Tag Project I: Cleanup & Rules

When I began this project we had 3,622 tags, of which more than two-thirds are singletons (2,666).  In sorting through our existing list to see what’s what, I have already pared it down to about 3,560.  

I make a living working with databases and am somewhat fanatic about having a clean dataset.  The banes of my existence are duplicates and ambiguous or incomplete information. Keeping database size down is also a concern. Size can affect speed and performance.     The first phase of the Tag project will be to clean up duplicates and ambiguous info.  The second phase we’ll look at the singletons and see where we can consolidate tags to keep the numbers and size down.    

Before we go any further, meet our Official Tag Team mascots:  

I did some research at the Daily Kos tag cleanup project.   They are trying to pare 40,000 tags down to a list of 5,000.    They have a nice Standard Tags page with 1500 of the most popular tags.  I hope to have something similar available here.  In the meantime, use theirs as a reference.   I relied on their expertise to make decisions about how to handle issues with our tags on things such as punctuation, abbreviations and proper names.  

Thus, I have drafted the following rules:  

None of these are set in stone.  They are guidelines, more than rules.   And there are always exceptions.  

1. Every essay should have at least one tag.

There is a Diary Topic pick list in the Essay editor.  If nothing else, use this as your first tag – or if none apply, put something else in the tag field.  I will put a note in the “rules” to alert people to choose a topic.  This list could still use some refinement.  

Exception:  I know of at least one member here that does not want their essays tagged.   A solution to this is to use the “No Tags” tag if you don’t want tags and also to be sure that no one else adds them inadvertently.

2. Tags should be single words, first & last names, or short phrases.

3.  Maximum characters for a tag is 45.  There is probably a limit to the total tags you can fit in the box but I’m not sure what it is.  

4.  Proper names.

Use first and last names.

Use initials only to disambiguate individuals, e.g. George W. Bush and George H.W. Bush

Use periods after intials or a prefix/suffix, e.g. Robert F. Kennedy Jr.    

No nicknames  – unless that is the most common usage, e.g. Dick Cheney instead of Richard Cheney.

Spelling variations – if there are multiple variants, check the existing tags and select the most frequently used form. e.g. Al-Qaeda / al-Qaida  

Don’t include titles or ranks with names.  Rep. John Conyers or Gen. Wesley Clark.

5. Use the plural form of nouns, e.g. dogs, movies, candidates.

6.  Punctuation –

Do not use periods, hyphens, quotes, question marks, exclamation points or any other non-alphanumeric characters.  

Remove extra spaces between words.  

Exceptions:

Congressional bills and resolutions, e.g. H.R. 1221, H.Res 333 and S. 2323  

Proper Names with initials or Jr.

URLs www.123.com

Congressional Districts CA-42

U.S. and U.N.  (for United States and United Nations)

7. Abbreviations –

Avoid abbreviations or acronyms unless it is a commonly used reference, e.g. ACLU, FISA.

If you use acronyms, make sure the full name is also spelled out in a tag.  (be aware, char. limit is 45)  

Do not abbreviate states.

8. Duplication –

To avoid duplicate tags, check the current tag list.  Here is the tag list in Alphabetical Order.  Right now it is fairly quick to load this list.  The fewer tags the better.     When you are on the page, you can search it in your browser… press Ctrl+F keys to Find.  Sorry this is a clunky way to search but that’s the best we can do right now.

9. Profanity?  Preferable that it not be put in tags but it’s not like we would ban people for it.

10. Recommended and Promoted tags?   I am ambivalent about these.  However, some essays here are already being tagged as such.    I will leave this up for discussion.  

What is the point of all this?

Things we can do with tags…

Soapblox has a couple of built-in bloxes that make use of the tags:  HotTags, Subjects, and Feeds.  In the right column (on the front page) these are sections called Hot Tags, Topics, and Action Alert.      The Hot Tags show the top tags in use over a given period of time (the number of tags and time length are set by the admin).  I dislike the Hot Tags because it overemphasizes the Pony Parties and Open Thread.  That’s not the most important thing happening here.   Curiously, while the Pony Party is supposed to be an Open Thread, none of the PP’s use the Open Thread tag.  But this is a meta discussion for another time.  

One alternative to the HotTags is the Topics blox.   This is a much better index of our most popular tags.  Interestingly, it is the same list of topics that I have added to the drop-down list in the Essay editor.  Do people like these tags?  Should there be more/fewer?    As another alternative, it is possible for us to ditch the Hot Tags and Topics and create our own custom Tags blox – with whatever subjects we want.

Feeds

The Action Alert blox is a feed of the most recent essays on our site that are tagged “Action”.  As you can see, I have tagged this Essay with Action – so that it will stay visible for our tag crew.    We could create other mini-feeds for Environment, Elections, or whatever, from our own site or pulling feeds from other sites.  For example, the ACLU has an Action Alert feed too.  I could put their feed right below the Take Action feed for our site.   Unfortunately, they don’t seem to keep their alerts up-to-date; the last posted was in 2007.   Note – for the feeds we control how many articles are listed.  I have the Action list set to 4 right now.  

The dharmazine.  This is notlightnessofbeing’s project.  It is a very cool way to aggregate our topics into more of a news format, allowing one to scan the headlines at-a-glance.  As we get more organized, and nlob gets some breathing room from settling into his new place, this “zine” can be revisited.

Widgets – there are a lot of sexy widgets out there that can be hooked up to Docudharma feeds and placed on our site or shared around the web.    nlob and News Corpse know a lot more about this than I do.  It looks like the wave of the future to me.  

Speaking of the future – Soapblox will soon be open source.  We should be able to construct a Tag Search function (which is not available right now) and do lots of cool things we haven’t even imagined yet.    

Tag Cleanup Project

Tags may be modified or corrected (based on the rules above) for misspellings, duplicates, ambiguous abbreviations, incomplete names, consolidation of topics,  incorrect punctuation, or wordiness.  

I have identified sets of tags based on: abbreviations, plurals, groupings, proper names, spelling errors, multiple words, and others that I have never heard of or don’t know what they mean.   For the latter, I just want someone to verify they are valid and spelled correctly.  

Clean-up lists will be posted in the comments below.   For those that want to help out, reply to the list that you want to work on.  Come back and and post another comment  when you are finished.   When all that is done I can re-analyze the list to work on consolidating the one-of-a-kind tags where possible.  Going forward just keep an eye on tags in the essays you read and make corrections or suggestions where warranted.    

Brief instructions:

The best way to find the errant tags is to use the alphabetized tag list and scroll through.  

Click on the tag and it will bring up all the essays with that tag.  

Go into the Essay, click the Add/Edit tags button.  

Make changes or do what you need to do.  

Click the Add/Edit tag button again to save the changes.  

Click the back button a couple times to return to the Essay list or go all the way back to the alphabetized tag list.  

97 comments

Skip to comment form

  1. Another stud Tag Team –

    Don’t worry boys, we’ll find something else for you to do around here…

  2. (split these up if necessary)

    betryal

    Buddism and politics

    Capitulaltion

    dafur

    Dems

    Dialogoue on Democracy

    Diane Feinstein

    DucuDharma Times

    Environement

    Fillibuster

    Free Speach

    Hillar Clinton

    House Judciary Committee

    house oversite committee

    househearingvet affairs

    Huga Chavez

    Hurircane Katrina

    Iglesia.

    international defence exhibition

    Joe Liebernman

    Lousiana  

    • pfiore8 on February 20, 2008 at 04:37

    Doc and Dharma… that is sooooooo cute!

    Dick and Drama aren’t bad either.

  3. Mahmoud Amadinejad

    Marcos Moulitsas

    Markos Moulitas

    Mohamed Elbaradi

    Montery

    Newbery Award

    Nicholas Sarkozy

    Osama Bin Ladin

    philosopy

    poision

    Progressiv Democrats of America

    real eastate

    steriods

    sustainabiility

    Ted Kenedy

    Tony Blaire

    Townswnd Press

    Tramatic Brain Injuries

    U.S.Constitution

    Univeral Health Care

    Venezuala

    Veterns for Peace

    war profitteering  

  4. (First & Last names only.  Remove Titles. Check spelling.)

    Ahmadenijad

    Chavez

    Clinton

    Clinton I

    Clinton II

    Congressmember Norman Dicks

    Congressmember Stephanie Herseth-Sandlin

    DKos

    dubya

    Feingold

    Feingold/Reid

    feinstein

    FOX Studio’s

    ft drum

    Gen. Wesley Clark

    general david petraeus

    general petraeus

    General Petraues

    George I

    George II

    gonzales

    Hill and Bill

    Hillary Rodham Clinton

    Ilona

    ilona meagher

    Jefferson

    John Meriwether

    Jonathan

    Kos

    Lincoln  

  5. (First & Last names only.  Remove Titles. Check spelling.)

    Mukasey

    Musharaff

    Musharraf

    Nacchio

    nadler

    O’Reilly

    Patraeus

    Peleliu

    Pelosi

    petergof

    Petraeus

    Piaget

    President Bush

    President Clinton

    President Dwight D. Eisenhower

    Putin

    Rep John Murtha

    Rep. Jose Serrano

    Rep. Obey

    Rep. Steny Hoyer

    Representative Stephanie Herseth-Sandlin

    RFK

    Richardson

    Senator Jim Webb

    Senator John Inhofe

    Sgt Brad Gaskins

    Subcomandante Marcos

    Unger

    Wes Clark  

  6. (remove any state abbrevs.  Spell out Acronyms in a separate tag)

    AEI

    AEP

    AIG

    ALA

    ANTM

    BARDA

    BCCI

    BCII

    CCR

    CDC

    CDO

    CMO

    DADT

    DDD

    DRA Legal

    DRM

    ESA

    EZLN

    FBN

    FCNL

    FDL

    FUSE

    GOA

    HIPAA

    HIPPA

    IPCC

    IUCN

    MOP

    MRFF  

    • pfiore8 on February 20, 2008 at 04:41

    put a rec button on this so it stays on the essay list for while to get maximum number of eyeballs (as ek would say)

    v.v.v. good work and strategy OTB!!!

  7. (remove any state abbrevs.  Spell out Acronyms in a separate tag.  Change US to U.S.)

    NIE

    NY

    NY Sun

    NYC

    NYTimes

    NYTimes Book Review

    PENS

    PJAK

    PKK

    PRI

    PUK

    RCMP

    RFID

    SMAP

    spp

    SPPNA

    TCE

    TSX  

    UNC

    UNEP

    US

    US Constitution

    US corruptioin

    US Dollar

    US Marines

    US myth

    usa

    USFWS

    VA

    VFP

    VFP Chapter 099

    VT

    VVAW

    WTI

    WTO  

  8. (Split into one or more tags)

    9/11 AUMF

    9/11/ CIA Tapes Scandal

    Acid-Test Neocons

    Al Qaeda in Iceland?

    ancient philosophy philosophactory

    and Karl Rove

    Bildeburg owns Obama

    Bildeburg Selection Process

    bleedin’ demised

    blood and flame

    Bob Herbert. Doug Wilder

    Bush lies

    CIA Terror Interrogation Tapes Destroyed

    Conspiracy Theory? Britney Spears

    Defense Authorization Bill 2008

    Democratic Candidates 2008

    Democratic Debate

    Destroyed Torture Tapes

    differant drummer cafe

    economy banking housing mortgage

    el salvadoran oligarchy

    end the Iraq war

    family values comes a crapper

    free.speech meta blogging political.influence

    global warming. software

    Gonzales Torture Memo

    Hillary Clinton 2008 elections

    homesteding nature

    Huckabee vs. Paul  

  9. (Split into one or more tags)

    Intelligence lack of

    Iowa Chris Dodd

    Kucinich Campaign Update

    love and happiness

    Mahatma Gandhi. Republicans

    medicine. Obama

    Missing CIA Torture Tapes

    Monty Python’s Flying Circus

    Monty Python’s Life of Brain

    New Hampshire Constitution Article 10

    New Year’s resolutions. pony party

    NPR Investigation

    Obama depression

    Pervez Musharraf. Pakistan

    philosophy cynicism

    photos of Russia

    politics economics

    politics media

    prosecute war crimes and corruption

    Republicans on Iraq

    School of the Americas Watch

    Skyway Aircraft. George W. Bush

    Stupid Medieval Democrats

    sucky holidays

    Survival of American Indians Association

    Traitorous Dems

    troops/veterans  

  10. (check for context and part of speech before changing singulars)

    activist

    activists

    African-American

    African-Americans

    belief

    Beliefs

    brokered convention

    Brokered Conventions

    cost of war

    Costs of War

    crime

    crimes

    cult

    cults

    Debate

    debates

    demonstration

    demonstrations

    disaster

    Disasters

    dog

    dogs

    Endorsement

    Endorsements

    Extraordinary Rendition

    extraordinary renditions

    Fire Department

    fire departments

    Foreign Policies

    foreign policy

    Indian

    Indians

    Iowa Caucus

    Iowa Caucuses

    light

    Lights

    missile

    missiles

    mortgage

    mortgages

    movie

    movies  

  11. (check for context and part of speech before changing singulars)

    petition

    petitions

    poem

    poems

    Phobia

    phobias

    polar bear

    polar bears

    poll

    polls

    Presidential candidate

    presidential candidates

    presidential debate

    Presidential Debates

    project

    projects

    puppies

    puppy

    quotation

    quotations

    Quote

    quotes

    rambling

    ramblings

    rant

    rants

    Recount

    recounts

    Reference

    references

    Republican

    Republicans

    song

    Songs

    Taser

    Tasers

    tribute

    Tributes

    Veterans Suicide

    veterans suicides

    weapon

    weapons

    Widget

    widgets

    wildfire

    wildfires

    wolf

    wolves  

  12. (these need to be consolidated into one tag. Look for whichever tag has highest number and change the others to match)

    Al Gore

    Al Gore 2008

    Al Gore!

    Al Qaeda

    al queda

    audioblog

    audioblogging

    bipolar disorder

    bi-polarism

    Booman

    Booman Tribune

    Bush Crime Family

    Bush/Cheney

    bushco

    the Bush Administration

    Climat change

    climate

    climate change

    Climate Crisis

    COINTELPRO

    co-intel-pro

    Daily Kos

    dailykos

    Deborah Mayer

    Deborah Meier

    Dick Cheney

    Richard B. Cheney

    Richard Cheney

    Richard V. Cheney

    doing it for ourselves

    doing it ourselves

    DREAM Act

    DreamAct

    Election Integrity

    Elections Integrity

    electronic voting

    electronic voting machines

    evoting

    geothermal energy

    Geothermal Power

    Guantanamo

    Guantanamo Bay

    how to

    how-to

    impeach

    Impeachable

    Impeachable Offenses

    impeachment

    impeachment.

    Iraq Moratorium

    Iraq War Moratorium

    Iraq War

    Iraq war and occupation

    War In Iraq

    JPop

    J-Pop

    LDS

    LDS Church

    Lieberman-Kyl

    Lieberman-Kyl Amendment

    Kyl-Lieberman Amendment.

    local food

    local produce

    Lockheed Martin

    Lockheed-Martin

    Martin Luther King

    Martin Luther King Jr

    Martin Luther King Jr.

    MLK

    meetup

    Meet-Up

    Michae B. Mukasey

    Michael B. Mukasey

    Michael Mukasey

    Middle East

    Middle-East

    midnight cowboy

    midnight cowboying

    Military Commissions

    Military Commissions Act

    miscellaneous

    miscellany  

  13. (these need to be consolidated into one tag. Look for whichever tag has highest number and change the others to match)

    neocons

    Neo-Cons

    Neo-Con’s

    neo-conservatives

    New Years Eve

    New Year’s Eve

    night life

    nightlife

    Non-violance

    nonviolence

    Non-Violence

    peace process

    peace  process

    Photographs

    photos

    PROTECT ACT

    Protect America Act

    Rasmussen Poll

    Rassmussen

    restore the constitution

    Restoring the Constitution

    Roman Nose

    Roman Nose State Park

    rss

    RSS Feed

    Rudolph Giuliani

    Rudy Giuliani

    Rudy Guilliani

    sayed pervez kambakhsh

    sayed pervez kambaksh

    SCHIP

    S-CHIP

    self reliance

    self-reliance

    snark

    Snark Attack!

    SNARK!!!

    State of the Union

    State of the Union Address

    suck

    sucking

    sucks

    tech

    technology

    Telco Amnesty

    telco immunity

    telecom amnesty

    telecom immunity

    United States

    United States of America

    Universal Health Care

    universal single-payer

    universal single-payer health insurance

    Veteran

    veterans

    Veteran’s

    Viet Nam

    Vietnam

    WGA

    WGA strike

    writers guild of america

    writers strike

    Writer’s Strike

    wiretap

    wiretapping

    Wounded Knee Massacre

    Wounded Knee Massacre Of 1890

    You Tube

    youtube  

  14. (Not sure what some of this means.  Let’s get rid of the nonsense tags and profanity)  

    #3

    aerial killing

    Ammiq

    amta

    Awwwwwww

    Billo

    bleg

    bobeta

    Cheka

    fananafanafofeta

    Fuck this shit!

    fuck you

    Fuckity Fuck McFuck

    GacktJob

    Guilin

    holy moly

    House Veterans  

  15. (Not sure what some of this means.  Let’s get rid of the nonsense tags and profanity)

    Jr.

    JRock

    Just fuckin with you

    kossack fullstop

    Literature for Kossacks

    ME-ta

    Or not

    Penza

    peom hat

    people unclear on the concept

    recep tayyip Erdogan

    ri

    Roles?

    the 2007 Joe Lieberman’s Boggy Cecum Award

    throwaway meta personal

    want to contribute anyway

    Who are you and where is my country

    Why do Dems hate freedom  

  16. Need to do more research on how to consolidate these tags. I was planning to use the Dkosopedia tag list as a guide.

    2008 Eelctions

    2008 Eelections

    2008 Election

    2008 Election cycle

    2008 elections

    2008 Elections!

    2008 Elections.

    2008 Primaries

    2008 primary

    2008 Vote

    2008. GOP

    Election

    election ’08

    Election 207

    Election 2208

    elections

    Elections 2008

    2008 Presidential Candidates

    2008 Presidential election

    2008 U.S. Presidential Election

    presidential election

    Presidential Election 2008

    Presidential race

    Presidential Race 08

    President 2008  

  17. like the bane of your existence!

    I shall attempt to conform!

  18. ;-)………..

    • Benny on February 20, 2008 at 07:53

    I had a conversation with Chris Bowers about a month ago concerning tags and their purposes.  At MyDD, they had lousy, too big broad stroke tags when he and Matt Stoller were still there, and it was not easy to find some of their poll results diaries.   I made a case to him that despite the power of google and other search engines, tags were important.  Maybe I will write him again and let him know this blog is making this  attempt.

    Although DK has turned into a temple for certain worshipers, I still think their tag patrol is good and the model is a good one to follow, and I like the guidelines here.  But I remind anyone that the more specific, the better.  Just needs to be consistent.  

    Matt Stoller is likely to take the “Everything is Miscellaneous” approach at OL, but I do think there needs to be some organization amongst the blogsophere in order to locate good diaries, information, etc.

    • plf515 on February 20, 2008 at 14:29

    congress and elections and the states’ names? Or abbreviations? or what?

    • Alma on February 20, 2008 at 16:42

    I will be back tonight or tommorrow to pick a list to do.  I like to finish what I start in one sitting (OCD thing), so I want to have a decent block of time set aside to do it in.

    • kj on February 20, 2008 at 17:16

    am going to have to beg off helping today!  have to get ready for an interview and testing thingy… going the temp agency route until i can figure out what it is i wanna do with my life, lol!

Comments have been disabled.