I Never Metadata I Didn't Like (not Will Rogers)

Living in Oklahoma, I couldn't resist the tag line of this piece by Jacob Harris in the New York Times' Open Source blog. Another thing I couldn't resist was his mention of the role of librarians in creating metadata (though not by that term) for the NYT since 1851. Mr. Harris describes these librarians as " the most advanced computational text-categorizing system known to mankind" Pretty heady stuff for a library student to read. He points to some of the things that search engines can't do because of problems with language (esp. English) as it relates to news stories:
  • Disambiguation — Is this story about Ford the president or Ford the automotive company?
  • Summarization — This article might quote Nancy Pelosi, but it’s really just an article about President Bush, isn’t it?
  • Normalization — The text of one story may use “The United States,” while another says “U.S.” Can we label both with the “United States of America” geographic label?
  • Taxonomies — One story may be about Global Warming and another on Pollution; can we label both of them as being subcategories for Environment?
His point is that carefully creating metadata allows content to be more accurately archived and retrieved. In fact, good metadata is critical and this is an example of a company creating it's own metadata system to keep track of what they create. He does lament the lack of standards and the sometimes proprietary nature of this information. He also notes that other news services either don't use or share their metadata. I would argue that a standard could be established and we would all benefit from having access to more relevant news. The reason I bring this up is because it reflects a view that I expressed on our class discussion board last week concerning Cory Doctorow's essay Metacrap: Putting the torch to seven straw-men of the meta-utopia.
Perhaps I suffer from nerd hubris, but I tend to believe in the meta-utopia or at least some aspects of it. Some of the same arguments were (and still are) made against OSS. The idea that a bunch of distant programmers could write viable code without being paid was also treated with much disdain, but it is increasingly embraced as a viable (some argue, prefered) model for software development. It took years to make any progress toward this ideal and there is growing momentum. I think much of Doctorow's arguments can be shot down with the success of folksonomies. Sure there are a lot of lazy, stupid, dishonest individuals, but in the aggregate we can create extremely useful tags which are metadata. While the term meta-utopia may be strong, it can't be denied that metadata in the form of XML is quickly becoming THE standard for describing and expanding information packages. I strongly disagree with his contention that industries can't set up or follow schema which are really just agreed-upon standards. I agree it is difficult, but most industries recognize it is necessary to follow protocols and standards like those established by the ISO. Also, computer component manufacturers often enter into standards-creating groups to ensure that their products will have wide acceptance, thus avoiding format wars (OK, other than Microsoft and Sony) I do admit to being a bit of an idealist, and acknowledge that there is definitely truth in the essay. But, if we can envision metadata working to organize the web, we can consciously move toward that ideal. With information growing exponentially, there have to be solutions like this.