Enthusiasm for word clouds –> musings on indexing

Have you heard about Wordle?  I think I’m behind the times again; I’ve used wordclouds and always liked having a tag cloud on my blog, but I didn’t realize you could go to a site and paste text into a box and get a customizable cloud.  I just pasted one of my college papers in and *pooof* out comes a beautiful word cloud:

I had to use IE instead of Firefox for it to work, but the fun factor is worth it.   Next, I put in the paper I wrote on metadata standards for cultural heritage collections:

What I tried to do but couldn’t really accomplish was to paste in text from all the reference questions I’ve answered in the past months.  It would be cool to see which terms would predominate.  Probably just “database”, but it could be something surprising.  These wordclouds could easily be tweaked to make a nice blog header…and I feel like there are probably lots of excellent ways to use them just lurking in the back of my brain. It’s so tempting to think that if you can just see which terms are most frequently used, you’ll know the subject of the document. Like…if you want to quickly present the major concepts in any chunk of text, this is ideal, right?  Wellll….

I was trying to think through what types of content could usefully be wordclouded, other than blogs.  I (half-seriously) started thinking that maybe in addition to an abstract you could have a mini wordcloud at the beginning of a scholarly article, so people could decide even more quickly if they wanted to read it.  But really, the idea behind wordclouds — representing content with simple words or phrases — is already applied to most scholarly articles.  The cloud is just arranging terms in an attractive way, and -what I think is most useful – translating things like term frequency into color and/or size.   The question is whether a cloud contains automatically-generated keyword terms based on word frequency, if it is based on user-created keyword terms (tags),  or if it contains indexing terms that came from a thesaurus and have somehow been verified (by a human) to correspond to the actual subject of the document, not just the words in it.

I think it’s great that we can use algorithms to do part of the work of assigning subject terms to content, but I worry about people (perhaps non-indexers/catalogers/librarians) forgetting that it’s not enough on its own.  You know that I wrote a paper about gorgons and Amazons, but you have no idea what I said about them or what my paper was really about.  In this paper from 1961, Bernier & Crane make a similar point:

On the average fewer than 1% of the words of a chemical document are required to subject-index it. Thus, more than 99% of the words that the author uses are useless and even undesirable in a subject index. They are useless to the index user because they do not guide him to the new information reported by the author; they are undesirable because they dilute the useful [index terms] to make for confusion and time-wasting reading.[…] The detection of subjects and their translation into index language is the function of the subject indexer.

Although this article is from the age of print indexes, the same principles still apply in our world of full-text searching and electronic documents. Word frequency algorithms can help subject indexers by showing them terms that might be useful for translation into controlled indexing terms. The indexer must assign terms that cover what the author of a document is actually talking about, not just what they’re saying. If you only index the words and not the subjects, you’ll never be able to connect (or collocate) documents that talk about the same things using different terminology. This seems obvious but I think it’s easy to forget and/or be confused about. Even Bernier and Crane say that “New indexers often have difficulty in understanding the difference between subject- and word-indexing.”

In the course of thinking about this stuff, I realized I was associating wordclouds ONLY with natural language processing / automatic indexing. But wordclouds (or other word frequency visualizations) don’t necessarily have to be wedded to automatic indexing with all its weaknesses. The clouds themselves are not the enemy; it’s the lack of vocabulary control going into most of them that is causing this unexpectedly long blog post to spill out of me. Obviously you could generate a wordcloud from terms pulled from a controlled vocabulary, and I know a number of people (e.g. librarians) are doing this with LC or AAT subject terms in digital collections. There’s a difference between the visualization itself, and what you put into it. Garbage in, garbage out, eh? The same thing goes for clouds based on tags, unless you’ve got some fancy term combination stuff going on like LibraryThing does. I get kind of worked up thinking about library catalogs that have implemented user tagging and word clouds, but without integrating any of the subject headings in the catalog into the word cloud, or mapping the most popular tags onto subject headings in some way (is this possible? I hope so….). Just think how many end-users probably assume that when they click on a tag, whether in a word cloud or in a record display, they’re getting results for everything in your collection on that topic. Cringe.  (But maybe users of library catalogs don’t even think that anymore?  Did patrons know the benefits of using subject headings to search catalogs before we started offering tags and keyword searching? Did they ever think that they could use subject headings to get a really thorough list of results? Maybe the only people who think something is being lost or obscured are people like me who are obsessed with authority control?)

In the end, what I’m thinking is that wordclouds — whether based on controlled index terms or just pure word frequency — are probably most useful for groups of documents (even though I had fun putting my term papers into Wordle).  Collection-level clouds can give users a glimpse of the coverage and strengths of a collection, just like document-level word clouds can indicate (SORT OF) the topic of a paper. But in the end,  it doesn’t really make sense to create a wordcloud for an individual document because 1. if you’re applying controlled index terms to it (which obviously you should, if the document is part of a collection), there probably won’t be enough of them to justify needing “visualization”, and 2. if you’re making a wordcloud using the entire text (aka just word frequency) of the document, you’re not really accurately representing what the document is about. On the other hand, though, you have to admit a wordcloud for a document would be prettier than something like this:

(From a LexisNexis document: a list of index terms (in caps) with relevance indicated in percentages instead of color or font size. Near the end are the keyword terms – possibly automatically generated – in lower case.)

For collections of documents, I think it’s useful to show people a wordcloud of all your indexing terms/subject headings, especially because you can then visually indicate which subject terms you attach to documents most frequently.  This benefit of providing a casual glimpse of a collection’s coverage is probably why tag clouds are popular for indicating what a blog (a collection of “documents” , or posts) is “about”.  I wonder if more people think of tag clouds in this way, as content indicators, or if they think of them primarily as tools for accessing blog content via tags.

Weekend update

I finally uploaded the annotated bibliography I wrote for one of my classes. It’s on ancient (mostly archaic and classical) Greek art & archaeology. The assignment required us to find a certain number of resources in specific formats, so there’s a lot fewer web resources than I would include in a bibliography I was doing on my own. In fact, I’ve been thinking it would be fun to do a webliography of all the fun and creative online projects I’ve come across in this field. There’s the vast world of 3-D archaeological site modeling, and then there’s all sorts of online exhibitions, image collections, and digital libraries. I have many such sites bookmarked, but I’m sure they’re just the tip of the iceberg. Before I do this I’ll have to see if someone else has already done it.

I’m also working on posts about personal finance resources and an epic overview of my favorite sci-fi books from the past 10 years. (meaning ones I’ve read since 2000, not ones that have been published since then). I’m taking an online workshop on taxonomies and controlled vocabularies through Simmons College, so things might be dull around here until that’s over. I do plan to keep doing the visual LCSH roundup, though, because it’s entertaining.

I’m also ruminating on how to possibly create some simple yet helpful document on entrepreneurship that could be mailed to prisoners requesting information on the topic. A zine would be great, but the postage might overwhelm. It seems there is definitely a need for some easily distributable resource on this topic, at least in Pennsylvania.