Enthusiasm for word clouds –> musings on indexing

Have you heard about Wordle?  I think I’m behind the times again; I’ve used wordclouds and always liked having a tag cloud on my blog, but I didn’t realize you could go to a site and paste text into a box and get a customizable cloud.  I just pasted one of my college papers in and *pooof* out comes a beautiful word cloud:

I had to use IE instead of Firefox for it to work, but the fun factor is worth it.   Next, I put in the paper I wrote on metadata standards for cultural heritage collections:

What I tried to do but couldn’t really accomplish was to paste in text from all the reference questions I’ve answered in the past months.  It would be cool to see which terms would predominate.  Probably just “database”, but it could be something surprising.  These wordclouds could easily be tweaked to make a nice blog header…and I feel like there are probably lots of excellent ways to use them just lurking in the back of my brain. It’s so tempting to think that if you can just see which terms are most frequently used, you’ll know the subject of the document. Like…if you want to quickly present the major concepts in any chunk of text, this is ideal, right?  Wellll….

I was trying to think through what types of content could usefully be wordclouded, other than blogs.  I (half-seriously) started thinking that maybe in addition to an abstract you could have a mini wordcloud at the beginning of a scholarly article, so people could decide even more quickly if they wanted to read it.  But really, the idea behind wordclouds — representing content with simple words or phrases — is already applied to most scholarly articles.  The cloud is just arranging terms in an attractive way, and -what I think is most useful – translating things like term frequency into color and/or size.   The question is whether a cloud contains automatically-generated keyword terms based on word frequency, if it is based on user-created keyword terms (tags),  or if it contains indexing terms that came from a thesaurus and have somehow been verified (by a human) to correspond to the actual subject of the document, not just the words in it.

I think it’s great that we can use algorithms to do part of the work of assigning subject terms to content, but I worry about people (perhaps non-indexers/catalogers/librarians) forgetting that it’s not enough on its own.  You know that I wrote a paper about gorgons and Amazons, but you have no idea what I said about them or what my paper was really about.  In this paper from 1961, Bernier & Crane make a similar point:

On the average fewer than 1% of the words of a chemical document are required to subject-index it. Thus, more than 99% of the words that the author uses are useless and even undesirable in a subject index. They are useless to the index user because they do not guide him to the new information reported by the author; they are undesirable because they dilute the useful [index terms] to make for confusion and time-wasting reading.[…] The detection of subjects and their translation into index language is the function of the subject indexer.

Although this article is from the age of print indexes, the same principles still apply in our world of full-text searching and electronic documents. Word frequency algorithms can help subject indexers by showing them terms that might be useful for translation into controlled indexing terms. The indexer must assign terms that cover what the author of a document is actually talking about, not just what they’re saying. If you only index the words and not the subjects, you’ll never be able to connect (or collocate) documents that talk about the same things using different terminology. This seems obvious but I think it’s easy to forget and/or be confused about. Even Bernier and Crane say that “New indexers often have difficulty in understanding the difference between subject- and word-indexing.”

In the course of thinking about this stuff, I realized I was associating wordclouds ONLY with natural language processing / automatic indexing. But wordclouds (or other word frequency visualizations) don’t necessarily have to be wedded to automatic indexing with all its weaknesses. The clouds themselves are not the enemy; it’s the lack of vocabulary control going into most of them that is causing this unexpectedly long blog post to spill out of me. Obviously you could generate a wordcloud from terms pulled from a controlled vocabulary, and I know a number of people (e.g. librarians) are doing this with LC or AAT subject terms in digital collections. There’s a difference between the visualization itself, and what you put into it. Garbage in, garbage out, eh? The same thing goes for clouds based on tags, unless you’ve got some fancy term combination stuff going on like LibraryThing does. I get kind of worked up thinking about library catalogs that have implemented user tagging and word clouds, but without integrating any of the subject headings in the catalog into the word cloud, or mapping the most popular tags onto subject headings in some way (is this possible? I hope so….). Just think how many end-users probably assume that when they click on a tag, whether in a word cloud or in a record display, they’re getting results for everything in your collection on that topic. Cringe.  (But maybe users of library catalogs don’t even think that anymore?  Did patrons know the benefits of using subject headings to search catalogs before we started offering tags and keyword searching? Did they ever think that they could use subject headings to get a really thorough list of results? Maybe the only people who think something is being lost or obscured are people like me who are obsessed with authority control?)

In the end, what I’m thinking is that wordclouds — whether based on controlled index terms or just pure word frequency — are probably most useful for groups of documents (even though I had fun putting my term papers into Wordle).  Collection-level clouds can give users a glimpse of the coverage and strengths of a collection, just like document-level word clouds can indicate (SORT OF) the topic of a paper. But in the end,  it doesn’t really make sense to create a wordcloud for an individual document because 1. if you’re applying controlled index terms to it (which obviously you should, if the document is part of a collection), there probably won’t be enough of them to justify needing “visualization”, and 2. if you’re making a wordcloud using the entire text (aka just word frequency) of the document, you’re not really accurately representing what the document is about. On the other hand, though, you have to admit a wordcloud for a document would be prettier than something like this:

(From a LexisNexis document: a list of index terms (in caps) with relevance indicated in percentages instead of color or font size. Near the end are the keyword terms – possibly automatically generated – in lower case.)

For collections of documents, I think it’s useful to show people a wordcloud of all your indexing terms/subject headings, especially because you can then visually indicate which subject terms you attach to documents most frequently.  This benefit of providing a casual glimpse of a collection’s coverage is probably why tag clouds are popular for indicating what a blog (a collection of “documents” , or posts) is “about”.  I wonder if more people think of tag clouds in this way, as content indicators, or if they think of them primarily as tools for accessing blog content via tags.

In praise of local newspaper indexing

“To the newspapers of our communities the historian turns to interpret the mind of the people during any crisis. For this quest the newspaper index is the indispensable key to the treasure house of facts locked away in newspaper files…Yet in the United States there is but one such index, published by The New York Times.”

–Paul P. Foster, librarian of the Philadelphia Inquirer, March 2, 1935. (in a Special to The New York Times)

In many cities, the newspaper with the most readers and the largest distribution area cannot possibly report on the happenings in the myriad neighborhoods and communities it covers. For example, in a 1995 article on collection building, John Yewell wrote that the Minneapolis Star Tribune “is no different from most other large city newspapers. The news decisions they make are intended to appeal to a wider audience. Their withdrawal over the years from local politics in Minneapolis has left a void that several small, scrappy newspapers are attempting to fill.” In Pittsburgh, we have the excellent City Paper, but also an impressive number of publications dedicated to reporting on happenings in our many neighborhoods and communities. The Greenfield Grapevine, the Spirit of Bloomfield, and the Pittsburgh Senior News are just a few examples.

While many large city newspapers (and even some small ones) have adapted to the Web and provide keyword searchable content online, most community publications remain confined to the world of print (or PDFs, if you’re lucky). Few small newspapers possess the funding or expertise to mount their archive of print content online. Their content is not included in news databases, and libraries often neglect to collect them at all. This is unfortunate, because community and neighborhood newspapers can provide journalistic detail and a unique historical perspective that is increasingly absent from larger newspapers. Now that news information is more accessible and ubiquitous than before, community-produced information about local history and current events can be easily lost in the chaos or overshadowed by more easily accessible digital resources.

While in library school, I became especially interested in local news indexing practices and history. Indexing “small, scrappy newpapers” arguably requires knowledge of local history and language, in addition to familiarity with the needs and interests of potential users of the index. Since community publications may never make it to the Web, an index may be the only way to provide subject access to the years of history contained in their pages. It’s hard to argue for the time-intensive effort of indexing when digitization seems so quick and supposedly timeless. Despite the many benefits of news indexes, librarians and archivists have often found it difficult to convince users to use them, and library administrations to support their production at a local level. The format of indexes can be especially daunting for users accustomed to keyword searching, but these users have been found to appreciate indexes once they are made aware of their usefulness (Knapp, 2008). The issue of funding and staffing for the production of a local news index is a larger problem. I admit that researching this topic is a little self-indulgent. Still, you never know what can be accomplished with some tenacious grant-writing. Even if a small newspaper is capable of offering full-text content online, its users could still benefit from the improved retrieval accuracy that a good index can provide. (On a related note, see this article by Heather Hedden on book-style indexes for websites).

Journalists don’t use standardized terminology within a single issue of a paper, much less over a number of years. Indexing with the use of a controlled vocabulary (or “heading authority”) brings together all references to constantly changing personal, organizational, and community names. Local news indexers often use an already established heading authority like SEARS or LCSH, and modify it for local use. “To accurately assign subject headings requires an intimate knowledge of the community and its history”(Weaver, 2006), and indexers with local expertise can better recognize associative relationships that should be created between concepts and entities in the index. Community news indexers may also develop a list of recommended or modified headings designed to reflect local terminology and respond to the needs and viewpoints of anticipated users (Sholtys, 1984). Genealogists, local historians, reporters, authors, business people, county agency staff, students, and historical preservation society members are all potential users of a local news index, and their information-seeking habits should be considered in the wording of terms and their variants.

The main reasons to create an index are to save users’ time and facilitate their access to information. In 1982, one librarian wrote that the microfilming of local news archives resulted in “readers [being] less able to browse among the yellowing pages of bound volumes and [being] forced instead to endure the microfilm, where soonest found, soonest done” (J.D.L., 1982). This librarian’s indexing project was valuable because it enabled patrons to go directly to an article without having to deal with browsing and switching between reels of dreaded microfilm. I imagine that an electronic index could be especially useful for the community newspapers that exist in Pittsburgh, which aren’t necessarily archived in local libraries or even accessible to the public. I think a lot of them are stored at community centers, where (for all I know) there might be lovely filing cabinets or even a dedicated volunteer toiling away at an index. (I’ve been meaning to find out about this for the last year!) The time-saving principle of indexes applies to the Web environment as well as to print. Online, indexes have been shown to direct users to answers an average of 2 minutes faster than keyword searching (Knapp, 2008) (who knew?!). Users benefit from not having to browse through results only to realize they are irrelevant, and an index reduces the need to perform multiple time-consuming searches using different terms in hopes of getting all results on a topic.

During the 1970s, there was a flourish of local newspaper indexing projects. The interest was not limited by international boundaries nor by library type: projects were initiated in public and academic libraries in Scotland, Great Britain, New Zealand, the United States, and elsewhere. The majority of these projects were grant-funded, and staffed by one dedicated librarian with a few volunteers. Training the volunteers was time consuming and difficult, and volunteers were often not able to stay interested long enough to complete the project. This was a frequent problem with projects attempting to index a historical collection of newspapers. Projects that began indexing current issues seemed to fare better because they didn’t have such a daunting body of work facing them. In order to make indexing workflows as efficient as possible, one should consult the literature before taking on a project. Though they are outdated in many ways, articles from the 1970s and 1980s provide valuable tips on developing a name and subject authority file, determining depth of indexing, and dealing with staff and money shortages.

People (especially librarians and archivists) who want to begin indexing their neighborhood or community newspapers may be able to advocate for a project by invoking some practical benefits, which may especially appeal to administrators and funders:

  • A local news index is tangible proof of library activity (Dewe, 1972).
  • Librarians working on an index will be acutely aware of trends and issues that are of interest to the community. This is important for almost all aspects of public library operations, including programming, collection development, and marketing.
  • The production of an index is a chance for the library to form partnerships with other institutions. Librarians in Napa, CA used their indexing project as an opportunity to achieve closer cooperation among a community college, two small city libraries, four high school libraries, a private parochial college, and a historical society (Vierra & Trice, 1980).

The Web seems to lead many information-seekers to neglect the importance of information that is not available online. Local newspapers provide unique information about people, places, and events that may never appear on the Web or in any major publication. The time and money it takes to produce an index are significant, and it is harder than ever to convince non-cataloging types that such “old-fashioned” efforts are still worth it. However, without any way of facilitating the retrieval of information from small local publications, their “treasure house of facts” will be “locked away” when it could instead be helping people access relevant information about their communities and their histories.

Works Cited + Further Reading:
American Society for Indexing. “ASI Publications.” http://www.asindexing.org/site/asipub.shtml

Aslin, P. (2001). Raising Rochester history: The history of an index. Key Words 9 (3), 76-82.

Beare, G. (1989). Local newspaper indexing projects and products. The Indexer 16(4), 227-233.

Dewe, M. (1972). Indexing local newspapers. American Libraries 3(4), 59.

J.D.L. (1982). Local newspaper indexing in the UK. The Indexer 13(2), 103.

Knapp, C. (2008). Breakout session C1: indexes and the Google Generation: what you don’t know CAN hurt you. 2008 CALI Conference Report. Key Words 16(3), 95.

Knee, M. (1982). Producing a local newspaper index. The Indexer 12(2), 101-103.

Napier, K. (1982). Indexing a local newspaper. New Zealand Libraries 43(12), 197-199.

Sandlin, L. (1985). Indexing of smaller-circulation daily newspapers. The Indexer 14(3), 184-189.

Sholtys, P. (1984). Adapting Library of Congress Subject Headings for newspaper indexing. Cataloging & Classification Quarterly, 4(4), 99-102.

Special to The New York Times. (1935, March 2). Newspaper indexing urged as library aid. New York Times (1857-Current file), p. 13. Retrieved November 12, 2008, from ProQuest Historical Newspapers The New York Times (1851 – 2005) database. (Document ID: 97144991).

Vierra, B. & Trice, T. (1980). Local newspaper indexing : a public library reports its experience. The Serials Librarian 5(1), 87-92.

Weaver, C. (2006). The Indexer as consultant: collaborative indexing of community newspapers. Key Words 14(1), 18-33.

Yewell, J. (1995). Why libraries must subscribe to –and preserve—the neighborhood and community press. Collection Building 14(2), 47-48.

FYI: much (but not all) of this post was distilled from a paper I wrote in 2008 as part of my MLIS coursework. Please give me credit if you’re citing or using any part of it.

A back-of-the-book index to images of ancient Greek vases

Before I dive into this, here’s my experimental image index that is discussed in this post. You can see some sample pages from the text using Amazon’s “look inside” feature.

Over the summer I took a class on indexing and abstracting. As part of my final project, I indexed some of the images of Greek vases in the book The History of Greek Vases, by John Boardman (London: Thames & Hudson Ltd, 2001). It seems kind of quaint to produce a back-of-the-book image index; maybe that’s just because I’m too steeped in digital stuff. Indexes are crucial for print materials, and I know it would have been useful to my art historical research to have the subjects, vase types etc. indexed. However, this probably isn’t the most practical of endeavors. Any good index to the text would probably index image captions, or the locators would at least get you close enough that you could find that one image you remembered that showed Perseus with a detached Gorgon head in hand. This exercise was more just to see what would happen if indexed multiple attributes but proceeded as if I were just creating a traditional subject index to a text. What would the index look like? What would the cross-references be like, and how would the image index differ from an index to the text?

Method and Meanderings
I didn’t index drawings, maps, or photographs of sections of vases; I focused on images depicting vases in their entirety, and on the descriptive information about them that was included in captions. The locators in my index are for page numbers, although each image was numbered in the text. It just seemed easier to navigate to a page than to an image number.

When deciding what aspects of the images to index, I took inspiration from the access points used by the Beazley Archive at Oxford University. Some of the elements by which one can search the Beazley pottery database are:

  • Fabric
  • Technique
  • Shape name
  • Date range
  • Inscription type
  • Inscription
  • Artist name
  • Scholar name
  • Decoration description
  • Collection name
  • Publication name

There are even more than that, and it’s pretty impressive. Not all of these categories would be useful for my purposes (a back-of-the-book image index), but several of them (inscription type and vase shape name, for example) are especially useful for various types of art historical research in this field.

I looked at the VRA Core 4.0 metadata schema and noted which of its elements might correspond to those used by the Beazley Archive. The Beazley’s “fabric” element combines VRA’s “cultural context” and “style/period” elements. “Technique” corresponds to VRA’s “material” and “technique”. VRA does have an element for “inscriptions”, but it’s not clear to me if the “type” attribute for the “text” sub-element could be used to indicate the type of inscription (e.g. epoisen or egraphsen signatures). I was mainly using VRA Core as a point of reference, to get an idea of the types of attributes generally deemed important in creating descriptive metadata for images. (at the time, I didn’t know about Cataloging Cultural Objects, but since VRA Core is based on it I don’t see that as a big deal). I also considered the facets of AAT, and what they indicate about elements that can be combined (e.g. style and period).

The attributes I finally chose to index were:

  • artist name
  • vase shape
  • technique
  • decoration
  • inscription
  • subject (both “things” depicted (e.g. warriors) and mythological figures (e.g. Achilles))

I don’t think any of the vases in my sample set ended up having inscriptions. I chose to index only these six elements because I had a limited amount of time to devote to this. If I was indexing images in an online setting I would definitely want to use more access points.

It was difficult to differentiate between imagery that could be both a subject and a decorative element. Many of the vases I was indexing were decorated with rows of animals in a repeating pattern. This is a common motif, so I needed to decide on a policy for how to index it. At first I was making very specific subheadings indicating the type of vase on which the pattern was appearing, but then I realized that this was creating too much work, and it probably wouldn’t be all that useful. It was also starting to conflate the “fabric” and “technique” elements with the “subject”, and I wanted to keep them separate in hopes of having a less chaotic index.

So, for a vase that had a motif of lions in a row, I decided to just give the locator after the heading (“lions, 30-31”). However, if there was a vase with a lion in any other, non-decorative context, I made a more specific subheading (“lions – being hunted, 20”). I worried that it would be confusing to have some locators listed after the heading, and then a subheading with more locators. So I decided that if any subheadings were required, I would instead list the decorative appearances of the subject with the subheading “as decorative element”. Hopefully more examples will make this clearer:

Sphinxes were only used as decorative elements in the vases I indexed. So the entry for sphinxes is:

sphinxes, 20, 33, 41, 45, 46

Lions were used as decorative elements on some of the vases, but sometimes they were part of the narrative scene. So my entry for lions is:

  being hunted, 20, 23, 30-31
  as decorative element, 29, 33

Problems / Discussion
Many of my locators appear multiple times within the same entry for a couple of reasons. First is the fact that one page could contain multiple images of different vases with the same subject matter, technique, or fabric. Second: most vases have multiple sections of varying imagery. I may have been indexing too deeply, but I thought that failing to index all the different aspects of the subjects depicted would be akin to giving the researcher a list of undifferentiated locators. A good example of this is the heading for “warriors”. Since I was creating a heading for “hares – being hunted by warriors” it made sense to have “warriors – hunting hares” instead of just listing all the pages for warriors. Why not give the user as much information as possible? Additionally, my entry for “hares” has duplicate locators because in one section of the same vase (on page 33) the hares are a decorative motif, and in another section they are part of a narrative scene in which warriors are hunting them. (confused yet?!)

As mentioned above, deciding to index both “subject” and “decorative elements” caused problems for me when the two were hard to differentiate. I wanted my imaginary user to be able to use the index to find lion freizes and not have to go through a bunch of locators just to find irrelevant images of lions NOT in freizes. But then again, I always expect too much of my information resources.

In hindsight, I think my index would be more useful if I had created separate indexes for each attribute I was indexing. But there are pros and cons to having the index divided by facets instead of in one long alphabetical sequence. One of the books I examined had one index for mythological figures and one for objects that are commonly attributed to specific gods/goddesses (I thought this was awesome). The book I was indexing had three separate indexes: (1) Artists, Groups, and Wares; (2) Mythological and Divine figures; and (3) General (three indexes in a mere 3 pages!). Though it makes sense to divide the index this way, it could lead to confusion if it spanned more than 3 pages. A user could be looking for Odysseus in the wrong index and not know it unless each page was very clearly marked with a header (I think I remember committing this error as an undergraduate). Another benefit to having one long alphabetical index is that it didn’t force me to always differentiate between the aforementioned troublesome decorative elements and subjects, which overlap so often on the same pot.

In real life, it would probably be best to keep the index as simple as possible, only providing sub-headings when the list of locators for certain topic/shape/artist got excessively long. Nevertheless, if my methods in this exercise could be used on a larger scale, the index would be very useful. Of course that’s unrealistic because of how time-consuming it is, and how “everything is online” now. Back-of-the-book image indexes aren’t unheard of, though. I inspected the indexes in the back of Boardman’s book and another book on Greek art. One of them only listed very general topics (“women” or “fighting scenes”). The other gave some sparse subheadings (and I think this one was the better index overall, though neither of them had any “see” references!).

Indexing a set of images in a book is a good way to become familiar with the basic issues of image description and index construction. It requires decisions that appear simple on the surface, but force you to carefully consider the nature of your subject and the needs of your potential users. You’re forced to make decisions about what to index, how deeply to index, and how to best express what you’re indexing…and those are all just as important in the online environment. So this is a good exercise for a rainy day when you need a rest from computer eye strain.

Also, I think it would be interesting to survey scholarly texts in any image-focused field and try to get an idea of how people deal with images in back-of-the-book indexes. Projects for the future!