InfoCommons #3 – Apps for Sharing Community Data

In all the craziness of transitioning from my job at the United Way to Google, I neglected to post my last AIRS newsletter column…and I’m just realizing it now, 4 months later.  Eh, life!  So here it is, for posterity.

– – – Published April 2013 in the AIRS Newsletter– – –

Sharing data about community needs is one of the many important functions of information and referral services, but it’s often one of our biggest challenges. We have multiple stakeholders of various types who may want data presented to them through very different lenses. We have data that needs to remain confidential, and data that can be shared in aggregate form to provide crucial insight into how needs are changing among the populations we serve. We may be asked to respond to last-minute requests for information about social services in a specific geographic area, or summarize the needs of a certain population.

The variety of data requests to which I&Rs are asked to respond can become overwhelming. While most software systems provide tools to help us in this area, you may still have some reporting needs that aren’t being met. Can you provide your key partners with anonymous data about calls that they can access via a web browser on a weekly basis? Can you provide interactive charts and graphs that allow stakeholders to see, at a glance, the breakdown of caller needs? Can you quickly generate a map of where calls for a certain topic originated during the past 3 months? Does your data team feel overwhelmed by the need to generate unique reports for everyone who asks? Do you have multiple people pulling data from your system and coming up with different numbers for the same questions?

While no one tool can address all these complex challenges, here are two free online apps that may be able to help:

Google Fusion Tables

Google offers Fusion Tables, a free experimental tool that you can use to upload data from a spreadsheet and quickly create interactive charts and graphs. When users mouseover part of a pie chart, for example, the number of calls and need category will pop up. You can also create summary tables, maps, and “card” views of your data, which turns each row of a spreadsheet into an easily readable index-size card displaying the contents of each field. graph1

Perhaps the best part of this tool is the ability to create views of your data that can then be shared with others, either via a link or by providing access to specific users through their Gmail accounts. This feature allows you to share only certain fields of a spreadsheet while keeping others visible to just you. When users access a view, they can create new charts and filter the data to display what’s relevant to them without changing the layout of the original Fusion Table. This type of functionality makes data sharing much more efficient; instead of creating reports for each county in a region, for example, you can simply upload your region-wide (anonymized) call data and give your regional partners access to filter the data by county or by whatever criteria are relevant to them. Google Fusion Tables also offers a mapping feature, but for that I prefer another tool:


GeoCommons is a free online mapping application that enables you to upload a spreadsheet, geocode the data, and display it on an interactive map. Since GeoCommons is also a place for sharing datasets, there are many types of data that you can layer onto your call data to add meaning. Examples of datasets that are freely available and have been uploaded to GeoCommons are: school districts, foreclosure scores, unemployment rates, census demographics, bicycle counts, legislative districts, adult ESL & GED information, profiles of older adults by state, and much more. The benefit of GeoCommons is the huge variety of datasets that have been uploaded, but this can also be a drawback because some of them are not formatted properly or may not look right when superimposed on your call data. You may have to experiment with multiple versions of the same data to find a set that works best with your map.map1

Despite the need for some tinkering to create a good map, the visual impact of the final product cannot be overstated. You can instantly see which areas in your region are generating the most calls for a specific need, and if you layer other datasets over your data, you can simultaneously display meaningful information like income level and foreclosure rates for the regions that generate high call volume. GeoCommons maps can be embedded in webpages and shared via links just like Google Fusion Table charts.

These are only a couple of the many free online tools just waiting to be discovered. The more you explore, the likelier you are to find a tool that could end up providing an ideal solution for your data-sharing needs!


Enthusiasm for word clouds –> musings on indexing

Have you heard about Wordle?  I think I’m behind the times again; I’ve used wordclouds and always liked having a tag cloud on my blog, but I didn’t realize you could go to a site and paste text into a box and get a customizable cloud.  I just pasted one of my college papers in and *pooof* out comes a beautiful word cloud:

I had to use IE instead of Firefox for it to work, but the fun factor is worth it.   Next, I put in the paper I wrote on metadata standards for cultural heritage collections:

What I tried to do but couldn’t really accomplish was to paste in text from all the reference questions I’ve answered in the past months.  It would be cool to see which terms would predominate.  Probably just “database”, but it could be something surprising.  These wordclouds could easily be tweaked to make a nice blog header…and I feel like there are probably lots of excellent ways to use them just lurking in the back of my brain. It’s so tempting to think that if you can just see which terms are most frequently used, you’ll know the subject of the document. Like…if you want to quickly present the major concepts in any chunk of text, this is ideal, right?  Wellll….

I was trying to think through what types of content could usefully be wordclouded, other than blogs.  I (half-seriously) started thinking that maybe in addition to an abstract you could have a mini wordcloud at the beginning of a scholarly article, so people could decide even more quickly if they wanted to read it.  But really, the idea behind wordclouds — representing content with simple words or phrases — is already applied to most scholarly articles.  The cloud is just arranging terms in an attractive way, and -what I think is most useful – translating things like term frequency into color and/or size.   The question is whether a cloud contains automatically-generated keyword terms based on word frequency, if it is based on user-created keyword terms (tags),  or if it contains indexing terms that came from a thesaurus and have somehow been verified (by a human) to correspond to the actual subject of the document, not just the words in it.

I think it’s great that we can use algorithms to do part of the work of assigning subject terms to content, but I worry about people (perhaps non-indexers/catalogers/librarians) forgetting that it’s not enough on its own.  You know that I wrote a paper about gorgons and Amazons, but you have no idea what I said about them or what my paper was really about.  In this paper from 1961, Bernier & Crane make a similar point:

On the average fewer than 1% of the words of a chemical document are required to subject-index it. Thus, more than 99% of the words that the author uses are useless and even undesirable in a subject index. They are useless to the index user because they do not guide him to the new information reported by the author; they are undesirable because they dilute the useful [index terms] to make for confusion and time-wasting reading.[…] The detection of subjects and their translation into index language is the function of the subject indexer.

Although this article is from the age of print indexes, the same principles still apply in our world of full-text searching and electronic documents. Word frequency algorithms can help subject indexers by showing them terms that might be useful for translation into controlled indexing terms. The indexer must assign terms that cover what the author of a document is actually talking about, not just what they’re saying. If you only index the words and not the subjects, you’ll never be able to connect (or collocate) documents that talk about the same things using different terminology. This seems obvious but I think it’s easy to forget and/or be confused about. Even Bernier and Crane say that “New indexers often have difficulty in understanding the difference between subject- and word-indexing.”

In the course of thinking about this stuff, I realized I was associating wordclouds ONLY with natural language processing / automatic indexing. But wordclouds (or other word frequency visualizations) don’t necessarily have to be wedded to automatic indexing with all its weaknesses. The clouds themselves are not the enemy; it’s the lack of vocabulary control going into most of them that is causing this unexpectedly long blog post to spill out of me. Obviously you could generate a wordcloud from terms pulled from a controlled vocabulary, and I know a number of people (e.g. librarians) are doing this with LC or AAT subject terms in digital collections. There’s a difference between the visualization itself, and what you put into it. Garbage in, garbage out, eh? The same thing goes for clouds based on tags, unless you’ve got some fancy term combination stuff going on like LibraryThing does. I get kind of worked up thinking about library catalogs that have implemented user tagging and word clouds, but without integrating any of the subject headings in the catalog into the word cloud, or mapping the most popular tags onto subject headings in some way (is this possible? I hope so….). Just think how many end-users probably assume that when they click on a tag, whether in a word cloud or in a record display, they’re getting results for everything in your collection on that topic. Cringe.  (But maybe users of library catalogs don’t even think that anymore?  Did patrons know the benefits of using subject headings to search catalogs before we started offering tags and keyword searching? Did they ever think that they could use subject headings to get a really thorough list of results? Maybe the only people who think something is being lost or obscured are people like me who are obsessed with authority control?)

In the end, what I’m thinking is that wordclouds — whether based on controlled index terms or just pure word frequency — are probably most useful for groups of documents (even though I had fun putting my term papers into Wordle).  Collection-level clouds can give users a glimpse of the coverage and strengths of a collection, just like document-level word clouds can indicate (SORT OF) the topic of a paper. But in the end,  it doesn’t really make sense to create a wordcloud for an individual document because 1. if you’re applying controlled index terms to it (which obviously you should, if the document is part of a collection), there probably won’t be enough of them to justify needing “visualization”, and 2. if you’re making a wordcloud using the entire text (aka just word frequency) of the document, you’re not really accurately representing what the document is about. On the other hand, though, you have to admit a wordcloud for a document would be prettier than something like this:

(From a LexisNexis document: a list of index terms (in caps) with relevance indicated in percentages instead of color or font size. Near the end are the keyword terms – possibly automatically generated – in lower case.)

For collections of documents, I think it’s useful to show people a wordcloud of all your indexing terms/subject headings, especially because you can then visually indicate which subject terms you attach to documents most frequently.  This benefit of providing a casual glimpse of a collection’s coverage is probably why tag clouds are popular for indicating what a blog (a collection of “documents” , or posts) is “about”.  I wonder if more people think of tag clouds in this way, as content indicators, or if they think of them primarily as tools for accessing blog content via tags.