In order to demonstrate some stuff I wanted to have a set of tagged data to play with. Delicious, flickr, that sort of thing. After some digging around on places like theinfo.org I found out that CiteULike (like delicious but targetted at academic papers) makes a dump of their data available. Unfortunately, it’s a bit messy. Not the data format itself, which is a simple pipe separated text file, but the quality of the tags themself. This is more or less to be expected for user reported tagging. It would be nice to have something a bit cleaner though.
I thought it might be illustrative to clean it up in public. It’s a completely hacky process, and not a particularly smart one, but it might be interesting or helpful to someone.
So, first things first. Get the data: http://static.citeulike.org/data/current.bz2. It’s zipped, so not too large, but will be about 300M unzipped.
It looks roughly like this:
42|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:25:05.373798+00|ecoli 42|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:25:05.373798+00|metabolism 42|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:25:05.373798+00|barabasi 42|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:25:05.373798+00|networks 43|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:25:51.839281+00|control 43|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:25:51.839281+00|engineering 43|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:25:51.839281+00|robustness 44|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:26:33.156319+00|networks 44|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:26:33.156319+00|strogatz 44|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:26:33.156319+00|survey 44|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:26:33.156319+00|review 45|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:27:38.983179+00|pleiotropy 45|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:27:38.983179+00|barabasi 45|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:27:38.983179+00|notsmall
A couple things to note:
As mentioned, it’s pipe separated. We have a document id, an anonymized user id, a date and a tag. There can be multiple tags for the same (document, user) pair.
Another thing to note is that sometimes it contains concatenatedwords. e.g. “notsmall” (a tag which seems to appear only once, for “Wrestling with pleiotropy: genomic and topological analysis of the yeast gene expression network” for some reason. I think it’s a mistag and was meant to go on “The metabolic world of Escherichia coli is not small”).
Let’s get a sense of what the tags for this look like. We’ll plot a distribution graph like so:
cat Desktop/current | ruby -ne 'puts $_.split("|")[-1]' | sort | uniq -c | sort -g -r > citeulike_tags
What’s this doing? Not much. We’re catting the data to standard out, feeding it through ruby to split off the last column and sorting the results, giving us a big list of tags with repetitions. We then pipe this through uniq -c, which collates consecutive unique lines with a count prepended (that’s what the -c does). We then sort again in generalised numerical order, reversed, and save the output to a file.
Unix is fun.
Here’s what the results look like:
[email protected]:~$ head citeulike_tags 212611 bibtex-import 156901 no-tag 27926 elegans 27886 celegans 27825 c_elegans 27795 nematode 27738 wormbase 27736 caenorhabditis_elegans 18897 review 15280 all-articles [email protected]:~$ tail citeulike_tags 1 00301512 1 0025 1 00208 1 001287275ep1114608epa00128727epb00128727 1 0010521342 1 0009811908 1 0009390946 1 000 1 ------------- 1 ___
So there’s clearly a bunch of random noise there. The top two are unimportant – they’re some sort of autogenerated thing – and the bottom lot are garbage. So, we’ll throw away the top two and everything with only 1 occurrence.
[email protected]:~$ vim citeulike_tags
Let’s look at the data again.
27926 elegans
27886 celegans
27825 c_elegans
27795 nematode
27738 wormbase
27736 caenorhabditis_elegans
18897 review
15280 all-articles
14597 evolution
13694 meeting_abstract
[email protected]:~$ tail citeulike_tags
2 035
2 0325
2 0315
2 021
2 01012
2 004692
2 003
2 0022
2 —-
2 __
The top is looking better. I’m a bit skeptical of “all-articles”. Looking at a few examples it seems to be something generated along with bibtex-import. But we’ll leave that for now.
We’ve got a bunch of duplication there. “celegans”, “c_elegans”, “caenorhabditis_elegans”. Citeulike seems to have a nematode obsession. Not much we can do about that right now though.
The bottom half is the more serious issue. It consists entirely of numbers, which is rubbish. So let’s filter out anything that doesn’t contain some text:
[email protected]:~$ mv citeulike_tags citeulike_tags_old
[email protected]:~$ cat citeulike_tags_old | ruby -ne ‘puts $_ if !($_ =~ /^[^a-z]+$/)’ > citeulike_tags
[email protected]:~$ tail citeulike_tags
2 0graphs
2 0ds-flicker
2 0-doek
2 0compas
2 0cath
2 05-wise-00-01
2 05matheco30_theoretic
2 05matheco25_theoretic
2 04-sose-00
2 041500040u1
Hm. Those are still pretty shitty. At this point I’m tempted to filter out everything which appears only twice. But after a quick trawl through with less I shall resist the temptation, as one finds things like this:
2 accessibility-technology-for-the-deaf
2 accessibilitystandards
2 accesory
2 acceptorphysiology
2 acceptormetabolismphysiology
2 acceptorgeneticsmetabolism
2 acceptability-judgments
2 accents
2 accent_learning
2 accentedness
2 accelerator-physics
2 acceleration-measurement
2 accelerationadverse
2 accelerating-admixture
2 accelerated-combinatorial-synthesis
Which it would be a shame to lose out on if we can avoid it.
At this point I noticed something annoying: There’s absolutely no consistency in how people space things in these tags. Even ignoring the people who concatenatewords, some people use _, some use -, some even have things like “academic-libraries—-collection-development”, which is just aggravating.
Let’s try to get some consistency out of this.
At this point I break out of the console and move to irb for some more interactive hacking. Unfortunately trying to record this turned out to be a pain due to IRB’s tendency to print the whole giant hash. So here’s it is as a ruby script. This looks for all tags which differ only in terms of the presence and type of _s and -s and conflates them all. In each case it chooses the most commonly occuring one, takes that as canonical and gives it the sum of all occurrences of equivalent tags. It then normalises the separator to an underscore.
I ran this as
[email protected]:~$ cat citeulike_tags | ruby group_duplicates.rb > normalized_tags
Now, we’ve still got quite a few tags left:
[email protected]:~$ wc -l normalized_tags
118785 normalized_tags
Let’s see if we can figure out some good ways of reducing this (or at least cutting out noise).
I dug around in it for a bit and noticed that there were a lot of tags of the form “file-import-something”. Not that many (291) but it’s a start. We’ll probably continue blacklisting things as we fine them.
Here’s an example of where we’ve got redundant tags: We’ve got genomic and genomics. statistic and statistics. population and populations. i.e. plurals. Let’s fix that.
For this step we’ll use a stemmer. Snowball has a decent binding for ruby, so we’ll use that. We’ll consider two tags to be equivalent if their parts stem to the same words.
We’re going to repurpose the spacing script above. Software reuse by cut and paste, yay. :-) I’ll need to do this all properly later, so I’ll clean it up for then, but now we’re just experimenting with data. So here’s a rewrite to identify things by stemming.
At this point we’re down to 106229 tags, from 118494 prior to stemming and 272919 originally. Not doing badly, given that most of what we’ve thrown away is junk or redundant information. If we threw out the tags which only appear twice we’d lose another 30352 tags (cat normalized_tags_2 | grep ” 2 ” | wc -l). I was resisting doing that because some of them are quite good quality, but really I don’t think we have enough information to clean the remainder up.
We’re nearly at the point where we’ve run out of what we can do with frequency and word based information – there’s still plenty more we could do in principle, but we’ll hit the point of diminishing returns pretty rapidly from here on out. One thing I have noticed though is that there are a bunch of tags like “of” and “and”. At this point I shall put out the favourite hacky linguistic hammer: The stopword list.
The one I tend to use was compiled for the SMART system by Chris Buckley and Gerard Salton at Cornell University. You can download a copy here. All we’ll use this for is to filter out any tags which happen to be stopwords. Here’s an obvious script to use it to remove stopwords.
This loses us 46 tags. Doesn’t sound like many, but they were mostly fairly high frequency ones, so it’s a nice win.
At this point I declare this part of the work done. We’ve compiled a good list of tags and, although I’ve not actually written the code to do so, because each tag was arrived at by grouping a bunch of existing tags, it’s easy to see how you could figure out which documents are tagged with which of the new tags (if you can’t see, don’t worry, I’ll be tidying this up and using it to generate a tag list next time). In order to go further, we’ll need to actually look at sets of tagged documents. And, to be honest, I don’t know how much further it will take us. It may be that there’s not much more to add. But I’m hopeful.
If you want to have a play with the data, here’s the end result.
In irb, I’ve taken to habitually tacking “;nil” onto the end of a command that would return a ginormous hash/array/etc. Speeds things up quite a bit, too.
>> a = (1..1000000).map {1}; nil
=> nil
>> a.size
=> 1000000
Yeah, I know. I try to do that, but I forget every now and then and when you have 100K items you only need to do it once and you’ve lost the log.
Yeah, true. I wish there were a command line option to suppress it. There’s an option to uglify the output (–noinspect), which slightly reduces the space.. I guess there’s no easy way to figure out what to suppress and what not to.
One slightly ‘orrible trick I tried was to monkey patch inspect to return “”, but for some reason this didn’t work correctly. I think it might be that some of the inspect methods are intrinsic.
to mute IRB …
IRB.conf[:PROMPT][ IRB.conf[:PROMPT_MODE] ][:RETURN]=”
from
http://groups.google.com/group/ruby-talk-google/browse_thread/thread/9c1febbe05513dc0?pli=1
Wow. Fantastic. That’s going right in my .irbrc.
Re: tag “all-articles”. We sometimes get requests to tag all articles with the same tag (typically so a user can delete all their articles) and this is the tag name we typically use.
@Fergus: Ah. Interesting. Thanks for the info.
Pingback: porges - Cleaning up a set of tags with Awk
As you can see from the pingback I’ve written a reply semi-tutorial on how to use Awk to do the same task. Hope you don’t mind :)
I don’t mind at all. :-) It was an interesting read. Thanks.
I will however probably continue using Ruby for these tasks. In particular I think you’re going to be completely unable to port the second one to awk because that’s where it stops being remotely line oriented and where I start making use of more general purpose libraries. I could start out with awk and switch to Ruby when that happens, but frankly I’m much more comfortable keeping it in the same language throughout.
One comment on pluralization. Although it sounds like this did not apply to the data set you normalized, it is possible that Statistic applied to a single fact or piece of information, while Statistics referred to the field of, and their may be additional cases where the plural and singular are different usefully different entities.
That said I appreciate your work on this, especially the details given on your approach. I hope that because of work like yours, sites will consider cleaning up their tags. Only two sites I can think of attempt to control their tags to any degree. Amazon, which presents previously used tags and Stack Overflow which handles this with tag auto-completion and a reputation requirement for creating new tags.
It’s true that stripping off pluralisation can change the meaning of the tag – my expectation is that that these cases will be sufficiently far outliers compared to the number of cases where this removes that I can live with the slight loss of useful information. Generally the meaning will be close enough to preserved that it’s tolerable.
I’m going to do some analysis on usage later when I try to clean up things further, and when I do I’ll see if I can spot any cases where it’s actually breaking things.
Also I’m glad you’re finding it interesting. :-) I don’t expect that much of what I’m doing will see use on these sites – it’s not really for that. It’s more for building data analysis on top of a noisy tag set.