So I have access to my twitter archive now. I was very excited by this, then a month later I still haven’t done anything about it.

I decided to fix this.

Everything I’m doing today is on a linux system (running Mint if you must know). You will need the following things installed to follow all of it.

atool
This is basically an archive management tool. Actually you don’t need this at all, but I used it as part of set up and it’s totally worth having on every system you run so I’m mentioning it anyway
git
Again, this is strictly optional, but you might want to replace it with some other VCS. When I’m working with a bunch of files I like to put them under version control, so any destructive operations I perform (deliberately or inadvertently) can easily be backed out of. I like git, so I used git, but there’s no reason to use it specifically over just about anything else here.
moreutils
I use this in precisely one step, but it’s quite useful for that step. You can easily find a way to do without it.
wget
jq
This is absolutely the core utility I’m using and you need it installed in order to do anything useful with this post

david@volcano-base ~ $mkdir -p data david@volcano-base ~$ cd data/ david@volcano-base ~/data $aunpack ~/Downloads/tweets.zip Archive: /home/david/Downloads/tweets.zip inflating: Unpack-2420/data/js/tweets/2013_03.js (etc) tweets.zip: extracted to tweets' (multiple files in root) [email protected] ~/data$ cd tweets/ css/ data/ img/ js/ lib/ [email protected] ~/data $cd tweets/data/js/tweets/ [email protected] ~/data/tweets/data/js/tweets$ ls 2008_04.js 2008_09.js 2009_01.js (etc)

You’re now in a directory with lots of .js files.

Before we do anything, lets put everything under git.

david@volcano-base ~/data/tweets/data/js/tweets $git init Initialized empty Git repository in /home/david/data/tweets/data/js/tweets/.git/ david@volcano-base ~/data/tweets/data/js/tweets$ git add *.js david@volcano-base ~/data/tweets/data/js/tweets $git commit -m "Initial commit of data files" [master (root-commit) 9aeb0dc] Initial commit of data files 59 files changed, 699462 insertions(+) create mode 100644 2008_04.js (etc) Now lets take a look at what we have here: david@volcano-base ~/data/tweets/data/js/tweets$ head -n25 2008_04.js Grailbird.data.tweets_2008_04 = [ { "source" : "web", "entities" : { "user_mentions" : [ ], "media" : [ ], "hashtags" : [ ], "urls" : [ ] }, "geo" : { }, "id_str" : "799851705", "text" : "Yay for google whoring.", "id" : 799851705, "created_at" : "Tue Apr 29 21:30:07 +0000 2008", "user" : { "name" : "David R. MacIver", "screen_name" : "DRMacIver", "protected" : false, "id_str" : "14368342", "profile_image_url_https" : "https://si0.twimg.com/profile_images/2609387884/1tu53xdcpssixve5o09m_normal.jpeg", "id" : 14368342, "verified" : false } }, {

First thing we see: These aren’t JSON files. They are actually Javscript. Fortunately they’re formatted in such a way that we can easily turn them into JSON:

david@volcano-base ~/data/tweets/data/js/tweets $for file in *.js; do tail -n+2$file | sponge $file; done david@volcano-base ~/data/tweets/data/js/tweets$ git diff | head diff --git a/2008_04.js b/2008_04.js index 5737052..2046162 100644 --- a/2008_04.js +++ b/2008_04.js @@ -1,4 +1,3 @@ -Grailbird.data.tweets_2008_04 = [ { "source" : "web", "entities" : { diff --git a/2008_06.js b/2008_06.js   david@volcano-base ~/data/tweets/data/js/tweets $git commit -a -m "remove initial assignment line so all our files are valid javascript" [master 8d41481] remove initial assignment line so all our files are valid javascript 59 files changed, 59 deletions(-) What’s going on here? Well, from the tail man page: -n, –lines=K output the last K lines, instead of the last 10; or use -n +K to output lines starting with the Kth So tail -n+2 outputs all lines starting with the second. sponge is from moreutils. According to its man page: sponge reads standard input and writes it out to the specified file. Unlike a shell redirect, sponge soaks up all its input before opening the output file. This allows constricting pipelines that read from and write to the same file. So what we’re doing in this loop is for each javascript file we’re stripping off the first line, buffering it up and then writing it back to the original file. (I expect there’s also a sed one liner to do this, but this was easier than looking up what it was) Right. Now for something actually interesting! For a starting point, I’ve often wondered how much I’ve actually written on twitter. So lets do that. david@volcano-base ~/data/tweets/data/js/tweets$ cat *.js | jq -r '.[] | .text' | wc -w 326212

Before I analyze what I’ve just done, I’m going to marvel at the fact that that’s a metric fuckton of words. Depending on how you count that’s about three novels tweeted. I don’t know if that says good or bad things about me.

Now lets unpack the command.

First the uninteresting bits: cat *.js concatenates all the js files and spews them to stdout, wc -w counts the number of words fed to its stdin. You knew that though.

Now lets talk about the jq command, which is the interesting bit. To be clear: I’m learning jq as I write this (I’ve been meaning to for a while and then didn’t), so I really don’t know that much about it, so all of what I say may be wrong.

As I understand it, the jq model is that everything in it is a stream of JSON values, and that’s also how it parses its STDIN. This is why concatenating the JSON files and feeding it to jq works: It parses one value from STDIN, then another, then another. So we’re starting with a stream of arrays.

We’re then building up a filter. ‘.’ is simply the filter which pipes its input to its output, but by modifying it as ‘.[]’ we get a filter which accepts a stream of arrays and unpacks them by reading each array then streaming it to the output one at a time. So we’ve taken our stream of arrays and turned it into a stream of the array contents. Let’s verify that:

david@volcano-base ~/data/tweets/data/js/tweets $< 2008_04.js jq '.[]' | head -n50 { "user": { "verified": false, "id": 14368342, "profile_image_url_https": "https://si0.twimg.com/profile_images/2609387884/1tu53xdcpssixve5o09m_normal.jpeg", "id_str": "14368342", "protected": false, "screen_name": "DRMacIver", "name": "David R. MacIver" }, "created_at": "Tue Apr 29 21:30:07 +0000 2008", "id": 799851705, "text": "Yay for google whoring.", "id_str": "799851705", "geo": {}, "entities": { "urls": [], "hashtags": [], "media": [], "user_mentions": [] }, "source": "web" } { "user": { "verified": false, "id": 14368342, "profile_image_url_https": "https://si0.twimg.com/profile_images/2609387884/1tu53xdcpssixve5o09m_normal.jpeg", "id_str": "14368342", "protected": false, "screen_name": "DRMacIver", "name": "David R. MacIver" }, "created_at": "Tue Apr 29 10:36:22 +0000 2008", "id": 799412882, "text": "I get more and more sick of Java with every passing day. Impressive, given that I'm not even writing it any more...", "id_str": "799412882", "geo": {}, "entities": { "urls": [], "hashtags": [], "media": [], "user_mentions": [] }, "source": "web" } { "user": { "verified": false, "id": 14368342, (I’ve switched to just using the 2008_04.js file because we’re only looking at a small amount of data) We do indeed get a sequence of JSON objects one after another (note no commas or array markers). We then build up a pipe inside the jq language. The “.text” filter reads its input stream, looks up the text property on it and outputs the result as follows. david@volcano-base ~/data/tweets/data/js/tweets$ < 2008_04.js jq '.[] | .text' | head -n10 "Yay for google whoring." "I get more and more sick of Java with every passing day. Impressive, given that I'm not even writing it any more..." "Unleashing my inner interior decorator." "Having trouble with the twitter UI, which is just embarassing given how simple it is. :-)" "giving this twitter thing another try."

The final bit to explain here is the -r flag. From the man page:

–raw-output/-r: With this option, if the filter’s result is a string then it will be written directly to standard output rather than being formatted as a JSON string with quotes. This can be useful for making jq filters talk to non-JSON-based systems.

Indeed it can:

david@volcano-base ~/data/tweets/data/js/tweets $< 2008_04.js jq -r '.[] | .text' | head -n10 Yay for google whoring. I get more and more sick of Java with every passing day. Impressive, given that I'm not even writing it any more... Unleashing my inner interior decorator. Having trouble with the twitter UI, which is just embarassing given how simple it is. :-) giving this twitter thing another try. So we’ve chained all these together to get the actual text. Now lets save that text so we can do a bit more analysis on it: david@volcano-base ~/data/tweets/data/js/tweets$ head all_tweets.txt Yay for google whoring. I get more and more sick of Java with every passing day. Impressive, given that I'm not even writing it any more... Unleashing my inner interior decorator. Having trouble with the twitter UI, which is just embarassing given how simple it is. :-) giving this twitter thing another try. No, seriously. I mean it. Why do people use ASP? Every site I've encountered using it has driven me insane with its awfullness Barack Obama is the eleventh doctor! Brillig and the Slithy Toves would make a great band name According to Victoria I am well positioned to be the antichrist I know understand why tea and crumpets are our traditional fare david@volcano-base ~/data/tweets/data/js/tweets $git add all_tweets.txt david@volcano-base ~/data/tweets/data/js/tweets$ git commit all_tweets.txt -m "Just the text for all tweets" [master df1e778] Just the text for all tweets 1 file changed, 20553 insertions(+) create mode 100644 all_tweets.txt

And a sanity check:

david@volcano-base ~/data/tweets/data/js/tweets $wc -w all_tweets.txt 326212 all_tweets.txt Good, the same answer. Now, lets ask another interesting question: How much have I written not counting @replies? Answer: Not nearly so much. david@volcano-base ~/data/tweets/data/js/tweets$ grep -v '@' all_tweets.txt | wc -w 88966

So apparently most of my twitter usage is conversations: We go from about 3 novels to a short novel or long novella if I remove them.

What’s going on here?

Well, I’m doing a text search on all_tweets.txt. grep ‘@’ gives me all lines which contain an @, and then the -v flag inverts the sense of the match:

david@volcano-base ~/data/tweets/data/js/tweets $grep '@' all_tweets.txt | head @benaud I could. It's kinda hacked together and purely for sending messages. Still want it? @t_a_w JQuery doesn't meet my needs in two crucial ways: a) I'm not doing something in the browser. b) It left survivors. @t_a_w JQuery fails my needs in two crucial ways. Firstly, I'm not trying to do something in the browser. Secondly, it left survivors. @gnufied It's possible I was being slightly facetious... @gnufied But that is not object orientated! Encapsulation! You are a bad programmer and should go back to writing C with global variables. @t_a_w I'll have you know that the book I just bought covers the state of the art up to at *least* 1990. @jherber Not sure.I don't think arrays are the issue so much as the general collections API. Certainly there are too damn many toList calls. @jherber Oh, hey. I'd forgotten about the embarrassing slowness of List.sort. Thanks for reminding me. I'll add fix that. @mikesten Right, but there are two reports. The one I sent a link to is everything on expertise-identification, but there are other ones. @mikesten The full super-scary thing or just the work applicable ones? :-) david@volcano-base ~/data/tweets/data/js/tweets$ grep -v '@' all_tweets.txt | head Yay for google whoring. I get more and more sick of Java with every passing day. Impressive, given that I'm not even writing it any more... Unleashing my inner interior decorator. Having trouble with the twitter UI, which is just embarassing given how simple it is. :-) giving this twitter thing another try. No, seriously. I mean it. Why do people use ASP? Every site I've encountered using it has driven me insane with its awfullness Barack Obama is the eleventh doctor! Brillig and the Slithy Toves would make a great band name According to Victoria I am well positioned to be the antichrist I know understand why tea and crumpets are our traditional fare

We could have also worked this out with jq:

david@volcano-base ~/data/tweets/data/js/tweets $cat *.js | jq -r '.[] | select(.entities.user_mentions | length == 0) | .text' | wc -w 91071 What’s going on here? What we’ve done is we’ve added a filter in the middle of our two previous filters. This is a select filter, as explained here in the manual. For each row of input, it passes that to the filter it’s wrapping (note that the thing inside is a filter!), then reads all the output from that filter until it either runs out of output or it finds a true value. Here we only ever output one value (whether the length of the user_mentions array is 0), but here’s an illustration of what happens if you output multiple values: david@volcano-base ~/data/tweets/data/js/tweets$ echo -e '[false]\n[true]\n[false,true]\n[true,false]' | jq 'select(.[])' [ true ] [ false, true ] [ true, false ]

Observe that the arrays with any true value in them get returned but the arrays with all false values don’t.

An interesting thing to note: These answers are not exactly the same. I have more words written with no user mentions than I do words from tweets with no @s in them. What’s going on?

Let’s find out:

david@volcano-base ~/data/tweets/data/js/tweets $cat *.js | jq -r '.[] | select(.entities.user_mentions | length == 0) | select(.text | contains("@")) | .text' | grep -o '@[[:alnum:]_]\+' | sort | uniq -c | sort -nr 68 @njbartlett 19 @sylviazygalo 8 @Lars_Westergren 7 @burningodzilla 4 @jeanfinds 3 @a_y_alex 2 @torsemaciver 2 @Ms_Elevator 2 @charlesarmstrong 2 @cambellmichael 2 @benreesman 2 @allbery_b 1 @zarkonnen 1 @trampodevs 1 @tooPsychedout 1 @timperret 1 @thatsoph 1 @nuart11 1 @missbossy 1 @mikevpotts 1 @michexile 1 @MetalBeetleLtd 1 @mccraicmccraig 1 @MatthijsAKrul 1 @m 1 @lucasowen85 1 @lisascott869 1 @lauren0donnell 1 @Knewton_Pete 1 @kittenhuffers 1 @jnbartlet 1 @itomicsam 1 @iamreddaveMaximum 1 @georgie_guy 1 @geezusfreeek 1 @GarethAugustus 1 @drmaciver 1 @debashshg 1 @dbasishg 1 @CTerry1985 1 @communicaI 1 @cipher3d 1 @carnotwitk These are the usernames that appear in tweets that have no user_mentions in them. Based on a few random samples, none of them appear to be valid twitter user names. Some of them are obviously typos, some of them I recognise as people who have changed their usernames or deleted their accounts. So I think that explains that. Now lets explain the command. The jq command should hopefully be obvious – it’s just more of the same, but I’ve added a new select (this time using contains for string matching). The remainder is interesting though. The grep string ‘@[[:alnum:]_]\+’ is a POSIX regular expression which matches strings which start with @ and then are followed by any combination of alphanumeric characters and underscores. So e.g. “@foobarbaz” matches, as does “@foobar123”, or “@foo_bar_baz”. The -m flag says to only print the matches rather than the lines containing matches as is grep’s normal behaviour. So e.g. david@volcano-base ~/data/tweets/data/js/tweets$ <all_tweets.txt grep -o '@[[:alnum:]_]\+' | head @benaud @t_a_w @t_a_w @gnufied @gnufied @t_a_w @jherber @jherber @mikesten @mikesten

What’s going on with the bit after the grep?

As a whole unit, ‘sort | uniq -c | sort -nr’ means “Give me a tabulated count of all the lines I feed in here, ordered by reverse frequency”.

Taking it apart what happens is this:

First we feed the results into sort. This, err, sorts them (in dictionary order). We then pass the output for that to uniq.

What uniq classically does is it removes all consecutive lines which are the same. So for example:

david@volcano-base ~/data/tweets/data/js/tweets $<all_tweets.txt grep -o '@[[:alnum:]_]\+' | head | uniq @benaud @t_a_w @gnufied @t_a_w @jherber @mikesten Note that it doesn’t remove all duplicates: Only adjacent ones. That’s why we had to sort first. Adding the -c flag causes it to count up those adjacent lines: david@volcano-base ~/data/tweets/data/js/tweets$ <all_tweets.txt grep -o '@[[:alnum:]_]\+' | head | uniq -c 1 @benaud 2 @t_a_w 2 @gnufied 1 @t_a_w 2 @jherber 2 @mikesten

So between the sorting and the counting, this gives us our tallies.

We then sort again to put it in order. However we don’t want to sort numbers by their string value (this would put e.g. 2 after 11), so the -n flag to sort tells it to sort numerically (I don’t actually know what it does with the rest of the text. I think it just pulls off the first number and uses that). This would put it in ascending numerical order, so the -r flag reverses that.

OK. So, the text is actually a more reliable guide of tweets than the metadata if I just want to know who I’m tweeting at. And I do. So lets get that.

david@volcano-base ~/data/tweets/data/js/tweets $<all_tweets.txt grep -o '@[[:alnum:]_]\+' | sort | uniq -cd | sort -nr | head 793 @angusprune 641 @LovedayBrooke 503 @petermacrobert 412 @pozorvlak 360 @stef 339 @bluealchemist 322 @drsnooks 313 @reyhan 271 @communicating 239 @zarkonnen I’m not sure if that’s the exact answer I would have expected, but it’s not terribly surprising. Now I’d like to take a look at what words I use. In order to do this I’ll use grep again to extract words from the all_tweets.txt file: david@volcano-base ~/data/tweets/data/js/tweets$ grep -o '\b[[:alpha:]]\+\b' all_tweets.txt | head Yay for google whoring I get more and more sick

The ‘\b’ special character there is the only thing new in this (also [:alpha:], but that just means any alphabet character). ‘\b’ means word boundary.

One thing stands out here: There are going to be a lot of really uninteresting words in this, like “I” and “more” and “and”. In natural language processing, these are typically called stop words. We want to remove those from the calculation.

So what do we do?

Well, first we find a list of stop words. Some brief googling lead me to this file, which seems to be good enough. Let’s fetch it:

david@volcano-base ~/data/tweets/data/js/tweets $wget https://stop-words.googlecode.com/svn/trunk/stop-words/stop-words/stop-words-english1.txt --2013-03-03 13:01:39-- https://stop-words.googlecode.com/svn/trunk/stop-words/stop-words/stop-words-english1.txt Resolving stop-words.googlecode.com (stop-words.googlecode.com)... 2a00:1450:400c:c05::52, 173.194.78.82 Connecting to stop-words.googlecode.com (stop-words.googlecode.com)|2a00:1450:400c:c05::52|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 4812 (4.7K) [text/plain] Saving to: stop-words-english1.txt' 100%[=====================================================================================================================================================================================>] 4,812 --.-K/s in 0.001s 2013-03-03 13:01:39 (8.29 MB/s) - stop-words-english1.txt' saved [4812/4812] Now! An annoying bit. What came next was mysteriously not working for me. It turns out that there were some problems with this file. The first reason it was not working was the presence of a BOM (Byte Order Mark) at the beginning of this file was confusing poor grep. So first we have to strip the BOM like so: david@volcano-base ~/data/tweets/data/js/tweets$ tail -c+4 stop-words-english1.txt | sponge stop-words-english1.txt

-c works like -n except that it functions on characters (really, bytes) instead of lines. So what we’re doing here is stripping out the first three bytes of the file, which are the BOM.

The second reason it was not working was to do with line endings.

Basically, line endings are actually special characters which say “Here! Start a new line!”. Unfortunately there is some disagreement as to just what those special characters are. Unix considers it to be a single newline character, traditionally represented as ‘\n’, whileas windows (and the web) consider it to be two characters ‘\r\n’- the first being a carriage return, meaning “Go back to the beginning of the line”.

We’re going to be feeding these lines to grep to use as patterns, and grep is of the unix, so it wants its new lines to be pure ‘\n’ and will consider ‘\r’ to be part of the pattern to be matched on. So we need to remove those:

david@volcano-base ~/data/tweets/data/js/tweets $sed -i 's/\r//' stop-words-english1.txt The pattern we’re giving said says “replace the first carriage return with an empty string”. The -i flag means “And then replace the file with the results of this”. Normally it would write the results to the console. Some poking at the file caused me to realise it also doesn’t consider “I” to be a stop word. I don’t know why. Lets fix that: david@volcano-base ~/data/tweets/data/js/tweets$ echo i >> stop-words-english1.txt david@volcano-base ~/data/tweets/data/js/tweets $tail stop-words-english1.txt you'd you'll your you're yours yourself yourselves you've zero i General lesson here: Real world data is often messy. If your code isn’t working make sure the bug isn’t in your data. Now our stop words are ready to use. Lets add them to the git repo: david@volcano-base ~/data/tweets/data/js/tweets$ git add stop-words-english1.txt david@volcano-base ~/data/tweets/data/js/tweets $git commit stop-words-english1.txt -m "File containing stop words" [master bec4115] File containing stop words 1 file changed, 636 insertions(+) create mode 100644 stop-words-english1.txt We’re now prepared to use the stop word list: david@volcano-base ~/data/tweets/data/js/tweets$ grep -o '\b[[:alpha:]]\+\b' all_tweets.txt | grep -xvi -f stop-words-english1.txt | head Yay google whoring sick Java passing day Impressive m writing

grep -f means “Use the lines from this file as patterns and match on any of them”

The flag -v as before means “invert the match”, i.e. only give us things that don’t match the pattern. -i tells it to match case insensitively and -x tells it to only match things which match the whole line (So e.g. the fact that we have a single character ‘i’ in the list shouldn’t exclude words containing i).

Annoyingly the word m is still coming through. Rather than add every single character word to the stop words, lets just change our search to only match words of three letters or longer:

david@volcano-base ~/data/tweets/data/js/tweets $grep -o '\b[[:alpha:]]\{3,\}\b' all_tweets.txt | grep -xvi -f stop-words-english1.txt | head Yay google whoring sick Java passing day Impressive writing Unleashing Replacing the ‘\+’ with ‘\{3,\}’ has made the pattern only match words which are at least 3 characters long. Now, lets actually get the answer: david@volcano-base ~/data/tweets/data/js/tweets$ grep -o '\b[[:alpha:]]\{3,\}\b' all_tweets.txt | grep -xvi -f stop-words-english1.txt | sort | uniq -c | sort -nr | head 1573 http 1134 don 901 people 826 good 794 angusprune 641 LovedayBrooke 610 time 525 work 504 petermacrobert 484 bit

Err. Clearly there are some problems here.

First problem is that fairly obviously some of these are @replies showing up as words. Apparently word boundary (‘\b’) doesn’t mean what I thought it did. It’s also clearly extracting things like http from http:// and don from don’t.

Here’s my replacement solution:

david@volcano-base ~/data/tweets/data/js/tweets $grep -o '[^[:space:]]\{3,\}\b' all_tweets.txt | grep -v '@' | grep -xvi -f stop-words-english1.txt | sort | uniq -c | sort -nr | head -n25 887 people 810 good 600 time 519 work 443 problem 361 code 350 bad 335 Yeah 327 idea 300 pretty 300 point 287 bit 285 day 279 find 278 lot 275 wrong 250 coffee 246 hard 229 read 224 feel 221 thought 216 twitter 197 today 191 long 190 works Basically instead of looking for purely alphabetic words we look for any sequences of non whitespace characters. We then filter out things with @ in them afterwards. The results here turn out to be… really uninteresting. About the only things that look remotely specific to me on here are “coffee” and “code”, both of which it’s true I do care about quite a lot. Mmm. Tasty, tasty code. Before I go, here’s one more thing: david@volcano-base ~/data/tweets/data/js/tweets$ cat *.js | jq -r '.[] | .entities.urls[] | .expanded_url' | head http://ncommandments.com/40 http://twitpic.com/43pl3t http://twitpic.com/43nccl http://ncommandments.com/893 http://ncommandments.com/42 http://twitpic.com/4dnzhr http://yfrog.com/h2797xpj http://twitpic.com/4a1t1z http://twitpic.com/49yzs2 http://twitpic.com/49euse

Every URL you’ve ever posted to twitter (well, it would be without the head at the end to truncate it). Just more chaining of jq filters – we unpack the arrays, then we get the URLs off the object as an array, then we unpack that, then we get the .expanded_url off the url objects.

And that’s about it for now. I can’t think of anything else I particularly want to do. Time line analysis might be interesting – i.e. what’s changed over time (particularly in terms of who I tweet at), but I’m not very interested in doing that right now so I think I’ll leave it there.

Questions? Anything you’d like know how to do?

This entry was posted in programming on by .

Have you considered applying something similar, if not better than, the tf-idf algorithm to get a rough idea of the important words the assumption being important words` are what your twttr updates are mostly concerned with?