So I’m playing around with frequent item set mining, and the approach I’m trying (not that I expect it’s at all novel. I just want to get a feel for the problems, and maybe build a useful tool if I can’t find any that match my needs) needs a “getting started” step which works out the event list for all frequent one and two item sets.
So basically what I want is a map of tags to maps of tags to events. So my_map[t1] gives a map of tags to which events they cooccur with t1 in (this should include t1, so it also includes all events t1 occurs in as a special case).
Further I want to do this in a way that can be expected to process huge amounts of data (certainly gigabytes) without falling over.
I was thinking about how to do the problem in a sensible way, and then I realised “Ah ha! I can use squish for this!”
In fact the solution ends up using squish twice. Here’s the code:
for(i = 1; i <= NF; i++)
for(j = 1; j <= NF; j++)
print $(i) " " $(j) ":" line_count
}' | sort | squish -d ":" -s ", " | sed 's/ /:/' | squish -d ":" -s "; " | sed 's/:/: /'
It takes input like this:
foo bar foo bar foo bar baz baz foo bar baz bif bing bloop blorp blorp
And outputs data like this:
bar: bar:2, 3, 4, 6; baz:4, 6; bif:6; bing:6; bloop:6; blorp:6; foo:3, 4, 6 baz: bar:4, 6; baz:4, 5, 6; bif:6; bing:6; bloop:6; blorp:6; foo:4, 6 bif: bar:6; baz:6; bif:6; bing:6; bloop:6; blorp:6; foo:6 bing: bar:6; baz:6; bif:6; bing:6; bloop:6; blorp:6; foo:6 bloop: bar:6; baz:6; bif:6; bing:6; bloop:6; blorp:6; foo:6 blorp: bar:6; baz:6; bif:6; bing:6; bloop:6; blorp:6, 7; foo:6 foo: bar:3, 4, 6; baz:4, 6; bif:6; bing:6; bloop:6; blorp:6; foo:1, 3, 4, 6
How does it work?
Well, we first use awk to transform it into data that looks like this:
foo foo:1 bar bar:2 foo foo:3 foo bar:3 bar foo:3
giving us cooccurring pairs followed by an event they cooccur in.
We then sort it to get all cooccurrences next to eachother, and pass it through squish using : as the delimiter and , as the separator, which gives us something that looks like this:
bar bar:2, 3, 4, 6 bar baz:4, 6 bar bif:6 bar bing:6 bar bloop:6
A bit of sed (not really necessary, but helps for formatting) changes the character after the first tag, and now we can apply squish again with the new delimiter and use it to join lines together with the separator as ;. We then reformat slightly to get the desired end result.
This wouldn't be terribly complicated to do without squish, but squish made it basically trivial, and because sort scales reasonable well to large data and squish processes only in a pipeline, this too should handle as much data as you want to throw at it.