David R. MacIver's Blog: A possibly new unix-style utility

A possibly new unix-style utility

11 October 2010

I’ve found a couple times in the past that I have the following problem:

I’ve got a bunch of records of the form

key1: value1
key1: value2
key2: value3
...

And I want to transform them into the form

key1: value1, value2
key2: value3
...

so combining consecutive lines with the same initial key and joining the values together on one line.

It’s not a terribly hard problem, but doing it quickly and in bounded amount of memory for very large volumes of data is at least non-trivial enough to require a bit of care.

In the past I generally solved this in some application specific way, but recently I decided to do it properly and have extracted a nice little command line utility from it. It accepts a sensible range of options for configuring behaviour, is reasonably fast (I haven’t benchmarked it very extensively, but on all data I’ve tried it on it’s about an order of magnitude faster than sorting the data, which I tend to want to do before feeding it into squish anyway, and about half the speed of uniq) and runs in memory that will never grow past O(size of largest key).

I didn’t have a good name for it, so I picked a bad one. It’s called “squish” and is available from my data-tools repository (aka “Where I put random crap”). If you have a better name for it I’m all ears.

It’s possible that this duplicates functionality of something that already exists. Anyone know if it does?

Comments

Mike S. on 2010-10-12 14:19:50:

At my job, I had a database layout similar to this but with enough special cases that I couldn’t use an off-the-shelf solution (like PostgreSQL’s “crosstab” add-on) to fix it, so I did something similar to ‘squish’. Query the data and dump it to a file in a specific order, then write a command line program to read the file and assemble the results in the order I wanted.

Michael Chermside on 2010-10-12 14:59:11:

Out of curiosity: what’s the algorithm you use to provide good runtime and O(size-of-largest-key)?

david on 2010-10-12 15:15:17:

Pretty much the obvious one! Record the key with each line, making a note as you read it if it’s changed from the last one. If it has, start a new line and write it. If not, write a separator.

Michael Chermside on 2010-10-12 21:49:26:

Oh, so you require that multiple records for the same key be together?

In other words, for this input:

Key1: Value1a
Key2: Value2
Key1: Value1b

I would expect the output:

Key1: Value1a, Value1b
Key2: Value2

or perhaps this:

Key2: Value2
Key1: Value1a, Value1b

but not this:

Key1: Value1a
Key2: Value2
Key1: Value1b

It sounds like perhaps I’d need to pipe my input file through “sort” beforehand to get the desired behavior. I guess you already mentioned that above, but I didn’t understand why. Thanks for the explanation.

david on 2010-10-13 10:05:25:

Oh, yes. The behaviour is like uniq in that regard: It only compares adjacent lines. This is sortof a feature, as it means there are better guarantees about order preservation, but I’ll admit I normally pipe it through sort before using it

A possibly new unix-style utility on 2010-12-22 13:19:39:

[...] value3 ... And I want to transform them into the form key1: value1, value2 key2: value3 ... so... [full post] david David R. MacIver uncategorized 0 0 0 0 0 [...]

silentbicycle on 2011-01-13 16:59:38:

Nice! I’ve written this in awk a couple times, but haven’t ever felt the need to write it in C. AFAIK no similar, well-established tool already exists.