A possibly new unix-style utility

I’ve found a couple times in the past that I have the following problem:

I’ve got a bunch of records of the form

key1: value1
key1: value2
key2: value3
...

And I want to transform them into the form

key1: value1, value2
key2: value3
...

so combining consecutive lines with the same initial key and joining the values together on one line.

It’s not a terribly hard problem, but doing it quickly and in bounded amount of memory for very large volumes of data is at least non-trivial enough to require a bit of care.

In the past I generally solved this in some application specific way, but recently I decided to do it properly and have extracted a nice little command line utility from it. It accepts a sensible range of options for configuring behaviour, is reasonably fast (I haven’t benchmarked it very extensively, but on all data I’ve tried it on it’s about an order of magnitude faster than sorting the data, which I tend to want to do before feeding it into squish anyway, and about half the speed of uniq) and runs in memory that will never grow past O(size of largest key).

I didn’t have a good name for it, so I picked a bad one. It’s called “squish” and is available from my data-tools repository (aka “Where I put random crap”). If you have a better name for it I’m all ears.

It’s possible that this duplicates functionality of something that already exists. Anyone know if it does?

This entry was posted in Uncategorized on by .

7 thoughts on “A possibly new unix-style utility

  1. Mike S.

    This seems similar to a cross-tabulation in a database or spreadsheet. You have a table in the form
    id | name | category | value and you transpose it to:
    name | category1 | category2 | category3 | ….

    At my job, I had a database layout similar to this but with enough special cases that I couldn’t use an off-the-shelf solution (like PostgreSQL’s “crosstab” add-on) to fix it, so I did something similar to ‘squish’. Query the data and dump it to a file in a specific order, then write a command line program to read the file and assemble the results in the order I wanted.

    1. david Post author

      Pretty much the obvious one! Record the key with each line, making a note as you read it if it’s changed from the last one. If it has, start a new line and write it. If not, write a separator.

      1. Michael Chermside

        Oh, so you require that multiple records for the same key be together?

        In other words, for this input:

        Key1: Value1a
        Key2: Value2
        Key1: Value1b

        I would expect the output:

        Key1: Value1a, Value1b
        Key2: Value2

        or perhaps this:

        Key2: Value2
        Key1: Value1a, Value1b

        but not this:

        Key1: Value1a
        Key2: Value2
        Key1: Value1b

        It sounds like perhaps I’d need to pipe my input file through “sort” beforehand to get the desired behavior. I guess you already mentioned that above, but I didn’t understand why. Thanks for the explanation.

      2. david Post author

        Oh, yes. The behaviour is like uniq in that regard: It only compares adjacent lines. This is sortof a feature, as it means there are better guarantees about order preservation, but I’ll admit I normally pipe it through sort before using it

  2. Pingback: A possibly new unix-style utility

  3. silentbicycle

    Nice! I’ve written this in awk a couple times, but haven’t ever felt the need to write it in C. AFAIK no similar, well-established tool already exists.

Comments are closed.