Edit for warning: This problem has proven substantially harder than I intended it to be. The spec keeps turning out to be ambiguous and needing patching and there are enough nasty edge cases that basically no one gets everything right (indeed, my reference implementation has just been pointed out to contain a bug). I’m going to leave it as is for the challenge but be warned.
This isn’t a question I’ve ever used to interview people, but it’s not dissimilar from the coding test we’ve got very good results from at Aframe (the problem is not at all similar, but the setup is fairly). I’ve been pondering this as an alternative, and I thought it might interesting to share it with people. I’ll explain what it’s designed to test in a later post.
If you want to answer it, I’m happy grade your answer, but there’s no exciting reward for doing so other than public or private recognition of a job well done. Details on this after the question.
You are required to write a command line program which reads lines which are terminated by ‘\n’ or EOF from its STDIN until it gets an end of file and then writes a stably sorted version of them to STDOUT. Where a line is terminated by EOF it should have an implicit ‘\n’ inserted after it.
It is perfectly OK to use library sort functions for this. Additionally, ability to handle large numbers of lines is not required – the only performance requirement is to sort 500 lines of less than 1000 bytes each in less than 10 seconds (this should not be in any way onerous). You can write an external sort if you want, and you’ll get extra kudos for it, but it is in no way required for a correct answer. The task is to implement the comparator used for the sorting, not to implement a sorting algorithm.
The comparator to be implemented is as follows:
Each line is to be considered to be an arbitrary sequence of non-‘\n’ bytes, which will be interpreted as a sequence of non-overlapping tokens. A token is EITHER:
- a sequence of ascii letter characters a-zA-Z
- a numeric string. This is a decimal representation of a number, containing 1 or more digits 0-9, possibly with a leading – sign, possibly with a single decimal point (.) followed by additional digits. The decimal point must come after at least one digit. No other characters are permitted. e.g. -5 and 5.0 are valid numeric tokens but +5, .5 and 5. are not (though they all contain a valid numeric token: 5)
Non-ascii or ascii but non-alphanumeric bytes may be present in the line, but must not be considered to be part of a token.
Each token should be the longest possible token it can be, with ambiguity resolved in favour of making the leftmost token longer. So for example the line “foobar” is the token “foobar”, not the tokens “foo” and “bar”, and “3.14.5” is “3.14”, “5”
Note also that lines may contain characters other than those permitted in tokens, and that tokens are not necessarily separated by whitespace:
should be tokenized as
-10.11, foo, 10, kittens
Lines should then be compared lexicographically as their list of tokens (as usual, if one is a prefix of the other then the shorter one comes first), with individual tokens being compared as follows:
- Two numeric tokens should be compared as their numeric value when interpreted as a decimal representation
- Any numeric token is less than any non numeric token
- Non-numeric tokens should be compared lexicographically by single character, with characters compared case insensitively. Case insensitivity should be performed as in English with ‘a’ corresponding to ‘A’, ‘b’ to ‘B’, etc.
If you want me to grade your answer, email it to me at [email protected] If it’s not obvious how to run it, please include instructions. I will need to be able to run it on a linux VM.
Your email should include:
- The source for your solution, either attached or as a link
- Whether you want me to publish your name and grade (you don’t get to change your mind after you’ve been graded)
- If so, how you want you want to be cited (pseudonym, full name, etc. I’m also happy to include a link
- Whether you’re OK with me using your source code as an example in follow up posts (I will assume you are unless you explicitly say you are not)
- Roughly how long you spent on the problem (I won’t publish this except in aggregate, it’s mostly just for my information)
I will then grade your submission as soon as I feasibly can (I don’t expect to be overwhelmed by submissions, and most of the grading will be automated, so hopefully this shouldn’t take too long). I will reply with your grade. You can at this point (assuming you didn’t get everything right. If you did, well done!) ask for examples of where you went wrong or submit an amended solution. We can keep iterating this process until one of us gets bored (I’m unlikely to accept more than 3 solutions from the same person). If you submit multiple solutions I’ll include your grades for all of them if you’ve asked for your name to be published.
Grading is as follows:
- You have suffered from a serious failure to read the spec
- You got some of it right, but there are significant omissions
- You have mostly got it right, but you missed some edge cases
- You have passed every test case I can throw at it
- And you implemented an external sort or otherwise did something clever. Go you!
(Note that there are no rewards for cleverness if you haven’t got the basic problem right. Such is life)
Hall of fame
(In order of solutions coming in)
- Dave Stark with a grade of B, A. Bonus kudos also for the fact that his solution uncovered a bug in my reference solution
- Alexander Azarov with a grade of B. Also his first solution uncovered some ambiguities in my spec
- Kat (a different one than I’ve referenced here before). B on her first try followed by an A (she got all the hard edge cases right the first time but was tripped over by an annoying one). Also kudos for pointing out a lot of spec ambiguities.
- Eiríkr Åsheim too gets kudos for the first solution which got everything right first time. Also for rolling his own IO code and finite state machine.
Feel free to point out problems in the spec. Note that a lack of detail is not a problem, but ambiguity is.
Do not post solutions in the comments. I will delete your comment. I will also delete or edit any comment I think gives too much away.
Also, comments of the form “What a stupidly easy test” will be deleted unless you have submitted a solution that was graded A.
A minor bit of underspecification: “a numeric string, which may include a sign and may include a decimal point” doesn’t specify whether leading digits are required before a decimal point and whether “+” is an allowed sign. Are e.g. “.52”, “-.52” and “+0.52” valid numeric strings? Also, whether decimal digits are required after a decimal point (is “12.” a valid numeric string?) and whether the sign must necessarily appear in the first position (is “0.52-” a valid numeric string)? These may seem pedantic, but I have encountered both in the wild.
Excellent questions both! I will update the specification.
Although on further thought I note that both a leading + sign and a trailing . shouldn’t change the answer regardless of which interpretation you pick. That makes me tempted to leave them out. Opinion?
Ah indeed, it doesn’t change the answer here. Leaving it out seems fine: if anyone wonders, the answer is here :)
The input consists only of bytes in the range 32-126 and \n, so I don’t have to worry about Unicode or anything like that?
(A good question which I considered to be too much of a spoiler for one of the tricky bits so I removed it – DRMacIver)
The current specification completely states what range of implementation is permissible. What isn’t forbidden is permitted.
(Actually on reread I guess there are some implicit assumptions on encoding. I will make those explicit)
What do you mean by an “offline sort”? I’m not familiar with the term. The closest I can find after some googling is Wikipedia on “online algorithms“, which suggests that you mean sorting the list in one call, rather than inserting each row into a sorted list as they are encountered using insertion sort. Is there much difference in implementation difficulty between the two approaches? In C# I think the only difference would be using repeated calls to “SortedList.Add” for the “online” sort v.s. one call to “List.Sort” for the “offline” sort. The rest of the code feels like it would be identical. I must be misunderstanding something?
“Offline sort” means “Sort where you don’t have to load the entire contents into memory”. A typical strategy for this is to use some sort of merge sort variation where you store the runs on disk, as you can merge two sorted files without loading more than two lines at a time into memory. This is implemented e.g. in the standard unix sorting program (try feeding it a couple GB of data and watch its memory usage not grow that large).
Wikipedia’s external sorting page explains it pretty well without using the term. I’m now puzzled – I swear I’ve heard this called an offline sort many times, but you’re right that the internet doesn’t seem to agree.
I see — it has nothing to do with the definition I found! That makes more sense: it would be much harder as you’ll need to start mucking with temp files and so on. Thanks
Each token should be the longest possible token it can be.
What if this requirement clashes with the requirement that tokens be disjoint?
These are obviously two tokens, but are they 1.23 and .45, or 1. and 23.45, or even 1.2 and 3.45?
Also, are there any constraints to language used? Do you want something compiled or interpreted?
Sorry, I meant 1.234.56: is this 1.23 and 4.56, or 1.2 and 34.56?