There’s a common sentiment that 99% of programmers don’t need to know how to build basic data structures, and that it’s stupid to expect them to.
There’s certainly an element of truth to that. Most jobs don’t require knowing how to implement any data structures at all, and a lot of this sentiment is just backlash against using them as part of the interview process. I agree with that backlash. Don’t use data structures as part of your interview process unless you expect the job to routinely involve writing your own data structures (or working on ones somebody has already written). Bad interviewer. No cookie.
But setting aside the interview question, there is still a strong underlying sentiment of this not actually being that useful a thing to spend your time on. After all, you wouldn’t ever implement a hash table when there’s a great one in the standard library, right?
This is like arguing that you don’t need to learn to cook because you can just go out to restaurants.
A second, related, point of view is that if you needed to know how this worked you’d just look it up.
That is, you don’t need to learn how to invent your own recipes because you can just look it up in a cook book.
In principle both of these arguments are fine. There are restaurants, there are cook books, not everybody needs to know how to cook and they certainly don’t need to become a gourmet chef.
But nevertheless, most people’s lives will be improved by acquiring at least a basic facility in the kitchen. Restaurants are expensive and may be inconvenient. You run out of ingredients and can’t be bothered to go to the store so you need to improvise or substitute. Or you’re just feeling creative and want to try something new for the hell of it.
The analogy breaks down a bit, because everybody needs to eat but most people don’t really need to implement custom data structures. It’s not 99%, but it might be 90%. Certainly it’s more than 50%.
But “most” isn’t “all”, and there’s a lot of grey areas at the boundary. If you’re not already sure you need to know this, you can probably get on fine without learning how to implement your own data structures, but you might find it surprisingly useful to do so anyway. Even if you don’t, there are some indirect benefits.
I’m not using this analogy just to make a point about usefulness, I also think it’s a valuable way of looking at it: Data structures are recipes. You have a set of techniques and tools and features, and you put them together in an appropriate way to achieve the result you want.
I think a lot of the problem is that data structures are not usually taught this way. I may be wrong about this – I’ve never formally taken a data structures course because my academic background is maths, not computer science, but it sure doesn’t look like people are being taught this way based on the books I’ve read and the people I’ve talked to.
Instead people are taught “Here’s how you implement an AVL tree. It’s very clever” (confession: I have no idea how you implement an AVL tree. If I needed to know I’d look it up, right?). It’s as if you were going to cookery school and they were taking you through a series of pages from the recipe book and teaching you how to follow them.
Which is not all bad! Learning some recipes is a great way to learn to cook. But some of that is because you already know how to eat food, so you’ve got a good idea what you’re trying to achieve. It’s also not sufficient in its own right – you need to learn to adapt, to combine the things you’ve already seen and apply the basic skills you’ve learned to solve new constraints or achieve new results.
Which is how I would like data structures to be taught. Not “Here is how to implement this named data structure” but “Here is the set of operations I would like to support, with these sorts of complexities as valid. Give me some ideas.”
Because this is the real use of learning to implement data structures: Sometimes the problem you’re given doesn’t match the set of data structures you have in the standard library or any of the standard ones. Maybe you need to support some silly combination of operations that you don’t normally do, or you have an unusual workload where some operations are very uncommon and so you don’t mind paying some extra cost there but some operations are very common so need to be ultra fast.
At that point, knowing the basic skills of data structure design becomes invaluable, because you can take what you’ve learned and put it together in a novel way that supports what you want.
And with that, I’ll finish by teaching you a little bit about data structures.
First lets start with a simple problem: Given a list of N items, I want to sample from them without replacement. How would I do that with an O(N) initialisation and O(1) sample?
Well, it’s easy: You create a copy of the list as an array. Now when you want to sample, you pick an index into the array at random.
Now that you have that index that gives you the value to return. Replace the value at that index with the value that’s at the end of the array, and reduce the array length by one.
Here’s some python:
def sample(ls, random): i = random.randint(0, len(ls) - 1) result = ls[i] ls[i] = ls[-1] ls.pop() return result |
Now I’ve given you a recipe to build on, lets see you improve upon it!
- If you assume the list you are given is immutable and you can hang onto it, can you improve the initialisation to O(1)? (you may need to make sampling only O(1) amortised and/or expected time to do this. Feel free to build on other standard data structures rather than inventing them from scratch).
- How would I extend that data structure to also support a “Remove the smallest element” operation in O(log(n))? (You may wish to read about how binary heaps work). You’ll probably have to go back to O(n) initialisation, but can you avoid that if you assume the input list is already sorted?
- How would you create a data structure to support weighted sampling with rejection? i.e. you start with a list of pairs of values and weights, and each value is sampled with probability proportionate to its weight. You may need to make sample O(log(n)) to do this (you can do it in expected O(1) time, but I don’t know of a data structure that does so without quite a lot of complexity). You can assume the weights are integers and/or just ignore questions of numerical stability.
- How would add an operation to give a key selected uniformly at random to a hash table? (If you haven’t read about how pypy dicts work you may wish to read that first)
- How would you extend a hash table to add an O(log(n)) “remove and return the smallest key” operation with no additional storage but increasing the insert complexity to O(log(n))? Can you do it without adding any extra storage to the hash table?
These aren’t completely arbitrary examples. Some of them are ones I’ve actually needed recently, others are just applications of the tricks I figured out in the course of doing so. I do recommend working through them in order though, because each will give you hints for how to do later ones.
You may never need any of these combinations, but that doesn’t matter. The point is not that these represent some great innovations in data structures. The point is to learn how to make your own data structures so that when you need to you can.
If you want to learn more, I recommend just playing around with this yourself. Try to come up with odd problems to solve that could be solved with a good data structure. It can also be worth learning about existing ones – e.g. reading about how the standard library in your favourite language implements things. What are the tricks and variations that it uses?
If you’d like to take a more formal course that is structured like this, I’m told Tim Roughgarden’s Coursera specialization on algorithms follows this model, and the second course in it will cover the basics of data structures. I’ve never taken it though, so this is a second hand recommendation. (Thanks @pozorvlak for the recommendation).
(And if you want to learn more things like this by reading more about it from me, support me on Patreon and say so! Nine out of ten cats prefer it, and you’ll get access to drafts of upcoming blog posts)