David R. MacIver's Blog: How and why to learn about data structures

How and why to learn about data structures

22 March 2017

There’s a common sentiment that 99% of programmers don’t need to know how to build basic data structures, and that it’s stupid to expect them to.

There’s certainly an element of truth to that. Most jobs don’t require knowing how to implement any data structures at all, and a lot of this sentiment is just backlash against using them as part of the interview process. I agree with that backlash. Don’t use data structures as part of your interview process unless you expect the job to routinely involve writing your own data structures (or working on ones somebody has already written). Bad interviewer. No cookie.

But setting aside the interview question, there is still a strong underlying sentiment of this not actually being that useful a thing to spend your time on. After all, you wouldn’t ever implement a hash table when there’s a great one in the standard library, right?

This is like arguing that you don’t need to learn to cook because you can just go out to restaurants.

A second, related, point of view is that if you needed to know how this worked you’d just look it up.

That is, you don’t need to learn how to invent your own recipes because you can just look it up in a cook book.

In principle both of these arguments are fine. There are restaurants, there are cook books, not everybody needs to know how to cook and they certainly don’t need to become a gourmet chef.

But nevertheless, most people’s lives will be improved by acquiring at least a basic facility in the kitchen. Restaurants are expensive and may be inconvenient. You run out of ingredients and can’t be bothered to go to the store so you need to improvise or substitute. Or you’re just feeling creative and want to try something new for the hell of it.

The analogy breaks down a bit, because everybody needs to eat but most people don’t really need to implement custom data structures. It’s not 99%, but it might be 90%. Certainly it’s more than 50%.

But “most” isn’t “all”, and there’s a lot of grey areas at the boundary. If you’re not already sure you need to know this, you can probably get on fine without learning how to implement your own data structures, but you might find it surprisingly useful to do so anyway. Even if you don’t, there are some indirect benefits.

I’m not using this analogy just to make a point about usefulness, I also think it’s a valuable way of looking at it: Data structures are recipes. You have a set of techniques and tools and features, and you put them together in an appropriate way to achieve the result you want.

I think a lot of the problem is that data structures are not usually taught this way. I may be wrong about this - I’ve never formally taken a data structures course because my academic background is maths, not computer science, but it sure doesn’t look like people are being taught this way based on the books I’ve read and the people I’ve talked to.

Instead people are taught “Here’s how you implement an AVL tree. It’s very clever” (confession: I have no idea how you implement an AVL tree. If I needed to know I’d look it up, right?). It’s as if you were going to cookery school and they were taking you through a series of pages from the recipe book and teaching you how to follow them.

Which is not all bad! Learning some recipes is a great way to learn to cook. But some of that is because you already know how to eat food, so you’ve got a good idea what you’re trying to achieve. It’s also not sufficient in its own right - you need to learn to adapt, to combine the things you’ve already seen and apply the basic skills you’ve learned to solve new constraints or achieve new results.

Which is how I would like data structures to be taught. Not “Here is how to implement this named data structure” but “Here is the set of operations I would like to support, with these sorts of complexities as valid. Give me some ideas.”

Because this is the real use of learning to implement data structures: Sometimes the problem you’re given doesn’t match the set of data structures you have in the standard library or any of the standard ones. Maybe you need to support some silly combination of operations that you don’t normally do, or you have an unusual workload where some operations are very uncommon and so you don’t mind paying some extra cost there but some operations are very common so need to be ultra fast.

At that point, knowing the basic skills of data structure design becomes invaluable, because you can take what you’ve learned and put it together in a novel way that supports what you want.

And with that, I’ll finish by teaching you a little bit about data structures.

First lets start with a simple problem: Given a list of N items, I want to sample from them without replacement. How would I do that with an O(N) initialisation and O(1) sample?

Well, it’s easy: You create a copy of the list as an array. Now when you want to sample, you pick an index into the array at random.

Now that you have that index that gives you the value to return. Replace the value at that index with the value that’s at the end of the array, and reduce the array length by one.

Here’s some python:

def sample(ls, random):
    i = random.randint(0, len(ls) - 1)
    result = ls[i]
    ls[i] = ls[-1]
    ls.pop()
    return result

Now I’ve given you a recipe to build on, lets see you improve upon it!

If you assume the list you are given is immutable and you can hang onto it, can you improve the initialisation to O(1)? (you may need to make sampling only O(1) amortised and/or expected time to do this. Feel free to build on other standard data structures rather than inventing them from scratch).
How would I extend that data structure to also support a “Remove the smallest element” operation in O(log(n))? (You may wish to read about how binary heaps work). You’ll probably have to go back to O(n) initialisation, but can you avoid that if you assume the input list is already sorted?
How would you create a data structure to support weighted sampling with rejection? i.e. you start with a list of pairs of values and weights, and each value is sampled with probability proportionate to its weight. You may need to make sample O(log(n)) to do this (you can do it in expected O(1) time, but I don’t know of a data structure that does so without quite a lot of complexity). You can assume the weights are integers and/or just ignore questions of numerical stability.
How would add an operation to give a key selected uniformly at random to a hash table? (If you haven’t read about how pypy dicts work you may wish to read that first)
How would you extend a hash table to add an O(log(n)) “remove and return the smallest key” operation with no additional storage but increasing the insert complexity to O(log(n))? Can you do it without adding any extra storage to the hash table?

These aren’t completely arbitrary examples. Some of them are ones I’ve actually needed recently, others are just applications of the tricks I figured out in the course of doing so. I do recommend working through them in order though, because each will give you hints for how to do later ones.

You may never need any of these combinations, but that doesn’t matter. The point is not that these represent some great innovations in data structures. The point is to learn how to make your own data structures so that when you need to you can.

If you want to learn more, I recommend just playing around with this yourself. Try to come up with odd problems to solve that could be solved with a good data structure. It can also be worth learning about existing ones - e.g. reading about how the standard library in your favourite language implements things. What are the tricks and variations that it uses?

If you’d like to take a more formal course that is structured like this, I’m told Tim Roughgarden’s Coursera specialization on algorithms follows this model, and the second course in it will cover the basics of data structures. I’ve never taken it though, so this is a second hand recommendation. (Thanks @pozorvlak for the recommendation).

(And if you want to learn more things like this by reading more about it from me, support me on Patreon and say so! Nine out of ten cats prefer it, and you’ll get access to drafts of upcoming blog posts)

Comments

Alaric Snell-Pym on 2017-03-22 14:34:06:

An area where this keeps cropping up in my life is making custom queues.

I tend to work a lot on systems that deal with lots of things to do - updates to a database, network packets to send, HTTP requests to service, that sort of thing. Due to these things to do arriving whenever they feel like it, but our hardware and software only being able to service requests at a certain rate, it’s standard practice to have a queue in between; new requests go in, idle workers pull requests out and do them. The queue acts as a buffer so we can accept a burst of requests and get around to them as soon as we can. Insert and removal can’t be O(N) or the system gets increasingly swamped as the queue overloads; O(1) is ideal, but O(log N) is acceptable. Off-the-shelf FIFOs (modulo some interesting details about multithreaded access, or how to recover from a hard reboot of the server, and what to do if a worker fails while processing a request, aside) tend to do the trick.

But we often start to want to do clever things with that queue. Prioritise “urgent” requests. Allocate requests to users and give each user a guaranteed fraction of the system capacity. Express dependencies between requests (“X must finish before Y can start”). Mark requests as “droppable”, so if the system is overloaded they can just be thrown away so the system focusses on more important work (this is related to priority, which basically means the request can be delayed until there’s no more important work; a droppable request goes a step further and relieves the system from the burden of storing the request somewhere). Requests that supercede other requests (if we have a request to turn the light on already in the queue, and a request comes in to turn the light OFF before we’ve gotten around to turning the light ON, we can probably just drop both requests (if the light was off to begin with)).

At which point, off-the-shelf FIFOs start to suffer a bit. I’ve tended to end up having a pool of requests with multiple “index” data structures pointing into it to cope with different access requirements while avoiding unacceptable O(N) behaviour - a heap organising them in priority order, plus another heap putting them in drop-prioirity order but only containing droppable requests (so we can find the most droppable request easily), a hashmap on the thing the request affects (so multiple requests affecting the same thing can be merged into one according to the rules for superceding), a hashmap on request ID (so if X must finish before Y can start, X can go in the normal heap while Y doesn’t as it’s not “runnable” yet; but we put Y in a linked list of dependents hanging off of X, and have the contents of that list be tipped into the heap when X completes), and so on.

Maintaining all these parallel indexes correctly is a pain, particularly when there aren’t simple invariants like “every request must be in every index structure”, and I often wonder if there might be data structures that can implement several of the access patterns at once, to at least reduce the number of parallel structures required... So I think it’s far from true that a “standard set” of data structures from a standard library is all we need. There’s plenty of interesting applications for more complicated access patterns and complexity requirements!