How and why to learn about data structures

There’s a common sentiment that 99% of programmers don’t need to know how to build basic data structures, and that it’s stupid to expect them to.

There’s certainly an element of truth to that. Most jobs don’t require knowing how to implement any data structures at all, and a lot of this sentiment is just backlash against using them as part of the interview process. I agree with that backlash. Don’t use data structures as part of your interview process unless you expect the job to routinely involve writing your own data structures (or working on ones somebody has already written). Bad interviewer. No cookie.

But setting aside the interview question, there is still a strong underlying sentiment of this not actually being that useful a thing to spend your time on. After all, you wouldn’t ever implement a hash table when there’s a great one in the standard library, right?

This is like arguing that you don’t need to learn to cook because you can just go out to restaurants.

A second, related, point of view is that if you needed to know how this worked you’d just look it up.

That is, you don’t need to learn how to invent your own recipes because you can just look it up in a cook book.

In principle both of these arguments are fine. There are restaurants, there are cook books, not everybody needs to know how to cook and they certainly don’t need to become a gourmet chef.

But nevertheless, most people’s lives will be improved by acquiring at least a basic facility in the kitchen. Restaurants are expensive and may be inconvenient. You run out of ingredients and can’t be bothered to go to the store so you need to improvise or substitute. Or you’re just feeling creative and want to try something new for the hell of it.

The analogy breaks down a bit, because everybody needs to eat but most people don’t really need to implement custom data structures. It’s not 99%, but it might be 90%. Certainly it’s more than 50%.

But “most” isn’t “all”, and there’s a lot of grey areas at the boundary. If you’re not already sure you need to know this, you can probably get on fine without learning how to implement your own data structures, but you might find it surprisingly useful to do so anyway. Even if you don’t, there are some indirect benefits.

I’m not using this analogy just to make a point about usefulness, I also think it’s a valuable way of looking at it: Data structures are recipes. You have a set of techniques and tools and features, and you put them together in an appropriate way to achieve the result you want.

I think a lot of the problem is that data structures are not usually taught this way. I may be wrong about this – I’ve never formally taken a data structures course because my academic background is maths, not computer science, but it sure doesn’t look like people are being taught this way based on the books I’ve read and the people I’ve talked to.

Instead people are taught “Here’s how you implement an AVL tree. It’s very clever” (confession: I have no idea how you implement an AVL tree. If I needed to know I’d look it up, right?). It’s as if you were going to cookery school and they were taking you through a series of pages from the recipe book and teaching you how to follow them.

Which is not all bad! Learning some recipes is a great way to learn to cook. But some of that is because you already know how to eat food, so you’ve got a good idea what you’re trying to achieve. It’s also not sufficient in its own right – you need to learn to adapt, to combine the things you’ve already seen and apply the basic skills you’ve learned to solve new constraints or achieve new results.

Which is how I would like data structures to be taught. Not “Here is how to implement this named data structure” but “Here is the set of operations I would like to support, with these sorts of complexities as valid. Give me some ideas.”

Because this is the real use of learning to implement data structures: Sometimes the problem you’re given doesn’t match the set of data structures you have in the standard library or any of the standard ones. Maybe you need to support some silly combination of operations that you don’t normally do, or you have an unusual workload where some operations are very uncommon and so you don’t mind paying some extra cost there but some operations are very common so need to be ultra fast.

At that point, knowing the basic skills of data structure design becomes invaluable, because you can take what you’ve learned and put it together in a novel way that supports what you want.

And with that, I’ll finish by teaching you a little bit about data structures.

First lets start with a simple problem: Given a list of N items, I want to sample from them without replacement. How would I do that with an O(N) initialisation and O(1) sample?

Well, it’s easy: You create a copy of the list as an array. Now when you want to sample, you pick an index into the array at random.

Now that you have that index that gives you the value to return. Replace the value at that index with the value that’s at the end of the array, and reduce the array length by one.

Here’s some python:

def sample(ls, random): i = random.randint(0, len(ls) - 1) result = ls[i] ls[i] = ls[-1] ls.pop() return result

Now I’ve given you a recipe to build on, lets see you improve upon it!

1. If you assume the list you are given is immutable and you can hang onto it, can you improve the initialisation to O(1)? (you may need to make sampling only O(1) amortised and/or expected time to do this. Feel free to build on other standard data structures rather than inventing them from scratch).
2. How would I extend that data structure to also support a “Remove the smallest element” operation in O(log(n))? (You may wish to read about how binary heaps work). You’ll probably have to go back to O(n) initialisation, but can you avoid that if you assume the input list is already sorted?
3. How would you create a data structure to support weighted sampling with rejection? i.e. you start with a list of pairs of values and weights, and each value is sampled with probability proportionate to its weight. You may need to make sample O(log(n)) to do this (you can do it in expected O(1) time, but I don’t know of a data structure that does so without quite a lot of complexity). You can assume the weights are integers and/or just ignore questions of numerical stability.
4. How would add an operation to give a key selected uniformly at random to a hash table? (If you haven’t read about how pypy dicts work you may wish to read that first)
5. How would you extend a hash table to add an O(log(n)) “remove and return the smallest key” operation with no additional storage but increasing the insert complexity to O(log(n))? Can you do it without adding any extra storage to the hash table?

These aren’t completely arbitrary examples. Some of them are ones I’ve actually needed recently, others are just applications of the tricks I figured out in the course of doing so. I do recommend working through them in order though, because each will give you hints for how to do later ones.

You may never need any of these combinations, but that doesn’t matter. The point is not that these represent some great innovations in data structures. The point is to learn how to make your own data structures so that when you need to you can.

If you want to learn more, I recommend just playing around with this yourself. Try to come up with odd problems to solve that could be solved with a good data structure. It can also be worth learning about existing ones – e.g. reading about how the standard library in your favourite language implements things. What are the tricks and variations that it uses?

If you’d like to take a more formal course that is structured like this, I’m told Tim Roughgarden’s Coursera specialization on algorithms follows this model, and the second course in it will cover the basics of data structures. I’ve never taken it though, so this is a second hand recommendation. (Thanks @pozorvlak for the recommendation).

(And if you want to learn more things like this by reading more about it from me, support me on Patreon and say so! Nine out of ten cats prefer it, and you’ll get access to drafts of upcoming blog posts)

This entry was posted in programming, Python on by .

Looking into doing a PhD

As regular readers of this blog have probably figured out, I’m a researchy sort of person.

A lot of my hobbies – maths, voting theory, weird corners of programming, etc – are research oriented, and most of my work has had some sort of research slant to it.

The last two years I’ve basically been engaged in a research project working on Hypothesis. It’s come quite far in that time, and I feel reasonably comfortable saying that it’s the best open source property based testing library on most metrics you’d care to choose. It has a number of novel features and implementation details that advance the state of the art.

It’s been pretty great working on Hypothesis like this, but it’s also been incredibly frustrating.

The big problem is that I do not have an academic background. I have a masters in mathematics (more technically I have a BA, an MA, and a CASM. Cambridge is weird. It’s entirely equivalent to a masters in mathematics though), but that’s where I stopped. Although it says “DR” in my online handle and the domain of this blog, those are just my initials and not my qualification.

As a result, I have little to no formal training or experience in doing academic research, and a similarly low understanding of who’s who and what’s what within the relevant fields. So I’ve been reading papers and trying to figure out the right people to talk to all on my own, and while it’s gone OK it’s still felt like fumbling around in the dark.

Which leads to the obvious solution that I spoilered in the title: If the problem is that I’m trying to do research outside of an academic context, the solution is to do research in an academic context.

So I’d like to do a PhD that is either about Hypothesis, or about something close enough to Hypothesis that each can benefit from the other.

There’s probably enough novel work in Hypothesis already that I could “just” clean it up, factor it out, and turn it into a PhD thesis as it is, but I’m not really expecting to do that (though I’d like that to be part of it). There are a number of additional directions that I think it would be worth exploring, and I expect most PhD funding will come with a focus subject attached which I would be happy to adapt to (a lot of the most interesting innovations in Hypothesis came because some external factor forced me to think about things in ways I wouldn’t otherwise have!). If you’d like to know more, I’ve written up a fairly long article about Hypothesis and why I think it’s interesting research on the main Hypothesis site.

Which, finally, brings me to the main point of the post: What I want from you.

I’m already looking into and approaching potential universities and interesting researchers there who might be good supervisors or able to recommend people who are. I’ve been in touch with a couple (some of whom might be reading this post. Hi), but I would also massively appreciate suggestions and introductions.

So, if you work in relevant areas or know of people who do and think it would be useful for me to talk to, please drop me an email at [email protected]. Or just leave a comment on this blog post, tweet at me, etc.

This entry was posted in Hypothesis, life, programming, Python on by .

Laziness is better when it’s visible

This is a trick I invented a while ago. I’m morally certain it’s a reinvention rather than an invention, but I’ve not really seen it in use and at the very least it doesn’t seem to be widely known. I recently ran into a situation where a library would benefit from it greatly and doesn’t use it, so I thought I would write it up.

Suppose we have a bunch of strings and we want to concatenate them.

def cat(my_strings): result = '' for s in my_strings: result += s return result

That’s how you concatenate strings, right?

You can fix this using a special string buffer type, or in python by just using ”.join(my_strings), but wouldn’t it be nice if you didn’t have to? It’s often very convenient to build things up using expressions. Although it’s no great hardship for strings to do bulk operations, you run into the same problem in e.g. pulp where you have more complex expressions (and no corresponding sum method in the library). It would be great if this all just worked.

One way to do this sort of thing is to switch to an immutable tree based representation like a rope where the concatenation operation has a more reasonable complexity (usually log(n)).

But that then comes with its own costs. Using a tree structure slows down access and iteration – only by an O(log(n)) factor, but with relatively high constants. As a result you end up paying a relatively high performance penalty for what was mostly just wanted as a convenience (ropes do have other advantages, but that’s not what we’re looking at here).

But what if you didn’t have to do any of that? What if you could get O(1) concatenation and all the benefits of the compact representation? It sounds implausible, but I’m here to tell you you can!

Or, rather, that you almost can. Both of the above are true: You do get O(1) concatenation and you do get all the benefits of the compact representation, but you do end up paying some additional cost, because the idea is to use laziness. So although concatenation is O(1) you end up paying an additional O(n) cost the first time you want to use the result. Fortunately this still avoids the problem of sums being quadratic.

The key idea is to use a data structure that can swap out its implementation on demand. It starts by just storing the abstract expression tree that lead to it, and then switches to a more efficient representation as soon as you need it to.

e.g. the version of this for the above is that a string becomes a binary tree where the leaves are buffers and the branches indicate that the string is a concatenation of its left and right parts. Concatenation is then just creating a new split node, which is O(1).

Then, once we want the compact representation (which will tend to be as soon as we start doing interesting operations on the data – because the expression tree is entirely unnormalized there is very little we can usefully do to it that isn’t an O(n) operation!), we calculate that, store the result on the string and throw away the expression data that brought us here.

That is, as soon as we have forced the string, the string’s switches to a new representation using the forced buffer, essentially replacing the split node with a leaf node.

This feels like we’re back where we started – if you’re doing this lazily like that then you’re just summing together two string children so you’re quadratic again – but we’re not, for one very important reason: Because the implementation of the laziness is under our control, we can tell whether a string has already been forced or not. When forcing a node we then don’t force its child nodes, but instead just walk the tree and behave appropriately when we get to the leaves.

This sort of thing can be very useful, because the common cases where this runs into issues is that you have a complex expression graph and only actually care about a very small fraction of the subexpressions (e.g. in the sum case).

This isn’t always a win, in that it does behave suboptimally under some workloads (e.g. when you do care about a lot of the intermediate results but process them in the reverse of the order you created them), but it’s rarely a substantial loss and usually produces dramatic speedups by converting accidentally quadratic cases into the desired linear behaviour.

There are additional tricks you can build on top of this:

• You can precompute some data so you don’t always have to force the structure. e.g. you can always calculate the length of the string in the above example without forcing it and still have the operations be O(1)
• you can sometimes have operations that only require partially forcing the data structure (e.g. if you index into a string you might only have to force one half of it (or neither if the index is out of bounds!)
• If you have more complex operations then you can do a sort of “query optimization” to rewrite the expression tree into a more efficient execution plan. For example, a thing I’ve done in the past is when the operation is intersection you can rewrite it so that intersections are processed in order of increasing size, which often ends up with you being able to terminate early because you’ve discovered that the end result is going to be empty regardless of what happens next.

Depending on circumstances, any of the above can be worth doing, but most of the big wins come from the basic trick which is almost always a valuable one if you’re running into this sort of problem.

Using this technique won’t always be an improvement – e.g. you’ll end up doing some duplicated work if you do something like x + x because it will effectively recompute forcing x twice – but most of the work loads on which it will behave particularly badly are ones that you should probably have avoided anyway with the normal approach. The only real downsides where you do suffer a hit from using this is that the laziness adds an additional check to each operation, which can be anywhere between a small and modest performance hit depending on how expensive the operation normally is. Additionally, if you want the operations to be thread safe then you’ll need a memory barrier of some sort (e.g. making the relevant field volatile) to get the visibility right, which adds another small hit.

So it’s not a universal win, but the cost is light enough and there are enough work loads where it improves behaviour substantially that it is often worth considering.

To finish off, and make this more concrete, here’s some Python code implementing this idea for strings:

(Like this post? Want to see more like it? Why not support my Patreon! You’ll get to see drafts of upcoming posts and also increase the amount I write)

This entry was posted in programming, Python on by .

Rigging elections with integer linear programming

No, this isn’t a post about politics, sorry, it’s just a post about voting theory.

As you might have noticed (and if not, this is the post announcing it), I have a book out! It’s called Voting by Example and is about the complexities of voting and why different voting systems might be interesting. You should go buy it.

But this isn’t just an ad post for the book, it’s about something else: How I came up with the examples in this book.

At its core is the following example. We have a student election where there are four candidates are running for class president. Each student casts a vote ranking the four candidates in order of most to least preferred.

• 32% of students voted Alex, Kim, Charlie, Pat
• 27% of students voted Charlie, Pat, Kim, Alex
• 25% of students voted Pat, Charlie, Kim, Alex
• 16% of students voted Kim, Pat, Alex, Charlie

The point of this example is that each of four different ranked voting systems give a different answer:

It’s also constructed to have a minimal number of voting blocs, though it’s minimum amongst a slightly more specific set of elections than just those satisfying the above conditions.

Roughly what this means is:

• Alex has the most first choice votes
• When compared pairwise with any other candidate, the majority prefer Charlie
• Pat wins a complicated procedure where people iteratively drop out depending on who is the favourite of the remaining candidates amongst the fewest voters
• If you give people a score of 3 for each voter who ranks them first, two for each who ranks them second, and one for each who ranks them third, then Kim has the highest score.

If you want to know more than that I recommend the relevant Wikipedia links above, which are all very approachable.

The significance of this is that each of these is a popular (either in practice or amongst electoral theorists) way of saying who should be the winner. So the situation ends up being rather complex to decide.

But this isn’t a post about the significance. This is a post about how I constructed the example.

Historically I’ve generally created example elections with Hypothesis or a similar approach – randomly fuzzing until you get a result you want – but that wasn’t really going to work very well here due to the rather complicated set of constraints that are hard to satisfy by accident.

So I instead turned to my old friend, integer linear programming.

The idea of integer linear programming (ILP) is that we have a number of variables which are forced to take integer values. We can then impose linear constraints between them, and give a linear objective function to optimise for. For example we might have three variables $$v_1, v_2, v_3$$ and the following constraints:

• $$v_1, v_2, v_3 \geq 0$$
• $$v_1 + 2 v_2 + 3 v_3 \leq 5$$

And then try to maximise $$v_1 + 2 v_2 + 4 v_3$$.

Given a problem like this we can feed it to an ILP solver and it will spit out a solution. If we do that, it will tell us that an optimal solution for this problem is $$v_1 = 2, v_2 = 0, v_3 = 1$$.

A lot of problems can be nicely turned into ILP problems, and there are some fairly good open source ILP solvers, so it’s often a nice way to solve complex combinatorial problems.

So how do we turn our election rigging problem into an integer linear program?

The key idea is this: 4! isn’t very large. It’s only 24. 24 is tiny from an ILP point of view.

This means that there’s no problem with creating a variable for each of the possible votes that someone could cast, representing the number of people who cast that vote. That is, we create integer variables $$v_p$$ indexed by permutations with the constraints:

• $$\sum v_p = 100$$
• $$v_p \geq 0$$

Then the value of e.g. $$v_{(0, 1, 2, 3)}$$ is the number of people who cast the exact vote $$(0, 1, 2, 3)$$ (or, by name, Alex,, Charlie, Pat, Kim).

We then use a trick to get nicer examples, which is that we try to minimise the number of non-zero votes. The idea is to create variables which, when minimised, just look like markers that say whether a vote is non-zero or not.

So we create supplementary 0/1 valued integer variables $$u_p$$ with the constraints that $$v_p \leq 100 u_p$$, and set the objective to minimise $$\sum u_p$$. Then $$u_p$$ will be set to $$0$$ wherever it can be, and the only places where it can’t are where $$v_p$$ is non-zero. Thus this minimises the number of voting blocs.

So that’s how we create our basic ILP problem, but right now it will just stick 100 votes on some arbitrary possible ballot. How do we then express the voting conditions?

Well, lets start with the plurality and Borda scores. These are pretty easy, because they constitute just calculating a score for each candidate for each permutation and adding up the scores. This means that the scores are just a linear function of the variables, which is exactly what an ILP is built on.

Victory is then just a simple matter of one candidate’s score exceeding another. You need to set some epsilon for the gap (linear programming can’t express $$<$$, only $$\leq$$), but that’s OK – the scores are just integers, so we can just insist on a gap of $$1$$.

The following code captures all of the above using Pulp, which is a very pleasant to use Python interface to a variety of ILP solvers:

The idea is that the additive_scores parameter takes a list of scoring functions and a partial list of winners given those functions and returns an election producing those orders.

So if we run this asking for the plurality winners to be (0, 1, 2, 3) and the Borda winners to be (3, 2, 1, 0) we get the following:

>>> build_election(4, 100, [
(lambda p, c: int(p[0] == c), (0, 1, 2, 3)),
(lambda p, c: 3 - p.index(c), (3, 2, 1, 0))])
[((0, 3, 1, 2), 35), ((1, 2, 3, 0), 34), ((2, 3, 0, 1), 31)]

So this is already looking quite close to our starting example.

Creating a Condorcet winner is similarly easy: Whether the majority prefers a candidate to another is again just an additive score. So we just need to add the requisite $$N$$ constraint that our desired Condorcet candidate wins.

if condorcet_winner is not None: victories = { (i, j): lpsum( v for p, v in variables if p.index(i) > p.index(j) ) for i in candidates for j in candidates } for c in candidates: if c != condorcet_winner: problem.addConstraint( victories[(condorcet_winner, c)] >= victories[(c, condorcet_winner)] + 1 )

If we run this to force the Condorcet winner to be $$1$$ we now get the following:

>>> build_election(4, 100, [
(lambda p, c: int(p[0] == c), (0, 1, 2, 3)),
(lambda p, c: 3 - p.index(c), (3, 2, 1, 0))],
condorcet_winner=1,
)
[((0, 3, 1, 2), 28),
((1, 2, 3, 0), 27),
((2, 1, 3, 0), 24),
((3, 2, 0, 1), 21)]


This is pretty close to the desired result. We just need to figure out how to set the IRV winner.

This is a bit more fiddly because IRV isn’t a simple additive procedure, so we can’t simply set up scores for who wins it.

But where it is a simple additive procedure is to determine who drops out given who has already dropped out, because that’s simply a matter of calculating a modified plurality score with some of the candidates ignored.

So what we can do is specify the exact dropout order: This means we know who has dropped out at any point, so we can calculate the scores for who should drop out next and add the appropriate constraints.

The following code achieves this:

 if irv_dropout_order is not None: remaining_candidates = set(candidates) for i in irv_dropout_order: if len(remaining_candidates) <= 1: break assert i in remaining_candidates allocations = {j: [] for j in remaining_candidates} for p, v in variables: for c in p: if c in remaining_candidates: allocations[c].append(v) break loser_allocations = sum(allocations.pop(i)) remaining_candidates.remove(i) for vs in allocations.values(): problem.addConstraint(loser_allocations + 1 <= sum(vs))

And running this we get the following:

>>> build_election(4, 100, [
(lambda p, c: int(p[0] == c), (0, 1, 2, 3)),
(lambda p, c: 3 - p.index(c), (3, 2, 1, 0))
], condorcet_winner=1, irv_dropout_order=(3, 1, 0, 2))
[((0, 3, 1, 2), 31),
((1, 2, 3, 0), 27),
((2, 1, 3, 0), 25),
((3, 2, 0, 1), 17)]


This isn’t quite the same example in the book (the first bloc has one fewer vote which the last bloc got in this), because the book example had a bit more code for optimizing it into a canonical form, but that code isn’t very interesting so we’ll skip it.

Here’s the full code:

I’m constantly surprised how nice integer linear programming ends up being for constructing examples. I knew I could do the Borda and plurality scores – that’s in fact the example that motivated me trying this out at all – but although I find it obvious in retrospect that you can also fix the Condorcet winner it definitely wasn’t a priori. The fact that it’s easy to also calculate the IRV dropout order was genuinely surprising.

This is also a nice example of how much fun small n programming can be. This isn’t just an O(n!) solution – it generates a problem of size O(n!) and then feeds it to a solver for an NP-hard problem! in principle that should be catastrophic. In practice, n is 4, so who cares? (n=4 is about the limit for this too – it just about works for n=5, and for n=6 it doesn’t really).

I also find it somewhat surprising how much freedom there is to get different results from different voting systems. I was motivated to do some of this by The ultimate of chaos resulting from weighted voting systems, so I knew there was a fair bit of freedom, but I was somewhat expecting that to be a pathology of weighted voting systems, so I’m still a little surprised. I guess there’s just quite a lot of room to play around with in a 24-dimensional simplex.

Which brings me to my final point: When you’re trying to understand what is possible in a domain, this sort of example generation focused programming is very useful for exploring the limits of the possible and building your intuition. I feel like this isn’t a thing people do enough, and I think we should all do more of it.

I also think you should buy my book.

This entry was posted in programming, Python, voting on by .

Avoiding conferences in the USA for now

This is an administrative note, with a side order of politics.

Although I am not personally affected to any large degree (I’m a US citizen, and while there are theoretically things that could cause me problems, in practice I think by the time they come for me you’ll care more about how to join the violent resistance than any particular effect of this post), the situation with Trump in the US and his rather dangerously slapdash approach to things like immigration and the basic rule of law are making me extremely unhappy, and I don’t really feel comfortable attending US conferences while this is going on.

I was probably going to be going to PyCon US (and had submitted a tutorial proposal which I’ve now withdrawn), and was quite happy that PyCon UK and Strange Loop did not clash this year, as I keep wanting to give a talk about Hypothesis implementation at the latter (which might or might not have been accepted of course, but I’d have at least tried), so amidst the rather larger and more important political horror this is also personally quite annoying. Oh well. Bigger things to worry about.

Naturally I hold neither of these conferences accountable for Trump’s actions, which are almost entirely beyond their control, and I wish them all the best. I just don’t feel comfortable attending myself.

I won’t entirely be avoiding travel to the US (I have family over there), but I will be doing my best to minimise it, and will probably not do so for work purposes unless it’s something really important.

I don’t know how long I intend to keep this policy up, but it’s almost certainly going to be for the next four years unless something very surprising happens.

Lest this seem like there’s an awful lot of glass in this house I’m throwing stones from, I’d probably recommend you do the same to the UK. The best I can claim right now is that we’re not yet as bad as Trump, but we already have a history of really awful border control so that’s probably not very reassuring. That being said, I live here, so I won’t make any such attempt to avoid local conferences.

Apologies to everyone who I won’t be able to see and/or meet for the first time in person at PyCon US and Strange Loop. Maybe we can find a great conference somewhere in the EU to both attend?

Assuming they still let Brits in after this debacle we’ve made of things.

This entry was posted in Admin, Python on by .