Category Archives: Python

Lazy fisher-yates shuffling for precise rejection sampling

This is a trick I figured out a while ago. It came up in a problem I was working on, so I thought I’d write it up.

I haven’t seen it anywhere else, but I would be very surprised if it were in any way original to me rather than a reinvention. I think it’s niche and not very useful, so it’s hard to find prior art.

Attention conservation notice: My conclusion at the end is going to be that this is a neat trick that is probably not worth bothering with. You get a slight asymptotic improvement for a worse constant factor. I mention some variants and cases where it might be useful at the end, but it’s definitely more of an interesting intellectual exercise than a practical tool.

Suppose you have the following problem: You want to implement the following function:

def sample(random, n, f):
    """returns a number i in range(n) such that f(i)
    is True. Raises ValueError if no such i exists."""

What is the right way to do this?

The obvious way to do it is this:

def sample(random, n, f):
    choices = [i for i in range(n) if f(i)]
    if not choices:
        raise ValueError("No such i!")
    return random.choice(choices)

We calculate all of the possible outcomes up front and then sample from those.

This works but it is always \(O(n)\). We can’t help to do better than that in general – in the case where \(f(i)\) always returns False, we must call it \(n\) times to verify that and raise an error – but it is highly inefficient if, say, f always returns True.

For cases where we know a priori that there is at least one i with f(i) returning True, we can use the following implementation:

def sample(random, n, f):
    while True:
        i = random.randrange(0, n)
        if f(i):
            return i

Here we do rejection sampling: We simulate the distribution directly by picking from the larger uniform distribution and selecting on the conditional property that \(f(i)\) is True.

How well this works depends on a parameter \(A\) – the number of values in \([0, n)\) such that \(f(i)\) is True. The probability of stopping at any loop iteration is \(\frac{A}{n}\) (the probability of the result being True), so the number of loops is a geometric distribution with that parameter, and so the expected number of loop iterations is \(\frac{n}{R}\). i.e. we’ve effectively sped up our search by a factor of \(A\).

So in expectation rejection sampling is always at least as good and usually better than our straightforward loop, but it has three problems:

  1. It will occasionally take more than \(n\) iterations to complete because it may try the same \(i\) more than once.
  2. It loops forever when \(A=0\)
  3. Even when \(A > 0\) its worst-case may take strictly more iterations than the filtering method.

So how can we fix this?

If we fix the first by arranging so that it never calls \(f\) with the same \(i\) more than once, then we also implicitly fix the other two: At some point we’ve called \(f\) with every \(i\) and we can terminate the loop and error, and if each loop iteration returns a fresh \(i\) then we can’t have more than \(n\) iterations.

We can do this fairly naturally by shuffling the order in which we try the indices and then returning the first one that returns True:

def sample(random, n, f):
    indices = list(range(n))
    random.shuffle(indices)
    for i in indices:
       if f(i):
           return i
    raise ValueError("No such i!")

Effectively at each iteration of the loop we are uniformly at random selecting an \(i\) among those we haven’t seen yet.

So now we are making fewer calls to \(f\) – we only call it until we first find an example, as in our rejection sampling. However we’re still paying an \(O(n)\) cost for actually doing the shuffle!

We can fix this partially by inlining the implementation of the shuffle as follows (this is a standard Fisher-Yates shuffle, though we’re doing it backwards:

def sample(random, n, f):
    indices = list(range(n))
    random.shuffle(indices)
    for j in range(n):
        k = random.randrange(j, n)
        indices[j], indices[k] = indices[k], indices[j]
        i = indices[j]
        if f(i):
            return i
    raise ValueError("No such i!")

This works because after \(j\) iterations of the loop in a Fisher-Yates shuffle the indices up to and including \(j\) are shuffled. Thus we can effectively fuse our search into the loop – if we know we’re going to terminate here, we can just stop and not bother shuffling the rest.

But we’re still paying \(O(n)\) cost for initialising the list of indices. Can we do better? Yes we can!

The idea is that we can amortise the cost of our reads of indices into our writes of it, and we only do \(O(\frac{n}{A})\) writes. If we haven’t written to a position in the indices list yet, then its value must be equal to its position.

We can do this naturally in python using defaultdict (there are better data structures for this, but a hashmap gets us the desired amortized complexity) as follows:

class key_dict(defaultdict):
    def __missing__(self, key):
        return key
 
def sample(random, n, f):
    indices = key_dict()
    random.shuffle(indices)
    for j in range(n):
        k = random.randrange(j, n)
        indices[j], indices[k] = indices[k], indices[j]
        i = indices[j]
        if f(i):
            return i
    raise ValueError("No such i!")

So we have now completely avoided any \(O(n)\) costs (assuming we’re writing Python 3 and so range is lazy. If you’re still using legacy Python, replace it with an xrange).

What is the actual expected number of loop iterations?

Well if \(A = n\) then we only make one loop iteration, and if \(A = 0\) we always make \(n\). In general we certainly make no more than \(n – A\) iterations, because after that many iterations we’ve exhausted all the values for which \(f\) is False.

I, err, confess I’ve got a bit lost in the maths for what the expected number of iterations of this loop is. If \(L\) is a random variable that is the position of the loop iteration on which this terminates with a successful result (with \(L = n + 1\) if there are no valid values) then a counting argument gives us that \(P(L \geq k) = \frac{(n – A)! (n – k)!}{n! (n – A – k)!}\), and we can calculate \(E(L) = \sum\limits_{k=0}^{n – A} P(L \geq k)\), but the sum is fiddly and my ability to do such sums is rusty. Based on plugging some special cases into Wolfram Alpha, I think the expected number of loop iterations is something like \(\frac{n+1}{A + 1}\), which at least gives the right numbers for the boundary cases. If that is the right answer then I’m sure there’s some elegant combinatorial argument that shows it, but I’m not currently seeing it. Assuming this is right, this is asymptotically no better than the original rejection sampling.

Regardless, the expected number of loop iterations is definitely \(\leq \frac{n}{k}\), because we can simulate it by running our rejection sampling and counting \(1\) whenever we see a new random variable, so it is strictly dominated by the expected number of iterations of the pure rejection sampling. So we do achieve our promised result of an algorithm that strictly improves on both rejection sampling and filter then sample – it has a better expected complexity than the former, and the same worst case complexity as the latter.

Is this method actually worth using? Ehhh… maybe? It’s conceptually pleasing to me, and it’s nice to have a method that complexity-wise strictly out-performs either of the two natural choices, but in practice the constant overhead of using the hash map almost certainly is greater than any benefit you get from it.

The real problem that limits this being actually useful is that either \(n\) is small, in which case who cares, or \(n\) is large, in which case the chances of drawing a duplicate are sufficiently low that the overhead is negligible.

There are a couple of ways this idea can still be useful in the right niche though:

  • This does reduce the constant factor in the number of calls to \(f\) (especially if \(A\) is small), so if \(f\) is expensive then the constant factor improvement in number of calls may be enough to justify this.
  • If you already have your values you want to sample in an array, and the order of the array is arbitrary, then you can use the lazy Fisher Yates shuffle trick directly without the masking hash map.
  • If you’re genuinely unsure about whether \(A > 0\) and do want to be able to check, this method allows you to do that (but you could also just run rejection sampling \(n\) times and then fall back to filter and sample and it would probably be slightly faster to do so).

If any of those apply, this might be a worthwhile trick. If they don’t, hopefully you enjoyed reading about it anyway. Sorry.

This entry was posted in Python on by .

A pathological example for test-case reduction

This is an example I cut from a paper I am currently writing about test-case reduction because I needed to save space and nobody except me actually cares about the algorithmic complexity of test-case reduction.

Suppose you have the following test case:

from hypothesis import given, strategies as st
 
K = 64
 
INTS = st.integers(0, 2 ** K - 1)
 
@given(INTS, INTS)
def test_are_not_too_far_apart(m, n):
    assert abs(m - n) > 1

Then if this test ever fails, with initial values \(m\) and \(n\), if reduction can replace an int with its predecessor but can’t change two ints at once, the reduction will take at least \(\frac{m + n}{2}\) test executions: The fastest possible path you can take is to at each step reduce the larger of \(m\) and \(n\) by two, so each step only reduces \(m + n\) by \(2\), and the whole iteration takes \(\frac{m + n}{2}\) steps.

(Side note: You can get lucky with Hypothesis and trigger some special cases that make this test faster. if you happen to have \(m = n\) then it can reduce the two together, but if you start with \(m = n \pm 1\) then it will currently never trigger because it will not ever have duplicates at the entry to that step. Hypothesis will also actually find this bug immediately because it will try it with both examples set to zero. Trivial modifications to the test can be made to avoid these problems, but I’m going to ignore them here).

The interesting thing about this from a Hypothesis point of view is that \(m + n\) is potentially exponential in \(k\), and the data size is linear in \(k\), so Hypothesis’s test case reduction is of exponential complexity (which doesn’t really cause a performance problem in practice because the number of successful reductions gets capped, but does cause an example quality problem because you then don’t run the full reduction). But this isn’t specifically a Hypothesis problem – I’m morally certain every current property-based testing library’s test case reduction is exponential in this case (except for ones that haven’t implemented reduction at all), possibly with one or two patches to avoid trivial special cases like always trying zero first.

Another way to get around this is to almost never trigger this test case with large values! Typically property-based testing libraries will usually only generate an example like this with very small values. But it’s easy for that not to be the case – mutational property-based testing libraries like Hypothesis or crowbar can in theory fairly easily find this example for large \(m\) and \(n\) (Hypothesis currently doesn’t. I haven’t tried with Crowbar). Another way you could easily trigger it is with distributions that special case large values.

One thing I want to emphasise is that regardless of the specific nature of the example and our workarounds for it, this sort of problem is inherent. It’s easy to make patches that avoid this particular example (especially in Hypothesis which has no problem making simultaneous changes to \(m\) and \(n\)).

But if you fix this one, I can just construct another that is tailored to break whatever heuristic it was that you came up with. Test-case reduction is a local search method in an exponentially large  space, and this sort of problem is just what happens when you block off all of the large changes your local search method tries to make but still allow some of the small ones.

You can basically keep dangling the carrot in front of the test-case reducer going “Maybe after this reduction you’ll be done”, and you can keep doing that indefinitely because of the size of the space. Pathological examples like this are not weird special cases, if anything the weird thing is that most examples are not pathological like this.

My suspicion is that we don’t see this problem cropping up much in practice for a couple of reasons:

  1. Existing property-based testing libraries are very biased towards finding small examples in the first place, so even when we hit cases with pathological complexity, \(n\) is so small that it doesn’t actually matter.
  2. This sort of boundary case relies on what are essentially “weird coincidences” in the test case. They happen when small local changes unlock other small local changes that were previously locked. This requires subtle dependency between different parts of the test case, and where that subtle dependency exists we are almost never finding it. Thus I suspect the fact that we are not hitting exponential slow downs in our test case reduction on a regular basis may actually be a sign that there are entire classes of bug that we are just never finding because the probability of hitting the required dependency combination is too low.
  3. It may also be that bugs just typically do not tend to have that kind of sensitive dependencies. My suspicion is that this is not true given the prevalence of off-by-one errors.
  4. It may also be that people are hitting this sort of problem in practice and aren’t telling us because they don’t care that much about the performance of test case reduction or example quality.
This entry was posted in Hypothesis, Python on by .

Shaping the World

I gave a keynote at PyCon UK recently – it was mostly about the book “Seeing Like A State” and what software developers can learn from it about our effect on the world.

I’ve been meaning edit it up into a blog post, and totally failing to get around to it, so in lieu of that, here’s my almost entirely unedited script – it’s not that close to the version I actually got up on stage and said, because I saw 800 people looking at me and panicked and all the words went out of my head (apparently this was not at all obvious to people), but the general themes of the two are the same and neither is strictly better than the other – if you prefer text like a sensible person, read this post. If you prefer video, the talk is supposedly pretty good based on the number of people who have said nice things to me about it (I haven’t been able to bear to watch it yet).

The original slides are available here (warning: Don’t load on mobile data. They’re kinda huge). I’ve inserted a couple of the slide images into the post where the words don’t make sense without the accompanying image, but otherwise decided not to clutter the text with images (read: I was too lazy).


Hi, I’m David MacIver. I’m here to talk to you today about the ways which we, as software developers, shape the world, whether we want to or not.

This is a talk about consequences. Most, maybe all, of you are good people. The Python community is great, but I’d be saying that anywhere. Most people are basically good, even though it doesn’t look that way sometimes. But unless you know about what effect your actions have, all the good intentions in the world won’t help you, and good people can still make the world a worse place. I’m going to show you some of the ways that I think we’re currently doing that.

The tool I’m going to use to do this is cultural anthropology: The study of differences and similarities between different cultures and societies. I’m not a cultural anthropologist. I’ve never even taken a class on it, I’ve just read a couple of books. But I wish I’d read those books earlier, and I’d like to share with you some of the important lessons for software development that I’ve drawn from them.

In particular I’d like to talk to you about the work of James C. Scott, and his book “Seeing like a state”. Seeing like a state is about the failure modes of totalitarian regimes, and other attempts to order human societies, which are surprisingly similar to some of the failure modes of software projects. I do recommend reading the book. If you’re like me and not that used to social science writing, it’s a bit of a heavy read, but it’s worth doing. But for now, I’ll highlight what I think are the important points.

Unsorted binary tree

Binary Tree by Derrick Coetzee

Before I talk about totalitarian states, I’d like to talk about trees. If you’re a computer scientist, or have had an unfortunate developer job interview recently, a tree is probably something like this. It has branches and leaves, and not much else.

If you’re anyone else, a tree is rather different. It’s a large living organism. It has leaves and branches, sure, but it also has a lot of other context and content. It provides shade, maybe fruit, it has a complex root system. It’s the center of its own little ecosystem, providing shelter and food for birds, insects, and other animals. Compared to the computer scientist’s view of a tree it’s almost infinitely complicated.

But there’s another simplifying view of a tree we could have taken, which is that of the professional forester. A tree isn’t a complex living organism, it’s just potential wood. The context is no longer relevant, all we really care about the numbers – it costs this much to produce this amount of this grade of wood and, ultimately, this amount of money when you sell the wood.

This is a very profitable view of a tree, but it runs into some difficulties. If you look at a forest, it’s complicated. You’ve got lots of different types of trees. Some of them are useful, some of them are not – not all wood is really saleable, some trees are new and still need time to grow, trees are not lined up with each other so you have to navigate around ones you didn’t want. As well as the difficulty of harvesting, this also creates difficulty measuring – even counting the trees is hard because of this complexity, let alone more detailed accounting of when and what type of wood will be ready, so how can you possibly predict how much wood you’re going to harvest and thus plan around what profit you’re going to make? Particularly a couple of hundred years ago when wood was the basis of a huge proportion of the national economy, this was a big deal. We have a simple view of the outcomes we want, but the complex nature of reality fights back at our attempts to achieve that. So what are we going to do?

Well, we simplify the forest. If the difficulty in achieving our simple goals is that reality is too complicated, we make the reality simpler. As we cut down the forest, we replant it with easy to manage trees in easy to manage lines. We divide it into regions where all of the trees are of the same age. Now we have a relatively constant amount of wood per unit of area, and we can simply just log an entire region at once, and now our profits become predictable and, most importantly, high.

James Scott talks about this sort of thing as “legibility”. The unmanaged forest is illegible – we literally cannot read it, because it has far more complexity than we can possibly hope to handle – while, in contrast, the managed forest is legible – we’ve reshaped its world to be expressible in a small number of variables – basically just the land area, and the number of regions we’ve divided it into. The illegible world is unmanageable, while the legible world is manageable, and we can control it by adjusting a small number of parameters.

In a technical sense, legibility lets us turn our control over reality into optimisation problems. We have some small number of variables, and an outcome we want to optimise for, so we simply reshape the world by finding the values of those variables that maximize that outcome – our profits. And this works great – we have our new simple refined world, and we maximize our profit. Everyone is happy.

Oh, sure, there are all those other people who were using the forest who might not be entirely happy. The illegible natural forest contains fruit for gathering, brush to collect for firewood, animals for hunting, and a dozen other uses all of which are missing from our legible managed forest. Why? Well because those didn’t affect our profit. The natural behaviour of optimisation processes is to destroy everything in their path that isn’t deliberately preserved or directly required for their outcome. If the other use cases didn’t result in profit for us, they’re at best distractions or at worst impediments. Either way we get rid of them. But those only matter to the little people, so who cares? We’re doing great, and we’re making lots of money.

At least, for about eighty years, at which point all of the trees start dying. This really happened. These days, we’re better a bit better at forest management, and have figured out more of which complexity is necessary and which we can safely ignore, but in early scientific forestry, about 200 years ago in Germany, they learned the hard way that a lot of things they had thought weren’t important really were. There was an entire complex ecological cycle that they’d ignored, and they got away with it for about 80 years because they had a lot of high quality soil left over from that period that they could basically strip mine for a while. But the health of the forest deteriorated over time as the soil got worse, and eventually the trees were unhealthy enough that they started getting sick. And because all of the trees were the same, when one got sick it spread like wildfire to the others. They called it Waldsterben – forest death.

The problem that the German scientific foresters ran into is that complex, natural, systems are often robust in ways that simple, optimised systems are not. They’ve evolved over time, with lots of fiddly little details that have occurred locally to adapt to and patch over problems. Much of that illegibility turns out not to be accidental complexity, but instead the adaptation that was required to make the system work at all. That’s not to say all complexity is necessary, or that there isn’t a simpler system that also works, but if the complexity is there, chances are we can’t just remove it without replacing it with something else and assume the system will keep working, even if it might look like it does for a while.

This isn’t actually a talk about trees, but it is a talk about complexity, and about simplification. And it’s a talk about what happens when we apply this kind of simplification process to people. Because it turns out that people are even more complicated than trees, and we have a long history of trying to fix that, to take complex, messy systems of people and produce nice, simple, well behaved social orders that follow straightforward rules.

This is what James Scott calls Authoritarian High-Modernism – the desire to force people to fit into some rational vision of the world. Often this is done for entirely virtuous reasons – many authoritarian high-modernist projects are utopian in nature – we want everyone to be happy and well fed and fulfilled in their lives. Often they are less virtuous – totalitarian regimes love forcing people into their desired mould. But virtuous or not, they often fail in the same way that early scientific forestry did. Seeing like a state has a bunch of good examples of this. I won’t go into them in detail, but here’s a few.

A picture of a building with multiple windows bricked up.

Portland Street, Southampton, England, by Gary Burt

An amusing example is buildings like this. Have you seen these? Do you know why there are these bricked up windows? Well it’s because of window taxes. A while back, income tax was very unpopular. Depending on who you ask, maybe it still is, but it was even more so back then. But the government wanted to extract money from its citizens. What could they do? Well, they could tax where people live by size – rich people live in bigger buildings – but houses are often irregularly shaped, so measuring the size of the house is hard, but there’s a nice, simple,convenient proxy for it – the number of windows. So this is where windows taxes come from – take complex, messy, realities of wealth and pick a simple proxy for it, you pick a simple proxy for that, and and you end up taxing the number of windows. Of course what happens is that people brick up their windows to save on taxes. And then suffer health problems from lack of natural light and proper ventilation in their lives, which is less funny, but so it goes.

Another very classic example that also comes from taxation is the early history of the Cadastral, or land-use, map. We want to tax land-use, so we need to know who owns the land. So we create these detailed land-use maps which say who owns what, and we tax them accordingly. This seems very straight forward, right? But in a traditional village this is nonsense. Most land isn’t owned by any single person – there are complex systems of shared usage rights. You might have commons on which anyone can graze their animals, but where certain fruit trees are owned, but everyone has the rights to use fallen fruit. It’s not that there aren’t notions of ownership per se, but they’re very fine grained and contextual, and they shift according to a complex mix of circumstance and need. The state doesn’t care. These complex shared ownerships are illegible, so we force people to conform instead to the legible idea of single people or families owning each piece of land. This is where a lot of modern notions of ownership come from by the way – the state created them so they could collect more tax.

And of course we have the soviet union’s program of farm collectivization, which has the state pushing things in entirely the opposite direction. People were operating small family owned farms, which were adapted to their local conditions and what grew well where they were. A lot of it was subsistence farming, particularly in lean times – when you had excess, you sold it. When you didn’t, you lived off the land. This was hard to manage if you’re a state who wants to appropriate large quantities of food to feed your army and decide who is deserving and who gets what. So they forcibly moved everyone to work on large, collective, farms which grew what the state wanted, typically in large fields of monocultures that ignored the local conditions. From a perspective of producing enough food, this worked terribly. The large, collectivized, farms, produced less food less reliably than the more distributed, adapted, local farms. The result was famine which killed millions. But from the point of view of making the food supply legible, and allowing the state to control it, the system worked great, and the soviets weren’t exactly shy about killing millions of people, so the collectivization program was largely considered a success by them, though it did eventually slow and stop before they converted every farm.

But there’s another, more modern, example of all of these patterns. We have met the authoritarians and they are us. Tech may not look much like a state, even ignoring its strongly libertarian bent, but it has many of the same properties and problems, and every tech company is engaged in much the same goal as these states were: Making the world legible in order to increase profit.

Every company does this to some degree, but software is intrinsically a force for legibility. A piece of software has some representation of the part of the world that it interacts with, boiling it down to the small number of variables that it needs to deal with. We don’t necessarily make people conform to that vision, but we don’t have to – as we saw with the windows, people will shape themselves in response to the incentives we give them,as long as we are able to reward compliance with our vision or punish deviance from it..

When you hear tech companies talk about disruption, legibility is at the heart of what we’re doing. We talk about efficiency – taking these slow, inefficient, legacy industries and replacing them with our new, sleek, streamlined versions based on software. But that efficiency comes mostly from legibility – it’s not that we’ve got some magic wand that makes everything better, it’s that we’ve reduced the world to the small subset of it that we think of as the important bits, and discarded the old, illegible, reality as unimportant.

And that legibility we impose often maps very badly to the actual complexity of the world. You only have to look at the endless stream of falsehoods programmers believe articles to get a sense of how much of the world’s complexity we’re ignoring. It’s not just programmers of course – if anything the rest of the company is typically worse – but we’re still pretty bad. We believe falsehoods about names, but also gender, addresses, time, and  many more.

This probably still feels like it’s not a huge problem. Companies are still not states. We’re not forcing things on anyone, right? If you don’t use our software, nobody is going to kick down your door and make you. Much of the role of the state is to hold a monopoly on the legitimate use of physical force, and we don’t have access to that. We like to pretend makes some sort of moral difference. We’re just giving people things that they want, not forcing them to obey us.

Unfortunately, that is a fundamental misunderstanding of the nature of power. Mickey Mouse, despite his history of complicity in US racism, has never held a gun to anyone’s head and forced them to do his bidding, outside of a cartoon anyway. Nevertheless he is almost single-handedly responsible for reshaping US copyright law, and by extension copyright law across most of the world. When Mickey Mouse is in danger of going out of copyright, US copyright law mysteriously extends the length of time after the creator’s death that works stay in copyright. We now live in a period of eternal copyright, largely on the strength of the fact that kids like Mickey Mouse.

This is what’s called Soft Power. Conventional ideas of power are derived from coercion – you make someone do what you want – while soft power is power that you derive instead from appeal – People want to do what you want. There are a variety of routes to soft power, but there’s one that has been particularly effective for colonising forces, the early state, and software companies. It goes like this.

First you make them want what you have, then you make them need it.

The trick is to to basically ease people in – you give them a hook that makes your stuff appealing, and then once they’re use to it they can’t do without. Either because it makes their life so much better, or because in the new shape of the world doing without it would make their life so much worse. These aren’t the same thing. There are some common patterns for this, but there are three approaches that have seen a lot of success that I’d like to highlight

The first is that you create an addiction. You sell them alcohol, or you sell them heroin. The first one’s free – just a sampler, a gift of friendship. But hey, that was pretty good. Why not have just a little bit more… Modern tech companies are very good at this. There’s a whole other talk you could give about addictive behaviours in modern software design. But, for example, I bet a lot of you find yourselves compulsively checking Twitter. You might not want to – you might even want to quit it entirely – but the habit is there. I’m certainly in this boat. That’s an addictive behaviour right there, and perhaps it wasn’t deliberately created, but it sure looks like it was.

The second strategy is that you can sell them guns. Arms dealing is great for creating dependency! You get to create an arms race by offering them to both sides, each side buys it for fear that the other one will, and now they have to keep buying from you because you’re the only one who can supply them bullets. Selling advertising and social media strategies to companies works a lot like this.

The third is you can sell them sugar. It’s cheap and delicious! And is probably quite bad for you and certainly takes over your diet, crowding out other more nutritious options. Look at companies who do predatory pricing, like Uber. It’s great – so much cheaper than existing taxis, and way more convenient than public transport, right? Pity they’re going to hike the prices way up when they’ve driven the competition into the ground and want to stop hemorrhaging money.

And we’re going to keep doing this, because this is the logic of the market. If people don’t want and need our product, they’re not going to use it, we’re not going to make money, and your company will fail and be replaced by one with no such qualms. The choice is not whether or not to exert soft power, it’s how and to what end.

I’m making this all sound very bleak, as if the things I’m talking about were uniformly bad. They’re not. Soft power is just influence, and it’s what happens every day as we interact with people. It’s an inevitable part of human life. Legibility is just an intrinsic part of how we come to understand and manipulate the world, and is at the core of most of the technological advancements of the last couple of centuries. Legibility is why we have only a small number of standardised weights and measures instead of a different notion of a pound or a foot for every village.

Without some sort of legible view of the world, nothing resembling modern civilization would be possible and, while modern civilization is not without its faults, on balance I’m much happier for it existing than not.

But civilizations fall as well as rise, and things that seemed like they were a great idea in the short term often end in forest death and famine. Sometimes it turns out that what we were disrupting was our life support system.

And on that cheerily apocalyptic note, I’d like to conclude with some free advice on how we can maybe try to do a bit better on that balancing act. It’s not going to single handedly save the world, but it might make the little corners of it that we’re responsible for better.

My first piece of free advice is this: Richard Stallman was right. Proprietary software is a harbinger of the end times, and an enemy of human flourishing. … don’t worry, I don’t actually expect you to follow this one. Astute observers will notice that I’m actually running Windows on the computer I’m using to show these slides, so I’m certainly not going to demand that you go out and install Linux, excuse me, GNU/Linux, and commit to a world of 100% free software all the time. But I don’t think this point of view is wrong either. As long as the software we use is not under our control, we are being forced to conform to someone else’s idea of the legible world. If we want to empower users, we can only do that with software they can control. Unfortunately I don’t really know how to get there from here, but a good start would be to be better about funding open source.

In contrast, my second piece of advice is one that I really do want you all to follow. Do user research, listen to what people say, and inform your design decisions based on it. If you’re going to be forming a simplified model of the world, at least base it on what’s important to the people who are going to be using your software.

And finally, here’s the middle ground advice that I’d really like you to think about. Stop relying on ads. As the saying goes, if your users aren’t paying for it, they’re not the customer, they’re the product. The product isn’t a tree, it’s planks. It’s not a person, it’s data. Ads and adtech are one of the most powerful forces for creating a legible society, because they are fundamentally reliant on turning a complex world of people and their interactions into simple lists of numbers, then optimising those numbers to make money. If we don’t want our own human shaped version of forest death, we need to figure out what important complexity we’re destroying, and we need to stop doing that.

And that is all I have to say to you today. I won’t be taking questions, but I will be around for the rest of the conference if you want to come talk to me about any of this. Thank you very much.

This entry was posted in Performing philosophy without a license, Python on by .

Python Coverage could be fast

Ned Batchelder’s coverage.py is a foundation of the Python testing ecosystem. It is solid, well maintained, and does its job extremely well. I think literally every Python project that cares about testing should be using it.

But it’s not without its faults. Specifically, its performance can be quite bad. On some workloads it’s absolutely fine, but on others you can see anything up to an order of magnitude slow down (and this is just on CPython. On pypy it can be even worse).

Recently, after some rather questionable late night hacking (and a significant amount of fixing not late at night), I finally made good on my promise that Hypothesis would eventually use coverage information and shipped Hypothesis 3.29.0 which added a dependency on Coverage and turned it on for every Hypothesis based test.

This hasn’t been an entirely smooth process – some for reasons that are my fault, and some that users are now running into these performance problems.

The right place to fix this is obviously in Coverage rather than Hypothesis itself, so I’ve been looking into this recently. I’ve already made one patch which gives branch coverage a 50% speedup in a loop-based microbenchmark and about a 20% speedup on one real world example I tried it on.

The idea of the patch is very straightforward (though apparently I’m a unusual in thinking that “Here I wrote you a hash table!!!” is a straightforward patch). Coverage creates a lot of Python objects in a very hot part of the code, so this caches them off an integer key so that most of the time it can omit creating those objects and significantly speed things up as a result.

Unfortunately that’s probably it for now. My priorities for the immediate future are PhD, paid work, and conference prep, which means that I certainly don’t have any time in the next month and probably not the next couple of months (this could be fixed by making this paid work. I may do a kickstarter or something for that, but in the meantime if any interested companies wanted to fund this work I’d be more than happy to discuss it…).

So I thought I’d write down my notes before I forget. These are both for future-me and for anyone interested who feels motivated to work on this problem in the meantime.

Initial Benchmarking

I haven’t done super formal benchmarking, but I set up pytest-benchmark with some basic benchmarks that just ran a loop adding numbers.

The benchmarking functionality itself was pretty great but I didn’t find it easy to compare benchmarks in the way that I wanted – in particular I had the same benchmark which was run in three different ways (no coverage, line coverage, branch coverage) and I wanted to break those down for side-by-side comparison, but I couldn’t find any functionality to do so (I admit I didn’t look very hard). It was nice having the statistics handled though and I will almost certainly want to sink some time into getting a decent pytest-benchmark suite for coverage if I work on this further.

Real World Benchmarking

The real world benchmark I used was Alex Groce’s tstl, because his usage profile is similar to mine (there’s a lot of overlap between what Hypothesis and TSTL do), and he had an existing example that was seeing an order of magnitude slow down. This is the example that gets a 20% speedup from my above patch.

The problem can be reproduced as follows:

git clone https://github.com/agroce/tstl.git
cd tstl
virtualenv v
source v/bin/activate
pip install .
cd examples/AVL
tstl avlnodisp.tstl
tstl_rt --seed=0 --timeout=30 
tstl_rt --seed=0 --timeout=30 --noCover

The thing to compare is the total number of test operations run in the outputs from each of the test_rt commands. For me I see “12192 TOTAL TEST OPERATIONS” with coverage, and “96938 TOTAL TEST OPERATIONS” without, so it runs about 8 times as many operations in the same time frame with coverage turned off (this is without my patch. With my patch I get 14665 under coverage, so about 20% more).

Profiling Coverage

I confess I didn’t figure out how to profile coverage until after I made the above patch. I had a benchmark I was measuring, and just based on inspection I was more than 90% certain that the above would help, so I decided to just give it a go and validate my suspicion and turned out to be right.

But after I’d picked the most obvious low hanging fruit I figured it would be silly to try to proceed further without getting profiling set up, so I poked around. I spent a little time trying to get google-perf-tools working with Python and failing, but eventually figured out that I could do it with perf and it works great (modulo quite a lot of hacking and data munging).

The basic idea with perf is that you run your program under “perf record” and it gives you raw output data. You can then do analysis on this to find out about your program’s performance.

The first thing to do to use perf is that you need to make sure that everything is compiled with debug symbols. This includes both your extension and Python itself.

To get a Python with debug symbols I used pyenv‘s python-build plugin:

export PYTHON_CFLAGS='-pg'
~/.pyenv/plugins/python-build/bin/python-build 2.7.13 ~/debug-python2

This builds a version of Python 2 (TSTL doesn’t work under Python 3) with the “-pg” flag to gcc which includes debug symbols. I also modified setup.py for coverage to include  extra_compile_args=[‘-pg’] (it should be possible to do this with an environment variable, but I didn’t try).

Once I had that, running under perf was straightforward:

perf record tstl_rt --seed=0 --timeout=30

This creates a file called perf.data that you can analyze. I did not find prof report, the default way of analyzing it, super helpful, so I used CPU Flame Graphs.

I was only interested in the performance for calls below CTracer_trace, and I didn’t find the way it was spread out in the SVG (there were a lot of bits) very helpful, so I ended up aggregating the data through some um very sophisticated data analysis tools as follows:

perf script > out.perf && \
    ~/scratch/FlameGraph/stackcollapse-perf.pl out.perf | \
    grep CTracer | sed 's/.\+;CTracer_trace/CTracer_trace/' | \
    sort | \
    python sum.py > out2.folded
~/scratch/FlameGraph/flamegraph.pl out2.folded > out.svg

sum.py is the following very basic code:

from __future__ import print_function

import sys

if __name__ == '__main__':
        prev = None
        for l in sys.stdin:
                u, v = l.split()
                v = int(v)
                if prev is None:
                        prev = u
                        count = v
                elif prev == u:
                        count += v
                else:
                        print(prev, count)
                        prev = u
                        count = v
        print(prev, count)

(the data munging earlier creates duplicated entries, so this merges them together).

WordPress won’t let me upload the generated SVG “For Security Reasons” (that do not apparently preclude running WordPress itself), so here’s a gist of it, and her’es one from before my patch was applied (right click and view image in a new tab to get a usable interactive version of it)

pypy

PyPy performance for coverage is more complicated. My understanding is that there are roughly three classes of problems here:

  1. coverage itself is not as well optimised as it could be (same problem as CPython)
  2. Using settrace interferes with the JIT
  3. The baseline speed of pypy operations is much faster so coverage is a much higher overhead by comparison.

The first is the only one I could deal with, and I think it probably will benefit significantly from whatever changes I make to the C extension also being ported over to the pure Python version (pypy doesn’t use the C extension tracer because it’s even slower than the pure python one on pypy), but I’m not sure that will be enough – I suspect there are also going to be more changes to pypy internals required for this, and I don’t know enough about those to say how difficult or easy they are.

The Limits of What Is Possible

Python coverage is never going to run at the speed of Python without coverage, especially on pypy.

You can see the limits of how much faster it could be by running with an empty trace function (note: Whether you are using a C or Python level trace function makes a big difference. sys.settrace is ruinously slow even with an empty function).

The difference varies significantly depending on your code though – with the TSTL workload above I see a 50% slow down with an empty C level trace function. With more microbenchmark style workloads with few functions and lots of looping I see almost no speed loss.

So my suspicion is that for CPython at least we can get coverage to reliably be within a factor of two of running without it, and maybe reliably within a factor of 1.5.

What next

For now my plan is to shepherd the patch from the beginning of this post and otherwise ignore this problem for the next couple of months.

Based on the profiling I did above most of the time is currently being spent in PyDict_SetItem, so I think the obvious next line of attack when I do start working on this is to replace the file_data objects in coverage which currently use Python dictionaries keyed off Python values with some sort of specialized data structure (possibly another hash table, possibly something better optimized for write heavy workloads). Longer term I think the goal should be move all calls back into Python land out of the trace function and just normalize at the end of tracing.

Even the more modest goal is a bit more invasive of a change than I wanted to start with, hence the above rather more conservative patch, but there’s nothing conceptually difficult to it – it just involves a bunch of slog and basic engineering work followed by some experimenting with clever data structure designs.

Once I’ve made it through the next month or two I’ll start seeing about getting some work on this funded. I’ve already suggested the notion to an existing customer who I know is having problems with coverage performance, but if they don’t want to fund it (which would be totally understandable) I’ll look further afield.

My fall back plan is a Kickstarter for this, but honestly I think some motivated company who is suffering from this should just think about picking up the tab.

I’ve been doing a bunch of funded work on Hypothesis for Stripe and Smarkets (here and here so far, with more to come) and it’s been going great – it’s worked out well for me, them, and Hypothesis users at large, and I don’t see why other projects shouldn’t benefit from the same (I’d encourage paying people who are actually maintainers of those projects by default, but many of them including Ned have full time jobs and I’m more flexibly available).

We’re not talking about a lot of money – a couple of weeks of development work (and, say, a 10-20% extra consultancy fee for Ned) should be enough to see some serious performance improvements, and as well as making your developers much happier, you’ll also earn some serious community good will (which is great when it comes to hiring new developers). That’s probably less than a month of your normal meeting schedule costs you.

This is a problem that affects a large majority of people who care about testing Python, which should include most commercial Python users, so if that’s you why not get in touch?

This entry was posted in Python on by .

The other half of binary search

Binary search is one of those classic algorithms that most people who know about algorithms at all will know how to do (and many will even be able to implement correctly! Probably fewer than think they can though – it took me a long time to go to thinking I could implement binary search correctly to actually being able to implement it correctly).

Some of this is because the way people think about binary search is somewhat flawed. It’s often treated as being about sorted arrays data, when that’s really only one application of it. So lets start with a review of what the right way to think about binary search is.

We have two integers \(a\) and \(b\) (probably non-negative, but it doesn’t matter), with \(a < b\). We also have some function that takes integers \(a \leq i \leq b\), with \(f(a) \neq f(b)\). We want to find \(c\) with \(a \leq c < b\) such that \(f(c) \neq f(c+ 1)\).

i.e. we’re looking to find a single exact point at which a function changes value. For functions that are monotonic (that is, either non-increasing or non-decreasing), this point will be unique, but in general it may not be.

To recover the normal idea of binary search, suppose we have some array \(xs\) of length \(n\). We want to find the smallest insertion point for some value \(v\).

To do this, we can use the function \(f(i)\) that that returns whether \(xs[i] < v\). Either this function is constantly true (in which case every element is < v and v should be inserted at the end), constantly false (in which case v should be inserted at the beginning), or the index i with \(f(i) \neq f(i + 1)\) is the point after which \(v\) should be inserted.

This also helps clarify the logic for writing a binary search:

def binary_search(f, lo, hi):
    # Invariant: f(hi) != f(lo)
    while lo + 1 < hi:
        assert f(lo) != f(hi)
        mid = (lo + hi) // 2
        if f(mid) == f(lo):
            lo = mid
        else:
            hi = mid
    return lo

Every iteration we cut the interval in half - because we know the gap between them is at least one, this must reduce the length. If \(f\) gives the same value to the midpoint as to lo, it must be our new lower bound, if not it must be our new upper bound (note that generically in this case we might not have \(f(mid) = f(hi)\), though in the typical case where \(f\) only takes two values we will).

Anyway, all of this is besides the point of this post, it's just scene setting.

Because the point of this post is this: Is this actually optimal?

Generically, yes it is. If we consider the functions \(f_k(i) = i < k\), each value we examine can only cut out half of these functions, so we must ask at least \(\log_2(b - a)\) questions, so binary search is optimal. But that's the generic case. In a lot of typical cases we have something else going for us: Often we expect change to be quite frequent, or at least to be very close to the lower bound. For example, suppose we were binary searching for a small value in a sorted list. Chances are good it's going to be a lot closer to the left hand side than the right, but we're going to do a full \(\log_2(n)\) calls every single time. We can solve this by starting the binary search with an exponential probe - we try small values, growing the gap by a factor of two each time, until we find one that gives a different value. This then gives us a (hopefully smaller) upper bound, and a lower bound somewhat closer to that.

def exponential_probe(f, lo, hi):
    gap = 1
    while lo + gap < hi:
        if f(lo + gap) == f(lo):
            lo += gap
            gap *= 2
        else:
            return lo, lo + gap
    return lo, hi

We can then put these together to give a better search algorithm, by using the exponential probe as the new upper bound for our binary search:

def find_change_point(f, lo, hi):
    assert f(lo) != f(hi)
    return binary_search(f, *exponential_probe(f, lo, hi))

When the value found is near or after the middle, this will end up being more expensive by a factor of about two - we have to do an extra \(\log_2(n)\) calls to probe up to the midpoint - but when the heuristic wins it potentially wins big - it will often take the \(\log_2(n)\) factor (which although not huge can easily be in the 10-20 range for reasonable sized gaps) and turn it into 1 or 2. Complexity wise, this will run in \(O(\log(k - lo)\), where \(k\) is the value returned, rather than the original \(O(hi - lo)\).

This idea isn't as intrinsically valuable as binary search, because it doesn't really improve the constant factors or the response to random data, but it's still very useful in a lot of real world applications. I first came across it in the context of timsort, which uses this to find a good merge point when merging two sublists in its merge step.

Edit to note: It was pointed out to me on Twitter that I'm relying on python's bigints to avoid the overflow problem that binary search will have if you implement it on fixed sized integers. I did know this at one point, but I confess I had forgotten. The above code works fine in Python, but if int is fixed size you want the following slightly less clear versions:


def midpoint(lo, hi):
    if lo <= 0 and hi >= 0:
        return (lo + hi) // 2
    else:
        return lo + (hi - lo) // 2

def binary_search(f, lo, hi):
    # Invariant: f(hi) != f(lo)
    while lo + 1 < hi:
        assert f(lo) != f(hi)
        mid = midpoint(lo, hi)
        if f(mid) == f(lo):
            lo = mid
        else:
            hi = mid
    return lo

def exponential_probe(f, lo, hi):
    gap = 1
    midway = midpoint(lo, hi)
    while True:
        if f(lo + gap) == f(lo):
            lo += gap
            if lo >= midway:
                break
            else:
                gap *= 2
         else:
            hi = lo + gap
            break
    return lo, hi

These avoid calculating any intermediate integers which overflow in the midpoint calculation:

  • If \(lo \leq 0\) and \(hi \geq 0\) then \(lo \leq hi + lo \leq hi\), so is representable.
  • If \(lo \geq 0\) then \(0 \leq hi - lo \leq hi\), so is representable.

The reason we need the two different cases is that e.g. if \(lo\) were INT_MIN and \(hi\) were INT_MAX, then \(hi - lo\) would overflow but \(lo + hi\) would be fine. Conversely if \(lo\) were INT_MAX - 1 and \(hi\) were INT_MAX, \(hi - lo\) would be fine but \(hi + lo\) would overflow.

The following should then be a branch free way of doing the same:

def midpoint(lo, hi):
    large_part = lo // 2 + hi // 2
    small_part = ((lo & 1) + (hi & 1)) // 2
    return large_part + small_part

We calculate (x + y) // 2 as x // 2 + y // 2, and then we fix up the rounding error this causes by calculating the midpoint of the low bits correctly. The intermediate parts don't overflow because we know the first sum fits in \([lo, hi]\), and the second fits in \([0, 1]\). The final sum also fits in \([lo, hi]\) so also doesn't overflow.

I haven't verified this part too carefully, but Hypothesis tells me it at least works for Python's big integers, and I think it should still work for normal C integers.

This entry was posted in programming, Python on by .