Category Archives: Python

Python Coverage could be fast

Ned Batchelder’s coverage.py is a foundation of the Python testing ecosystem. It is solid, well maintained, and does its job extremely well. I think literally every Python project that cares about testing should be using it.

But it’s not without its faults. Specifically, its performance can be quite bad. On some workloads it’s absolutely fine, but on others you can see anything up to an order of magnitude slow down (and this is just on CPython. On pypy it can be even worse).

Recently, after some rather questionable late night hacking (and a significant amount of fixing not late at night), I finally made good on my promise that Hypothesis would eventually use coverage information and shipped Hypothesis 3.29.0 which added a dependency on Coverage and turned it on for every Hypothesis based test.

This hasn’t been an entirely smooth process – some for reasons that are my fault, and some that users are now running into these performance problems.

The right place to fix this is obviously in Coverage rather than Hypothesis itself, so I’ve been looking into this recently. I’ve already made one patch which gives branch coverage a 50% speedup in a loop-based microbenchmark and about a 20% speedup on one real world example I tried it on.

The idea of the patch is very straightforward (though apparently I’m a unusual in thinking that “Here I wrote you a hash table!!!” is a straightforward patch). Coverage creates a lot of Python objects in a very hot part of the code, so this caches them off an integer key so that most of the time it can omit creating those objects and significantly speed things up as a result.

Unfortunately that’s probably it for now. My priorities for the immediate future are PhD, paid work, and conference prep, which means that I certainly don’t have any time in the next month and probably not the next couple of months (this could be fixed by making this paid work. I may do a kickstarter or something for that, but in the meantime if any interested companies wanted to fund this work I’d be more than happy to discuss it…).

So I thought I’d write down my notes before I forget. These are both for future-me and for anyone interested who feels motivated to work on this problem in the meantime.

Initial Benchmarking

I haven’t done super formal benchmarking, but I set up pytest-benchmark with some basic benchmarks that just ran a loop adding numbers.

The benchmarking functionality itself was pretty great but I didn’t find it easy to compare benchmarks in the way that I wanted – in particular I had the same benchmark which was run in three different ways (no coverage, line coverage, branch coverage) and I wanted to break those down for side-by-side comparison, but I couldn’t find any functionality to do so (I admit I didn’t look very hard). It was nice having the statistics handled though and I will almost certainly want to sink some time into getting a decent pytest-benchmark suite for coverage if I work on this further.

Real World Benchmarking

The real world benchmark I used was Alex Groce’s tstl, because his usage profile is similar to mine (there’s a lot of overlap between what Hypothesis and TSTL do), and he had an existing example that was seeing an order of magnitude slow down. This is the example that gets a 20% speedup from my above patch.

The problem can be reproduced as follows:

git clone https://github.com/agroce/tstl.git
cd tstl
virtualenv v
source v/bin/activate
pip install .
cd examples/AVL
tstl avlnodisp.tstl
tstl_rt --seed=0 --timeout=30 
tstl_rt --seed=0 --timeout=30 --noCover

The thing to compare is the total number of test operations run in the outputs from each of the test_rt commands. For me I see “12192 TOTAL TEST OPERATIONS” with coverage, and “96938 TOTAL TEST OPERATIONS” without, so it runs about 8 times as many operations in the same time frame with coverage turned off (this is without my patch. With my patch I get 14665 under coverage, so about 20% more).

Profiling Coverage

I confess I didn’t figure out how to profile coverage until after I made the above patch. I had a benchmark I was measuring, and just based on inspection I was more than 90% certain that the above would help, so I decided to just give it a go and validate my suspicion and turned out to be right.

But after I’d picked the most obvious low hanging fruit I figured it would be silly to try to proceed further without getting profiling set up, so I poked around. I spent a little time trying to get google-perf-tools working with Python and failing, but eventually figured out that I could do it with perf and it works great (modulo quite a lot of hacking and data munging).

The basic idea with perf is that you run your program under “perf record” and it gives you raw output data. You can then do analysis on this to find out about your program’s performance.

The first thing to do to use perf is that you need to make sure that everything is compiled with debug symbols. This includes both your extension and Python itself.

To get a Python with debug symbols I used pyenv‘s python-build plugin:

export PYTHON_CFLAGS='-pg'
~/.pyenv/plugins/python-build/bin/python-build 2.7.13 ~/debug-python2

This builds a version of Python 2 (TSTL doesn’t work under Python 3) with the “-pg” flag to gcc which includes debug symbols. I also modified setup.py for coverage to include  extra_compile_args=[‘-pg’] (it should be possible to do this with an environment variable, but I didn’t try).

Once I had that, running under perf was straightforward:

perf record tstl_rt --seed=0 --timeout=30

This creates a file called perf.data that you can analyze. I did not find prof report, the default way of analyzing it, super helpful, so I used CPU Flame Graphs.

I was only interested in the performance for calls below CTracer_trace, and I didn’t find the way it was spread out in the SVG (there were a lot of bits) very helpful, so I ended up aggregating the data through some um very sophisticated data analysis tools as follows:

perf script > out.perf && \
    ~/scratch/FlameGraph/stackcollapse-perf.pl out.perf | \
    grep CTracer | sed 's/.\+;CTracer_trace/CTracer_trace/' | \
    sort | \
    python sum.py > out2.folded
~/scratch/FlameGraph/flamegraph.pl out2.folded > out.svg

sum.py is the following very basic code:

from __future__ import print_function

import sys

if __name__ == '__main__':
        prev = None
        for l in sys.stdin:
                u, v = l.split()
                v = int(v)
                if prev is None:
                        prev = u
                        count = v
                elif prev == u:
                        count += v
                else:
                        print(prev, count)
                        prev = u
                        count = v
        print(prev, count)

(the data munging earlier creates duplicated entries, so this merges them together).

WordPress won’t let me upload the generated SVG “For Security Reasons” (that do not apparently preclude running WordPress itself), so here’s a gist of it, and her’es one from before my patch was applied (right click and view image in a new tab to get a usable interactive version of it)

pypy

PyPy performance for coverage is more complicated. My understanding is that there are roughly three classes of problems here:

  1. coverage itself is not as well optimised as it could be (same problem as CPython)
  2. Using settrace interferes with the JIT
  3. The baseline speed of pypy operations is much faster so coverage is a much higher overhead by comparison.

The first is the only one I could deal with, and I think it probably will benefit significantly from whatever changes I make to the C extension also being ported over to the pure Python version (pypy doesn’t use the C extension tracer because it’s even slower than the pure python one on pypy), but I’m not sure that will be enough – I suspect there are also going to be more changes to pypy internals required for this, and I don’t know enough about those to say how difficult or easy they are.

The Limits of What Is Possible

Python coverage is never going to run at the speed of Python without coverage, especially on pypy.

You can see the limits of how much faster it could be by running with an empty trace function (note: Whether you are using a C or Python level trace function makes a big difference. sys.settrace is ruinously slow even with an empty function).

The difference varies significantly depending on your code though – with the TSTL workload above I see a 50% slow down with an empty C level trace function. With more microbenchmark style workloads with few functions and lots of looping I see almost no speed loss.

So my suspicion is that for CPython at least we can get coverage to reliably be within a factor of two of running without it, and maybe reliably within a factor of 1.5.

What next

For now my plan is to shepherd the patch from the beginning of this post and otherwise ignore this problem for the next couple of months.

Based on the profiling I did above most of the time is currently being spent in PyDict_SetItem, so I think the obvious next line of attack when I do start working on this is to replace the file_data objects in coverage which currently use Python dictionaries keyed off Python values with some sort of specialized data structure (possibly another hash table, possibly something better optimized for write heavy workloads). Longer term I think the goal should be move all calls back into Python land out of the trace function and just normalize at the end of tracing.

Even the more modest goal is a bit more invasive of a change than I wanted to start with, hence the above rather more conservative patch, but there’s nothing conceptually difficult to it – it just involves a bunch of slog and basic engineering work followed by some experimenting with clever data structure designs.

Once I’ve made it through the next month or two I’ll start seeing about getting some work on this funded. I’ve already suggested the notion to an existing customer who I know is having problems with coverage performance, but if they don’t want to fund it (which would be totally understandable) I’ll look further afield.

My fall back plan is a Kickstarter for this, but honestly I think some motivated company who is suffering from this should just think about picking up the tab.

I’ve been doing a bunch of funded work on Hypothesis for Stripe and Smarkets (here and here so far, with more to come) and it’s been going great – it’s worked out well for me, them, and Hypothesis users at large, and I don’t see why other projects shouldn’t benefit from the same (I’d encourage paying people who are actually maintainers of those projects by default, but many of them including Ned have full time jobs and I’m more flexibly available).

We’re not talking about a lot of money – a couple of weeks of development work (and, say, a 10-20% extra consultancy fee for Ned) should be enough to see some serious performance improvements, and as well as making your developers much happier, you’ll also earn some serious community good will (which is great when it comes to hiring new developers). That’s probably less than a month of your normal meeting schedule costs you.

This is a problem that affects a large majority of people who care about testing Python, which should include most commercial Python users, so if that’s you why not get in touch?

This entry was posted in Python on by .

The other half of binary search

Binary search is one of those classic algorithms that most people who know about algorithms at all will know how to do (and many will even be able to implement correctly! Probably fewer than think they can though – it took me a long time to go to thinking I could implement binary search correctly to actually being able to implement it correctly).

Some of this is because the way people think about binary search is somewhat flawed. It’s often treated as being about sorted arrays data, when that’s really only one application of it. So lets start with a review of what the right way to think about binary search is.

We have two integers \(a\) and \(b\) (probably non-negative, but it doesn’t matter), with \(a < b\). We also have some function that takes integers \(a \leq i \leq b\), with \(f(a) \neq f(b)\). We want to find \(c\) with \(a \leq c < b\) such that \(f(c) \neq f(c+ 1)\).

i.e. we’re looking to find a single exact point at which a function changes value. For functions that are monotonic (that is, either non-increasing or non-decreasing), this point will be unique, but in general it may not be.

To recover the normal idea of binary search, suppose we have some array \(xs\) of length \(n\). We want to find the smallest insertion point for some value \(v\).

To do this, we can use the function \(f(i)\) that that returns whether \(xs[i] < v\). Either this function is constantly true (in which case every element is < v and v should be inserted at the end), constantly false (in which case v should be inserted at the beginning), or the index i with \(f(i) \neq f(i + 1)\) is the point after which \(v\) should be inserted.

This also helps clarify the logic for writing a binary search:

def binary_search(f, lo, hi):
    # Invariant: f(hi) != f(lo)
    while lo + 1 < hi:
        assert f(lo) != f(hi)
        mid = (lo + hi) // 2
        if f(mid) == f(lo):
            lo = mid
        else:
            hi = mid
    return lo

Every iteration we cut the interval in half - because we know the gap between them is at least one, this must reduce the length. If \(f\) gives the same value to the midpoint as to lo, it must be our new lower bound, if not it must be our new upper bound (note that generically in this case we might not have \(f(mid) = f(hi)\), though in the typical case where \(f\) only takes two values we will).

Anyway, all of this is besides the point of this post, it's just scene setting.

Because the point of this post is this: Is this actually optimal?

Generically, yes it is. If we consider the functions \(f_k(i) = i < k\), each value we examine can only cut out half of these functions, so we must ask at least \(\log_2(b - a)\) questions, so binary search is optimal. But that's the generic case. In a lot of typical cases we have something else going for us: Often we expect change to be quite frequent, or at least to be very close to the lower bound. For example, suppose we were binary searching for a small value in a sorted list. Chances are good it's going to be a lot closer to the left hand side than the right, but we're going to do a full \(\log_2(n)\) calls every single time. We can solve this by starting the binary search with an exponential probe - we try small values, growing the gap by a factor of two each time, until we find one that gives a different value. This then gives us a (hopefully smaller) upper bound, and a lower bound somewhat closer to that.

def exponential_probe(f, lo, hi):
    gap = 1
    while lo + gap < hi:
        if f(lo + gap) == f(lo):
            lo += gap
            gap *= 2
        else:
            return lo, lo + gap
    return lo, hi

We can then put these together to give a better search algorithm, by using the exponential probe as the new upper bound for our binary search:

def find_change_point(f, lo, hi):
    assert f(lo) != f(hi)
    return binary_search(f, *exponential_probe(f, lo, hi))

When the value found is near or after the middle, this will end up being more expensive by a factor of about two - we have to do an extra \(\log_2(n)\) calls to probe up to the midpoint - but when the heuristic wins it potentially wins big - it will often take the \(\log_2(n)\) factor (which although not huge can easily be in the 10-20 range for reasonable sized gaps) and turn it into 1 or 2. Complexity wise, this will run in \(O(\log(k - lo)\), where \(k\) is the value returned, rather than the original \(O(hi - lo)\).

This idea isn't as intrinsically valuable as binary search, because it doesn't really improve the constant factors or the response to random data, but it's still very useful in a lot of real world applications. I first came across it in the context of timsort, which uses this to find a good merge point when merging two sublists in its merge step.

Edit to note: It was pointed out to me on Twitter that I'm relying on python's bigints to avoid the overflow problem that binary search will have if you implement it on fixed sized integers. I did know this at one point, but I confess I had forgotten. The above code works fine in Python, but if int is fixed size you want the following slightly less clear versions:

def midpoint(lo, hi):
    if lo <= 0 and hi >= 0:
        return (lo + hi) // 2
    else:
        return lo + (hi - lo) // 2

def binary_search(f, lo, hi):
    # Invariant: f(hi) != f(lo)
    while lo + 1 < hi:
        assert f(lo) != f(hi)
        mid = midpoint(lo, hi)
        if f(mid) == f(lo):
            lo = mid
        else:
            hi = mid
    return lo

def exponential_probe(f, lo, hi):
    gap = 1
    midway = midpoint(lo, hi)
    while True:
        if f(lo + gap) == f(lo):
            lo += gap
            if lo >= midway:
                break
            else:
                gap *= 2
         else:
            hi = lo + gap
            break
    return lo, hi

These avoid calculating any intermediate integers which overflow in the midpoint calculation:

  • If \(lo \leq 0\) and \(hi \geq 0\) then \(lo \leq hi + lo \leq hi\), so is representable.
  • If \(lo \geq 0\) then \(0 \leq hi - lo \leq hi\), so is representable.

The reason we need the two different cases is that e.g. if \(lo\) were INT_MIN and \(hi\) were INT_MAX, then \(hi - lo\) would overflow but \(lo + hi\) would be fine. Conversely if \(lo\) were INT_MAX - 1 and \(hi\) were INT_MAX, \(hi - lo\) would be fine but \(hi + lo\) would overflow.

The following should then be a branch free way of doing the same:

def midpoint(lo, hi):
    large_part = lo // 2 + hi // 2
    small_part = ((lo & 1) + (hi & 1)) // 2
    return large_part + small_part

We calculate (x + y) // 2 as x // 2 + y // 2, and then we fix up the rounding error this causes by calculating the midpoint of the low bits correctly. The intermediate parts don't overflow because we know the first sum fits in \([lo, hi]\), and the second fits in \([0, 1]\). The final sum also fits in \([lo, hi]\) so also doesn't overflow.

I haven't verified this part too carefully, but Hypothesis tells me it at least works for Python's big integers, and I think it should still work for normal C integers.

This entry was posted in programming, Python on by .

Linear time (ish) test case reduction

A problem I’m quite interested in, primarily from my work on Hypothesis, is test case reduction: Taking an example that produces a bug, and producing a smaller version that triggers the same bug.

Typically a “test-case” here means a file that when fed to a program triggers a crash or some other wrong behaviour. In the abstract, I usually think of sequences as the prototypical target for test case reduction. You can view a file as a sequence in a number of usefully different ways – e.g. as bytes, as lines, etc. and they all are amenable to broadly similar algorithms.

The seminal work on test-case reduction is Zeller and Hildebrant’s “Simplifying and Isolating Failure-Inducing Input“, in which they introduce delta debugging. Delta-debugging is essentially an optimisation to the greedy algorithm which removes one item at a time from the sequence and sees if that still triggers the bug. It repeatedly applies this greedy algorithm to increasingly fine partitions of the test case, until the partition is maximally fine.

Unfortunately the greedy algorithm as described in their paper, and as widely implemented, contains an accidentally quadratic bug. This problem is not present in the reference implementation, but it is present in many implementations of test case reduction found in the wild, including Berkeley delta and QuickCheck. Hypothesis gets this right, and has for so long that I forgot until quite recently that this wasn’t more widely known.

To see the problem, lets look at a concrete implementation. In Python, the normal implementation of the greedy algorithm looks something like this:

def greedy(test_case, predicate):
    while True:
        for i in range(len(test_case)):
           attempt = list(test_case)
           del attempt[i]
           if predicate(attempt):
               test_case = attempt
               break
        else:
           break
    return test_case

We try deleting each index in turn. If that succeeds, we start again. If not, we’re done: No single item can be deleted from the list.

But consider what happens if our list is, say, the numbers from 1 through 10, and we succeed at deleting a number from the list if and only if it’s even.

When we run this algorithm we try the following sequence of deletes:

  • delete 1 (fail)
  • delete 2 (succeed)
  • delete 1 (fail)
  • delete 3 (fail)
  • delete 4 (succeed)
  • delete 1 (fail)

Every time we succeed in deleting an element, we start again from scratch. As a result, we have a classic accidentally quadratic bug.

This is easy to avoid though. Instead of starting again from scratch, we continue at our current index:

def greedy(test_case, predicate):
    deleted = True
    while deleted:
        deleted = False
        i = 0
        while i <  len(test_case):
           attempt = list(test_case)
           del attempt[i]
           if predicate(attempt):
               test_case = attempt
               deleted = True
           else:
               i += 1
    return test_case

At each stage we either successfully reduce the size of the test case by 1 or we advance the loop counter by one, so this loop makes progress. It stops in the same case as before (when we make it all the way through with no deletions), so it still achieves the same minimality guarantee that we can’t delete any single element.

The reason this is only linear time “ish” is that you might need to make up to n iterations of the loop (this is also true with the original algorithm) because deleting an element might unlock another element. Consider e.g. the predicate that our sequence must consist of the numbers 1, …, k for some k – we can only every delete the last element, so we must make k passes through the list. Additionally, if we consider the opposite where it must be the numbers from k to n for some k, this algorithm is quadratic while the original one is linear!

I believe the quadratic case to be essential, and it’s certainly essential if you consider only algorithms that are allowed to delete at most one element at a time (just always make the last one they try in any given sequence succeed), but anecdotally most test cases found in the wild don’t have nearly this high a degree of dependency among them, and indexes that previously failed to delete tend to continue to fail to delete.

A model with a strong version of this as its core assumption shows up the complexity difference: Suppose you have \(k\) indices out of \(n\) which are essential, and every other index can be deleted. Then this algorithm will always run in \(n + k\) steps (\(n\) steps through the list the first time, \(k\) steps at the end to verify), while the classic greedy algorithm will always run in \(O(k^2 + n)\) steps.

Although this model is overly simplistic, anecdotally examples found in the wild tend to have something like this which acts as an “envelope” of the test case – there’s some large easily deletable set and a small essential core. Once you’re down to the core you have to do more work to find deletions, but getting down to the core is often the time consuming part of the process.

As a result I would be very surprised to find any instances in the wild where switching to this version of the algorithm was a net loss.

This entry was posted in programming, Python on by .

A worked example of designing an unusual data structure

Due to reasons, I found myself in need of a data structure supporting a slightly unusual combination of operations. Developing it involved a fairly straightforward process of refinement, and a number of interesting tricks, so I thought it might be informative to people to walk through (a somewhat stylised version of) the design.

The data structure is a particular type of random sampler, starting from a shared array of values (possibly containing duplicates). Values are hashable and comparable for equality.

It needs to support the following operations:

  1. Initialise from a random number generator and a shared immutable array of values so that it holds all those values.
  2. Sample an element uniformly at random from the remaining values, or raise an error if there are none.
  3. Unconditionally (i.e. without checking whether it’s present) remove all instances of a value from the list.

The actual data structure I want is a bit more complicated than that, but those are enough to demonstrate the basic principles.

What’s surprising is that you can do all of these operations in amortised O(1). This includes the initialisation from a list of n values!

The idea behind designing this is to start with the most natural data structure that doesn’t achieve these bounds and then try to refine it until it does. That data structure is a resizable array. You can sample uniformly by just picking an index into the array. You can delete by doing a scan and deleting the first element that is equal to the value. This means you have to be able to mutate the array, so initalising it requires copying.

Which means it’s time for some code.

Let’s start by writing some code.

First lets write a test suite for this data structure:

from collections import Counter
 
from hypothesis.stateful import RuleBasedStateMachine, rule, precondition
import hypothesis.strategies as st
 
from sampler import Sampler
 
 
class FakeRandom(object):
    def __init__(self, data):
        self.__data = data
 
    def randint(self, m, n):
        return self.__data.draw(st.integers(m, n), label="randint(%d, %d)" % (
            m, n
        ))
 
 
class SamplerRules(RuleBasedStateMachine):
    def __init__(self):
        super(SamplerRules, self).__init__()
        self.__initialized = False
 
    @precondition(lambda self: not self.__initialized)
    @rule(
        values=st.lists(st.integers()).map(tuple),
        data=st.data()
    )
    def initialize(self, values, data):
        self.__initial_values = values
        self.__sampler = Sampler(values, FakeRandom(data))
        self.__counts = Counter(values)
        self.__initialized = True
 
    @precondition(lambda self: self.__initialized)
    @rule()
    def sample(self):
        if sum(self.__counts.values()) != 0:
            v = self.__sampler.sample()
            assert self.__counts[v] != 0
        else:
            try:
                self.__sampler.sample()
                assert False, "Should have raised"
            except IndexError:
                pass
 
    @precondition(lambda self: self.__initialized)
    @rule(data=st.data())
    def remove(self, data):
        v = data.draw(st.sampled_from(self.__initial_values))
        self.__sampler.remove(v)
        self.__counts[v] = 0
 
TestSampler = SamplerRules.TestCase

This uses Hypothesis’s rule based stateful testing to completely describe the valid range of behaviour of the data structure. There are a number of interesting and possibly non-obvious details in there, but this is a data structures post rather than a Hypothesis post, so I’m just going to gloss over them and invite you to peruse the tests in more detail at your leisure if you’re interested.

Now lets look at an implementation of this, using the approach described above:

class Sampler(object):
    def __init__(self, values, random):
        self.__values = list(values)
        self.__random = random
 
    def sample(self):
        if not self.__values:
            raise IndexError("Cannot sample from empty list")
        i = self.__random.randint(0, len(self.__values) - 1)
        return self.__values[i]
 
    def remove(self, value):
        self.__values = [v for v in self.__values if v != value]

The test suite passes, so we’ve successfully implemented the operations (or our bugs are too subtle for Hypothesis to find in a couple seconds). Hurrah.

But we’ve not really achieved our goals: Sampling is O(1), sure, but remove and initialisation are both O(n). How can we fix that?

The idea is to incrementally patch up this data structure by finding the things that make it O(n) and seeing if we can defer the cost for each element until we actually definitely need to incur that cost to get the correct result.

Let’s start by fixing removal.

The first key observation is that if we don’t care about the order of values in a list (which we don’t because we only access it through random sampling), we can remove the element present at an index in O(1) by popping the element that is at the end of the list and overwriting the index with that value (if it wasn’t the last index). This is the approach normally taken if you want to implement random sampling without replacement, but in our use case we’ve separated removal from sampling so it’s not quite so easy.

The problem is that we don’t know where (or even if) the value we want to delete is in our array, so we still have to do an O(n) scan to find it.

One solution to this problem (which is an entirely valid one) is to have a mapping of values to the indexes they are found in. This is a little tricky to get right with duplicates, but it’s an entirely workable solution. It makes it much harder to do our O(1) initialize later though, so we’ll not go down this route.

Instead the idea is to defer the deletion until we know of an index for it, which we can do during our sampling! We keep a count of how many times a value has been deleted and, if we end up sampling it and the count is non-zero, we remove it from the list and decrement the count by one.

This means that we potentially pay an additional O(deletions) cost each time we sample, but these costs are “queued up” from the previous delete calls, and once incurred do not repeat, so this doesn’t break our claim of O(1) amortised time – the costs we pay on sampling are just one-off deferred costs from earlier.

Here’s some code:

class Sampler(object):
    def __init__(self, values, random):
        self.__values = list(values)
        self.__random = random
        self.__deletions = set()
 
    def sample(self):
        while True:
            if not self.__values:
                raise IndexError("Cannot sample from empty list")
            i = self.__random.randint(0, len(self.__values) - 1)
            v = self.__values[i]
            if v in self.__deletions:
                replacement = self.__values.pop()
                if i != len(self.__values):
                    self.__values[i] = replacement
            else:
                return v
 
    def remove(self, value):
        self.__deletions.add(value)

So now we’re almost done. All we have to do is figure out some way to create a mutable version of our immutable list in O(1).

This sounds impossible but turns out to be surprisingly easy.

The idea is to create a mask in front of our immutable sequence, which tracks a length and a mapping of indices to values. Whenever we write to the mutable “copy” we write to the mask. Whenever we read from the copy, we first check that it’s in bounds and if it is we read from the mask. If the index is not present in the mask we read from the original sequence.

The result is essentially a sort of very fine grained copy on write – we never have to look at the whole sequence, only the bits that we are reading from, so we never have to do O(n) work.

Here’s some code:

from collections import Counter
 
 
class Sampler(object):
    def __init__(self, values, random):
        self.__values = values
 
        self.__length = len(values)
        self.__mask = {}
        self.__random = random
        self.__deletions = set()
 
    def sample(self):
        while True:
            if not self.__length:
                raise IndexError("Cannot sample from empty list")
            i = self.__random.randint(0, self.__length - 1)
            try:
                v = self.__mask[i]
            except KeyError:
                v = self.__values[i]
            if v in  self.__deletions:
                j = self.__length - 1
                try:
                    replacement = self.__mask.pop(j)
                except KeyError:
                    replacement = self.__values[j]
                self.__length = j
                if i != j:
                    self.__mask[i] = replacement
            else:
                return v
 
    def remove(self, value):
        self.__deletions.add(value)

And that’s it, we’re done!

There are more optimisations we could do here – e.g. the masking trick is relatively expensive, so it might make sense to switch back to a mutable array once we’ve masked off the entirety of the array, e.g. using a representation akin to the pypy dict implementation and throwing away the hash table part when the value array is of full length.

But that would only improve the constants (you can’t get better than O(1) asymptotically!), so I’d be loathe to take on the increased complexity until I saw a real world workload where this was the bottleneck (which I’m expecting to at some point if this idea bears fruit, but I don’t yet know if it will). We’ve got the asymptotics I wanted, so lets stop there while the implementation is fairly simple.

I’ve yet to actually use this in practice, but I’m still really pleased with the design of this thing.  Starting from a fairly naive initial implementation, we’ve used some fairly generic tricks to patch up what started out as O(n) costs and turn them O(1). As well as everything dropping out nicely, a lot of these techniques are probably reusable for other things (the masking trick in particular is highly generic).

Update 09/4/2017: An earlier version of this claimed that this solution allowed you to remove a single instance of a value from the list. I’ve updated it to a version that removes all values from a list, due to a comment below correctly pointing out that that approach biases the distribution. Fortunately for me in my original use case the values are all distinct anyway so the distinction doesn’t matter, but I’ve now updated the post and the code to remove all instances of the value from the list.


Do you like data structures? Of course you do! Who doesn’t like data structures? Would you like more data structures? Naturally. So why not sign up for my Patreon and tell me so, so you can get more exciting blog posts like this! You’ll get access to drafts of upcoming blog posts, a slightly increased blogging rate from me, and the warm fuzzy feeling of supporting someone whose writing you enjoy.

This entry was posted in programming, Python on by .

My time: Now available to the highest bidder

This is an experiment.

Would you like me to do some work for you? Specifically, one day worth of work?

Well, that option is of course always open to you and you can email me to talk about it. But maybe you’ve been worried that I’m too busy, too expensive, etc. Or maybe you just haven’t got around to it.

So I’m going to try an experiment: I am auctioning off one work day of my time next month to the highest bidder. It can be for anything you like – Hypothesis, writing, having someone come in to take a look at your business and give you some pointers on your software development, or really anything else at all. This can be more or less whatever you want as long as it’s something I’m in principle willing to do for money (and if it’s paying me to do a concrete thing like write some software or an article rather than general consulting, I’m happy for this to include rights assignment).

Why am I doing this? The prompting thought was that I’m looking into ways to fill precisely one day of my time per month once I start doing my PhD, which will hopefully be in September – PhD stipends are a lot lower than developer salaries or day rates, so a top up would be nice, but I don’t want to do my PhD part time.

But once I thought about it I realised that it just seemed like a really fun experiment to try, so I figured I might as well try it now! Worst case scenario, we all learn something interesting and the cost of a day of my time.

Here are the rules:

  1. You are getting one day of work. If I think what you have asked for cannot be done in one day I will tell you up front, but if it takes longer than expected then I’m not going to continue working on it for free. I recommend picking things that either are intrinsically time bounded or can be stopped at any time and still be useful, but small projects that I’m confident can be done in a day are OK too.
  2. The auction will be a Vickrey auction. That is, the highest bidder wins the auction and pays the second highest bidder’s bid. That means that you will usually pay less than your bid (And also means that the game theoretically correct amount for you to bid is exactly the amount my time is worth to you).
  3. If there is only one bidder then that means they get a day of my work for free.
  4. The auction will run until midnight BST (UTC + 1) on April 30th. No bids made after that point will be counted.
  5. If there is a tie for top place then I will pick the one I most want to do among them (or flip a coin or something if I’m torn) and charge the full bid.
  6. If I judge a job to be illegal, unethical, or I just really don’t want to do it for any reason then I will remove it from the list and not count it for either the top or the second place bid. I will let you know if I do this if you would have been the first or second place bid and I don’t think you’re being deliberately abusive.
  7. I will publish both the winning bid and what they actually paid (and, if it is acceptable to the winner, who won) once the auction completes.
  8. The day of work must be redeemed at some point in May, after which I will send you an invoice. If we can’t come to a mutually acceptable time in May I will rerun the auction without your bid and offer the spot to the new winner and you won’t get charged anything.
  9. If the job cannot be done remotely I will also invoice you for travel (and, if necessary, accommodation) on top of the bid.

Note that yes you can bid ridiculously small amounts, or even zero, and if that’s the winning bid then I will honour it (though I will probably not repeat the experiment for very long if that happens). You can also bid ridiculously high amounts and hope that the next best bid isn’t ridiculous, but I don’t recommend it (and may query your willingness to pay before accepting…). The winning strategy really is to bid what you think my time is worth to you.

Interested? Just fill out this form! I will let you know if you win the bid, and we can then arrange dates accordingly.

If you’re still on the fence, here are some ideas for how a day of my time can be useful to you:

  1. I can write a short article (or story if you prefer!) about something I already know a reasonable amount about. Very few of my blog posts take more than a day of work.
  2. I can read a paper and discuss it with you and/or try to write a short precis and introduction to its ideas (whether this is feasible will vary based on the paper, but I can get a pretty good idea if it’s going to be with a brief skim).
  3. I can add tests using Hypothesis to a small Python project.
  4. If working with someone who knows that project well, I can add tests using Hypothesis to a larger Python project.
  5. I can come in to your company and answer peoples’ questions on a subject of your choosing, which will probably be Hypothesis, but could be Integer Linear Programming, Voting Theory, or general software development.
  6. I can come in to your company and take a look at your development practices and processes and offer you tips on how to improve them.
  7. I can come in to your company and help with improving your interviewing process.

Those are just some ideas though! The point is that I have a variety of expertise, a lot of which can be usefully applied in under a day if you pick a reasonably small project or just use me as a general consultant.

Note that if you do have any general desire to bring me in as a consultant you should put in a bid, because you will almost certainly get a very good deal on this auction compared to my normal rates.

This entry was posted in programming, Python, Work on by .