Ned Batchelder’s coverage.py is a foundation of the Python testing ecosystem. It is solid, well maintained, and does its job extremely well. I think literally every Python project that cares about testing should be using it.
But it’s not without its faults. Specifically, its performance can be quite bad. On some workloads it’s absolutely fine, but on others you can see anything up to an order of magnitude slow down (and this is just on CPython. On pypy it can be even worse).
Recently, after some rather questionable late night hacking (and a significant amount of fixing not late at night), I finally made good on my promise that Hypothesis would eventually use coverage information and shipped Hypothesis 3.29.0 which added a dependency on Coverage and turned it on for every Hypothesis based test.
This hasn’t been an entirely smooth process – some for reasons that are my fault, and some that users are now running into these performance problems.
The right place to fix this is obviously in Coverage rather than Hypothesis itself, so I’ve been looking into this recently. I’ve already made one patch which gives branch coverage a 50% speedup in a loop-based microbenchmark and about a 20% speedup on one real world example I tried it on.
The idea of the patch is very straightforward (though apparently I’m a unusual in thinking that “Here I wrote you a hash table!!!” is a straightforward patch). Coverage creates a lot of Python objects in a very hot part of the code, so this caches them off an integer key so that most of the time it can omit creating those objects and significantly speed things up as a result.
Unfortunately that’s probably it for now. My priorities for the immediate future are PhD, paid work, and conference prep, which means that I certainly don’t have any time in the next month and probably not the next couple of months (this could be fixed by making this paid work. I may do a kickstarter or something for that, but in the meantime if any interested companies wanted to fund this work I’d be more than happy to discuss it…).
So I thought I’d write down my notes before I forget. These are both for future-me and for anyone interested who feels motivated to work on this problem in the meantime.
I haven’t done super formal benchmarking, but I set up pytest-benchmark with some basic benchmarks that just ran a loop adding numbers.
The benchmarking functionality itself was pretty great but I didn’t find it easy to compare benchmarks in the way that I wanted – in particular I had the same benchmark which was run in three different ways (no coverage, line coverage, branch coverage) and I wanted to break those down for side-by-side comparison, but I couldn’t find any functionality to do so (I admit I didn’t look very hard). It was nice having the statistics handled though and I will almost certainly want to sink some time into getting a decent pytest-benchmark suite for coverage if I work on this further.
Real World Benchmarking
The real world benchmark I used was Alex Groce’s tstl, because his usage profile is similar to mine (there’s a lot of overlap between what Hypothesis and TSTL do), and he had an existing example that was seeing an order of magnitude slow down. This is the example that gets a 20% speedup from my above patch.
The problem can be reproduced as follows:
git clone https://github.com/agroce/tstl.git cd tstl virtualenv v source v/bin/activate pip install . cd examples/AVL tstl avlnodisp.tstl tstl_rt --seed=0 --timeout=30 tstl_rt --seed=0 --timeout=30 --noCover
The thing to compare is the total number of test operations run in the outputs from each of the test_rt commands. For me I see “12192 TOTAL TEST OPERATIONS” with coverage, and “96938 TOTAL TEST OPERATIONS” without, so it runs about 8 times as many operations in the same time frame with coverage turned off (this is without my patch. With my patch I get 14665 under coverage, so about 20% more).
I confess I didn’t figure out how to profile coverage until after I made the above patch. I had a benchmark I was measuring, and just based on inspection I was more than 90% certain that the above would help, so I decided to just give it a go and validate my suspicion and turned out to be right.
But after I’d picked the most obvious low hanging fruit I figured it would be silly to try to proceed further without getting profiling set up, so I poked around. I spent a little time trying to get google-perf-tools working with Python and failing, but eventually figured out that I could do it with perf and it works great (modulo quite a lot of hacking and data munging).
The basic idea with perf is that you run your program under “perf record” and it gives you raw output data. You can then do analysis on this to find out about your program’s performance.
The first thing to do to use perf is that you need to make sure that everything is compiled with debug symbols. This includes both your extension and Python itself.
To get a Python with debug symbols I used pyenv‘s python-build plugin:
export PYTHON_CFLAGS='-pg' ~/.pyenv/plugins/python-build/bin/python-build 2.7.13 ~/debug-python2
This builds a version of Python 2 (TSTL doesn’t work under Python 3) with the “-pg” flag to gcc which includes debug symbols. I also modified setup.py for coverage to include extra_compile_args=[‘-pg’] (it should be possible to do this with an environment variable, but I didn’t try).
Once I had that, running under perf was straightforward:
perf record tstl_rt --seed=0 --timeout=30
This creates a file called perf.data that you can analyze. I did not find prof report, the default way of analyzing it, super helpful, so I used CPU Flame Graphs.
I was only interested in the performance for calls below CTracer_trace, and I didn’t find the way it was spread out in the SVG (there were a lot of bits) very helpful, so I ended up aggregating the data through some um very sophisticated data analysis tools as follows:
perf script > out.perf && \ ~/scratch/FlameGraph/stackcollapse-perf.pl out.perf | \ grep CTracer | sed 's/.\+;CTracer_trace/CTracer_trace/' | \ sort | \ python sum.py > out2.folded ~/scratch/FlameGraph/flamegraph.pl out2.folded > out.svg
sum.py is the following very basic code:
from __future__ import print_function import sys if __name__ == '__main__': prev = None for l in sys.stdin: u, v = l.split() v = int(v) if prev is None: prev = u count = v elif prev == u: count += v else: print(prev, count) prev = u count = v print(prev, count)
(the data munging earlier creates duplicated entries, so this merges them together).
WordPress won’t let me upload the generated SVG “For Security Reasons” (that do not apparently preclude running WordPress itself), so here’s a gist of it, and her’es one from before my patch was applied (right click and view image in a new tab to get a usable interactive version of it)
PyPy performance for coverage is more complicated. My understanding is that there are roughly three classes of problems here:
- coverage itself is not as well optimised as it could be (same problem as CPython)
- Using settrace interferes with the JIT
- The baseline speed of pypy operations is much faster so coverage is a much higher overhead by comparison.
The first is the only one I could deal with, and I think it probably will benefit significantly from whatever changes I make to the C extension also being ported over to the pure Python version (pypy doesn’t use the C extension tracer because it’s even slower than the pure python one on pypy), but I’m not sure that will be enough – I suspect there are also going to be more changes to pypy internals required for this, and I don’t know enough about those to say how difficult or easy they are.
The Limits of What Is Possible
Python coverage is never going to run at the speed of Python without coverage, especially on pypy.
You can see the limits of how much faster it could be by running with an empty trace function (note: Whether you are using a C or Python level trace function makes a big difference. sys.settrace is ruinously slow even with an empty function).
The difference varies significantly depending on your code though – with the TSTL workload above I see a 50% slow down with an empty C level trace function. With more microbenchmark style workloads with few functions and lots of looping I see almost no speed loss.
So my suspicion is that for CPython at least we can get coverage to reliably be within a factor of two of running without it, and maybe reliably within a factor of 1.5.
For now my plan is to shepherd the patch from the beginning of this post and otherwise ignore this problem for the next couple of months.
Based on the profiling I did above most of the time is currently being spent in PyDict_SetItem, so I think the obvious next line of attack when I do start working on this is to replace the file_data objects in coverage which currently use Python dictionaries keyed off Python values with some sort of specialized data structure (possibly another hash table, possibly something better optimized for write heavy workloads). Longer term I think the goal should be move all calls back into Python land out of the trace function and just normalize at the end of tracing.
Even the more modest goal is a bit more invasive of a change than I wanted to start with, hence the above rather more conservative patch, but there’s nothing conceptually difficult to it – it just involves a bunch of slog and basic engineering work followed by some experimenting with clever data structure designs.
Once I’ve made it through the next month or two I’ll start seeing about getting some work on this funded. I’ve already suggested the notion to an existing customer who I know is having problems with coverage performance, but if they don’t want to fund it (which would be totally understandable) I’ll look further afield.
My fall back plan is a Kickstarter for this, but honestly I think some motivated company who is suffering from this should just think about picking up the tab.
I’ve been doing a bunch of funded work on Hypothesis for Stripe and Smarkets (here and here so far, with more to come) and it’s been going great – it’s worked out well for me, them, and Hypothesis users at large, and I don’t see why other projects shouldn’t benefit from the same (I’d encourage paying people who are actually maintainers of those projects by default, but many of them including Ned have full time jobs and I’m more flexibly available).
We’re not talking about a lot of money – a couple of weeks of development work (and, say, a 10-20% extra consultancy fee for Ned) should be enough to see some serious performance improvements, and as well as making your developers much happier, you’ll also earn some serious community good will (which is great when it comes to hiring new developers). That’s probably less than a month of your normal meeting schedule costs you.
This is a problem that affects a large majority of people who care about testing Python, which should include most commercial Python users, so if that’s you why not get in touch?