Python Coverage could be fast

Ned Batchelder’s coverage.py is a foundation of the Python testing ecosystem. It is solid, well maintained, and does its job extremely well. I think literally every Python project that cares about testing should be using it.

But it’s not without its faults. Specifically, its performance can be quite bad. On some workloads it’s absolutely fine, but on others you can see anything up to an order of magnitude slow down (and this is just on CPython. On pypy it can be even worse).

Recently, after some rather questionable late night hacking (and a significant amount of fixing not late at night), I finally made good on my promise that Hypothesis would eventually use coverage information and shipped Hypothesis 3.29.0 which added a dependency on Coverage and turned it on for every Hypothesis based test.

This hasn’t been an entirely smooth process – some for reasons that are my fault, and some that users are now running into these performance problems.

The right place to fix this is obviously in Coverage rather than Hypothesis itself, so I’ve been looking into this recently. I’ve already made one patch which gives branch coverage a 50% speedup in a loop-based microbenchmark and about a 20% speedup on one real world example I tried it on.

The idea of the patch is very straightforward (though apparently I’m a unusual in thinking that “Here I wrote you a hash table!!!” is a straightforward patch). Coverage creates a lot of Python objects in a very hot part of the code, so this caches them off an integer key so that most of the time it can omit creating those objects and significantly speed things up as a result.

Unfortunately that’s probably it for now. My priorities for the immediate future are PhD, paid work, and conference prep, which means that I certainly don’t have any time in the next month and probably not the next couple of months (this could be fixed by making this paid work. I may do a kickstarter or something for that, but in the meantime if any interested companies wanted to fund this work I’d be more than happy to discuss it…).

So I thought I’d write down my notes before I forget. These are both for future-me and for anyone interested who feels motivated to work on this problem in the meantime.

Initial Benchmarking

I haven’t done super formal benchmarking, but I set up pytest-benchmark with some basic benchmarks that just ran a loop adding numbers.

The benchmarking functionality itself was pretty great but I didn’t find it easy to compare benchmarks in the way that I wanted – in particular I had the same benchmark which was run in three different ways (no coverage, line coverage, branch coverage) and I wanted to break those down for side-by-side comparison, but I couldn’t find any functionality to do so (I admit I didn’t look very hard). It was nice having the statistics handled though and I will almost certainly want to sink some time into getting a decent pytest-benchmark suite for coverage if I work on this further.

Real World Benchmarking

The real world benchmark I used was Alex Groce’s tstl, because his usage profile is similar to mine (there’s a lot of overlap between what Hypothesis and TSTL do), and he had an existing example that was seeing an order of magnitude slow down. This is the example that gets a 20% speedup from my above patch.

The problem can be reproduced as follows:

git clone https://github.com/agroce/tstl.git
cd tstl
virtualenv v
source v/bin/activate
pip install .
cd examples/AVL
tstl avlnodisp.tstl
tstl_rt --seed=0 --timeout=30 
tstl_rt --seed=0 --timeout=30 --noCover

The thing to compare is the total number of test operations run in the outputs from each of the test_rt commands. For me I see “12192 TOTAL TEST OPERATIONS” with coverage, and “96938 TOTAL TEST OPERATIONS” without, so it runs about 8 times as many operations in the same time frame with coverage turned off (this is without my patch. With my patch I get 14665 under coverage, so about 20% more).

Profiling Coverage

I confess I didn’t figure out how to profile coverage until after I made the above patch. I had a benchmark I was measuring, and just based on inspection I was more than 90% certain that the above would help, so I decided to just give it a go and validate my suspicion and turned out to be right.

But after I’d picked the most obvious low hanging fruit I figured it would be silly to try to proceed further without getting profiling set up, so I poked around. I spent a little time trying to get google-perf-tools working with Python and failing, but eventually figured out that I could do it with perf and it works great (modulo quite a lot of hacking and data munging).

The basic idea with perf is that you run your program under “perf record” and it gives you raw output data. You can then do analysis on this to find out about your program’s performance.

The first thing to do to use perf is that you need to make sure that everything is compiled with debug symbols. This includes both your extension and Python itself.

To get a Python with debug symbols I used pyenv‘s python-build plugin:

export PYTHON_CFLAGS='-pg'
~/.pyenv/plugins/python-build/bin/python-build 2.7.13 ~/debug-python2

This builds a version of Python 2 (TSTL doesn’t work under Python 3) with the “-pg” flag to gcc which includes debug symbols. I also modified setup.py for coverage to include  extra_compile_args=[‘-pg’] (it should be possible to do this with an environment variable, but I didn’t try).

Once I had that, running under perf was straightforward:

perf record tstl_rt --seed=0 --timeout=30

This creates a file called perf.data that you can analyze. I did not find prof report, the default way of analyzing it, super helpful, so I used CPU Flame Graphs.

I was only interested in the performance for calls below CTracer_trace, and I didn’t find the way it was spread out in the SVG (there were a lot of bits) very helpful, so I ended up aggregating the data through some um very sophisticated data analysis tools as follows:

perf script > out.perf && \
    ~/scratch/FlameGraph/stackcollapse-perf.pl out.perf | \
    grep CTracer | sed 's/.\+;CTracer_trace/CTracer_trace/' | \
    sort | \
    python sum.py > out2.folded
~/scratch/FlameGraph/flamegraph.pl out2.folded > out.svg

sum.py is the following very basic code:

from __future__ import print_function

import sys

if __name__ == '__main__':
        prev = None
        for l in sys.stdin:
                u, v = l.split()
                v = int(v)
                if prev is None:
                        prev = u
                        count = v
                elif prev == u:
                        count += v
                else:
                        print(prev, count)
                        prev = u
                        count = v
        print(prev, count)

(the data munging earlier creates duplicated entries, so this merges them together).

WordPress won’t let me upload the generated SVG “For Security Reasons” (that do not apparently preclude running WordPress itself), so here’s a gist of it, and her’es one from before my patch was applied (right click and view image in a new tab to get a usable interactive version of it)

pypy

PyPy performance for coverage is more complicated. My understanding is that there are roughly three classes of problems here:

  1. coverage itself is not as well optimised as it could be (same problem as CPython)
  2. Using settrace interferes with the JIT
  3. The baseline speed of pypy operations is much faster so coverage is a much higher overhead by comparison.

The first is the only one I could deal with, and I think it probably will benefit significantly from whatever changes I make to the C extension also being ported over to the pure Python version (pypy doesn’t use the C extension tracer because it’s even slower than the pure python one on pypy), but I’m not sure that will be enough – I suspect there are also going to be more changes to pypy internals required for this, and I don’t know enough about those to say how difficult or easy they are.

The Limits of What Is Possible

Python coverage is never going to run at the speed of Python without coverage, especially on pypy.

You can see the limits of how much faster it could be by running with an empty trace function (note: Whether you are using a C or Python level trace function makes a big difference. sys.settrace is ruinously slow even with an empty function).

The difference varies significantly depending on your code though – with the TSTL workload above I see a 50% slow down with an empty C level trace function. With more microbenchmark style workloads with few functions and lots of looping I see almost no speed loss.

So my suspicion is that for CPython at least we can get coverage to reliably be within a factor of two of running without it, and maybe reliably within a factor of 1.5.

What next

For now my plan is to shepherd the patch from the beginning of this post and otherwise ignore this problem for the next couple of months.

Based on the profiling I did above most of the time is currently being spent in PyDict_SetItem, so I think the obvious next line of attack when I do start working on this is to replace the file_data objects in coverage which currently use Python dictionaries keyed off Python values with some sort of specialized data structure (possibly another hash table, possibly something better optimized for write heavy workloads). Longer term I think the goal should be move all calls back into Python land out of the trace function and just normalize at the end of tracing.

Even the more modest goal is a bit more invasive of a change than I wanted to start with, hence the above rather more conservative patch, but there’s nothing conceptually difficult to it – it just involves a bunch of slog and basic engineering work followed by some experimenting with clever data structure designs.

Once I’ve made it through the next month or two I’ll start seeing about getting some work on this funded. I’ve already suggested the notion to an existing customer who I know is having problems with coverage performance, but if they don’t want to fund it (which would be totally understandable) I’ll look further afield.

My fall back plan is a Kickstarter for this, but honestly I think some motivated company who is suffering from this should just think about picking up the tab.

I’ve been doing a bunch of funded work on Hypothesis for Stripe and Smarkets (here and here so far, with more to come) and it’s been going great – it’s worked out well for me, them, and Hypothesis users at large, and I don’t see why other projects shouldn’t benefit from the same (I’d encourage paying people who are actually maintainers of those projects by default, but many of them including Ned have full time jobs and I’m more flexibly available).

We’re not talking about a lot of money – a couple of weeks of development work (and, say, a 10-20% extra consultancy fee for Ned) should be enough to see some serious performance improvements, and as well as making your developers much happier, you’ll also earn some serious community good will (which is great when it comes to hiring new developers). That’s probably less than a month of your normal meeting schedule costs you.

This is a problem that affects a large majority of people who care about testing Python, which should include most commercial Python users, so if that’s you why not get in touch?

This entry was posted in Python on by .

Programmer at Large progress report

This is a very short post to cheat on my Beeminder goal for programmer at large (it just says I have to post under the tag! They don’t have to be chapters! I feel like obeying the letter of the law that is designed to achieve a particular social effect is entirely in the spirit of the story so I don’t feel at all bad about this. OK. Maybe a little bad).

But I’m also going to use it to just put a quick note here on why there hasn’t been much programmer at large recently.

Short version: I’ve been finding it hard to work on programmer at large because the story has ended up going in a direction I didn’t really intend it to. I don’t dislike the direction per se, but it doesn’t quite work for me as an author. This combined with the fact that I’ve had a lot going on recently means that I just haven’t been able to get into the right head space to write it.

I’m not formally putting it on hiatus because “on hiatus” is another word for dead as far as web serials and the like are concerned. I do intend to finish it, but it will probably be a relatively abbreviated finish where I wrap it up in probably no more than another two or three chapters.

I think the primary thing I have learned from writing this serial is that I don’t like writing web serials. I will probably either stick to more self-contained stories in future or write a much larger amount before I start publishing.

This entry was posted in Programmer at Large on by .

Surgery recovery update

I realised that announcing that I was going to have surgery and then stopping updating the blog from the surgery date might be considered less than ideal, so this is just a quick update to reassure readers that I aten’t dead.

Anyway, I had my surgery last Tuesday as planned. It went fine according to my surgeon. It’s hard for me to tell – it’s somewhat in the nature of surgery recovery is that you spend a period immediately after where the thing you wanted to get better gets worse instead, and this one is very much exemplifying that. A week later I’m just about at the point where I can almost kinda breathe through my nose.

In general the recovery has been not much fun but entirely manageable – I spent the first couple of days useless, but since then I’ve been able to think OK. Despite feeling a lot like a bad cold it doesn’t seem to turn my brain to mush in the same way.

Anyway, hopefully in another week or so I might start seeing some benefit, or at least get back to feature-parity with pre-surgery me, I shall report back with some updated notes on the war on sleep when that happens.

Update: Apparently I was being unduly pessimistic. My breathing through my nose, while it still feels congested and uncomfortable, is actually now as good as or maybe slightly better than it was prior to the surgery in terms of measured airflow.

This entry was posted in Uncategorized on by .

Programmer at Large: Why aren’t you laughing?

This is the latest chapter in my web serial, Programmer at Large. The first chapter is here and you can read the whole archives here or on the Archive of Our Own mirror. This chapter is also mirrored at Archive of Our Own.


“Well… some of us think they just wanted to see what would happen.”

I blinked at Kimiko a couple times.

“What.”

“Oh come on, didn’t you read between the lines in history class? Half the point of culture design when you spawn a new ship is to try weird things and mess with people.”

“That… doesn’t seem much like what they told us. Isn’t it supposed to be all about designing for resilience and the long-term survival of the trade fleet and the human race?”

“Yeah. By messing with people. Societies doesn’t last long if they can’t take a joke. Well, here we are. We’re the joke. Why aren’t you laughing?”

“It doesn’t seem very funny.”

They sighed.

“It’s really not. Especially if you’re stuck in the middle of it. But I’m serious about the messing with people.”

“OK but… why?”

“A whole bunch of reasons, but it mostly boils down to evolution and data gathering. Trying random nonsense and seeing what works. Sometimes unexpected things turn out to be a really good idea and other people copy them, sometimes it all explodes horribly and the ship dumps a whole bunch of really good experimental data about what not to do onto the trade network. Most of them are somewhere in between, like us.”

“This still seems horribly irresponsible.”

They shrugged elaborately.

“And yet we’re the longest lasting civilisation in human history. As much as it hurts to be on the receiving end of this particular iteration, I can’t deny it works. In some ways we’re even a successful experiment – turns out having a dedicated minority to gently oppress is a great bonding exercise for the rest of the crew, and the systems in place are good enough at stopping things from blowing up. Our drama metrics are really low compared to most ships.”

“That’s horrible,”

“Believe me, I know. Fortunately it’d be a hard sell for any new ship design – it doesn’t have to just work, people have to actually buy into the design, and now that there’s data from us it’d be harder to repeat the experiment. But maybe our data will help somebody figure out a less dysfunctional way of doing it. That’s how the system works.”

I didn’t really know what to say to that, so I just floated there for a while with a slightly ill look on my face. Eventually, Kimiko continued speaking.

“So, uh, now that you know, what are you going to do about it?”

That, at least, was obvious.

“Oh, nothing.”

“Nothing? Really? You’re not going to make a fuss about it?”

“What? No, of course not. That would be stupid. I mean, let me know if I’m wrong and there’s something you want me to do about it, but until then I’m going to do the same thing I do with any complex problem that I don’t understand properly and the experts are already on top of: leave it alone until I do understand it properly.”

They breathed a sigh of relief.

“Good. Thank you. Right answer. And no, there’s nothing much you can do about it. Though, uh, I should warn you that you still might not want to be friends with me. It looks like you’re in enough social metrics trouble as it is without people calling you a sex fiend.”

“Oh, waste that. This whole thing is stupid and even if I’m not going to try and fix it I’m not going to make it worse. Besides, if I get kicked off because people think I’m having sex with you, at least that way I’ll be part of a group rather than all on my own surrounded by grounders.”

I gave a slightly pained smile to show I was only joking. Mostly.

Apparently I’d said something right anyway. I could feel a tension I hadn’t even realised they were holding go out of them.

“That’s… nice to hear. Thank you.”

They paused, grinned.

“And now of course, we must celebrate our new friendship in the way of my people. Let’s bang.”

They waggled their eyebrows suggestively.

I gave them an extremely flat look. Even I could spot that one was a joke.

They held the grin for a few seconds before bursting out laughing.

“Sorry, sorry, couldn’t help myself. Don’t worry, I know better than to actually hit on you. But let me know if you ever want to experiment.”

I nodded.

“I doubt I will, but thanks.”

I stifled a yawn.

“Sorry, excuse me. It’s getting close to my bed time. Is there anything else we need to talk about?”

They shook their head.

“I don’t think so. We’ve got the basic facts of life covered, you’re not going to treat me as a social pariah, those were the big things I wanted to check on, so I’m good if you are.”

They yawned too.

“To be honest though, I’m wiped. It’s a bit off cycle for me, but mind if I join you?”

“Sure, fine by me.”

I usually sleep alone. Not for any particularly good reason, it just happens that way. It would be nice to have some company for once.

“Now?”

“Might as well, if we’re done.”

We stripped off and plugged into the wall. It took a little while to find a comfortable position, but we eventually settled on Kimiko cuddling up to me from behind.

“Sleep well, Arthur”

“You too, Kimiko”

I triggered my sleep mode. Within seconds the world went fuzzy, and shortly after I was fast asleep.

This entry was posted in Programmer at Large on by .

A trick for getting to sleep

I used to have a lot of trouble getting to sleep. These days I manage better (partly because I’m sufficiently tired all the time that I’m much sleepier when I’m trying), but for a long time the biggest obstacle to my sleep was just that I couldn’t shut off my mind well enough to go to sleep.

And there are still nights where that’s the case and my thoughts end up keeping me up. e.g. because I can’t shut my brain off thinking about work, or because I’m having an anxiety day and my brain really wants to remind me of all my failings. Thanks brain.

The trick for those nights which I hit on ages ago is basically to give myself something to focus on that is sufficiently mentally engaging that I can keep paying attention to it, but insufficiently so to keep me up.

Historically I’ve basically made up stories to do this – not high quality ones, often just variations on whatever I’ve been reading recently (i.e. basically bad self-insert fanfic. Don’t worry, I won’t inflict it on you). This works… pretty well, some of the time. It rather relies on my feeling sufficiently creative in order to work though, which I often don’t.

But I’ve recently found a new technique that I think works strictly better. I thought I had got it from this article about a sleep app, but on a recent reread I realised that not only had I completely got the wrong end of the stick about what their technique was, what they’re actually proposing is completely impossible for me to use due to my inability to visualise.

So, uh, yay for creative misunderstanding?

Anyway, here’s the thing I do now:

  1. Pick a starting word. Usually something short with three or four letters (cat, was, rim, keel, etc).
  2. Repeatedly try to change your current word into another word. I vary what counts as a change but usually go for allowing you to add, delete, or replace one letter, or swap two letters.
  3. When you can’t find any new words starting from the current one, start again from the beginning with a new word (you can repeat words if you want to, you just don’t have to).

I tend to set myself minigoals while playing it – e.g. I normally default to trying to make the word longer, or if there’s some word that looks reachable nearby I try to get to it – e.g. how do you get from child to shield? (this one is actually moderately hard. I don’t currently know the answer and decided not to get distracted from writing this to figure it out. Edit: See end of post). Sometimes these turn out to be harder than expected and I abandon them. If I find I’m getting frustrated with a minigoal I definitely abandon it.

That’s all there is to it really. I just keep playing the word game until I fall asleep.

This ends up working pretty well – it’s mildly engaging, certainly enough that I don’t get bored of it, but it’s also not that interesting, so it doesn’t stop me falling asleep playing it.

The major limitation I run into with this (and all techniques in this space) is that sometimes when my thoughts are very fragmented – the sort of fragmenting that comes from say a headache or a fever, not the sort that comes from sleepiness – my working memory is shot and I can’t actually hold things coherently enough to. I don’t really have a good answer for that sort of scenario other than taking some painkillers and hoping that I’m drained enough already that that’s enough.

Fortunately that’s not the normal case for me (although I’ve got a bit of it at the time of this writing. Fortunately external memory is a good substitute when writing), and this works pretty well for the common case when the problem is just that my won’t shut off properly and it’s stopping me from getting to sleep.

Now if you’ll excuse me, I’m going to go implement A* search and get a computer to figure out that damned child to shield play.

Update: ‘child’, ‘chid’, ‘hid’, ‘hied’, ‘shied’, ‘shield’. I’d probably have given up before I got that (there are other shorter ones but they go via stupid words that I wouldn’t use without sowpod in front of me to check). If you’re suspicious of even that chid (past tense of chide, obviously) then there’s ‘child’, ‘chile’, ‘chide’, ‘hide’, ‘hied’, ‘shied’, ‘shield’.

This entry was posted in The War On Sleep on by .