TLDR, this is another Hypothesis post. If you’re bored of those you can press back now.
I discovered a bug today that has been in Hypothesis from the beginning. I’m really really glad that I found it the way I did: Partly because I feel all smug and validated about my development practices as a result, but mostly because finding and figuring this one out in the wild would have been a nightmare, and it probably would have pissed off a ton of users who just wrote the library off as inherently flaky before I found out about it.
In Hypothesis you can define strategies for instances. This is how things like @given([int]) and so on work – there’s a rule that says how to build strategies for lists given strategies for their elements.
But in Python you have subtyping. There’s a similar strategy definition for tuples, but you might want more specific ones for your custom namedtuples (e.g. insisting on some relation between the fields).
So you need to be able to define instance strategy mappings for a subtype of something that already has a mapping defined and have it work. But you also want to have it inherit the supertype’s definition if a subtype one has not been defined (e.g. if you have a specific namedtuple you want it to inherit the generic namedtuple logic).
This is basically a method dispatch problem. You have to find the most specific type with a definition for a supertype of the current type (note: There’s potentially an ambiguity in “most specific” which is currently not well handled. I plan to look into this but it’s not high on my priority list).
The Hypothesis implementation of this is ridiculously inefficient, but it shouldn’t be used as a general purpose object system so that’s basically not a huge problem. I did however think that it was correct – I’d frequently said things along the lines of “Yes it’s ridiculous that Hypothesis has its own object system but honestly it’s been super useful and never given me any problems”.
Ah, hubris.
So yeah, there turns out to be a bug in this logic which can cause things to really confusingly wrong in surprising and hard to debug ways.
The bug is this:
Due to reasons that are honestly a bit silly and probably entirely unnecessary (I need to revisit this code properly at some point), the core of the algorithm Hypothesis uses for this requires doing a topological sort of the instance handlers with respect to the subset relation.
So the way I did this was to do a normal sort by the relation that x < y if x is a subtype of y, x > y if y is a subtype of x and otherwise sort by comparing names.
And this is so very much not how you do a topological sort. I don’t know what past David was thinking but apparently I’m just really bad at extending partial orders. This produces a non-transitive order, and as a result sorting it can basically do anything it wants.
What it wants in practice seems to be to almost work. Most of the time it produces the right results, but depending on the order you started with it can be almost arbitrarily wrong.
I spotted this because my builds are designed to fail at less than 100% branch coverage. The tests I had around this were flaky in the sense that sometimes they would randomly fail to produce enough coverage and I couldn’t figure out why. Investigating caused me to find this dodgy code and do a double take.
I then really struggled to find an example that actually demonstrated it was dodgy. So I just wrote some code to throw lots of examples at it and see what happened, and indeed got lots of super complicated examples demonstrating that it really was dodgy but not providing any great deal of insight.
Then I realised that I was literally sitting in the middle of a library that was about example generation and minimization, so I once more turned Hypothesis’s terrible gaze upon itself and it immediately started giving me small examples of 3 or 4 classes.
Which on trying to reproduce in a deterministic test or in ipython didn’t work.
And this is where the bug gets super funny and where I spent a good half an hour finding test cases then trying and failing to reproduce them and getting increasingly convinced that I’d found a new and exciting different bug in hypothesis before I finally figured out what was going on.
You see, as I mentioned above the behaviour depends on the starting order. In this case, the items we start with are the entries in a dict. This means that given a specific order of operations to build up the mapping, the starting order is entirely deterministic if basically arbitrary – it’s just the dict iteration order.
Entirely deterministic, that is, within any given process. As of Python 3, hash randomization is on by default. Each new python process will have a different hash secret and will thus have its own iteration order.
This means that that lovingly hand-crafted example of this will probably only produce the problem some fraction of the time, and that no matter how many times you run it in a given test run it will always produce the same answer.
Hypothesis on the other hand? No problem. It just throws examples at the problem until it finds one that works with the current hash iteration order.
This is really interesting to me because it’s an example where property based testing is not just the same as normal testing but you don’t have to explicitly provide the examples. In this case the system is actively fighting against your ability to produce examples but Hypothesis just shrugs, does its thing and gives you one anyway.
All told, despite the ridiculous terrible code that was the source of this, I think this is pretty much a success story for Hypothesis: The policy of failing the build if there’s not 100% coverage is working out really well because it makes me find areas of the codebase which are merely erratically covered and given them some serious attention, and the ability of Hypothesis to find and reduce interesting examples continues to impress.
If you’re interested, here is the commit that fixes this. Somewhat embarrassingly, it only works because of a typo, but it does still work rather well.