Stable serialization and cryptographic hashing for tracking seen objects

This is a clever trick I figured out yesterday. It’s in retrospect moderately obvious, but I’ve never seen it before and it’s taken Hypothesis example tracking down from an expected space complexity of O(mn) and time complexity of O(god I don’t even know) to a deterministic O(n) space complexity and an O(mn) time complexity. It’s also vastly improved the constant factors by moving a bunch of stuff from pure python to native code.

(Actually there’s a bit of a wrinkle in that it’s not strictly deterministic because n is a random function of the number of examples you’re looking for based on the duplicate example rate, but in practice it’s rarely more than a few times larger than the number of examples).

First: The problem. Due to the way its data is generated, Hypothesis has a decent chance of producing the same example more than once (if it didn’t there would be a lot of other interesting examples it struggled to produce). Rather than run the same example multiple times, Hypothesis keeps a record of which examples it’s already tried and doesn’t try them again. This is also used during simplification to avoid loops in the simplify graph (it’s a lot easier to write simplifies that might occasionally have loops and let Hypothesis sort that out than it is for every simplify method to do a lot of careful book keeping).

Previously I was using the obvious data structure to track which examples we’ve already seen: A hash table. Stick the object in a hash table when it’s seen, check in the hash table to see if we’ve already seen it.

There are two problems with this. One is intrinsic, the other implementation specific.

The intrinsic one is that we’re keeping all these objects around! That uses up a lot of memory. Some of these objects may be large nested lists of things, and even if they aren’t Python is not exactly good at compact object representation.

The second problem is that Python’s default equality and hashing are completely unsuitable for this. We don’t want to consider {1} and frozenset({1.0}) equal for these purposes, and we do want to consider nan and nan equal for these purposes. Additionally we need to be able to track things like lists which aren’t hashable. So this means we need our own hashing and equality logic. This was written in pure python and thus was very slow when dealing with large objects.

This is very sub-optimal, but it’s not clear how I could have done it better when tracking arbitrary python objects.

…which is the clue. These days Hypothesis doesn’t need to track arbitrary Python objects. It tracks templates, which I can require to have a much more specific type than the objects tracked. In particular I can require them to be marshallable.

In the end I decided on a slightly laxer requirement because marshalling doesn’t handle certain types I wanted to be able to use (specifically namedtuple instances). Every template needs to be either:

  1. Marshallable
  2. An iterable type containing other templates

Combining these by flattening out collections into tuples + a type tag and then marshalling the whole lot lets us generate a binary string for every template value such that the two binary strings are equal the templates are equivalent (and generally speaking the converse too, though there are some special cases where that might not quite work if the iteration order is unstable and we just accept the slight duplication).

This is already a pretty great saving: We have native equality and hashing which does what we want it to do, and the representation is so much more compact than the Python object representation.

But we can do better! Sometimes (often) the strings we get here will be quite long. When a string is at least twenty bytes long we replace it with its sha1 hash. This gives us yet another space saving (in theory it also gives us a time saving, but the hash for binary strings is quite good and with random data equality will tend to return False early, so in practice I don’t expect equality comparisons between long strings would have been a noticable time cost).

The result of this change has been that some parts of my build are now nearly ten times faster than they used to be, and the memory usage has become much more predictable. It’s also meant that I could ditch a ton of code that I really hated – the custom hashing and equality code used to have other use cases, but this was the last remaining one. Over all, this has been one of my better ideas.

This entry was posted in Hypothesis, Uncategorized on by .

2 thoughts on “Stable serialization and cryptographic hashing for tracking seen objects

  1. pozorvlak

    This is very sub-optimal, but it’s not clear how I could have done it better when tracking arbitrary python objects.

    Could you have pickled the objects and used the resulting strings as your keys?

    1. david Post author

      Not all objects are sensibly pickleable, and I don’t think the pickle serialization format is stable because it needs to e.g. track identity.

Comments are closed.