# A possible more principled approach to generation and simplification in hypothesis

Currently the way generation and simplification in hypothesis work are very ad hoc and undisciplined. The API is spread across various places, and the actual behaviour is very under-specified.

There are two perations exposed, produce and simplify. produce takes a size parameter and produces values of the specifed type of about the provided size, whileas simplify takes a value of the specified type and produces a generator over simplified variants of it.

The meaning of these terms is explicit meaning of these terms is deliberately undefined: There is no specified meaning for “about that size” or “simplified variants”.

This post is an attempt to sketch out some ideas about how this could become better specified. I’m just going to use maths rather than Python here – some of the beginnings of this has made it into code, but it’s no more than a starting point right now.

Let $$S$$ be a set of values we’ll call our state space. We will call a search tactic for $$S$$ a triple:

\begin{align*} \mathrm{complexity} & : S \to [0, \infty) \\ \mathrm{simplify} & : S \to [S]^{< \omega}\\ \mathrm{produce} & : [0, \infty) \to \mathrm{RV}(S) \\ \end{align*}

Where $$\mathrm{RV}(S)$$ is the set of random variables taking values in $$S$$ and $$[S]^{<\omega}$$ is the set of finite sequences taking values in $$S$$.

These operations should satisfy the following properties:

1. $$\mathrm{h}(\mathrm{produce}(x)) = \min(x, \log(|S|))$$
2. $$x \to \mathbb{E}(\mathrm{complexity}(\mathrm{produce}(x)))$$ should be monotone increasing in X
3. $$\mathrm{complexity}(y) \leq \mathrm{complexity}(x)$$ for $$y \in \mathrm{simplify}(x)$$

In general it would be nice if the distribution of $$\mathrm{produce}(x)$$ minimized the expected complexity, but I think that would be too restrictive.

Where $$\mathrm{h}$$ is the entropy function.

The idea here is that the entropy of the distribution is a good measure of how spread out the search space is – a low entropy distribution will be very concentrated, whileas a high entropy distribution will be very spread out. This makes it a good “size” function. The requirement that the expected complexity be monotone increasing captures the idea of the search space spreading out, and the requirement that simplification not increase the complexity captures the idea of moving towards the values more like what you generated at low size.

Here are some examples of how you might produce search strategies:

A search strategy for positive real numbers could be:

\begin{align*} \mathrm{complexity}(x) & = x \\ \mathrm{simplify}(x) & = x, \ldots, x - n \mbox{ for } n < x\\ \mathrm{produce}(h) & = \mathrm{Exp}(e^h - 1) \\ \end{align*}

The exponential distribution seems to be a nice choice because it’s a maximum entropy distribution for a given expectation, but I don’t really know if it’s a minimal choice of expectation for a fixed entropy.

Another example. Given search strategies for $$S$$ and $$T$$ we can produce a search strategy for $$S \times T$$ as follows:

\begin{align*} \mathrm{complexity}(x, y) & = \mathrm{complexity}_S(x) + \mathrm{complexity}_T(y)\\ \mathrm{simplify}(x, y) & = [(a, b) : a \in \mathrm{simplify}_S(x), b \in \mathrm{simplify}_T(y)] \\ \mathrm{produce}(h) & = (\mathrm{produce}_S(\frac{1}{2}h), \mathrm{produce}_T(\frac{1}{2}h)) \\ \end{align*}

The first two should be self-explanatory. The produce function works because the entropy of a product of independent variables is the sum of the entropy of its components. It might also be potentially interesting to distribute the entropy less uniformly through the components, but this is probably simplest.

You can also generate a search strategy for sequences given a search strategy for natural numbers $$\mathbb{N}$$ and a search strategy for $$S$$ we can generate a search strategy for $$[S]^{< \omega}$$:

If we define the complexity of a sequence as the sum of the complexities of its components, we can define its simplifications as coordinatewise simplifications of subsequences. The produce is the only hard one to define. Its definition goes as follows:

We will generate length as a random variable $$N = \mathrm{produce}_{\mathbb{N}}(i)$$ for some entropy $$i$$. We will then allocate a total entropy $$j$$ to the sequence of length $$N$$. So the coordinates will be generated as $$x_i = \mathrm{produce}_S(\frac{j}{N})$$. Let $$T$$ be the random variable of the sequence produced.

The value of $$N$$ is completely specified by $$S$$, so $$h(S) = h(S,N)$$. We can then use conditional entropy to calculate this: We have that $$h(S | N = n) = j$$ because we allocated the entropy equally between each of the coordinates.

So $$h(S) = h(N) + h(S | N) = i + j$$

So we can allocate the entropy between coordinates and length as we wish – either an equal split, or biasing in favour of shorter sequences with more complex coordinates or short sequences with complex coordinates.

Anyway, those are some worked examples. It seems to work reasonably well, and is more pleasantly principled than the current ad hoc approach. It remains to be seen how well it works in practice.

# Stateful testing with hypothesis

The idea of stateful testing is that it is “quickcheck for mutable data structures”.

The way it works is that rather than trying to produce arguments which falsify an example, we instead try and produce a sequence of operations which break a data structure.

Let me show you. We’re going to start with the following broken implementation of a set:

class BadSet: def __init__(self): self.data = []   def add(self, arg): self.data.append(arg)   def remove(self, arg): for i in xrange(0, len(self.data)): if self.data[i] == arg: del self.data[i] break   def contains(self, arg): return arg in self.data

Because it uses an array internally to store its items and doesn’t check if an item is already contained when adding it, if you add an item twice and then remove it then the item will still be there.

(Obviously this is a really stupid example, but it should demonstrate how it works for more complicated examples)

Now lets write some tests that break this!

The operations we want to test are adding and removing elements. So we do the following:

from hypothesis.statefultesting import StatefulTest, step, requires   class BadSetTester(StatefulTest): def __init__(self): self.target = BadSet()   @step @requires(int) def add(self,i): self.target.add(i) assert self.target.contains(i)   @step @requires(int) def remove(self,i): self.target.remove(i) assert not self.target.contains(i)

We can now ask this to produce us a breaking example:

>>> print BadSetTester.breaking_example() [('add', 0), ('add', 0), ('remove', 0)]

Note the nicely minimized example sequence.

At the moment this code is very much a work in progress – what I have works, but should probably be considered a sketch of how it might work rather than a finished product. As such, feedback would very definitely be appreciated.

# Quickcheck style testing in python with hypothesis

So I’ve been tinkering a bit more with hypothesis. I would now cautiously label it “probably not too broken and I’m unlikely to completely change the API”. The version number is currently 0.0.4 though, so you should certainly regard it with some degree of suspicion. However I’m now reasonably confident that it’s good enough for simple usage in the sense that I expect bug reports from most people who use it in anger but probably not all people and probably not many bug reports from most. :-)

Since the initial version I’ve mostly been thinking about the API and fixing anything that was really obviously stupid. I’ve also cleaned up a bunch of internals. The implementation is still pretty far from perfect, but it’s no longer awful.

Most importantly, I’ve added a feature to it that makes it actually useful. Test integration! You can now use it for some fairly pretty quickcheck style testing:

@given(int,int) def test_int_addition_is_commutative(x,y): assert x + y == y + x   @given(str,str) def test_str_addition_is_commutative(x,y): assert x + y == y + x

The first test will pass, because int addition really is commutative, but of course string addition is not so what you’ll get is an error from the test suite:

 x = '0', y = '1'   @given(str,str) def test_str_addition_is_commutative(x,y): > assert x + y == y + x E assert '01' == '10' E - 01 E + 10

This doesn’t currently support passing keyword arguments through to the tests – all arguments have to be positional. I think I know how to fix this, I just haven’t sorted out the details yet.

But yeah, quickcheck style testing with test case minimization (look at how nice the counter-example it produced was!). If that turns out to be exactly what you’ve always wanted, go wild.

# An interview question

Edit for warning: This problem has proven substantially harder than I intended it to be. The spec keeps turning out to be ambiguous and needing patching and there are enough nasty edge cases that basically no one gets everything right (indeed, my reference implementation has just been pointed out to contain a bug). I’m going to leave it as is for the challenge but be warned.

This isn’t a question I’ve ever used to interview people, but it’s not dissimilar from the coding test we’ve got very good results from at Aframe (the problem is not at all similar, but the setup is fairly). I’ve been pondering this as an alternative, and I thought it might interesting to share it with people. I’ll explain what it’s designed to test in a later post.

If you want to answer it, I’m happy grade your answer, but there’s no exciting reward for doing so other than public or private recognition of a job well done. Details on this after the question.

### The problem

You are required to write a command line program which reads lines which are terminated by ‘\n’ or EOF  from its STDIN until it gets an end of file and then writes a stably sorted version of them to STDOUT. Where a line is terminated by EOF it should have an implicit ‘\n’ inserted after it.

It is perfectly OK to use library sort functions for this. Additionally, ability to handle large numbers of lines is not required – the only performance requirement is to sort 500 lines of less than 1000 bytes each in less than 10 seconds (this should not be in any way onerous). You can write an external sort if you want, and you’ll get extra kudos for it, but it is in no way required for a correct answer. The task is to implement the comparator used for the sorting, not to implement a sorting algorithm.

The comparator to be implemented is as follows:

Each line is to be considered to be an arbitrary sequence of non-’\n’ bytes, which will be interpreted as a sequence of non-overlapping tokens. A token is EITHER:

• a sequence of ascii letter characters a-zA-Z
• a numeric string. This is a decimal representation of a number, containing 1 or more digits 0-9, possibly with a leading – sign, possibly with a single decimal point (.) followed by additional digits. The decimal point must come after at least one digit. No other characters are permitted. e.g. -5 and 5.0 are valid numeric tokens but +5, .5 and 5. are not (though they all contain a valid numeric token: 5)

Non-ascii or ascii but non-alphanumeric bytes may be present in the line, but must not be considered to be part of a token.

Each token should be the longest possible token it can be, with ambiguity resolved in favour of making the leftmost token longer. So for example the line “foobar” is the token “foobar”, not the tokens “foo” and “bar”, and “3.14.5″ is “3.14″, “5″

Note also that lines may contain characters other than those permitted in tokens, and that tokens are not necessarily separated by whitespace:

“-10.11foo10 kittens%$£$%”

should be tokenized as

-10.11, foo, 10, kittens

Lines should then be compared lexicographically as their list of tokens (as usual, if one is a prefix of the other then the shorter one comes first), with individual tokens being compared as follows:

1. Two numeric tokens should be compared as their numeric value when interpreted as a decimal representation
2. Any numeric token is less than any non numeric token
3. Non-numeric tokens should be compared lexicographically by single character, with characters compared case insensitively. Case insensitivity should be performed as in English with ‘a’ corresponding to ‘A’, ‘b’ to ‘B’, etc.

### Submission

If you want me to grade your answer, email it to me at [email protected] If it’s not obvious how to run it, please include instructions. I will need to be able to run it on a linux VM.

1. The source for your solution, either attached or as a link
3. If so, how you want you want to be cited (pseudonym, full name, etc. I’m also happy to include a link
4. Whether you’re OK with me using your source code as an example in follow up posts (I will assume you are unless you explicitly say you are not)
5. Roughly how long you spent on the problem (I won’t publish this except in aggregate, it’s mostly just for my information)

F
You have suffered from a serious failure to read the spec
C
You got some of it right, but there are significant omissions
B
You have mostly got it right, but you missed some edge cases
A
You have passed every test case I can throw at it
A+
And you implemented an external sort or otherwise did something clever. Go you!

(Note that there are no rewards for cleverness if you haven’t got the basic problem right. Such is life)

### Hall of fame

(In order of solutions coming in)

1. Dave Stark with a grade of B, A. Bonus kudos also for the fact that his solution uncovered a bug in my reference solution
2. Alexander Azarov with a grade of B. Also his first solution uncovered some ambiguities in my spec
3. Kat (a different one than I’ve referenced here before). B on her first try followed by an A (she got all the hard edge cases right the first time but was tripped over by an annoying one). Also kudos for pointing out a lot of spec ambiguities.
4. Eiríkr Åsheim too gets kudos for the first solution which got everything right first time. Also for rolling his own IO code and finite state machine.

Feel free to point out problems in the spec. Note that a lack of detail is not a problem, but ambiguity is.

Do not post solutions in the comments. I will delete your comment. I will also delete or edit any comment I think gives too much away.

Also, comments of the form “What a stupidly easy test” will be deleted unless you have submitted a solution that was graded A.

# A (rewritten) manifesto for error reporting

So I wrote A manifesto for error reporting. I stand by it entirely, but it did end up more of a diatribe than a manifesto, and it mixed implementation details with end results. This post contains largely the same information but with less anger and hopefully clearer presentation.

### The Manifesto

This is a manifesto for how errors should be reported by software to technical people whose responsibility it is to work with said software – it is primarily focused on the information that programmers need, but it’s going to be a lot of help to ops people as well. The principles described in here apply equally whether you’re writing a library or an application. They do not apply to how you should report errors to a non-technical end-user. That’s an entirely different problem.

This is primarily about how errors appear when represented in text formats – either through some sort of alerting mechanism or logs. It doesn’t cover more advanced tools like debuggers and live environments. Textual reports of errors are a lowest common denominator across multiple languages and are important to get right even if you have better tools.

The guiding principle you should follow is that the way you report errors is important, and that you should think as carefully about how you convey information in failure cases as how you behave in non-failure cases. A moderate amount of careful forethought at this point can prevent a vast amount of effort and frustration at a later date.

In particular, when crafting software you should think about what information the person who is attempting to debug the problem is going to need. This information primarily takes three forms:

1. What, specifically, is the problem that occurred?
2. What has triggered this problem?
3. Where in the code has this problem occurred?

If you bear these three questions in mind, and make sure to provide enough information to answer them, you will be in good stead. What follows is some specific advice for helping people answer these questions.

#### Be as specific as you can in your error messages

Your error messages should not be too long – a sentence is typically more than enough. They should however be descriptive, and tell you what happened.

A bad example of an error message is:

Invalid state


Better is:

Transaction aborted


Better yet:

Cannot commit an aborted transaction


Rather than merely telling you that the state is invalid, the error message tells you which invalid state you were in and what it is preventing you from doing.

#### Error messages should contain pertinent information about the values that produced them

This is not a good error message:

Index out of range


This is:

Index 8 is out of range for array of length 7


You could also do

Index 8 is out of range for array [1,2,3,4,5,6,7]


the problem with this is that if the array gets very large then so does the error message. So while error messages should contain information about the values that generated them, they do not need to contain the entire value: Only enough information about it to say why it triggered this error.

Another error message you shouldn’t generate from the exact value:

Failure to process credit card number XXXX XXX XXX XXX


Even ignoring the specific laws around processing credit card numbers, you should obviously not be logging confidential or secret information about users like this.

So there are reasons why your error messages can’t always sensibly contain the full values that triggered them. That being said, it’s much easier to recreate a problem if you can recreate the exact value, so it’s a good default to include more rather than less, and you should certainly be including some.

#### Error messages should locate where in the code they occur

In an ideal world, every error message would come with a complete stack trace that says exactly the chain of calls that it went through to get there. If absolutely necessary, and if you’re generating good and expressive error messages, it’s sufficient to include just the file and line number where the error occurred, but it’s not perfect and gives you much less information about how the problem was triggered.

The reason this is so important is that determining where the problem occurs in code is one of the first steps of any debugging process, so you can save a lot of time and effort for the person debugging by doing this for them at the point of the error.

In most languages if you are using exceptions, you get pretty close to this by default. On POSIX systems in C or C++ you can apparently also do this with the backtrace function.

Additionally, you should make a best effort to include stack traces when crossing process boundaries through RPC mechanisms: If a remote procedure can reasonably report a stack trace, it should report a stack trace and you should include that in your error report.

#### You should not mask lower level errors

It is common to wrap lower level errors in high level ones. It is also common to alter the display of errors in code you’re calling – e.g. in testing frameworks.

When you do either of these things the golden rule you must follow is that you should not remove information from the lower level errors, as they may be the most informative information the developer debugging the problem has about what actually went wrong.

In particular, if you are rethrowing exceptions you need to take steps to ensure that you include the original stack trace and error message (in many languages it is possible to alter the stack trace of the exception you’re throwing, and you can use this to chain the stack traces together).

Additionally you should never remove stack trace elements for display (it is acceptable to e.g. compress adjacent lines into a single one with a counter for repetitions. It’s OK to change the display, but not to remove information).

#### Error conditions should not be covered up

It is often tempting to believe that it is the code’s responsibility to attempt to cover for an error and keep on working regardless. Sometimes this is even viable and true. Sometimes however an error is more likely to be a sign of developer error which should be addressed sooner rather than later, and even when it is not an obvious developer error it is likely a symptom of something going genuinely wrong.

As a consequence unless an error condition is genuinely routine (a rough rule of thumb here would be “Can reasonably be expected to happen multiple times a day and we’re not going to do anything about that” it should be reported. It is fine for the code to recover from the error and attempt to proceed regardless, but the error needs to be logged. Even if it’s not a problem that needs fixing, it may end up as symptomatic of other problems.

#### Errors should be reported when you enter an invalid state, not just when you attempt to operate whilst in one

One of the most common errors to see in a Java application is a NullPointerException. In Ruby it’s similarly common to see a NameError or a NoMethodError.

Inevitably this is because a value has been allowed to enter somewhere that it shouldn’t be permitted.

Other forms of invalid state are also possible, but they basically all come down to the same thing: Your error is not caused by what you are currently doing, it is caused by what has come before. Your debugging now has to back track to find the point at which the object was put into an invalid state, because where the error appears to be occurring is of no help to you.

The solution to this is to validate your state when it changes: If data is only permitted to be within a certain range of values, check that it belongs to that range of values when you set or change it. This means that the problem will be caught at the point where it occurs rather than the point where it causes problems.

In summary: