Category Archives: Uncategorized

On the new data generation in hypothesis

I mentioned in the Hypothesis short term road map that one of the things I wanted to do was improve the data generation.

What’s wrong with the data generation as it currently stands?

Well, in classic quickcheck style testing basically what you have is a size parameter that controls the shape of the distribution. A large value for the size parameter results in a more variable distribution – larger elements, longer lists, etc.

How, given this, would you generate a list of more than 100 floating point numbers in which every element had the same sign?

Assuming your floating point generator has probability p of producing a non-negative number with 0 < p < 1. The probability of a list of at least 100 elements getting all the same sign is max(p^100, (1-p)^100). Or, as I like to call it, as close to zero as makes no difference.

Hypothesis has historically solved this by introducing into the mix a notion of flags, which are boolean variables that can be turned on and off for a whole data generation run. For example, whether negative floating point numbers are allowed is a flag. This means that with a 50/50 chance your list will have all positive numbers. This idea was inspired by John Regehr’s post “Better Random Testing By Leaving Features Out

But it’s not quite enough. What for example if you want to generate a long list with a low random average value? Or a high one? Depending on how you use the size parameter it will be hard to do one of these (in Hypothesis it was the latter which is hard – the size was split amongst the elements and so the elements of a long list would tend to be much smaller).

It’s also very non-orthogonal. There are these two different notions of parameter which are similar but not really and don’t share any implementation.

So what I wanted to do (and now have done! It proved surprisingly easy) is to introduce a common notion of parameter which unified these two.

A parameter is just any sort of distribution you can draw from. It can contain any sort of values – booleans, floats, integers,  composite values, etc. Each strategy has a parameter you can draw from, and data generation then occurs by first drawing a parameter value and then drawing data conditional on that parameter value.

A list then has a composite parameter: One parameter is its average length, and lengths are then drawn from a geometric distribution with that average, and the other parameter is whatever parameter its elements take.

The reason this is importantly different from just drawing from the marginal distribution is that all of the draws then use the same parameter value. So for the floating parameters we have one parameter which says the sign of the value (it can be always positive, always negative or either) and parameters that control the shape of it given that sign. This means that when generating a list, the chances of all elements being high or all elements being low is much better (it does mean that the chances of having a mix are lower, but you can fix that with a more complicated parameter shape).

The overall effect is that this allows you to generate much  more interesting large structured examples. With the classic approach you end up with a sort of curse of dimensionality where everything looks pretty flat because of the differences averaging out.

Another nice perk of this is that you can use the parameters to guide rejection better. A test in hypothesis can essentially fail for two reasons: One is that it’s a legitimate failure and the other is that a precondition has not been met. Generally speaking when the latter happens you don’t really want to generate a whole bunch of data like that.

To continue with our sign example, suppose I want to generate non-negative lists of numbers. I could do this by defining a special strategy for non-negative floats, but that’s an annoyingly large amount of work (I’d like to make it less work and have some ideas around that, but there will always be cases like this). It would be nice if I could do things like assume(all(x >= 0 for x in xs)) and have any chance of succeeding for large lists.

Well with the new system it will succeed for about half of parameter values. So the logic becomes: Generate a parameter, try some number of draws of data from this parameter. Eventually draw a new parameter, but if we have a high rejection rate draw a new parameter much sooner.

This has all landed in master as of yesterday evening, and I’m pretty pleased with it so far. I’m going to do a bit more work on hypothesis over the next few days, but expect to see it in a released version near you quite soon.

This entry was posted in Hypothesis, Uncategorized on by .

Hypothesis short term road map

As some of you might be aware, I authored a python testing library called hypothesis.

It’s basically quickcheck for Python. I first wrote it within about a month of my learning python, so it has some… eccentricities (I still maintain that the fact that it has its own object model with multiple dispatch is totally legit, but I admit that the prototype based inheritance feature in it was probably misguided), but I flatter myself into thinking it’s not only pretty good but is by some small margin possibly the most advanced library of its ilk.

It’s also pretty neglected. I haven’t actually released a version of it in just over a year. Part of this is because I was pretty excited about testmachine and was considering it as the successor to hypothesis, but then I didn’t do any work on testmachine either. Basically, for a variety of reasons 2014 was a pretty rough year and I didn’t get much done on my open source work during it.

Despite this neglect, people seem to have started using hypothesis. I’ve got a bunch of issues opened, a few pull requests, and pypi stats are telling me it’s had more than 500 downloads in the last month.

And so, I appear to have failed to avoid success at all costs, and will instead embrace success and resume development on hypothesis.

So here’s the plan.

In the next few days there will be a point release of hypothesis. It will not be large – it will clear up a few bugs, maybe have one or two minor enhancements, improve python 3 support, and generally demonstrate that things are moving again.

In the next month or two there are a variety of larger features I want to work on. Some of them are pretty damn exciting, and if I get all of them working then I’m going to remove the weasel words from the above and simply say that flat out hypothesis will be the most advanced library of its kind.

In rough order of least to most exciting, I’m going to be working on:

Logging

This is super unsexy and to be honest I will probably not be inspired enough to work on it immediately, but I very much want it to get done as it’s both important and something people have actually asked for: Hypothesis needs to report information on its progress and what it’s tried. It’s important to know what sort of things hypothesis has tried on your code – e.g. if it’s only managed to generate a very small number of examples.

Some sort of top level driver

Hypothesis is right now fundamentally a library. If you want to actually use it to write tests you need to use it within pytest or similar.

Continued usage like this will 100% continue to be supported, encouraged and generally considered a first class citizen, but I would like there to be some sort of top level hypothesis program as well so you can get a more tailored view of what’s going on and have a better mechanism for controlling things like timeouts, etc.

Improved data generation

The data generation code is currently both ropy and not very intelligent. It has two discrete concepts – flags and size – which interact to control how things are generated. I want to introduce a more general and unifying notion of a parameter which gives you much more fine grained control over the shape of the distribution. This should also improve the coverage a lot, make this more easily user configurable, and it may even improve performance because currently data generation can be a bit of a bottleneck in some cases.

Merging the ideas from TestMachine

Testmachine is pretty awesome as a concept and I don’t want it to die. It turns out I have this popularish testing library that shares a lot in common with it. Lets merge the two.

One thing that I discovered with TestMachine is that making this sort of thing work well with mutable data is actually pretty hard, so this will probably necessitate some improved support around that. I suspect this will involve a bunch of fixes and improvements to the stateful testing feature.

Remembering failing test cases

One of the problems with randomized testing is that tests are inherently flaky – sometimes a passing test will become a failing test, sometimes vice versa without any changes to the underlying code.

A passing test becoming a failing test is generally fine in the sense that it means that the library has just found an exciting new bug for you to fix and you should be grateful.

A failing test becoming a passing test on the other hand is super annoying because it makes it much harder to reproduce and fix.

One way to do this is to have your library generate test cases that can be copied and pasted into your non-randomized test suite. This is the approach I took in testmachine and it’s a pretty good one.

Another approach that I’d like to explore instead is the idea of a test database which remembers failing test cases. Whenever a small example produces a failure, that example should be saved and tried first next time. Over time you build up a great database of examples to test your code with.

This also opens the possibility of giving hypothesis two possible run modes: One in which it just runs for an extended amount of time looking for bugs and the other in which it runs super quickly and basically only runs on previously discovered examples. I would be very interested in such an approach.

Support for glass-box testing

Randomized property based testing is intrinsically black box. It knows nothing about the code it’s testing except for how to feed it examples.

But what if it didn’t have to be?

American Fuzzy Lop is a fuzz tester, mostly designed for testing things that handle binary file formats (it works for things that are text formats too but the verbosity tends to work against it). It executes roughly the following algorithm:

    1. Take a seed example.
    2. Mutate it a bit
    3. Run the example through the program in question.
      1. If this produces bad behaviour, output it as a test case
      2. If this produces a new interesting state transition in the program, where a state transition is a pair of positions in the code with one immediately following the other in this execution, add it to the list of seed examples.
    4. Run ad infinitum, outputting bad examples as you go

This produces a remarkably good set of examples, and it’s 100% something that hypothesis could be doing. We can detect state transitions using coverage and generate new data until we stop getting new interesting examples. Then we can mutate existing data until we stop getting new interesting examples from that.

This sort of looking inside the box will allow one to have much greater confidence in hypothesis’s ability to find interesting failures. Currently it just executes examples until it has executed a fixed number or runs out of time, which is fine but may mean that it stops too early.

The existing behaviour will continue to remain as an option – initially switched on by default, but once this is robust enough eventually this will become the default option. The old behaviour will remain useful for cases where you want to e.g. test C code and thus can’t get reliable coverage information.

This would also work well with the test database idea, because you could prepopulate the test database with minimal interesting examples.

Probably some other stuff too

e.g. in the above I keep getting the nagging feeling that hypothesis needs a more general notion of settings to support some of these, so I will likely be doing something around that. There’s also some code clean up that really could use doing.

It’s currently unclear to me how long all of this is going to take and whether I will get all of it done. Chances are also pretty high that some of these will turn out to be bad ideas.

If any  of you are using or are considering using hypothesis, do let me know if any of these seem particularly exciting and you’d like me to work on them. I’m also open to suggestions of other features you’d like to see included.

 

 

 

This entry was posted in Hypothesis, Uncategorized on by .

Extending a preorder efficiently and correctly

This is a thing that shouldn’t be hard to do and yet I manage to get it wrong every single time. I’m writing this up partly to try to save you from my fate and partly in the hope that if I write it up I will remember this next time.

Consider the following problem: You have a a matrix of booleans A. It currently stores a relation such that:

  1. It is reflexive: for all t, A[t, t] is true
  2. It is transitive: for all t, u, v, if A[t, u] and A[u, v] are both true then A[t, v] is true.

(the first condition isn’t strictly necessary for this, but it makes the whole thing easier and if you have something satisfying it but not the first you can just set A[i, i] to true for all i and have a relation satisfying both properties)

Such a relation is called a preorder. The main interesting special cases preorders are equivalence relations and partial orders. This is mostly interesting in things that look more like partial orders than equivalence relations, as equivalence relations admit a much more efficient storage format and corresponding algorithms (you just store a single array of length n labelling each element with a unique number representing its equivalence class).

Now, you pick some pair i != j such that neither A[i, j] nor A[j, i]. You want to set A[i, j] to true but maintain the above invariants.

There is a simple linear time algorithm for this. It’s efficient, easy to understand and entirely wrong, yet somehow I keep falling for it:

def wrong_update(A, i, j):
   n = A.shape[0]
   for k in xrange(n):
      if A[k, i]:
         A[k, j] = True
      if A[j, k]:
         A[i, k] = True

Why is this wrong? It’s wrong because it misses out on pairs u, v where initially A[u, i] and A[j, v] but not A[u, v].

In fact, seeing that problem makes it relatively easy to see that there can’t possibly be an algorithm for this that does better than quadratic time.

Let n = 2k. Construct the matrix as follows: A[i, j] = (i < j) and (i < k == j < k).

That is, we’ve split the range of indices into two halves. Each half is the usual total order, but there is no relation between things in either half.

Now suppose we want to set A[k-1, k] to be true. For u < k we have A[j, k-1] and for v > k we have A[k, v]. So transitivity requires us to set A[u, v] for any u < k and v >= k. This requires us to change k^2 = n^2 / 4 = O(n^2) elements of A from false to true.

We can in fact do it in O(n^2). Consider the following algorithm:

def correct_update(A, i, j):
   n = A.shape[0]
   for u in xrange(n):
     for v in xrange(n):
       if A[u, i] and A[j, v]:
          A[u, v] = True

Does this produce a transitive relation?

For convenience of analysis, instead consider the version which produces a matrix B such that B[u, v] is true if A[u, v] is true or if A[u, i] and A[j, v]. This is effectively that followed by assigning A = B (note: These are only equivalent given the result that B is transitive, so any changes we’ve written to A during iteration can’t change the result).

Lets check that B has all the desired properties.

Certainly B[t, t] is true because A[t, t] is true.

Suppose B[t, u] and B[u, v]. There are four cases to consider based on where these come from.

  1. If A[t, u] and A[u, v] then A[t, v] by transitivity of A. Thus B[t, v]
  2. If A[t, u] and not A[u, v] then we have A[t, u], A[u, i], A[j, v]. So A[t, i] and A[j, v], and thus B[t, v].
  3. If not A[t, u] and A[u, v] then A[t, i] and A[j, u]. So A[j, v] by transitivity and thus B[t, v] again by definition of B.
  4. Finally if not A[t, u] and not A[u, v] then A[t, i], A[j, u], A[u, i], A[j, v]. Therefore by transitivity of A we have A[j, i]. This contradicts our precondition that neither A[i, j] nor A[j, i], so this case is impossible.

So this works, which is nice.

Can we do better?

Well, as established, sometimes we have to do O(n^2) operations. So we can’t asymptotically do better in the worst case, but if the matrix is sparse we can do the whole thing in O(n + gh ) with an additional O(g + h) of space,  where g=|{k such that A[k, i]}|, h=|{k such that A[j, k]}| by first preprocessing the list.

def faster_correct_update(A, i, j):
   n = A.shape[0]
   left = [k for k in xrange(n) if A[k, i]]
   right = [k for k in xrange(n) if A[j, k]]
   for u in left:
     for v in right:
       A[u, v] = True

Other minor variations:

  1. you can do the first step in one pass if you are willing to spend the full O(n) space) by just allocating a vector of size n, writing k with A[k, i] on the left and A[j, k] on the right
  2. If you don’t want to spend the extra space you can do this in O(n + n min(g, h)) by just doing the earlier version, picking whichever side is smaller as the outer loop, and skip the inner loop when the outer variable doesn’t satisfy the relevant condition.

Another thing worth noting is that if A is anti-symmetric (for any u != v, not (A[u, v] and A[v, u])) then so is the resulting relation. i.e. if A encodes a partial order then so does B. Suppose B[u, v] and B[v, u]. Consider two cases:

  1. If one of these is from A, assume without loss of generality that A[u, v]. Then we know that not A[v, u] because A is anti-symmetric. So we must have that A[v, i] and A[j, u]. But then by transitivity we have A[u, i] and A[j, u], so A[j, i], contradicting the precondition on i and i.
  2. If neither of them are from A then we have A[u, i] and A[j, v], A[v, i] and A[j, u]. So transitivity again gives us A[j, i], another contradiction.
This entry was posted in Uncategorized on by .

It turns out integer linear programming solvers are really good

In the comments to my last post, pozorvlak pointed out that what I was describing was awfully close to the branch and bound algorithm that an integer linear solver would be using anyway and asked if I had tried just adding integer constraints to the LP.

I did actually already know that what I was doing was a form of branch and bound. I’d come up with it independently before realising this, but alas there are no prizes for reinventing a 50 year old algorithm. The question was whether the way I was applying it was a good one or if I’d be better off just using the standard one in the ILP solver. So how do they compare?

Well, uh, there’s a funny story on that one which explains why I hadn’t tried just adding integer constraints to the LP. I actually tried it ages back, found it intolerably slow, and gave up on it as a bad idea.

Basically I only realised how effective the LP was even in large circumstances quite late because I spent the first chunk of working on this problem thinking that the LP couldn’t possibly scale to more than about 10 elements even without the integer constraints.

The reason for this is that I was using PuLP as an interface to the LP solver with its GLPK bindings. This gave me huge performance problems – it’s probably fine if you’re setting up an LP once or twice, maybe, but its performance at building the LP is terrible. It’s several orders of magnitude slower than actually running the LP.

So, I ripped out PuLP and wrote my own interface to lp_solve because it was easy enough to just generate its text format and do that (thanks to Bill Mill for the suggestion). Even that was quite slow at large n though – because the size of the LP is \(O(n^3)\) at large \(n\) you’re generating and parsing several MB to do a decent job of it. So even then the size of the LP I could hand off to it was strictly bounded.

Anyway, to cut a long story short, I’m now using the C API to lp_solve directly so setting up the model has become a lot cheaper (it’s still surprisingly slow for n bigger than around 30, but not intolerably so) and yeah it turns out that lp_solve’s integer constraints scale a lot better than my  custom tailored branch and bound algorithm does.

Moreover, after a while tweaking things and trying various clever stuff and huffily insisting that of course my approach was better and I just needed to optimise it more, I think I’ve convinced myself that the ILP branch and bound algorithm is intrinsically better here, for a reason that is both straightforward and kinda interesting.

The bit about it that is somewhat interesting is that it seems empirically that the LP that results from FAST in some sense “wants” to be 0/1 valued. The vast majority of the time the solution you get out of the relaxed LP is just integer valued automatically and you can call it a day as soon as you get the answer back from it.

The interesting thing is what happens when you don’t. You then have to branch, compute a new LP for each branch, and see what happens there. If each of those sub-problems is integral (or costs at least as much as your current best integer solution) then you’re done, else you’ll have to recurse some more.

And when this happens, my solution branches to n sub problems and the normal ILP branch and bound branches to two. If each is equally likely to produce integer valued sub-problems then the latter is obviously better.

My solution could be better if it chooses sub-problems that are substantially more likely to be integer valued. However this doesn’t seem to be the case – empirically it seems to almost always need to go at least two levels deep if it doesn’t find an integer solution immediately. In fact, it seems plausible to me that the greater freedom in choosing how to branch available to the ILP solution is more likely to let it achieve integer valued sub-problems. I don’t know if it does this, but it could just choose n/2 pairs to branch on, see if any of them produce both sides as integers and then be done if they do. At that point it would have calculated the same number of LPs as my solution but have had a substantially higher chance of not having to recurse further.

So, in conclusion, the better algorithm has won. It was a fun try in which I learned a few things but ultimately as a research avenue not very productive. The only thing from it that might be interesting to pursue further is something I actually didn’t mention in the last post, which is how to use the relaxed LP to get a good heuristic ordering, but I suspect that the quality of that ordering is directly correlated with how easy it is to just get the right answer from the ILP solver so even that is probably a dead end.

This entry was posted in Uncategorized on by .

Optimal solutions to FAST

This is another post about FAST which I mentioned in the last post and provided some heuristic solutions to. In this I’d like to talk about finding exact solutions to it.

Note: There’s a lot of literature about this and I am only starting to get to grips with it (I have a book on order which will help with that). What I’m presenting does not I think add anything very new to the state of the art except that it was a lot simpler for me to understand than anything equivalently good I could find.

My not very well optimised python version of this can handle \(20 \times 20\) matrices pretty quickly (a couple of seconds) and \(25 \times 25\) in under a minute. This is actually a pretty good achievement. I think a reasonably well written C or C++ version that wasn’t committing atrocities in how it communicated with the LP solver could easily be an order of magnitude and possibly two faster, which might bring things up to about \(30 \times 30\) within feasible reach.

Also, if you want to see code instead of maths, you can just skip straight to the repo where I’ve implemented this solution (plus a few details not mentioned here).

There are a couple ideas in this solution:

Firstly, note that we only need to find the weight of the optimal solution. If we can do this then we can reconstruct an optimal solution with only an additional \(O(n^2)\) additional calls (and because of the way this solution will use dynamic programming this will actually be a lot faster than it sounds): We can find the head of the solution by looking at the weights of optimal solutions for each of the subsets of size \(n – 1\) and then recursively find an ordering for the best subset. So specifically we want the function \(\mathrm{FasWeight}(A) = \min\limits_{\sigma \in S_n} w(A, \sigma)\).

From henceforth, let \(A\) be fixed. We can also assume without loss of generality that \(A\) is non-negative and integer valued, so we will because it makes our life easier.

The starting point for our solution will be the following recursive brute force implementation of the function \(\mathrm{FasWeightSubset}(U)\) where \(U \subseteq N = \{1, \ldots, n\}\).

If \(|U| \leq 1\) then \(\mathrm{FasWeightSubset}(U) = 0\)

Else \(\mathrm{FasWeightSubset}(U) = \min\limits_{x \in U} \left[ \mathrm{FasWeightSubset}(U – \{x\}) +\mathrm{HeadScore}(x, U) \right]\)

Where \(\mathrm{HeadScore}(x, U) = \sum\limits_{y \in U, y \neq x} A_{yx}\).

What we’re doing here is conditioning on each element of \(x\) whether it’s at the front of the list. If it is then the optimal score is the optimal score for ordering \(U\) without \(x\) plus the cost of all the backward arcs to \(x\). We take the minimum over all \(x\) and get the desired result.

If one uses dynamic programming to memoize the result for each subset (which you really have to do in order for it to be remotely feasible to run this) then this is an \(O(n^2 2^n)\) algorithm. It also uses rather a lot of memory (\(\sum\limits_{k \leq n} k {n \choose k}\), which is probably some nice formula but basically is “lots”).

The idea for improving this is that if we have some fast way of calculating a lower bound for how good a set could possibly be then we may be able to skip some subsets without actually doing the full calculation. e.g. a trivial lower bound is \(0\). If a subset would need a negative score to improve the current best then we can disregard it. A slightly less trivial lower bound is \(\sum\limits_{x, y \in U, x < y} \min(A_{xy}, A_{yx})\), because for each pair of indices you  must pay the cost of an arc in at least one direction. It turns out that neither of these scores are good enough, but we’ll skip the question of what is. For now assume we have a function \(\mathrm{LowerBound}(U)\) which provides some number such that every ordering of \(U\) creates weight \(\geq \mathrm{LowerBound}(U)\).

We can now use this to define the  function \(\mathrm{CappedFasWeightSubset}(U, t)\). This will be equal to \(\min(\mathrm{FasWeightSubset}(U), t)\) but we can offer a more efficient algorithm for it:

For \(U\) with \(\mathrm{LowerBound}(U) \geq t\), let \(\mathrm{CappedFasWeightSubset}(U, t) = t\).

Otherwise, \(\mathrm{CappedFasWeightSubset}(U, t) = \min\limits_{x \in U} \left[ \mathrm{CappedFasWeightSubset}(U – \{x\}, t – \mathrm{HeadScore}(x, U)) + \mathrm{HeadScore}(x, U) \right]\)

At each recursive call, placing \(x\) at the head incurs a certain cost and thus lowers the threshold we pass to the tail.

In fact we can do better than this: Given any known good solution lower to that threshold we can lower the threshold to that value, as we’re only interested in whether we can improve on that. This also means that we can use the running minimum of the sub-problems to reduce the threshold. I won’t write that in the current notation because it’s a pain. It’ll be clearer in the code though.

This is basically it. There are a few minor optimisations, and how to memoize the results may not be obvious (you don’t want the threshold in the cache key or you’ll get zero hits), but there’s not much more too it than that.

However: This works about as badly as the earlier version if your lower bound isn’t good enough – even with an only reasonably good one I found I was only cutting out maybe 70% of the subsets, which was nice but not actually worth the cost of calculating the lower bound.

So where do we get better lower bounds?

Well it turns out that there’s another formulation of the problem. You can treat this as an instance of integer linear programming. It requires \(O(n^2)\) variables and \(O(n^3)\) constraints, so the integer linear programming problem rapidly becomes too difficult to calculate, but the convex relaxation of it (i.e. the same linear program without the integer constraints) provides a good lower bound.

The way this works is that we introduce 0/1 \(D_{ij}\) variables for each pair \(i \neq j\). \(D_{ij}\) will define a total ordering \(\prec\) of \(N\), with \(D_{ij} = 1\) if \(i \prec j\). This gives us the following ILP:

Minimize: \(\sum\limits_{i \neq j} A_{ji} D_{ij}\) (you pay the cost of the backwards arcs)

Subject to: \(\forall i, j, D_{ij} + D_{ji} + 1\) (the order is reflexive)

And: \(\forall i, j, k, D_{ij} + D_{jk} + D_{ki} \leq 2\) (the order is transitive)

Solving this linear program without the integer constraints then gives us a fairly good lower bound.

What’s more, it turns out that often the integer constraints happen to be satisfied anyway. This allows us a further shortcut: If the lower bound we found turns out to have all integer variables, we can just immediately return that as the solution.

This entry was posted in Uncategorized on by .