Category Archives: programming

Writing libraries is terrible

My talk, Writing Libraries is Terrible, is now available online.

You can see the video on youtube, and the slides here.

Transcription

So, yeah. I’m here to tell you about writing libraries and why it’s terrible. And I’ve been basically working full time on a library for about the last year and a half, so I’ve got to experience all of this first hand.

Some caveats: I’m talking about my specific experience, so I’m talking about writing open source libraries. Closed source, proprietary, libraries, do exist. I have no experience with this at the moment, and my impression is that using those is terrible, so this is what the other end looks like.

I’m also talking specifically about my experience writing a Python library. I think very little of what I have to say is actually about Python, or Python specific, but that does colour the experiences. Different languages and different communities do respond differently.

Now, despite the title, I actually really like writing libraries. You get to focus on a very constrained problem, you get to do a lot of interesting things in the course of doing them. There’s really a lot to recommend writing libraries, and the process of writing them is enlightening and enjoyable.

But then people start using them, and that’s where things start to go a bit wrong.

This was originally going to be a talk full of technical war stories, and I’m really glad it no longer is because Jim completely topped any technical war stories I can give you, but I am going to start by just mentioning the single biggest technical problem that gives me a headache as a library author, which is that I currently support three operating systems officially and really dropping any one of these is completely non-optional [ed. I of course mean “supporting” rather than dropping], and on top of that I have to support a really large number of versions of python, and even with the ones I support people complain about the ones I don’t support. So apparently it’s really annoying that I don’t support Python 2.6 despite the fact that Python 2.6 has been end of lifed for a large number of years.

And in general, everyone’s computer is different, and as an application developer you have to deal with this and it turns out that as a library developer you have to deal with this more, because you don’t just have to worry about what’s on their computer you have to worry about the specific versions of different things. If you’re writing an application you can ship a language runtime with it and not really care which version they already have, there all sorts of advantages. I don’t get any of those.

OK, that’s all I want to say on the technical side of things, because it turns out that none of these are actually the really bad thing about writing libraries. The really bad thing about writing libraries is money.

Because, culturally we’ve just moved to this idea that of course libraries are things we get for free. Like I said, open source libraries are great to use, so of course everyone is using open source libraries. And you do get paid for open source projects, like a lot of the big ones in particularly are very well paid these days. Most people who work on Linux are paid to do so. But a lot of the open source ecosystem isn’t this sort of large open source project, it’s these single serving libraries that do one specific thing and people just install them, they don’t even really think about paying for them.

And what they pay for them instead is complaining. Like I said, people complain at me for not supporting Python 2.6, and I get some really entitled comments about it, basically just like “I’m a giant bank building legacy software in Python 2.6, making lots of money, why aren’t you supporting this thing for my for profit business for free?”, and this isn’t much fun to be on the receiving end of obviously, and it turns out to be quite hard to pay your bills with other peoples’ whining.

But actually none of this is the worst about not getting any money to work on these. Because it turns out that as well as sort of being intrinsically useful, being paid for your work gives you lots of advantages that not being paid for your advantages doesn’t have. One of the biggest, I think, under appreciated advantages of making money is that it gives you an amazingly good decision heuristic. Whenever you come to a choice, and you’ve got a choice where you’ve crossed off all the other directions – you’re ethically OK with both options, both options seem like good ideas, so on and so forth… which one of them is going to make you more money? Do I support this obscure version of Python? Well, am I going to make more money than it costs me in effort to support this version of Python? And in the course of design there are a million choices you can make in all these, and basically with open source, and with this sort of library, all that really guides your choice is do I feel like doing it and will I feel bad about not supporting this. And this turns out, if you’re like me, to be a great way of creating an amazingly large amount of work for yourself, and it’s really not obvious which of the work is important and which of the work is not important.

The other great thing about money is that you can hire people. And you do get contributions as an open source library. My project, Hypothesis, I have absolutely got contributions from people, and it’s great, but this is the amount of Hypothesis I’ve done [ed. I’m holding out my arms wide here] and this amount of Hypothesis literally everyone else has done [ed. My hands are close together now]. And I don’t hold that against them. Why should they? I’m not paying them, and basically everything they do for Hypothesis is a gift, and so I’m very grateful to those gifts, but at the same time, it’s still just me doing most of this work.

And what that means is that I have to be a lot of people. If you think about your normal software projects, you don’t just have developers, you certainly don’t just have one developer. That would be ridiculous, why would you do that? I mean, you’ve got not redundancy there, right? But you’ve also got project managers, you’ve got customer supports, you’ve got sales, you’ve got marketing, you’ve got technical writing, you’ve QA… I have to be all of those people. And… in some ways this is a good thing. I’ve really enjoyed learning a lot of these, and the best way to get a lot of appreciation for your colleagues is to try and do their jobs and discover that all the things that you thought were really easy and you thought your colleagues were kindof idiots and were just making a big deal of it… it turns out that actually people have specialised skills and some of those specialised skills are really hard. I’ve been doing a lot of attempts at marketing recently [ed. This slightly overstates the case. I’m talking mostly about hypothesis.works and some behind the scenes stuff, but I’ve been reading more] and it turns out that maybe those marketers did actually know something. This is not an opinion I would have had a year ago. But at the same time, this is really bad, because I’m not good at marketing and I’m doing the marketing for this project. I’m OK at writing, well I think I’m good at writing but I’m not good at technical writing. Documentation it turns out is an entire new skill on top of both technical understanding and writing. So I’m having to learn all of these because I don’t really have another way to do it, and I certainly can’t afford to pay anyone else to do it.

And this is all a massive problem and this isn’t going to change, because from the point of view of the people using our work, what we are basically doing is giving them money. We’re leaving this giant pile of cash on the ground and they’re just coming along and picking up because someone’s just left this giant pile of cash and obviously that means it’s free for use, right? If someone tells you that there’s this thing that is better than thing you were going to pay for and you can get it for free then of course you’re going to take it.

And this very short term thinking. As we saw in the lightning talk about CentOS yesterday, this will bite you, but by the time it’s bitten you we’ve already done all the work and it’s probably too late. So regardless of what a bad decision this may or may not be on the company’s side, this doesn’t actually help me. And this isn’t just about me obviously, this is about everyone else who also finds themselves in this position. But… in particular this is about me. And so this doesn’t help me.

And there’s really only one way this can change, which is my side and our side. At some point we’re going to have to start withdrawing our labour. If the problem is that people are taking our work and using it for free and this is really bad for us, maybe we could just… stop giving our work away for free? Is that a valid plan? It seems like a valid plan. Every time I try and explain my situation to my family they look at me like I’m crazy and ask me why I’m giving away stuff for free.

And I feel sad about this. I really like sharing my work. I’m not writing open source software because I want to make lots of money. My judgement isn’t quite that bad. I’m writing open source software because I want people to use it, and I want so share these ideas, and I really enjoy doing it.

So what I’m really hoping is that there’s some compromise we can do here. And… software licensing is a large part of this. I don’t know what part the GPL necessarily has to play in any eventual solution, but I do know that the MIT and BSD style solutions have nothing to play in this solution. They can go to hell as far as I’m concerned. And I think there’s more scope for improving the situation with licensing. And I don’t know what yet that’s going to be. I am talking to people to find out what it’s going to be. Because ultimately we do need to give people a better way to pay us, and we do need to still have ways of giving things away for free, and I don’t know where the middle ground is.

But… despite my very adversarial framing of this as us vs them, I don’t think us vs them is the solution either. Because as well as sharing, I also like making money, and I’m perfectly happy for other people to make money, and I think working together is the only way we can move forward. We have to start this, but we can’t finish this alone.

And ultimately as well as being better for us open source developers, I think it’s also going to be better for companies. Because when you think about it, 90% of the code you’re running in production has this label on it, and that’s probably a bad idea.

Thank you very much. Any questions?

[ed. I was stupid and forgot to repeat questions and we didn’t have individual microphones. I’m guessing at what people asked me based on interpretation, memory and my responses]

Question: “You must have had a plan for making money off open source software. What was it?”

Answer: I had various plans. Right now I’m mostly doing services and training. I’m going to be starting offering proprietary extensions which try and make it easy to use the core libraries for everyone, but when you start to have problems that you only really have if you’re a large company who is doing serious deployments of these things then you need to start paying me money. And there are a variety of other plans, I’m happy to talk about that at greater length afterwards. But the short answer is that it’s really hard and no-one seems to have a very successful plan yet. The best plan seems to be to do a really bad job of writing the library and then to get people to pay you to help them figure it out, and I’m not a big fan of that one either.

Question: (I have no idea what was said. I think this was mostly a question about bounty source and similar bounty programs?)

Answer: So for about the last six months a lot of the issues and feature requests on my github tracker have had the tag “for a modest fee” on them. Literally no-one has ever approached me about this. I do have a bounty source account which I need to make more obvious but I currently money from donations on that and zero money on issues. This works very well for some people but there’s this general pattern where applications with lots of consumer usage do quite well on these bounty programs more than things with companies using them. For example I know that neovim seems to be doing very well out of bounytsource right now, and I think it’s again a thing where large open source applications work much better for this than these sorts of small libraries. That is probably part of a viable future but I don’t what part it is and it hasn’t been working for me so far.

Question: Do you feel you should be paying for Python and the standard library.

Answer: Well right now I’m not a successful commercial user of it. I kinda do though, yeah. I think certainly in terms of donating to the Python software foundation that’s something I will probably do once my revenue stream starts to look slightly less pitiful and I actually start paying myself again [ed. I am now paying myself again! Or at least, I am starting doing pay roll for myself and that will at some point actually make it into my bank account], but it’s hard. There’s this huge ecosystem of open source and I don’t know yet where the money for it should come from, but I do feel that a lot more of it should come from the people who are making literally billions of dollars running off it than necessarily the individual developers who are building on top of it. But yes, it’s complex, I don’t know the answer.

Question: (Something about developers not general being buyers in a company)

Answer: Not really. I call this the “take me to your leader” problem, which is that basically the only solution I can figure out is that you need to get developers to introduce you to their managers, but that runs into the developers are not marketers problem because basically you’re trying to turn developers into your marketers and we’re not very good at that.

Question: (I have no idea)

Answer: I think there does need to be better charitable organisations but I think I’m not necessarily the person to come up with the ideas around that or to try that.

Question: Have you approached the Python Software Foundation for funding? (Something more to it than that but I’m not sure what)

Answer: The Python Software Foundation have very specific policies which is that you can apply to them for funding to work on a specific isolated chunk of work, so if there were a feature I wanted to add to Hypothesis that I wanted funding for I could potentially apply to the PSF for it, but for the general problem of ongoing Hypothesis development that’s not the sort of thing that they fund, and so I’ve largely not looked into it past that point. But for certain things that’s absolutely a thing you should think about doing. And also there are within specific areas there are other funding bodies, so for example for Django there’s the Django foundation who can do this, there’s NumFOCUS who tend to focus on scientific Python, and a bunch of relatetd things. So sometimes that works, but it tends to be more solutions to specific problem than the general larger problem.

Question: (I think about the idea of my writing a book)

Answer: Yes I’m going to look into that, because I’ve already a huge number of articles and so I figure I can turn those articles into a book. But that’s more marketing than it is going to be necessarily a useful income stream. I’m expecting that to make some money, and this is the sort of solution that only works once you’ve put in all this work and done all this free work, so again I think writing books is… it’s a great partial solution to this problem but it doesn’t actually solve the problem. And maybe with all the partial solutions available if you do literally all of them it starts to add up to a whole solution, but right now I’m not convinced it does.

Question: Is open source possible the wrong answer to this? Or is just a marketing strategy?

Answer: I don’t know if open source is the wrong answer. I think that closed source is definitely the wrong answer, and I think that finding some hybrid between the two, but as we saw in Jim’s talk earlier and in plenty of others, there’s a reason why open source is so popular and it’s not just that stuff is being given away for free. Your life tends to get very bad if you build your things on a foundation of other peoples’ closed source products. And this sort of free sharing of ideas really does have a lot of benefits, so I don’t want to throw the baby out with the bathwater. I really like open source and want it to succeed, and that’s why I’m trying to find a compromise. In terms of the question of whether it’s marketing or not, marketing is definitely a component but I think we need more than just marketing. I think no matter how well you market your open source project that’s not going to make you any money unless you also have a path to making money from it. Which is what I’m trying to figure out.

Question: Something about using the open source product as the marketing for something.

Answer: That is the open source core, or something like that, and I think that’s a viable solution, but you do then end up with… in that case you’re sort of asking “how much of the benefits of open source do you want to get?” and I would like something where you could get both the benefits of open source and the benefits of commercial work. I would also like a pony.

This entry was posted in programming on by .

In defence of developer exceptionalism

Epistemic status: I am thinking through a possible position out loud. While I believe the argument I am making in this post, I am not totally convinced by it. I am aware of a number of counter-arguments that I do not touch on here.

There’s a rather annoying attitude that often comes up: Software Developers are special and you should treat them differently from everyone else when designing how your organisation works.

I don’t think this is true. Software development is knowledge work with a fairly logical (though less logical than we like to pretend) basis, and most of the specific features of it as a profession just emerge out of that.

But I am starting to think it might be useful.

There is a category of argument I have often encountered in negotiation (not at all coincidentally, usually right before I turned down a job offer or shortly before I left a company). I call it the “We can’t treat you fairly because then we’d have to treat other people fairly too” argument. e.g. I’ve seen this in requests for flexible working (I like four day weeks a lot, and will in general happily take 80% of the salary for a 4 day week) where the response to the request is “But loads of people want that and if you had it they’d ask why they can’t have it too.”

My attitude to this in the past has always been “Well, why can’t they have it too?”

This is, of course, still my attitude, but I’ve come to accept that it will win the argument precisely 0% of the time.

The problem with accepting arguments of the form “Why should you get this treatment if everybody else does not?” is that they remove all possibility of incremental change. Even when everybody should get something, there’s probably no route to that that doesn’t pass through some people getting it and not others.

One example of where this comes up in a more explicitly developers vs non developers scenario is in responses to things like “Why you should not interrupt a programmer“. Every time this or similar is posted, a lot of my friends become angry and say “Programmers aren’t special! You shouldn’t interrupt anyone!”.

This is more or less true (there are large categories of jobs where it’s not only fine but desirable to interrupt them because that’s what their job is. But for basically all knowledge work and a whole bunch of adjacent jobs it’s true). You shouldn’t interrupt anyone.

But it’s a lot easier to win arguments about not interrupting programmers.

And, perhaps, it’s a lot easier to win arguments about not interrupting other people once you’ve established a norm that it’s not OK to interrupt programmers?

And this is why I think developer exceptionalism might be useful. Not because developers are exceptional, but because it is a place where we can put the thin end of the wedge.

A lot of how business is run is very dysfunctional, and exceptionalism gives us a convenient way of side stepping that.  If your business is very interrupt driven but you can establish a small island of uninterrupted calm around your programmers, maybe later the designers can come join you on the island? And maybe after that marketing can note that actually a lot of their work involves pretty intense thinking like that and they’d like to be uninterrupted at least some of the time and…

I can’t promise that it will work, but I’m pretty sure it won’t work without. Systems are quite stable without something to shake them up, and without some specific argument that you should be treated specially, you’ll be treated the same way as everyone else: Badly.

Exceptionalism and solidarity feel fundamentally at odds, and perhaps they are, but perhaps solidarity without the permission to occasionally engage in exceptionalism is just a crab bucket, and the best way to get everyone to where you want them to be is to let some people get ahead by whatever means necessary and then pull the others up behind them?

Of course, if this is going to work you have to actually pull people up behind you.

The correct progression is:

  1. “You can’t have X”
  2. “But developers are special!”
  3. “OK fine developers can have X”
  4. “You can’t have X”
  5. “But the developers have X and it worked really well at improving Y which is also our problem”
  6. (Developers chime in) “We can totally back that up. This is what we’ve seen out of X and it really does help with Y”
  7. “Oh OK, you can have X too”

An incorrect progression would be:

  1. “You can’t have X”
  2. “But developers are special!”
  3. “OK fine developers can have X”
  4. “You can’t have X”
  5. “But the developers have X and it worked really well at improving Y which is also our problem”
  6. (Developers chime in) “No! X is only for developers! You are not a developer and thus are not special and cannot have X”
  7. “You can’t have X”

If developer exceptionalism is to be a useful lie then we need to do two things:

  1. Never forget that it is a useful lie and not actually the truth.
  2. Have other peoples’ back when they try to follow our lead.

Otherwise it’s just developers being privileged jerks, which we have quite enough of already.

 

This entry was posted in life, programming on by .

Language reconstruction based fuzzing without reconstructing languages

Note: This idea has not been implemented yet so may turn out not to work. It all clicked together while I was walking around Paris this afternoon and I’m writing it down while it’s fresh in my mind. I think it will work though.

Secondary note: This post may be too inside baseball for you.

An idea that I’ve been banging my head against recently is that of reconstructing languages corresponding to tests based on some bounded data. I’ve posted a bit about its use for reducing, but I’ve also been interested in its use for fuzzing.

Here’s a problem Hypothesis experiences: The | operator is currently not associative. x | (y | z) will produce a different data distribution than (x | y) | z. This is suboptimal. I could special case it (and for a long time I thought I had, but apparently I dreamed that code), but it’s easy to come up with trivial ways to defeat that. e.g. (x | y).map(lambda x: x) | z should really have the same distribution as x | y | z but doesn’t if you special case |.

But under the hood Hypothesis just works with streams of bytes, so in theory you can represent its data distribution as a finite state machine. You could then randomly walk the state machine reweighting your choices by the number of reachable states and thus get much get better coverage of the set of available paths by reaching rare paths much more often.

Or, somewhat suggestively, here’s another algorithm that could work well and is a lot simpler than reweighting:

  1. Pick a state uniformly at random
  2. Take the shortest path to that state (or, indeed, any random path to that state).
  3. Continue the random walk from there

Unfortunately we run into a major problem with either of these approaches: Reconstructing regular languages for unknown predicates is really hard.

The insight I had this afternoon is this: We don’t need to! We can use the same trick as I used to produce self-optimizing boolean expressions to just store exemplars of distinct states we’ve found. We then just pick a random state we’ve already discovered and run the generation from there (Hypothesis can extend any string to a complete but possibly invalid example).

How does this work?

We maintain three sets of data:

  1. A corpus of examples such that we know that every element of this list definitely corresponds to a different state in the state machine for our predicate.
  2. A set of previously seen strings (this doesn’t have to be precise and can e.g. be a bloom filter)
  3. A binary search tree with branches labelled by strings and leaves labelled by indices into our list of examples.

We populate this as follows:

  1. The corpus starts as the empty string
  2. The set of previously seen strings contains just the empty string
  3. The binary tree is a single leaf pointing to 0 (the index of the empty string).

We then fuzz as follows:

  1. Pick a random element of our corpus.
  2. Generate an example such that that random element is a strict prefix of the example (it’s easy to do this in Hypothesis).
  3. For each proper prefix of that example, perform the incorporate operation.

The incorporate operation when run on a string s is:

  1. If the string is in our set of seen strings, stop. Else, add it to the set.
  2. Walk the binary tree until we hit a leaf. At each branch look at the label e. If the concatenation of s followed by e is valid, walk right. Else, walk left.
  3. Once we are at a node, find the string in our corpus index by it. Call that t. By repeatedly fuzzing extending s and t, attempt to find e such that s + e and t + e produce different results. If we do find such an e, add s to the corpus as a new element and split the current tree node, creating a branch labelled by e and with the left and right children being leaves containing s and t (which one is which depending on which way around the difference occurred). Otherwise, if s is a better example than t (shorter, or same length and lexicographically smaller), replace t in the corpus with s.

The experiments we split on witness that two nodes must be in different states (cf. the Myhill-Nerode theorem), so we guarantee that our corpus consists of unique states. The fuzzing step is probabilistic, so we can sometimes discard a state inappropriately, but this shouldn’t be a major problem in practice – states that we can’t distinguish with high probability might as well be the same from a fuzzing point of view.

Anyway, like I said, it’s currently unclear how useful this is because I haven’t actually implemented it! In particular I suspect it’s more useful for long running fuzzing processes than for the sort of fast fuzzing Hypothesis tries to do. Fortunately I’m planning how to incorporate such workflows into Hypothesis. However I definitely still think it’s a good idea and I’m looking forward to exploring it further.

This entry was posted in Hypothesis, programming on by .

Self-optimizing boolean expressions

This is a trick I figured out this morning. It seems to work (in that I have a prototype that works just well enough to pass a single test), and I think it might have some interesting applications. I haven’t yet decided whether it’s more than a fun toy though.

It’s entirely possible (likely, even) that this is a well known idea that has some fancy name that I don’t know. I was doing some digging on related areas and couldn’t see anything like it, but that just means I wasn’t looking in the right places.  It may also be that it’s never been written up because it’s a bad idea. Usual disclaimers about having clever ideas apply.

The idea is this: We’re working with boolean expressions over a set of variables. We want a canonical representation of them. We could use reduced ordered binary decision diagrams, but we dislike exponential blow-up. So what do we do?

There are a whole pile of extensions to the idea of BDDs which relax constraints, add features, etc. We’re not going to do that. Instead we’re going to canonicalize as we go using the awesome power of mutability.

For this trick we will need:

  1. A SAT solver (this problem is intrinsically equivalent to solving SAT, so there’s no getting away from that. We might as well let someone else do the hard work). I used minisat for my prototype. This would presumably work better with one of the various patched minisat extensions.
  2. A single mutable binary tree with leaves as boolean expressions and splits labelled by sets of variables.
  3. Union find.
  4. Hash consing (with weak references if you prefer but it will hurt speed)

Our working representation for boolean expressions will be that an expression can be one of:

  1. A literal true
  2. A literal false
  3. A single variable
  4. The negation of another boolean expression
  5. The conjunction (and) of two other boolean expressions

We hash cons the representation of our expressions so that any two structurally equal expressions are reference equal. We then make them our items in union find, so additionally every expression has a reference to another expression, or to itself if it is our current canonical representation for equivalent expressions.

Now, every time we introduce a new expression that is not previously in our hash cons we want to find out if it is equivalent to any expression we have previously seen. If it is, we perform a merge operation in our union find, rooting the more complicated of the two in the simpler of the two. If not, we add it as a new root.

Obviously we don’t do that by searching through every expression we’ve seen so far. This is where the binary tree comes in.

The binary tree starts with a single branch keyed with the empty set with the literal false as the left branch and the literal true as the right branch. When we insert a new node, we walk the tree as follows:

At a split node, look at the set of variables and determine what value this expression has when those variables are set to true and every other variable is set to false. If that result is true, walk to the right leaf. Else, walk to the left branch.

Once you get to a leaf, look at the expression that is currently there.  Using a SAT solver, attempt to find a variable assignment that produces different results for the two expressions.t eIf the solver doesn’t find one, the expressions are equivalent. Merge them. If the solver does find one, split the leaf out into a branch with the label as the set of variables that are true in that assignment and false otherwise.

Anyway, once you have this data structure, you can always cheaply convert an expression that you’ve already constructed to the simplest equivalent expression you’ve seen so far. Testing equivalence of expressions then becomes equally cheap – just simplify both and see if you get the same answer.

This is of course a value of cheap that involves having run a large number of calls to a SAT solver to get to this point, but for the case where you’ve got a large number of small expressions which you’re going to want to then pass to a SAT solver all at once later I think that’s probably not too bad – passing the individual expressions to a SAT solver is extremely cheap, and then you just convert the whole batch to CNF together rather than trying to insert them into the data structure.


 

Notes:

  1. The tree will almost certainly get unbalanced. Occasional rebalancing probably becomes necessary. I’m not sure what the best way to do this is going to be, as it’s not actually an ordered tree in the traditional sense. I think something Splay Tree like could be made to work.
  2. You have to decide what constitutes a “better” expression in an implementation of this, but some sort of heuristic about number of reachable distinct subexpressions is a good start.
  3. It is probably worth putting some effort in to minimizing and reusing the expressions you get back from the SAT solver. It depends a bit on what the hit rate is likely to be.
  4. There are some optimizations you can do to make the evaluation a bit more efficient that I haven’t gone into here.
  5. There are some simple algebraic rewrite rules it’s worth performing before farming anything off to the sat solver.
  6. You can also optimize a little based on the fact that literals, variables and their negations are always distinct from each other.

 

This entry was posted in programming on by .

Shrinking failing input using a SAT solver

Advance warning: This is mostly me writing up a negative to neutral result. The technique described in this post works OK, but it doesn’t seem to work better than a well optimized delta debugging and as such does not currently justify its complexity. It also doesn’t work very well when you start from large examples, but that’s fortunately not a huge problem using the ideas I outlined in my last post because you can work with smaller sequences until the example is small enough to be feasible when approached bytewise.

My hope was that it would give a system that was much better at finding longer intervals to delete (delta debugging only guarantees that you can not delete any byte. Guaranteeing that you cannot delete any interval provably requires a quadratic number of tests to check, but I was hoping that this would be able to exploit structure found in the string to find more intervals that pure delta debugging would have missed.

I’ve got some ideas I want to pursue to extend it that might make it worth it, but this post just describes the current state of work in progress. Also I might end up incorporating it into my shrinker anyway and hope that I can improve its working with more cleverness later (I already have a prototype of doing this).

Anyway, disclaimers over, on to the idea.

We are trying to take a string \(S\) and a predicate \(f\). We are trying to construct a string \(T\) of minimal length such that \(f(T)\) is true.

This builds on my previous idea of using regular language induction to aid shrinking. The version I described in that post appears to be too slow to make work.

The core idea is still that we are looking for a DFA with fewer than \(|S|\) nodes such that every accepted string satisfies \(f\). If we can find such a DFA then we can shrink \(S\) because the shortest path to an accepted node must have no more than the number of nodes, which is fewer than \(|S|\) by assumption.

The idea is to try to manufacture a DFA from \(S\) using a SAT solver (Disclaimer: I actually used the naive algorithm that they aimed to replace because I didn’t quite understand their work and wanted to prototype something quickly) to colour the nodes in a manner consistent with the transitions observed in \(S\). (i.e. if we colour the indices \(i\) and \(j\) the same and \(S[i] = S[j]\) then we colour \(i + 1\) and \(j + 1\) the same).

We maintain a set of pairs of indices which we know do not correspond to the same state in the DFA. We initialize it to \(\{(0, |S|\}\) at the start and will add to it as we go.

We now produce a colouring consistent with that set and the transition requirement.

If that colouring gives a unique colour to every index, stop. There is no DFA with fewer than \(|S|\) nodes that matches \(S\) and.

Otherwise, produce the quotient DFA and look at the minimal path to an accepting node. If it satisfies \(f\), we have shrunk the string. Start again from the beginning with the new string (this usually doesn’t work, but it’s worth checking).

If that didn’t work, we now produce either a strictly smaller string satisfying \(f\) or we increase the size of our set of known inconsistencies as follows:

For each colour, we look at the smallest and the largest index in it, \(i, j\). If deleting the interval \([i, j)\) from the  string produces a string satisfying \(f\) then we have shrunk the string. Start again from the beginning with the new string. Otherwise, we have proved that the subsequences of \(S\) up to \(i\) and \(j\) respectively must live in distinct states (because following the suffix of \(S\) starting at \(j\) produces different results). Add \((i, j)\) to the set of inconsistencies. Do this for each colour (starting from the ones which produce the largest gap so we maximize how much we shrink by), and then if none of them produced a shrink then try colouring again.

Eventually we’ll either hit the point where we need all \(|S| + 1\) colours to colour the indices, or we’ll have shrunk the string.

Note: I think this doesn’t actually guarantee even 1-minimality like delta debugging does. The problem is that you might have an index that is not equivalent to its successor index even though you can delete the character there. I’m not currently sure if this algorithm can witness that possibility or not.

Anyway, I’ve been running the example I used to prototype structureshrink (the C++ hello world) with this as the list minimization algorithm. It seems to do alright, but it’s nothing to write home about, but it’s been an interesting experiment and I’m glad to have got the excuse to play with minisat.

This entry was posted in programming, Python on by .