Author Archives: david

Life Changes Announcement: Academia Edition

This is just a small blog post to let people know what’s going on in my life.

I wrote a little while ago that I was thinking about starting doing a PhD. Well, thought turned into action remarkably quickly, and in October I will be starting a PhD with Dr Alistair Donaldson’s Multicore Programming Group at Imperial College London.

Despite the group title, my focus will be almost entirely on the sorts of things I’ve already been doing for the last few years and probably won’t involve that much multicore programming at all (I’m interested in them because they do a lot of work on language based fuzzing for GPUs, and also because Ally and I really hit it off and there’s a bunch of other interesting related work at Imperial). I’ll be focusing on Hypothesis and related technology, but with a goal of validating and publishing some of the research I’ve been doing on it from a more academic standpoint.

Separately, though not unrelated, to that, I’ve also started a part time research assistant job at Imperial with the same group. I’ll be doing three days a week there for the next three months, helping them with some of their work on GLFuzz (I somehow convinced Ally I knew a thing or two about programming in general and writing fuzzers in particular. Not sure how). So I am now officially an academic.

Feels weird, to be honest. Also not really sure it’s sunk in yet.

For now this doesn’t mean I’ll be around London any more often than I have been, as it will mostly be a remote working position, but come October or so when I start my PhD I will be moving back. I have decidedly mixed feelings about this, but it will be very good to see my friends and family who live there more often.

This entry was posted in Uncategorized on by .

Programmer at Large: Can we not?

This is the latest chapter in my web serial, Programmer at Large. The first chapter is here and you can read the whole archives here or on the Archive of Our Own mirror. This chapter is also mirrored at Archive of Our Own.


“I think that went well, don’t you?

I let out a huff of frustration.

“What on the ground was that about?”

Sam cuddled up closer into a two person conversation stance.

“Brian is just a bit… sensitive about the subject of sex. Not sure why. Their profile says they have a high disgust reflex, but I’ve never been sure if it says that because of the sex thing or whether it’s the reason for it.”

“OK, but call me off consensus here, but wasn’t that just straight up rumour mongering? We’re not supposed to do that, right? What am I missing?”

Sam grimaced and (check tagging) looked uncomfortable.

“There are… special cases. Sex is one of them. Kimiko could legally challenge Brian about this if they wanted, but without that Brian is technically within socially acceptable bounds, though they’re being a bit gauche about it.”

I called up Kimiko’s social graph into a shared workspace and highlighted ties that had risen and then fallen. Brian was connected to about a third of them. I switched it to timeline mode and the pattern was even clearer.

“Gauche? Look at that. This isn’t gauche, this is practically aggression!”

I was getting mood warnings again, and Sam was stroking my back in a calming manner.

“Arthur, calm down. This isn’t your problem.”

“How is this not my problem? This is a clear social breakdown on the ship! Those are everybody’s problem! It says so right there in the charter!”

“Look, it’s complicated.”

“That’s what people always say when they think the rules don’t apply to them!”

Sam grimaced again and shook their head.

“Ugh, Arthur, I can’t do this right now. I’m sorry. I’m not angry at you, and I understand why you feel this way, but this is a much higher effort conversation than I have the capacity for at the moment. Can we drop it?”

Sam and I have had more than a few conversations like this, and they probably could tell how this one was going to go.

The problem was that it was hitting right at the core of the things I find hardest about shipboard society.

The goals of our charter are mostly worthy, and the rules it defines are mostly a fair way of achieving those goals. It’s not perfect, but nothing created by humanity is. We’ve learned that the hard way.

The official rules are sometimes very hard for me to follow, but I accept their legitimacy and where I struggle too much I have software to help keep me on track.

But then there are all the unofficial rules, which are impossible to keep straight because nobody knows what they are in any formal sense, they just somehow magically know how to follow them. Every time I ask people to explain, we just both get frustrated – they tell me things they want to be true, and I get mad because they obviously aren’t.

And when the unofficial rules override the official rules you got this sort of completely hypocritical situation where people just said “it’s complicated” and can’t really explain why.

But it wasn’t Sam’s fault the ship was like this, and I could definitely understand not feeling able to talk about it. Even if it wasn’t basic courtesy, I’d respect that.

I was glad they had clarified they weren’t angry with me though.

I took a deep breath, pulled myself together, and nodded acceptance.

“Of course. Sorry I got carried away there, but this stuff… gets to me.”

“I understand. It gets to me too sometimes.”

“It’s fine to not talk about it, but then I need to not talk right now. I think this is going to be on my mind for a while and I’m probably not going to be able to stay off the subject.”

“That’s fine. Do you want to go? Or should we hang out together quietly for a bit?”

I hesitated briefly, but the need to show Sam that I wasn’t angry at them won out.

“I don’t need to be anywhere, and the company would be nice if you’re still willing.”

“Of course I am! And I’ve got plenty of quiet things I can get on with, so this works well.”

“Great.”

We shifted around a bit so we weren’t directly facing each other and could each have a hand free to work with.

The first thing I did was drop a note in Kimiko’s inbox saying I’d like to talk to them at their convenience. I flagged it as low-urgency but mildly important. Their status showed as busy at the moment, so they wouldn’t get the notification until later.

The second thing I did was start calling up social network diagrams.

Kimiko was indeed poorly connected to a lot of the rest of the crew. They had the obvious contacts in the biology sections, and there was a fairly densely connected group of about fifty people that they were part of that didn’t have any obvious reason for the connection.

I guessed that was probably the sexual subset of the crew, assuming Brian hadn’t simply been wrong or lying.

The network structure here was weird. The group was much more densely connected relative to its external connections than a group this size should be. It looked a lot like a clique or a minority interest group, and we weren’t suppose to have those. I looked up the various group metrics to see why it hadn’t been flagged.

Apparently the answer was that it consistently sat just under the alerting threshold on every single metric – slightly too large to be a clique, slightly too small to be a minority interest. The standard clustering algorithms all cleaved the group roughly in half, though they didn’t agree on which halves. Average group centrality was low but not quite low enough to be worth a flag. And so on – we have about ninety social unity metrics and this group managed to just avoid alerting on every single one of them.

If I’d found a system in the plumbing like this then I would have immediately flagged it up for review. It’s in the nature of difficult systems to push right up against the alerting boundaries, and often it’s a sign that you need a new alerting metric.

Properly that was exactly what I should have done here too: The rules don’t distinguish systems made out of humans from systems made out of machines. If you see anomalous structure you should flag it for review.

But I had a nagging feeling that doing that here would be… bad. I resolved to wait until after I talked to Kimiko, and raised the importance level of my request to meet slightly, while still leaving it as non-urgent. This had clearly been going on for a while, and just because I only found out about it now didn’t make it suddenly urgent.

The whole scenario left me feeling intensely uncomfortable, but on the plus side I’d found my own little exception to the rules to be hypocritical about. Maybe I was starting to understand the rest of the crew after all.


Next chapter schedule is still TBD, but most likely will continue on the established schedule (which this one fell out of) and be on Friday May 5th.

Like this? Why not support me on Patreon! It’s good for the soul, and will result in you seeing chapters as they’re written (which is currently not very much earlier than they’re published, but hopefully I’ll get back on track soon).

This entry was posted in Fiction, Programmer at Large on by .

Ordering and mapping regular languages

I’ve had the following question for a while: How do I create a mapping of keys to values where the keys are regular expressions, and two regular expressions are considered equivalent if they correspond to the same language?

An example of why you might want to do this is e.g. when constructing a minimal deterministic finite automaton for a regular language you end up labelling states by regular expressions that represent the language matched when starting from that state. In order for the automaton to be minimal you need to have any two equivalent regular expressions correspond to the same state, so you need a way of taking a new regular expression and finding out which state it should go to.

It’s easy (if potentially expensive) to test regular expression equivalence once you know how, so the naive implementation of this is just to do a linear scan of all the regular expressions you’ve found so far. It’s O(n) lookup but it’s at least an existence proof.

In the past I’ve ended up implementing a crude hash function for regular languages and then just used a hash table. It works, but collisions are common unless you’re quite clever with your hashing, so it doesn’t work well.

But it turns out that there is a better way! Rather than using hashed data structures you can used ordered ones, because it turns out that there is a natural and easy to compute (or at least not substantially harder than testing equivalence) total ordering over the set of regular languages.

That way is this: If you have two regular languages \(L\) and \(M\) that are not equivalent, there is some \(x \in L \triangle M\), the symmetric difference. That is, we can find and \(x\) which is in one but not the other. Let \(x\) be the shortlex minimal such word (i.e. the lexicographically first word amongst those of minimal length). Then \(L < M\) if \(x \in L\), else \(M < L\).

The work in the previous post on regular language equivalence is thus enough to calculate the shortlex minimal element of an inequivalent pair of languages (though I still don’t know if the faster of the two algorithms gives the minimal one. But you can use the fast algorithm for equivalence checking and then the slightly slower algorithm to get a minimal refutation), so we can readily compute this ordering between two regular expressions. This, combined with any sort of ordered collection type (e.g. a balanced binary tree of some sort) gives us our desired mapping.

But why does this definition provide a total order?

Well, consider the enumeration of all words in increasing shortlex order as \(w_0, \ldots, w_n, \ldots\). Let \(l_n = 1\) if \(w_n \in L\), else \(l_n = 0\). Define \(m_n\) similarly for \(M\).

Then the above definition is equivalent to the reverse of the lexicographical ordering between \(l\) and \(m\)! If \(w_k\) is the smallest word in the symmetric difference then \(k\) is the first index at which \(l\) and \(m\) differ. If  \(w_k \in L\) then \(l_k = 1\) and \(m_k = 0\), so \(l > k\), and vice versa. The lexicographical order is a total order, and the reverse of a total order is a total order, so the above definition is also a total order.

This definition has a number of nice properties:

  • Any language containing the empty word sorts before any language that doesn’t
  • The function \(L \to \overline{L}\) is order reversing.
  • \(L \cap M \leq L, M \leq L \cup M\)

I originally thought that union was coordinate-wise monotonic, but it’s not. Suppose we have four words \(a < b < c < d\), and consider the languages \(L = \{a, d\}, M = \{b, c\}\). Then \(L < M\) because the deciding value is \(a\). But now consider \(P = \{a, b\}\). Then \(L \cup P > M \cup P\) because the deciding element now ends up being \(c\), which is in \(M\).

I’ve yet to try this in practice, so it might turn out that there are some interestingly pathological failure modes for this comparison function, but I don’t think there are likely to be any that aren’t also present in testing regular expression equivalence itself.

Another open question here is which sorted map data structure to use? The comparison is relatively expensive, so it might be worth putting a bit of extra effort in to balance it. As such an AVL tree might be a fairly reasonable choice. I’m not sure.


Want more blog posts like this? Join the 30 others who are supporting my writing on Patreon!

This entry was posted in Automata Theory, programming on by .

An epistemic vicious circle

Let’s start with apology: This blog post will not contain any concrete examples of what I want to talk about. Please don’t ask me to give examples. I will also moderate out any concrete examples in the comments. Sorry.

Hopefully the reasons for this will become clear and you can fill in the blanks with examples from your own experience.

There’s a pattern I’ve been noticing for a while, but it happens that three separate examples of it came up recently (only one of which involved me directly).

Suppose there are two groups. Let’s call them the Eagles and the Rattlers. Suppose further that the two groups are roughly evenly split.

Now suppose there is some action, or fact, on which people disagree. Let’s call them blue and orange.

One thing is clear: If you are a Rattler, you prefer orange.

If you are an Eagle however, opinions are somewhat divided. Maybe due to differing values, or different experiences, or differing levels of having thought about the problem. It doesn’t matter. All that matters is that there is a split of opinions, and it doesn’t skew too heavily orange. Let’s say it’s 50/50 to start off with.

Now, suppose you encounter someone you don’t know and they are advocating for orange. What do you assume?

Well, it’s pretty likely that they’re a Rattler, right? 100% of Rattlers like orange, and 50% of Eagles do, so there’s a two thirds chance that a randomly picked orange advocate will be Rattler. Bayes’ theorem in action, but most people are capable of doing this one intuitively.

And thus if you happen to be an Eagle who likes orange, you have to put in extra effort every time the subject comes up to demonstrate that. It’ll usually work – the evidence against you isn’t that strong – but sometimes you’ll run into someone who feels really strongly about the blue/orange divide and be unable to convince them that you want orange for purely virtuous reasons. Even when it’s not that bad it adds extra friction to the interaction.

And that means that if you don’t care that much about the blue/orange split you’ll just… stop talking about it. It’s not worth the extra effort, so when the subject comes up you’ll just smile and nod or change it.

Which, of course, brings down the percentage of Eagles you hear advocating for orange.

So now if you encounter an orange advocate they’re more likely to be Rattler. Say 70% chance.

Which in turn raises the amount of effort required to demonstrate that you, the virtuous orange advocate, are not in fact Rattler. Which raises the threshold of how much you have to care about the issue, which reduces the fraction of Eagles who talk in favour of orange, which raises the chance that an orange advocate is actually Rattler, etc.

The result is that when the other side is united on an issue and your side is divided, you effectively mostly cede an option to the other side: Eventually the evidence that someone advocating for that option is a Rattler is so overwhelming that only weird niche people who have some particularly strong reason for advocating for orange despite being an Eagle will continue to argue the cause.

And they’re weird and niche, so we don’t mind ostracising them and considering them honorary Rattlers (the real Rattlers hate them too of course, because they still look like Eagles by some other criteria).

As you can probably infer from the fact that I’m writing this post, I think this scenario is bad.

It’s bad for a number of reasons, but one very simple reason dominates for me: Sometimes Rattlers are right (usually, but not always, for the wrong reasons).

I think this most often happens when the groups are divided on some value where Eagles care strongly about it, but Rattlers don’t care about that value either way, and vice versa. Thus the disagreement between Rattler and Eagles is of a fundamentally different character: Blue is obviously detrimental to the Rattlers’ values, so they’re in favour of orange. Meanwhile the Eagles have a legitimate disagreement not over whether those values are good, but over the empirical claim of whether blue or orange will be better according to those values.

Reality is complicated, and complex systems behave in non-obvious ways. Often the obviously virtuous action has unfortunate perverse side effects that you didn’t anticipate. If you have ceded the ground to your opponent before you discover those side effects, you have now bound your hands and are unable to take what now turns out to be the correct path because only a Rattler would suggest that.

I do not have a good suggestion for how to solve this problem, except maybe to spend less time having conversations about controversial subjects with people whose virtue you are unsure of and to treat those you do have with more charity. A secondary implication of this suggestion is to spend less time on Twitter.

But I do think a good start is to be aware of this problem, notice when it is starting to happen, and explicitly call it out and point out that this is an issue that Eagles can disagree on. It won’t solve the problem of ground already ceded, but it will at least help to stop things getting worse.


Like my writing? Why not support me on Patreon! Especially if you want more posts like this one, because I mostly don’t write this sort of thing any more if I can possibly help it, but might start doing so again given a nudge from patrons.

A worked example of designing an unusual data structure

Due to reasons, I found myself in need of a data structure supporting a slightly unusual combination of operations. Developing it involved a fairly straightforward process of refinement, and a number of interesting tricks, so I thought it might be informative to people to walk through (a somewhat stylised version of) the design.

The data structure is a particular type of random sampler, starting from a shared array of values (possibly containing duplicates). Values are hashable and comparable for equality.

It needs to support the following operations:

  1. Initialise from a random number generator and a shared immutable array of values so that it holds all those values.
  2. Sample an element uniformly at random from the remaining values, or raise an error if there are none.
  3. Unconditionally (i.e. without checking whether it’s present) remove all instances of a value from the list.

The actual data structure I want is a bit more complicated than that, but those are enough to demonstrate the basic principles.

What’s surprising is that you can do all of these operations in amortised O(1). This includes the initialisation from a list of n values!

The idea behind designing this is to start with the most natural data structure that doesn’t achieve these bounds and then try to refine it until it does. That data structure is a resizable array. You can sample uniformly by just picking an index into the array. You can delete by doing a scan and deleting the first element that is equal to the value. This means you have to be able to mutate the array, so initalising it requires copying.

Which means it’s time for some code.

Let’s start by writing some code.

First lets write a test suite for this data structure:

from collections import Counter
 
from hypothesis.stateful import RuleBasedStateMachine, rule, precondition
import hypothesis.strategies as st
 
from sampler import Sampler
 
 
class FakeRandom(object):
    def __init__(self, data):
        self.__data = data
 
    def randint(self, m, n):
        return self.__data.draw(st.integers(m, n), label="randint(%d, %d)" % (
            m, n
        ))
 
 
class SamplerRules(RuleBasedStateMachine):
    def __init__(self):
        super(SamplerRules, self).__init__()
        self.__initialized = False
 
    @precondition(lambda self: not self.__initialized)
    @rule(
        values=st.lists(st.integers()).map(tuple),
        data=st.data()
    )
    def initialize(self, values, data):
        self.__initial_values = values
        self.__sampler = Sampler(values, FakeRandom(data))
        self.__counts = Counter(values)
        self.__initialized = True
 
    @precondition(lambda self: self.__initialized)
    @rule()
    def sample(self):
        if sum(self.__counts.values()) != 0:
            v = self.__sampler.sample()
            assert self.__counts[v] != 0
        else:
            try:
                self.__sampler.sample()
                assert False, "Should have raised"
            except IndexError:
                pass
 
    @precondition(lambda self: self.__initialized)
    @rule(data=st.data())
    def remove(self, data):
        v = data.draw(st.sampled_from(self.__initial_values))
        self.__sampler.remove(v)
        self.__counts[v] = 0
 
TestSampler = SamplerRules.TestCase

This uses Hypothesis’s rule based stateful testing to completely describe the valid range of behaviour of the data structure. There are a number of interesting and possibly non-obvious details in there, but this is a data structures post rather than a Hypothesis post, so I’m just going to gloss over them and invite you to peruse the tests in more detail at your leisure if you’re interested.

Now lets look at an implementation of this, using the approach described above:

class Sampler(object):
    def __init__(self, values, random):
        self.__values = list(values)
        self.__random = random
 
    def sample(self):
        if not self.__values:
            raise IndexError("Cannot sample from empty list")
        i = self.__random.randint(0, len(self.__values) - 1)
        return self.__values[i]
 
    def remove(self, value):
        self.__values = [v for v in self.__values if v != value]

The test suite passes, so we’ve successfully implemented the operations (or our bugs are too subtle for Hypothesis to find in a couple seconds). Hurrah.

But we’ve not really achieved our goals: Sampling is O(1), sure, but remove and initialisation are both O(n). How can we fix that?

The idea is to incrementally patch up this data structure by finding the things that make it O(n) and seeing if we can defer the cost for each element until we actually definitely need to incur that cost to get the correct result.

Let’s start by fixing removal.

The first key observation is that if we don’t care about the order of values in a list (which we don’t because we only access it through random sampling), we can remove the element present at an index in O(1) by popping the element that is at the end of the list and overwriting the index with that value (if it wasn’t the last index). This is the approach normally taken if you want to implement random sampling without replacement, but in our use case we’ve separated removal from sampling so it’s not quite so easy.

The problem is that we don’t know where (or even if) the value we want to delete is in our array, so we still have to do an O(n) scan to find it.

One solution to this problem (which is an entirely valid one) is to have a mapping of values to the indexes they are found in. This is a little tricky to get right with duplicates, but it’s an entirely workable solution. It makes it much harder to do our O(1) initialize later though, so we’ll not go down this route.

Instead the idea is to defer the deletion until we know of an index for it, which we can do during our sampling! We keep a count of how many times a value has been deleted and, if we end up sampling it and the count is non-zero, we remove it from the list and decrement the count by one.

This means that we potentially pay an additional O(deletions) cost each time we sample, but these costs are “queued up” from the previous delete calls, and once incurred do not repeat, so this doesn’t break our claim of O(1) amortised time – the costs we pay on sampling are just one-off deferred costs from earlier.

Here’s some code:

class Sampler(object):
    def __init__(self, values, random):
        self.__values = list(values)
        self.__random = random
        self.__deletions = set()
 
    def sample(self):
        while True:
            if not self.__values:
                raise IndexError("Cannot sample from empty list")
            i = self.__random.randint(0, len(self.__values) - 1)
            v = self.__values[i]
            if v in self.__deletions:
                replacement = self.__values.pop()
                if i != len(self.__values):
                    self.__values[i] = replacement
            else:
                return v
 
    def remove(self, value):
        self.__deletions.add(value)

So now we’re almost done. All we have to do is figure out some way to create a mutable version of our immutable list in O(1).

This sounds impossible but turns out to be surprisingly easy.

The idea is to create a mask in front of our immutable sequence, which tracks a length and a mapping of indices to values. Whenever we write to the mutable “copy” we write to the mask. Whenever we read from the copy, we first check that it’s in bounds and if it is we read from the mask. If the index is not present in the mask we read from the original sequence.

The result is essentially a sort of very fine grained copy on write – we never have to look at the whole sequence, only the bits that we are reading from, so we never have to do O(n) work.

Here’s some code:

from collections import Counter
 
 
class Sampler(object):
    def __init__(self, values, random):
        self.__values = values
 
        self.__length = len(values)
        self.__mask = {}
        self.__random = random
        self.__deletions = set()
 
    def sample(self):
        while True:
            if not self.__length:
                raise IndexError("Cannot sample from empty list")
            i = self.__random.randint(0, self.__length - 1)
            try:
                v = self.__mask[i]
            except KeyError:
                v = self.__values[i]
            if v in  self.__deletions:
                j = self.__length - 1
                try:
                    replacement = self.__mask.pop(j)
                except KeyError:
                    replacement = self.__values[j]
                self.__length = j
                if i != j:
                    self.__mask[i] = replacement
            else:
                return v
 
    def remove(self, value):
        self.__deletions.add(value)

And that’s it, we’re done!

There are more optimisations we could do here – e.g. the masking trick is relatively expensive, so it might make sense to switch back to a mutable array once we’ve masked off the entirety of the array, e.g. using a representation akin to the pypy dict implementation and throwing away the hash table part when the value array is of full length.

But that would only improve the constants (you can’t get better than O(1) asymptotically!), so I’d be loathe to take on the increased complexity until I saw a real world workload where this was the bottleneck (which I’m expecting to at some point if this idea bears fruit, but I don’t yet know if it will). We’ve got the asymptotics I wanted, so lets stop there while the implementation is fairly simple.

I’ve yet to actually use this in practice, but I’m still really pleased with the design of this thing.  Starting from a fairly naive initial implementation, we’ve used some fairly generic tricks to patch up what started out as O(n) costs and turn them O(1). As well as everything dropping out nicely, a lot of these techniques are probably reusable for other things (the masking trick in particular is highly generic).

Update 09/4/2017: An earlier version of this claimed that this solution allowed you to remove a single instance of a value from the list. I’ve updated it to a version that removes all values from a list, due to a comment below correctly pointing out that that approach biases the distribution. Fortunately for me in my original use case the values are all distinct anyway so the distinction doesn’t matter, but I’ve now updated the post and the code to remove all instances of the value from the list.


Do you like data structures? Of course you do! Who doesn’t like data structures? Would you like more data structures? Naturally. So why not sign up for my Patreon and tell me so, so you can get more exciting blog posts like this! You’ll get access to drafts of upcoming blog posts, a slightly increased blogging rate from me, and the warm fuzzy feeling of supporting someone whose writing you enjoy.

This entry was posted in programming, Python on by .