Author Archives: david

Selection pressures on company behaviour

Epistemic status: Speculative. Almost certainly not an idea that is original to me, but it’s hard to google for this without being swamped in people who don’t actually understand what evolution is.

Evolution is not only real, it’s everywhere. The set of criteria required for an evolutionary selection process is so mild that it’s almost weird to find things where it isn’t a factor (though it may not be the dominant one).

In order to be subject to evolution, roughly all you need is the following:

  1. Heritability – the ability to reproduce and create new things that are in some way “based” on the original and share its traits.
  2. Mutation – a certain degree of random changes to traits as you reproduce.
  3. Selection – reproductive success must be influenced by inherited traits, usually due to some sort of interaction with the environment (e.g. “survival of the fittest”, but also things like traits that make you more able to attract mates)

This absolutely includes evolution in the classical, biological, sense. A while back Dawkins proposed memes (in the original sense that just means “ideas that are subject to selection pressure” rather than pictures of cats with words on them. There’s also a fairly significant evolutionary process in how language changes), but really it’s just about everything. Programming languages evolve, natural languages evolve, genres of literature evolve, etc.

Often there is some mix of evolution and design going on (evolution is not design!) but the evolution is still there, the design is just a major part of the inheritance and mutation processes.

Note that evolve doesn’t mean “improve”, it just means “adapt to selection pressures”, and often that adaptation again isn’t one that we’d particularly think of as a good thing. As Dawkins (again. Sorry. In my defence this was all from before his evolution transformation into his current profile as chief new atheist ranter) pointed out in The Selfish Gene, evolution of our genetics doesn’t select in any way for our well-being, it just does whatever makes the numbers go up (which is why you get things like genes that work great when you have one of them and kill you when you have two. Statistically it’s a great deal!)

One thing I’ve been noticing for a while is that company cultures are absolutely in the camp of things that are subject to evolutionary pressures. A company culture:

  1. Reproduces as people leave it and join or form other companies, or as people outside the company are exposed to ideas about how the company works.
  2. Mutates, as people change their opinions over time through their natural growth and change and through exposure to other companies.
  3. Is selected for, because company culture is a large part of determining company success.

I’ve previously thought of this as a mostly neutral thing with some interesting implications. e.g. it probably tends to mean that over time we evolve companies which are more or less rational economic actors even if they’re not made out of rational economic actors. Not that I necessarily think that’s a good thing, but it’s at least a useful thing for reasoning about company behaviour.

But I realised this morning a rather unfortunate problem: Although having highly profitable companies which treat their employees great and make economically rational decisions is one perfectly viable reproductive strategy in this model (where in that case reproduction tends to come more from people trying to emulate you than from people leaving, although there tends to be enough turnover in even great companies that it’s both), there is another extremely adaptive model.

That model is to basically just be incredibly dysfunctional and haemorrhage employees as a result.

This might not be good for the survival of the company, but it’s not the company that we’re considering here, it’s the population of company cultures. If the company goes under and all its employees go off and form five different companies that look exactly the same as their parent company, that’s evolutionarily speaking a great success.

The problem is that people tend to reproduce the cultures they’re used to. Even when they realise that those cultures are bad, they will typically only pick up on a few of the ways in which they were bad, and will tend to just carry the rest of them along without ever really noticing.

As a result, dysfunctional company cultures are fairly actively selected for, and will tend to thrive even as they destroy the companies they live in.

This is perhaps analogous to r/K selection theory – some animals reproduce by having a small number of offspring that they invest a lot of effort in to ensure they thrive (K-selection, roughly analogous to well functioning company cultures that reproduce slowly but tend to promote the survival of the companies that have them) and those that produce many offspring, most of whom will die but enough of whom will survive to adulthood to continue on to another generation (r-selection, roughly analogous to dysfunctional company cultures that destroy their companies but are passed on when the employees leave for other companies). The analogy isn’t perfect because the mechanism of “gene” transmission is so different, but it’s at least suggestive.

Which one is more adaptive varies from niche to niche, and there are plenty of niches, so it’s unlikely that either type of culture or organism is ever going to out-compete the other and replace it, but anecdotally it certainly seems like we have enough niches for the dysfunctional company cultures to thrive in right now.

So that’s the problem. What’s the solution?

Good question. Unfortunately I’m not sure of the answer. Culture is hard.

But I do think that whatever the solution is, it’s going to have to work with the evolutionary dynamic, not against it – evolution is going to happen regardless of whether we want it to or not, but we can try to shape it so that it selects for things we want.

So here are some ideas for how we can try to guide each of the three aspects of that evolution:

  1. To control how dysfunctional cultures reproduce, we can can reduce how much hiring we do along friendship and network lines – when a company fails and people leave, they tend to bring their friends at that company along with them. This is a perfectly natural thing to do, but tends to perpetuate the culture they are used to. (That being said, I don’t think it’s a good idea to actively avoid hiring people from dysfunctional companies – that will just send them to companies that will happily reproduce their culture. Also it punishes individuals unfairly for things that are not their fault. Just… maybe treat their opinions on how things should be done with a grain of salt.)
  2. To control how culture mutates, we can be more actively involved in designing our cultures for the traits we want – culture isn’t just a thing that happens, it’s a thing we create. If we let it emerge purely organically, we will tend to reproduce what has come before. If we design it, and in particular design it to avoid things we know are dysfunctional, the mutation will more actively push us in the directions we want.
  3. To control how dysfunction is selected for, we as individuals can more actively avoid working for dysfunctional companies – investigate the culture before we join, and leave sooner rather than later once we realise there are major problems we can’t fix. Where possible, if we know a company has actively bad culture we should warn people off, though whisper networks and doing this in public have problems that I don’t entirely know how to solve.

I don’t necessarily think those ideas are either necessary or sufficient, but I do think they will probably help. Evolution is hard to manage, and tends to do things we don’t expect, but a certain amount of selective breeding is possible, and I think we need to start thinking about how to do that for the cultures we create.

Of course, when we’re doing that, we need to be careful about going too far in the other direction.

(Thanks to Jess for the conversation that lead to me thinking about this stuff again and lead to the realisation that sparked this post, and Kelsey for acting as my consulting biologist)

This entry was posted in Work on by .

An algorithmic puzzle

I posted the following problem on Twitter yesterday: I have intervals \(L_1, \ldots, L_m\) which are represented as half-open intervals \(L_k = [u_k, v_k)\). I know that the intervals are all distinct and that for any two intervals \(L_i, L_j\) with \(i \neq j\) either one is contained in the other or two are disjoint.

I want to sample uniformly from all pairs \(\{(i, j) : L_i \cap L_j = \emptyset \}\), or determine that there are no such pairs. How?

A solution now follows. It might not be the optimal solution, but it’s pretty close.

The two observations that make the solution work are:

  1. If every interval is disjoint from at least one other interval, then rejection sampling runs in expected time at most \(O(m)\), because the probability of accepting a random pair is at least the lowest probability of selecting \(j\) disjoint from \(i\) over all \(i\).
  2. A root interval (one containing everything else) is contained in no disjoint pair and thus can be removed from the list without changing the distribution. If there is no root interval then every interval is disjoint from at least one other.

To unpack the second observation: If there is no root interval, take some \(L_i\). Then \(L_i\) is contained in some maximal length interval. This is not a root interval, so there is some \(L_j\) that is not contained in it (and thus is disjoint from it). Such an \(L_j\) is necessarily disjoint from \(L_i\).

It’s easy to find a root interval in \(O(m)\), but removing it might introduce a new root-interval (in the degenerate case the whole list could be a chain of the form \(L_i = [i, m)\)).

Here’s how to simultaneously remove all root intervals in \(O(m \log(m))\):

First note that a root interval is necessarily of maximum length. Thus if we sort the intervals in order of decreasing length (that’s the \(O(m \log(m) \) bit. You can do it in \(O(m)\) if the interval boundaries are not-too-large intervals using bucket sort), the root intervals must form some initial prefix of the list. By renumbering where necessary lets assume that the \(L_i\) are sorted in that order.

Now calculate the intervals \(B_i = [\min\limits_{k \geq i} u_i, \max\limits_{k \geq i} v_i) \). These are the bounding intervals of the tail sets and can easily be worked out in \(O(m)\) by just iterating backwards over the list.

\(L_i\) contains all \(k \geq i\) if and only if it contains \(U_i\). So now we find the first index \(k\) that is not a root index just by iterating forward through the \(L_i\) until \(U_i\) is not contained in \(L_i\).

We now take all intervals from that point and discard everything earlier. If the set is now empty, we raise an error because there are no disjoint pairs. If it’s not empty, we can now run the rejection sampling algorithm on it as desired and have it terminate in \(O(m)\).

think there’s another solution that may sometimes be better involving explicitly tracking the tree structure and counting the number of disjoint pairs that live in each sub-interval, but the solution fell apart the more I looked at it and I wasn’t convinced it was ever actually better.

This entry was posted in Uncategorized on by .

Lazy fisher-yates shuffling for precise rejection sampling

This is a trick I figured out a while ago. It came up in a problem I was working on, so I thought I’d write it up.

I haven’t seen it anywhere else, but I would be very surprised if it were in any way original to me rather than a reinvention. I think it’s niche and not very useful, so it’s hard to find prior art.

Attention conservation notice: My conclusion at the end is going to be that this is a neat trick that is probably not worth bothering with. You get a slight asymptotic improvement for a worse constant factor. I mention some variants and cases where it might be useful at the end, but it’s definitely more of an interesting intellectual exercise than a practical tool.

Suppose you have the following problem: You want to implement the following function:

def sample(random, n, f):
    """returns a number i in range(n) such that f(i)
    is True. Raises ValueError if no such i exists."""

What is the right way to do this?

The obvious way to do it is this:

def sample(random, n, f):
    choices = [i for i in range(n) if f(i)]
    if not choices:
        raise ValueError("No such i!")
    return random.choice(choices)

We calculate all of the possible outcomes up front and then sample from those.

This works but it is always \(O(n)\). We can’t help to do better than that in general – in the case where \(f(i)\) always returns False, we must call it \(n\) times to verify that and raise an error – but it is highly inefficient if, say, f always returns True.

For cases where we know a priori that there is at least one i with f(i) returning True, we can use the following implementation:

def sample(random, n, f):
    while True:
        i = random.randrange(0, n)
        if f(i):
            return i

Here we do rejection sampling: We simulate the distribution directly by picking from the larger uniform distribution and selecting on the conditional property that \(f(i)\) is True.

How well this works depends on a parameter \(A\) – the number of values in \([0, n)\) such that \(f(i)\) is True. The probability of stopping at any loop iteration is \(\frac{A}{n}\) (the probability of the result being True), so the number of loops is a geometric distribution with that parameter, and so the expected number of loop iterations is \(\frac{n}{R}\). i.e. we’ve effectively sped up our search by a factor of \(A\).

So in expectation rejection sampling is always at least as good and usually better than our straightforward loop, but it has three problems:

  1. It will occasionally take more than \(n\) iterations to complete because it may try the same \(i\) more than once.
  2. It loops forever when \(A=0\)
  3. Even when \(A > 0\) its worst-case may take strictly more iterations than the filtering method.

So how can we fix this?

If we fix the first by arranging so that it never calls \(f\) with the same \(i\) more than once, then we also implicitly fix the other two: At some point we’ve called \(f\) with every \(i\) and we can terminate the loop and error, and if each loop iteration returns a fresh \(i\) then we can’t have more than \(n\) iterations.

We can do this fairly naturally by shuffling the order in which we try the indices and then returning the first one that returns True:

def sample(random, n, f):
    indices = list(range(n))
    random.shuffle(indices)
    for i in indices:
       if f(i):
           return i
    raise ValueError("No such i!")

Effectively at each iteration of the loop we are uniformly at random selecting an \(i\) among those we haven’t seen yet.

So now we are making fewer calls to \(f\) – we only call it until we first find an example, as in our rejection sampling. However we’re still paying an \(O(n)\) cost for actually doing the shuffle!

We can fix this partially by inlining the implementation of the shuffle as follows (this is a standard Fisher-Yates shuffle, though we’re doing it backwards:

def sample(random, n, f):
    indices = list(range(n))
    random.shuffle(indices)
    for j in range(n):
        k = random.randrange(j, n)
        indices[j], indices[k] = indices[k], indices[j]
        i = indices[j]
        if f(i):
            return i
    raise ValueError("No such i!")

This works because after \(j\) iterations of the loop in a Fisher-Yates shuffle the indices up to and including \(j\) are shuffled. Thus we can effectively fuse our search into the loop – if we know we’re going to terminate here, we can just stop and not bother shuffling the rest.

But we’re still paying \(O(n)\) cost for initialising the list of indices. Can we do better? Yes we can!

The idea is that we can amortise the cost of our reads of indices into our writes of it, and we only do \(O(\frac{n}{A})\) writes. If we haven’t written to a position in the indices list yet, then its value must be equal to its position.

We can do this naturally in python using defaultdict (there are better data structures for this, but a hashmap gets us the desired amortized complexity) as follows:

class key_dict(defaultdict):
    def __missing__(self, key):
        return key
 
def sample(random, n, f):
    indices = key_dict()
    random.shuffle(indices)
    for j in range(n):
        k = random.randrange(j, n)
        indices[j], indices[k] = indices[k], indices[j]
        i = indices[j]
        if f(i):
            return i
    raise ValueError("No such i!")

So we have now completely avoided any \(O(n)\) costs (assuming we’re writing Python 3 and so range is lazy. If you’re still using legacy Python, replace it with an xrange).

What is the actual expected number of loop iterations?

Well if \(A = n\) then we only make one loop iteration, and if \(A = 0\) we always make \(n\). In general we certainly make no more than \(n – A\) iterations, because after that many iterations we’ve exhausted all the values for which \(f\) is False.

I, err, confess I’ve got a bit lost in the maths for what the expected number of iterations of this loop is. If \(L\) is a random variable that is the position of the loop iteration on which this terminates with a successful result (with \(L = n + 1\) if there are no valid values) then a counting argument gives us that \(P(L \geq k) = \frac{(n – A)! (n – k)!}{n! (n – A – k)!}\), and we can calculate \(E(L) = \sum\limits_{k=0}^{n – A} P(L \geq k)\), but the sum is fiddly and my ability to do such sums is rusty. Based on plugging some special cases into Wolfram Alpha, I think the expected number of loop iterations is something like \(\frac{n+1}{A + 1}\), which at least gives the right numbers for the boundary cases. If that is the right answer then I’m sure there’s some elegant combinatorial argument that shows it, but I’m not currently seeing it. Assuming this is right, this is asymptotically no better than the original rejection sampling.

Regardless, the expected number of loop iterations is definitely \(\leq \frac{n}{k}\), because we can simulate it by running our rejection sampling and counting \(1\) whenever we see a new random variable, so it is strictly dominated by the expected number of iterations of the pure rejection sampling. So we do achieve our promised result of an algorithm that strictly improves on both rejection sampling and filter then sample – it has a better expected complexity than the former, and the same worst case complexity as the latter.

Is this method actually worth using? Ehhh… maybe? It’s conceptually pleasing to me, and it’s nice to have a method that complexity-wise strictly out-performs either of the two natural choices, but in practice the constant overhead of using the hash map almost certainly is greater than any benefit you get from it.

The real problem that limits this being actually useful is that either \(n\) is small, in which case who cares, or \(n\) is large, in which case the chances of drawing a duplicate are sufficiently low that the overhead is negligible.

There are a couple of ways this idea can still be useful in the right niche though:

  • This does reduce the constant factor in the number of calls to \(f\) (especially if \(A\) is small), so if \(f\) is expensive then the constant factor improvement in number of calls may be enough to justify this.
  • If you already have your values you want to sample in an array, and the order of the array is arbitrary, then you can use the lazy Fisher Yates shuffle trick directly without the masking hash map.
  • If you’re genuinely unsure about whether \(A > 0\) and do want to be able to check, this method allows you to do that (but you could also just run rejection sampling \(n\) times and then fall back to filter and sample and it would probably be slightly faster to do so).

If any of those apply, this might be a worthwhile trick. If they don’t, hopefully you enjoyed reading about it anyway. Sorry.

This entry was posted in Python on by .

A pathological example for test-case reduction

This is an example I cut from a paper I am currently writing about test-case reduction because I needed to save space and nobody except me actually cares about the algorithmic complexity of test-case reduction.

Suppose you have the following test case:

from hypothesis import given, strategies as st
 
K = 64
 
INTS = st.integers(0, 2 ** K - 1)
 
@given(INTS, INTS)
def test_are_not_too_far_apart(m, n):
    assert abs(m - n) > 1

Then if this test ever fails, with initial values \(m\) and \(n\), if reduction can replace an int with its predecessor but can’t change two ints at once, the reduction will take at least \(\frac{m + n}{2}\) test executions: The fastest possible path you can take is to at each step reduce the larger of \(m\) and \(n\) by two, so each step only reduces \(m + n\) by \(2\), and the whole iteration takes \(\frac{m + n}{2}\) steps.

(Side note: You can get lucky with Hypothesis and trigger some special cases that make this test faster. if you happen to have \(m = n\) then it can reduce the two together, but if you start with \(m = n \pm 1\) then it will currently never trigger because it will not ever have duplicates at the entry to that step. Hypothesis will also actually find this bug immediately because it will try it with both examples set to zero. Trivial modifications to the test can be made to avoid these problems, but I’m going to ignore them here).

The interesting thing about this from a Hypothesis point of view is that \(m + n\) is potentially exponential in \(k\), and the data size is linear in \(k\), so Hypothesis’s test case reduction is of exponential complexity (which doesn’t really cause a performance problem in practice because the number of successful reductions gets capped, but does cause an example quality problem because you then don’t run the full reduction). But this isn’t specifically a Hypothesis problem – I’m morally certain every current property-based testing library’s test case reduction is exponential in this case (except for ones that haven’t implemented reduction at all), possibly with one or two patches to avoid trivial special cases like always trying zero first.

Another way to get around this is to almost never trigger this test case with large values! Typically property-based testing libraries will usually only generate an example like this with very small values. But it’s easy for that not to be the case – mutational property-based testing libraries like Hypothesis or crowbar can in theory fairly easily find this example for large \(m\) and \(n\) (Hypothesis currently doesn’t. I haven’t tried with Crowbar). Another way you could easily trigger it is with distributions that special case large values.

One thing I want to emphasise is that regardless of the specific nature of the example and our workarounds for it, this sort of problem is inherent. It’s easy to make patches that avoid this particular example (especially in Hypothesis which has no problem making simultaneous changes to \(m\) and \(n\)).

But if you fix this one, I can just construct another that is tailored to break whatever heuristic it was that you came up with. Test-case reduction is a local search method in an exponentially large  space, and this sort of problem is just what happens when you block off all of the large changes your local search method tries to make but still allow some of the small ones.

You can basically keep dangling the carrot in front of the test-case reducer going “Maybe after this reduction you’ll be done”, and you can keep doing that indefinitely because of the size of the space. Pathological examples like this are not weird special cases, if anything the weird thing is that most examples are not pathological like this.

My suspicion is that we don’t see this problem cropping up much in practice for a couple of reasons:

  1. Existing property-based testing libraries are very biased towards finding small examples in the first place, so even when we hit cases with pathological complexity, \(n\) is so small that it doesn’t actually matter.
  2. This sort of boundary case relies on what are essentially “weird coincidences” in the test case. They happen when small local changes unlock other small local changes that were previously locked. This requires subtle dependency between different parts of the test case, and where that subtle dependency exists we are almost never finding it. Thus I suspect the fact that we are not hitting exponential slow downs in our test case reduction on a regular basis may actually be a sign that there are entire classes of bug that we are just never finding because the probability of hitting the required dependency combination is too low.
  3. It may also be that bugs just typically do not tend to have that kind of sensitive dependencies. My suspicion is that this is not true given the prevalence of off-by-one errors.
  4. It may also be that people are hitting this sort of problem in practice and aren’t telling us because they don’t care that much about the performance of test case reduction or example quality.
This entry was posted in Hypothesis, Python on by .

The Clarke Gap

I coined a term on Twitter earlier and almost immediately found it useful, so I thought I’d register it somewhere a bit more permanent.

The “Clarke Gap” of a technology is the degree to which it is distinguishable from magic.

(The term of course comes Clarke’s third law that any significantly advanced technology is indistinguishable from magic, or more specifically from the corollary to it that any piece of technology distinguishable from magic is insufficiently advanced).

The Clarke gap is characterised mostly by how likely something is to go wrong in a way that requires you to understand what the system is actually doing in order to fix it. The more likely it is, the larger the Clarke gap is. If something always works and you never need to understand it, it might as well be magic. If you have to open the hood every five minutes and poke the bits that have got misaligned, it’s obviously technology. Most things are somewhere in between.

It’s not that things with a low Clarke gap don’t do wrong, but when they go wrong they tend to do it in a way that is either easy or extremely hard to fix – e.g. a computer has a fairly low Clarke gap – when it goes seriously wrong you need a lot of technical expertise to figure out what’s happened and why – but computers go wrong all the time and mostly you can just turn it off and on again (either the whole computer or the misbehaving program) and it will work.

A low Clarke gap is neither wholly good or wholly bad. It’s mostly about which cases you’re optimising for: Systems with a low Clarke gap tend to be much easier to learn at a casual usage level, but much harder to acquire expertise in.

For any given class of technology, the Clarke gap tends to go down over time. If you look at search engines now versus search engines twenty years ago, the Clarke gap has gone way down – a modern search engine is very much trying to guess your intent and thinks it knows better than you, while one from twenty years ago is basically a boolean query engine with maybe some spelling correction if you’re lucky. I suspect this has a correspondingly negative impact on people’s abilities to pick up search skills – we’re so used to search being magically good, we don’t pick up the necessarily knowledge about what to do when it’s not.

There are a number of factors that tend to push it this way. If users commonly experience a problem, the natural thing to do is to try to make that problem non-existent or easy to solve. This is good development practice! You’re making people’s lives easier. But by doing this you’re also removing a lot of intermediate ways in which the system can go wrong, and making the system more complex and thus increasing the number of more subtle ways it has to fail. This lowers the Clarke gap.

I think this is a useful phenomenon to pay attention to, and consequently to have a term for, so I’m probably going to keep using the term until a better one comes around.

This entry was posted in Uncategorized on by .