David R. MacIver's Blog

Generating provably solvable instances for testing solvers

Do you wanna test a SAT solver? (It doesn't have to be a SAT solver)

17 June 2026

Do you wanna test a SAT solver? (It doesn't have to be a SAT solver)

There’s a technique in property-based testing that almost nobody seems to know about but me. There are a variety of reasons for this, but most of them are that nobody really seems to innovate in what sorts of properties to test, and this one only works painlessly in Hypothesis (and Hegel). It’s useful any time you have something that looks like a “solver” or “optimiser” - something that takes a problem description, and outputs some solution to that problem. For example, scheduling a meditation retreat.

Testing a SAT solver never misses solutions

To take the central example of such problems, suppose that you want to test a SAT solver, a completely normal and reasonable thing that normal and reasonable people do.¹

What are some reasonable property-based tests you might want to write for this?

from hypothesis import assume, given
from hypothesis import strategies as st
from my_solver import MySolver


@st.composite
def sat_problems(draw):
    n_variables = draw(st.integers(1, 100))
    literals = st.integers(1, n_variables) | st.integers(-n_variables, -1)
    clauses = st.lists(literals, min_size=1, unique_by=abs)
    return draw(st.lists(clauses, unique_by=frozenset, min_size=1))


@settings(verbosity=Verbosity.verbose)
@given(sat_problems())
def test_every_model_solves_the_problem(problem):
    solver = MySolver(problem)
    assume(solver.solve())
    model = solver.get_model()
    assert model is not None
    model = set(model)
    for clause in problem:
        assert not set(clause).isdisjoint(model)

That is, we construct our problem in DIMACS CNF - every clause is represented as a list of integers, with -n meaning that n is assigned false, and n meaning n is assigned true - and pass it to the solver. If the solver says that the SAT problem is solvable, then we get a model out of it. This is a list of each variable, positive if the variable is set true, negative if it is set false.² Our test asserts that every clause is satisfied - that is, one of the literals in it also appears in the model.

This is a perfectly reasonable test, but it should raise some concerns.

Firstly, we’ve got that assume in there which is an assumption based on the results of calling the solver. That’s not intrinsically bad, but whenever we’ve got an assume like that we should worry that there’s a decent chance that Hypothesis is going to yell at us about filtering too much.³

But the real worry I’d like to point out here is this: We’ve tested a weaker property than we’d ideally like to. This property essentially says that when the SAT solver says a problem is solvable, it’s right.⁴ But we’d like to check the other direction too: When the solver says a problem isn’t solvable, it’s right. How might we do that?

One easy answer to this is differential testing. Take some other SAT solver, and whenever our solver claims a problem is unsolvable, check if the other SAT solver agrees. If it doesn’t, there’s a bug in one of the solvers.

If we didn’t have another SAT solver though,⁵ how might we do that?

Well, we just write a more specific generator, one that only generates solvable SAT problems. This sounds hard, until you realise the trick, which results on the following very deep and important insight: Every solvable problem has a solution.

This is useful because you can generate that solution first, and work backwards from there.

from hypothesis import assume, given
from hypothesis import strategies as st
from my_solver import MySolver

@st.composite
def solvable_sat_problems(draw):
    n_variables = draw(st.integers(1, 100))
    solution = [
        draw(st.sampled_from((i, -i))) for i in range(1, n_variables + 1)
    ]
    literals = st.integers(1, n_variables) | st.integers(-n_variables, -1)

    @st.composite
    def clauses(draw):
        base = draw(
            st.lists(st.sampled_from(solution), min_size=1, unique_by=abs)
        )
        used = set(map(abs, base))
        unused_literals = literals.filter(lambda n: abs(n) not in base)
        extra = draw(st.lists(unused_literals, unique_by=abs))
        clause = base + extra
        return draw(st.permutations(clause))

    return draw(st.lists(clauses(), unique_by=frozenset, min_size=1))


@settings(verbosity=Verbosity.verbose)
@given(solvable_sat_problems())
def test_every_model_solves_the_problem(problem):
    solver = MySolver(problem)
    assert solver.solve()
    model = solver.get_model()
    assert model is not None
    model = set(model)
    for clause in problem:
        assert not set(clause).isdisjoint(model)

The strategy here is a bit more complicated, so let me walk you through it: First, at the start, we generate a solution to the SAT problem. This is just an assignment of the variables.

Then when we generate a clause, we do it in two parts: First we generate the part of the clause that our chosen solution satisfies, then we generate some extra list of literals whose variables are not used in that solution, and then just for good measure we mix up the clauses.⁶ This means that, by construction, every clause is satisfied by our chosen solution, and as a result every problem generated by this is necessarily solvable.

We then change our existing test to just replace the assume with an assert, because now all our problems are solvable by construction.

This test is complete, in the sense that every solvable SAT problem can in theory be generated by this,⁷ because every SAT problem has a satisfying assignment, and if we start by generating some such assignment then we can always generate the target SAT problem. This means that if a SAT solver passes the logically-infinite version of this test then we know that it solves every solvable SAT problem.

It doesn’t however handle the other side of this, that is that whenever a problem is unsolvable the SAT solver says it’s unsolvable. I don’t actually have a very good answer for how to write those tests other than differential testing.⁸ There are a few interesting properties, e.g. if you add clauses to an unsolvable SAT problem it should stay unsolvable,⁹ but nothing very satisfying.

Still, I think the solvable case is often the more interesting side in practice.

Extending this to optimisers

Another problem domain in this class is optimisers, which have to give you not just a solution but the best solution. This technique has a natural extension there: We generate a solution, then we generate a problem to which that is a solution, and then we run the optimiser and see how well it does. If it does worse than our starting problem, that’s a bug.

For example, consider the knapsack packing problem: I have an integer capacity, and a list of integer weights, and I want to fit the maximum amount of stuff into my knapsack.

from hypothesis import assume, given
from hypothesis import strategies as st
from my_solver import KnapsackSolver

@st.composite
def knapsack_problems_with_score(draw):
    solution = draw(st.lists(st.integers(min_value=0), min_size=1))
    extra_capacity = draw(st.integers(min_value=0))
    capacity = sum(solution) + extra_capacity
    rest = draw(st.lists(st.integers(min_value=extra_capacity + 1), min_size=1)))
    weights = draw(st.permutations(solution + rest))
    return weights, capacity, sum(solution)

@given(knapsack_problems_with_score())
def test_always_beats_known_achievable_score(problem):
    weights, capacity, to_beat = problem
    solver = KnapsackSolver(weights, capacity)
    solution = solver.get_solution()
    assert sum(solution) >= to_beat

Unlike the SAT solver, the knapsack solver always succeeds - it can just return the empty list of weights - but the question is how well it does. We don’t know the optimal solution here, but we do know a solution, and the optimal solution has to be at least as good as that.¹⁰

Why this is Hypothesis-model-specific

The big guarantee that the Hypothesis property-based testing model gives you is that you can write generators and everything will more or less just work. They will shrink reasonable well, and you will never be given a shrunk test case that couldn’t have come from the generator.

This sort of test-case generation where you first generate something and then you do dependent generation on it is the classic place where other approaches to deriving shrinkers just straight up doesn’t work, and if you try to do type-based shrinking on the result then you will lose your guarantee that the problem is solvable. For example if you generate [[-1], [2]] and then try to shrink it to [[-1], [1]], now you’ve got an insoluble SAT problem. This will never happen in Hypothesis, but will absolutely happen in QuickCheck.

I think there’s something further than this though, which is that people have got very used to the idea that you are trying to write property-based tests whose generators range over the set of all valid inputs to your program. This sort of technique where you construct more specialised generators that are designed to explore only a subset of the valid inputs (and as a result can guarantee stronger properties hold there) seems underexplored, and I suspect there is a lot of easy low-hanging fruit for new properties lurking in it as a result.

As a reminder, a SAT solver is a tool that takes some boolean formula and tries to find an assignment of its variables that causes it to return true. Essentially all SAT solvers operate in CNF, “conjunctive normal form”, which means that they are expressed as a big and over a set of ors. e.g. a & (b | !c)
↩︎
This is the standard PySAT API, but is also just a very common way of representing this.
↩︎
It doesn’t in this case. This strategy almost always generates solvable SAT problems. That stops being the case if we set the max number of variables smaller, but it’s still always basically fine.
↩︎
Technically it says that when it says the problem is solvable, the model it claims solves it is right, which is a slightly stronger property.
↩︎
Mostly because this is an instance of a general class of problem rather than because we might realistically not have a SAT solver
↩︎
We could equally well sort by variable or something, I just wanted to make sure we avoided putting all the base clauses first.
↩︎
In practice, the distribution of problems could definitely use some work.
↩︎
Which can be against a guaranteed-correct brute force version if you restrict the number of variables to be small.
↩︎
This is a form of metamorphic testing.
↩︎
You can also regard this as a straightforward solver for the decision problem of “Does this problem admit a solution with at least this score?”
↩︎

How I've been using Claude Code

9 April 2026

I wrote a comment on lobste.rs about how we’re using Claude Code on Hegel. Various people have asked me to turn it into a blog post. This isn’t exactly that, as I’d like to talk about how I’ve been using Claude Code more broadly, but it should have the content people were asking for.

Claude use in Hegel

To start with, a disclaimer: None of this is official Antithesis policy. We’ve been running a bit ahead of the pack on how we’ve been using Claude in Hegel, because it’s open source code (and thus we’re somewhat less concerned about it being leaked because it being leaked is, well, what it’s for) and greenfield so small enough for the team working on it to keep on top of the code. Everyone else in the company is being much more cautious in their adoption of agentic coding than we are with Hegel.

Anyway, we’ve been using Claude Code a lot. I’d say that something in the region of 90% of code I’ve “written” for Hegel has been written by a Claude, at least in its first draft. Sometimes the second draft is me going in and tearing all the bullshit Claude has done apart and fixing it properly. More often though it’s me telling Claude what to fix, or me going in and doing a targeted rewrite of some particularly egregiously wrong bit.

Although some people seem to have decided that the words mean “anything made with an LLM”, this is not vibecoding and little to none of it is slop.¹ We’ve reviewed the code ourselves, and heavily dictated its design. Hegel was extremely not “Claude, make me a property-based testing library”. I designed the protocol, Liam and I designed the API together,² but we got Claude to do a lot of the actual line-by-line writing of the code.

This has gone great. Hegel works really well,³ and we have been able to develop it much faster than we ever would by hand.

But if we’d just gone down the “Claude, make me a property-based testing library”, it would have been a disaster. Claudes can probably just about port minithesis, and we’re unironically hoping that a future model will be good enough to port enough of Hypothesis for us to rewrite hegel-core in Rust, but most of the early work on Hegel where we gave a Claude a bit of a free rein was a complete mess. e.g. all the original protocol implementation is hand-written by me, because I gave a Claude a detailed spec of the protocol and it decided that it was too complicated and it would take a more “pragmatic”⁴ solution and shoved every message down the control stream.⁵

Instead, we have a standard pull-request workflow, where a human reviews all the code. Actually, two humans, because first the person making the pull request reviews all the code that a Claude wrote on their behalf. We still catch places where Claude comically fucked something up and we failed to catch it. Often this is my fault: I’m very used to code reviewing in a high trust environment, where I can trust that the person who wrote the code is basically competent and well-intentioned and was actually trying to succeed. This means that I’m looking for a different set of things than you need when reviewing AI, or even junior human, code - high level misunderstandings and problems with the design rather than “took a shortcut that completely undermines the entire point of the feature”.

Also, prior to the code being written, we’ve decided more or less what we want the code to do. We don’t necessarily know how to achieve it, but we know what we want the API to be, and we know roughly how we want it to achieve that.

A decent recent example is the better output printing pull request I made. It massively improves the quality of hegel-rust’s test output.⁶ I probably couldn’t have written this code. I’m OK at Rust, but I’m shit at Rust macros. Claude, in contrast, is pretty good at Rust macros. So I figured out enough to know what I needed to do (define the API and the way the macro rewrites it), told Claude to do it, and then went through all the ways it could go wrong and made sure there were tests for them, spotted a few more edge cases that neither I nor the Claude had thought of, and am genuinely pretty happy with the code (and, more importantly, delighted with the results of running it) despite, if I’m honest, not fully understanding all the macro code.⁷

Speaking of, testing. One of the things that ensures Claude doesn’t go completely off the rails is making sure that the code is actually tested, and we review those tests if anything more thoroughly than we review the code.⁸

In order to ensure there’s enough testing, we set minimum coverage to 100%. I basically think there’s no good reason to have untested code in a project with AI working on it.

Unfortunately, Claude disagrees. It’s become a bit of a running joke that I’m the guy who is constantly yelling at Claude to write tests and yes I really mean 100%. On hegel-rust, coverage is currently moderately short of 100% because I discovered that in the early days before we were as careful about reviewing as we are now, Claude had decided that it wasn’t pragmatic to enforce 100% coverage, and lowered the number. Normally it lowers the number to 98% or something. In this case it lowered it to uh… 30%. We’ve not fully fixed it yet and have introduced a ratchet script that forces the number of uncovered lines down to zero.⁹

9

As an aside about this… one of the things I’m constantly surprised by when writing in languages that aren’t Python is how many things in Python I took for granted are just far better than the alternatives, especially in the testing ecosystem.

To me, “100% coverage” has the standard meaning of “Tool reports 100% branch coverage, assertions and other unreachable constructs excluded, nocov comments allowed only in extreme circumstances and subject to justification in a comment and at review time”. This is straightforward inn Python. In every other language, I’ve run into at least one of:

“100% coverage” isn’t actually a thing, because coverage reports all sorts of lines as uncovered that can’t possibly be covered (e.g.structural elements)
No branch coverage
No configurable exclusions
No exclusions at all!

Which means that basically every non-Python project I’ve wanted to do this on I’ve ended up getting Claude to write a custom scripts for parsing coverage output because the built in tooling was not good enough for asserting the invariant that I want.

Originally rather than the ratchet, I tried to get a Claude to just fix the testing, but there was so much slop in the tests it wrote that I eventually gave up on getting those mergeable, and decided the ratchet was the better option. We’ll gradually fix the coverage over time, because the number can mostly only go down, and as we work on a particularly area of code we’ll refactor it towards testability at the same time.

BTW one thing you will notice on the ratchet is that it’s got explicit instructions that it is not allowed to increase the ratchet or edit the script unless a human explicitly says so. That works… most of the time. Once you’ve fenced Claude in enough that it actually has to get to 100% coverage, it will still often just decide that testing something is too hard and just try to exclude it instead. I’ve not found a better solution than human review yet, but I’m still working on it.

In general, writing quality code using Claude is this constant battle and balancing act. It enables you to do so much more - every single project I’ve used Claude on has far better setup and infrastructure than almost anything I worked on pre-Claude - but requires a level of constant vigilance to actually get the outcomes you want, and that vigilance will often slip. Sometimes you can automate it, and Claude is actually very good at helping you automate it, but it still requires a great deal of human attention.

I don’t feel like I’m doing less work as a result of using Claude on Hegel - if anything I’m working harder than I otherwise would have - but I’m definitely getting a lot done for that work.

Other projects

I’ve had a variety of other projects I’ve done with Claude, of varying degrees of quality. Most of them I’ve abandoned, because the results were not good enough, or because I lost interest after playing around with it for a while. There have been a few success stories though.

One is this website! I suppose technically it’s too early to call that a success story, but what determines whether I abandon it is really whether I want to keep writing it. The actual software is, at worst, fine, and I definitely wouldn’t have completed the migration or got the new design working without Claude support. I hate doing web development - it’s not my forte, and I don’t want to invest the time and effort required to be good at it. Also, the actual conversion process involved lots of fiddly little details that I could absolutely have done myself but was goign to keep procrastinating on indefinitely if I didn’t.

Shrinkray has had some big refactors and a new UI using Claude, and that’s proven pretty great. The UI has improved greatly as a result of doing this.¹⁰

I’ve also done some of the inevitable “I got a Claude to write me tools for using more Claude” work that comes with getting into AI agents. My current best attempt is pr-manager which is a tool for helping me keep on top of my many open pull requests. I don’t recommend it to anyone else, but it’s worked great for me.

There are also some partial success stories:

I wrote a slay the spire mod. It lets you try out different decks in different fights. I like it conceptually, I found it sortof useful, but it was definitely still buggy when I stopped working on it and now Slay the Spire 2 is out and I’m not very interested in resuming work on it. I think there’s a decent chance that if Megacrit don’t add a similar feature (which I’ve no reason to believe they will) I’ll port this mod to StS2 when there’s a good modding story, but I’m not currently motivated to do this.
A dynamic random sampler. This is a datatype from a paper that I’ve wanted a solid implementation of for ages. It’s a way of sampling from a discrete random distribution where you can efficiently update the weights. I think the implementation Claude has produced is really solid as far as I can tell - I keep coming up with new ways to test it and it keeps passing them. This required a degree of yelling at Claude and making it fix problems, but I haven’t actually done more than skim the implementation code. This is only a partial success because I haven’t actually had a use case for it since writing it, but I’ll report back when I do.
I ported redo to rust. I don’t really have a use case for this if I’m honest, I was just curious if it would work, and to the best of my ability to tell it worked great. I’ve always liked the idea of redo, but didn’t really want to use a codebase that hadn’t been maintained in 7 years and only ran on Python 2. Turns out, I still don’t want to use redo even after that, but I’ve given it significant consideration. I’m somewhat tempted to ask Claude to rebuild the build system for this website on top of redo-rs.

I’ve had a bunch of other side-experiments that ended up not going anywhere. e.g. I wrote a tool for training perfect pitch which worked pretty well but mostly lost interest in the project. I was working on a game of exploring a solar system with realistic gravity, but ended up a bit too nerd-sniped by trying to get the details working before I lost interest. These sorts of things are normal for creating software projects, but Claude has definitely accelerated the process.

A fortuitously timed case study in bugfinding

While I was writing this, cfbolz reported a bug in shrinkray without a reproducer, so here is a log of how I used Claude Code to fix it.

I pointed a Claude at the issue and asked it to diagnose it and write me some tests reproducing the problem. Its diagnosis was mostly wrong - it pointed it enough in the right direction to find the problem, but it very clearly didn’t understand what was actually going on - and its tests were mostly bad, but they did succesfully reproduce the problem, which would have been tedious for me to do by hand.

I think it was also useful to explain to it why it was wrong and made it much more obvious to me what the bug was, like a sort of advanced mode reverse rubber duck.

In the end I kept one of the two tests it wrote, after significant rewriting, and didn’t even try to get it to suggest a fix and just wrote one myself once I understood the problem.¹¹

I still count this as a win, because I really didn’t want to figure out how to reproduce this bug, and would absolutely have put off working on this without that reproducer.

I then discovered a bunch of human errors:

The branch protection rule set was not set up correctly to prevent merging when status checks fail. I discovered this because I clicked merge with failing status checks without noticing.¹²

As a result of this, the build was apparently already failing on main and I hadn’t noticed.
Also, I’d failed to push because of some history rewriting and didn’t notice, and as a result merged an earlier version of the PR.

Altogether, not my finest hour, but that’s what I get for distractedly working on an issue when I’m meant to be writing a blog post.

I then pointed Claude (using pr-manager) at the pull request and told it to make it green. Turns out, one of the problems we were seeing was the result of a part of my fix that I’d tried early on, discarded as not relevant, and left in (I’d changed a call from is_reduction to is_interesting because I thought the non-determinism of the changing test case might be the problem after Claude claimed it was. It wasn’t). It fixed that, and then discovered that the coverage job was also failing.

It then did what Claudes always do with problems: Declared it a pre-existing failure unrelated to its changes. In its defence, this was true for this particular instance, and was the right thing for it to investigate. Also the problems turned out to be my fault.

One of the previous failures on main is that a previous Claude had written a bunch of imports inside functions. Claudes fuckin’ love doing this, and it’s baffling to me how much they do it despite being repeatedly told not to, so there’s a lint that catches that. Apparently, and I’m not entirely sure why, there were still some in main, so I’d fixed those. Except apparently somehow in fixing these (and I really can’t blame Claude for this, I must have done some sort of over-eager deletion, but I’m a little baffled as to how) I’d deleted some of the tests in that file which were previously covering those lines.

At this point I reverted the file to its previous state and did what I should have done before: Asked a Claude to write an autofix script using libcst and ran ‘just format’.

At this point, I had a green branch, and I merged it. Eventually, the build completed on main, and… the autorelease failed, because I’d made the branch ruleset stricter. I needed to change the the autorelease script¹³ script to use a GitHub app so that I could add branch protection rules. Claude walked me through the manual bit, made the changes needed to the release, issued a new pull request, and once that was green¹⁴, I merged it.

The release failed again, because I’d gotten “app secret” and “private key” confused and configured the wrong one. I figured this out by pointing Claude at the error and it told me what I’d done wrong. I fixed the problem, reran the job, and finally shrinkray released.

Anyway, this story probably doesn’t convince you that I’m a very serious software engineer whose opinions on software quality you should take seriously. That’s OK. I’m mostly not trying to convince you of anything in this post, I’m just telling you what I do.

I would say in my defence that this was unusually bad for me, and that I don’t normally ship production fixes interspersed with writing a blog post and cleaning the house. But it’s certainly representative of me on a bad day. I don’t work on software correctness tools and processes because I’m natively good at making things that work, I work on them to corral the disorganised little chaos monkey that I am when left to my own devices.

I would also say that it was about 4 hours from bug report to bug fix, while I was doing multiple other things, and the project was left in a better state on multiple axes than it was at the start of this process.

And I do think in this whole scenario, Claude was clearly a net positive. It provided me with a good reproducer, acted to some degree as an external source of state about what was going on, and helped me ratchet my processes to be higher quality with the new formatter. I could 100% have done this without Claude, but I probably wouldn’t have done it today without Claude, and I definitely wouldn’t have had the patience or inclination to do the GitHub workflows wrangling and would probably have just turned off branch protections without Claude to do that bit for me.

Some general reflections on using Claude

It’s definitely changed my working habits. Sometimes for good, sometimes for bad.

One of the things I’ve noticed is that it makes my productivity much more robust to environments that disrupt my attention. This is good because I’ve started working full-time in an office again.¹⁵ I need to spend some amount of focused time in short bursts when trying to figure something or review some code, but because I’m orchestrating Claude work in conversation, I can externalise a lot more of the thinking state and this is very helpful.

Unfortunately this also means that I often feel the need to work while doing other things because it makes me so much better at multi-tasking. You saw this in the above scenario: All of that was done interspersed with writing this piece and also cleaning the house. It very much wasn’t getting my focused time, and in the beforetimes it just wouldn’t have got my time at all.

Once recently I even caught myself waking up and doing some work from bed,¹⁶ which is somewhat unprecedented for me.

I think there are a couple factors leading into this, and some of it is just getting over-excited with a new toy and is already normalising. The other is that it does feel a bit like a continuous partial attention game¹⁷ and as a result keeps drawing my attention back to it.

Some of it is also just that I’ve got a bit obsessed with my work recently. It happens. It’s mostly a good thing. It will settle down in a few months.

In general, I’m not super worried about this at the moment and expect I’ll figure out healthy habits and ways of working with it as time goes on, but I’m not there yet.

Agent etiquette

A lot of people have had very negative experiences with AI coding. Mostly I think they’ve had very negative experiences with other people’s AI coding. I’m not surprised, but it hasn’t been my experience. Part of why it’s not been my experience is that I have, for the most part, either been using this stuff reasonably carefully (and working with people I trust to do the same) or on solo projects, and another part is that I treat it more like a tool than an author. Certainly it’s written a lot of code for me, and sometimes when it’s written a lot of code for me all at once it’s even gone well, but ultimately I’m responsible for the code it produces.

I think a lot of the negative experiences people have are from coworkers that are not taking that responsibility. I do not have a high opinion of people doing this, and I think they should be held to that responsibility whether they take it or not.

A few months ago, after my latest round of calling Claude a lazy little shit for how it was behaving with regards to coverage, someone said I was being very harsh and asked if I would say these things about it if it were a person. I answered that I would be much harsher with a person that behaved like Claude did, and instead of calling it a lazy little shit, I would using words like “fundamentally untrustworthy” and “when are we going to fire them?”.

And I basically think that if you’re regularly submitting agent-authored code without vetting it yourself, you should probably be held to this standard too.

So, I think the first part of using agentic coding on a team is this: You’ve got to read the code yourself before anyone else does. You should review it at least as thoroughly as if a junior coworker had written it before you hand it off to anyone else. If you’re not doing that, you’re causing problems.

This doesn’t mean that you need to fully understand the code well enough to have written it yourself. There are bits it’s more OK to gloss over. e.g. I’m bad at reading build code. This has absolutely caused problems, and probably will continue to cause problems, but I think it’s still ended up a lot better than the alternative.

In general I think the big thing with code written by agents is that you need to decide your slop tolerance level. One-off scripts, tools for just yourself, things that will never run in production, these can be a lot sloppier than anything you definitely 100% need to work, and you should focus your time and energy more on the latter.

One consequence of this is that yes, you will spend a lot more time reviewing code. First “your own” that the agent produced, then again your coworkers’ code. I don’t think you can really skip this with the current generation of agents. Yes, you can get agents to review other agents’ code. I think it’s even a good idea to do so. I also think it doesn’t decrease the amount of work you need to spend on code review much, it just increases the quality of the end result, because it’s definitely not good enough that you can afford to skip a human check.

I think you should use it anyway

A lot of people are really put off by agentic coding because of their bad experiences with it being used badly, and I do agree that it is currently very easy to use badly and somewhat hard to use well, but I think it’s worth it and will only get more pervasive and, hopefully, better from here.

One of the things I keep finding is that agentic coding, while not (yet?) the miracle software factory that its proponents want it to be, really is transformative in a bunch of key ways. Principally:

A lot of things that you always knew would be good to do for your project but were just a bit too much of a pain in the ass are things that an agent can more or less one shot.
It offers specific workflows that would have just been somewhat magical if you’d looked at them a few years ago.

Note the absence of (3) it’s really good at writing code and you should definitely use it to write all your code. I’ve found that worthwhile, but it’s far more clearly a trade-off than the more mundane use cases even if it’s the one that everyone is super excited about.

Here are what I think of as things that are obviously worth everyone’s time today to try:

Rebase this branch onto main for me
Fix the build on this PR
Write me a script to do quality task X in this codebase (e.g. custom linting rules, custom formatting, script for parsing coverage output)
Sort out my project infra (e.g. write me a justfile for common tasks, set up github actions)
Port this code from the old deprecated method to the new supported method
Read the code and look for out of date documentation
Here’s a bug report, write me a test that reproduces it
Investigate XYZ problem

Many of these won’t work 100% of the time, but even when they fail they will probably give you a good starting point for succeeding.

I also think that they’re very good for coding tasks that you otherwise just won’t do, and that are better done badly than done well. e.g. the website port, but also for me anything to do with github actions is done only with the greatest of protests, and the fact that Claude makes it easy is a genuinely huge upgrade to my quality of life when developing.

A lot of the complaints about generative AI are that they’re taking over the most creative bits of our job. They can do, if you let them, and sometimes it’s worth letting them, but with agentic coding in particular I think they’re actually very good at removing the drudge work from our jobs. Maybe they’ll take our jobs eventually, and there are a whole bunch of things I’m worried about,¹⁸ but I don’t think we fix that by ignoring the genuine improvements that are here today and hoping they’ll go away.

Some of the build infrastructure is definitely closer to slop, but I think that’s OK.
↩︎
Mostly by arguing a lot.
↩︎
Yes, yes, except the Python thing. We know the Python thing is weird. But also the Python thing is 0% Claude’s fault, that was entirely on me, and I still stand by it as the right call.
↩︎
Every time a Claude says it’s being pragmatic, you know shenanigans are about to occur.
↩︎
If you’re curious what this means, here’s the protocol reference. But the short answer is that Hegel multiplexes many logical connections across a single actual transport layer, so as to allow lots of really cheap short-lived connections and easily support concurrency. It also has a single central stream which you’re supposed to use sparingly mostly for messages that let you know when a new test has started. That is not how Claude set it up.
↩︎
A little too much perhaps. I don’t know how we’re going to get anything nearly as good in the other languages.
↩︎
I have read all the macro code, and convinced myself that it looks like it’s doing the right thing, but I certainly couldn’t reproduce it myself without a lot of work. This is a crucial detail of “human review everything” that we’re still finding the right balance on: It doesn’t mean that you necessarily understood all the code. You could require that, but I’m not sure it would be the right trade off. It’s certainly caused us some problems that we don’t though. Especially in hegel-go, as I’m much worse at Go than Rust.
↩︎
Or, at least, that’s the ideal. There is always a temptation to LGTM on tests. I try to resist it. Liam probably does a better job of resisting it than I do.
↩︎
As an aside about this… one of the things I’m constantly surprised by when writing in languages that aren’t Python is how many things in Python I took for granted are just far better than the alternatives, especially in the testing ecosystem.

To me, “100% coverage” has the standard meaning of “Tool reports 100% branch coverage, assertions and other unreachable constructs excluded, nocov comments allowed only in extreme circumstances and subject to justification in a comment and at review time”. This is straightforward inn Python. In every other language, I’ve run into at least one of:
- “100% coverage” isn’t actually a thing, because coverage reports all sorts of lines as uncovered that can’t possibly be covered (e.g.structural elements)
- No branch coverage
- No configurable exclusions
- No exclusions at all!
Which means that basically every non-Python project I’ve wanted to do this on I’ve ended up getting Claude to write a custom scripts for parsing coverage output because the built in tooling was not good enough for asserting the invariant that I want.
↩︎
Honesty compels me to admit that I’ve received more bug reports with shrinkray since doing this, but also many of those bug reports are in code that I don’t think Claude has touched, so it’s unclear to me how much this is increased usage, possibly partly because it’s nicer software now, and how much is that shrinkray has genuinely got buggier. It was never the most reliable software in the first place.
↩︎
Which was 100% in code that I wrote, and that has been sitting happily there for well over a year without anyone triggering it, which is evidence for the “increased usage” theory.
↩︎
Part of the problem here is that GitHub’s new fancy branch protection rules are insane and make it remarkably difficult to say “Yes obviously I don’t want to be able to merge with any failing status checks” and require you to manually add each one. Further, they require you to manually add each one without providing you a list to click on. Fuck sake. Anyway probably what happened is that I set up these rules before the first PR run with the new workflows and as a result couldn’t add them at that time.
↩︎
Also, previously, written by Claude
↩︎
The first edition violated the repo’s zizmor checks.
↩︎
Antithesis is very into this. Also the London office sucks right now. It was fine when we were a third of the size that we are, but now we’ve grown. We’re moving offices in a few weeks and I expect to need this feature of it much less.
↩︎
I don’t normally allow my laptop in my bedroom, but sometimes when I can’t sleep well I watch some videos - usually slay the spire streamers - to fall asleep to, and I assume that was what happened the previous night.
↩︎
e.g. cookie clicker or, ironically, Universal Paperclips.
↩︎
e.g. I more or less buy the argument that taking away the drudge work is quite bad for juniors, though I think it’s More Complicated Than That.
↩︎

Hello

9 April 2026

Hello. Remember this place?

If you’re reading this, there’s a decent chance you’ve been reading my substack. I’ve enjoyed writing that, and I intend to continue to cross-post relevant posts over there, but I’ve also been missing having a “real” blog, and feeling increasingly bad about the lingering rotting corpse of my old blog, so I’m attempting to revitalise it.

As part of that I’ve moved off Wordpress and onto a static site generator.¹ I’ve also imported all my substack posts into here. Hopefully all of those will work. Let me know if you run into any problems.

One of the big motivators is that I want to do more writing about technical subjects again. I’ve been doing a bit of that over on my notebook but the notebook feels off for anything I want to be taken seriously, because the whole point of things that are on the notebook is that they oughtn’t be taken seriously. They feel even more off on the substack, which has never really been about that. I could start a new blog, but there’s one sitting right here which has historically been for exactly that.

So, here we are, back at the beginning - me writing posts, sometimes about software development, on drmaciver.com.

That a Claude wrote for me. Thanks, that Claude.
↩︎

On Mediocrity

24 February 2026

This post was originally published at https://drmaciver.substack.com/p/on-mediocrity.

I’m in the middle of a very long draft post,¹ and after I’d already written a very large number of words I started going off on a digression explaining the way I use a particular word and how it differs slightly from the normal meaning. This seemed like a sign that that bit should be its own post, so here it is as its own post.

The word is, as you can probably guess from the title, “mediocre”.

Used normally, “mediocre” means something like “only adequate”, which is compatible with my meaning, but for me it includes overtones of “and could never hope to be better without fundamentally changing its nature”.

The easiest way you can tell I mean something slightly different from the normal meaning is that English is a tonal language, and when I say this word it is usually pronounced somewhere between tinged and dripping with contempt. In my idiolect,² mediocre is a much harsher judgement than “terrible”.

This doesn’t, exactly, mean that mediocre things are worse than terrible things. A mediocre meal is edible even when a terrible meal might not be. But it is possible to produce a terrible meal and not have me judge you for it, while I will always judge a mediocre one at least a bit.³

(This is not, you understand, to say that I do not ever produce mediocre things. I absolutely do, somewhat regularly. And then I judge myself, at least slightly, for doing so)

Let me illustrate with two examples of adequate meals that I somewhat regularly cook. Both are “low energy” meals - they don’t require much in the way of time, energy, or planning, and they fulfil the requirements of actually having something resembling a real meal:

The first is: Rice, stir fry greens⁴ from the freezer, a Thai omelette.

The second is: Leon’s frozen waffle fries, Leon’s frozen gluten-free chicken, frozen green beans cooked in the microwave.

Both of these are adequate meals. The second is mediocre, the first is not. What makes the difference?

It’s not effort or time. The first is a little more effort, and cooking rice takes longer than cooking waffle fries, but not much, and that effort and time can be reduced without the meal becoming mediocre. e.g. the stir fry greens I use microwave just fine, and I often have leftover rice in the fridge that can be reheated, at which point this meal becomes almost trivial. There’s a decent chance that if I’m in low effort mode I’ll fuck up the Thai omelette, but a Thai omelette that you’ve slightly fucked up cooking is still delicious.⁵

It’s also not quality. I made a variation of the rice one the other day where I swapped out the Thai omelette for Tofoo sriracha tofu⁶ fried in Schezuan pepper oil.⁷ The results were, if I’m completely honest, kinda bad. I hadn’t used this Schezuan pepper oil before and greatly misjudged its flavour profile, and the result was too sour and made your mouth go unpleasantly instead of pleasantly numb. If you asked me to compare this and the fries-and-chicken meal, I’d have to in full honesty admit that the latter was tastier. But I still wouldn’t judge this meal as mediocre, it was just kinda bad. Perfectly edible - I still ate the leftovers for lunch the next day - but I would never choose to cook it that way again.

Fries, chicken, and green beans is, in contrast, honestly a pretty OK meal. It’s filling, reasonably tasty, contains a somewhat adequate amount of nutrition.⁸ I wouldn’t evangelise it as a good choice, but I could do a lot worse. Fries aren’t obviously worse than rice, breaded chicken isn’t obviously worse than a Thai omelette. It is still, however, extremely mediocre.

What’s the difference?

Well, if I’m honest, some of it is vibes. This is a bit like the real vs fake thing all over again, and there are parts of this distinction that I can’t quite defend. There are a couple quite specific concrete things I can point to about it though.

The first is that the the rice dish is just more interesting. It has more variety of flavour profile built into it, and permits more variation around that. I will often season it differently - sometimes adding pickled chinese vegetables, or chilli crisp, usually adding some subset of soy sauce, rice vinegar, and sesame oil, etc. It provides more of a base of variation, and is an easy target of creativity.

Chips and chicken in contrast, I will serve with both types of condiments, ketchup and mayonnaise.⁹

This, I think, is the basis of the second and more important part of the distinction: What happens if I put more effort into the dish?

The rice dish, I can very easy turn into a pretty respectable meal. If I add a second vegetable dish to it, it’s starting to look like an actually good meal. I can turn the Thai omelette into a side dish accompanying a main. The stir fry vegetables are easily doctored up with the addition of e.g. extra ginger and some spices. I could do a coconut rice instead of a plain one, or swap it out for a more interesting rice than a basic Thai jasmine white rice.¹⁰ There are a million directions to go in to make the dish better, many of them easily accessible.

I do not, to be clear, that often go very far in those directions. It’s common for me to do maybe one improvement, typically in the form of some sort of accompanying condiment, slight improvement to the vegetables, or replacing the omelette with something equally easy. But even at its most basic version of this dish, the possibilities are there.

In contrast, the fries and chicken… where do I go from there? I can put more effort into getting good results, e.g. by putting some fat and salt on the waffle fries and making sure they’re well separated on the pan before cooking them, but the fundamental character of the dish is pretty stable and putting in more work doesn’t really produce a better dish, just a better execution of an adequate dish.

Don’t get me wrong. It’s possible to do very good fries and chicken. I’m not sure I can make my own waffle fries easily enough to be worth it, but I can certainly do something in the space of crispy potatoes with breaded chicken¹¹ and a vegetable side dish and get a very good result. This is a fairly big step change in the recipe though.

Sure, I can make my own fries, but that’s a hell of a lot more work than putting some waffle fries in the oven. Breading and frying my own chicken is even more work. Sometimes that’s work worth doing! I like fried chicken, and I like putting work into meals. It is not, however, at all contiguous with the lazy meal I started with, it’s its own separate thing with a passing resemblance to it.

This is, I think, the core thing I am pointing to when I describe something as “mediocre”.

With the basic rice meal, I am operating in a process where I can decide how good a dish I want to make, and then make it to that standard. I choose to make an adequate dish, because that is currently the right trade off for me - I have limited time, energy, and ingredients to hand, and limited motivation to do something great, so I’ve chosen a level of quality appropriate to that and executed on that.

Frozen fries and chicken, in contrast, I’ve rather locked myself into the quality level as soon as the process has been chosen. If I wanted to produce a good meal, I wouldn’t put in a bit more effort, I would choose an entirely different plan.

Mediocrity is not just about quality, it is about acting in such a way that you can be responsive to level of quality the situation demands. If you produce something merely adequate because you decided adequacy was sufficient, or because you failed in your attempt to do better, that’s one thing. If you behave in such a way that you can cannot hope to be better than adequate, that’s mediocre.

My running example makes it sound like this is about premade goods, which do indeed tend to lock you into a plan, and probably are the easiest way to produce mediocre results, but I don’t think this is essential. There are, for example, plenty of mediocre writers - people who produce slop, even without the assistance of an LLM.

Many of these people are writing every word from scratch themselves, but they are doing so to make quota, without putting much thought into it, or in response to whatever will get them clicks.

Sometimes this produces mediocrity because they are operating in an environment where they have to churn something out too fast for thought - I can just about produce good work daily for a few months, but I pretty rapidly lose the plot after that, and if I had to produce five new pieces a day you’d pretty rapidly find what my version of slop looks like - but I think often it runs deeper than that. The reason the process is not sensitive to the needs of quality is that the person executing it is not.

This is a very easy place to end up if you are just naively following incentives. The short-term rewards for “good enough” are not actually much less than the short-term rewards for “actually good”, and it takes a lot less effort. If you don’t develop your own aesthetic sense that you can follow independently of external reward signals, you will very rapidly converge into the aesthetic equivalent of frozen chicken - broadly palatable, but incapable of becoming better without fundamentally changing your character.

This might be what you want. That’s allowed. We cannot choose excellence in all things. I, for example, have a fairly mediocre dress sense. I don’t dress badly, but I don’t dress well, and I would need to profoundly change my approach to clothing in order to do better.

I’m also mediocre at editing. It’s probably my greatest weakness as a writer - shortly after writing this sentence I’m going to finish this piece and click publish on this piece, probably without so much as rereading what I’ve written so far. Unlike the dres sense, this is something where I’d like to do better, but it requires a genuine investment of effort and practice that I simply never manage to budget the time for.¹²

I do, for the record, judge myself slightly for both these things. The editing more than the dress sense, but I do consider both to be flaws. That’s OK, human beings are flawed, but it doesn’t make the flaws not problems.

More importantly, it doesn’t mean we get to ignore our flaws.

It is, more or less, OK to choose mediocrity, particularly in specifics domain where there’s no compelling reason that you need to be good at. But if you are mediocre at something, let it be because you’ve chosen to be, not because you’ve unconsciously slid into that state.

More, if you’re mediocre at something that I should be able to depend on you being actually good at, I’m probably going to be furious at you in a way that I will not not be if you are just bad at it, because to me mediocrity means something worse than failing: It means you don’t care to do good work, and I can forgive almost any failure but that.

This is unusual for me and when I do it often a sign that the post will never get published. Sorry.
↩︎
“the speech habits peculiar to a particular person” - like a dialect but with one person. Part of my idiolect is that I use the word “idiolect”.
↩︎
But, to be clear, usually well below the cheeseburger threshold.
↩︎
Strong Roots stir fry greens are unironically very good.
↩︎
Thai Omelette recipe: Take some eggs and some fish sauce. Beat them together. Heat a pan containing more oil than seems reasonable. When it’s hot, pour the mix in. Let cook for maybe a minute, then flip. Cook for another minute or so, then decant to a plate with a paper towel to absorb some of the oil, and maybe pat some of the oil off the top.

It is very easy to break the omelette in the flipping stage and it is objectively worse texture and composition if you do that, but also it’s still fried eggs flavoured with fish sauce which is pretty good.
↩︎
Which it turns out is just someone squirting a bunch of sriracha in the tofu bag. Don’t particularly recommend.
↩︎
This might be starting to sound like I’m putting in actual effort, but what this means is “I looked in the fridge to figure out what I had to hand, and then I remembered I had bought some Schezuan pepper oil for a terrible experiment and thought it was actually quite nice and that I should use it for something better than sensory crimes.
↩︎
It definitely could use more vegetables. When I’m feeling well-prepared I will sometimes have a bunch of coleslaw I made previously in the fridge, and adding that to this meal is a significant quality upgrade.

The vegetables are probably part of my different judgements of these recipes. I do have part of me that feels that a meal is not a real meal unless it contains at least three different vegetables, which is part of why the coleslaw is such a quality upgrade here - beans, carrots, cabbage. Great, that’s three vegetables.

I still think it’s a mediocre meal though.
↩︎
And, in fairness, sometimes hot sauce.
↩︎
Which is, don’t get me wrong, a perfectly fine rice.
↩︎
I used to do a very nice cornflake crusted chicken. I should have another go at that.
↩︎
Partly because I find it aversive, honestly.

Also it’s worth noting that this is a list of things that I am specifically mediocre at. There are many more things I am bad at, and many more things I am only OK at for reasons that are not my being trapped in a local optimum, but those are a different category.
↩︎

The hard problem with hard problems

13 February 2026

This post was originally published at https://drmaciver.substack.com/p/the-hard-problem-with-hard-problems.

For reasons that are neither here nor there, I’ve been getting Claude Code to write a solar system simulation for me recently. It keeps getting it wrong, and I’ve kept getting incandescently angry at it for this.¹

Dave pointed out to me recently that, in Claude’s defence, gravity simulation is actually a very hard problem that very clever people struggle with. I fully agreed with this, but also don’t consider it relevant.

One of the ways you can tell it’s not relevant is that the actual gravity simulation is done in REBOUND which, as far as I know, is written entirely by humans. It was originally written by Claude, but I was also worried about it getting that right, so I replaced it, and none of the problems went away.

The actual problem that I keep getting angry about Claude with is that it keeps cheating - sneaking in shortcuts where it violates the laws of physics, sometimes because it’s too hard for it to do the right thing but usually because it made some stupid mistake it’s covering up, or writing tests that don’t actually test the problems I’m pointing out to it and then claiming that the problem is fixed. The problem is not that it’s getting it wrong, it’s that it’s not trying to get it right.

When a coding agent does things like this our running joke² on Discord is “Oh, so it can replace human engineers”.

It’s very tempting though to just stop there and go “Oh well, I guess this problem is too hard for an LLM”. Some problems genuinely are. But the thing is, although it’s being given a hard problem, it’s not the hard problem it’s stuck on. What it needs here is not to be smarter, but to be better shepherded through the basic processes of software engineering.³

This isn’t a problem we see only with LLMs, it’s a problem all over the place.

A previous organisation I worked at was… quite dysfunctional. It had many positive features, but it also was a total mess where I was often amazed that anyone ever got anything done. I used to ha-ha-only-serious joke that this was a result of its two biggest problems:

The biggest problem was that it had grown too rapidly.
The second biggest problem is that it could blame everything else it was doing wrong on the first biggest problem.

Depending on how grumpy I was on a given day, I might reverse the order of the two.

Many of that organisation’s problems were legitimately caused by their rapid growth. Keeping an organisation functional during rapid growth is legitimately a hard problem.⁴ Also many of that organisation’s problems were straightforward unforced errors caused by e.g. bad policies or inadequate adminstration,⁵ but instead of being treated as addressable they just said “Yup we sure have grown a lot crazy how nothing works properly any more”.

You see this over and over again in a lot of projects. People try to do something legitimately hard, and also fail. This is the expected result when trying to do something hard, so the hardness of the problem acts as an emotional shield against failure: You don’t have to feel bad about failing, you were just trying to do something hard.⁶

This conflates two claims:

I tried to do something hard and I failed.
I tried to do something hard and as a result I failed.

I could⁷ try to make a soufflé. This is supposedly a hard cooking task. I will likely fail at it, at least the first time. This is because it’s hard. But if I try to make it out of cabbage, the reason I failed to make a soufflé is not that soufflé is hard.

This is, obviously, a ridiculous example, but I don’t find many of the other examples I run into in practice much less ridiculous. People engaged in hard problems often over-focus on the hard problem, because that is where they expect to fail, and as a result neglect basic housekeeping and maintenance tasks. I don’t want to cite specific examples because they’re a little too pointed, but for example I’ve seen software projects which are trying to solve hard problems which have completely inadequate testing and that are constantly bogged down by bugs as a result, or where the basic foundations of the project are garbage in a way that means that you have to be hundred times as clever to achieve basic things.⁸

There are a variety of reasons why this happens and ways that it can. I don’t want to diagnose them all, because that’s not the hard problem with hard problems.

The hard problem with hard problems is noticing that the reason you’re struggling isn’t that the problem is hard.

When you notice this, you can diagnose the specific problem you’re actually running into, and fix it. Your problem will still be hard, but it will no longer be unnecessarily hard.

When you don’t notice this, you will probably fail to solve your problem, and then you’ll have an excuse for your failure which prevents you from learning from it, which is much worse than merely failing.

Truer words have never been spoken than Kelsey Piper’s “Vibecoding is like having a genius at your beck and call and also yelling at your printer”
↩︎
Started by Aria
↩︎
One of the reasons that I’ve been writing less recently is that I’ve gotten a bit obsessed with little vibe coding projects like this btw, and part of that is because it’s been interesting creating novel things, and part of it is that the process of how to get a computer to be a better software engineer is really interesting.
↩︎
Good thing I don’t have that problem any more and haven’t joined a company about to experience rapid growth.

Shit.
↩︎
Does “inadequate administration” mean that there wasn’t enough, or that the administration staff who were there were personally inadequate?

Yes.
↩︎
Probably you shouldn’t feel particularly bad about failing. Most of the time failure shouldn’t feel bad. You learned something! Good job. It’s reasonable to feel disappointed, but you shouldn’t feel guilty unless you genuinely fucked up.
↩︎
I wouldn’t, but I could.
↩︎
Which you often then feel great about achieving. There’s a real problem, particularly in programming, where by making your life much harder for no good reason you give yourself a great sense of accomplishment when achieving relatively basic tasks.

My somewhat cynical take is that this is the reason for the popularity of a lot of unusual tools and programming languages. e.g. it often seems like the popularity of Haskell is because it turns mundane tasks into a beautiful puzzle to solve.

The result of this is that it is often hard to get people to let go of ways in which they’ve made their life needlessly difficult, because a lot of what they enjoyed about the status quo was that difficulty.
↩︎

How to be less box-shaped

7 February 2026

This post was originally published at https://drmaciver.substack.com/p/how-to-be-less-box-shaped.

Hi all,

Been a while. I started a new job at Antithesis back in November and it’s been taking up most of my mental energy, and what’s left over has largely been spent on some coding projects, so there’s not been much brain space for writing recently. I’d like to fix that.

But anyway, that’s not what this post is about, this post is about C Thi Nguyen’s new book “The Score”.

First, my review of it: This book is excellent. You should go read it. I think it is very unlikely that anyone who likes this newsletter would not like this book.

If you are already familiar with Nguyen’s body of work, it will probably be about 80% familiar, but even the familiar bits are helpfully clarifying and much of the remaining 20% is genuinely interesting. There are bits of the book I expect to revisit over and over again.

This is a book about scoring systems. A scoring system is a way of producing a number which tells you which of two things is better.¹ It tells you what, in this particular situation, you are supposed to care about.

The basic question of this book is as follows: There are two common ways we use scoring systems: The first is in games. The second is in metrics (e.g. performance metrics at your job). In games, scoring systems are fun. In performance metrics, they are soul crushing. Why is that?

I must admit, I got to the end of the book, and I still don’t feel like I 100% understand the answer to this question, but I am definitely more productively confused about it.

Here is one part of Nguyen’s answer that stands out to me the most: It’s about convergence. Through their widespread and permanent deployment, metrics have the property of making everything the same. If everyone is trying to produce the perfect result across some metric, you end up with a sort of bland sameness of every result being close to identical. Games, in contrast, by being things that you can take up, put down, and shift between, instead create a world in which everything is different, because you can always shift between different scores.

I think this is not always the case. In many professional games and sports, you in fact see exactly this sort of convergence. It’s also telling that that often produces very boring play, much like the theory would predict.

This is not Nguyen’s only conclusion though, and I intend to reread bits of it in a more targeted manner to understand them better.

For now though, I shall present you with a grab bag of thoughts in which I randomly talk about things more or less about or inspired by the book. It’s somewhere between a summary of key bits of it and my own riffs on some of the ideas.

Reading books

Funnily, reading this book comes right off the back of a recent conversation with Lucy. We were talking about the terrible books we’d been reading recently, and at some point in the conversation I said “You know… Maybe we should read… good books?”

Anyway apparently manifesting is real now², because without making any particularly deliberate change of behaviour other than declaring my intention, the next two books I read were good. This is the second of those and it is, as mentioned, very good.³

Prior to this conversation, I’d been in a bit of a reading slump. I’d been reading a lot, don’t get me wrong, but it was mostly books with titles like “The Shadow of Mars” or “Undying Immortal System” or “Chaotic Craftsman Worships the Cube”.⁴

This might come as some surprise to you, as everyone seems to think I’m absurdly well read, but this happens to me pretty regularly. I go through periods of months or even years where my reading drops off almost entirely. This is… not fine exactly, but it’s just a problem to be solved, not a character flaw, and I know of a couple of ways to solve it.

The first is to do what I’m doing now: Read good books. Except… “good” isn’t quite right. It is often the case that I have many good books on my shelves, and absolutely none of them appeal. This is because what’s needed is not just for the book to be good, but also for the book to be good in a way that I actually want right now. This is often quite eclectic, and prone to changing at the drop of a hat.⁵

This is fine, as long as I know what I actually want right now, but I often don’t. Sometimes that’s depression, sometimes the rats in my brain are just yelling YEARNING but can’t tell me for what.

Fortunately, there’s an easy solution to this: Read more until I find out what it is I want. But, given that I don’t currently want to read, how do I do that?

Well, it’s easy: First I make up a number, and then I set myself a goal of making that number go up. I create a score.

That number can be anything, but the two obvious choices are number of books read and time spent reading.⁶ Making those numbers go up requires me to read more books, so I do.

This is an example of what Nguyen calls a Suitsian Game, or what Bernard Suits just calls a game: A voluntary attempt to overcome an imaginary obstacle.⁷ There is no particular extrinsic reason to want that number to go up. It’s just a number. But by taking it on as a temporary goal, I have chosen to play a game with myself.

Crucially, by making it into a game, I have separated my goal (make the number go up), from my purpose - read more books.

Except of course, “read more books” isn’t quite right. As we’ve established, I’m reading plenty during my off periods, just not the sorts of books that I actually want. If I just wanted to read more, I’d just trawl through Royal Road some more and read things like An Infinite Recursion of Time⁸. What I want isn’t exactly to read more, but to have my life enriched by reading. And no number can easily capture that.

This isn’t necessarily a problem, as long as I relate to the number the right way.

The problem is there are two ways I can relate to the number.

The first is that I can treat it as what Agnes Callard calls a proleptic reason to do things.⁹ I am not making the number go up because I care deeply about the number, it is serving as a stand-in for my true harder to grasp goal. At some point I get to let go of the number and let me newly acquired (or, in this case, reawakened) value drive me to do the thing that the number was merely a stepping stone towards. At some point the score becomes unnecessary and I can let go of it because I’m reading on my own.

To use, instead, Nguyen’s distinction, there is a separation between my goal (make number go up) and my purpose (have my life enriched by reading). The advantage of treating this score as a game is that the goal is disposable, and I can easily stop playing the game when it no longer service my purpose.

The other way I can treat these numbers is less good for me. Instead of a game, I can treat the numbers as a measure of how well I am doing. A metric. If I do this, I risk falling afoul of what he calls value capture: The complex value I started with (have my life enriched by reading) gets degraded into following the simple measure (e.g. number of books read, amount of time reading) that I used to track it. I change my behaviour to match the goal, even when it doesn’t serve the purpose.

This isn’t a hypothetical risk. I’ve done the “number of books read” metric a number of times in the past. It’s worked very well for getting me out of reading slumps. I strongly recommend it as a method. But at some point the process I always start to notice that I’m deliberately picking shorter, easier, books to read because if I pick a long and difficult book then I will make it much harder to keep the number going up at the rate I want it to. Usually that’s a sign that it’s time to stop tracking the number.¹⁰ I’ve not actually tried the reading time one, but there the failure mode would be spending all my time reading trash, which is if anything even worse.

A healthy relationship to scores requires this willingness to understand the difference between your purpose and your goal, to playfully pick up and put down goals as they are and aren’t working for you, and to hold these goals lightly.

This is easy when those scores are created by you - as long as you are capable of noticing that this is happening, which is a lot of what this book exists to help you do. It’s significantly harder when those scores are imposed on you by other people.

A short history of the legibility war

There’s a section of the book where I kept thinking “Hmm the logical person to bring in here would be…” and then the next chapter he brings them in. My initial reaction to this section was “Fuck, I wish I’d written this”. On further reflection, I don’t think that’s right. I’m glad he wrote it and I didn’t, because it’s also not the thing I would have written linking this work, even if I had ever got around to it because I last wrote notes on this subject a good seven years ago and never revisited it properly. If I ever do get around to revisiting this topic properly, I’ll get to draw on Nguyen’s work on this as well.

The works in question are “Seeing Like a State” by James C. Scott, “Sorting things out: Classification and its consequences” by Bowker and Star, and “Epistemic Injustice” by Miranda Fricker,¹¹ and the broad theme here is something like… the way that we simplify the world in a way that erases individual and between-group variations, and who this affects and who it doesn’t.

I got extremely into this as an issue a while back. I still think it’s important but it’s on my long list of things that I never quite manage to articulate well enough to write about and never quite treat as important to prioritise figuring out how to do that.

If I try now I will probably fail, so I’m going to stop this section here. This is just a placeholder to note that there is something important here, and I’d like to owe you an essay about it, but my track record does not suggest I will ever write that essay. I might write you some more targeted ones though.

The four bargains

A recurring theme throughout the book is that there are four bargains that you can make with the world, in which you sacrifice something that matters to you for power.

Each of these bargains offers you something and takes something from you in exchange. Together they offer you great power, as well as great cost.

Nguyen does not call them the four great bargains. He calls them something else. More on that in a moment, but first, I intend to tell you what these bargains are. I’ll try to follow Nguyen’s framing as much as possible, but these are all in my own words, and any errors introduced are mine and not intended.

The first bargain is Mechanical Rules (Rules for short).

Mechanical Rules gives us clear procedures that everybody can follow in the same way. This make policies consistent and universal. It lets us replicate the way we make decisions from one location to another. This forms the basis of being able to conduct large scale civilisation, because it allows you to make things that work in a context free way. If everything has to be handled differently and by an expert, you will struggle to ever have enough of it.

The cost of mechanical rules is that you no longer get to handle each case differently. If the system is to work without expert judgement, you need to remove expert judgement from the system. You can’t make exceptions based on discretion or circumstance. If you do, you lose the power that rules granted to you.

Rules offers us accessibility, and asks us to sacrifice adaptability.

The second bargain is Replaceable Parts (Parts for short).

Replaceable Parts asks us to make everything interchangeable. One screw is much the same as another, one factory worker is much the same as another, one bag of sugar is much the same as another. As long as we do, we will be able to produce consistent results over and over again.

The downside is that we can no longer benefit from individual variation. You will never end up in a situation where what you have determines what you make, because what you have is always the same. That one bag of sugar that was interchangeable? It wasn’t, until you made it so. Brown sugar is highly variable in flavour - even between allegedly the same types¹², and if you are using it then you need to adapt to that variability and get to benefit from its specificity.

Those workers you made interchangeable by giving them assigned roles and Mechanical Rules to follow actually aren’t - they have different strengths and weaknesses, and you can always achieve better results by playing to those strengths and weaknesses, but those results will change as your workers do.

Parts is the bargain that gives us reliability, and asks us to sacrifice specificity.

The third bargain is Centralised Control (Control for short).

Control asks us to make decisions in an organized way, from a central location. This lets us coordinate actions across many contexts. To do this we compress information into legible forms, reducing it to the key features that matter, and then make sensible, rational, decisions, on the basis of that information. This lets us coordinate vast numbers of people and resources into a coherent direction and plan. As long as those people actually follow the plan.

Control offers us coordination, and asks us to sacrifice autonomy.

The fourth bargain is Scale, and it subordinates the other three. It says, if you do these three things, vast powers will be available to you, because what you can do once you can do many times. You can run nations, corporations, and do things far larger than anything you can manage as an individual. All it will cost you is your ability to handle the small things in more suitable ways. One size fits all, and if it doesn’t, that’s too bad for the all that it doesn’t fit.

Scale offers us portability, and asks us to sacrifice context.

Each of these bargains is a foundation of the modern world. We could not have achieved what we have achieved without them. They are the technology upon which our civilisation rests.

Nguyen is fully on board with this, and wants to be clear about that. James C Scott seemed to want to throw it all in and move to small scale anarchy¹³, but Nguyen in contrast is fully on board with the benefits of modern civilisation and has an interesting, nuanced, account of the trade-offs involved with each of them.

Which is why it is so annoying that he actually calls these four bargains “The Four Horsemen of Bureaucracy”.

Contrasting recipes

The following two paragraphs are one of my favourite small bits of the book:

What’s the antidote? I have a book on my shelf that exemplifies the opposite tendency. It is a simple book, called Julia and Jacques Cooking at Home. They take you through a lot of basic recipes: how to make a good omelet or sauté some fish. And for every dish, they give you two completely different recipes: Julia Child’s and Jacques Pepin’s. And next to the recipes are sidebars in which they bicker with each other. They explain why their version of the recipe is the way it is, what decision they made to get which effects—and why the other person’s recipe misses the mark. The book is formatted this way because it is the companion piece to an old PBS TV show. It is the record of an argument—a rowdy conversation between friends.

And the effect on me, as I learned to cook from this book, was to undermine the sense that there is a single, correct way to cook. Instead, it revealed every cooking act as a set of decisions through a network of legitimate but different alternatives. You want your scrambled eggs to be fluffier? Use higher heat and faster motion. You want them to be like sweet pudding? Turn the heat down low, add more butter, and stir it for a long, slow time. And once you learn those two endpoints of cooking scrambled eggs, you will know how to improvise in between. The book uses mechanical recipes to communicate—but by pairing different recipes and introducing dissent, it frames them differently. It undermines the monolithic authority of the classic cookbook, offering a landscape of variation, of different choices you can make, guided by different tastes. It is an unsettled cookbook. The paired recipes appear not as the Right and Official way to do things, but as points on a wide spectrum. The book uses mechanical recipes to create space for your culinary agency.

I’ve ordered a copy of the book. I suspect I won’t actually like the recipes very much - My dietary requirements are almost completely incompatible with French cooking - but I love the idea.

In Learning to walk through walls, which is one of the big places I’ve previously drawn on Nguyen’s work, I talked about two competing moves:

Noticing you’re playing a game that you don’t need to be, and stepping out of the rule set.
Building restricted games to help you navigate a more complex skill by turning it into a serious of discrete moves.

Recipes serve a similar function. Cooking is a running theme throughout this book - which I of course really appreciated - and one of the points that he makes several times is how these mechanical recipes act as a great starting point for cooking, but you do need to be able to break out of them, and many people don’t even start from them, in contrast to a more traditional approach to cooking:

I asked my mom to teach me my very favorite Vietnamese dish: hot and sour catfish soup. So she did—or she tried to. What she gave me wasn’t anything I could follow; it was nothing like a recipe at all. It seemed to me, at the time, like this vast and disorganized ramble, a weird organic messy flowchart of possibilities and decisions and judgment calls. I was supposed to add tomato and pineapple but I was supposed to taste the ingredients first. If one was sweet and the other sour, I was probably fine. But if they were both particularly sweet, I would need to balance them with some extra vinegar. Or if they were both sour, I might need to add a little brown sugar. My mom wouldn’t ever tell me how much; it all depended on how things were tasting that day.

This is a level and type of cooking skill that I envy, truth be told. It’s not mine though, and I’m not going to try to acquire it. As with many things one envy’s, what I want is the upsides without the cost.

I am an improvisational cook, for sure, and can adapt a recipe on the fly, but what I cannot do that is adapt a recipe to achieve a consistent result like this. It’s an impressive level of dedication to a single dish, and one that comes from a lifetime of skilled practice and you cannot, I think, easily shortcut that practice, and while the skill implied is fascinating to me it’s also not one where I’m prepared to do the legwork.

The reason we start with recipes is that they are easy. Mechanical Rules grants us the promise that anyone with the basic prerequisite skills can follow the instructions and get the desired result. When you’re a manager, you want that because you want your workers to be fungible. But when you’re an anyone, you want that because you want to be able to follow the rules and get the desired result, and it’s pretty great that you can do that.

The Julia and Jacques approach that Nguyen is pointing to here is that two recipes is a lot better than one, because it’s also the entire continuous spectrum of recipes in between those two, and this helps you notice that in fact there are no barriers around you that you cannot walk through. It won’t teach you to the depth of experience found in his mother’s approach to hot and sour catfish soup, but it will give you the foundation you need to move as far in that direction as you are prepared

Philosophy and self-help

When I first found about this book, my initial reaction was to go ugh. I hate the title, the subtitle, and the cover. It looks like the worst sort of airport book self-help trash. If I did not already have an extreme amount of faith in the author and his abilities as a thinker and a writer, I would not have bought this book without a resounding endorsement from someone I trusted.

Fortunately I did have that faith, and took the corresponding leap and the book is, as I mentioned, excellent.

But… the impression of the book as a self-help book isn’t entirely wrong. This is not a book about how to fix the world. This is a book about how to navigate a world that is trying to squeeze you into a tidy set of boxes where you can be one among a set of Replaceable Parts following Mechanical Rules under Central Control to produce a society that works at Scale. This is good for you only to the degree that you are box-shaped, and the book is trying to help you push back against that tendency.

And… it’s only OK at that, because although it is written as if it were a self help book, it is really a philosophy book, albeit one written in an engaging and accessible manner for a general audience, and philosophy is much better at helping you understand the problem than it is at providing you a solution.

Nguyen’s proposed solution is games: By learning to play with rules, we can start to add play and flexibility and personal meaning back into a world that has tried to force us into simple and easy to understand shapes.

To which I say… well, maybe.

I think this is a better solution for Nguyen than it is for me, because Nguyen is delighted by games in a way that I simply am not. I like games, don’t get me wrong. I think they’re fun, they offer an interesting lens on the world, and they are provide a great way to engage with and socialise with others.

A lot of Nguyen’s work starts with Bernard Suits’s ideas around games, which are roughly:

A game is a voluntary attempt to overcome an unnecessary obstacle.¹⁴
14
This is the “portable definition”, but the distinction between it and the more formal definition doesn’t matter here.
In a sufficiently advanced utopia, there would be no necessary obstacles, therefore everything that we choose to spend our time doing must be a game.

The primary definition has never quite sat right with me. I think it’s an excellent lens on games, but not necessarily a good definition of one. I think, for example, you could regard proving a theorem in pure mathematics as a game under this definition, and while I don’t think it’s exactly wrong to do so (this is, after all, the formalist position in philosophy of mathematics), it feels to me at least like you are missing a lot of the richness of the mathematical experience in reducing it to that.

I think a weaker form of Nguyen’s thesis - that games help us to relearn play in a world so steeped in rules as part of its essential nature - is clearly true, and I am glad of them for that (and for their own sake), but as a solution, or even the starting point of one, it leaves me unsatisfied.

Fortunately, for me, I think the book itself is a better solution than the solution it presents is. Sometimes the thing you need to solve the problem is not a solution, but just to be able to see and describe it clearly. To acquire the tools of interpretation that let you look at the world and say “Ah, I see what is going on”, and know what it is that you want about it to be different.

This book has certainly helped me with that, and I intend to revisit it, probably several times, because my first read I basically wolfed down the contents, and it deserves a more thorough chewing on.

In review

As I said at the beginning, this book is excellent. You should go read it.

Part of why I say that is a sort of altruism. I like Nguyen, and want his work to do well, and while you are, by default, an anonymous reader, I at the very least want the best for you in that you are a person and I wish you well as a result of it, and I think this book would be good for you.

But partly I say this because, for my own sake, I would really like there to be a broader understanding of these sorts of issues. Nguyen talks about one of the costs of Scale being in the form of Fricker’s notion of hermeneutical injustice - people lack the tools to understand and communicate their experience, because they must communicate in the sanitised language of legibility - but another form of hermeneutical injustice is that people lack the tools to understand what is happening in the world, and how it is shaping them. This book is one part of that tool set.

We’ll never have a world where everyone understands this. It is the nature of tools like this that they decay at scale. But if we, as individuals, understand this, then we can build communities that do too, and even if we can’t - and don’t really even want to - push back on the broader legibilising forces of the world, we can at least understand them enough to build shelter from them.

“A scoring system is a social process that delivers a quantified evaluation, and so enters a singular verdict into some official record.” according to Nguyen.
↩︎
TBF I already mostly believed this, the article is just timely.
↩︎
The other was “Sanity and Sainthood” by Tucker Peck. Maybe more about that another time, but here’s Sasha’s review if you’re curious.
↩︎
I do actually recommend all of these. Maybe not “The Shadow of Mars”. It’s the latest in a long long series and I like about half the books in it but it’s way too military sci-fi and I’m mostly here for the wizards in space.
↩︎
One problem I often have is that I end up drawing a through-line through a series of books where I’m reading obsessively for a topic for about five or six books in a row and have several more queued up and then suddenly have had enough of the whole thing and don’t want to read any more of it. There’s a lot of debris on my shelves resulting from this process.
↩︎
Number of pages read is also an option, but it’s mostly pretty well correlated with time spent reading, and it’s more annoying to track, so I’ve never tried that.
↩︎
This concept has come up here a number of times in e.g. Learning to walk through walls. I’ve never actually written a really good introduction to the concept, or seen anyone else do so in less than book length, but hopefully you get the idea.
↩︎
This one I don’t recommend.
↩︎
Nguyen doesn’t specifically mention Callard in The Score, but I was introduced to her by his previous book “Games: Agency as Art”.
↩︎
Not always! This is actually a feature, not a bug, early on in the process, because if you’re in a reading slump then reading a bunch of easy to read books is actually a good place to be. It just can’t continue indefinitely.
↩︎
WELL ACTUALLY, the right thing to be reading here is “A Cautionary Tale: On Limiting Epistemic Oppression” by Kristie Dotson, who Nguyen doesn’t cite, but the problem that he talks about is much more in line with her concept of “contributory injustice” than Fricker’s “hermeneutical injustice”, because it is about hermeneutical resources that exist in the relevant communities but are not taken up by the majority.
↩︎
Unless, of course, you made that brown sugar by first refining out the white sugar and the molasses, adjusting your molasses until it achieved the desired reliability and consistency, and then added it back in.
↩︎
Especially the sort of small scale anarchy where he got to keep being a professor at Yale.
↩︎
This is the “portable definition”, but the distinction between it and the more formal definition doesn’t matter here.
↩︎

Making connections

3 November 2025

This post was originally published at https://drmaciver.substack.com/p/making-connections.

I sometimes think I only have two moves in writing: “These two things you thought were different are actually the same thing” and “These two things you thought were the same are actually different”.¹ This isn’t quite true, e.g. there’s also “Look at this thing, isn’t it neat!”,² but they’re fairly major parts of what I write.

1

These two moves are actually the same thing, because they’re both just special cases of the more general skill you might call “practical ontology”, or perhaps “personal construct psychology” if you’re George Kelly - finding the right concepts with which to carve up the world in order to better engage with it.

But they’re actually different things, because they point you in very different directions. “Actually the same thing” is mostly about drawing analogies between different situations in order to help you understand each better, while “these two things you thought are the same are actually different” is more about causing you to look closer at the specific situation and see whether you are applying an inappropriate strategy that you’ve learned in a context which is more different than you think.

These two observations about these two moves are actually the same observation.

Anyway, this is a “these two things are actually the same thing” post.

One evening recently, I was visiting two friends of mine and we were talking about memory. I said that I don’t have a very good memory for facts, but I have a very good associative memory. I remember patterns, and how things fit together, and can often recall relevant information and methods that I’ve seen before that are appropriate to the situation.

My two friends agreed that their memory was also similar, and one of them said (paraphrasing from memory) “I think maybe being able to make that sort of connection is just what intelligence looks like”.

That stuck with me, because I think it’s very right, and it points to one of the biggest comparative differences in what you are good at when you are smarter than the others around you. If you are smart, you are good at making connections between things. That’s a large part of what being smart means.

I’m aware it’s gauche to self-describe as smart, but I don’t think saying “I’m good at [things that are classically associated with intelligence]” is any better even if it avoids saying the word, and I’m on the record as thinking it’s important to acknowledge and talk about intelligence.

You’re probably smart too.³ I say this not to flatter you, but because I think my stuff is primarily interesting to people who have the same sorts of problems I do, which means that my target demographic is mostly my fellow smart nerds who are somehow still not very good at life. (Sorry. I say it with love). As a result, you’re probably good at this too, and it’s worth knowing how to use it well.

Anyway, I recently introduced two… let’s say friendly acquaintances, in that I know and like them but don’t know or interact with them enough for the label “friend” to really apply, to each other. I was catching up with one of them about what he was doing, and when described it, the way he described it made me think of this other person I knew, and her research into a very similar area. I mentioned this, he was interested, and now I’ve introduced the two of them.

I don’t know if anything will come of it - I’d give like… maybe 2% it results in an interesting collaboration, and say 0.1% that it results in a really important project for one or both of them, and most of the rest of the time they just have an interesting conversation and never talk again, or maybe casually stay in touch.

Which is to say… it probably won’t achieve anything, but for the amount of effort required from everyone involved, this is a ridiculously good deal. It required almost no effort from me, the cost to them is one conversation that they’ll probably enjoy, and although the probability of anything major resulting from it is pretty low, it’s not that low, and the upside is pretty huge if it pays off.

I have some abandoned writing on luck that I never finished off, but one of the results from Richard Wiseman’s study on luck, and what makes people lucky, is that the biggest difference is whether you notice and take opportunities.

The sort of opportunity that, in my opinion, matters the most is these sorts of relatively low effort things that there’s negligible cost to trying and a reasonable choice of it resulting in something big. My post about looking for new projects last year was an example of this - relatively easy to do, and it resulted in some much more interesting work than I’d have found if I went looking for contracts and projects directly.⁴ I’ve previously used tweets as another example of this - it was, once upon a time, an incredibly low effort way of creating potential opportunities.⁵

But anyway, this isn’t a post about how to be lucky, this is a post about how to make the people around you luckier: Make connections between them.

A classic sociology (which may even be true! Certainly it feels true) result is the strength of weak ties: The social connections that are most useful to you are the ones that aren’t that close, because if you know someone super well then you probably already have access to many of the same connections and opportunities as they do. I’m not sure if this is part of the original observation, but there are also just more of them.⁶

As a result, these sorts of opportunities for introductions tend to come up with people you know less well. You’re more likely to get a job opportunity via someone you’ve not talked to in a few years, you’re more likely to meet potential partners at a party⁷ hosted by someone you’re not that close to than you are via your best friend.

Actively navigating these connections for people and making introductions along those weak ties is, thus, a huge favour you can do for the people around you to make their lives better.

This may seem like I’ve taken two unrelated senses of the phrase “making connections” and talked about them independently but, well, you see, these two things are the same thing.

The difficulty with proactively making these connections between people (as opposed to doing it in response to requests, or creating opportunities for it to happen naturally) is that you’ve got to actually figure out which people to introduce. If you just introduce two random acquaintances to each other, probably nothing very interesting will happen. Sure, they both know you and have that in common, but thinking that is enough is Geek Social Fallacy #4: Friendship is transitive.

But if you’re good at making connections between ideas, at the sort of associative pattern matching that lets you encounter something and go “Ah, that reminds me of…” and pull in some seemingly completely unrelated topic, then you can also easily be good at figuring out who should talk to whom: When you’re talking to someone, if that happens and what they say reminds you of something that relates to someone else you know, that’s an opportunity for an introduction right there.

It’s especially valuable if they’re likely to be useful to each other in some way, but that’s not even required really. If you have interesting and compatible conversations with two people about similar subjects, maybe they’d like to have those conversations with each other.

Either way, if you’re talking to someone and they remind you of someone else, ask them if they’d like an introduction. It can’t hurt to ask, and there’s a reasonable chance that doing so will pay off massively.

These two moves are actually the same thing, because they’re both just special cases of the more general skill you might call “practical ontology”, or perhaps “personal construct psychology” if you’re George Kelly - finding the right concepts with which to carve up the world in order to better engage with it.

But they’re actually different things, because they point you in very different directions. “Actually the same thing” is mostly about drawing analogies between different situations in order to help you understand each better, while “these two things you thought are the same are actually different” is more about causing you to look closer at the specific situation and see whether you are applying an inappropriate strategy that you’ve learned in a context which is more different than you think.

These two observations about these two moves are actually the same observation.
↩︎
Wouldn’t you think my collection’s complete? Wouldn’t you think I’m the girl, the girl who has everything?
↩︎
Especially if you’re a subscriber! That’s a very smart move. And if you’re a paid subscriber, you’re probably not just smart but also very interesting and extremely attractive to people of your preferred gender or genders.
↩︎
This does illustrate a common limitation, which having access to these sorts of opportunities is very dependent on the infrastructure you’ve already built up. That post would have worked far less well if I wasn’t both a reasonably well read writer on Substack and also someone with a really interesting technical background.
↩︎
Twitter still exists of course, even if some people call it X, and maybe it still works this way, but for me at least the cost of being on it got too great at the same time as the upside waned heavily.
↩︎
I suspect also these days many more ties are weak in the sense of the paper, because social connections are much more dyadic than they used to be. Whether or not we’re more atomised, I do think it’s much more common to have individual friends than groups of friends these days.
↩︎
Of course, as per the paper Go to More Parties? Social Occasions as Home to Unexpected Turning Points in Life Trajectories, parties are an excellent example of the sort of high pay off low cost bets I’m talking about.

I should go to more parties tbh.

This one has even more “maybe it’s even true” than the strength of weak ties. Alice Goffman is a somewhat controversial figure. The claim feels very plausible to me though and I enjoyed the paper.
↩︎

Reduce the need for active stabilisation

1 November 2025

This post was originally published at https://drmaciver.substack.com/p/reduce-the-need-for-active-stabilisation.

I recently visited my parents and accidentally left my Kindle behind at their house. I’ll claim it back soon, but in the meantime I’ve been reading on the Kindle app on my phone, and it’s interesting how much worse I am at it.¹

I fail at reading on my phone in pretty much the expected way: I get distracted. I read some, then I check discord, or substack, or go shop on something on Amazon, look at Feedbin, or play Sudoku… There are myriad things I can do on the phone, and when I zone out while reading I naturally reach for one of them.

In contrast, the Kindle is a single purpose device. When I zone out while reading on my Kindle, I just zone out for a bit and then either stop reading or return to reading. Very occasionally I might switch to another book, but I tend not to have multiple Kindle books on the go at once because I find it a bit irritating to do.

The laptop has the same problem as the phone. Literally right now I caught myself zoning out and reflexively swiping up to switch apps. Wouldn’t have that problem on my typewriter.

I’m not helpless in this situation. I can notice myself doing it, and suppress it. I have to do that in order to write a lot of the time - as I said, I got distracted mid sentence and tried to switch away. Instead of doing so I caught the action, inhibited it, and returned to writing. This is completely possible.

This is something that some of my early writing on magical practice and secondary anchors aims to help with.² You have a desired state you want to be in, and you set up some rules to help you stay in that state and allow you to return to it when you lose it.

I think, though, it’s better to just set things up so that you don’t lose the state in the first place, and the Kindle vs phone issue is pretty illustrative of the problem.

Reading a Kindle, or a book, if I lose my focus, my attention fairly returns to the activity. If I’m reading on my phone, it tends to settle on something else. Sometimes that activity is itself one that I lose focus on and switch away from (occasionally back to my original reading), sometimes it’s something more absorbing (usually in a bad way) and I lose the thread of what I was doing entirely.³ Either way though, the point is that moving away from the intended activity causes one to continue to stay away from it.

In contrast, when reading a book in the absence of a source of distractions, if my attention drifts the natural thing for it to drift back to is the book.⁴ The state is a stable one - it tends to correct small perturbations.

The problem puts me in mind of one I have with certain exercises. I’m moderately hypermobile⁵, and as a result certain activities are harder for me than they would be for people with normal joints. Plank is one⁶, and standing on one leg is another.

I used to attribute the difficulty standing on one leg as being a “balance issue”, and in some sense it definitionally is: I’m trying to balance and failing, therefore I have a balance issue. On the other hand, classic victim of metonymy problem. It’s not a balance issue in the sense that I’ve got some abstract generalised difficulty with staying vertical. It’s a balance issue because my joints are extra wibbly, and as a result it’s harder to keep them in place.

One way to demonstrate that the culprit is ease of movement is that I can stand one one leg just fine if I tense all the right muscles extra hard. This is, to some degree, normal. Standing on one leg requires using your muscles to stabilise. I have to use them a lot before it feels stable though, and this is exhausting.

The easier something is to move, the harder you have to work to keep it in one place.

I think the same is true with reading and writing, or anything else with this shape. Using a multifunction device like a phone or a computer gives you a lot more “freedom of movement” - you can easily switch from what you’re doing to something else, and as a result it requires active effort to avoid doing so.⁷

In contrast, single function devices lack this freedom of movement, and as a result you can free up the time and attention you would spend on staying focused. It’s easier, and it’s more pleasant.

Of course, a single function device is often not much help if my phone or laptop are nearby. If I was reading on my Kindle, or reading a physical book, or writing on my typewriter, or doing anything else that requires focus, but had my phone in my pocket, there’s a decent chance that I’d get my phone out and check it. The mere presence of the device is enough to make the situation at least moderately unstable.

I tend to solve this by blocking out devices entirely.⁸ Sometimes I do this by physically separating myself from them - leaving them in my room and going downstairs with books, pen, and paper. Sometimes I do it by lighting a candle and agreeing with myself that as long as the candle lit I will not use any devices.⁹

These aren’t states I always want to be in of course. I like the internet, and my phone and laptop are both hugely life improving devices. But their power comes with a cost, and when I don’t want to pay that cost, explicitly separating them frees up capacity for what I want to focus on.

It doesn’t always work of course. Sometimes it turns out I genuinely don’t want to read or write, or don’t want to read or write the particular thing I’m doing. Even then I’ve often found it’s interesting and useful to see where my attention goes without an easy attractor for it, and it leads to me finding other stable states that I’d otherwise have missed.

Some of this is that the app is garbage.

Who on earth thought it was a good idea to have the highlight feature easily triggered by the natural motion you make when scrolling? Because whomever they are, I want to track them down, and cover them with random splashes of yellow paint in arbitrary locations all over their body.

That’s not a kink thing, it’s just what they’ve done to all of the books I read on the Kindle app, so it seems only fair to reciprocate.
↩︎
I should really get back to doing that. I’ve somewhat lost the habit. Spells are all very well, but they only work when you use them.
↩︎
This problem is also why I wear a watch. Anyone who uses their phone as their primary means of checking time has probably had the experience of getting their phone out to check the time, getting distracted by a notification, and five minutes later putting their phone back still having no idea what the time is.
↩︎
Or, sometimes, staring blankly into space, or out the window if I’m on a train, but I tend to think that when that’s the most absorbing thing for me it’s probably a sign that that’s a good thing for me to be doing.

There’s a line from somewhere I no longer remember the origin of, that most people would be better spending more time staring at a wall instead of their phone. This seems right to me.
↩︎
In case you thought I wasn’t screaming “I have ADHD!!” loudly enough in this post already.

It’s not actually clear whether I do have ADHD, but given the number of people who would say “uh huh” to that…
↩︎
This is also harder for me because my core strength is shit.

But let’s say it’s the hypermobility that’s at fault.
↩︎
Unless you’re in full flow. If you are, great! But requiring full flow to be able to do things is a form of affect entitlement. Who am I that I should get to do only things that can absorb my full attention?

I think it’s also harder to get into flow in the first place in this state.
↩︎
I have a special case where the kindle isn’t a device, it’s just a funny shaped book.
↩︎
I’m allowed to blow the candle out at any time, I just have to make a conscious decision to do so.
↩︎

How to find things (an intro to binary search)

21 October 2025

This post was originally published at https://drmaciver.substack.com/p/how-to-find-things-an-intro-to-binary.

As probably most of you are aware, I’m actually a software developer. I don’t talk about it here much¹, as this is mostly a newsletter about the rest of my interests, but I’ve been experimenting with writing about whatever I feel like writing about, and today I feel like writing about binary search.

If you’re not a programmer, don’t worry. Most of this should be highly accessible to you anyway. There will be some code later, but it’s not the main point of this article. I just think it’s worth understanding the technique either way. If you are a programmer, this will still teach you something new, because I think you’ve probably been taught binary search wrong.

Part of why I’m thinking about binary search is that it came up in a recent issue of London Centric, in the context of bike theft. This isn’t the first time it’s come up, and it’s a story that makes programmers (and mathematicians apparently) wince in pain when they hear it.

Here’s the problem: You leave your bike somewhere at 9AM. You come back at 5PM, and discover to your horror that your bike has been stolen. Fortunately, there’s a CCTV camera. You go to the police and report the issue.

“Sorry”, they say, “that would require us to watch eight hours of CCTV footage to spot the theft. Bike theft just isn’t a high enough priority to justify that kind of effort.”

At this point you might feel outraged, but depending on the sort of person you are your outrage might take a different form. On the one hand, you might be outraged that your bike has been stolen and that you’re not getting it back. If, on the other hand, you are a Cambridge professor or otherwise similarly inclined, you will be outraged at what you see as a failure of basic common sense, whip your blackboard out of your pocket, and proceed to give the police a lecture on what you believe to be one of the fundamental rights of humankind: Replacing O(n) operations with O(log(n)) ones.

Here’s the idea: OK, sure. There are eight hours in which your bike could have been stolen. That’s a lot of footage, makes sense that you don’t want to watch that.

But… you could just check the footage for 1PM. Is the bike there? If so, great. Now you know the crime was done between 1PM and 5PM. Otherwise, you know it was done between 9AM and 1PM. Either way, that’s four hours of footage to watch. Still too long, but a lot less than eight!

Let’s say the bike is still there at 1PM. Now, you check whether the bike was there at 3PM. If it was, the crime happened between 3PM at 5PM, otherwise between 1PM and 3PM. Again, now you’ve only got two hours to check…

You can repeat this process, each time cutting the time you have to check in half. Once you’re down to a few minutes where you know the bike was there at the start and wasn’t at the end, you just watch those few minutes of footage to see the crime being committed.²

This strategy is called binary search. Search meaning, well, you’re looking for something, and binary meaning “relating to two”. It’s binary search because you’re searching by repeatedly reducing the size of the space you have to search by a factor of two.

You have to check seven times to get the time down from eight hours to under five minutes. If you exactly halve it each time (you probably won’t, but it works basically the same way if you’re close enough and pick nice round numbers), the amount of footage you have left is eight hours, then four hours, two hours, one hour, half an hour, 15 minutes, 7.5 minutes, 3.75 minutes.

If you instead had sixteen hours of footage to check, you would require one more step, because your first step would cut you down from sixteen hours to eight, and then you’d need the seven more steps as above.

This is what my crack about O(log(n)) operations means: The original strategy is linear - that is, every minute you add to the footage adds a minute³ to the time it takes you to process it.

In contrast, the binary search strategy is logarithmic. Every time you double the size of the search space, you increase the time it takes you by some fixed amount (the amount of time it takes you to check a midpoint).

For large spaces this matters a lot: If you start with eight hours, and each step takes you say 10 seconds, then it takes you a minute and ten seconds (70 seconds) to get yourself down to three minutes and 45 seconds, so the whole task takes you just under five minutes, in contrast to watching the whole eight hours. If on the other hand you started with ten minutes of footage, your binary search step cuts the space in half to five minutes, then you watch five minutes, and it only saves you just under five minutes of time watching (you could of course cut it down further - e.g. if you wanted to only watch two and a half minutes of footage, this could save you a bit over 7 minutes).

If you only had 30 seconds to watch in the first place, there’d be little point in doing any fussy midpoint finding

The point of “linear” versus “logarithmic” is not that logarithmic is faster in general, it’s that logarithmic is always going to be better for large enough problems. For small problems it might still be faster to use a linear operation.⁴

Now, if you’re a programmer there are two things you should probably have noticed about the above explanation:

It is blindingly obvious if you’ve encountered binary search even once.
It is probably not actually an example of binary search as you’ve been taught it.

The other prompt for writing this post is that I was running a calibration interview question on a colleague who I have a very high opinion of, and at some point he tried to write a slightly custom binary search to solve a problem, and got into a little bit of a mess (I think he’d have sorted it out fine when not under time pressure) and reverted to a linear search. I, meanwhile, was surprised, because I forgot that binary searches were a thing people got wrong despite myself having previously failed an interview by getting a binary search wrong. The reason is that in the meantime I have learned the correct way to think about binary search, and forgot that almost everyone teaches it wrong. So the next part of this post is my attempt to teach it to you right.

In order to get a stereotypical example of how binary search is normally taught, I asked Claude.⁵ Here’s what it said:

Binary search is an efficient algorithm for finding a target value in a sorted array or list.

How it works:

Start by examining the middle element

If the target equals the middle element, you’re done

If the target is less than the middle element, search the left half

If the target is greater than the middle element, search the right half

Repeat this process on the chosen half until you find the target or the search space is empty

I think this is a perfectly adequate example of how binary search is normally explained, but note that this is very much not what we’ve just done in the bike theft example! We don’t have a list, nothing is sorted, we’re not looking for a specific item.

And yet, they do seem quite similar. The reason for this is that they are the same thing, but Claude’s (and everyone else’s) explanation is only describing a single specific use of binary search rather than the general phenomenon, and as a result contains a lot of distracting details that make it harder to understand and also less useful.

In order to see one of the key differences, let me tell you a story.

Suppose a thief steals your bike at 12:30 PM. Riding away madly, he is suddenly struck by an epiphany: Stealing is wrong.

Guiltily, he cycles back to where he found it, and puts your bike back and locks it up with the somehow magically still intact lock (or maybe you didn’t lock it, at which point frankly this whole saga is on you), returning it to the bike rack at 1:30PM.

Now, at 3PM, some other bastard steals your bike and cycles off with it. He is in no way burdened with a sense of conscience, and keeps it for himself.

What happens now with our binary search?

Well, we check at 1PM, and the bike’s not there. Therefore it’s stolen before 1PM. Now we check at 11 AM, bike’s there, etc. until we eventually find our bike thief at 12:30. Great! We now have a bike theft to prosecute.⁶

As binary search is normally taught, it works on sorted arrays. In this case, that would mean that your bike has a certain stolenness that only increases over time: It can be stolen, but once it is stolen it is never unstolen. As the above example shows, this method finds a theft just fine if you lose that property and allow for the bike to be unstolen.

The next thing is that this is binary search over a continuous space. If you wanted to, you could run the binary search forever, slicing the time finer and finer.⁷ This would be dumb, so we bail out when we get to a small enough time to watch it ourselves, so it causes no problems, but the pure binary search as explained by Claude never stops here unless we get lucky and find the exact instant of the theft (as opposed to some point when the theft is in progress).

Neither of these are problems if you think of binary search in the right way, and the bike theft example is a nice illustration of what binary search is actually doing: You have two numbers (or points on a line if you prefer) which might be far apart, and are different in some way. Binary search lets you find two numbers between them that are close together and are still different from each other. That is, it lets you find two nearby points where something changes, where “nearby” means “within five minutes of each other”, and the something that changes is whether there is a bike there.

In the sorted list example, what you’re looking for is a point where the values change from less than your target to greater than or equal to your target, and your notion of “nearby” is “right next to each other”. So, in code, the standard binary search looks like this (if you can’t read the code, don’t worry about it too much, I’ll explain in a minute):

def find_target(elements, target):
    """
    Given a sorted list `elements`, returns the index
    of the first position where all elements after that
    point are >= target.
    """
    # If the first element is >= the target, all elements
    # are, so we return 0.
    if elements[0] >= target:
        return 0

    # If the last element is < the target, then no elements
    # in the array are, so we return the length meaning that
    # only an empty set of elements are >= the target
    if elements[-1] < target:
        return len(elements)

    # Invariant: elements[low] < target
    low = 0

    # Invariant: elements[high] >= target
    high = len(elements) - 1

    while low + 1 < high:
        # Take the midpoint
        mid = (lo + hi) // 2
        # Assign to whichever endpoint would preserve
        # the invariant.
        if elements[mid] < target:
            low = mid
        else:
            high = mid

    # By the invariants, elements[low] < target and
    # elements[high] >= target
    # We know that at this point low + 1 == high, so
    # that means that high is necessarily the first
    # point at which elements become >= target.
    return high

If you’re a programmer reading this, the main take home you should have from this code is that if you find yourself implementing binary search (which hopefully you won’t too often, but sometimes you gotta), never under any circumstances skip writing those “Invariant” comments at the top. They will save you confusion every time.

If you’re not a programmer, here’s what’s going on in it:

We’re looking for the point where the list changes from elements that are less than the target to the point where they are greater than or equal to the target. We find a pair of indices next to each other where this changes, and thus the larger of the two is necessarily the first point (because it is greater than or equal to the every element before it is less than the target).

If the target is in the list, that first point where they are greater than or equal is necessarily where the target is. Otherwise, you can tell the target is not in the list by looking at that index and checking whether it’s past the end of the list or the value at it is greater than the target.

Let’s try an example to work through it: Suppose you’ve got the list of elements 1, 3, 5, 7, 9, and we’re looking for the number 7 in it. This proceeds as follows:

We start with low = 0, high = 5 (indexes start from 0, so the first element is the element at position 0).
We take the midpoint of this, which is 2 (division rounds down), so we look at the element at position 2, which is 5.
This is less than 7, so we now set low = 2.
Now, low = 2, high = 5, so mid = 3. The element at position 3 is 7, which is greater than or equal to seven, so we set high = 3.
Now, low is 2, high is 3, so lo + 1 equals high, and we stop and return that the first index greater than or equal to 7 is at position 3 (which it is, because 7 is there).

But honestly, if you’re not a programmer, this is probably the least interesting possible example of binary search, this problem just comes up in programming a lot, and it’s much more interesting to understand the general principle.

Why is it important to understand? Well, partly just because I think knowing things like this that give you an easy perspective shift on what is possible is almost always valuable, and partly because binary search is very useful.

We’ve already seen how it takes the cycle theft example from not worth it to to easy,⁸ by helping you figure out when something happened. Another place I’ve suggested in the past where understanding this principle is useful is estimating unknown quantities. I used the example of guessing the population of Paraguay. You first pick a number that’s obviously too small, and a number that’s obviously too large, and then you progressively narrow that range through binary search.

Except, that again isn’t quite the same thing! The problem there is that while you are doing the same thing (finding a pair of points where one is obviously too small and one is obviously too large), your stopping condition is no longer necessarily when the points are close. The problem is that there’s a large range of values in the middle where your answer to “Is this obviously too large or obviously too small?” is “No. Seems kinda plausible but I’m not sure.”, so you can get stuck on making progress.

One way to solve this is with the same way we did with the ordered list: Our condition for the lower bound is “obviously too small”, and for the upper bound is “not obviously too small”. I think you’ll still get stuck at some point when you find yourself asking “OK, but how obvious is obvious…?”, but you’ll do a lot better. You can then repeat this on the other side with “obviously too large”, and you get a range of values that you think are plausible.

Another way to solve this is that once your midpoint is non-obvious you can just try testing some other points in the range and updating your range on the basis of that. Pick randomly, or near the edges, or whatever, and narrow the range on that basis. There’s not actually anything magical about the exact midpoint - any point you try between the two ends will allow you to update the range if you get an “obviously too small” or “obviously too large” answer.

This sort of flexibility is one of the reasons why I think it’s super helpful to understand binary search as looking for changes in behaviour. It makes it much easier to reason about new variants, because you really understand what’s going on.

And in general this is one of the key purposes of understanding how things work: If you know how to make it, you know how to make something like it that works slightly differently. In the pen cap post, I talked about how “how things work is how you work with them”, and that’s important, but the other reason to understand how things work is that how things work is how other things like them work, and how things work is how you make them.

I’ve run into this principle over and over again in programming, software development, and maths.⁹ e.g. I did some algorithmic work I’m really proud of recently,¹⁰ and it only happened because I spent some time staring at the preliminary material going “Hmm… How on earth does that work?”, in much the same way I found out about pen caps recently. Right now I’m working on some statistics, and I think if I treated statistics the way most people do - as a set of tools to use without understanding them - the task I’m doing would be almost impossible.

I don’t think this is just for abstract subjects either. I see it a lot in other people who better at more physical skills than me: In much the same way I can just whip up a new algorithm, they can just improvise or custom build a physical object where I’d have to ad hoc adapt some existing object that was much less fit for purpose.

And, you know, often that’s fine. It’s impossible to acquire complete knowledge of everything. Some of the people I’m comparing myself to on the physical skills have literal decades of more experience in the subject,¹¹ and I’m unlikely to put in the effort to acquire those. I expect that will be the same for most people on most of the specific things where I understand them in this way too. This isn’t an invitation to try to learn everything, it’s just me drawing your attention to what happens with the things you do decide to learn.

And part of that is that when you find yourself in the position of acquiring a new technique or skill, I think that if you can it’s worth taking the time to make sure you really understand it, and to find the right way of looking at it, so you can stop treating it as a narrow and well-defined thing you can do,¹² and start treating it as a lesson in how to interact with the world in a variety of related ways.

There’s more technical posting on my notebook blog. I used to do technical blogging at my “main” blog, but that’s as the last post says “on indefinite hiatus”.
↩︎
There’s also the possibility of you happening to get lucky and catching the criminal mid-act in one of your repeatedly checking points in the video.
↩︎
In general, linear/O(n) actually means “some fixed multiple of a minute”. For example if you had to watch the footage twice, every minute would add two minutes of time for you. If you were able to watch the video at 10x speed, every minute would add six seconds. Both of these are still O(n) operations, because every minute of footage adds a fixed amount of time to the task.
↩︎
And, indeed, you can see that this is the case in this example! Once we get under the five minute mark we switch to a linear search. This is partly because of the problem - we actually want to catch the thief in the act - but also for viewing a continuous chunk of footage, whenever you get to under the time it takes you to fiddle with your video controls, it’s always going to be easier and faster just to watch it.
↩︎
Claude should of course not be taken as authoritative, but I think it’s pretty safe to rely on it being clichéd.
↩︎
Of course, it’s not very useful for getting your bloody bike back, but details details.
↩︎
Well you couldn’t because the video has a fixed frame rate, and maybe there’s a quantum time consideration even if that weren’t the case, but ignore this detail.
↩︎
Or, at least, reduced to the problem of getting the police to follow basic instructions and care about doing their jobs.
↩︎
Even with binary search! As the expression goes, if I had a nickel for every time I’ve invented a novel variant of binary search, I’d have… Well, two nickels, but it’s weird that it’s happened twice.

The first is the observation that binary search gets much faster (improved complexity even!) if you have a good guess about where the point you’re looking for is. The second is that if what you’re actually looking for is a random sample in some ordered range, you can do binary search incrementally to get the desired result efficiently.

I don’t think I’d have figured out either of these without the perspective on binary search I’m telling you about here.
↩︎
The GAWRS and RAWRS variants in section H.3 of Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling if you’re interested. Sadly I only did this work quite late in the publication, so it only made it into an appendix, but these are implemented in genlm-control as the main algorithm used for AWRS.
↩︎
Hi dad.
↩︎
A spell, if you will.
↩︎

How pen caps work

16 October 2025

This post was originally published at https://drmaciver.substack.com/p/how-pen-caps-work.

So, I recently learned how pen caps work. I think it’s neat, so I’m going to tell you about it, but I also think the experience of figuring this out illustrates some other interesting things about the world, so I’m going to tell you about that too.

First, pen caps.

Some context: I mostly write with a fountain pen.¹ I often put it down when I’m mid-writing because I need to think something through, get distracted, and as a result I leave it out having forgotten to put the lid back on and the nib dries out. This is mildly annoying.

One time recently I had done this and thought “Oh, that’s annoying, I should have put the pen cap on”. Then I stared at the pen cap for a bit and went “Wait, hang on… how on earth does that work?”

My naive intuitive model of this is that it’s like vegetables. If you put cut vegetables in an open bowl in the fridge, they’ll dry out. If you put them in a sealed container or bag, they won’t. If anything they’ll seem damper when they come out.

The reason it works like this though is that vegetables contain a lot of water. When in dry air, that water evaporates and is released into the air. The dryer the air, the more water is released. The more humid the air, the less, until at a certain level of wetness the amount of water condensing onto the vegetable is the same or equal to the amount of water evaporating from it, and the vegetable stops getting drier. The reason why putting vegetables into a container stops them drying out too much is because they release water into the air until the air is humid enough that it reaches that point and they stop drying out further.

Something you can see from this is why it matters that the container be small. If you think about it, just putting them in a bowl in the fridge is also putting them in a sealed air tight container, it’s just a fridge sized one.² Because the fridge is so large relative to the vegetables, they can never release enough water to reach the point where they stop drying out. In contrast, a small container doesn’t have much air in it, so the vegetables can easily make it humid. The smaller the container, the better this works - the ideal is a small bag you’ve pressed out all the air from.

The thing is, a pen nib doesn’t have a lot of liquid on it, and the cap is quite large. I can’t give you actual numbers so this may be wildly off, but based on how quickly a pen nib dries out in air, and how little ink is lost when you do that, I’d be astonished if it could reach equilibrium before the pen nib is just thoroughly dried out. Also if you take the lid off and on again a lot, you’d be let the humid air out of the cap, restarting the whole process, which doesn’t track my experience that cycling putting the cap on and taking it off again doesn’t result in the pen drying out.³

So: Intuitive theory that the way pen caps work is just keeping the nib in a closed environment, seemingly false.

My second theory was that the cap was structured so that the nib was actually pressing against metal on the inside, so there was little to no air contact causing the evaporation. I did some fiddling with putting the pen away and it’s a little hard to tell for sure, but the shape of the pen cap doesn’t seem to support this: The nib is straight, while the cap on my pen is curved. Also the cap is much longer than the part of the pen above the seal.

At this point I performed the ultimate experiment: I looked up the answer. As one would expect from the internet, there is an an entire website dedicated to fountain pen design, and after a bit of googling I found both it and this specific page on fountain pen cap mechanics and physics.

The key bit of this page is:

In summary:

Inserting a pen into a cap as well as removing the cap from the pen causes pumping action

taking the cap off → suction, some vacuum is caused inside the cap

putting it on → compression of air inside the cap

That is: It’s not just, or even necessarily primarily, that putting a cap on the pen keeps it dry. Taking a cap off a pen makes it wet again, because it draws out the ink from the reservoir through suction .

This is an easy experiment to perform, so I did: I waited for my pen to get dry, put the cap on, and immediately took it off. The pen was still a little dry but had become wet enough to be usable for writing again. Claim validated.

At this point, my mind was absolutely blown. This is not a mechanism that it would ever have occurred to me was in use here. It makes complete sense in retrospect, but it simply wasn’t on my radar as a possibility.

An additional wrinkle: Some subsequent experimentation caused me to conclude that it wasn’t just this that was going on, and my earlier theory that it was like vegetables actually does hold some water. I investigated the other pen I use on the regular, a uni-ball eye,⁴ and by shining a flashlight down it, the way the cap is designed very clearly has a tiny near-airtight well at the end that the nib goes into. The nib isn’t in contact with the lid, but it is enclosed in a very small space, which means that it’s much easier for it to reach equilibrium with the air. This also has an air tight cap, and it’s harder to make a rollerball dry out, but some experimentation with it made it clear that there’s still some of the same effect going on and the nib has much more ink on it when the cap is put on and removed.

Once I had spotted this, I experimented with running the fountain pen nib down the inside of the lid, and realised that there is a slight ridge around where the actual nib will reach, so there’s clearly an extra air tight seal happening there, putting the nib in a much smaller area.

Speculating, I think this is particularly important for long term nib usage: If I keep the fountain pen closed for hours or days, especially if it’s in a bag, more ink will come out of it as it moves (if you shake a fountain pen vigorously, ink will fly out, so presumably moderate shaking also causes moderate flow), and so reaching equilibrium with the air in the cap quickly helps prevent that drying out.

This detail aside, for short term usage of the pen cap, what clearly dominates is the suction effect of the cap.

Why am I telling you about this?

Well because it turns out that pen caps’ mechanics are fascinating, and if you didn’t find this information incredibly interesting I really don’t know what to tell you. I’m probably not going to go out and become a pen cap enthusiast,⁵ but I’m really delighted to have learned this fact and wanted to share the delight.

It’s also incredibly useful to know about for when the pen does dry out, because it’s far more effective as a means of fixing that than anything else I was doing before I knew this,⁶ and hopefully some of you will find that useful too. Although, according to the fountain pen mechanics site, don’t do this too often as it may cause your pen to drip by filling up the feed too much.⁷

But I also think this interesting for a couple of more generalisable reasons.

The first is that it’s an interesting encounter with reality having a surprising amount of detail. Pens seem… if not simple, at least not something that you have to think about the complexity of that much. I know how pens work, in much the same way that I know how bicycles or toilets work - I couldn’t build one, but I could sketch out enough of the details of how they work and express the bounds on my ignorance that I’m not going to embarrass myself too badly when I try to explain them.

But we’ve been using pens for thousands of years, and fountain pens for at least hundreds of years, and that’s a lot of time for the design to evolve, and for us to figure out how to solve problems with previous designs, and looking at the modern version of them you might not even realise that they’re solving those problems because you’ve never really noticed you had them in the first place.

The second thing is that it illustrates a principle that I don’t have a very pithy name for, but is something like… how things work is how you work with them.

You may not realise that you have those problems, but you still have those problems, and if you use things wrong they will not solve those problems for you. The ideal device is designed so that you never have to think about that, and even in non-ideal cases you can often get away with not thinking about it, but there will inevitably come times where knowing how it works will make life better for you - because it’s broken, or because you want to do something unusual with it, or even just because you want it to work better.

Most things though are not ideal, because we cannot build perfect systems, and as a result you will use them wrong if you don’t have some understanding of what they do. A trivial example of this is dishwashers. A lot of people seem to treat dishwashers as magic “put dishes in, clean things come out”, and as a result stack dishes in them like an insane person in a way that obviously cannot possibly work if you think for a second about how a dishwasher cleans things through the straightforward mechanical application of water to them, because the flow of water to the dish is blocked.⁸

I’ve run into this a lot with software too. For a lot of previous companies I ended up as the local expert in git, because unlike everyone else I’d bothered to acquire a rudimentary working knowledge of how git worked, so things like “Yes, the remote branch and the local branch are different things and can diverge” were not mysteries to me. I think these days enough general git knowledge has diffused into the population of programmers that this happens less, but maybe I’ve just stopped paying attention.

The same thing happens in Hypothesis. A lot of it is designed to just magically work, which is great up until the point where it doesn’t. Users don’t need to care about how shrinking works, until suddenly their generated test cases are horribly complicated because they wrote their code “wrong”, and now they need to know how it works.⁹

The process by which I found out how this worked is also interesting, because I think it’s quite illustrative of how discovering how things work in general goes.

I had a problem which I cared about (my pen nibs were drying out). I knew the solution (I should put the cap on more reliably)¹⁰, but then I asked “Why does that work…?”

I didn’t need to know why this worked, but I was curious, and being curious about silly and basic things is important.

I generated an explanation, but the explanation felt wrong. I was analogising from a similar situation I understood, but the more I thought about the analogy the more it didn’t hold up. Following that sense of wrongness until you get to a point where the pieces fit together is important. It’s very easy to get to wrong explanations, it’s important to check them.

And then I looked up the answer. Doing science is all very well, and it’s important to be able to do it well enough to notice problems and understand other people’s explanations, but in this case I just wanted to know the answer and other people clearly know how fountain pens work so I may as well draw on that expert knowledge.

And now I’m telling you about it, because sharing stories is how we learn from each other.

A Pilot MR3 for the fountain pen nerds in the audience.

I’m not especially knowledgeable about fountain pens, but I have friends who are, so I just asked them what to try and after trying a few this is the one I liked the most and have stuck with.
↩︎
Also the fridge isn’t actually airtight, and a good one is designed to remove humidity because it will condense on cold surfaces and then freeze up. Ideally that condensed water is drained away and then evaporates on the outside. But this claim would be true even if that weren’t true.
↩︎
This is, as we will discover shortly, a Clue.
↩︎
Which I firmly believe are objectively the best rollerball.
↩︎
Though I am genuinely tempted to read the fountain pen book written by the site I linked to.
↩︎
My best trick before was slightly wetting the nib and then scribbling until the ink looked consistent. This sucks as a method and I clearly should have looked up a better one but never thought to.
↩︎
What’s the feed? It’s the bit under the pen nib. It controls the flow of ink from reservoir to nib. No, I didn’t know this either before reading this page.
↩︎
They then, of course, fail to notice that this doesn’t work and adjust their behaviour because they’re not actually trying to succeed.
↩︎
One of the ways I feel like the design fails a bit is that it doesn’t provide a particularly gentle onramp to understanding this and figuring it out.
↩︎
But not too reliably, lest I flood the feed!
↩︎