Triangular Beeminding; Or, Drink Less, Using the Power of Triangles

This is a crosspost of a guest post I’ve written for the Beeminder blog.

One of my vices is that I drink a bit too much. Not to the level where I have a problem, but it would be strictly better if I cut out about 2 or 3 of the drinks I have in a typical week. This seems like an obvious use case for Beeminder.

I’ve previously beeminded units of alcohol consumption and concluded that, measured as a total number of units per week, I’m completely fine. The recommended maximum intake for an adult male is somewhere in the region of 20 – 25 units per week depending on who you ask. When I was beeminding this regularly I never had trouble keeping under 12 units. I drink a bit more than that now, but nowhere close to twice as much.

So if I’m that far under the recommended guideline why do I think I drink too much?

Well, the low average is because I actually have a lot of nights of any given week where I don’t drink at all. The problem is that on nights that I do drink I often have a drink or two more than I should. I make tasty cocktails, and if I’ve just had a cocktail I really liked then making another one sounds like an excellent idea. By the third drink of the evening I will usually discover the next day that it wasn’t such an excellent idea.

So what I need is a Beeminder goal that matches the structure of the behaviour I want to change: I need a way to beemind the peaks as well as the averages. A week with two three-drink nights should “cost” more than a week with a single drink every night.

I’ve come up with what seems like a good structure for this.

The idea is to assign each drink [1] a number of points. The first drink in a day costs one point, the second two, the third three, and so on. Because these add up, this means that a day with one drink costs one point, a day with two drinks costs three, a day with three drinks costs six. It mounts up pretty quickly. These running totals are called triangle numbers, hence the title of this post.

To start with I’ve capped the total number of points at 15/week. This is a deliberately lax starting rate which equates to a maximum of 11 drinks in a week (3 days with 1 drink and 4 with 2). Since the drinks I tend to have are two units this is about at the recommended maximum. Note that I can hit the limit while drinking less than that: If I have more than two drinks on any night, the extra points mean I’m forced [2] to reduce the total for the rest of the week to compensate.

Example permitted maximum drinking patterns:

  1. 4 days with 2 drinks and 3 days with 1 (11 drinks)
  2. 1 day with 3 drinks, 1 day with 2 drinks, 5 with 1 (10 drinks)
  3. 2 days with 3 drinks, 3 days with 1 drink, 2 days alcohol free (9 drinks)
  4. 1 day with 4 drinks, 1 day with 2 drinks, 2 days with 1 drink (9 drinks)
  5. 1 day with 5 drinks (!) and no drinking the rest of the week (5 drinks)

Note that 3 days with 3 drinks is not permitted even with the remaining rest of the week free: That would be 18 points which would take me over the threshold [3].

I was originally planning to track this manually, but then Danny got so excited by the concept that he added a feature for it, so it’s easy to give this a try yourself:

  1. Go to “Terrifyingly advanced settings”
  2. Convert your goal to a custom goal (this requires a premium plan).
  3. Switch the aggregation mode to “Triangle”

Best to apply this to a fresh goal. This stuff can easily screw up your goal if you’re not careful, so don’t do it to one with data you care about. [4]

If you want to track it manually instead, just enter the numbers yourself: 1 for the first drink, 2 for the second, and so on [5]. A standard Do Less goal will sum those up, yielding the triangular numbers.

So far this is experimental. I’ve only been running this for a few days, so it may turn out to be a silly idea in the long run. I don’t think it will though. I’m quite pleased with the incentive structure it sets up, and the effect so far has definitely been to make me think more carefully about the later drinks. I’ll add a follow up comment to this post in a month or so when I’ve had time to see how it works.

 

Footnotes

[1] Drinks, not units of alcohol. I try to keep my Beeminder goals based on things I don’t need to estimate or measure. Especially if I have to estimate them after a few drinks. Most of my drinks are approximately two units as I tend to drink cocktails or spirits.

[2] Well, “forced”. I have this goal set up so that it’s OK to fail occasionally. I’ve got a pledge cap of $10 set, so the worst case scenario is that my drinks suddenly become a bit more expensive. This is coupled with a no-mercy recommit: If I decided last night that I was OK derailing, I’m not off the hook today. This is based on a concept from Bethany Griswold about using Beeminder to make free things not free. The Bethany better known in these parts makes a related point in “Be Nice To Yourself”.

[3] Normally this wouldn’t be quite true because I could build up buffer from week to week if I wasn’t drinking much, but I’ve got this goal set to auto-ratchet so I can’t actually do that. If I build up more than a week of buffer it cuts back down to a week. Another terrifyingly advanced premium feature, available with Plan Bee.

[4] Danny here: But we’re excited about people trying this so actually please do feel free to experiment and holler at [email protected] if you break something and we’ll fix it!

[5] Danny doesn’t like this because it breaks the “Quantified Self First” principle. The numbers that you enter this way don’t correspond directly to something you want to measure. [6] Personally I’m much more interested in behaviour change than QS, so I don’t have a problem with it.

[6] Danny again: Actually, QS is more about measuring real-world things. With the triangle aggregation, we’ve got the best of both worlds. We get a true count of the number of drinks (real-world measure) and we get a goal-friendly aggregation for beeminding. You can export the data and do something else with it or change the aggregation function back to normal summing if you want to go back to beeminding total amount of alcohol consumed.

This entry was posted in Uncategorized on by .

Monadic data generation strategies and why you should care

I posted this gist earlier. It’s a toy port of one of the new templatized data generation for Hypothesis to Haskell.

It doesn’t do most of the important things. Its purpose is to demonstrate one simple point: Hypothesis strategies are now monads.

I don’t want to get into the deep philosophical implications of this or how it means Hypothesis is now like a burrito in a space suit. What I want to do here is point out that this enables some super useful things.

Consider the following example from the Hypothesis README (at the time of this writing. This is going to change soon for various reasons, oe of them being the stuff I’m about to get into)

from decimal import Decimal
from hypothesis.searchstrategy import MappedSearchStrategy
 
class DecimalStrategy(MappedSearchStrategy):
    def pack(self, x):
        return Decimal(x) / 100
 
    def unpack(self, x):
        return int(x * 100)

This is for defining a Decimal strategy in terms of an integer strategy – it has an operation to convert a decimal to an int and an int to a decimal.

The reason it needs to convert a decimal to an int is because of simplification. If it can’t convert back then it can’t simplify. This is also what stops strategies from being a functor (remember that all monads are functors): In order for something to be a functor we need to be able to define a new version by just mapping values. We can’t require the mapping to go both ways.

Which is why it’s pretty great that the new template API lets us throw away half of this! Now MappedSearchStrategy no longer requires you to implement unpack. As well as being half the work, this means you can use it in cases where you might sometimes need to throw away data – the mapping no longer has to be a bijection. The reason it can do this is that it just uses the templates for the original type, so there’s no need for you to convert back.

But that’s just an example where being a functor is useful. Why is being a monad useful?

Well, the operation that monads add over functors is bind. map let us take a function a -> b and turn a Strategy a into a Strategy b. bind lets us take something that maps a to a Strategy b and turn a Strategy a into a Strategy b. When would we want to do this?

Well, one example is when we want some sort of sharing constraint. Suppose for example we wanted to generate a list of dates, but we wanted them to all be in the same time zone. The bind operation would let us do this: We could do something like strategy(timezones).bind(lambda s: [dates_from(s)]) (this is a made up API, the details for this are not yet in place in actual Hypothesis). This would generate a timezone, then we generate a strategy for generating dates in that time zone, and a strategy for producing lists from that.

Given that this is Python, you don’t have the advantage you get in Haskell that being a monad gives you nice syntax and a rich set of support libraries, but that’s OK. The reason monads are a thing in the first place is that the monad operations are generally super useful in their own right, and that remains true in Python too.

This entry was posted in Uncategorized on by .

The three stage value pipeline for Hypothesis generation

Note: This describes a work in progress rather than something that is in a released Hypothesis version. As such it’s liable to change quite a lot before it’s done. I’m partly writing this as a form of thinking through the design, partly as a way of procrastinating from the fact that I’ve literally broken the entire test suite in the course of moving to this and trying to fix it is really making me wish I’d written Hypothesis in Haskell.

I’ve previously talked about how generating data in Hypothesis is a two step process: Generate a random parameter, then for a given parameter value generate a random value.

I’ve introduced a third step to the process because I thought that wasn’t complicated enough. The process now goes:

  1. Generate a parameter
  2. From a parameter generate a template
  3. Reify that template into a value.

Why?

The idea is that rather than working with values which might have state we instead always work with merely the information that could be used to construct the values. The specific impetus for this change was the Django integration, but it also allows me to unify two features of Hypothesis that… honestly it would never have even occurred to me to unify.

In order to motivate this, allow me to present some examples:

  1. When generating say, a list of integers, we want to be able to pass it to user provided functions without worrying about it being mutated. How do we do this?
  2. If we have a strategy that produces things for one_of((integers_in_range(1, 10), (9, 20))) and we have found a counter-example of 10, how do we determine which strategy to pass it to for simplification?
  3. How does one simplify Django models we’ve previously generated once we’ve rolled back the database?

Previously my answers to these would have been respectively:

  1. We copy it
  2. We have a method which says whether a strategy could have produced a value and arbitrarily pick one that could have produced it
  3. Oh god I don’t know can I have testmachine back?

But the templates provide a solution to all three problems! The new answers are:

  1. Each time we reify the template it produces a fresh list
  2. A template for a one_of strategy encodes which strategy the value came from and we use that one
  3. We instantiate the template for the Django model outside the transaction and reify it inside the transaction.

A key point in order for part 3 to work is that simplification happens on templates, not values. In general most things that would previously have happened to values now happen to templates. The only point at which we actually need values is when we actually want to run the test.

As I said, this is still all a bit in pieces on the floor, but I think it’s a big improvement in the design. I just wish I’d thought of it earlier so I didn’t have to fix all this code that’s now broken by the change.

(Users who do not have their own SearchStrategy implementations will be mostly unaffected. Users who do have their own SearchStrategy implementations will probably suffer quite a lot in this next release. Sorry. That’s why it says 0.x in the version)

This entry was posted in Uncategorized on by .

What are developer adjacent skills?

I idly mused on Twitter about how one would go about teaching essential developer adjacent skills. I’m still not sure, but I thought I would elaborate on what it is I actually mean by a developer adjacent skill.

More generally, I’d like to elaborate what I mean by a job adjacent skill. There are sales adjacent skills, journalist adjacent skills, designer adjacent skills, etc. I’m only talking about developer adjacent skills because I am a developer and thus that’s where most of my expertise in the subject lies.

When I say a job adjacent skill what I mean is any skill that:

  1. Would improve your ability to interact with people doing that job in their capacity as someone who does that job
  2. Would probably be much less useful if you didn’t have to interact with anyone doing that job
  3. Would not require you to basically have to learn to do the job in order to acquire the skill

Examples:

  • Learning to code is not a developer adjacent skill because once you’ve learned to code sufficiently well you’re basically able to be a developer (You might not be able to be a very good developer yet, but you could almost certainly get a job as one).
  • Learning to write better emails is not a developer adjacent skill because it’s a generally useful skill – almost every job will be improved by this skill, not just ones that require interaction with developers.
  • Learning to write better bug reports is a developer adjacent skill, because usually you’re sending bug reports to developers rather than other people (either directly or indirectly), and it will make your interactions with those developers much better.

Usually skills adjacent to your job need not be ones that you need to do your job directly – for example you can happily code away without ever learning how to file a bug report (far too many people do) – but most jobs require interacting with other people doing the same job. You may have coworkers, you may have a professional community, etc. So  most job adjacent skills are ones that you should also try to pick up if you do the job itself.

There are exceptions. Some job adjacent skills that you might benefit from are specific to the type of job you have. For example, I’m a backend developer. I have to work with frontend developers, and this means that it would be useful for me to acquire frontend developer adjacent skills. There is some overlap, but not all of these are the same frontend developer adjacent skills that a sales person would need to acquire.

One simple example of a skill that I should acquire is how to launch the app in a way that compiles all the CSS, etc, and to understand the buildchain enough that I don’t have to go bother someone and say “Help what’s a grunt it’s telling me it can’t find a grunt” when something doesn’t work. This is a front-end developer adjacent skill that the front-end developers should also have but that sales people probably don’t need to care about.

But another example of these is that I should know how to design an API in a way that cleanly exposes the data in a way that front-end developers can easily make use of. This is a front-end developer adjacent skill that the front-end developers don’t need to have – it’s not their job, it’s mine. They need to have skills around how to use an API and how to clearly convey their requirements, but building it isn’t their thing and it doesn’t need to be. Sales people are unlikely to care about this one either.

So some job adjacent skills are quite specific, but I think the majority of them are ones that are generally useful in almost any role that interacts with that job.

Here are some examples of things I think of as good general purpose developer adjacent skills:

  • The aforementioned how to write a good bug report
  • How to find bugs in an application
  • How to behave in a sprint planning meeting or local equivalent
  • Understanding what software security is and why it’s important

There are doubtless many more. Those are just the ones I can think of off the top of my head.

Should you acquire developer adjacent skills?

Well, do you interact with developers? If not, then no you probably shouldn’t.

If you do interact with developers then yes, yes you should. And the developers should in turn acquire the skills adjacent to your job.

In general I think it’s important to acquire the adjacent skills of any job you routinely interact with. If not, you are basically making them carry the weight of your refusal to learn – if you don’t learn to write a decent bug report then your bugs are less likely to get fixed and will take up more of my time in which I could be working on other things, if I don’t learn to compile the CSS and launch the app I’ll be bugging you with questions every time I need to do that and taking away your time in which you could be working on other things.

This can be hard. There are a lot of jobs, and as a result a lot of adjacent skills that you might need to pick up. No one is going to be perfect at this.

I think the important thing is to bear all of this in mind. Whenever you interact with someone with a different job and the interaction is less productive than it could be, think of it as a learning opportunity – there are probably adjacent skills on both sides that are missing. Rather than getting frustrated at the lack of them on the other side, try to teach them, and in turn try to encourage them to teach you. Hopefully the next interaction will be less frustrating.

This entry was posted in Uncategorized on by .

What is the testmachine?

It turns out you can be told and don’t have to experience it for yourself. However it also turns out that in the intervening year and a bit since I wrote this code, I’d mostly forgotten how it worked, so I thought I would complete a long ago promise and document its behaviour as a way to help me remember.

First, some history

Around the beginning of 2014 I was working on a C project called intmap. It was more or less a C implementation of Chris Okasaki and Andy Gill, “Fast Mergeable Integer Maps”, the basis for Haskell’s intmap type, with an interesting reference counting scheme borrowed from jq internals. It also has some stuff where it takes advantage of lazy evaluation to optimize expressions (e.g. if you do a bunch of expensive operations and then intersect with an empty set you can throw away the expensive operations without ever computing them). I had some vague intention of trying to use this to give jq a better table and array implementation (it has immutable types, but internally they’re really copy on write), but I never actually got around to that and was mostly treating it as a fun hack project.

The internals of the project are quite fiddly in places. There are a bunch of low level optimisations where it takes advantage of the the reference counting semantics to go “Ah, I own this value, I can totally mutate it instead of returning a new one”, and even without that the core algorithms are a bit tricky to get right. So I needed lots of tests for correctness.

I wrote a bunch, but I wasn’t really very confident that I had written enough – both because I had coverage telling me I hadn’t written enough and because even if I had 100% coverage I didn’t really believe I was testing all the possible interactions.

So I turned to my old friend, randomized testing. Only it wasn’t just enough to generate random data to feed to simple properties, I wanted to generate random programs. I knew this was possible because I’d had quite good luck with Hypothesis and Scalacheck’s stateful testing, but this wasn’t a stateful system, so what to do?

Well the answer was simple: Turn it into a stateful system.

Thus was born the baby version of testmachine. Its core was a stack of intmaps mapping to string values. It generated random stack operations: Pushing singleton maps onto the stack, performing binary operations on the top two elements of the stack, rotations, deletions, copies, etc. It also remembered what keys and values it had previously used and sometimes reused them – testing deletion isn’t very interesting if the key you’re trying to delete is in the map with probability ~0.

It then had a model version of what should be on the stack written in python and generated assertions by comparing the two. So it would generate a random stack program, run the stack program, and if the results differed between the model and the reality, it would minimize the stack into a smaller program and spit out a corresponding C test case. It also did some shenanigans with forking first so it could recover from segfaults and assertion failures and minimize things that produced those too.

The initial versions of the C test cases explicitly ran the stack machine, but this was ugly and hard to reason about, so I wrote a small compiler that turned the stack machine into SSA (this is much easier than it sounds because there are no non-trivial control structures in the language, so a simple algorithm where you just maintain a single stack of variable labels in parallel was entirely sufficient for doing so). It then spat out the C program corresponding to this SSA representation with no explicit stack. You can see some of the generated tests here.

At this point I looked at what I had wrought for testing intmap and concluded this was much more exciting than intmap itself and started to figure out how to generalise it. And from that was born testmachine.

I experimented with it for a while, and it was looking really exciting, but then uh 2014 proper happened and all my side projects got dropped on the floor.

Then end of 2014 happened and it looked like Hypothesis was where the smart money was, so I picked it up again and left testmachine mostly abandoned.

And finally I realised that actually what Hypothesis really needed for some of the things I wanted to do was the ideas from testmachine. So soon it will live again.

What’s the relation between testmachine and Hypothesis?

At the time of this writing, there is no relation between testmachine and Hypothesis except that they are both randomized testing libraries written in Python by me.

Originally I was considering testmachine as the beginnings of a successor to Hypothesis, but I about faced on that and now I’m considering it a prototype that will inspire some new features in Hypothesis. As per my post on the new plans for Hypothesis 1.0, the concepts presented in this post should soon (it might take ~2 weeks) be making it into a version of Hypothesis near you. The result should be strictly better than either existing Hypothesis or testmachine – a lot of new things become possible that weren’t previously possible in Hypothesis, a lot of nice things like value simplification, parametrized data generation and an example database that weren’t present in testmachine become available.

So what is testmachine?

Testmachine is a form of randomized testing which rather than generating data generates whole programs, by combining sequences of operations until it finds one that fails. It doesn’t really distinguish assertions from data transforming operations – any operation can fail, and any failing operation triggers a minimizing example.

To cycle back to our intmap example: We have values which are essentially a pair (intmap, mirror_intmap), where mirror_intmap is a python dictionary of ints to strings, and intmap is our real intmap type. We can define a union operation which performs a union on each and fails if the resulting mirror does correspond to the resulting map.

There are approximately three types of testmachine operation:

  1. Internal data shuffling operations
  2. Directly generate some values
  3. Take 1 or more previously generated values and either fail or generate 0 or more new values

(You can unify the latter two but it turns out to be slightly more convenient not to)

TestMachine generates a valid sequence of operations of a fixed size (200 operations by default). It then runs it. If it fails it finds a minimalish (it can’t be shrunk by deleting fewer than two instructions and still get a failing test case) subsequence of the program that also fails.

The internal representation for this is a collection of named stacks, usually corresponding to types of variables (e.g. you might have a stack for strings, a stack for ints, and a stack for intmaps). An operation may push, pop, or read data from the stacks, or it may just shuffle them about a bit. Testmachine operations are then just instructions on this multi-stack machine.

Simple, right?

Well… there are a few complications.

The first is that they’re not actually stacks. They’re what I imaginatively named varstacks. A varstack is just a stack of values paired with integer labels. Whenever a variable is pushed onto the stack it gets a fresh label. When you perform a dup operation on the stack (pushing the head of the stack onto it a second time) or similar the variable label comes along with it for free.

This is important both for the later compilation stage and it also adds an additional useful operation: Because we can track which values on the stack are “really the same” we can invalidate them all at once. This allows us to implement things like “delete” operations which mark all instances of a value as subsequently invalid. In the intmap case this is an actual delete and free the memory operation (ok, it’s actually a “decrement the reference count and if it hits zero delete and free some memory”, but you’re supposed to pretend it’s a real delete). It could also be e.g. closing a collection, or deleting some object from the database, or similar.

The variable structure of the varstack also lets us define one of the most important invariants that lets testmachine do its work: An operation must have a consistent stack effect that depends only on the variable structure of the stacks. Further an operation’s validity must depend only on the variable structure of the stacks (whether or not it fails obviously depends on the values on the stack. The point is that it can’t have been invalid to execute that operation in the first place depending on the available data).

This is a bit restrictive, and definitely eliminates some operations you might want to perform – for example “push all the values from this list onto that stack” – but in practice it seems to work well for the type of things that you actually want to test.

The reason it’s important is that it allows you to decouple program generation from program execution. This is desirable because it lets you do things like forking before executing the program to protect against native code, but it also just allows a much cleaner semantics for the behaviour of the library.

In fact the actual invariant that testmachine maintains is even more restrictive. All testmachine operations have one of the following stack effects:

  1. They operate on a single stack and do not change the set of available variables on that stack (but may cause the number of times a variable appears on the stack to change)
  2. They perform a read argspec operation, followed by a series of push operations

What is a read argspec operation?

An argspec (argument specifier. I’m really not very good at names) is a tuple of pairs (stack name, bool), where the bool is a flag that indicates whether that value is consumed.

The interpretation of an argspec like e.g. ((intmaps, True), (ints, False), (strings, False)) is that it reads the top intmap, int and string from each of their stacks, and then invalidates all intmaps with that label. An argspec like ((intmaps, True), (intmaps, True)) reads and invalidates the top two intmaps. An operation like ((ints, False), (ints, False)) reads the to two ints and doesn’t invalidate them.

It’s a little under-specified what happens when you have something like ((intmaps, True), (intmaps, False)). Assume something sensible and consistent happens that if I’d continued with the project I’d totally have pinned down.

The reason for this super restrictive set of operations is a) It turned out to be enough for everything I cared about and b) It made the compilation process and thus the resulting program generation much cleaner. Every operation is either invisible in the compiled code or can be written as r1, …, rn = some_function(t1, …, tn).

So we have the testmachine (the collection of varstacks), and we have a set of permitted operations on it. How do we go from that to a program?

And the answer is that we then introduce the final concept: That of a testmachine language.

A language is simply any function that takes a random number generator and the variable structure of a testmachine program and produces an operation that would be valid given that variable structure.

Note that a language does not have to have a consistent stack effect, only the operations it generates – it’s possible (indeed, basically essential) for a single language to generate operations with a wide variety of different stack effects on a wide variety of different stacks.

So we have a testmachine and a language, and that’s enough to go on.

We now create a simulation testmachine. We repeatedly ask the language to generate an operation, simulate the result of that operation (which we can do because we know the stack effect) by just putting None values in everywhere we’d actually run the operation, and iterate this process until we have a long enough program. We then run the program and see what happens. If it fails, great! We have a failing test case. Minimize that and spit it out as output. If not, start again from scratch until you’ve tried enough examples.

And that’s pretty much it, modulo a bunch of details around representation and forking that aren’t really relevant to the interesting parts of the algorithm.

This is strictly more general than quickcheck, or Hypothesis or Scalacheck’s stateful testing: You can represent a simple quickcheck program as a language that pushes values onto a stack an an operation that reads values from that stack and fails if the desired property isn’t true.

With a little bit of work you can even make it just better at Quickcheck style.

Suppose you want to test the property that for all lists of integers xs, after xs.remove(y), y not in xs. This property is false because remove only removes the first element.

So in Hypothesis you could write the test:

@given([int], int)
def test_not_present_after_remove(xs, y):
    try:
        xs.remove(y)
    except ValueError:
        pass
 
    assert y not in xs

And this would indeed fail.

But this would probably pass:

@given([int], int)
def test_not_present_after_remove(xs, y):
    try:
        xs.remove(y)
        xs.remove(y)
    except ValueError:
        pass
 
    assert y not in xs

Because it only passes due to some special cases for being able to generate low entropy integers, and it’s hard to strike a balance between being low entropy enough to generate a list containing the same value three times and being high enough entropy enough to generate an interesting range of edge cases.

Testmachine though? No sweat. Because the advantage of testmachine is that generation occurs knowing what values you’ve already generated.

You could produce a testmachine with stacks ints and intlists, and a language which can generate ints, generate lists of ints from the ints already generated, and can perform the above test, and it will have very little difficulty falsifying it, because it has a high chance of sharing values between examples.

The future

Testmachine is basically ideal for testing things like ORM code where there’s a mix of returning values and mutating the global state of the database, which is what caused me to make the decision to bring it forward and include it in (hopefully) the next version of Hypothesis. It’s not going to be an entirely easy fit, as there are a bunch of mismatches between how the two work, but I think it will be worth it. As well as allowing for a much more powerful form of testing it will also make the quality of Hypothesis example generation go up, while in turn Hypothesis will improve on testmachine by improving the quality of the initial data generation and allowing simplification of the generated values.

This entry was posted in Uncategorized on by .