Category Archives: programming

A file that should exist in all your ruby projects

Alright, the title is maybe a little more general than is strictly valid. I’m making the assumption that if you’re writing ruby then you are, like me, the sort of trend bucking nonconformist that does exactly the same thing as all the other trend bucking non-conformists.

Specifically I’m assuming that if you are using ruby you’re also using bundler (you should be. It is the way and the truth and the light. There is no salvation save through bundler), and you are using git. If you’re not using the former you should fix that (did I mention you should fix that?). If you’re not using the latter then this might be worth reading anyway, but the specific file you’re going to need is different.

Anyway, at the root of your git repo you should have a file called “.gitattributes”. That file may contain various things, but the line it needs to contain is

Gemfile.lock -merge

What’s going on here?

Well, the Gemfile.lock is basically a compiled and pinned down version of your Gemfile. You’re supposed to commit it to your repo so as to get a consistent gem environment across all the different platforms you run on.

The problem comes when you’re working with other people, or even just on different branches, and you make changes to the Gemfile (or even just the Gemfile.lock) on each of those different branches. You might get away with it, but there’s a good chance that when you merge the branches it will silently just merge your Gemfile.lock files. This is because the lock file is a text format so git assumes it’s safe to merge.

Sometimes this will cause you no problems, or will cause you problems that you notice very quickly. The problem is that often it will produce a Gemfile.lock that confuses bundler into working in some cases and much later down the line you will get really confusing bundler errors when you try to use it in a slightly different context (we’ve e.g. found that if you have two incompatible versions of a gem in your Gemfile.lock it can cause confusingly different results depending on what’s installed on your system).

So what does this .gitattributes do? Simple: It tells git never to merge your Gemfile.locks. For merge purposes it treats them as binary files and will generate a conflict at this point, thus localising the error to where the problem occurred instead of some distant point down the line.

This entry was posted in programming on by .

What are developers bad at?

I’m currently trying to put together a post about different modes of thinking that I think developers are not very good at and traditionally leave to other roles to prop them up on, but I’m not having much luck. It occurred to me that a large part of the problem here was that I am, in fact, a developer so most of my perspective of this is from the inside looking out. I’d like to fix that.

As I’ve previously expressed I think having a good team who you can rely on to support you in areas you’re bad at is a wonderful thing, but I also think that relying on them too much and not meeting them half way will create unnecessary blockages and inefficiencies in your work flow – you’ll have to wait for someone else to do something you could have done in 5 minutes really, and so it will instead take several hours and may be indefinitely delayed.

When I’d previously written about this I was thinking in terms of different types of development, but it now occurs to me that exactly the same problem occurs when developers work with non-developers.

So, people who work closely with developers: What are we bad at that you’d like us to meet you half way on? What would it make your life easier if we all got a little bit better at? Can you offer concrete advice on how to do so?

This entry was posted in programming, Work on by .

Questions for prospective employers

So I’m starting interviewing this coming week, which means I have to distil my long list of questions to ask employers into something more manageable. The following is very much a work in progress and I’ve not tested it in the wild yet, but I think it roughly matches what I’m going to want to use. If I change what I actually use then I’ll update it appropriately.

First off, how to raise these: I was originally thinking of mentioning right at the beginning of the interview that this is going to happen. Something along the lines of:

Hey. Up front warning: I have a long list of questions I need to ask your company in order to determine if this is somewhere where I’d like to work. It’ll probably take about half an hour to an hour, and I’ll need at least one developer and one non-developer present. I’d sortof expect this to happen at the end when we get to the “any questions?” part, but really we can do this at any point in the interview you like – we can even schedule a separate interview for it afterwards if you don’t want to take up this much time on it today – but I really will need to do this.

My friend Daniel Royde pointed out in the comments that this is something that I should really be mentioning before I arrive for the interview, so they can actually incorporate this into their schedule. He is 100% right about this.

The one non-developer and one developer thing is very important. I want to see how different their answers are and how they interact.

The actual questions I’m planning to ask are as follows:

  1. What do you personally like about working here? What do you dislike?
  2. What’s your office culture like? Can you describe the typical sense of humour?
  3. What’s your racial and gender diversity like in the company? Does this vary from team to team? If it’s not good, do you know why and is it something you’re trying to change? How?
  4. What is your company doing to have a positive impact on society? (question suggested by my friend Alex White)
  5. What’s your staff turnover rate like? Is it different from team to team? If it’s high, do you know why people typically leave and is it something you’re trying to change? (question suggested by Jamie MacIver, my brother)
  6. How do you deal with failure? When something goes wrong, what do you do to make sure it doesn’t happen again? (question suggested by my friend Kat Matfield)
  7. Who in your company is completely indispensable? Tell me about them (question suggested by my friend Elisabeth)
  8. How much technical debt do you have? What are you doing to address it? Is it working?
  9. Can you tell me about a change you’ve made in your development process recently? What prompted it? (question suggested by Michael Chermside)
  10. How do you decide and manage what to work on next?
  11. Suppose it is decided that a feature is needed and should be part of the current work priorities. What happens between now and the point where that feature hits production?
  12. What’s your business model like? Is it working? How would you know if it wasn’t working, and how would you go about fixing it?

I feel like this is a bit too long, but there really aren’t any questions in there I’d be willing to take out. If anything there are questions I’d like to put in that I’ve not!

UPDATE: I’ve decided the bit that follows is a bad idea. Real interviews are too chaotic for voice recording to be practical, and some informal polling suggests that people are bothered enough by the idea that it’s not worth the headache. Left in for posterity.

The other thing I am currently considering is how to handle note taking. Specifically, I’d like to record these question and answer sessions. I’m considering offering this with a speech something along the lines of this:

So I need to take notes on this bit for later review. If I do that by writing, this will slow everything down and it’ll be a pain for all of us. Would you object terribly if I record this bit? I promise never to publish any part of this recording, or even a transcript of it, and at the end I’ll give you the option to delete it before I leave. If you take that option I’ll ask for a few minutes to write down some notes and reference the recording before you do, but will be completely happy to comply.

I’m also thinking of mentioning at the beginning of the interview that I’m going to have a long set of questions to ask them and that I’m happy to do them either at the end of the interview or at whatever other point in it they’ll find convenient.

What do you think? Does this sound like a terrible idea?

This entry was posted in Hiring, life, programming, Work on by .

Implementing find_or_create correctly is impossible

There’s a method that seems to be present in every single ruby ORM. I assume it’s reasonably common outside for ruby ORMs too, but I hadn’t noticed it before I started using Ruby.

The method is called find_or_create. Its looks something like:

class MyModel
   def self.find_or_create(find_params, create_params)
      self.find(find_params) || self.create(find_params.merge(create_params))
   end
end

(although all code in this post will be ruby, the examples are using a pseudo-ORM that most closely resembles Sequel)

In fact, in many cases, this is exactly what the implementation looks like too.

The thing about this implementation is that it is completely wrong in the presence of concurrency: If you call find_or_create at the same time in two different processes with the same arguments, it will almost certainly result in duplicated creates. If there is a uniqueness constraint then this will cause one of them to error. If there is not, you’ll get two rows when you expected one.

It turns out that as far as I can tell it’s actually impossible to implement this method signature in a sensible way that satisfies the following properties:

  1. The method may always be called
  2. The method does not care about the structure of the table
  3. The method will never result in multiple inserts when called twice with the same arguments (in the presence of no other modifications to the database contents)
  4. Calling the method twice in two different processes will not error unless calling it once would error (in the presence of no other modifications to the database contents)

Here are two proposed implementations.

Just use transactions

This is the most straightforward and general implementation. You just wrap the method in a transaction. The problem is, that we only perform two queries and these absolutely cannot be interleaved, so the transaction in question must be serializable. So the code has to look something like this:

class MyModel
   def self.find_or_create(find_params, create_params)
      database.transaction(:level => :serializable) do
         self.find(find_params) || self.create(find_params.merge(create_params))
      end
   end
end

But this suffers from a problem: What if we’re already inside a transaction? If the transaction is already serializable then that’s fine. If it’s not though there’s nothing we can do to fix this problem.

Additionally, support for correct handling of serializable transactions doesn’t seem to be great. Sequel, which is normally awesome, doesn’t really do the right thing with them (or rather it leaves doing the right thing up to the user and just handles setting the isolation level, which is fine really). But really we need a method that looks something like this:

def serializable_transaction
  if current_transaction
    if current_transaction.serializable?
      return yield
    else
      raise ArgumentError, "Cannot start serializable transaction inside a non-serializable transaction"
    end
  else
    loop do
      begin
        transaction(:serializable => true) do
          return yield
        end
      rescue FailedToSerialize
      end
    end
  end
end

i.e. if we’re already in a serializable transaction we just reuse it. If we’re not inside a transaction we retry the block until we don’t get a FailedToSerialize error (which may happen if you do concurrent modification inside serializable transactions). Note this is pseudo-Sequel. Most of these methods don’t exist in Sequel.

So, given correct support for serializable transactions, we can implement find_or_create as

class MyModel
   def self.find_or_create(find_params, create_params)
      database.serializable_transaction do
         self.find(find_params) || self.create(find_params.merge(create_params))
      end
   end
end

What’s the problem?

Well, firstly, the inability to use this inside non-serializable transactions is rather a big deal. We like transactions, but we don’t want to be making our transactions serializable if we can possibly avoid it.

Secondly, serializable transactions are not always supported and may be really bad ideas on some databases – e..g. they might just cause a global lock. On PostgreSQL this method should behave mostly acceptably, but it’s still not ideal due to the lack of ability to nest it.

So here’s a version which imposes a different limitation:

Rely on uniqueness constraints

class MyModel
   def self.find_or_create(find_params, create_params)
      begin
         return self.find(find_params) || self.create(find_params.merge(create_params))
      rescue UniqueConstraintViolation => e
         find_again = self.find(find_params)
         if find_again
            return find_again
         else
            raise e
         end
      end
   end
end

How does this work?

Well, we rely on there being a constraint enforced on the find params. If there is a uniqueness constraint that they satisfy then trying to do the insert if the find would have succeeded will reliably error.

If however there is some other unique constraint that might be violated – e.g. if some strict subset of the find params or some of the create params have a unique constraint on them – we may through a unique constraint violation for entirely unrelated reasons. This is why we test if the row is in the database and rethrow the unique constraint violation if it’s missing.

Note that one failure mode for this is if someone in another process inserts then deletes this row. This will result in us rethrowing a UniqueConstraintViolation where it would have succeeded if we immediately did an insert. I’m not bothered by that.

I think basically all my use cases for find_or_create key off a unique value in the find params, so this is the one I’ve been using mostly. It’s strictly less general than find_or_create normally is, but has the pleasant advantage of actually being correct.

This entry was posted in programming on by .

Majority Judgement implementation in python

So I’ve done my first Proper open source library in ages. It’s a Python implementation of the majority judgement voting algorithm.

You can check out the source on Github or just use the pypi package.

If, you know, you find yourself in need of an implementation of majority judgement in python. I admit that’s not very likely.

Majority judgement basically works as follows: Every voter assigns a grade to each candidate in a discrete list of ordinal grades (e.g. terrible, bad, ok, good, great). You then tally these grades and rearrange the grades for each candidate in a sequence. You then compare these sequences lexicographically to determine who won. The first element is always the (lower) median, and each point in the sequence is the lower median of the tail from that point on.

So e.g. a candidate with three OK grades, two bad ones and one great one would get the sequence:

OK, OK, Bad, OK, Bad, Great

A candidate two had two OK grades, two bad ones and two great ones would get

OK, OK, Bad, Great, Bad, Great

So would differ in the fourth position with the second candidate getting “great” and the first getting “OK”, so the second candidate would win.

However, implementing it this way for very large electorates is on the inefficient side, so the library I released contains an optimisation: Basically instead of generating a list it generates a run length encoded list. This is nice because you can often tell how many runs you’re going to need in advance so you don’t have to generate the whole list and compress. It’s also much faster to compare heavily compressed run length encoded lists – you can compare long stretches at a time rather than having to go through one element at a time.

Another optimisation that this library performs is that the evaluation is lazy: We only ever work out as much of the sequence as is needed.

An optimisation this library doesn’t perform yet but that I want it to is that there’s another type of structure I’d like to compress: Frequently you end up with situations where you’ve got a long sequence which looks like Ok, Good, Ok, Good, OK, Good, etc. repeating for a long stretch of time (e.g. you can get this when you’ve got just two grades assigned to a candidate and the same number of each). I’d like to compress those down in a similar manner to the run length encoding. I roughly know how the code for that will look, I just haven’t

Although this is a very early stage of the library, I’m actually pretty happy to label it production ready. It’s not going to change much except in implementation details as the interface is basically “It looks like a list but behind the scenes it’s doing cunning things”, and the test coverage is probably the highest of any project I’ve worked on – it’s got about half again as many lines of tests as lines of code and all those tests are heavily parametrized so there are effectively nearly 600 test cases for < 200 lines of code. Branch coverage is 100% and I intend to keep it that way.

This entry was posted in programming on by .