David R. MacIver's Blog: Implementing find_or_create correctly is impossible

Implementing find_or_create correctly is impossible

22 February 2013

There’s a method that seems to be present in every single ruby ORM. I assume it’s reasonably common outside for ruby ORMs too, but I hadn’t noticed it before I started using Ruby.

The method is called find_or_create. Its looks something like:

class MyModel
   def self.find_or_create(find_params, create_params)
      self.find(find_params) || self.create(find_params.merge(create_params))
   end
end

(although all code in this post will be ruby, the examples are using a pseudo-ORM that most closely resembles Sequel)

In fact, in many cases, this is exactly what the implementation looks like too.

The thing about this implementation is that it is completely wrong in the presence of concurrency: If you call find_or_create at the same time in two different processes with the same arguments, it will almost certainly result in duplicated creates. If there is a uniqueness constraint then this will cause one of them to error. If there is not, you’ll get two rows when you expected one.

It turns out that as far as I can tell it’s actually impossible to implement this method signature in a sensible way that satisfies the following properties:

The method may always be called
The method does not care about the structure of the table
The method will never result in multiple inserts when called twice with the same arguments (in the presence of no other modifications to the database contents)
Calling the method twice in two different processes will not error unless calling it once would error (in the presence of no other modifications to the database contents)

Here are two proposed implementations.

Just use transactions

This is the most straightforward and general implementation. You just wrap the method in a transaction. The problem is, that we only perform two queries and these absolutely cannot be interleaved, so the transaction in question must be serializable. So the code has to look something like this:

class MyModel
   def self.find_or_create(find_params, create_params)
      database.transaction(:level => :serializable) do
         self.find(find_params) || self.create(find_params.merge(create_params))
      end
   end
end

But this suffers from a problem: What if we’re already inside a transaction? If the transaction is already serializable then that’s fine. If it’s not though there’s nothing we can do to fix this problem.

Additionally, support for correct handling of serializable transactions doesn’t seem to be great. Sequel, which is normally awesome, doesn’t really do the right thing with them (or rather it leaves doing the right thing up to the user and just handles setting the isolation level, which is fine really). But really we need a method that looks something like this:

def serializable_transaction
  if current_transaction
    if current_transaction.serializable?
      return yield
    else
      raise ArgumentError, "Cannot start serializable transaction inside a non-serializable transaction"
    end
  else
    loop do
      begin
        transaction(:serializable => true) do
          return yield
        end
      rescue FailedToSerialize
      end
    end
  end
end

i.e. if we’re already in a serializable transaction we just reuse it. If we’re not inside a transaction we retry the block until we don’t get a FailedToSerialize error (which may happen if you do concurrent modification inside serializable transactions). Note this is pseudo-Sequel. Most of these methods don’t exist in Sequel.

So, given correct support for serializable transactions, we can implement find_or_create as

class MyModel
   def self.find_or_create(find_params, create_params)
      database.serializable_transaction do
         self.find(find_params) || self.create(find_params.merge(create_params))
      end
   end
end

What’s the problem?

Well, firstly, the inability to use this inside non-serializable transactions is rather a big deal. We like transactions, but we don’t want to be making our transactions serializable if we can possibly avoid it.

Secondly, serializable transactions are not always supported and may be really bad ideas on some databases - e..g. they might just cause a global lock. On PostgreSQL this method should behave mostly acceptably, but it’s still not ideal due to the lack of ability to nest it.

So here’s a version which imposes a different limitation:

Rely on uniqueness constraints

class MyModel
   def self.find_or_create(find_params, create_params)
      begin
         return self.find(find_params) || self.create(find_params.merge(create_params))
      rescue UniqueConstraintViolation => e
         find_again = self.find(find_params)
         if find_again
            return find_again
         else
            raise e
         end
      end
   end
end

How does this work?

Well, we rely on there being a constraint enforced on the find params. If there is a uniqueness constraint that they satisfy then trying to do the insert if the find would have succeeded will reliably error.

If however there is some other unique constraint that might be violated - e.g. if some strict subset of the find params or some of the create params have a unique constraint on them - we may through a unique constraint violation for entirely unrelated reasons. This is why we test if the row is in the database and rethrow the unique constraint violation if it’s missing.

Note that one failure mode for this is if someone in another process inserts then deletes this row. This will result in us rethrowing a UniqueConstraintViolation where it would have succeeded if we immediately did an insert. I’m not bothered by that.

I think basically all my use cases for find_or_create key off a unique value in the find params, so this is the one I’ve been using mostly. It’s strictly less general than find_or_create normally is, but has the pleasant advantage of actually being correct.