Why you should use a single repository for all your company’s projects

In my post about things that might help you write better software, a couple points are controversial. Most of them I think are controversial for uninteresting reasons, but monorepos (putting all your code in one repository) are controversial for interesting reasons.

With monorepos the advice is controversial because monorepos are good at everything you might reasonably think they’re bad at, and multiple repos per project are bad at everything you might reasonably think they’re good at.

First a note: When I am talking about a monorepo I do not mean that you should have one undifferentiated ball o’ stuff where all your projects merge into one. The point is not that you should have a single project, but that one repository can contain multiple distinct projects.

In particular a monorepo should be organised with a number of subdirectories that each look more or less like the root directory of what would have otherwise been its own repo (possibly with additional directories for grouping projects together, though for small to medium sized companies I wouldn’t bother).

The root of your monorepo should have very little in it – A README, some useful scripts for managing workflows maybe, etc. It certainly shouldn’t have a “src” directory or whatever the local equivalent where you put your code is.

But given this distinction, I think almost everyone will benefit from organising their code this way.

For me the single biggest advantages of this sort of organisation style are:

  1. It is impossible for your code to get out of sync with itself.
  2. Any change can be considered and reviewed as a single atomic unit.
  3. Refactoring to modularity becomes cheap.

There are other advantages (and some disadvantages that I’ll get to later), but those are both necessary and sufficient for me: On their own they’re enough to justify the move to a monorepo, and without them I probably wouldn’t be convinced it was worth it.

Lets break them down further:

It is impossible for your code to get out of sync with itself

This is relatively straightforward, but is the precursor to the other benefits.

When you have multiple projects across multiple repos, “Am I using the right version of this code?” is a question you always have to be asking yourself – if you’ve made a change in one repository, is it picked up in a dependent one? If someone else has made changes in two repositories have you updated both of them? etc.

You can partly fix this with tooling, but people mostly don’t and when they do the tooling is rarely perfect, so it remains a constant low grade annoyance and source of wasted time.

With a monorepo this just doesn’t happen. You are always using the right version of the code because it’s right there in your local repo. Everything is kept entirely self-consistent because there is only a single coherent notion of version.

Any change can be considered and reviewed as a single atomic unit

This is essentially a consequence of the “single consistent version” feature, but it’s an important one: If you have a single notion of version, you have a single notion of change.

This is a big deal for reviewing and deploying code. The benefit for deploying is straightforward – you now just have a single notion of version to be put on a server, and you know that version has been tested against itself.

The benefit for reviews is more interesting.

How many times have you seen a pull request that says “This pull request depends on these pull requests in another repo”?

I’ve seen it happen a lot. When you’ve got well-factored libraries and/or services then it’s basically routine to want to add a feature to them at the same time as adding a consumer of that feature.

As well as adding significant overhead to the process by splitting it into multiple steps where it’s logically a single step, I find this makes code review significantly worse: You often end up either having to review something without the context needed to understand it or constantly switching between them.

This can in principle be fixed with better tooling to support multi-repo review, but review tools don’t seem to support that well at the moment, and at that point it really feels like trying to emulate a monorepo on top of a collection of repos just so you can say you don’t have one.

Refactoring to modularity becomes cheap

This is by far the biggest advantage of a monorepo for me, and is the one that I think is the most counter-intuitive to people.

People mostly seem to want multiple repositories for modularity, but multiple repositories actually significantly hurt modularity compared to a monorepo.

There are two major reasons for this:

The first builds on the previous two features of a monorepo: Because multiple repositories add friction, if repositories are the basis of your modularity then every time you want to make things more modular by extracting a library, a sub-project, etc. you are adding significant overhead: Now you have two things to keep in sync. Additionally it’s compounded by the previous problems directly: If I have some code I want to extract from one project into a common library, how on earth do I juggle the versions and reviews for that in a way that isn’t going to mess everyone up? It’s certainly manageable, but it’s enough of a pain that you will be reluctant to do what should be an easy thing.

The second is that by enforcing boundaries between projects across which it is difficult to refactor across you end up building up cruft along the boundary: If project A depends on project B, what will tend to happen if they are in separate repos is that A will build up an implicit “B-compatibility layer” of things that should have happened in B but it was just slightly too much work. This both means that different users of B end up duplicating work, and also often makes it much harder to make changes to B because you’d then need to change all the different weird compatibility layers at once.

In a monorepo this doesn’t happen: If you want to make an improvement to B as part of a change to A, you just do so as part of the same change – there’s no problem (If B has a lot of other dependants there might be a problem if you’re changing rather than adding things, but the build will tell you). Everyone then benefits, and the cruft is minimised.

I’ve seen this born out in practice multiple times, at both large and small scales: Well organised single repositories actually create much more modular code than trying to split it out into multiple repositories would allow for.

Reasons you might still want to have multiple repositories

It has almost always been my experience that multiple repositories are better consolidated, but there are a few exceptions.

The first is when your company’s code is partly but not entirely open source. In this case it is probably useful to have one or more repositories for your open source projects. This doesn’t make the problems of multiple repositories go away mind you, it just means that you can’t currently use the better system!

Similarly, if you’re doing client work where you are assigning copyright to someone else then you should probably keep per client repos separate.

Ideally I’d like to solve this problem with tooling that made it better to mirror directories of a monorepo as individual projects, but I’m not aware of any good systems that do that right now.

The other reason you might want to have multiple repositories is if your codebase is really large. The Linux kernel is about 15 million lines of code and works more or less fine as a single git repository, so that that’s a rough idea of the scale of what “large” means (if you have non text assets checked in you may hit that limit faster, but I’ve not seen that be much of a problem in practice).

This is another thing that should be fixable with tooling. To some extent it’s in the process of being fixed: Various large companies are investing in the ability of git and mercurial to handle larger scales.

If you do not currently have this problem you should not worry about having this problem. It will take you a very long time to get there, and if the tools haven’t improved sufficiently by the time you do then you can almost certainly afford to pay someone to customise them to your use case like the current generation of large companies are doing.

The final thing that might cause you to stick with multiple repos is the tooling you’ve built around your current multi repo set up. In the long run I think you’d still benefit from a monorepo, but if all your deploys, CI, etc. have multi repo assumptions baked into them then it can significantly increase the cost of migrating.

What to do now

Embrace the monorepo. The monorepo is your friend and wants you to be happy.

More seriously, the nice thing about this observation is that it doesn’t require you to go all in to get the benefits. All of the above benefits of having one repository also extend to having fewer repositories. So start reducing your repository count.

The first and easiest thing to do is just stop creating new repositories. Either create a new monorepo or designate some existing large project as the new monorepo by moving its current contents into a subdirectory. All new things that you’d previously have created a repository for now go as new directories in there.

Now move existing projects into a subdirectory as and when is convenient. e.g. before starting a large chunk of work that touches multiple repositories. Supposedly if you’re using git you can do this in a way that preserves their history, though when I’ve done it in the past I have typically just thrown away the history (or rather, kept the old repository around as read only for when I wanted to consult the history).

This may require some modifications to your deployment and build scripts to tie everything together, but it should be minor and most of the difficulty will come the first time you do it.

And you should feel the benefit almost  immediately. Whenever I’ve done this it’s felt like an absolute breath of fresh air, and has immediately made me happier.

 

 

This entry was posted in programming on by .

8 thoughts on “Why you should use a single repository for all your company’s projects

  1. Scott Muc

    I’m in the camp of keeping the fewest possible repositories needed as well. Where I’m finding myself having a hard time selling this are in the following scenarios:

    1 – The organisation wants super granular access controls. Or some people want repositories separated in a “Single Responsibility” structure. For both of these, I find the monorepo satisfies these concerns because it affords easier audit processes (and I much prefer audit controls vs control controls). The Linux kernel is an interesting example. What are the artifacts of all the code when built?

    2 – Triggering build pipelines is difficult to setup in some tools. People have gotten used to pointing their CI tool of choice at a repository and everything works by convention. Also, the builds become slower because of cloning larger repositories (in environments where the build infrastructure is ephemeral and state is not expected to linger).

    3 – The history firehose becomes difficult to parse.

    Personally, I can work with these hurdles to gain all the things you’ve mentioned. Not having to follow a chain of Pull Requests to add a feature is how I prefer to work. I continue to have a hard time convincing people that they would be much better off just leaving things in one repository. If you can’t keep it neat and tidy in one repository what makes anyone thing it’ll be neat and tidy when it spreads out into all those separate repositories? Directories are good organisational structures too.

  2. Greg

    What about disclosing everything the company was working on to anybody working for it? Lets say you have two teams working on different products – surely it would not be wise to keep both code bases on same repository.

    1. david Post author

      I’ll definitely grant that if you’re working somewhere where you have to treat other people within your company as hostile adversaries there are probably better things you can do to improve productivity than moving to a single repo.

  3. Tom

    What are your thoughts from a security perspective. A single developer host compromised would mean all IP lost in the mono-repo method vs multiple repo’s and the least privileged model where a single compromise would at least limit the IP lost.

    First thing that comes to mind is the HackingTeam breach in 2015. Now, this wasn’t a repo that was compromised, but it was all (or at least thought to be all) 400Gb of their IP related to malware, exploits, and other hacking software. The breach was said to be of a single account within the organization. If all IP resides on a single repo, or server in this case, doesn’t that increase your risk?
    https://www.wired.com/2015/07/hacking-team-breach-shows-global-spying-firm-run-amok/

    Other one that comes to mind is Code Spaces. Again, not a single repo, but all their eggs were in a single proverbial basket (AWS), similar to a single repo. Single compromised put the entire company out of business. In the same regard, a single compromise would be mean ALL IP stolen (not that you’ll go out of business, but certainly brand impacting).
    https://threatpost.com/hacker-puts-hosting-service-code-spaces-out-of-business/106761/

    Is this a risk businesses are accepting by moving to a mono-repo? Are the gains that much greater than the risk? Or, is my tinfoil hat just fitting a bit tighter than normal today?

    I’ve seen a few articles (this one in particular – https://about.gitlab.com/2014/11/26/keeping-your-code-protected/) that outlines “protected” branches. Which seems great, but doesn’t solve the “Confidentiality” issue when looking at this problem from the CIA (Confidentiality, Integrity, and Availability) perspective.

    Also, regarding the efficiency side of the mono-repo. At what point does having many Gb of code sitting in your repo, when, say, a team of front-end dev’s are only working on a small subset of those files become cumbersome?

    Adding to the efficiency side of things, if you need to outsource a problem I’m assuming you would split out that code and bring in the contractor(s) to a separate repo. Does it again become cumbersome to test with code that’s being developed internally vs code that’s being dev’ed externally? Then, having to merge the new code back into the repo once the project is complete or as sprints come to close?

    Sorry for the large number of questions, this is just an issue I’m trying to explore a bit further and attempting to discover other perspectives out there that might shed some light on how others are using git securely.

    Great article and appreciate your time!

    1. david Post author

      > Is this a risk businesses are accepting by moving to a mono-repo? Are the gains that much greater than the risk? Or, is my tinfoil hat just fitting a bit tighter than normal today?

      Generally speaking, leaking code is not actually a particularly big deal unless you have secrets like SSH private keys or the equivalent checked into your repo (which you shouldn’t do whether you have one or many repos). It’s a big deal, but it’s not *that* big a deal and is unlikely to actually cause much medium to long run commercial damage.

      I know some companies have separate repos for anything that is for some reason extremely sensitive (e.g. because it’s a major part of their competitive advantage against other similar companies) and then one repo for everything else, which seems workable.

      Data (especially backup data) and servers need to be much better siloed than code, but I think that’s really an orthogonal problem which doesn’t have much to do with how you organise your code.

      > Also, regarding the efficiency side of the mono-repo. At what point does having many Gb of code sitting in your repo, when, say, a team of front-end dev’s are only working on a small subset of those files become cumbersome?

      It mostly doesn’t seem to be too bad until you hit a really very large scale. The initial clones can get quite slow, but after that the tooling mostly handles it OK.

      > Adding to the efficiency side of things, if you need to outsource a problem I’m assuming you would split out that code and bring in the contractor(s) to a separate repo. Does it again become cumbersome to test with code that’s being developed internally vs code that’s being dev’ed externally? Then, having to merge the new code back into the repo once the project is complete or as sprints come to close?

      I’ve no direct experience of how this works, but I definitely know that some of the larger companies doing this just solve this by getting the contractors to sign NDAs and giving them access. They also tend to have internal tooling that lets them do more fine grained access.

      Otherwise, it’s not the end of the world to have them work on their own repo and either depend on it from that repo or integrate it into yours – it’s no worse than you’d have to deal with if you were using multiple repos yourself.

  4. Glen Mailer

    One thing I often see glossed over in posts like this is about the atomicity (or lack of) when deploying. My own opinion is that monorepos will benefit you the most when you don’t have well-defined interfaces between components and modules – which could be a valid and effective trade-off to make.

    When you run a monorepo locally, sure – everything is on the right version.
    When you want to make a wide-reaching change, sure – you can see a single atomic change.

    But when you come to ship this into production where you have a large number of running processes – the only way to get an atomic change to all of the new code is to turn everything off, update, and turn everything back on again. I worry that a monorepo approach and the confidence it gives for large changes in development and source control can make it easy to overlook this aspect of deployment.

    Do you have any thoughts on this?

    1. david Post author

      > Do you have any thoughts on this?

      A couple. They mostly boil down to “I haven’t really seen this as being a problem in practice, but here are some ideas that might help if it is one for you”.

      The first is that I generally think you should be running fewer processes in production anyway (http://www.drmaciver.com/2014/03/write-libraries-not-services/), which makes this less of an issue. But I accept this is another controversial point to swallow and I don’t think this one is dependent on the other.

      The second is that I think if this is something that is important to worry about, then it’s important to test (I’ll confess I’ve never really seen it being tested in real life – most people seem to just seem to adopt a policy of “deploy and hope for the best”) – the right way to fix this is not to separate things out into multiple repos and hope that that somehow fosters an attitude of people thinking about this the right way (I haven’t particularly noticed that it does) – but to run the tests for your deployed versions against the new versions of the services they depend on. Having a unified project structure with explicit dependencies between them probably makes this actually easier to do, but again I have not actually tried this myself.

      As a side note, I think you could actually do atomic deploys fairly easily in a multi-process model, but it would require some moderately fancy tooling as a precursor to make it work – if you attach a version tag (e.g. git commit hash, or just a release counter) to all your running process then you could use load balancer config or service discovery to ensure that processes only talk to things of the same version, then once leaf nodes (e.g. public API endpoints or web servers) are up and green start load balancing to them. Having this tooling is probably a good idea anyway, but it’s probably overkill at smaller scales.

      As far as not having well-defined interfaces between components and modules, it depends what you mean by well defined. Certainly you benefit less from this if you have stable public interfaces between modules that you’re committing to. But if “well-defined” doesn’t include “stable over time” then I’ve generally not found that to be the case – I’ve seen much cleaner module boundaries result from moving things to monorepos, because the cost of creating them is much lower.

Comments are closed.