Author Archives: david

How hard can it be?

There are two types of people in the world:

  1. People who assume jobs they haven’t done and don’t understand are hard
  2. People who assume jobs they haven’t done and don’t understand are easy
  3. People who try to divide the world into overly simple two item lists

Joking aside, there’s definitely a spectrum of attitudes in terms of how you regard jobs you don’t understand.

Developers seem very much to cluster around the “jobs I don’t understand are easy” end (“Not all developers” you say? Sure. Agreed. But it seems to be the dominant attitude, and as such it drives a lot of the discourse). It may be that this isn’t just developers but everyone. It seems especially prevalent amongst developers, but that may just be because I’m a developer so this is where I see it. At any rate, this is about the bits that I have observed directly, not the bits that I haven’t, and about the specific way it manifests amongst developers.

I think this manifests in several interesting ways. Here are two of the main ones:

Contempt for associated jobs

Have you noticed how a lot of devs regard ops as “beneath them”? I mean it just involves scripting a couple of things. How hard is it to write a bash script that rsyncs some files to a server and then restarts Apache?? (Note: If your deployment actually looks like this, sad face).

What seems to happen with devs and ops people is that the devs go “The bits where our jobs overlap are easy. The bits where our jobs do not I don’t understand, therefore they can’t be important”.

The thing about ops is that their job isn’t just writing the software that does deployment and similar. It’s asking questions like “Hey, so, this process that runs arbitrary code passed to it over the network…. could it maybe not do that? Also if it has to do that perhaps we shouldn’t be running it as root” (Lets just pretend this is a hypothetical example that none of us have ever seen in the wild).

The result is that when developers try and do ops, it’s by and large a disaster. Because they think that the bits of ops they don’t understand must be easy, they don’t understand that they are doing ops badly. 

The same happens with front-end development. Back-end developers will generally regard front-end as a trivial task that less intelligent people have to do. “Just make it look pretty while I do the real work”. The result is much the same as ops: It’s very obvious when a site was put together by a back-end developer.

I think to some degree the same happens with front-end developers and designers, but I don’t have much experience of that part of the pipeline so I won’t say anything further in that regard.

(Note: I am not able to do the job of an ops person or the job of a front-end person either. The difference is not that I know that their job is hard therefore I can do it. The difference is that I know that their job is hard so I don’t con myself into thinking that I can do it as well as they can. The solution is to ask for help, or at least if you don’t don’t pretend that you’ve done a good job).

Buzzword jobs

There seems to be a growing category of jobs that are basically defined by developers going “Job X: How hard can it be?” and creating a whole category out of doing that job like a developer. Sometimes this genuinely does achieve interesting things: Cross-fertilisation between domains is a genuinely useful thing that should happen more often.

But often when this happens it’s at the expense of the actual job the developers are trying to replace being done badly, and a lot of the things that were important about the job are lost.

Examples:

  1. “Dev-ops engineer” – Ops: how hard can it be? (Note: There’s a lot of legit stuff that also gets described as dev-ops. That tends to be more under the heading of cross-fertilisation than devs doing ops. But a lot of the time dev-ops ends up as devs doing ops badly)
  2. “Data scientist” – Statistics: How hard can it be?
  3. “Growth hacker” – Marketing: How hard can it be? (actually I’m not sure this one is devs’ fault, but it seems to fit into the same sort of problem)

People are literally creating entire job categories out of the assumption that the people who already do those jobs don’t really know what they’re doing and aren’t worth learning from. This isn’t going to end well.

Conclusion

The main thing I want people to take from this is “This is a dick move. Don’t do it”. Although I’m sure there are plenty of jobs that are not actually all that hard, most jobs are done by people because they are hard enough that they need someone dedicated to doing them. Respect that.

If you really think that another profession could benefit from a developer insight because they’re doing things inefficiently and wouldn’t this be so much better with software then talk to them. Put in the effort to find out what their job involves. Talk to them about the problems they face. Offer them solutions to their actual problems and learn what’s important. It’s harder than just assuming you know better than them, but it has the advantage of being both the right thing to do and way less likely to result in a complete disaster.

This entry was posted in life, programming, rambling nonsense on by .

Correlation does not imply correlation

By now you’ve probably seen a graph like this:

xqOt9mP

(source)

The site tylervigen.com finds spurious and entertaining correlations. It’s pretty good. If you haven’t seen it I encourage you to go check it out.

Chances are if you’ve already seen it then you’ve seen it linked with some caption like “Reminder that correlation does not imply causation“.

I wish this wasn’t the take home message that people were deriving from this, primarily for two main reasons:

Firstly, I just don’t think this message needs pushing any harder. Most of the people who can usefully receive it either already understand stats well enough that they don’t need it or already understand stats badly enough that every time someone posts a paper they smugly assert “Ah, but correlation doesn’t imply causation” and then pat themselves on the back and say “Well done, me. You did a science. Good job. You deserve your cookie” (Sorry. This is one of the many peeves in my menagerie of pets). “Correlation doesn’t imply causation” has more or less reached saturation, and that saturation point is far larger than is actually useful.

But secondly, there’s something more interesting going on here.

Some of these correlations are very much of the classic common causal variable sort of correlation: Neither causes the other, but both are caused by the same thing. e.g. in the graph above, what we have is two variables which are increasing over time, so they correlate. This is unsurprising (if you’re wondering why they track so well on the graph that’s another interesting thing demonstrated here: Note that the two variables aren’t on the same scale. They only track that closely because the scale was adjusted to fit. This doesn’t actually affect the correlation, but it makes its display more convincing). More or them are like this:

age-of-miss-america_murders-by-steam-hot-vapours-and-hot-objects

(source)

Sure, the track is not quite as perfect as before, but it’s still remarkably good.

Clearly people are inspired by Miss America to commit hot steamy murder.

Except, well, that’s probably not what’s really going on. I mean, I’m not ruling out the possibility that there is some underlying common causal factor here, but it doesn’t seem very likely. What seems more likely is that this correlation has zero predictive power and actually it will fall apart if more data points are added. All that produced this graph is that the dice happened to roll the right way – the fact that for all 11 data points the graphs have tracked each-other is nothing more than a coincidence.

“But David,” you cry, “that seems incredibly unlikely. How could such a perfect relationship occur by chance?”

I’m glad you asked, suspiciously convenient anonymous blog commenter. I’m glad you asked.

The simple answer is that this happens in two ways. We only have 11 data points. This means that you can get quite high correlations purely by chance. The standard critical value for a correlation which will occur by chance no more than one time in 100 for this many data points is 0.684 . This is based on a model that doesn’t actually hold here, as it assumes normality of the data which is unlikely to hold for all these variables, but it illustrates the point. Also regardless of what your model is, a correlation of any sorts of variables that only holds by chance about one time in a hundred is probably a pretty reasonable correlation.

Still, one time in a hundred is a pretty big coincidence, right?

Well… it is if you’re not looking for it. The thing about things which happen one time in a hundred is that they’re surprising if you look at a single thing and it happens, but they’re not very surprising when you look at 100 things and one of them exhibits it.

And when looking at correlations there are a lot more than a hundred things you’re looking at, because you’re not just looking at the different variables you’re looking at pairs of them, and there are a lot more pairs than variables. So if there were e.g. 100 different variables then there would be 4950 pairs of variables (the formula is \(\frac{n(n-1)}{2}\) – there are n choices for the first variable, n – 1 choices for the other because it has to be a different one. This overcounts by a factor of two because the pairs are unordered so you can get each pair two ways depending on which you pick first, so divide the answer by 2).

This means that if you were looking for correlations that only occur one time in a hundred, you’d expect to find about 50 of them in 100 variables. And every time you double the number of variables, this goes up by a factor of (nearly) 4 – so if you had 200 variables you’d find 200 correlations, etc. In fact, the site seems to have about 3500 variables (based on some html scraping) so you’d expect it to find about 61,000 different significant correlations. Suddenly the fact that it was able to mine through all those variables to find things which track so well doesn’t seem very surprising, does it?

Which is the message I’d like people to derive from this site. Not that correlation doesn’t imply causation, but that there will always appear to be correlations if you look hard enough, even where none really exist, and you shouldn’t expect that the correlations you find that way will actually predict anything about future values. This is an object lesson about the dangers of running too many experiments and not considering that most of the positive results will be false positives, not about the conclusions you draw from those results.

This entry was posted in Numbers are hard on by .

How UK MEP elections work

A friend commented on facebook (and rightly so) that the list of MEP candidates in London was very distorted in favour of getting us out of Europe. I pointed out that the MEP electoral system we used was vulnerable to vote splitting (it has to be: You only get to vote for one party), so in some sense this was a good thing (although I hate to celebrate bad voting systems).

But it made me realise that I’m actually incredibly unclear as to what the voting system for our MEPs is. It turns out that I’m not alone. The available literature is terrible. It’s scattered, hard to find and poorly explained. This is my attempt to make sense of it. You might want to just go read the wikipedia page instead, as it’s actually pretty good, but I was most of the way through finishing this when I found that particular page and decided I might as well keep doing the research. These are my research notes.

Note that this is UK specific, though the details are not unique to us in that a lot of other EU countries use similar systems. The EU imposes various restrictions on the voting systems used. It requires that the system be a form of proportional representation, and provides some bounds on what sort of proportional representation, but it does not mandate a specific form. It also allows for countries to be divided up into constituencies who each use a form of proportional representation (it is unclear to me what would happen if a country wanted to subdivide into constituencies of one member each and use FPTP in each of them. This would obviously be extremely non-proportional). Most countries don’t, but some do.

So step one is that there are 12 constituencies in the UK. Each of these elect multiple MEPs, with a different number for each constituency. Note that these constituencies do not map directly onto the constituencies for electing your MP – they are much larger. I’m unclear on whether some of the normal electoral constituencies cross multiple MEP constituencies but I don’t think they do (EDIT: Alex Foster confirms in the comments that they don’t).

The regions and number of MEPs are as follows:

  • Eastern – 7
  • East Midlands – 5
  • London – 8
  • North East – 3
  • North West – 8
  • South East – 10
  • South West – 6
  • West Midlands – 7 (NOTE: This is one more than last time, apparently due to the Lisbon Treaty. The last election was rerun with the original counts to make up the difference when this change came into effect)
  • Yorkshire and Humber – 6
  • Wales – 4
  • Scotland – 6
  • Northern Ireland – 3

So there are 73 UK MEPs in total (up from 72).

(Source)

How these MPs are then elected varies by constituency. Every constituency other than Northern Ireland uses the D’Hondt method. Northern Ireland uses STV.

…or at least that’s what seems to be said all over the place, e.g. in the gov.uk page about voting systems, but the guidance for returning officers paper makes no mention of anything other than D’Hondt. Additionally that’s the only place where I’ve found them to come out and say “We use the D’Hondt method” as opposed to a half-baked and incomplete explanation of it. I’m pretty sure it’s accurate though.

The D’Hondt method is essentially a (supposedly proportional. I haven’t seen the maths that justifies this, but it seems to work out that way in examples) extension of first past the post to multiple members.

It’s a party-list proportional method. What that means is that you vote for parties, not individuals, and each party puts forth a list of candidates. If they get N seats then the top N people on their list get in.

The way it works is that you run as many rounds as you are electing candidates. In each round, a party gets the score equal to the number of votes it got divided by 1 + the number of candidates it has already had elected (so if it has no candidates so far its score is the number of votes it got. If it has one candidate elected already its score is half that, etc). The party with the highest score gets a candidate and you move onto the next round until you’ve elected enough members.

(It appears to be the case that all MEP elections must use either a proportional party list system or an STV system. I am unclear on exactly what “the list system” constitutes here, but it doesn’t appear to be as restrictive as to specify the D’Hondt method given the range of examples described).

Northern Ireland elects its MEPs with the variant of STV it normally uses. It’s a pretty damn good one. The neat feature of it is that it avoids any ambiguity over which votes are transferred (which can affect the result of an election quite significantly) by transferring all votes but at a fractional value. You calculate the number of excess votes for a candidate that should be transferred onwards and distribute that to each vote according to the fraction of the vote it made up (this isn’t the same as distributing it equally amongst each voter being transferred onwards because some of those may already be at reduced value because they were transferred from a previous candidate). This is the Gregory method of STV.

Anyway, this is most of the information I was interested in finding out. If you want to find out a bit more, go read the Wikipedia page I linked.

This entry was posted in voting on by .

Putting word counts into beeminder

As I promised in my post about subsuming myself into the hive mind, I’m now using beeminder to try and keep myself blogging. I’ve set up a goal here.

I’ve ended up using word count as the metric rather than number of blog posts because:

  1. It will prevent me from weaselling out with ridiculous tiny blog posts like this one.
  2. It was easy to do

I’ve currently set it up at what right now seems like a rather measly 400 words per week (a typical blog post for me seems to be anywhere between 400 and 1500 words), but given that the whole point of this is to provide a lower bound while I’m under pressure not to blog this seems reasonable. I have however retroratcheted it right before writing this to get rid of the 70 days of buffer the last couple weeks of blogging gave me.

If you’re interested, I’ve open sourced the code I’m using to automate this. It’s pretty trivial, but it may save you the hour or so it took me to figure out how to write this (still well within the xkcd time allotment, but only if I really believe I’m going to keep this up for 5 years).

This entry was posted in Admin, life on by .

Printing, a status report

The process of moving countries appears to involve a lot of paper. The intended workflow is that you download a file, you print it out, you scribble on it, you scan it back in. It’s a bit ridiculous.

The main obstacle to this is that I didn’t have a printer, or a scanner. I’ve resisted having one for a long time: I just don’t need to print much, and printers have a tendency to break if you don’t use them for extended periods of time. Also, printing under Linux has a tendency to be… variable.

But this seemed to have become a necessity, and I figured worst case scenario I can just buy the printer then gift or sell it on after I’m done with it. If I can’t get it working on Linux I do have a Windows machine (it’s intended purely for games).

So, I gave in and bought one, the Epson XP-312.

The verdict? Actually surprisingly good. Admittedly given how bad I was expecting it to be, not being eaten by sharks is a surprisingly good outcome.

Getting the printer on the WiFi was a complete pain. On setting up it basically says “Here are 5 SSIDS. Want one of them?” (These were not the 5 strongest. It was sitting literally right next to my router when I did this). If those are not the ones you want, you have to manually enter the entire SSID one character at a time selecting the character by pressing up and down on the controls. It took a while. Once I’d done that though it did manage to connect just fine.

I solved the problem of printers not working well with Linux by just not using either of my laptops and doing virtually everything via my phone.

The printer supports Google cloud printing natively, which works rather nicely. There was some WTF about the complete lack of integration between drive/docs and cloud print on the web (it works fine on android), but it wasn’t too much of an issue.

Scanning was a little more challenging. There’s Google cloud printing but they don’t support the dual operation of scanning. So it was either figure out how to get my scanner working under Linux (nooooo. Also this would have required me to use the shit kind of USB cable that no one ever wants and only printers think is a good idea which naturally I don’t have and they don’t include) or installing the software they provide under windows (noooooo).

But then I spotted the existence of Epson’s cloud services which claim to support scanning. On “select printers”. Apparently mine is not one of the select printers, but I only found this out after installing the cloud services. Well… I guess now I have two ways to print from the internet? Yay?

So, that plan failed, I noticed that there’s an android app for Epson printers. Called “iPrint”. Amusing misbranding but whatever.

And… you know, it works pretty well. Using my phone I can activate the scanner, get the scanned file on my phone and upload that to Google drive. The UI is like something out of the 90s, but it works remarkably well despite that. It isn’t using the cloud services either – it just detects it on the local network and connects to the printer directly.

So… printers are still a bit clumsy, but it seems to be possible to bypass CUPS these days, at least for this particular model. All told it could have been a lot worse.

This entry was posted in Uncategorized on by .