David R. MacIver's Blog: Lets talk about sets

Lets talk about sets

1 December 2017

I was talking to a lecturer in the computational optimisation group about a problem I have in Hypothesis the other day. It took a while to convince her I had an interesting problem. Once I had, I started talking to her colleague about it and the confusion immediately reoccurred.

After a while, it dawned on me why this might be happening, and I realised that it is probably a thing that has been causing subtle confusion in my conversations for a while.

The probable source of the confusion is the word “set”.

The way I framed my problem was roughly as follows:

I have a set of strings defined by some oracle function and I am trying to find a shortlex minimal element of the set.

What I mean by this is that I have a function f that takes a string, and I’m trying to find a shortlex (shortest length, lexicographically minimal among those) string such that f(s) is True. There’s no implication that I am holding all of the values in memory at once - The set in question is technically finite, but only in the sense that it hasn’t got more than about \(10^{20000}\) elements in it.

So when I say “set” it’s very much a mathematician’s notion of a set - it’s some abstract-ish definition that defines a bunch of values, not a concrete thing for working with those values. You could enumerate it if you wanted to, but there’s no implication that doing so is easy.

But what computer scientists typically mean by set is “A data structure that supports efficient membership queries”. The set is fundamentally something you can enumerate - you’ve got a collection of stuff right there that you’ve put in your data structure. It’s not that they’re unfamiliar with the mathematical notion of a set, it’s just not the thing that leaps to mind when they hear the word.

So when I say “set” and mean it in the mathematical sense to mean “Here is the shape of the problem”, computer scientists will often hear it in the computer scientist sense to mean “The problem is small enough that I can fit everything in memory”.

So, given that confusion, it’s not surprising that my problem sounded trivial! It’s just O(n), right? Take your n items and calculate the minimum of them in the desired ordering, bam. Done.

There’s a line (probably from George Bernard Shaw) that the United States and Great Britain are two countries separated by a common language. The same seems to be true of mathematics and computer science - we share a lot of words in our jargon, and some of them mean the same thing and some of them don’t, so there are a lot of false cognates.

It’s almost worse than the problem with natural languages, because as far as I can tell most of our shared jargon almost means the same thing in each. You probably won’t go very long without learning that you’re not in fact pregnant, merely embarrassed, but as I’ve discovered you can go quite some time saying “set” without realising that people are hearing something other than what you intended.

Comments

Alaric Snell-Pym on 2017-12-02 12:15:26:

I find your use of the word “set” unsurprising!

Computer scientists who are familiar with Bloom filters (which are a “set” data structure with two operations: “insert an element” and “is this element in the set?” with possible answers “NO” and “MAYBE”) should also be familiar with a set not being some efficiently-searchable list, but I do find that when I explain a bloom filter as a non-enumerable set structure, many do have a few initial mental hurdles to leap to see that...

Irene Knapp on 2017-12-03 00:44:45:

Huh. That makes sense.

Conversely, we often see mathematicians be surprised that they have to think about performance characteristics when writing code. Actual data structures have all these inconvenient properties related to the fact that they exist in a different sense than mathematical objects do...