My friend Dave Stark has an ongoing project to write an OCR engine in Java called Longan. It’s still very much a work in progress, but I’ve been enjoying hearing about that progress and it sounds like it’s going well.
He’s currently experimenting with a new backend for the recognition based on a metric search approach instead of the neural networks he’s been using so far and mentioned that it was currently rather slow. So I was all “Ooh! Vantage Point Trees! You should use Vantage Point Trees! They’re fun and awesome. Here, I will write you one!”
Err, so I did. It took me about two hours all told, plus some tinkering afterwards to fix the bits I was ashamed of. It’s a reasonably complete implementation – it does sensible center selection when building the tree, it buckets leaves into small numbers of points rather than making you do distance comparisons all the way down to the bottom and it tracks the radius of subtrees to let you quickly eliminate trees which aren’t going to hit at all.
Currently supported API:
- Given a metric and a collection of list of points, build a vantage tree
- Given a vantage tree and a point p find all points in the tree within epsilon of p
- Given a vantage tree and a point p find the n nearest points to p in the tree
In particular there are no mutation operations on the tree to speak of, mostly because I couldn’t be bothered.
It’s currently what you might call “lightly tested” in the sense that there isn’t even the pretence of a test suite. I’ve tried it and it seems to work (and my trying it was thorough enough that I found and fixed various bugs) but I haven’t got around to writing proper tests for it. The performance is surprisingly good – on the order of a few ms for querying out of a million (reasonably well distributed – I haven’t tried it on less favourable cases) points.
Although it works, it should probably be considered more of a tech demo than a real library. I’m happy to accept patches, I may fix reported bugs, but I’m not likely to otherwise maintain it and I’m certainly not going to sort out packaging for it, but if you would find such a thing useful be my guest to take it, run with it and otherwise do whatever you want with it.