I’ve been planning to do some work on my term extractor to make it a bit smarter. It’s currently a rule based system on top of various machine learning tools. This is perfectly legitimate, but it’s starting to hit the limitations of that approach. I’d like to experiment with a more intelligent approach using machine learning more directly.
To do this though I need a training set. My plan is to do this by building a first pass using the existing version on some sentence corpus and then editing that to taste.
Of course, to do this I need a decent sentence corpus. So today I set out to generate one. It was a lot fiddlier than it should have been, but I think in the end I’ve got a decent one.
I’m presumably not the only person to need something like this, so I’m making a largish sample of it available. It’s not hard to generate yourself but it’s something of a pain, so maybe I can save you some effort.
So, here you go. A bzipped list of one million random sentences from wikipedia.
The format is obvious: Plain text, one sentence per line.
I make no guarantees about the quality of the data (there’s definitely some noise), and I definitely don’t claim this to be a statistically fair sample of Wikipedia. But initial impressions are that it’s a reasonable good list. Certainly it should be good enough for my purposes.