I want ONE MEELYUN sentences

I’ve been planning to do some work on my term extractor to make it a bit smarter. It’s currently a rule based system on top of various machine learning tools. This is perfectly legitimate, but it’s starting to hit the limitations of that approach. I’d like to experiment with a more intelligent approach using machine learning more directly.

To do this though I need a training set. My plan is to do this by building a first pass using the existing version on some sentence corpus and then editing that to taste.

Of course, to do this I need a decent sentence corpus. So today I set out to generate one. It was a lot fiddlier than it should have been, but I think in the end I’ve got a decent one.

I’m presumably not the only person to need something like this, so I’m making a largish sample of it available. It’s not hard to generate yourself but it’s something of a pain, so maybe I can save you some effort.

So, here you go. A bzipped list of one million random sentences from wikipedia.

The format is obvious: Plain text, one sentence per line.

I make no guarantees about the quality of the data (there’s definitely some noise), and I definitely don’t claim this to be a statistically fair sample of Wikipedia. But initial impressions are that it’s a reasonable good list. Certainly it should be good enough for my purposes.

This entry was posted in computational linguistics, programming on by .

2 thoughts on “I want ONE MEELYUN sentences

  1. mitcho

    How hard would it be to rerun your script to preserve inline links? That’s something I’d be interested in for my own research, and haven’t gotten around to writing a script yet.

    Also, when you say a “random sample”, how did you choose sentences randomly?

  2. Rob

    Thank you so much for this! I’m doing some personal NLP studies and this is by far the best free corpus I’ve seen.

Comments are closed.