I want ONE MEELYUN sentences

I’ve been planning to do some work on my term extractor to make it a bit smarter. It’s currently a rule based system on top of various machine learning tools. This is perfectly legitimate, but it’s starting to hit the limitations of that approach. I’d like to experiment with a more intelligent approach using machine learning more directly.

To do this though I need a training set. My plan is to do this by building a first pass using the existing version on some sentence corpus and then editing that to taste.

Of course, to do this I need a decent sentence corpus. So today I set out to generate one. It was a lot fiddlier than it should have been, but I think in the end I’ve got a decent one.

I’m presumably not the only person to need something like this, so I’m making a largish sample of it available. It’s not hard to generate yourself but it’s something of a pain, so maybe I can save you some effort.

So, here you go. A bzipped list of one million random sentences from wikipedia.

The format is obvious: Plain text, one sentence per line.

I make no guarantees about the quality of the data (there’s definitely some noise), and I definitely don’t claim this to be a statistically fair sample of Wikipedia. But initial impressions are that it’s a reasonable good list. Certainly it should be good enough for my purposes.

2 Responses to “I want ONE MEELYUN sentences”

  1. mitcho says:

    How hard would it be to rerun your script to preserve inline links? That’s something I’d be interested in for my own research, and haven’t gotten around to writing a script yet.

    Also, when you say a “random sample”, how did you choose sentences randomly?

  2. Rob says:

    Thank you so much for this! I’m doing some personal NLP studies and this is by far the best free corpus I’ve seen.

Leave a Reply