<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>David R. MacIver</title>
	<atom:link href="http://www.drmaciver.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.drmaciver.com</link>
	<description></description>
	<lastBuildDate>Thu, 11 Feb 2010 18:13:32 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>You Might Not Know&#8230;</title>
		<link>http://www.drmaciver.com/2010/02/you-might-not-know/</link>
		<comments>http://www.drmaciver.com/2010/02/you-might-not-know/#comments</comments>
		<pubDate>Thu, 11 Feb 2010 18:13:32 +0000</pubDate>
		<dc:creator>david</dc:creator>
				<category><![CDATA[life]]></category>

		<guid isPermaLink="false">http://www.drmaciver.com/?p=3801</guid>
		<description><![CDATA[&#8230;that Mike and I have been working on a secret project. 
Some time last year a friend gave me a really good piece of advice. I don&#8217;t even remember what it was about &#8211; it was something totally minor. Useful at the time, but not something that particularly sticks in your memory. What did stick [...]]]></description>
			<content:encoded><![CDATA[<p>&#8230;that <a href="http://donotremove.co.uk/">Mike</a> and I have been working on a secret project. </p>
<p>Some time last year a friend gave me a really good piece of advice. I don&#8217;t even remember what it was about &#8211; it was something totally minor. Useful at the time, but not something that particularly sticks in your memory. What did stick in my memory was the realisation that everyone has a pile of these little unexpected ways of doing things which plenty of people could use, yet most of them go unshared. This seems like a shame. </p>
<p>So, Mike and I set out to fix that. After a fair bit of work and a lot more procrastination, we give you <a href="http://youmightnotknow.com/">You Might Not Know</a>: A site for sharing those tips and tricks for life that you might otherwise not have. </p>
<p>I&#8217;m pretty pleased with how it&#8217;s gone so far. There&#8217;s still plenty left to do, but I find what&#8217;s there remarkably pleasant and easy to use, in no small part due to Mike being <em>way</em> better at user experience and design than I am. </p>
<p>So, do go check it out. If you&#8217;ve got something to share, great! Even if not, have a browse. Maybe you&#8217;ll learn something useful. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.drmaciver.com/2010/02/you-might-not-know/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Understanding timsort, Part 1: Adaptive Mergesort</title>
		<link>http://www.drmaciver.com/2010/01/understanding-timsort-1adaptive-mergesort/</link>
		<comments>http://www.drmaciver.com/2010/01/understanding-timsort-1adaptive-mergesort/#comments</comments>
		<pubDate>Mon, 11 Jan 2010 16:11:42 +0000</pubDate>
		<dc:creator>david</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[timsort]]></category>

		<guid isPermaLink="false">http://www.drmaciver.com/?p=3745</guid>
		<description><![CDATA[Python&#8217;s timsort has a reputation for being rather scary. This is understandable, as there are a lot of bits to it. However, really when you come down to it it&#8217;s &#8220;just&#8221; a pile of variations applied to mergesort. Some of these variations are rather clever, some of them are pretty straightforward, but together they result [...]]]></description>
			<content:encoded><![CDATA[<p>Python&#8217;s timsort has a reputation for being rather scary. This is understandable, as there are a lot of bits to it. However, really when you come down to it it&#8217;s &#8220;just&#8221; a pile of variations applied to mergesort. Some of these variations are rather clever, some of them are pretty straightforward, but together they result in something really quite impressive.</p>
<p>I&#8217;m going to show you through worked examples how you might arrive at timsort starting from a basic mergesort. In this article I&#8217;ll cover how to arrive at a basic adaptive mergesort that represents the &#8220;core&#8221; of timsort. Later articles will build on this to cover more of the specific optimisations it uses. </p>
<p>For the sake of simplicity I&#8217;m not going to worry about the general case, but am going to stick to arrays of integers (it&#8217;s easy to generalise once you have this, it just makes the code easier to follow). Also, this is necessarily a summary, so I will probably gloss over details (or may just have some things plain wrong), so naturally you should always refer to <a href="http://svn.python.org/projects/python/trunk/Objects/listsort.txt">Tim Peters&#8217;s description of the algorithm</a> if you want the specific details.</p>
<p>Oh, and the example code is going to be in C. Sorry.</p>
<p>So, we&#8217;ll start with a really naive implementation of mergesort.</p>
<p>Hopefully you already know how mergesort works (if not, you might want to refer elsewhere to find out), but here&#8217;s a refresher: Arrays of length 1 are already sorted. For arrays of length n > 1 partition the array into two (the most common approach is to split it down the middle). Mergesort the two partitions. Now apply a merge operation, which takes two sorted arrays and merges them together to form a larger sorted array by scanning through them and always writing the smallest one as the next element, to get the larger array sorted.</p>
<p>Here&#8217;s some code: </p>
<pre>
#include "timsort.h"
#include <stdlib.h>
#include <string.h>

// Merge the sorted arrays p1, p2 of length l1, l2 into a single
// sorted array starting at target. target may overlap with either
// of p1 or p2 but must have enough space to store the array.
void merge(int target[], int p1[], int l1, int p2[], int l2);

void integer_timsort(int array[], int size){
  if(size <= 1) return; 

  int partition = size/2;
  integer_timsort(array, partition);
  integer_timsort(array + partition, size - partition);
  merge(array, array, partition, array + partition, size - partition);
}

void merge(int target[], int p1[], int l1, int p2[], int l2){
  int *merge_to = malloc(sizeof(int) * (l1 + l2)); 

  // Current index into each of the two arrays we're writing
  // from.
  int i1, i2;
  i1 = i2 = 0; 

  // The address to which we write the next element in the merge
  int *next_merge_element = merge_to;

  // Iterate over the two arrays, writing the least element at the
  // current position to merge_to. When the two are equal we prefer
  // the left one, because if we're merging left, right we want to
  // ensure stability.
  // Of course this doesn't matter for integers, but it's the thought
  // that counts.
  while(i1 < l1 &#038;&#038; i2 < l2){
    if(p1[i1] <= p2[i2]){
      *next_merge_element = p1[i1];
      i1++;
    } else {
      *next_merge_element = p2[i2];
      i2++;
    }
    next_merge_element++;
  }

  // If we stopped short before the end of one of the arrays
  // we now copy the rest over.
  memcpy(next_merge_element, p1 + i1, sizeof(int) * (l1 - i1));
  memcpy(next_merge_element, p2 + i2, sizeof(int) * (l2 - i2)); 

  // We've now merged into our additional working space. Time
  // to copy to the target.
  memcpy(target, merge_to, sizeof(int) * (l1 + l2));

  free(merge_to);
}
</pre>
<p>I won't always paste the full code. You can follow these along as revisions in <a href="http://github.com/DRMacIver/understanding-timsort">the github repo</a>.</p>
<p>Now, if you're a C programmer one thing probably leapt out at you as a horrible abomination: We're allocating and freeing space for the merge with every merge call (you may also be grumpy that we're not checking for null returns. Pretend that I would if this were real code rather than demo if that makes you feel better).</p>
<p>This is easy to fix with a few signature changes and a wrapper:</p>
<pre>
void merge(int target[], int p1[], int l1, int p2[], int l2, int storage[]);
void integer_timsort_with_storage(int array[], int size, int storage[]);

void integer_timsort(int array[], int size){
  int *storage = malloc(sizeof(int) * size);
  integer_timsort_with_storage(array, size, storage);
  free(storage);
}
</pre>
<p>So we have a top level sort function which does some setup and then passes to the recursive version. This is a pattern we'll take a lot of advantage for for timsort, though what is passed in to the worker version will end up being more complicated than just a flat block of storage.</p>
<p>So, we have our basic mergesort. We need to ask: How can we optimise this?</p>
<p>In general we can't expect to optimise it to get a win in every case. Mergesort's behaviour is very close to optimal for a comparison sort. The key feature of timsort is that it is optimised to exploit certain common sorts of regularity in data. When they are there, we should take advantage of them as much as possible. When they are not we should merely be not substantially worse than a normal mergesort.</p>
<p>If you look at the mergesort implementation, essentially all the work is done by the merge operation. So optimising basically comes down to that. This suggests three optimisation approaches:</p>
<ol>
<li>Can we make merges faster?</li>
<li>Can we perform fewer merges?</li>
<li>Are there cases where we're actually better off doing something different and not using mergesort?</li>
</ol>
<p>The answer to three is unequivocally yes, and this is a very common source of merge sort optimisations. For example, the recursive implementation makes it very easy to switch to different sorting algorithms based on the size of the array. Mergesort is a very good general purpose sorting algorithm, but for small arrays the constant factors tend to dominate. Frequently one drops down to insertion sort for arrays under some size (around 7 or 8 seems to be a common choice).</p>
<p>This isn't actually how timsort works, but we will need insertion sort later, so I'll take a quick digression down this route.</p>
<p>Basically: Suppose we have a sorted array of n elements, with space for an n+1th at the end, and we want to add a single element to it in such a way that the end result is sorted. We need to find the appropriate place for it and move the elements larger than that up. One obvious way to do this is to insert the element into the n+1th slot and then swap backwards until it's in the right place (for large arrays this isn't neccessarily the best bet: You might want to binary search and then move the rest of the array up without doing comparisons. For small arrays however this is likely to lose due to cache effects). </p>
<p>This is how insertion sort works: You have the first k elements sorted. You insert the k+1th element into these k sorted elements as above, so now you have k+1 elements sorted. Proceed until you hit the end of the array.</p>
<p>Here's some code:</p>
<pre>
void insertion_sort(int xs[], int length){
  if(length <= 1) return;
  int i;
  for(i = 1; i < length; i++){
    // The array before i is sorted. Now insert xs[i] into it
    int x = xs[i];
    int j = i - 1;

    // Move j down until it's either at the beginning or on
    // something <= x, and everything to the right of it has
    // been moved up one.
    while(j >= 0 &#038;&#038; xs[j] > x){
      xs[j+1], xs[j];
      j--;
    }
    xs[j+1] = x;
  }
}
</pre>
<p>And the sort code gets modified as follows:</p>
<pre>
void integer_timsort_with_storage(int array[], int size, int storage[]){
  if(size <= INSERTION_SORT_SIZE){
    insertion_sort(array, size);
    return;
  }
</pre>
<p>You can see this version <a href="http://github.com/DRMacIver/understanding-timsort/commit/57a91bd8c5383ffa1e0e5dc1df0849e16ec037bd">here</a>.</p>
<p>Anyway, that digression aside, we return to our questions about optimisation.</p>
<p>Can we perform fewer merges?</p>
<p>Well, in general probably not. But let's consider a couple common cases.</p>
<p>Suppose we have an array that's already sorted. How many merges do we need to perform?</p>
<p>Well, in principle none: The array is already sorted. There's nothing to do. So one option would be to add an initial check to see if the array is already sorted and exit early there.</p>
<p>But that adds a bunch of extra work to the sort. It wins big in the case where it succeeds (drops us down to O(n) instead of O(n log(n)) with a worse contant factor), but adds a bunch of useless extra work in the case where it fails. So let's try and figure out how we can perform this check and make use of the results even when it doesn't succeed.</p>
<p>Suppose we've got the following array:</p>
<pre>
{5, 6, 7, 8, 9, 10, 1, 2, 3}
</pre>
<p>(ignoring for the moment the fact that we want to use a different sort for smaller n).</p>
<p>Where do we want to partition in order to get the best merge?</p>
<p>Clearly there are two already sorted subarrays: 5 to 10 and then 1 to 3. It would be nice to be able to choose them as our partitions. </p>
<p>Let me propose a broken solution: </p>
<p>Find the longest initial increasing sequence. Choose that as the first partition, and the rest of the array as the second partition. </p>
<p>In the case where the array is partitioned into a small number of sorted arrays, this works pretty well (actually even then it's not a great idea), but it has pretty awful worst case behaviour. Consider what happens if we have an array in reverse order. The first sorted subarray on each partition will have length 1. So at each stage we'll have only one element in the first partition, and then recursively perform the merge sort on n - 1 elements. This gives us a most distinctly unsatisfying O(n^2) behaviour.</p>
<p>We could fix this by artificially boosting short arrays to the first n / 2 elements, but this is unsatisfying: We're still most likely to ignore the extra work we're doing, and it's going to pay off very rarely. </p>
<p>However, the basic idea is sound: Use the already sorted subarrays as the basis of partitions for our merge. </p>
<p>The bad part is our choice of the second partition. We want to ensure that our merges are better balanced in order to ensure we don't hit pathological worst case behaviour.</p>
<p>In order to see how to fix this, let's take a step back. Consider the following slightly strange inversion of how a standard merge sort works:</p>
<p>Partition the array into sections of length 1.</p>
<p>While there is more than one partition, merge alternating even/odd partitions and replace them with a single partition.</p>
<p>so e.g. if we had the array {1, 2, 3, 4} then this would go:</p>
<pre>
{{1}, {2}, {3}, {4}}
{{1, 2}, {3, 4}}
{{1, 2, 3, 4}}
</pre>
<p>It's relatively easy to see that this is "the same" as the standard mergesort: We've just turned it inside out by making the recursion explicit and using external storage instead of the stack. However, this approach is more suggestive of how we can use the existing sorted subarrays: We replace the first step by instead of partitioning the array into segments of length 1 we partition it into the already sorted segments. We then proceed with the merges as above.</p>
<p>Now, there's just one small problem with this: We're using a pile of external storage that we didn't need to use. With the original mergesort we used O(log(n)) stack space. This version uses O(n) space to store the initial partitions.</p>
<p>So, how is it that our "equivalent" algorithms have vastly different space usage? </p>
<p>Well, the answer is that I sortof lied about their equivalence. The big difference is that with the original mergesort the partition lists are generated lazily. We only ever generate as much as we need to produce the next level up and then discard it once we've produced the next level.</p>
<p>Put another way, we're actually merging as we go rather than generating all the partitions up front. </p>
<p>So, let's see if we can convert that into an algorithm. </p>
<p>First pass: At each step, generate a new base level partition (in normal mergesort this is a single element. In our version proposed above this is a sorted subarray). Add it to a stack of already generated partitions. Possibly reduce the size of the stack by merging the top two elements some number of times. Repeat until there are no more partitions to generate. Collapse the entire stack by merging.</p>
<p>There was one bit of fakery in there: We've left the logic for when to merge as we go completely unspecified. </p>
<p>At this point there's been far too much text and far too little code, so I'm going to propose a temporary answer: Pick it at random. In the normal merge sort about half of operations result in a merge. Half of the partitions generated are merged with the previous one, half of the merges at a given level are merged with the previous one, etc. So we'll simply flip a metaphorical coin as to whether or not we should merge.</p>
<p>Now, let's write some code for this. </p>
<p>The first thing we do is encapsulate all the state we're going to be passing around:</p>
<pre>
// We use a fixed size stack. This size is far larger than there is
// any reasonable expectation of overflowing. Of course, we do still
// need to check for overflows.
#define STACK_SIZE 1024

typedef struct {
  int *index;
  int length;
} run;

typedef struct {
  int *storage;
  // Storage for the stack of runs we've got so far.
  run runs[STACK_SIZE];
  // The index of the first unwritten element of the stack.
  int stack_height;

  // We keep track of how far we've partitioned up to so we know where to start the next partition.
  // The idea is that everything < partioned_up_to is on the stack, everything >= partioned_up_to
  // is not yet on the stack. When partitioned_up_to == length we'll have put everything on the stack.
  int *partitioned_up_to;

  int *array;
  int length;

} sort_state_struct;

typedef sort_state_struct *sort_state;
</pre>
<p>We'll pass around a pointer to the sort_state to all the functions we need.</p>
<p>The basic logic of the sort is this:</p>
<pre>
  while(next_partition(&#038;state)){
    while(should_collapse(&#038;state)) merge_collapse(&#038;state);
  }

  while(state.stack_height > 1) merge_collapse(&#038;state);
</pre>
<p>next_partition either pushes a new partition onto the stack and returns 1 or returns 0 if there are no more partitions to add (i.e. we're at the end of the array). We then collapse the stack a bit. Finally when the entire array is partitioned we collapse the stack to one element.</p>
<p>We now have our first adaptive version of mergesort: If there are a lot of sorted subarrays we'll be able to get large shortcuts out of them. If not, we'll still run in (expected) O(n log(n)) time.</p>
<p>That "expected" qualification is a bit of a wart though. The randomisation was clearly a quick hack to avoid us having to actually figure out good conditions to merge on.</p>
<p>So, let's see if we can figure out a better condition. The natural way to do it is to try to maintain some invariant on the stack and merge until that invariant is satisfied.</p>
<p>Further, we want that invariant to maintain the stack as having at most log(n) elements. </p>
<p>For now, let's consider the following invariant: Each element on the stack has to be >= twice the one popped right before it. So the head is the smallest, the next smallest is the previous and is at least twice as long as the head.</p>
<p>This invariant certainly achieves the log(n) elements criterion. It does however have the tendency to create very long runs of collapses. Consider the case where the lengths of the stack look as follows:</p>
<pre>
64, 32, 16, 8, 4, 2, 1
</pre>
<p>Suppose we push a run of length 1 onto the stack. We start the following sequences of merges:</p>
<pre>
64, 32, 16, 8, 4, 2, 1, 1
64, 32, 16, 8, 4, 2, 2
64, 32, 16, 8, 4, 4
64, 32, 16, 8, 8
64, 32, 16, 16
64, 32, 32
64, 64
128
</pre>
<p>Later on, as the merge gets smarter, this will prove to be a bad thing (basically because it stomps on certain structure that might be present in the array). However right now our merges are pretty dumb, so we don't need to worry about it. So we'll simply go with this for now.</p>
<p>One thing worth noting: We now have deterministic guarantees over how big our stack can get. Suppose the first element of the stack is 1. Then the next is >= 2, the next is >= 4, etc. So the total length of segments on the stack is 2^n - 1. Since there can be at most 2^64 elements in the array on a 64-bit machine (and that would be a really alarmingly large array even there), we know that a stack satisfying this invariant can have at most 65 elements. Adding 1 more for the element being pushed, this means we can allocate 66 spaces for the stack and never worry about overflowing.</p>
<p>It's also worth noting that we only need to check whether the element one off the head is >= 2 * the head, because we're always pushing onto a stack satisfying this invariant and a merge only affects the top two elements.</p>
<p>So, in order to satisfy this invariant we simply change should_collapse as follows:</p>
<pre>
 int should_collapse(sort_state state){
  if (state->stack_height <= 2) return 0;

  int h = state->stack_height - 1;

  int head_length = state->runs[h].length;
  int next_length = state->runs[h-1].length;

  return 2 * head_length > next_length;
}
</pre>
<p>So, our adaptive merge is now deterministic. Huzzah.</p>
<p>Now, let's go back to our previous example of a case that was problematic and see what happens.</p>
<p>Consider the following reversed array:</p>
<pre>
5, 4, 3, 2, 1
</pre>
<p>What happens when we apply our adaptive mergesort to it?</p>
<p>Well, the stack of runs looks like this:</p>
<pre>
{5}
{5}, {4}
{4, 5}
{4, 5}, {3}
{4, 5}, {3}, {2}
{4, 5}, {2, 3}
{2, 3, 4, 5}
{2, 3, 4, 5}, {1}
{1, 2, 3, 4, 5}
</pre>
<p>Which is a sane enough merge strategy. </p>
<p>But you know what a nicer way to sort a reverse order array is? Reverse it in place and you're done.</p>
<p>There's an obvious way to modify our algorithm to take advantage of this. We're already looking for increasing runs, we can simply look for a decreasing run when we don't find an increasing one, reverse it in place and add it as an increasing run. </p>
<p>So we modify the code for finding the next run as follows:</p>
<pre>
  if(next_start_index < state->array + state->length){
    if(*next_start_index < *start_index){
      // We have a decreasing sequence starting here.
      while(next_start_index < state->array + state->length){
        if(*next_start_index < *(next_start_index - 1)) next_start_index++;
        else break;
      }

      // Now reverse it in place.
      reverse(start_index, next_start_index - start_index);

    } else {
      // We have an increasing sequence starting here.
      while(next_start_index < state->array + state->length){
        if(*next_start_index >= *(next_start_index - 1)) next_start_index++;
        else break;
      }
    }
  }
</pre>
<p>As well as the basic case of a reversed array, the sort will now deal much better with things which "zig zag" up and down. e.g. consider sorting the following:</p>
<pre>
{1, 2, 3, 4, 5, 4, 3, 2, 1}
</pre>
<p>We get the following merges:</p>
<pre>
{1, 2, 3, 4, 5}
{1, 2, 3, 4, 5}, {1, 2, 3, 4}
{1, 1, 2, 2, 3, 3, 4, 4, 5}
</pre>
<p>Which is a lot better than we would have on the previous implementation!</p>
<p>And for one final optimisation on the run generation:</p>
<p>In our previous mergesort we had a cutoff size at which we switched to insertion sort for small arrays. There's currently no analogue of this for our adaptive version, which means that we will potentially underperform compared to normal mergesort when there isn't much structure to exploit.</p>
<p>Looking back to our "inside out" mergesort, the process of switching to insertion sort for small runs can be viewed as follows: Rather than starting with runs of size 1, we start with runs of size INSERTION_SORT_SIZE, which we run insertion sort on to ensure that they are sorted.</p>
<p>This suggests a natural adaption to our adaptive sort: When we find a run which is less than some minimum size, use insertion sort to boost it to a run of that size.</p>
<p>This causes us to change the end of next_partition as follows:</p>
<pre>
  if(run_to_add.length < MIN_RUN_SIZE){
    boost_run_length(state, &#038;run_to_add);
  }
  state->partitioned_up_to = start_index + run_to_add.length;
</pre>
<p>Where boot_run_length is defined as:</p>
<pre>
void boost_run_length(sort_state state, run *run){
  // Need to make sure we don't overshoot the end of the array
  int length = state->length - (run->index - state->array);
  if(length > MIN_RUN_SIZE) length = MIN_RUN_SIZE;

  insertion_sort(run->index, length);
  run->length = length;
}
</pre>
<p>(It would make more sense to specialize this a bit, as we know that we have an initially sorted segment, but I'm being lazy).</p>
<p>This should improve the behaviour on random data to a degree which is fairly competive with a normal merge sort.</p>
<p>We now have an adaptive mergesort which is in some sense the "core" of timsort. Timsort adds a large number of optimisations on top of this, many of them integral to its success, but this is the starting point on which they're all based. I hope/plan to cover the rest in later articles.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.drmaciver.com/2010/01/understanding-timsort-1adaptive-mergesort/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>I want ONE MEELYUN sentences</title>
		<link>http://www.drmaciver.com/2009/12/i-want-one-meelyun-sentences/</link>
		<comments>http://www.drmaciver.com/2009/12/i-want-one-meelyun-sentences/#comments</comments>
		<pubDate>Sun, 06 Dec 2009 20:18:17 +0000</pubDate>
		<dc:creator>david</dc:creator>
				<category><![CDATA[computational linguistics]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.drmaciver.com/?p=3707</guid>
		<description><![CDATA[I&#8217;ve been planning to do some work on my term extractor to make it a bit smarter. It&#8217;s currently a rule based system on top of various machine learning tools. This is perfectly legitimate, but it&#8217;s starting to hit the limitations of that approach. I&#8217;d like to experiment with a more intelligent approach using machine [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been planning to do some work on my <a href="http://github.com/DRMacIver/term-extractor">term extractor</a> to make it a bit smarter. It&#8217;s currently a rule based system on top of various machine learning tools. This is perfectly legitimate, but it&#8217;s starting to hit the limitations of that approach. I&#8217;d like to experiment with a more intelligent approach using machine learning more directly.</p>
<p>To do this though I need a training set. My plan is to do this by building a first pass using the existing version on some sentence corpus and then editing that to taste.</p>
<p>Of course, to do this I need a decent sentence corpus. So today I set out to generate one. It was a lot fiddlier than it should have been, but I think in the end I&#8217;ve got a decent one. </p>
<p>I&#8217;m presumably not the only person to need something like this, so I&#8217;m making a largish sample of it available. It&#8217;s not hard to generate yourself but it&#8217;s something of a pain, so maybe I can save you some effort.</p>
<p>So, here you go. <a href="http://d3t3fd87rd28b5.cloudfront.net/one_meelyun_sentences.bz2">A bzipped list of one million random sentences from wikipedia</a>. </p>
<p>The format is obvious: Plain text, one sentence per line. </p>
<p>I make no guarantees about the quality of the data (there&#8217;s definitely some noise), and I definitely don&#8217;t claim this to be a statistically fair sample of Wikipedia. But initial impressions are that it&#8217;s a reasonable good list. Certainly it should be good enough for my purposes. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.drmaciver.com/2009/12/i-want-one-meelyun-sentences/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Shaving yaks and finding feeds</title>
		<link>http://www.drmaciver.com/2009/12/shaving-yaks-and-finding-feeds/</link>
		<comments>http://www.drmaciver.com/2009/12/shaving-yaks-and-finding-feeds/#comments</comments>
		<pubDate>Sun, 06 Dec 2009 13:06:49 +0000</pubDate>
		<dc:creator>david</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[labour saving]]></category>
		<category><![CDATA[web]]></category>

		<guid isPermaLink="false">http://www.drmaciver.com/?p=3704</guid>
		<description><![CDATA[So I had some interesting ideas I wanted to play with to do with keeping on top of streams of information.
Of course, I needed some streams of information to keep on top of in order to do this. I decided to go with my RSS feeds (the other obvious source being twitter).
To do that I [...]]]></description>
			<content:encoded><![CDATA[<p>So I had some interesting ideas I wanted to play with to do with keeping on top of streams of information.</p>
<p>Of course, I needed some streams of information to keep on top of in order to do this. I decided to go with my RSS feeds (the other obvious source being twitter).</p>
<p>To do that I needed a database of feed entries. So I created a small program to do that (I really should just have used feed-bag, but there were some things I wanted to tweak and integrate so I didn&#8217;t).</p>
<p>Unfortunately for whatever reason I ended up with a lot of URLs that pointed to sites or something invalid in my opml. I&#8217;m not sure offhand if this was an import problem or a problem in the google reader export. </p>
<p>So, I thought, let&#8217;s do our damnedest to correct URLs: If it points to a site do feed discovery, follow redirects, etc. It can&#8217;t be that hard.</p>
<p>Cue me getting very angry. Suffice it to say, if you do what I did and foolishly expect people on the web to follow standards you are very mistaken. </p>
<p>Anyway, after much hacking around trying to get this to work I decided to codify the various tricks into a library so you don&#8217;t have to share my anger. I&#8217;ve called this library <a href="http://github.com/DRMacIver/feedify">feedify</a>. This is very rude of me as there&#8217;s another ruby library called feedify, but given that it hit 0.0.1 in january 2008 and never updated since then I don&#8217;t feel too bad about stomping on its namespace. </p>
<p>Additionally I&#8217;ve put up an http interface to it. If you go to http://feedify.merobe.com/feed/(some url) then it will try to find a feed associated with that URL and redirect you to it. You can also run this service yourself &#8211; it&#8217;s included in the github project.</p>
<p>This is all very rough and liable to change at the moment. If you have any bug reports of URLs it misses or gets wrong I&#8217;d be very interested to receive them.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.drmaciver.com/2009/12/shaving-yaks-and-finding-feeds/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Filtering deleted documents with PostgreSQL rules</title>
		<link>http://www.drmaciver.com/2009/11/filtering-deleted-documents-with-postgresql-rules/</link>
		<comments>http://www.drmaciver.com/2009/11/filtering-deleted-documents-with-postgresql-rules/#comments</comments>
		<pubDate>Fri, 13 Nov 2009 16:23:37 +0000</pubDate>
		<dc:creator>david</dc:creator>
				<category><![CDATA[Code]]></category>
		<category><![CDATA[SQL]]></category>

		<guid isPermaLink="false">http://www.drmaciver.com/?p=3665</guid>
		<description><![CDATA[I&#8217;m currently working on a Mysterious Project (coming soon to an internet near you) which involves a lot of user generated content (Yes, fine, slap a 2.0 on my name and call me &#8220;still in beta&#8221;). As such, it&#8217;s got all the usual problems with user generated content. In particular it has spam. 
So, we [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m currently working on a Mysterious Project (coming soon to an internet near you) which involves a lot of user generated content (Yes, fine, slap a 2.0 on my name and call me &#8220;still in beta&#8221;). As such, it&#8217;s got all the usual problems with user generated content. In particular it has spam. </p>
<p>So, we need some sort of spam filtering in place to make sure we never show spam to users. But we don&#8217;t want to delete spam from the database &#8211; partly in case of mistakes, partly because we want to use the data for automated classification of spam. </p>
<p>Ok, this is easy enough to do. You add a flag &#8220;spam&#8221; to the table and don&#8217;t show the user anything flagged as spam.</p>
<p>The problem here is that this content gets used in all sorts of contexts, and it&#8217;s really annoying to have to add &#8220;where != spam&#8221; here. </p>
<p>No problem. We create a view. That&#8217;s what they&#8217;re for.</p>
<p>But this is slightly annoying: Basically all our access to content goes through this view, but modifications to content have to go through the original table. It would be really nice if we could have all updates and inserts going to the same thing we access the data from. &#8220;really nice&#8221; is partly aesthetic, but there&#8217;s also a boring practical reason: We&#8217;re using an ORM (ActiveRecord in fact. Sigh), and we&#8217;d like the ORM to access the filtered version, but we&#8217;d also like to be able to update the same objects.</p>
<p>Hang on. We&#8217;re using PostgreSQL. There&#8217;s an app&#8230; err. feature for that. </p>
<p>PostgreSQL has a feature called <a href="http://www.postgresql.org/docs/8.2/interactive/rules.html">rules</a> which allow you to change the meaning of various operations on a table (views in PostgreSQL are also tables). We can use these to make our view updateable. Let&#8217;s see how.</p>
<p>We&#8217;ll start with a slightly abstracted version of the problem. Instead of thinking about spam filtering we&#8217;ll concern ourselves with deleting posts. We want to retain the old posts but not show them:</p>
<pre>
david=# create sequence post_ids;
CREATE SEQUENCE
david=# create table unfiltered_posts(id int primary key default nextval('post_ids'),
david(#                                        body text,
david(#                                        deleted boolean not null default false);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "unfiltered_posts_pkey" for table "unfiltered_posts"
CREATE TABLE
</pre>
<p>So we first create our view that will be the posts which have not been deleted:</p>
<pre>
david=# create view posts as select * from unfiltered_posts where not deleted;
CREATE VIEW
</pre>
<p>Making sure everything&#8217;s working as expected:</p>
<pre>
david=# select * from posts;
 id | body | deleted
----+------+---------
(0 rows)

david=# insert into unfiltered_posts(body, deleted) values('I like kittens', false);
INSERT 0 1

david=# insert into unfiltered_posts(body, deleted) values('I don''t like kittens', true);
INSERT 0 1

david=# select * from posts;
 id |      body      | deleted
----+----------------+---------
  1 | I like kittens | f
(1 row)

david=# select * from unfiltered_posts ;
 id |         body         | deleted
----+----------------------+---------
  1 | I like kittens       | f
  2 | I don't like kittens | t
(2 rows)
</pre>
<p>So all working as expected: unfiltered_posts marked deleted don&#8217;t show up in the view.</p>
<p>But of course this was the bit we already knew how to do. What doesn&#8217;t work is inserting into the view:</p>
<pre>
david=# insert into posts(body) values('I am the very model of a modern major general');
ERROR:  cannot insert into a view
HINT:  You need an unconditional ON INSERT DO INSTEAD rule.
</pre>
<p>Indeed it doesn&#8217;t work. But it does give us a nice hint of what to do next. </p>
<pre>
david=# create or replace rule insert_into_posts as on insert to posts do instead insert into unfiltered_posts(body) values(NEW.body);
CREATE RULE
</pre>
<p>So, now we can insert into the view:</p>
<pre>
david=# insert into posts(body) values('I am the very model of a modern major general');
INSERT 0 1
david=# select * from posts;
 id |                     body                      | deleted
----+-----------------------------------------------+---------
  1 | I like kittens                                | f
  4 | I am the very model of a modern major general | f
</pre>
<p>This works, but I find it a bit ugly. The problem here is that you have to explicitly enumerate the fields in order for this to work. I couldn&#8217;t find a terribly satisfactory solution unfortunately. So if someone is reading this who knows more about postgresql than I do I&#8217;d love go get some hints. </p>
<p>The following <em>does</em> work as an alternative:</p>
<pre>
david=# create or replace rule insert_into_posts as on insert to posts do instead insert into unfiltered_posts values(NEW.*);
CREATE RULE
</pre>
<p>But the problem is that it plays badly with the defaults. If we try this we get:</p>
<pre>
david=# insert into posts(body) values('I am the very model of a modern major general');
ERROR:  null value in column "id" violates not-null constraint
</pre>
<p>The problem is that inserting null into a not-null column doesn&#8217;t replace null with the default value. It would be nice if it did as that would make this easy, but oh well (this isn&#8217;t postgresql specific behaviour. I&#8217;m not aware of any database where inserting null into a not null default blah column will work. Certainly MySQL does the same thing). You could probably make this work with a before insert or update trigger, but that&#8217;s a little gross. </p>
<p>An alternative version which offers slightly better functionality but still requires you to explicitly enumerate the columns in the rule is the following:</p>
<pre>
david=# create or replace rule insert_into_posts as on insert to posts do instead insert into unfiltered_posts values(coalesce(NEW.id, nextval('post_ids')), NEW.body, coalesce(NEW.deleted, false));
CREATE RULE
david=# insert into posts(body) values('I''ve information vegetable, animal, and mineral');
INSERT 0 1
david=# select * from posts;
 id |                      body                       | deleted
----+-------------------------------------------------+---------
  1 | I like kittens                                  | f
  3 | I am the very model of a modern major general   | f
  4 | I've information vegetable, animal, and mineral | f
</pre>
<p>This requires us to duplicate the defaults as well as the columns, which is rather annoying, but at least it works satisfactorily (note: Some of you will complain that I didn&#8217;t explicitly enumerate the columns in the insert into. This is deliberate &#8211; the view will break if I change the table structure in any interesting way. If I explicitly enumerated the column names it would instead silently do the wrong thing). </p>
<p>So, this works. We can do the same on update:</p>
<pre>

david=#   create or replace rule update_to_posts
david-#   as on update to posts
david-#   do instead
david-#      update unfiltered_posts
david-#      set id = coalesce(NEW.id, OLD.id),
david-#           body = coalesce(NEW.body, OLD.body),
david-#           deleted = coalesce(NEW.deleted, OLD.deleted)
david-#      where id = OLD.id;
CREATE RULE

david=# update posts set deleted = true where id = 4;
UPDATE 1
david=# select * from posts;
 id |                     body                      | deleted
----+-----------------------------------------------+---------
  1 | I like kittens                                | f
  3 | I am the very model of a modern major general | f
(2 rows)

david=# select * from unfiltered_posts;
 id |                      body                       | deleted
----+-------------------------------------------------+---------
  1 | I like kittens                                  | f
  2 | I don't like kittens                            | t
  3 | I am the very model of a modern major general   | f
  4 | I've information vegetable, animal, and mineral | t
(4 rows)
</pre>
<p>So now updating things in posts works. Note that if we try to update a filtered post it will not work:</p>
<pre>
david=# update posts set body = 'kittens' where id = 4;
UPDATE 0
david=# select * from unfiltered_posts ;
 id |                      body                       | deleted
----+-------------------------------------------------+---------
  1 | I like kittens                                  | f
  2 | I don't like kittens                            | t
  3 | I am the very model of a modern major general   | f
  4 | I've information vegetable, animal, and mineral | t
(4 rows)
</pre>
<p>And, finally, we want to hook deletion into it. Obviously we don&#8217;t want deletion to delete things from the underlying table but instead to set their deleted flag to be false:</p>
<pre>
david=# create or replace rule delete_posts
david-# as on delete to posts do instead
david-# update unfiltered_posts
david-# set deleted = true where id = OLD.id;
CREATE RULE
david=# select * from posts;
 id |                     body                      | deleted
----+-----------------------------------------------+---------
  1 | I like kittens                                | f
  3 | I am the very model of a modern major general | f
(2 rows)

david=# delete from posts where id = 3;
DELETE 0
david=# select * from posts;
 id |      body      | deleted
----+----------------+---------
  1 | I like kittens | f
(1 row)

david=# select * from unfiltered_posts;
 id |                      body                       | deleted
----+-------------------------------------------------+---------
  1 | I like kittens                                  | f
  2 | I don't like kittens                            | t
  4 | I've information vegetable, animal, and mineral | t
  3 | I am the very model of a modern major general   | t
(4 rows)
</pre>
<p>So there we have it: A view which we can insert into, update and delete. Despite the slight annoyances around default values, this is definitely a really neat feature. I look forward to exploring its use. </p>
<p>If you want to have a play with this, I&#8217;ve created <a href="https://gist.github.com/7aa8ea79e2e6a68bc074">a gist</a> containing the table, view and rules.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.drmaciver.com/2009/11/filtering-deleted-documents-with-postgresql-rules/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic page generated in 0.678 seconds. -->
<!-- Cached page generated by WP-Super-Cache on 2010-03-10 22:23:24 -->
<!-- Compression = gzip -->