Minimum Edit Distance

Here's a straight-forward translation of the recurrence on page 75 from the text directly into Python.

def  minEditDistR(target, source):
   """ Minimum edit distance. Straight from the recurrence. """

   i = len(target); j = len(source)

   if i == 0:  return j
   elif j == 0: return i

   return(min(minEditDistR(target[:i-1],source)+1,
              minEditDistR(target, source[:j-1])+1,
              minEditDistR(target[:i-1], source[:j-1])+substCost(source[j-1], target[i-1])))

def substCost(x,y):
    if x == y: return 0
    else: return 2

Let's see if it works with the example from the book shown in Figure 3.26 on page 76.

>>> ref1 = "intention"
ref1 = "intention"
>>> ref2 = "execution"
ref2 = "execution"
>>> minEditDistR(ref1, ref2)
minEditDistR(ref1, ref2)
8
>>>

Well that works. At least it has the right answer for the final value. Note, however, that if we wanted to inspect any intermediate values, or get the edit trace (or an alignment), we're out of luck since there's no explicit table being built here.

Worse is that noticable pause we got when we tried this example. This seemingly simple operation seemed to take quite a while to come up with an answer. This is due, of course, to all the repeated calculation of intermediate values during the recursion. The usual way to fix problems like this is to turn it into a dynamic programming problem and add a table, (and optionally backpointers) and to fill this table in a systematic bottom-up fashion.

Here's the table-based version of that, again translated into Python from the code in Figure 3.25.

def  minEditDist(target, source):
    ''' Computes the min edit distance from target to source. Figure 3.25 '''
    
    n = len(target)
    m = len(source)

    distance = [[0 for i in range(m+1)] for j in range(n+1)]

    for i in range(1,n+1):
        distance[i][0] = distance[i-1][0] + insertCost(target[i-1])

    for j in range(1,m+1):
        distance[0][j] = distance[0][j-1] + deleteCost(source[j-1])

    for i in range(1,n+1):
        for j in range(1,m+1):
           distance[i][j] = min(distance[i-1][j]+1,
                                distance[i][j-1]+1,
                                distance[i-1][j-1]+substCost(source[j-1],target[i-1]))
    return distance[n][m]

Running that produces the same result as the earlier version, just a lot faster. And if we wanted to add backpointers or inspect the table we could. But just how much faster is it? Let's look at Python's timing functions to see how to do that.

Python provides a module called timeit that provides timing functionality. To use timeit you first create a Timer object with 2 arguments: the thing you want to time, and some initializing code. You need the initializing code since timeit essentially creates a clean environment to do the test.

You can then ask for the code to be timed with the timeit method. By default, timeit runs the code 1,000,000 times and returns the aggregate time. Probably not a good idea in our case. Let's run our candidates 10 times each.

>>> import timeit

>>> timeit.Timer("minedit.minEditDistR(minedit.ref1,minedit.ref2)", "import minedit").timeit(number=10)
27.515401840209961

>>> timeit.Timer("minedit.minEditDist(minedit.ref1,minedit.ref2)", "import minedit").timeit(number=10)
0.0030419826507568359
>>>

Short story is that the dynamic programming version is roughly 9000 times faster than the direct recurrence version. Ouch.

Turns out, however, that you can have your cake and it too without too much additional effort. The main complaint about the recursive version is that it doesn't remember sub-problems that it has already solved earlier. What if we could leave the original code alone (mostly) but have it remember what its done in the past. Turns out that there's a name for that --- memoization --- and python provides a pretty easy way to do it with what are called decorators. We just need to add one line to what we have.

@memoized
def  minEditDistR(target, source):
   """ Minimum edit distance. Straight from the recurrence. """

   i = len(target); j = len(source)

   if i == 0:  return j
   elif j == 0: return i

   return(min(minEditDistR(target[:i-1],source)+1,
              minEditDistR(target, source[:j-1])+1,
              minEditDistR(target[:i-1], source[:j-1])+substCost(source[j-1], target[i-1])))

This wraps some code around our function and adds a cache where previous answers are hashed under the relevant arguments. If we have already computed the answer, then it's returned, otherwise the answer is computed, cached and then returned. Let's see what happens if we do that...

>>> import timeit
import timeit
>>> timeit.Timer("minedit.minEditDistR(minedit.ref1,minedit.ref2)", "import minedit").timeit(number=1)
0.00091791152954101562
>>> timeit.Timer("minedit.minEditDist(minedit.ref1,minedit.ref2)", "import minedit").timeit(number=1)
0.00031018257141113281
>>>

So simply memoizing the recursive version knocks the total time down from 27 seconds to .0009. Not a bad speedup from 1 line of code. Still not as fast as the table version, but easily in the ballpark. Note that you have to be careful with analyzing this kind of thing. Look what happens if we do the recursive version 1000 times.

>>> timeit.Timer("minedit.minEditDistR(minedit.ref1,minedit.ref2)", "import minedit").timeit(number=1000)
timeit.Timer("minedit.minEditDistR(minedit.ref1,minedit.ref2)", "import minedit").timeit(number=1000)
0.0013098716735839844

That's way less than 1000 times our earlier .0009, so what's up? Well since we're doing the same problem 1000 times we're really only computing the answer the first time, the other 999 calls are lookups. Not exactly what we were trying to measure.

Last modified: Wed Mar 10 13:57:27 MST 2010