Levenshtein Distance Tutorial

Anton Haugen
8 min readApr 12, 2021

--

I just finished a novel written by Patricia Lockwood called No One is Talking About This, a genre-defying book that begins embedded in the social media experiences of an extremely online protagonist and ends with the protagonist’s sisters birth of a child with proteus syndrome, a one in a billion chance. To diagnose this syndrome prenatally, the doctors must perform a exome sequencing of her DNA. The doctors tell the family will be like looking for “a single misspelling in a single word on a single page of a very long book.”

It reminded me of one of the uses of Levenshtein Distance, or Edit Distance, a problem I’ve enjoyed solving in the past with help from online tutorials, because the algorithm behind Levenshtein Distance are used not just for spelling correct, but for comparing two sequences of DNA. While reading a very long book may induce yawns in a human reader, it is common practice for a programming algorithm. Levenshtein Distance is one way a computer can parse huge amounts of text.

Because I have been doing coding interviews in the past week, I wanted to understand some basic approaches for these problems, the principles, ideas, and strategies that can be employed to solve problems that one has never solved before. While by solving numerous hackerrank problems one can intuit these principles, it’s also great to have a framework in which to situate each new problem.

One programmer I talked to told me about the Window approach found in the book Cracking the Coding Interview, the approach of forming substrings or subsequences. I learned more from this lecture provided by MIT open courses.

The “five easy steps” for a dynamic programming problem is broken down as follows:

1.)Define subproblems and count how many there are

2.) Guess part of the solution

3.) Start to ontroduce a recursive step, where you solve one of the subproblems you already know how to solve. Find the runtime, while ignoring the recursion.

4.)Recurse or Bottom Up. Either way, make sure the loop is acyclic and has a finite endpoint. To compute the runtime, you would multiply the runtimes together.

5.) Double check that you can solve the original problem, either as one of your subproblems or a combination of the subproblems.

In this lecture, he details three types of subproblems for strings: for string x this would be suffixes x[i:], prefixes x[:i], or substrings x[i:j]. While prefixes and suffixes have a runtime n, because you would have to loop back, the substring subproblem would have a runtime of n².

For parenthetisization, or the optimal evaluation of associative expression, you are finding the cheapest ways of performing some form of computation. Suppose you must perform multiplication between a column vector, a row vector, and a column vector. A column times a row produces a matrix, while a row times a column produces a scalar, or a real number. To parenthesize the column and the row first would result in a matrix times a column. To parenthesize the row and the second column would result in the first column times the scalar product. The second approach would result in less runtime because a matrix has n*n values times a column of values, while the second approach is one number times a column vector.

Now let’s try applying this theory to the popular Levenshtein problem, which is as follows.

Levenshtein Distance or Edit distance: given two strings, x and y, what is the cheapest possible sequence of character edits to convert x to y. Character edits comprise inserting characters, deleting characters, and replacing characters.

In the real world, cost can be variable for character edits, allowing one to consider character proximity to other characters (i.e. an ‘a’ is closer to an ‘s’ on a QWERTY keyboard, and would have a cost lower than an ‘a’ to an ‘l’). In addition, for DNA sequences, some mutations are more likely than others, a C to a G is more likely than a C to an A so a C to a G would have a lower cost than a C to an A and edit distance can give you a measure of how similar two strings of DNA are. However, in this example we will assign each type of change as having cost 1.

For our first attempt at the problem let us try thinking of an algorithm that could tell us the minimum number of changes to get from ‘hop’ to ‘hot’. Just by looking at the two strings you’ll see we will have a minimal cost of 1, a replacement, similarly for ‘hop’ to ‘pop’. For ‘hop’ to ‘hope’, we would also have a minimal cost of 1, an insertion. But about for ‘hop’ to ‘pope’? We would have a cost of 2, a replacement and then an insertion. For ‘hopeful’ to ‘hop’, we would have a cost of 4, four deletions.

As for the algorithm, you can see that we could iterate through the characters of the two strings and that there would then be four options, the aforementioned insert, replace, and delete, which each cost 1 and the ‘match’ which costs 0.

For our first attempt at the algorithm, let’s try this function

def levenshtein(s1,s2):  cost=0  for i in range(len(s2)):     if s1[i]!=s2[i]:         cost+=1  return cost

While this approach could work for ‘hop’ to ‘pop’ or ‘hop’ to ‘hoe’, it would not work for ‘hop’ to ‘hope’ because they have two different lengths. We would therefore need to track both word’s lengths in the algorithm.

def levenshtein(s1,s2):  costs=[]
for j in range(len(s2)):
cost=0 for i in range(len(s1)): if s1[i]!=s2[i]: cost+=1 costs.append(cost)
return min(costs)

But wait a minute, this may work for ‘hope’ to ‘hopeful’, but try running ‘crash’ and ‘cash’ and you’ll see that it returns four. I think this means we need to change our approach. Remember there are four possible actions, match and the three other actions which cost 1. We need a way to track substring edits to the main string. Let’s try solving ‘hopeful’ to an empty string. We know that it will be 7, but within this bigger problem we also have the edit distance for ‘hope’ and ‘hop’. We also know that in the worst case scenario, we would have to make as many changes as the longest word, if there are no matching characters. For example, ‘automaton’ to ‘pixel’ would require nine changes as would ‘pixel’ to ‘automaton’. Let’s include this consideration of distance, word length, and of our four possible actions in our next crack at a solution.


def levenshtein(s1,s2)
#here we make sure s2 is the longer of the two strings
if len(s1)>len(s2):
s1, s2= s2, s1
#like in 'hope' to 'hopeful', we need to track substrings
distances=range(len(s1)+1)
#here's our range object to track 'hopeful' lengths
for j in range(len(s2)):
distances_ = [j+1]
for i in range(len(s1)):
if s1[i] == s2[j]:
distances_.append(distances[i])

else:
cost_steps=(distances[i], distances[i + 1], distances_[-1])
distances_.append(1 + min(cost_steps))
distances = distances_
return distances[-1]

To help visualize this, you can think of a matrix of len(s1) by len(s2). Our origin is the first cell in the matrix and our end point. To change a string of c to an empty string, we would make one deletion, but to change a string of c to c, we would do nothing.

  '' c
'' 0 1
c 1 0

In the case of ‘cat’ to ‘hat, we can see the problem in this matrix

 ''hat
''0123
c 1123
a 2212
t 3321

As you can see, for each step in our algorithm above, we are returning the bottom right cell as 1 plus the minimum value of its three adjacent values (left, left-diagonal, and top) when the characters are not matching. When there’s a match, we use the distance value corresponding to our current index position in the shorter string and the distanct it refers to in the shorter word’s distance range object. You can see this demonstrated in the matrix of ‘cat’ to cataract.

 ''cataract
''012345678
c 101234567
a 210123456
t 321012345

When we reach i=3 when j = 3 (the ‘cat’ x ‘cat’ cell), we are still using the distance object from i=2 and j=3 or ‘ca’ to ‘cat’ where the matrix looks as follows:


1. Our distances_ object contains the number of changes of the longest word to an empty string and iteratesup to the full length of 'cataract' (8) while our distances object will contain a range from 0 to the length of 'cat', inclusive (0,1,2,3).
2. With j=0 and i=0, when 'c' and 'c' match, we append distances[0], which equals 0, to distances_.
3. With j=0 and i=1, we get 'c' to 'ca' and need to find which of the three actions are the most effective. In this step we find the minimum cost of insert, delete, and replace by using distances object (0,1,2,3) and distances_ object, [1,0]. distances[i] = 1 (the top cell) and distances[i+1]= 2 (the left cell) while distances_[-1]=0 (the diagonal). The minimum is distances[-1] so we add 1 to this value and append it to our distances_ object, which now contains [1,0,1]
4. With j=0 and i=2, we end out loop of the shorter word and now find the edit distance for 'c' to 'cat'. Like in step three we find the minimum of the three options. Now, distances[i] = 2 and distances[i+1]=3, so again distances_[-1]=1 would be the min cost we'd add one to. distances_=[1,0,1,2]5. For j=1, 'ca' to the iterations of cats, we now use the distances_ from j=0 as our distances object. In j=1, distances=[1,0,1,2] instead of [0,1,2,3]. Distances will always be the length of the shorter word plus 1, because we add a value for each character.

Let’s try fast forwarding to j=3, or ‘cata’ to ‘cat’:

1. When j=3, distances_=[4] and our distances object contains [3, 2, 1,0] or our distances_ from 'cat' to 'cat'.
2. When j=3 and i=0, s2[3] or 'a' does not equal 'c' we append 1 + the minimum of distances[i] = 3, distances[i+1]=2, and distances_[-1]=4. distances_ now contains[4,3]
3. When j=3 and i=1, s2[3] or 'a' does equal s1[1], so we append distances[i] to distances_. distances_ now contains [4,3,2]
4. When j=2 and i=2, s2[3] does not equal s1[2], so we again append the sum of 1 and the minimum. distances[2]=1, distances[3]=0, and distances_[-1] =2, so we append the sum of 1 and distances[3].
5. We reassign distances to equal distances_ [4,3,2,1]

Levenshtein distance took me a long time to understand, so I hope this tutorial helps you out!

def levenshtein(s1,s2)
#here we make sure s2 is the longer of the two strings
if len(s1)>len(s2):
s1, s2= s2, s1
#like in 'hope' to 'hopeful', we need to track substrings
distances=range(len(s1)+1)
#here's our range object to track 'hopeful' lengths
for j in range(len(s2)):
distances_ = [j+1]
for i in range(len(s1)):
if s1[i] == s2[j]:
distances_.append(distances[i])

else:
cost_steps=(distances[i], distances[i + 1], distances_[-1])
distances_.append(1 + min(cost_steps))
distances = distances_
return distances[-1]

--

--

Anton Haugen
Anton Haugen

No responses yet