If you’re working with user inputs, you’re likely to come across typos. A popular problem on LeetCode, Levenshtein Distance or Edit Distance asks coders to create an algorithm that would find the minimum number of changes that it would take to change one word to another word if you can only insert, delete, or substitute a letter. For example, ‘horse’ would require 3 changes to become ‘ros’ (1.rorse 2.rose 3. ros).
One cool way I saw Levenshtein Distance used in the past week was in conjunction with a join in PySpark dataframes.
Imagine you have a dataframe in which every animal in every American zoo is listed. Each row identifies the animal’s name, his/her/their age, and the kind of animal. However, the number of distinct animal types in this dataframe is much larger than the length of your list of recognized animal names, meaning there is a large number of typos in your dataframe.
values = [('Monkey',10),('Monkay',36),('Mnky',123), \
zoo = spark.createDataFrame(values,['Animal','age'])
Although not a perfect solution, one way to correct this issue is to join this dataframe to a dataframe of correctly spelled animal types using a Levenshtein Distance Boolean as the ‘key’ to join the two tables.
from pyspark.sql.functions import levenshtein
from pyspark.sql.functions import *
from pyspark.sql.types import *options =spark.createDataFrame([‘Monkey’, ‘Elephant’, ‘Hippopotamus’],StringType())results= zoo.join(options, levenshtein(zoo[‘Animal’], options[‘value’]) < 5, ‘left’)results.show()
As you’ll see above, the algorithmic for Levenshtein Distance made finding the correct types a breeze for this large dataset, though it did miss the abbreviation for Hippo. Since this is a common abbreviation for Hippopotamus and is probably as likely to occur as the full name, a future coding solution could include ‘Hippo’ in the dictionary.