So much of data science is about understanding the context around your data. In this talk, we hope to address how to work with messy text data by leveraging fuzzy search algorithms in python or against a database such as PostgreSQL. We will talk specifically about fuzzy algorithms such as Soundex, Trigram/n-gram search, and Levenshtein distances and demonstrate use cases in an ipython notebook.
Fuzzy Searching or approximate string matching is powerful because often times text data is messy - shorthand and/or abbreviated text are common in various data sets, voice to text conversion can also be messy sometimes. As a result, we want to be able to make the most of our data by extrapolating as much information as possible. In this talk, we will explore the various approaches used in fuzzy string matching and demonstrate how they can be used as a feature in a model or a component in your python code. We will dive deep into the approaches of different algorithms such as Soundex, Trigram/n-gram search, and Levenshtein distances and what the best use cases are. For instance, Levenshtein is great for real time analytics whereas trigram/n-gram works best on a batch data set with appropriate indexes. Finally, we will demonstrate via live coding how to implement some of these fuzzy search algorithms using python and/or PostgreSQL.