Building an end-to-end machine learner that evolves regular expressions as features for text classification tasks with genetic programming
In a business context you frequently encounter text or document classification tasks. For example to determine whether or not an e-mail is spam, or at ING Wholesale Banking whether or not an account name is that of a company or of a private individual. A successful approach is to train a classifier on character or word n-grams. But sometimes it also pays off to use certain regular expression patterns as features. For example to detect acronyms, legal forms or initials. The regular expressions we used were designed by hand, based on what we expected to find in the text. This made us wonder if there are maybe more regular expression patterns that we didn’t think of, but could be automatically learned from the data. So we set out to build an end-to-end machine learner, that combines genetic programming to build regular expressions, with a classifier based on the regular expressions’ matching. We will show the architecture of the learner, how it is trained using a distributed/island model, and how it performs on various data sets stand-alone and on top of character or word n-grams.