Thursday 1:30 PM–2:15 PM in Track 3 Room

MAP all the things

John Healy

Audience level:
Intermediate

Description

Embedding techniques like word2vec and doc2vec are taking over the world. An up and coming technique for embedding numeric data is UMAP. How would you go about applying UMAP to real word data? How about text data? What about malware? In this talk we’ll learn how to MAP all the things!

Abstract

Embedding techniques are taking over the world. From word2vec to embed words, all the way to Latent Dirichlet Allocation and doc2vec to embed documents. All these techniques are really about turning non-numeric data into vector space data suitable for either machine learning or visualization. An up and coming technique for embedding numeric data is UMAP. How would you go about applying UMAP to word data? How about text data? What about malware? In this talk we’ll learn how to MAP all the things!

We’ll introduce you to a new technique called WordMAP for generating very low dimensional word embeddings by making use of UMAP. With this technique in hand one can generalize to a document embedding algorithm we're calling DocMAP. This approach ultimately only requires sequences of tokens and thus can apply to much broader classes of problems. We’ll demonstrate this by applying a variation of DocMAP to the problem of mapping the space of malware based on it’s behaviour.

While the math behind UMAP might be challenging to some this talk will focus more on how to apply it in novel situations and take a more practical approach to things. If you have problems that can fit in this framework you should come and learn how to MAP all the things!

Subscribe to Receive PyData Updates

Subscribe