Thursday 10:50 AM–11:30 AM in Radio City (#6604)

Using embeddings to understand the variance and evolution of data science skill sets

Maryam Jahanshahi

Audience level:
Intermediate

Description

In this talk I will discuss exponential family embeddings, which are methods that extend the idea behind word embeddings to other data types. I will describe how we used dynamic embeddings to understand how data science skill-sets have transformed over the last 3 years using our large corpus of jobs. The key takeaway is that these models can enrich analysis of specialized datasets.

Abstract

Many data scientists are familiar with word embedding models such as word2vec, which capture semantic similarity among words and phrases in a corpus. However, word embeddings are limited in their ability to interrogate a corpus alongside other context or over time. Moreover, word embedding models either need significant amounts of data, or tuning through transfer learning of a domain-specific vocabulary that is unique to most commercial applications.

In this talk, I will introduce exponential family embeddings. Developed by Rudolph and Blei, these methods extend the idea of word embeddings to other types of high-dimensional data. I will demonstrate how they can be used to conduct advanced topic modelling on datasets that are medium-sized, which are specialized enough to require significant modifications of a word2vec model and contain more general data types (including categorical, count, continuous).

I will discuss how we implemented a dynamic embedding model using scikit-learn and Tensor Flow and our proprietary corpus of job descriptions. Using both categorical and natural language data associated with jobs, we charted the development of different skill sets over the last 3 years. I specifically focus description of my results on how data science and quantitative skill sets have developed, grown and pollinated other types of jobs over time. If time allows, I will also discuss other segmentation analyses we performed, including company-types or geographies.

This talk is for both data science practitioners and product/business function managers because it straddles the boundary between being a technical talk (although I will keep discussion of the mathematical underpinnings to a minimum and focus on big picture concepts), and a data/results talk. For data scientists, the takeaway will be a new tool for topic modelling. For product/business function managers, the takeaway will be a discussion of how to map trends from a combination of natural language and structured data. For broader audiences, the talk will also discuss how data science skills have varied across industries, functions and over time.

I will specifically discuss the following:

Subscribe to Receive PyData Updates

Subscribe