Saturday 17:00–17:45 in LG6

The CV: A Data Scientist's View

Rui Miguel Forte

Audience level:


This talk will focus on two main goals. The first of these is to present the CV, and a job seeker's career trajectory more generally, as a highly interesting and fertile application domain for data science. Secondly, the talk will highlight a number of practical lessons and tips gleaned though years of hands-on experience in this area.


Despite the proliferation of professional networks such as LinkedIn, and the widespread use of online forms in which job seekers can fill in their details when applying for a job, the CV still remains relevant today. From online job boards to recruitment firms, personal websites to companies who remain old-fashioned in their hiring, the CV finds its way in every facet of job hunting. For the HR professional however, the CV is cumbersome to work with. For this reason, it has consumed countless man-hours of work usually spent on mundane tasks. For example, candidate details from CVs are often manually copied into forms to create a structured profile in a database.

From a data scientist's perspective, the CV provides us with a wealth of interesting applications and opportunities for investigation. In this talk, the author will present learnings from over three years of working in this domain as well as a range of applications that can arise.

Concretely, the talk will cover:

  • Introducing the CV Parsing task and its components
  • Automatically extracting text from documents with Apache Tika
  • Understanding and mitigating text extraction errors
  • Analyzing CV text with an NLP pipeline
  • Using open source tools for core NLP tasks
  • High-level CV parsing with section identification and sentence classification
  • Identifying important entities with Named Entity Recognition (NER)
  • Understanding the limitations of existing NER systems
  • Building your own NER model
  • Reproducibility in experiments
  • Building a corpus of annotated data
  • Recognizing entity relationships
  • Entity normalization
  • Discovering and extracting skills from CV's
  • Clustering CV's by word vectors
  • Identifying duplicate profiles

Though each of these topics could be an entire talk on their own, the objective here is to present the main idea of each and the role that they play within the broader context of CV parsing and interpreting job seeker data. The talk will also feature code snippets in Python and Apache Spark to give a practical foundation for some of the concepts discussed.