Monday 5:10 PM–5:50 PM in Central Park East (6501a)

Semantic modeling of data science code

Evan Patterson

Audience level:
Intermediate

Description

Programming languages and libraries are proliferating in the data science community. In an effort to reduce communication barriers and enable automation and intelligent tooling, we are developing software to automatically construct language-agnostic semantic models of data science code written in Python or R. In this talk, we introduce our methods and illustrate them by example.

Abstract

As suggested by the name of Project Jupyter (“JUlia-PYThon-R”), contemporary data science is increasingly pluralistic, involving several popular programming languages and countless software packages. While a diverse and growing ecosystem is generally a boon to the field, it can impede data scientists from communicating and sharing knowledge effectively, both with each other and with their collaborators in other fields. Building intelligent tools for data scientists and conducting automated meta-analyses are also more difficult.

We present our ongoing efforts to automatically construct semantic models of data science code, expressed in terms of general concepts and independent of any specific programming language or library. We explain and demonstrate by example the elements of our process: program analysis tools for Python and R; a fledgling Data Science Ontology; and an ontology-driven algorithm for semantic enrichment, implemented in Julia. We also suggest possible applications and future directions for this technology.

Attendees will find it helpful to have a working knowledge of at least one programming language commonly used in data science, such as Python or R. All project components are available as open source software under a permissive license.

Subscribe to Receive PyData Updates