Sunday 17:00–17:45 in D105 Audimax

Data Science for Digital Humanities: Extracting meaning from Images and Text

Hendrik Heuer

Audience level:
Intermediate

Description

Analyzing millions of images and enormous text sources using machine learning and deep learning techniques is simple and straightforward in the Python ecosystem. Powerful machine learning algorithms and interactive visualization frameworks make it easy to conduct and communicate large scale experiments. Exploring this data can yield new insights for researchers, journalists, and businesses.

Abstract

The focus of this talk is extracting meaning from data and making powerful methods usable by everybody. With the advent of big data, new approaches and technologies are needed to tackle the increase in volume, variety, and velocity of data. This talk illustrates how analysts, journalists, and scientists can benefit from exploratory data analysis and data science.

Imagine a journalist who wants to cross-reference the names on the guest list of a parliament with online information about lobbyists to identify which party meets which company. A business analyst might want to quantify what topics certain customers are discussing on Twitter or how their sentiment towards a particular product is. Exploratory data analysis and data science techniques enable researchers, journalists and businesses to ask bigger and more ambitious questions than anybody before them and to leverage the abundance of information that is available today.

The Digital Humanities are located at the intersection of computing and the disciplines of the humanities. They can benefit from the massive-scale automated analysis of content like images and text. Researchers, analysts, and journalists can quantify the state of society from publicly available data like tweets. It is now possible to construct an almost complete map of our civilization just by looking at the tags and GPS coordinates of Flickr photos.

A vast Python ecosystem is supporting this including machine learning frameworks like scikit-learn, dedicated deep learning frameworks like Keras, and topic modeling tools like gensim. All these tools are open source and can be integrated into powerful data science pipelines. Rather than training neural networks from scratch, pretrained features for text and images can be adapted for fast results.

Subscribe to Receive PyData Updates

Subscribe

Tickets

Get Now