Sunday 2:45 PM–3:30 PM in Room 1

Using Exploratory Data Analysis to Discover Patterns in Image and Document Collections

Mehrdad Yazdani

Audience level:
Novice

Description

Exploratory Data Analysis (EDA) is one of the key sets of procedures for summarizing a dataset. In this talk we will develop an EDA procedure for large collections of documents and images (such as photo albums, emails, articles, etc). We will show features used from NLP and Deep Neural Nets and also introduce novel visualization techniques for large image collections using PyImagePlot.

Abstract

Using Exploratory Data Analysis to Discover Patterns in Image and Document Collections

Introduction to EDA

  • Why EDA?
  • Challenges with non-structured datasets: text and image collections
  • Feature embedding of text and image collections using machine learning
  • Visualization techniques for high dimensional data: PCA, t-SNE, PyImagePlot

Text Case study: using EDA to investigate the gender pronoun gap in newspapers

  • About the text corpus and the problem: gender pronoun gap in language
  • Define summary statistics and extract features from text collection
  • Visualize summary statistics and features using PCA

Image Case study: using EDA to find patterns in collection of images

  • Finding patterns in large collections of images
  • Extracting Deep Learning features from images using SkiCaffe
  • Visualize image collections using PyImagePlot