An online pedagogical tool for data science using Bokeh

Raghuram Thiagarajan, Neel Surya, Timothy Odom, Xu Chen

Prior knowledge:
No previous knowledge expected

Summary

Computational modules can serve as pedagogical tools to teach complex engineering concepts via enquiry-based learning (EBL). In this context, we will demonstrate an interactive online visualization-based module to teach the fundamental concepts of data science using a real-world dataset. We will also discuss the use of python packages such as Boken and scikit-learn to build this module.

Description

Introduction

The use of interactive visualizations to convey complex concepts is an emerging teaching paradigm and offers multiple advantages. First, students engage in active learning of the material, which leads to greater concept retention than with the traditional classroom. Second, students can learn at real time the effect of changes to parameters, thereby understanding the fundamental concepts via enquiry that is facilitated by an instructor; that is, such modules serve as hands-on computational experiments. Third, since these simulations are easier to perform than experiments, one can literally run hundreds of settings to relate inputs and parameters of a problem with its output through real time visualization. Thus, they can serve as effective educational supplements that educators can employ for personalized STEM education.

In this context, we have developed four Python-based online interactive visualization modules, using the package Bokeh:

  1. A data science module for a catalysis dataset used by chemical engineers

  2. An evaporatively-cooled zero energy cooling chamber (ZECC)

  3. An infectious disease SEIR (Susceptible-Exposed-Infected-Recovered) model

  4. A simple reaction kinetics model.

Description of Modules

The data science module covers five aspects:

  1. A data exploration section to understand the distribution of the datasets being used

  2. A correlation matrix section to show the correlation between all the features in the datasets

  3. A multivariable regression section to build a customized regression model

  4. An unsupervised learning section covering clustering and principal component analysis

  5. A classification section to determine how datasets are partitioned into various “classes”

The reaction kinetics example describes a sequential reaction of A→ B→ C, which is a model reaction system representative of many real chemical systems such as converting biomass to green fuels, where students can manipulate the rate and order of each step and observe the complex effect on the concentration profile immediately. A ZECC is a sustainable and cost-effective way of storing agricultural and horticultural produce in warm, low-income regions; the module allows students to vary different parameters of the simulation for various locations in the world and immediately evaluate if their design is suitable for that region. The SEIR utilizes ordinary differential equations (ODEs) to describe the spread of infectious diseases, such as COVID-19, through a population; students can vary different parameters governing both the intrinsic characteristics of the disease (infection, fatality rates) and human interventions to contain the spread (social distancing, vaccination, etc.) and observe the complex emergent behavior of the model to understand the rationale of recent public health policies.

All these deployed models can be accessed at the website hosted by Lehigh University.