Welcome to Pydata Global 2022

This conference is packed with events, talks, and virtual get-togethers.

Discord

Discord is the town square of PyData Global 2022.

Gather.town

Join the PyData Community Gather Workspace for networkings and social events.

Schedule

Want to add the schedule to your Google Calendar?


Day 1
Day 2
Day 3
Community Events & Sponsor Sessions
Keynotes
Talks
Tutorials
Workshops

Generate Actionable Counterfactuals using Multi-objective Particle Swarm Optimization

Day 1
Talks
Niranjan G S, Shashank Shekhar  |  2022/12/01 08:00:00 UTC - 2022/12/01 08:30:00

Counterfactual explanations (CFE) are methods that explain a machine learning model by giving an alternate class prediction of a data point with some minimal changes in its features. In this talk, we describe a counterfactual (CF) generation method based on particle swarm optimization (PSO) and how we can have greater control over the proximity and sparsity properties over the generated CFs.

Join: https://www.youtube.com/watch?v=gVqshlX4aW0

Watch: https://youtu.be/pPgic2V7oWg?t=91

View Details


Managing Python Dependencies at scale

Day 1
Talks
Jarek Potiuk  |  2022/12/01 08:00:00 UTC - 2022/12/01 08:30:00

This talk is about the approach we’ve taken at the Apache Airflow for managing our dependencies at scale of a project that is the most popular Data Orchestrator in the world, consists of ~ 80 independent package and has more than 650 depenencies in total (and did not loose our sanity).

Join: https://youtu.be/qk0B64Ku2Q8

Watch: https://www.youtube.com/watch?v=qk0B64Ku2Q8&t=75s

View Details


Measurement of Trust in AI

Day 1
Talks
Shashank Shekhar   |  2022/12/01 08:30:00 UTC - 2022/12/01 09:00:00

For enterprises to adopt and embrace AI into their transformational journey, it is imperative to build Trustworthy AI- so that AI products and solutions that are built, delivered, and acquired are responsible enough to drive trust and wider adoption. We look at AI Trust as a function of 4 key constructs which include Reliability, Safety, Transparency, Responsibility and Accountability. These core constructs are pillars of driving AI trust in our products and solutions. In this talk, I will explain how to enable each core construct and will articulate how they can be measured in some real-world use cases.

Join: https://www.youtube.com/watch?v=gVqshlX4aW0

Watch: https://youtu.be/pPgic2V7oWg?t=2025

View Details


ARCH/GARCH Models Tour

Day 1
Talks
Kalyan Prasad  |  2022/12/01 08:30:00 UTC - 2022/12/01 09:00:00

When your goal of the study is to analyze and forecast volatility, this is where the ARCH/GARCH models comes into the picture to solve the complicated time series problems.

Join: https://youtu.be/qk0B64Ku2Q8

Watch: https://www.youtube.com/watch?v=qk0B64Ku2Q8&t=1815s

View Details


Expressive and fast dataframes in Python with polars

Day 1
Talks
Juan Luis Cano Rodríguez  |  2022/12/01 09:00:00 UTC - 2022/12/01 09:30:00

The pandas library is one of the key factors that enabled the growth of Python in the Data Science industry and continues to help data scientists thrive almost 15 years after its creation. Because of this success, nowadays several open-source projects claim to improve pandas in various ways, either by bringing it to a distributed computing setting (Dask), accelerating its performance with minimal changes (Modin), or offering slightly different API that solves some of its shortcomings (Polars).

In this talk we will dive into Polars, a new dataframe library backed by Arrow and Rust that offers an expressive API for dataframe manipulation with excellent performance.

If you are a seasoned pandas user willing to explore alternatives, or a beginner user wondering what all the fuzz about these new dataframe libraries is, this talk is for you!

Join: https://www.youtube.com/watch?v=gVqshlX4aW0

Watch: https://youtu.be/pPgic2V7oWg?t=3917

View Details


Data Validation for Feature pipelines: Using Great Expectations and Hopsworks

Day 1
Talks
Moritz Meister  |  2022/12/01 09:00:00 UTC - 2022/12/01 09:30:00

Have you ever trained an awesome model just to have it break in production because of a null value? At its core a feature store needs to provide reliable features to data scientists to build and productionize models. So how can we avoid garbage in, garbage out situations? Great expectations is the most popular library for data validation, and so the two are a natural fit. In this talk we will touch briefly upon different Python data validation libraries such as Pydantic, Pandera but then dive deeper into Great Expectations’ concepts and how you can leverage them in feature pipelines powering a feature store.

Join: https://youtu.be/qk0B64Ku2Q8

Watch: https://www.youtube.com/watch?v=qk0B64Ku2Q8&t=3711s

View Details


Full-stack Machine Learning for Data Scientists

Day 1
Workshops
Hugo Bowne-Anderson  |  2022/12/01 09:30:00 UTC - 2022/12/01 11:00:00

One of the key questions in modern data science and machine learning, for businesses and practitioners alike, is how do you move machine learning projects from prototype and experiment to production as a repeatable process. In this workshop, we present a hands-on introduction to the landscape of production-grade tools, techniques, and workflows that bridge the gap between laptop data science and production ML workflows. Participants will learn how to take common machine learning models, such as those from scikit-learn, XGBoost, and Keras, and productionize them using Metaflow.

We’ll present a high-level overview of the 8 layers of the ML stack: data, compute, versioning, orchestration, software architecture, model operations, feature engineering, and model development. We’ll present a schematic as to which layers data scientists need to be thinking about and working with, and then introduce attendees to the tooling and workflow landscape. In doing so, we’ll present a widely applicable stack that provides the best possible user experience for data scientists, allowing them to focus on parts they like (modeling using their favorite off-the-shelf libraries) while providing robust built-in solutions for the foundational infrastructure.

You can find the companion repository for the workshop here: https://github.com/outerbounds/full-stack-ML-metaflow-tutorial.

Join: https://numfocus-org.zoom.us/j/86917942169?pwd=LzFnc1RsS0ZRdGhSZFN5ZGJTNmQrQT09

Watch: https://numfocus-org.zoom.us/rec/play/3FN9BaDdHeAqLO5O_qBmWb3w2k8_mDEWIhozv4JaoLKQN0zhSOrTolAra_xtwr62VTOPUtz_MmF-_uje.ybiOKzIp5KT5fS0m?continueMode=true&_x_zm_rtaid=5PvSTUKvRo2mcYSl0MELBA.1669915913700.51590fdcae43689e515219d25aed7aa0&_x_zm_rhtaid=270

View Details


Inequality Joins in Pandas with Pyjanitor

Day 1
Talks
Samuel Oranyeli  |  2022/12/01 10:00:00 UTC - 2022/12/01 10:30:00

Inequality joins are less frequent than equality joins, but are useful in temporal analytics and even in some conventional applications. Pyjanitor fills this gap in Pandas with an efficient implementation

Join: https://www.youtube.com/watch?v=gVqshlX4aW0

Watch: https://youtu.be/pPgic2V7oWg?t=7296

View Details


Interpretable and realistic generative models in data science? Likelihood-free Bayes’ says yes!

Day 1
Talks
Narendra Mukherjee  |  2022/12/01 10:00:00 UTC - 2022/12/01 10:30:00

Are you fascinated by the real-life images or text produced by deep generative models but cannot interpret their underlying data generation process or see how they can be applied to other problems? I will talk about generative simulations built using knowledge of the problem domain that can produce realistic data in a variety of scenarios. This talk will be a Bayesian thinking exercise cum data science case study of product star rating timeseries from an online marketplace (like Amazon.com) – I will show how we use recent advances in likelihood-free Bayesian inference together with a detailed simulation of an online marketplace to directly infer factors involved in how customers purchase and rate products.

Join: https://youtu.be/qk0B64Ku2Q8

Watch: https://www.youtube.com/watch?v=qk0B64Ku2Q8&t=7255s

View Details


Data-Centric AI Cookbook: let’s prep that data

Day 1
Talks
Marysia Winkels  |  2022/12/01 10:30:00 UTC - 2022/12/01 11:00:00

Data Centric AI is about iterating on data instead of models to improve machine learning predictions. Why is this trend relevant now? Is this yet another hype in data science? Or has something really changed? And most of all — how is this relevant to you?

Join: https://www.youtube.com/watch?v=gVqshlX4aW0

Watch: https://youtu.be/pPgic2V7oWg?t=9046

View Details


Explaining Why You have a Favorite Cereal

Day 1
Talks
Gatha  |  2022/12/01 10:30:00 UTC - 2022/12/01 11:00:00

It’s crunchy! It’s sweet! Maybe it is the presence of the nuts or their absence. There are various features that make you favor a particular cereal. Now surely, if we modeled the consumer ratings for cereals, some features would be considered more important than others. After all, feature engineering is one of the most critical steps in modeling. But after the model is up and running, what if we tweak the features just to see how much meddling can affect the preference? This process is called post-hoc feature attribution and it seeks to interpret the model behavior. In this talk, let us spoon through the interpretability of ML models.

Join: https://youtu.be/qk0B64Ku2Q8

Watch: https://www.youtube.com/watch?v=qk0B64Ku2Q8&t=9023s

View Details


Supercharge your training on TPUs with PyTorch Lightning

Day 1
Talks
Kaushik Bokka  |  2022/12/01 11:00:00 UTC - 2022/12/01 11:30:00

This session will discuss scaling your PyTorch models on TPUs with zero code changes using PyTorch Lightning. We’ll cover training on TPUs from beginning to end, including setting them up, TPU architecture, frequently faced issues, and debugging techniques. You’ll learn about the experience of using PyTorch Lightning to make working with TPUs and the PyTorch XLA library easier and explore best practices for getting started with training large-scale models on TPUs.

Join: https://www.youtube.com/watch?v=gVqshlX4aW0

Watch: https://youtu.be/pPgic2V7oWg?t=10864

View Details


Knowing what you don’t know matters: Uncertainty-aware model rating

Day 1
Talks
Malte Tichy  |  2022/12/01 11:30:00 UTC - 2022/12/01 12:00:00

Meaningful probabilistic models do not only produce a “best guess” for the target, but also convey their uncertainty, i.e., a belief in how the target is distributed around the predicted estimate. Business evaluation metrics such as mean absolute error, a priori, neglect that unavoidable uncertainty. This talk discusses why and how to account for uncertainty when evaluating models using traditional business metrics, using python standard tooling. The resulting uncertainty-aware model rating satisfies the requirements of statisticians because it accounts for the probabilistic process that generates the target. It should please practitioners because it is based on established business metrics. It appeases executives because it allows concrete quantitative goals and non-defensive judgements.

Join: https://www.youtube.com/watch?v=gVqshlX4aW0

Watch: https://youtu.be/pPgic2V7oWg?t=12611

View Details


Detecting anomalous sequences using text processing methods

Day 1
Talks
Liron Faybish  |  2022/12/01 11:30:00 UTC - 2022/12/01 12:00:00

Hello wait you talk see to can’t all my in!

Sounds weird, right?! Detecting abnormal sequences is a common problem.
Join my talk to see how this problem involves Bert, Word2vec, and Autoencoders, and learn how you can also apply it to information security

Join: https://youtu.be/qk0B64Ku2Q8

Watch: https://www.youtube.com/watch?v=qk0B64Ku2Q8&t=12600s

View Details


sktime – python toolbox for time series: pipelines and transformers

Day 1
Tutorials
Franz Kiraly, Benedikt Heidrich, Mirae L Parker, Martin Walter  |  2022/12/01 11:30:00 UTC - 2022/12/01 13:00:00

sktime is a widely used scikit-learn compatible library for learning with time series. sktime is easily extensible by anyone, and interoperable with the pydata/numfocus stack. sktime has a rich framework for building pipelines across multiple learning tasks that it supports, including forecasting, time series classification, regression, clustering. This tutorial explains basic and advanced sktime pipeline constructs, and introduces in detail the time series transformer which is the main component in all types of pipelines. It is a continuation of the sktime introductory tutorial at pydata global 2021.

Join: https://numfocus-org.zoom.us/j/89277835827?pwd=QlFxWFpLWjJiYVpkNU1VczI4eW91QT09

Watch: https://numfocus-org.zoom.us/rec/share/8PO_HT_6q5jzDn9xPGmJK0HQRYU-4zNAfS4NXzgTMJ9CXCSFQIdb8anD6enQOEP8.rqrhaMISvkS4e5EE

View Details


Building Data Products in a Lakehouse using Trino, dbt, and Dagster

Day 1
Tutorials
Przemysław Denkiewicz  |  2022/12/01 11:30:00 UTC - 2022/12/01 13:00:00

Build data pipelines using Trino and dbt, combining heterogeneous data sources without having to copy everything into a single system. Manage access to your data products using modern and flexible security principles from authentication methods to fine-grained access control. Run and monitor your data pipelines using Dagster.

Join: https://numfocus-org.zoom.us/j/86481679219?pwd=bUNEYTl3d1hmekxqc0ZDMExkZm9rdz09

Watch: https://drive.google.com/file/d/1NLUuWsKDA2V4gfoRraoNuQtAluNBrmpL/view

View Details


Teaching papermill new tricks: creating custom engines for flexible notebook execution

Day 1
Talks
Eduardo Blancas  |  2022/12/01 12:00:00 UTC - 2022/12/01 12:30:00

This talk will show you how to build papermill plugins. As motivating examples, we’ll describe how to customize papermill for notebook debugging and profiling.

Join: https://www.youtube.com/watch?v=gVqshlX4aW0

Watch: https://youtu.be/pPgic2V7oWg?t=14424

View Details


Do You Follow What I’m Explaining? A Practitioner’s Guide to Opening the AI Black Box for Humans

Day 1
Talks
Kilian Kluge  |  2022/12/01 12:00:00 UTC - 2022/12/01 12:30:00

Numerous tools generate “explanations” for the outputs of machine-learning models and similarly complex AI systems. However, such “explanations” are prone to misinterpretation and often fail to enable data scientists or end-users to assess and scrutinize “an AI.” We share best practices for implementing “explanations” that their human recipients understand.

Join: https://youtu.be/qk0B64Ku2Q8

Watch: https://www.youtube.com/watch?v=qk0B64Ku2Q8&t=14429s

View Details


The Beauty of Zarr

Day 1
Talks
Sanket Verma  |  2022/12/01 12:30:00 UTC - 2022/12/01 13:00:00

In this talk, I’d be talking about Zarr, an open-source data format for storing chunked, compressed N-dimensional arrays. This talk presents a systematic approach to understanding and implementing Zarr by showing how it works, the need for using it, and a hands-on session at the end. Zarr is based on an open technical specification, making implementations across several languages possible. I’d be mainly talking about Zarr’s Python implementation and would show how it beautifully interoperates with the existing libraries in the PyData stack.

Join: https://www.youtube.com/watch?v=gVqshlX4aW0

Watch: https://youtu.be/pPgic2V7oWg?t=16323

View Details


Data Storytelling through Visualization

Day 1
Talks
Marysia Winkels  |  2022/12/01 12:30:00 UTC - 2022/12/01 13:00:00

Data is everywhere. It is through analysis and visualization that we are able to turn data into information that can be used to drive better decision making. Out-of-the-box tools will allow you to create a chart, but if you want people to take action, your numbers need to tell a compelling story. Learn how elements of storytelling can be applied to data visualization.

Join: https://youtu.be/qk0B64Ku2Q8

Watch: https://www.youtube.com/watch?v=qk0B64Ku2Q8&t=16200s

View Details


Start asking your data “Why?” – A Gentle Introduction To Causal Inference

Day 1
Talks
Eyal Kazin  |  2022/12/01 13:00:00 UTC - 2022/12/01 13:30:00

Correlation does not imply causation. It turns out, however, that with some simple ingenious tricks one can unveil causal relationships within standard observational data, without having to resort to expensive randomised control trials. Learn how to make the most out of your data, avoid misinterpretation pitfalls and draw more meaningful conclusions by adding causal inference to your toolbox.

Join: https://youtu.be/qk0B64Ku2Q8

Watch: https://www.youtube.com/watch?v=qk0B64Ku2Q8&t=18030s

View Details


Data Science Project Patterns that Work

Day 1
Talks
Ian Ozsvald  |  2022/12/01 13:30:00 UTC - 2022/12/01 14:00:00

Getting your team to choose good projects, reliably derisk them, research ideas, productionise the solutions and create positive change in an organisation is hard. Really hard. I’ll present patterns that work for these 5 critical project stages. This guidance is based on 15 years of experience writing AI and DS solutions and 5 years giving both strategic guidance training on how to get to success. You’ll come away from the session with new techniques to help your team deliver successfully and increase their confidence in the roadmap, new thoughts on how to diagnose your model’s quality and new ideas to make positive difference in your organisation.

Join: https://www.youtube.com/watch?v=gVqshlX4aW0

Watch: https://youtu.be/pPgic2V7oWg?t=19875

View Details


Crowd-Kit: A Scikit-Learn for Crowdsourced Annotations

Day 1
Talks
Evgeniya  |  2022/12/01 13:30:00 UTC - 2022/12/01 14:00:00

The talk includes the presentation of Crowd-Kit – an open-source computational quality control library – followed by its demonstration. Crowdsourced annotations in most cases require post-processing due to their heterogeneous nature; raw data contains errors, is biased and non-trivial to combine. Crowd-Kit provides various methods like aggregation, uncertainty, and agreements, which could be used as helping tools in getting an interpretable result out of data labeled with the help of crowdsourcing.

Join: https://youtu.be/qk0B64Ku2Q8

Watch: https://www.youtube.com/watch?v=qk0B64Ku2Q8&t=19894s

View Details


Keynote – Ada Nduka Oyom

Day 1
Keynotes
Ada Nduka Oyom  |  2022/12/01 14:00:00 UTC - 2022/12/01 15:00:00

Ada is the Founder of She Code Africa (SCA), a non-profit organisation focused on empowering young girls and women in Africa through technical skills and Co-founder, Open Source Community Africa, one of the largest communities for open-source enthusiasts, advocates and experts across Africa. She’s currently engaged with Google as the Ecosystem community manager for Sub-saharan Africa.

Join: https://www.youtube.com/watch?v=gVqshlX4aW0

Watch: https://youtu.be/pPgic2V7oWg?t=21702

View Details


Algorithms at Scale: Raising Awareness on Latent Inequities in Our Data

Day 1
Talks
Dr. Lalitha Krishnamoorthy  |  2022/12/01 15:00:00 UTC - 2022/12/01 15:30:00

In today’s digital age, we use machine learning (ML) and artificial intelligence (AI) to solve problems and improve productivity and efficiency. Yet, there’s risk in delegating decision-making power to algorithmically based systems: their workings are often opaque, turning them into uninterpretable “black boxes.”

Join: https://www.youtube.com/watch?v=gVqshlX4aW0

Watch: https://youtu.be/pPgic2V7oWg?t=25313

View Details


Trying No GIL on Scientific Programming

Day 1
Talks
Cheuk Ting Ho  |  2022/12/01 15:00:00 UTC - 2022/12/01 15:30:00

Recently Sam Gross, the author of nogil fork on Python 3.9, demonstrates the GIL can be removed. For scientific programs which use heavy CPU-bound processes, it could be a huge performance improvement. In this talk, we will see if this is true and compare the nogil version to the original.

Join: https://youtu.be/qk0B64Ku2Q8

Watch: https://www.youtube.com/watch?v=qk0B64Ku2Q8&t=25335s

View Details


Bayesian Decision Analysis

Day 1
Tutorials
Allen Downey  |  2022/12/01 15:00:00 UTC - 2022/12/01 16:30:00

This tutorial is a hands-on introduction to Bayesian Decision Analysis (BDA), which is a framework for using probability to guide decision-making under uncertainty. I start with Bayes’s Theorem, which is the foundation of Bayesian statistics, and work toward the Bayesian bandit strategy, which is used for A/B testing, medical tests, and related applications. For each step, I provide a Jupyter notebook where you can run Python code and work on exercises. In addition to the bandit strategy, I summarize two other applications of BDA, optimal bidding and deriving a decision rule. Finally, I suggest resources you can use to learn more.

Join: https://numfocus-org.zoom.us/j/81618582619?pwd=ODcwUkVGanVkSkEyMys2VVRSVTlVQT09

Watch: https://numfocus-org.zoom.us/rec/share/qigNn_NCmnL6o9kGUi5NiiP18Nq0kre9nw_FhXMxLnsp78K7eQAfUAg6YKNe4rT3.fqvI1Q_r1_iWdfcS

View Details


Anomaly Detection on Streaming Data in Python using Bytewax and River

Day 1
Tutorials
Zander  |  2022/12/01 15:00:00 UTC - 2022/12/01 16:30:00

Bytewax is an open source, Python native, framework and distributed processing engine for processing data streams that makes it easy to build everything from pipelines for anonymizing data to more sophisticated systems for fraud detection, personalization, and more. For this tutorial, we will cover how you can use Bytewax and the Python library, River, to build an online machine learning system that will detect anomalies in IoT data from streaming systems like Kafka and Redpanda. This tutorial is for data scientists, data engineers, and machine learning engineers interested in machine learning and streaming data. At the end of the tutorial session you will know how to:
– run a streaming platform like Kafka or Redpanda in a docker container,
– develop a Bytewax dataflow
– run a River anomaly detection algorithm to detect anomalous data

The tutorial material will be available via a GitHub Repo and the content will be covered in roughly the timeline shown below.

0-10min – Introduction to stream processing and online machine learning
10-30min – Setup streaming system and prepare the data
30-60min – Write the Bytewax dataflow and anomaly detector code
60-90min – Tune the anomaly detector and run the Bytewax dataflow successfully.

Join: https://numfocus-org.zoom.us/j/81351614841?pwd=MFZRTXNVNFJML0htUXd0cEp1d09IZz09

Watch: https://numfocus-org.zoom.us/rec/share/9d3H1AE8Y3DZCKVbW-hYr2cABswKGgCMg-LEw6Ag74IvBVa4Eh52Q3uf38plNHwg.GO9aS-HMNEl1JkEq

View Details


Executives at PyData

Day 1
Community Events & Sponsor Sessions
Ian Ozsvald, Lauren Oldja, Douglas Squirrel  |  2022/12/01 15:00:00 UTC - 2022/12/01 17:00:00

Executives at PyData is a facilitated discussion session for leaders on the challenges around designing and delivering successful projects, organizational communication, product management and design, hiring, and team growth.

Join: https://numfocus-org.zoom.us/j/87259453482?pwd=dHZXbVhIbjlIbUNaeEU1RmFoQVkvZz09

Watch: https://numfocus-org.zoom.us/rec/share/f_zFeY5NjhoJIedWFu06-6wSUL91jKSG0qkQ5ntvkczLLuJKKgkx_t9o7GgTzJNi.4KmjE9lxftqoSWFd

View Details


Deploying Dask

Day 1
Talks
Matthew Rocklin  |  2022/12/01 15:30:00 UTC - 2022/12/01 16:00:00

Dask is a framework for parallel computing in Python.
It’s great, until you need to set it up.

Kubernetes? Cloud? HPC? SSH? YARN/Hadoop even?
What’s the right deployment technology to choose?

After you set it up a new set of problems arise:

How do you install software across the cluster?
How do you secure network access?
How do you access secure data that needs credentials?
How do you track who uses it and constrain costs?
When things break, how do you track them down?
There exist solutions to these problems in open source packages like dask-kubernetes, helm charts, dask-cloudprovider, and dask-gateway, as well as commercially supported products like Coiled, Saturn, QHub, AWS EMR, and GCP Dataproc. How do we choose?

This talk describes the problem faced by people trying to deploy any distributed computing system, and tries to construct a framework to help them make decisions on how to deploy.

Join: https://youtu.be/qk0B64Ku2Q8

Watch: https://www.youtube.com/watch?v=qk0B64Ku2Q8&t=27070s

View Details


An Evolving Jupyter Notebook

Day 1
Talks
Rosio Reyes  |  2022/12/01 16:00:00 UTC - 2022/12/01 16:30:00

The Jupyter ecosystem has been undergoing many changes in the past few years. While JupyterLab has been embraced by many, there are still many active users of Jupyter Notebook. With that in mind, Jupyter developers have been gearing up for the release of the updated Notebook 7 based on JupyterLab components as outlined in the Jupyter Enhancement Proposal #79. With this, there are significant changes coming to Notebook 6, of which the upcoming Notebook 6.5 is intended to be end-of-life, and users installing Notebook will soon receive a version of the project that may disrupt their workflows. In an effort to give users time to transition to using the updated codebase, the NbClassic project has been introduced. NbClassic is the Jupyter Server extension implementation of the classical notebook. NbClassic has also become the owner of the static assets for the classical notebook, and Notebook 6.5 depends on NbClassic to provide those.
The aim of this talk is to:
1. Reflect on the changes to the Jupyter ecosystem with the introduction of NbClassic and Notebook 7.
2. Address some questions that may come up about NbClassic and Notebook 6.5, as well as some of those that may come up once Notebook 7 is released.
3. Showcase the feasibility with which users can use the different front-ends NbClassic, Notebook 7 and JupyterLab with a demo.

Join: https://www.youtube.com/watch?v=gVqshlX4aW0

Watch: https://youtu.be/pPgic2V7oWg?t=28929

View Details


Create text classifiers in a few hours using the open-source, no-code Label Sleuth system

Day 1
Talks
Yannis Katsis  |  2022/12/01 16:00:00 UTC - 2022/12/01 16:30:00

Domain experts often need to create text classification models; however, they may lack ML or coding expertise to do so. In this talk, we show how domain experts can create text classifiers without writing a single line of code through the open-source, no-code Label Sleuth system; a system that combines an intuitive labeling UI with active learning techniques and integrated model training functionality. Finally, we describe how the system can also benefit more technical users, such as data scientists, and developers, who can customize it for more advanced usage.

Join: https://youtu.be/qk0B64Ku2Q8

Watch: https://www.youtube.com/watch?v=qk0B64Ku2Q8&t=28785s

View Details


OpenTeams’ AMA with Travis Oliphant, Lalitha Krishnamoorthy & Fatma Tarlaci

Day 1
Community Events & Sponsor Sessions
Travis Oliphant, Lalitha Krishnamoorthy & Fatma Tarlaci  |  2022/12/01 16:00:00 UTC - 2022/12/01 16:30:00

We’re on a global mission to make open source software thrive and be more sustainable—from supporting open source contributors in their career paths with our Open Source Professional Network (OSPN) to helping organizations transform their business with support from our vetted network of enterprise solution architects (ESA Network) to helping our clients select the right open source software stack for their business challenge by leveraging our AI-driven scoring system. Please join us during Sponsor Open Hours to learn more and ask us anything about open source.

Join: https://numfocus-org.zoom.us/j/88281360503?pwd=cFdsaGd4N3FoQWRDdlJHZmxSM0JaUT09

Watch: https://numfocus-org.zoom.us/rec/share/a6ANzTpAJCey4ri13lrPJbqhyeJ4R4spd5ChAd43iZzCUbtQLeZOHkd1WtBfErhl.Hiy84Oy9x8gNeBlj

View Details


PyScript and Data Science: a love story

Day 1
Talks
Fabio Pliger  |  2022/12/01 16:30:00 UTC - 2022/12/01 17:00:00

PyScript has brought change to the Python and PyData eco-system making it much easier to execute Python in the browser and opening the road for multiple possibilities that were not possible. The talk will explore what happened since we presented it and will talk about how PyScript can change the way we do Data Science and many other things.

Join: https://www.youtube.com/watch?v=gVqshlX4aW0

Watch: https://youtu.be/pPgic2V7oWg?t=30658

View Details


On copies and views: updating pandas’ internals (a.k.a. “Getting rid of the SettingWithCopyWarning”)

Day 1
Talks
Joris Van den Bossche  |  2022/12/01 16:30:00 UTC - 2022/12/01 17:00:00

Pandas’ current behavior on whether indexing returns a view or copy is confusing, even for experienced users. But it doesn’t have to be this way. We can make this aspect of pandas easier to grasp by simplifying the copy/view rules, and at the same time make pandas more memory-efficient. And get rid of the SettingWithCopyWarning.

Join: https://youtu.be/qk0B64Ku2Q8

Watch: https://www.youtube.com/watch?v=qk0B64Ku2Q8&t=30765s

View Details


Keynote – Thomas Dohmke

Day 1
Keynotes
Thomas Dohmke  |  2022/12/01 17:00:00 UTC - 2022/12/01 18:00:00

AI is the future of software development

Join: https://www.youtube.com/watch?v=gVqshlX4aW0

Watch: Coming soon

View Details


Mischief Managed: What hackers can do on your Jupyter instance

Day 1
Talks
Joseph Lucas  |  2022/12/01 18:00:00 UTC - 2022/12/01 18:30:00

Many Python data professionals work daily in JupyterLab or Notebook instances. What can a hacker do with access to that system? In this presentation, I will introduce the threat model and show why Jupyter instances are valuable targets. Next, I will demonstrate several post-exploitation activities that someone may try to perform on systems hosting Jupyter instances. We will conclude with some defensive strategies to minimize the likelihood and impact of these activities. This talk will help data scientists and information technology professionals better understand the perspective of potential attackers operating in Jupyter environments to improve defensive awareness and behavior.

Join: https://www.youtube.com/watch?v=gVqshlX4aW0

Watch: https://youtu.be/7uKc8RsZgR8?t=502

View Details


Parallelization of code in Python for beginners

Day 1
Talks
Cheryl Roberts  |  2022/12/01 18:00:00 UTC - 2022/12/01 18:30:00

Stuck with long-running code that takes too long to complete, if ever? Learn to think strategically about parallelizing your workflows, including the characteristics that make a workflow a good candidate for parallelization as well as the options in python for executing parallelization. The talk eschews PySpark or other big data platforms.

Join: https://youtu.be/qk0B64Ku2Q8

Watch: https://www.youtube.com/watch?v=qk0B64Ku2Q8&t=35815s

View Details


Text to Data: Make Your Code Malleable, Not Brittle

Day 1
Talks
David Barrett, Martha L Escobar-Molano  |  2022/12/01 18:30:00 UTC - 2022/12/01 19:00:00

Extracting the highly valuable data from unstructured text often results in hard-to-read, brittle, difficult-to-maintain code. The problem is that using regular expressions directly embedded in the program control flow does not provide the best level of abstraction. We propose a query language (based on the tuple relational calculus) that facilitates data extraction. Developers can explicitly express their intent declaratively, making their code much easier to write, read, and maintain.

Join: https://www.youtube.com/watch?v=gVqshlX4aW0

Watch: https://youtu.be/7uKc8RsZgR8?t=2256

View Details


Single node shared memory comes to dask

Day 1
Talks
Martin Durant  |  2022/12/01 17:30:00 UTC - 2022/12/01 18:00:00

The Ray project has show that having a shared memory facility greatly helps in certain compute problems, particularly where the job can be performed on a single large machine as opposed to a cluster. We present preliminary work showing that Dask can also achieve the same benefits.

Join: https://youtu.be/qk0B64Ku2Q8

Watch: https://www.youtube.com/watch?v=qk0B64Ku2Q8&t=37659s

View Details


Keynote – Embracing multi-lingual 
data science

Day 1
Keynotes
Hadley Wickham  |  2022/12/01 19:00:00 UTC - 2022/12/01 20:00:00

RStudio recently changed its name to Posit to reflect the fact that we’re already a company that does more than just R. Come along to this talk to hear a few of the reasons that we love R, and to learn about some of the open source tools we’re working on for python.

Join: https://www.youtube.com/watch?v=gVqshlX4aW0

Watch: https://youtu.be/7uKc8RsZgR8?t=3984

View Details


Reactive data processing in Python

Day 1
Talks
Adrian Kosowski  |  2022/12/01 20:00:00 UTC - 2022/12/01 20:30:00

Machine Learning models designed to work with streaming systems make decisions on new data points as they arrive. But there is a downside: model decisions can’t be easily changed later when the model is updated with fresher data, user feedback, or freshly tuned hyperparameters. This is often a blocker for anomaly detection, recommender systems, process mining, and human-in-the-loop planning.

To deal with this, we’ll demonstrate design patterns to easily express reactive data processing logic. We will use Pathway, a scalable data processing framework built around a Python programming interface. Pathway is battle-tested with operational data in enterprise, including graphs and event streams in real-world supply chains, and is now launching as open-core.

You will leave the talk with a thorough understanding of the practical engineering challenges behind reactive data processing with a Machine Learning angle, and the steps needed to overcome these challenges.

Join: https://www.youtube.com/watch?v=gVqshlX4aW0

Watch: https://youtu.be/7uKc8RsZgR8?t=7605

View Details


Running Apache Airflow at Scale

Day 1
Talks
Jean-Martin Archer, Michael Petro  |  2022/12/01 20:00:00 UTC - 2022/12/01 20:30:00

Apache Airflow is a foundational component of data platform orchestration at Shopify. In this talk, we’ll dive into the many performance and reliability challenges we’ve encountered running Airflow at Shopify’s scale, our custom tooling, and the new multi-instance architecture we rolled out.

Join: https://www.youtube.com/watch?v=exHKcj5_kQw&ab_channel=PydataMedia

Watch: https://www.youtube.com/watch?v=exHKcj5_kQw&t=428s

View Details


Apache Airflow at Scale: Let’s Discuss

Community Events & Sponsor Sessions
Join the team from Shopify for this open discussion.  |  2022/12/01 20:30:00 UTC - 2022/12/01 21:30:00

Apache Airflow is a foundational component of data platform orchestration at Shopify. Following the main talk, this is a session is scheduled for you to ask and discuss running Airflow at scale with Jean-Martin Archer, Staff Data Engineer at Shopify and Michael Petro, Data Engineer at Shopify

Join: https://numfocus-org.zoom.us/j/85801544685?pwd=MVNKOFMrd2hRR2tlcFFXMUZOUUQvQT09

Watch: Coming soon

View Details


Urdu poems to Shakespearean English – Machine Translation

Day 1
Talks
Sidra Effendi  |  2022/12/01 20:30:00 UTC - 2022/12/01 21:00:00

All languages are rich in prose and poetry. A lot of the literature is inaccessible because of a lack of understanding of that language. It is often difficult to appreciate a simple translation of a poem due to gaps in cultural knowledge. A poem translated in the style of an author familiar to the reader might help to both add cultural context for the reader and capture the essence of the poem itself.

Join: https://www.youtube.com/watch?v=gVqshlX4aW0

Watch: https://youtu.be/7uKc8RsZgR8?t=9474

View Details


Lightning Talks – Dale Tovar, Kacper Łukawski, Kurt Schelfthout, Richard Lee, Allan Campopiano, Eyal Kazin, Ziheng Wang, Caroline Arnold

Day 1
Talks
Dale Tovar, Kacper Łukawski, Kurt Schelfthout, Richard Lee, Allan Campopiano, Eyal Kazin, Ziheng Wang, Caroline Arnold  |  2022/12/01 20:30:00 UTC - 2022/12/01 22:00:00

Lightning Talks are short 5-10 minute sessions presented by community members on a variety of interesting topics.

Join: https://www.youtube.com/watch?v=exHKcj5_kQw&ab_channel=PydataMedia

Watch: https://www.youtube.com/watch?v=exHKcj5_kQw&t=2149s

View Details


Using feedback loops to tune predictive models in a video ad marketplace

Day 1
Talks
Emily Hopper  |  2022/12/01 21:00:00 UTC - 2022/12/01 21:30:00

For video advertisers, precisely hitting their ad performance goals is critical. Undershooting on campaign viewability objectives means spending money on ads that nobody watches, while overshooting them can mean vastly reducing the available ad slots. At JW Player, we combine predictive models with PID controllers to tune decision thresholds and deliver the maximum possible reach to our advertisers while hitting their goals.

Join: https://www.youtube.com/watch?v=gVqshlX4aW0

Watch: https://youtu.be/7uKc8RsZgR8?t=10916

View Details


Data Prep for Graphs

Day 1
Talks
Paco Nathan  |  2022/12/01 21:30:00 UTC - 2022/12/01 22:00:00

Data science practitioners have a saying that a 80% of their time gets spent on data prep. Often this involves tools such as Pandas and Jupyter. Graph Data Science is similar, except the data prep techniques are highly specialized and computationally expensive. Moreover, data prep for graphs is required before commercial tools such as graph databases or visualization can be used effectively. This talk shows examples of data prep for graphs. A progressive example illustrates the challenges plus techniques that leverage open source integrations with the PyData stack: Arrow/Parquet, PSL, Ray, Keyvi, Datasketch, etc.

Join: https://www.youtube.com/watch?v=gVqshlX4aW0

Watch: https://youtu.be/7uKc8RsZgR8?t=13005

View Details


The 10 commandments of reliable data science

Day 1
Talks
Isaac Slavitt  |  2022/12/01 22:00:00 UTC - 2022/12/01 22:30:00

Data science as a professional discipline is still in its infancy, and our field lacks widespread technical norms around project organization, collaboration, and reproducibility. This is painful both for practitioners and their end users because disorganized analysis is bad analysis, and bad analysis costs money and wastes time. This talk presents ten principles for correct and reproducible data science inheriting from software engineering’s seven decades of hard-earned lessons as well as numerous experiences with data science teams at organizations of all sizes. We motivate these principles by looking at some hard truths about data science “in the wild.”

Join: https://www.youtube.com/watch?v=gVqshlX4aW0

Watch: https://youtu.be/7uKc8RsZgR8?t=14809

View Details


Revolutionizing the Big Data Age With Compute over Data

Day 1
Talks
David Aronchick  |  2022/12/01 22:00:00 UTC - 2022/12/01 22:30:00

Introducing a new project, Compute over Data (Bacalhau), to run any computation on decentralized data. No need to move large datasets & all languages/data are supported. If you can run Docker/WASM, you’re in the game! Bacalhau is a decentralized public computation network that takes a job and moves it near where the data stored, including across a decentralized server network that stores data and runs jobs inside it. Bacalhau runs the job near where data lives and eliminates data management for the user.

Join: https://www.youtube.com/watch?v=exHKcj5_kQw&ab_channel=PydataMedia

Watch: https://www.youtube.com/watch?v=exHKcj5_kQw&t=7408s

View Details


The Dask at Hand: Using Dask to Speed up the High Quality Transit Areas dataset for the CA Open Data Portal.

Day 1
Talks
Tiffany Chu  |  2022/12/01 22:30:00 UTC - 2022/12/01 23:00:00

Where are CA’s frequent, high quality transit corridors? The CA Public Resources Code defines it, but it requires continued access of the General Transit Specification Feed (GTFS) data and fairly complex geospatial processing. The Integrated Travel Project within Caltrans tackles this by leveraging the combined powers of Dask and Python to make this dataset publicly available and updated monthly on the CA open data portal.

Join: https://www.youtube.com/watch?v=gVqshlX4aW0

Watch: https://youtu.be/7uKc8RsZgR8?t=16620

View Details


Object Detection with KerasCV

Day 1
Talks
Lucas Wood  |  2022/12/01 22:30:00 UTC - 2022/12/01 23:00:00

KerasCV offers a complete set of APIs to train your own state-of-the-art, production-grade object detection model. These APIs include object detection specific data augmentation techniques, models, and COCO metrics. This talk covers how to train a RetinaNet on your own dataset using KerasCV

Join: https://www.youtube.com/watch?v=exHKcj5_kQw&ab_channel=PydataMedia

Watch: https://www.youtube.com/watch?v=exHKcj5_kQw&t=9330s

View Details


BastionAI: Towards an Easy-to-use Privacy-preserving Deep Learning Framework

Day 2
Talks
Daniel Huynh  |  2022/12/02 08:00:00 UTC - 2022/12/02 08:30:00

We present BastionAI, a new framework for privacy-preserving deep learning leveraging secure enclaves and Differential Privacy. We provide promising first results on fine-tuning a BERT model on the SMS Spam Collection Data Set within a secure enclave with Differential Privacy. The library is available at https://github.com/mithril-security/bastionai.

Join: https://www.youtube.com/watch?v=LjtixglMLCg

Watch: https://www.youtube.com/watch?v=LjtixglMLCg&t=41

View Details


Level up your Jupyter Notebooks with VS Code

Day 2
Tutorials
Sarah Kaiser (She/Her)  |  2022/12/02 08:00:00 UTC - 2022/12/02 09:30:00

Visual Studio Code is one of the most popular editors in the Python and data science communities, and the extension ecosystem makes it easy for users to easily customize their workspace for the tools and frameworks they need.

Jupyter notebooks are one such popular tool, and there are some really great features for working in notebooks that can reduce context switching, enable multi-tool workflows, and utilize powerful Python IDE features in notebooks.

This tutorial is geared for all Jupyter Notebook users, who either have interest in or are regularly using VS Code.

Participants will learn how to use some of the best VS Code features for Jupyter Notebooks, as well as a bunch of other tips and tricks to run, visualize and share your notebooks in VS Code.

Some familiarity with Jupyter Notebooks is required, but experience with VS Code is not necessary.

Materials and sample notebooks for the tutorial will be hosted on GitHub, which participants will be able to launch in their browser in the VS Code editor with

GitHub Codespaces with no local setup.

Participants will also be encouraged if they have VS Code installed locally that they can open one of their own notebooks and try out the features as we go along.

Join: https://numfocus-org.zoom.us/j/82841113669?pwd=YWgxQzl3U3JUdVhLTS9JNWY2Zjg4QT09

Watch: https://numfocus-org.zoom.us/rec/share/9e_h8e8nPA-YGATbXMSBQjdWd5xGxwQuZGacMjEjPScPSmjlsHy4NFzjVd-68Bex.5CUsZjKCsa4JY7bc

View Details


ML in Production – What does “Production” even mean?

Day 2
Talks
Dean Pleban  |  2022/12/02 08:30:00 UTC - 2022/12/02 09:00:00

We like talking about production – one famous, but probably wrong statement about it is “87% of data science projects never make it to production”.

While giving a talk to a group of up-and-coming data scientists, a question that surprised me came up:

When you say “production”, what exactly do you mean?

Buzzwords are great, but all the cool kids know what production is, right? Wrong.

In this talk, we’ll define what production actually means. I’ll present a first-principles, step-by-step approach to thinking about deploying a model to production. We’ll talk about challenges you might face in each step, and provide further reading if you want to dive deeper into each one.

Join: https://www.youtube.com/watch?v=LjtixglMLCg

Watch: https://www.youtube.com/watch?v=LjtixglMLCg&t=1901s

View Details


Don’t Stop ’til You Get Enough – Hypothesis Testing Stop Criterion with “Precision Is The Goal”

Day 2
Talks
Eyal Kazin  |  2022/12/02 09:00:00 UTC - 2022/12/02 09:30:00

In hypothesis testing the stopping criterion for data collection is a non-trivial question that puzzles many analysts. This is especially true with sequential testing where demands for quick results may lead to biassed ones.

I show how the belief that Bayesian approaches magically resolve this issue is misleading and how to obtain reliable outcomes by focusing on sample precision as a goal.

Join: https://www.youtube.com/watch?v=LjtixglMLCg

Watch: https://www.youtube.com/watch?v=LjtixglMLCg&t=3655s

View Details


Responsible AI – What, Why, How and Future!

Day 2
Talks
Dr. Sonal Kukreja  |  2022/12/02 09:00:00 UTC - 2022/12/02 09:30:00

Mostly, people relate Artificial Intelligence to progress, intelligence and productivity. But with this comes unfair decisions, biases, human workforce being replaced, lack of privacy and security. And to make matters worse, a lot of these problems are specific to AI. This indicates that the rules and regulations in place are inadequate to deal with them. Responsible AI comes into play in this situation. It seeks to resolve these problems and establish AI system responsibility. In this talk I am going to talk about What is Responsible AI, Why is it needed, How it can be implemented, What are the various frameworks for Responsible AI and What is the Future?

Join: https://youtu.be/h779eyy-zRE

Watch: https://www.youtube.com/watch?v=h779eyy-zRE&t=65s

View Details


Supercharging your pandas workflows with Modin

Day 2
Talks
Alejandro Herrera  |  2022/12/02 09:30:00 UTC - 2022/12/02 10:00:00

Data practitioners are typically forced to choose between tools that are either easy to use (pandas) or highly scalable (Spark, SQL..etc.). Modin, an open source project originally developed by researchers at UC Berkeley, is a highly scalable, drop-in replacement for pandas.

This talk will give an overview of Modin and practical examples on how to use it to effortlessly scale up your pandas workflows.

Join: https://www.youtube.com/watch?v=LjtixglMLCg

Watch: https://www.youtube.com/watch?v=LjtixglMLCg&t=5807s

View Details


Why we do ML model retraining wrong, and how to do better

Day 2
Talks
Emeli Dral  |  2022/12/02 10:00:00 UTC - 2022/12/02 10:30:00

Machine learning models degrade with time. You need to update and retrain them regularly. However, the decision on the maintenance approach is often arbitrary, and the models are simply retrained on a schedule or after every new batch. This can lead to suboptimal performance or wasted resources. In this talk, I will discuss how we can do better: from estimating the speed of the model decay in advance to constructing a proper evaluation set.

Join: https://www.youtube.com/watch?v=LjtixglMLCg

Watch: https://www.youtube.com/watch?v=LjtixglMLCg&t=7319s

View Details


Building Large-scale, Localized Language Models: From Data Preparation to Training and Deployment to Production.

Day 2
Talks
Miguel Martínez, Meriem Bendris  |  2022/12/02 10:00:00 UTC - 2022/12/02 10:30:00

Recent advances in natural language processing demonstrate the capability of large-scale language models (such as GPT-3) to solve a variety of NLP problems with zero shots shifting from supervised fine-tuning to prompt engineering/tuning.

Join: https://youtu.be/h779eyy-zRE

Watch: https://www.youtube.com/watch?v=h779eyy-zRE&t=3713s

View Details


Working session for the Bayesian Python Ecosystem

Day 2
Workshops
Oriol Abril Pla  |  2022/12/02 10:00:00 UTC - 2022/12/02 12:00:00

There is a rich ecosystem of libraries for Bayesian analysis in Python and it is necessary to use multiple libraries at the same time to use a Bayesian workflow, from model creation to presenting results going through sampling and model checking.

This working session aims to bring together practitioners to discuss and address interoperability issues within the ecosystem. Attendees should expect a hands-on get together where they will meet other Bayesian practitioners with whom to discuss the issues faced and contribute to open source libraries with issues, pull requests and discussions.

Join: https://numfocus-org.zoom.us/j/81429725137?pwd=a3NRNnNwWEg5TlU2aFEwVjYvSjNldz09

Watch: https://numfocus-org.zoom.us/rec/share/4bHUQVHx02XeyADxPYONFF6_zhgWZ6BVW4mcuu3vg4UZaQBylvInGHYgtwm4oFI1.nkxHF-Xuu10p4eeb

View Details


Data visualisation with Seaborn

Day 2
Tutorials
Myles Mitchell, Parisa Gregg  |  2022/12/02 10:00:00 UTC - 2022/12/02 11:30:00

Want to create beautiful and complex visualisations of your data with concise code? Look no further than Seaborn, Python’s fantastic plotting library which builds on the hugely popular Matplotlib package. This hands-on tutorial will provide you with all the necessary tools to communicate your data insights with Seaborn.

Join: https://numfocus-org.zoom.us/j/85436614845?pwd=ZDVzcHVOcUhWYTBDdTU3dHoyUVFZUT09

Watch: https://numfocus-org.zoom.us/rec/share/xF9BiwBR0fqpbIjRB3wcBH8WFlVCm_QFg-snS1BZPR4CGQcUnip3m7aX43tIQFIk.Fp3Tu7CMrQipoHYR

View Details


Super Search with OpenSearch and Python

Day 2
Talks
Laysa Uchoa  |  2022/12/02 10:30:00 UTC - 2022/12/02 11:00:00

OpenSearch is an open source document database with search and aggregation superpowers, based on Elasticsearch. This session covers how to use OpenSearch to perform both simple and advanced searches on semi-structured data such as a product database.

Join: https://www.youtube.com/watch?v=LjtixglMLCg

Watch: https://www.youtube.com/watch?v=LjtixglMLCg&t=9122s

View Details


Maps, Maps, Maps!

Day 2
Talks
Geir Arne Hjelle  |  2022/12/02 10:30:00 UTC - 2022/12/02 11:00:00

Python has many different packages that are useful for working with different kinds of geographical data. This presentation will introduce several of these packages and show you how you can get started working with geolocated information and presenting insights on maps.

Join: https://youtu.be/h779eyy-zRE

Watch: https://www.youtube.com/watch?v=h779eyy-zRE&t=5662s

View Details


How to maximally parallelize the entire pandas API

Day 2
Talks
Rehan Durrani  |  2022/12/02 11:00:00 UTC - 2022/12/02 11:30:00

pandas has rapidly become one of the most popular tools for data analysis, but is limited by its inability to scale to large datasets. We developed Modin, a scalable, drop-in alternative to pandas, that preserves the dynamic and flexible behavior of pandas dataframes while enhancing the scalability.

This talk will walk you through our team’s research at UC Berkeley, which enabled the development of Modin. We’ll also discuss our latest publication at VLDB, which covers a novel approach to parallelization and metadata management techniques for dataframes.

Join: https://www.youtube.com/watch?v=LjtixglMLCg

Watch: https://www.youtube.com/watch?v=LjtixglMLCg&t=10864s

View Details


Exploring Feature Redundancy and Synergy with FACET 2.0 – and Why You Need It to Interpret ML Models Correctly

Day 2
Talks
Mateusz Sokół  |  2022/12/02 11:30:00 UTC - 2022/12/02 12:00:00

Understanding dependencies between features is crucial in the process of developing and interpreting black-box ML models. Mistreating or neglecting this aspect can lead to incorrect conclusions and, consequentially, sub-optimal or wrong decisions leading to financial losses or other undesired outcomes. Many common approaches to explain ML models – as simple as feature importance or more advanced methods such as SHAP – can yield misleading results if mutual feature dependencies are not taken into account.

In this talk we present FACET 2.0 – a new approach for global feature explanations using a new technique called SHAP vector projection, open-sourced at: https://github.com/BCG-Gamma/facet/.

Join: https://www.youtube.com/watch?v=LjtixglMLCg

Watch: https://www.youtube.com/watch?v=LjtixglMLCg&t=12650s

View Details


Discover Inspirational Insights in Motivational Sports Speeches Using Speech-to-Text

Day 3
Talks
Tonya Sims  |  2022/12/03 15:30:00 UTC - 2022/12/03 16:00:00

Inspirational sports speeches have motivated and reinvigorated folks for years. Whether you’re a developer or an athlete, they’ve withstood the journey because even the smartest, the bravest, and the most resilient need some encouragement on occasion.

During our time together, we’ll use Python and a speech-to-text provider to transcribe sports podcasts that contain inspirational speeches. We’ll discover insights from the transcripts to determine which ones might give you a boost of energy or rally your team.

We’ll discover common topics of each sports podcast episode and measure how they leave us feeling: victorious or perhaps overcoming the agony of defeat. We’ll also investigate if there are any similarities and differences in the sports speeches and what makes a great motivational speech that moves people to action.

By the end, you’ll have a better understanding of using speech recognition in real-world scenarios and using features of Machine Learning with Python to derive insights.

This talk is for developers of all levels, including beginners.

Join: https://www.youtube.com/watch?v=Ik8YHADlFVk

Watch: https://www.youtube.com/watch?v=48nWtfgEevc&t=2008s

View Details


BERT’s Achilles’ heel? Applying contrastive learning to fight anisotropy in language models.

Day 2
Talks
Aleksander Molak  |  2022/12/02 11:30:00 UTC - 2022/12/02 12:00:00

Transformer models became state-of-the-art in natural language processing. Word representations learned by these models offer great flexibility for many types of downstream tasks from classification to summarization. Nonetheless, these representations suffer from certain conditions that impair their effectiveness. Researchers have demonstrated that BERT and GPT embeddings tend to cluster in a narrow cone of the embedding space which leads to unwanted consequences (e.g. spurious similarities between unrelated words). During the talk we’ll introduce SimCSE – a contrastive learning method that helps to regularize the embeddings and reduce the problem of anisotropy. We will demonstrate how SimCSE can be implemented in Python.

Join: https://youtu.be/h779eyy-zRE

Watch: https://www.youtube.com/watch?v=h779eyy-zRE&t=8987s

View Details


Machine Learning Frameworks Interoperability

Day 2
Talks
Christian Hundt, Miguel Martínez  |  2022/12/02 12:00:00 UTC - 2022/12/02 12:30:00

To develop mature data science, machine learning, and deep learning applications, one must develop a large number of pipeline components, such as data loading, feature extraction, and frequently a multitude of machine learning models.

Join: https://www.youtube.com/watch?v=LjtixglMLCg

Watch: https://www.youtube.com/watch?v=LjtixglMLCg&t=14423s

View Details


How to Properly Test ML Models & Data

Day 2
Talks
Shir Chorev  |  2022/12/02 12:00:00 UTC - 2022/12/02 12:30:00

Automatic testing for ML pipelines is hard. Part of the executed code is a model that was dynamically trained on a fresh batch of data, and silent failures are common. Therefore, it’s problematic to use known methodologies such as automating tests for predefined edge cases and tracking code coverage.
In this talk we’ll discuss common pitfalls with ML models, and cover best practices for automatically validating them: What should be tested in these pipelines? How can we verify that they’ll behave as we expect once in production? We’ll demonstrate how to automate tests for these scenarios and introduce a few open-source testing tools that can aid the process.

Join: https://youtu.be/h779eyy-zRE

Watch: https://www.youtube.com/watch?v=h779eyy-zRE&t=10775s

View Details


ipyvizzu-story – a new, open-source charting tool to build, create and share animated data stories with Python in Jupyter

Day 2
Workshops
Peter Vidos  |  2022/12/02 12:00:00 UTC - 2022/12/02 13:30:00

Sharing and explaining the results of your analysis can be a lot easier and much more fun when you can create an animated story of the charts containing your insights. ipyvizzu-story – a new open-source presentation tool for Jupyter & Databricks notebooks and similar platforms – enables just that using a simple Python interface.

In this workshop, one of the creators of ipyvizzu-story introduces this tool and helps the audience take the first steps in utilizing the power of animation in data storytelling. After the workshop, the members can build and present animated data stories independently.

Join: https://numfocus-org.zoom.us/j/89651814635?pwd=WkN0QUc1b1o3UkpvTmJGRnhJN0REUT09

Watch: https://numfocus-org.zoom.us/rec/share/V1pI27mqD_1LYuyLOb8BW4r_ZLqCjXhteTbi4Ykgz9vvsPcX3ehrW3DW_jPKoASe.5LzNJck2P0ZvZ14c

View Details


Lessons Learned Building Our Own Dashboard Solution Using Open-Source Technologies

Day 2
Talks
Jan Dix, Zornitsa Manolova  |  2022/12/02 12:30:00 UTC - 2022/12/02 13:00:00

Most organisations habe implemented some kind of dashboard to monitor their data, processes, or business. However, many dashboard solutions come with a caveat – either the licensing costs, lack of transparency in the workflows, limited creativity, or they cannot be connected to existing infrastructure.
This talk is aimed at Data Scientists, Data Engineers, Data Practitioners and Managers struggling with choosing between a myriad of commercial dashboard solutions and DIY. We present how to create your own dashboard using open-source Python technologies like FastAPI, SQLAlchemy, and Celery and the challenges involved. We look back at the pitfalls and solutions we have worked on over the past 3 years. The goal is not to present our unique solution, but to show how we can combine different Python libraries to implement custom solutions to solve different use cases. Attendees should be familiar with the basic concepts of web infrastructure. Previous knowledge of any libraries is not required. We hope to provide a starting point to build your custom dashboard solution using open-source tooling.

Join: https://www.youtube.com/watch?v=LjtixglMLCg

Watch: https://www.youtube.com/watch?v=LjtixglMLCg&t=16211s

View Details


Industrial Strength DALLE-E: Scaling Complex Large Text & Image Models

Day 2
Talks
Alejandro Saucedo  |  2022/12/02 12:30:00 UTC - 2022/12/02 13:00:00

Identifying the right tools to enable for high performance machine learning may be overwhelming as the ecosystem continues to grow at break-neck speed. This becomes particularly emphasised when dealing with the ever growingly popular large language and image generation models such as GPT2, OTP and DALL-E, between others. In this session we will dive into a practical showcase where we will be productionising the large image generation model DALL-E, and showcase some optimizations that can be introduced as well as considerations as the use-cases scale. By the end of this session practitioners will be able to run their own DALL-E powered applications as well as integrate these with functionalities from other large language models like GPT2, etc. We will be leveraging key tools in the Python ecosystem to achieve this, including Pytorch, HuggingFace, FastAPI and MLServer.

Join: https://youtu.be/h779eyy-zRE

Watch: https://www.youtube.com/watch?v=h779eyy-zRE&t=12587s

View Details


Converting sentence-transformers models to a single tensorflow graph

Day 2
Talks
Georgios Balikas  |  2022/12/02 13:00:00 UTC - 2022/12/02 13:30:00

Getting predictions from transformer models such as BERT requires two steps: first to query the tokenizer and then feed the outputs to the deep learning model itself. These two parts of the model are kept under different class implementations in popular open source implementations like Huggingface Transformers and Sentence-Transformers. This works well within Python but when one wants to put such a model in production or convert it to more efficient formats like onnx that may be served by other languages such as JVM-based it is preferable and simpler (and less risky) to have a single artifact that is directly queried. This talk builds on the popular sentence-transformers library and shows how one can transform a sentence-transformer model into a single tensorflow artifact that can be queried with strings and is ready for serving. At the end of the talk the audience will get a better understanding of the architecture of sentence-transformers and the required steps for converting a sentence-transformer model to a single tensorflow graph. The code is released as a set of notebooks so that the audience can replicate the results.

Join: https://www.youtube.com/watch?v=LjtixglMLCg

Watch: https://www.youtube.com/watch?v=LjtixglMLCg&t=17928s

View Details


Metadata Systems for End-to-End Data & Machine Learning Platforms

Day 2
Talks
Alejandro Saucedo  |  2022/12/02 13:00:00 UTC - 2022/12/02 13:30:00

Organisations have been growingly adopting and integrating a non-trivial number of different frameworks at each stage of their machine learning lifecycle. Although this has helped reduce time-to-value for real-world AI use-cases, it has come at a cost of complexity and interoperability bottlenecks.

Join: https://youtu.be/h779eyy-zRE

Watch: https://www.youtube.com/watch?v=h779eyy-zRE&t=14371s

View Details


Real-world Perspectives to Avoid the Worst Mistakes using Machine Learning in Science

Day 2
Workshops
Jesper Dramsch, Valerio Maggio, Gemma Turon, Mike Walmsley, Goku Mohandas  |  2022/12/02 13:00:00 UTC - 2022/12/02 15:00:00

Numerous scientific disciplines have noticed a reproducibility crisis of published results. While this important topic was being addressed, the danger of non-reproducible and unsustainable research artefacts using machine learning in science arose. The brunt of this has been avoided by better education of reviewers who nowadays have the skills to spot insufficient validation practices. However, there is more potential to further ease the review process, improve collaboration and make results and models available to fellow scientists. This workshop will teach practical lessons that can be directly applied to elevate the quality of ML applications in science by scientists.

Join: https://numfocus-org.zoom.us/j/85651624814?pwd=dEQ3dlFlaW95M2ZGOGs0d1owYlExZz09

Watch: https://numfocus-org.zoom.us/rec/share/KnLRtJYUe9E7Op521FXzmbEdHxP4Zmhn7nD4NnK5OVJuqjqLBzwLyJFYW3R75h0u.ExhqTkvfTl9n2lCO

View Details


Steering a data science project

Day 2
Talks
Morgane Mahaud  |  2022/12/02 13:30:00 UTC - 2022/12/02 14:00:00

Starting a new data science project is an exciting time, full of exotic models possibilities and faraway incredible features. However this ocean of potentialities is treacherous and the risks of veering off numerous.

This talk aims to provide a checklist to help you set a course for your data science project, and keep it. An industrial project about images pseudo-classification will be used as a working example.

Join: https://www.youtube.com/watch?v=LjtixglMLCg

Watch: https://www.youtube.com/watch?v=LjtixglMLCg&t=19805s

View Details


Things I learned running neural networks on microcontrollers

Day 2
Talks
SARADINDU SENGUPTA  |  2022/12/02 13:30:00 UTC - 2022/12/02 14:00:00

A somewhat beginner’s guide on running neural networks on micro-controllers, understanding the training pipeline, deployment and how to update the deployed model.

Join: https://youtu.be/h779eyy-zRE

Watch: https://www.youtube.com/watch?v=h779eyy-zRE&t=16186s

View Details


Building a Machine Learning Platform with OSS in 90 min

Day 2
Workshops
Anindya Saha  |  2022/12/02 13:30:00 UTC - 2022/12/02 15:00:00

Have you ever wondered what it takes to build a production grade Machine Learning platform? With so many OSS tools and frameworks it can get overwhelming at times how to make everything work. In this workshop we will build a production grade Model training, Model Serving, Model Monitoring platform on AWS EKS. Nothing will be local. These ideas can serve ML Engineers, Applied Data Scientists & Researchers to further extend them and develop a holistic picture of building an ML Platformon OSS.

Join: https://numfocus-org.zoom.us/j/89759495181?pwd=YVZReUZxL1Z1YVVZeUNwbGE4a1Y4dz09

Watch: https://numfocus-org.zoom.us/rec/share/V_WpDffKKwwAUP9420M9zmlunEu7QX1berpPRJyPf_xIR5uMi_j0nrOfpY_uGYev.3ioic8GXblnYUflJ

View Details


The Pythonic Common Chemical Universe

Day 2
Talks
Suliman Sharif  |  2022/12/02 14:00:00 UTC - 2022/12/02 14:30:00

The virtual chemical universe is expanding rapidly as open access titan databases Enamine Database (20 Billion), Zinc Database (2 Billion), PubMed Database (68 Million) and cheminformatic tools
to process, manipulate, and derive new compound structures are being established. We present our open source knowledge graph, Global-Chem, written in python to distribute dictionaries of common chemical lists of relevant to different sub-communities out to the general public i.e What is inside Food? Cannabis? Sex Products? Chemical Weapons? Narcotics? Medical Therapeutics?

To navigate new chemical space we use our data as a reference index as to help us keep track of common patterns of interest and help us explore new chemicals that could be theoretically real. In our talk, we will present the chemical data, the rules governing the data and it’s integrity, and how to use our tools to understand the chemical universe with python.

Join: https://www.youtube.com/watch?v=LjtixglMLCg

Watch: https://www.youtube.com/watch?v=LjtixglMLCg&t=21561s

View Details


Extending Awkward Array into the broader PyData Ecosystem

Day 2
Talks
Doug Davis  |  2022/12/02 14:00:00 UTC - 2022/12/02 14:30:00

The Awkward Array project provides a library for operating on nested, variable length data structures with NumPy-like idioms. We present two projects that provide native support for Awkward Arrays in the broader
PyData ecosystem. In dask-awkward we have implemented a new Dask collection to scale up and distribute workflows with partitioned Awkward Arrays. In awkward-pandas we have implemented a new Pandas extension array type, making it easy to use Awkward Arrays in Pandas workflows and enabling massive acceleration in the processing of nested data. We will show how these projects plug into PyData and present some compelling use cases.

Join: https://youtu.be/h779eyy-zRE

Watch: https://www.youtube.com/watch?v=h779eyy-zRE&t=17969s

View Details


Practical MLOps for better models

Day 2
Talks
Isabel Zimmerman  |  2022/12/02 14:30:00 UTC - 2022/12/02 15:00:00

Machine learning operations (MLOps) are often synonymous with large and complex applications, but many MLOps practices help practitioners build better models, regardless of the size. This talk shares best practices for operationalizing a model and practical examples using the open-source MLOps framework vetiver to version, share, deploy, and monitor models.

Join: https://youtu.be/h779eyy-zRE

Watch: https://www.youtube.com/watch?v=h779eyy-zRE&t=19763s

View Details


Keynote – DJ Patil

Day 2
Keynotes
DJ Patil   |  2022/12/02 15:00:00 UTC - 2022/12/02 16:00:00

DJ Patil is the former U.S. Chief Data Scientist

Join: https://www.youtube.com/watch?v=LjtixglMLCg

Watch: https://www.youtube.com/watch?v=LjtixglMLCg&t=25327s

View Details


Testing Big Data Applications (Spark, Dask, and Ray)

Day 2
Talks
Han Wang  |  2022/12/02 16:00:00 UTC - 2022/12/02 16:30:00

Data practitioners use distributed computing frameworks such as Spark, Dask, and Ray to work with big data. One of the major pain points of these frameworks is testability. For testing simple code changes, users have to spin up local clusters, which have a high overhead. In some cases, code dependencies force testing against a cluster. Because testing on big data is hard, it becomes easy for practitioners to avoid testing entirely. In this talk, we’ll show best practices for testing big data applications. By using Fugue to decouple logic and execution, we can bring more tests locally and make it easier for data practitioners to test with low overhead.

Join: https://www.youtube.com/watch?v=LjtixglMLCg

Watch: https://www.youtube.com/watch?v=LjtixglMLCg&t=28884s

View Details


On creating behavioral profiles of your customer from event stream data – introduction to Cleora, the open-source tool for real time multimodal modeling.

Day 2
Talks
Dominika Basaj, Barbara Rychalska  |  2022/12/02 16:00:00 UTC - 2022/12/02 16:30:00

We want to present Cleora – an open-source tool for creating compact representation of the behavior of your client. Cleora uses graph theory to transform streams of event data into embedding. It is suitable as an input for training models like churn, propensity and recommender systems. This is a talk useful for anyone who wishes to learn how to work with event data of clients and how to model client’s behavior.

Join: https://youtu.be/h779eyy-zRE

Watch: https://www.youtube.com/watch?v=h779eyy-zRE&t=25186s

View Details


Data annotation for humans: Creating and refining annotation guidelines from a UX perspective

Day 2
Workshops
Damian Romero, Magda  |  2022/12/02 16:00:00 UTC - 2022/12/02 18:00:00

In this workshop, attendees will learn how to create data annotation guidelines from a user experience (UX) perspective.

Creating annotation guidelines from a UX perspective means imbuing them with usability, resulting in a better experience for annotators, and more effective and productive annotation campaigns. With Python being at the forefront of Machine Learning and data science, we believe that the Python community will benefit from learning more about the design of data annotation guidelines and why they are essential for creating great machine learning applications.

Join: https://numfocus-org.zoom.us/j/81392908618?pwd=empVQlBwcGZWK245OWkzajBURkNUdz09

Watch: https://numfocus-org.zoom.us/rec/share/gqj5sVoT4mphKeC6uGdpneu6mdRtp97YkqPCz_p2wqbUnPvk0j8GAJh3eM6H0iY.wAe7rz1MHwYOXII3

View Details


Too much data? When big data starts to become a bad idea

Day 2
Tutorials
Cheuk Ting Ho  |  2022/12/02 16:00:00 UTC - 2022/12/02 18:00:00

Nowadays we know the social media and tech giants are honesting tons of data from their users and most of us agree that the capability of these companies to deliver their suggestions and customization for you is driven by big data.

However, this brings a question: Is more data always better? Do more data equal to more accurate model? When do you need big data and when does it start becoming a bad idea? Let’s find out in this panel session.

Join: https://numfocus-org.zoom.us/j/87402632626?pwd=cWx1dDNNTUl3Zk8zSXdSTEZiWjQrQT09

Watch: https://numfocus-org.zoom.us/rec/share/R35fWv7QlHm701H45ltgpVOfKeMPYzvx-ykN8CJAums0y-ux3UKjQdbOrLdptyRX.1abhCmT9UX1rgusV

View Details


Improving production workflows for scikit-learn models with skops

Day 2
Talks
Merve Noyan  |  2022/12/02 16:30:00 UTC - 2022/12/02 17:00:00

Production workflows in machine learning has it’s own requirements compared to DevOps. In this talk, I will present a new library we are developing called “skops” that’s built to improve production workflows for scikit-learn models.

Join: https://youtu.be/h779eyy-zRE

Watch: https://www.youtube.com/watch?v=h779eyy-zRE&t=26981s

View Details


MLOps for the rest of us: A poor man’s guide to putting models in production

Day 2
Talks
Duarte Carmo  |  2022/12/02 17:00:00 UTC - 2022/12/02 17:30:00

What if you’re a two man machine learning team deploying models to users? What if you don’t have a full blown team of Data Engineers working with you? What if nobody around you cares about making that nasty production data available in a pristine feature store? What if you don’t even have time to build out your entire Machine Learning platform?

There must be a way to still deliver your ML model to users right? There must be way to deliver value.

In this session, I’ll talk about how small teams address the problem of delivering ML-value to users. At a reasonable scale. I’ll go over some misconceptions and lessons-learned from 4 years working with early-stage startups.

Join: https://www.youtube.com/watch?v=LjtixglMLCg

Watch: https://www.youtube.com/watch?v=LjtixglMLCg&t=32448s

View Details


Machine Learning in the Warehouse with Python

Day 2
Talks
Allan Campopiano  |  2022/12/02 17:00:00 UTC - 2022/12/02 17:30:00

Moving data in and out of a warehouse is both tedious and time-consuming. In this talk, we will demonstrate a new approach using the Snowpark Python library. Snowpark for Python is a new interface for Snowflake warehouses with Pythonic access that enables querying DataFrames without having to use SQL strings, using open-source packages, and running your model without moving your data out of the warehouse. We will discuss the framework and showcase how data scientists can design and train a model end-to-end, upload it to a warehouse and append new predictions using notebooks.

Join: https://youtu.be/h779eyy-zRE

Watch: https://www.youtube.com/watch?v=h779eyy-zRE&t=28826s

View Details


How to Eliminate the I/O Bottleneck and Continuously Feed the GPU While Training in the Cloud

Day 2
Talks
Lu Qiu  |  2022/12/02 17:30:00 UTC - 2022/12/02 18:00:00

Model training is a time-consuming, data-intensive, and resource-hungry phase in machine learning, with much use of storage, CPUs, and GPUs. The data access pattern in training requires frequent I/O of a massive number of small files, such as images and audio files. With the advancement of distributed training in the cloud, it is challenging to maintain the I/O throughput to keep expensive GPUs highly utilized without waiting for access to data. The unique data access patterns and I/O challenges associated with model training compared to traditional data analytics necessitate a change in the architecture of your data platform.

Join: https://www.youtube.com/watch?v=LjtixglMLCg

Watch: https://www.youtube.com/watch?v=LjtixglMLCg&t=34076s

View Details


Better Python Coding through Prefect Blocks

Day 2
Talks
Jeff Hale  |  2022/12/02 17:30:00 UTC - 2022/12/02 18:00:00

Everyone who codes can save time by reusing configuration — whether for logging in to cloud providers or databases, spinning up Docker containers, or sending notifications. The Prefect open source library provides you with blocks – sharable, reusable, and secure configuration with code. Blocks can be created and edited through the Prefect UI or Python code, allowing for easier collaboration with team members of all skill levels.

Join: https://youtu.be/h779eyy-zRE

Watch: https://www.youtube.com/watch?v=h779eyy-zRE&t=30646s

View Details


Keynote – Gabriela de Queiroz

Day 2
Keynotes
Gabriela de Queiroz  |  2022/12/02 18:00:00 UTC - 2022/12/02 19:00:00

At Microsoft, Gabriela leads and manages the Global AI/ML/Data team in Education Advocacy.

Before that, she worked at IBM as a Program Director on Open Source, Data & AI Technologies and then as Chief Data Scientist at IBM, leading AI Strategy and Innovations.

She is the founder of AI Inclusive, a global organization that is helping increase the representation and participation of gender minorities in Artificial Intelligence. She is also the founder of R-Ladies, a worldwide organization for promoting diversity in the R community with more than 200 chapters in 55+ countries.

She has worked in several startups and where she built teams, developed statistical models, and employed a variety of techniques to derive insights and drive data-centric decisions. She likes to mentor and share her knowledge through mentorship programs, tutorials, and talks.

Join: https://www.youtube.com/watch?v=LjtixglMLCg

Watch: https://www.youtube.com/watch?v=LjtixglMLCg&t=36074s

View Details


100x Faster NetworkX: Dispatching to GraphBLAS

Day 2
Talks
Jim Kitchen, Erik Welch, Mridul Seth  |  2022/12/02 19:00:00 UTC - 2022/12/02 19:30:00

NetworkX is the most popular graph/network library in Python. It is easy to use, well documented, easy to contribute to, extremely flexible, and extremely slow for large graphs.
An upcoming release begins to fix that last issue by calling fast GraphBLAS implementations instead of the native Python implementation.

If you use NetworkX or have ever written a graph algorithm, this talk will be of interest to you as it shows how NetworkX is planning on a path of pluggable algorithm libraries so users can opt-in to faster implementations with minimal code changes.

Join: https://www.youtube.com/watch?v=pd-Sbm8lHTc

Watch: https://www.youtube.com/watch?v=pd-Sbm8lHTc&t=43s

View Details


Data pipelines != workflows: orchestrating data with Dagster

Day 2
Talks
Sandy Ryza  |  2022/12/02 19:00:00 UTC - 2022/12/02 19:30:00

Data pipelines consist of graphs of computations that produce and consume data assets like tables and ML models.

Data practitioners often use workflow engines like Airflow to define and manage their data pipelines. But these tools are an odd fit – they schedule tasks, but miss that tasks are built to produce and maintain data assets. They struggle to represent dependencies that are more complex than “run X after Y finishes” and lose the trail on data lineage.

Dagster is an open-source framework and orchestrator built to help data practitioners develop, test, and run data pipelines. It takes a declarative approach to data orchestration that starts with defining data assets that are supposed to exist and the upstream data assets that they’re derived from.

Attendees of this session will learn how to develop and maintain data pipelines in a way that makes their datasets and ML models dramatically easier to trust and evolve.

Join: https://www.youtube.com/watch?v=SQdq9xtzdJI

Watch: https://www.youtube.com/watch?v=SQdq9xtzdJI&t=260s

View Details


Missing Data in the Age of Machine Learning

Day 2
Tutorials
Haw-minn Lu, Haoyin Xu  |  2022/12/02 19:00:00 UTC - 2022/12/02 20:30:00

Machine learning algorithms, especially artificial neural networks, are not tolerant of missing data. Many practitioners simply remove records with missing fields without any consideration for the potential statistical bias that might be introduced. The field of imputation has become mature with imputations not only predicting missing values, but reflecting the uncertainty in the prediction. Traditional statistical estimators make use of the full benefits offered by advanced imputation techniques. This tutorial illustrates techniques and architectures that can incorporate advanced imputation techniques into machine learning pipelines including artificial neural networks.

Join: https://numfocus-org.zoom.us/j/81102390207?pwd=KzI0V2t0VkhnczdwMXFIMjR4Y0ZGdz09

Watch: https://numfocus-org.zoom.us/rec/share/z5cr7fHyqnhiE021RH4MwcB9gT28OA9nh0m9qX-AuPI2iLnKJjOIU9pjWPR9XyeR.syJJXMydd0cO2nZT

View Details


Simulations in Python: Discrete Event Simulation with SimPy

Day 2
Tutorials
Lara Kattan  |  2022/12/02 19:00:00 UTC - 2022/12/02 20:30:00

Add to your machine learning arsenal with an introduction to simulation in Python using SimPy! Simulations are increasingly important in machine learning, with applications that include simulating the spread of COVID-19 to make decisions about public policy, vaccination and shutdowns.

You can use simulation to answer questions like, Can you increase profits by adding more tables or staff to your restaurant? You can also use simulation to create data for modeling when it’s hard or impossible to get (e.g. simulate purchases in response to promotions on certain products to see if they increase sales).

To benefit from this talk, you’ll need to know a small amount of Python, specifically how to write functions and simple classes. No previous knowledge of simulation needed! If you know about simulation in another language and want to see a SimPy example, you can also benefit from this talk. You’ll get a Jupyter notebook with a simple but fully worked out example to follow along with and to study on your own time after the conference.

Join: https://numfocus-org.zoom.us/j/88337941336?pwd=cDJ5TWRobDlMSE9HQzFnbGhZbDdSZz09

Watch: https://numfocus-org.zoom.us/rec/share/gq6CjRBk6-pu3KGlQ1B5nL-s4B1SbPqlYbZoVHjZIVwS5lLpY7yLXOtPQfG3CFRs.UJdPHWjyWAjrYFmJ

View Details


Vaex: the perfect DataFrame Library for Python data apps

Day 2
Talks
Jovan Veljanoski, Maarten Breddels  |  2022/12/02 19:30:00 UTC - 2022/12/02 20:00:00

Vaex is an incredibly powerful DataFrame library that allows one to work with datasets much larger than RAM on a single node. It combines memory mapping, lazy evaluations, efficient C++ algorithms, and a variety of other tricks to empower your off-the-shelf laptop and make it crunch through a billion samples in real time.

A common use-case for Vaex is as a backend for data apps, especially if one needs to process, transform, and visualize a larger amount of data in real time. Vaex implements a number of features that have been specifically designed to improve performance of data hungry dashboards or apps, namely:
– caching
– async evaluations
– early stopping of operations
– progress bars

In this talk we will showcase how you can use these features to build efficient dashboards and data apps, regardless of the data app library you prefer using.

Join: https://www.youtube.com/watch?v=pd-Sbm8lHTc

Watch: https://www.youtube.com/watch?v=pd-Sbm8lHTc&t=1825s

View Details


Daft: the Distributed Python Dataframe for “Complex Data” (images, video, documents and more)

Day 2
Talks
Jay Chia, Sammy Sidhu  |  2022/12/02 19:30:00 UTC - 2022/12/02 20:00:00

Daft is an open-sourced distributed dataframe library built for “Complex Data” (data that doesn’t usually fit in a SQL table such as images, videos, documents etc).

Experiment Locally, Scale Up in the Cloud

Daft grows with you and is built to run just as efficiently/seamlessly in a notebook on your laptop or on a Ray cluster consisting of thousands of machines with GPUs.

Pythonic

Daft lets you have tables of any Python object such as images/audio/documents/genomic files. This makes it really easy to process your Complex Data alongside all your regular tabular data. Daft is dynamically typed and built for fast iteration, experimentation and productionization.

Blazing Fast

Daft is built for distributed computing and fully utilizes your all of your machine’s or cluster’s resources. It uses modern technologies such as Apache Arrow, Parquet and Iceberg for optimizing data serialization and transport.

Join: https://www.youtube.com/watch?v=SQdq9xtzdJI

Watch: https://www.youtube.com/watch?v=SQdq9xtzdJI&t=1990s

View Details


Scale Data Science by Pandas API on Spark

Day 2
Talks
Xinrong Meng, Takuya Ueshin  |  2022/12/02 20:00:00 UTC - 2022/12/02 20:30:00

With Python emerging as the primary language for data science, pandas has grown rapidly to become one of the standard data science libraries. One of the known limitations in pandas is that it does not scale with your data volume linearly due to single-machine processing.Pandas API on Spark overcomes the limitation, enabling users to work with large datasets by leveraging Apache Spark. In this talk, we will introduce Pandas API on Spark and help you scale your existing data science workloads using that. Furthermore, we will share the cutting-edge features in Pandas API on Spark.

Join: https://www.youtube.com/watch?v=pd-Sbm8lHTc

Watch: https://www.youtube.com/watch?v=pd-Sbm8lHTc&t=3610s

View Details


Testing Pandas: Shoots, leaves, and garbage!

Day 2
Talks
Matt Harrison  |  2022/12/02 20:00:00 UTC - 2022/12/02 20:30:00

“It works on my machine”… those dreaded words.

“I’m not a developer, I don’t know how to test”… arghhh.

“Let QA test it”….

No more excuses. Learn how to debug and test Pandas code.

Join: https://www.youtube.com/watch?v=SQdq9xtzdJI

Watch: https://www.youtube.com/watch?v=SQdq9xtzdJI&t=3779s

View Details


Keynote – Quincy Larson

Day 2
Keynotes
Quincy Larson  |  2022/12/02 20:30:00 UTC - 2022/12/02 21:30:00

Quincy Larson is the Founder of freecodecamp.org.

Join: https://www.youtube.com/watch?v=pd-Sbm8lHTc

Watch: https://www.youtube.com/watch?v=pd-Sbm8lHTc&t=5400s

View Details


Everything you need to know about Transformer Models

Day 2
Talks
Mike Rothenhäusler  |  2022/12/02 21:30:00 UTC - 2022/12/02 22:00:00

Transformer models are all around in the deep learning community and this talk will help to better understand why transformers achieve such impressive results. Using various explainability techniques and plain numpy examples, participants will gain an understanding of the attention mechanism, its implementation, and how it all comes together.

Join: https://www.youtube.com/watch?v=pd-Sbm8lHTc

Watch: https://www.youtube.com/watch?v=pd-Sbm8lHTc&t=9110s

View Details


Lightning Talks – Aidan Russell, Ray Bell, Cameron Devine PhD, Archit Khosla

Day 2
Talks
Aidan Russell, Ray Bell, Cameron Devine PhD, Josh Seltzer, Archit Khosla  |  2022/12/02 21:30:00 UTC - 2022/12/02 22:30:00

Lightning Talks are short 5-10 minute sessions presented by community members on a variety of interesting topics.

Join: https://numfocus-org.zoom.us/j/86535220090

Watch: https://numfocus-org.zoom.us/rec/share/vOsGoT-NH-gFpU235BLQ2uzCucWuENXPTJGemk8V9dZ9Xz6PKvQgaadCeQEsmuSO.vQqfFAW-RlEeIPpO

View Details


You don’t need a cluster for that: using embedded SQL engines for plotting massive datasets on a laptop

Day 2
Talks
Eduardo Blancas  |  2022/12/02 22:00:00 UTC - 2022/12/02 22:30:00

This talk will show you a simple yet effective technique to visualize larger-than-memory datasets on your laptop by leveraging SQLite or DuckDB. No need to spin up a Spark cluster!

Join: https://www.youtube.com/watch?v=pd-Sbm8lHTc

Watch: https://www.youtube.com/watch?v=pd-Sbm8lHTc&t=10871s

View Details


Bon Voyage! Leading machine learning research journeys with happy (into-production) endings

Day 3
Talks
Topaz Gilad  |  2022/12/03 08:00:00 UTC - 2022/12/03 08:30:00

Why is the process of transforming research into a “real world” product so full of question marks? We often know where the research journey starts but have uncertainty about how and WHEN it ends.

In this talk, I will share my own experience leading algorithmic teams through the cycle of research into the production of live-streaming AI products. I will also share how to mitigate between agile incremental delivery and giant leaps forward that require longer research. How understanding the minimum viable product (MVP) way of thinking can help not only managers but every developer. Learn to outline MVP for new AI capabilities, and move forward with production in mind, while always raising the quality standards. At the end of this session, you will get the boost you need to take the data-driven experimental mindset to the next level, spiced with methodologies you can adapt to development as well as research.

Join: https://www.youtube.com/watch?v=5j1w18Rwdi4

Watch: https://www.youtube.com/watch?v=5j1w18Rwdi4&t=240s

View Details


Building an ML Application Platform from the Ground Up

Day 3
Talks
Sean Sheng  |  2022/12/03 08:30:00 UTC - 2022/12/03 09:00:00

The value of an ML model is not realized until it is deployed and served in production. Building an ML application is more challenging compared to a traditional application due to the added complexities from models and data in addition to the application code. Using web serving frameworks (e.g. FastAPI) can work for the simple cases but falls short for performance and efficiency. Alternatively, using pre-packaged models servers (e.g. Triton Inference Server) can be ideal for low-latency serving and resource utilization but lacks flexibility in defining custom logic and dependency. BentoML abstracts the complexities by creating separate runtimes for IO-intensive preprocessing logic and compute-intensive model inference logic. Simultaneously, BentoML offers an intuitive and flexible Python-first SDK for defining custom preprocessing logic, orchestrating multi-model inference, and integrating with other frameworks in the MLOps ecosystem.

Join: https://www.youtube.com/watch?v=5j1w18Rwdi4

Watch: https://www.youtube.com/watch?v=5j1w18Rwdi4&t=2130s

View Details


ML Model Traceability and Reproducibility by Design

Day 3
Talks
Basak Eskili, Maria Vechtomova  |  2022/12/03 09:30:00 UTC - 2022/12/03 10:00:00

Model traceability and reproducibility are crucial steps when deploying machine learning models. Model traceability allows us to know which version of the model generated which prediction. Model reproducibility ensures that we can roll back to the previous versions of the model anytime we want.
We, as ML engineers, designed reusable workflows which enable data scientists to follow these two principles by design.

Join: https://www.youtube.com/watch?v=5j1w18Rwdi4

Watch: https://www.youtube.com/watch?v=5j1w18Rwdi4&t=5610s

View Details


Implementation and analysis of deep learning models for codeswitched speech classification

Day 3
Talks
Yashasvi Misra  |  2022/12/03 10:00:00 UTC - 2022/12/03 10:30:00

Automatic Speech recognition (ASR) is used in many devices to identify Bilingual speech data. Bilingual language or in more scientific terms a code switched language is one or more languages being mixed in a speech utterance. In this presentation, learn about different deep learning techniques that can be used for the classification of such speech utterances. If you are a beginner in this field and don’t know where to start, join me to explore this use case and learn something new!

Join: https://www.youtube.com/watch?v=5j1w18Rwdi4

Watch: https://www.youtube.com/watch?v=5j1w18Rwdi4&t=7414s

View Details


Is it possible to have entities within entities within entities?

Day 3
Talks
Victoria Slocum  |  2022/12/03 11:00:00 UTC - 2022/12/03 11:30:00

Named entity recognition models might not be able to handle a wide variety of spans, but Spancat certainly can! Within our open-source library for NLP, spaCy, we’ve created a NER model to handle overlapping and arbitrary text spans. Dive into named entity recognition, its limitations, and how we’ve solved them with a solution-focused talk and practical applications.

Join: https://www.youtube.com/watch?v=5j1w18Rwdi4

Watch: https://www.youtube.com/watch?v=5j1w18Rwdi4&t=10984s

View Details


Workflows Deep Dive: From Data Engineering to Machine Learning

Day 3
Workshops
Ramon Perez  |  2022/12/03 11:00:00 UTC - 2022/12/03 13:00:00

Programmers, regardless of their level of experience, enjoy solving increasingly complex challenges within their domains of expertise, and one of the main reasons they can spend more time working on different challenges is because of the workflows they put in place around their projects. Data Engineers build pipelines to make sure the company’s data is in optimal condition for Analysts to answer business critical questions, for Data Scientists to automate the selection, engineering, and analysis of distinct features before training models, and for machine learning engineers to know where to get data from, or send it to, for the APIs they build. On the other hand, developers automate the infrastructures of software products to reduce time to market of new features. These groups of data professionals and engineers are not too foreign to each other as they all speak the same language, Python. That said, the goal of this workshop is to dive deep into different workflow patterns for building pipelines for data and machine learning projects. In other words, this workshop bridges the gap between building one-off projects and building automated and reusable pipelines, all while creating an environment that welcomes both, newcomers and experts to either the data and machine learning fields or the engineering one.

Join: https://numfocus-org.zoom.us/j/89834952307?pwd=bUxaVE4rZVpTdi9NS3NFMG5NWUZJZz09

Watch: https://numfocus-org.zoom.us/rec/share/-_uBF0aeQIG2JwgUxsg09prdCYq-7UtNvCvS_yxgM2ay8yRvqhRgVupUn0hqNw8.1uQNaeadSO_CX8xx

View Details


Mixing art with Python: an introduction to Style Transfer

Day 3
Talks
Isac Moura Gomes  |  2022/12/03 11:30:00 UTC - 2022/12/03 12:00:00

What would the sunset painted by van Gogh look like? And the front of your house? This is entirely possible with Deep Learning. The Neural Style Transfer technique aims to compose images in the style of another image, modifying the content and saving it at the same time.

In this lecture, the concepts of Deep Learning, neural networks, and the step-by-step to carry-out styles transfer will be introduced.

Join: https://www.youtube.com/watch?v=5j1w18Rwdi4

Watch: https://www.youtube.com/watch?v=5j1w18Rwdi4&t=13084s

View Details


A Practical Approach To Unlock Value From Data and Analytics

Day 3
Talks
Maria Feria  |  2022/12/03 12:00:00 UTC - 2022/12/03 12:30:00

There are many stories about Data Science hires that end up working in silos, buried in ad hoc business requests. According to Gartner, only 20% of analytic insights will deliver business outcomes in 2022. And a large number of Machine Learning Models never go to production. On top of that, work satisfaction among data professionals is staggeringly low; for instance, 97% of data engineers reported feeling burnt out in a 2021 Wakefield Research Survey. Furthermore, despite living in the era of information, many business executives are making decisions based on guesswork because of the need for more relevant data access in a timely fashion. This talk covers why many data initiatives fail and, more importantly, how to prevent it. I lay out a number of practical approaches based on work experience that will help you to unlock the potential of data and analytics ⁠— from how to build the case and gain buy-in to promoting a fact-based decision-making culture. This talk is for you if you are a business leader sponsoring data initiatives, if you work in data applications, or if you would benefit from enhanced analytics.

Join: https://www.youtube.com/watch?v=5j1w18Rwdi4

Watch: https://www.youtube.com/watch?v=5j1w18Rwdi4&t=14584s

View Details


Lightning Talks – Shivay Lamba, Srikanth, Shrabastee Banerjee, Kefentse Mothusi, Roshini Sudhaharan, Ted Conway, Lutz Ostkamp, SARADINDU SENGUPTA, Srivatsa Kundurthy, Aadit Kapoor

Day 3
Talks
Shivay Lamba, Srikanth, Shrabastee Banerjee, Kefentse Mothusi, Roshini Sudhaharan, Ted Conway, Lutz Ostkamp, SARADINDU SENGUPTA, Srivatsa Kundurthy, Aadit Kapoor  |  2022/12/03 12:00:00 UTC - 2022/12/03 13:30:00

Lightning Talks are short 5-10 minute sessions presented by community members on a variety of interesting topics.

Join: https://youtu.be/qqUGGBKtx0c

Watch: https://www.youtube.com/watch?v=qqUGGBKtx0c

View Details


What-if? Causal reasoning meets Bayesian Inference

Day 3
Talks
Benjamin Vincent  |  2022/12/03 12:30:00 UTC - 2022/12/03 13:00:00

We learn about the world from data, drawing on a broad array of statistical and inferential tools. The problem is that causal reasoning is needed to answer many of our questions, but few data scientists have this in their skill set. This talk will give a high-level introduction to aspects of causal reasoning and how it is complemented by Bayesian inference. A worked example will be given of how to answer what-if questions.

Join: https://www.youtube.com/watch?v=5j1w18Rwdi4

Watch: https://www.youtube.com/watch?v=5j1w18Rwdi4&t=16284s

View Details


Visually Inspecting Data Profiles for Data Distribution Shifts

Day 3
Tutorials
Felipe de Pontes Adachi  |  2022/12/03 13:00:00 UTC - 2022/12/03 14:30:00

The real world is a constant source of ever-changing and non-stationary data. That ultimately means that even the best ML models will eventually go stale. Data distribution shifts, in all of their forms, are one of the major post-production concerns for any ML/data practitioner. As organizations are increasingly relying on ML to improve performance as intended outside of the lab, the need for efficient debugging and troubleshooting tools in the ML operations world also increases. That becomes especially challenging when taking into consideration common requirements in the production environment, such as scalability, privacy, security, and real-time concerns.

In this talk, Data Scientist Felipe Adachi will talk about different types of data distribution shifts and how these issues can affect your ML application. Furthermore, the speaker will discuss the challenges of enabling distribution shift detection in data in a lightweight and scalable manner by calculating approximate statistics for drift measurements. Finally, the speaker will walk through steps that data scientists and ML engineers can take in order to surface data distribution shift issues in a practical manner, such as visually inspecting histograms, applying statistical tests and ensuring quality with data validation checks.

Join: https://numfocus-org.zoom.us/j/87378269186?pwd=Y21qejBvZmtBc2hmZi9EN2l1QlJNQT09

Watch: https://numfocus-org.zoom.us/rec/share/WCXZJ6QKjBsj2JrbdvrXQYYoBq2fiJZLsN1ZFJ7AchnYilO7oDxRj7C7UjX3IjyA.wZBJ9d12C4xZWWkJ

View Details


Probabilistic demand forecasting at scale

Day 3
Talks
Hagop Dippel  |  2022/12/03 13:30:00 UTC - 2022/12/03 14:00:00

It’s common to hear about demand forecasting in the e-commerce ecosystem. Indeed, It plays a pivotal role in logistics and inventory applications. However, due to uncertainty impacting demand and the stochastic nature of most downstream applications, the need for probabilistic demand forecasting emerges. Moreover, for the most realistic use cases, you’ll have to forecast for thousands if not hundreds of thousands of time series. The problem we will explore together is: how can we get probabilistic forecasts that embrace uncertainty and scale?

The talk is light-hearted, contains few math formulas, and is aimed at forecasting practitioners! If you are new to the topic of forecasting, you’ll be able to follow! We take the time to pose the problems and develop deeper from there.

Join: https://www.youtube.com/watch?v=5j1w18Rwdi4

Watch: https://www.youtube.com/watch?v=5j1w18Rwdi4&t=19884s

View Details


Scalable Feature Engineering with Hamilton

Day 3
Talks
Elijah ben Izzy, Stefan Krawczyk  |  2022/12/03 14:00:00 UTC - 2022/12/03 14:30:00

In this talk we present Hamilton, a novel open-source framework for developing and maintaining scalable feature engineering dataflows. Hamilton was initially built to solve the problem of managing a codebase of transforms on pandas dataframes, enabling a data science team to scale the capabilities they offer with the complexity of their business. Since then, it has grown into a general-purpose tool for writing and maintaining dataflows in python. We introduce the framework, discuss its motivations and initial successes at Stitch Fix, and share recent extensions that seamlessly integrate it with distributed compute offerings, such as Dask, Ray, and Spark.

Join: https://www.youtube.com/watch?v=5j1w18Rwdi4

Watch: https://www.youtube.com/watch?v=5j1w18Rwdi4&t=21759s

View Details


Media Mix Modeling: How to Measure the Effectiveness of Advertising in Python

Day 3
Talks
Hajime Takeda  |  2022/12/03 14:30:00 UTC - 2022/12/03 15:00:00

Media Mix Modeling, also called Marketing Mix Modeling (MMM), is a technique that helps advertisers to quantify the impact of several marketing investments on sales.

If a company advertises in multiple media (TV, digital ads, magazines, etc.), how can we measure the effectiveness and make future budget allocation decisions? Traditionally, regression modeling has been used, but obtaining actionable insights with that approach has been challenging.

Recently, many researchers and data scientists have tackled this problem using Bayesian statistical approaches. For example, Google has published multiple papers about this topic.

In this talk, I will show the key concepts of a Bayesian approach to MMM, its implementation using Python, and practical tips.

Join: https://www.youtube.com/watch?v=5j1w18Rwdi4

Watch: https://www.youtube.com/watch?v=5j1w18Rwdi4&t=23569s

View Details


Production-grade Machine Learning with Flyte

Day 3
Tutorials
Niels Bantilan  |  2022/12/03 15:00:00 UTC - 2022/12/03 16:30:00

MLOps encapsulates the discipline of – and infrastructure that supports – building and maintaining machine learning models in production. This tutorials highlight four challenges in carrying this out effectively: scalability, data quality, reproducibility, recoverability, and auditability. As a data science and machine learning practitioner, you’ll learn how Flyte, an open source data- and machine-learning-aware orchestration tool, is designed to overcome these challenges and you’ll get your hands dirty using Flyte to build ML pipelines with increasing complexity and scale!

Join: https://numfocus-org.zoom.us/j/84247302012?pwd=UHhRUzc2bHFJN2thcy9tR3V2L3c3QT09

Watch: https://numfocus-org.zoom.us/rec/share/0mlFy4Xb8nn5tK-o_uLd1h737C1VOCePyojsSDHGJ0-5H0Qqhdxk1hX1LH-jkR0Z.gdCEUb2mGOyMMQPd

View Details


A dive into time series for the energy sector

Day 3
Talks
Rosana de Oliveira Gomes  |  2022/12/03 15:00:00 UTC - 2022/12/03 15:30:00

The energy sector has gained great attention in 2022 due to the current global energy crisis. Understanding which technologies and techniques are suitable for this sector is crucial to guarantee an effective transition to a future with cleaner and efficient energy sources. This talk aims to educate tech professionals interested in the applications of machine learning in the energy sectors, especially when it comes to time series analysis and forecasting. The audience is expected to have a basic understanding of data science and machine learning, and will be introduced to the concepts of time series, as well as the most common techniques utilized in the sector.

Join: https://www.youtube.com/watch?v=5j1w18Rwdi4

Watch: https://www.youtube.com/watch?v=5j1w18Rwdi4&t=25455s

View Details


Navigating Career Adjustments in Times of Uncertainty

Day 3
Talks
Jose Mesa  |  2022/12/03 15:00:00 UTC - 2022/12/03 15:30:00

Throughout the COVID pandemic, we’ve experienced extremes brought on by economic downturns and uncertainty across industries—to this day, we are feeling these effects around the globe. In fact, statistics show that many professionals have changed careers following the waves of layoffs that have recently occurred—but how? How can we best prepare for this type of situation, and how easy or difficult is it to change careers? If these questions have been on your mind, join this session to learn about several global industry trends, ways to adapt to career changes, and how to grow your tech skills and leverage certain platforms to support your learning process.

Join: https://www.youtube.com/watch?v=48nWtfgEevc

Watch: https://www.youtube.com/watch?v=48nWtfgEevc&t=314s

View Details


Speed up Python data processing with vectorization

Day 3
Talks
Itamar Turner-Trauring  |  2022/12/03 15:30:00 UTC - 2022/12/03 16:00:00

You need to quickly process a large amount of data—but running Python code is slow.
Libraries like NumPy and Pandas bridge this performance gap using a technique called vectorization.
In order take full advantage of these libraries to speed up your code, it’s helpful to understand what vectorization means and when and how it works.

In this talk you’ll learn what vectorization means (there’s 3 different definitions!), how it speeds up your code, and how to apply it to your code.

Join: https://www.youtube.com/watch?v=5j1w18Rwdi4

Watch: https://www.youtube.com/watch?v=5j1w18Rwdi4&t=27195s

View Details


Critical CV/NLP Data Errors and How to Fix Them with Galileo

Day 3
Talks
Nikita Demir  |  2022/12/03 16:00:00 UTC - 2022/12/03 16:30:00

Bad data is likely the largest factor limiting your model’s performance. We’ll talk about common data errors and how you can fix them today using Galileo. Although the majority of examples used will be in CV and NLP, the same insights apply to other modalities!

Join: https://www.youtube.com/watch?v=48nWtfgEevc

Watch: https://www.youtube.com/watch?v=48nWtfgEevc&t=3830s

View Details


Developing Battery Materials with Python

Day 3
Talks
Gabriel Birnbaum  |  2022/12/03 16:30:00 UTC - 2022/12/03 17:00:00

The electrochemical battery is one of the most important technologies for a renewable future. In this beginner-friendly talk, we will walk through how fundamental quantum mechanics and data science inform how we fine-tune battery materials for higher performance. We will also show how we used these techniques to computationally model a lithium-oxygen battery in Python.

Join: https://www.youtube.com/watch?v=5j1w18Rwdi4

Watch: https://www.youtube.com/watch?v=5j1w18Rwdi4&t=30792s

View Details


IMF Data Discovery and Collection

Day 3
Talks
Irina Klein  |  2022/12/03 16:30:00 UTC - 2022/12/03 17:00:00

The International Monetary Fund (IMF) provides a huge variety of economic datasets from different countries. We have explored the Python API for data extraction from the IMF, which allows users (primarily economists or financial analysts) to access the data. The structure of the underlying JSON datasets is quite complex for an unprepared user. In the talk, we will demonstrate the API workflow and go over the issues that we are designing a new, easier-to-use API, which is currently being developed. This is joint work with Dr. Sou-Cheng Choi (Illinois Institute of Technology and SAS Institute Inc.).

The talk is primarily directed at data analysts and economists interested in utilizing IMF’s macroeconomic data.

Join: https://www.youtube.com/watch?v=48nWtfgEevc

Watch: https://www.youtube.com/watch?v=48nWtfgEevc&t=5600s

View Details


Reproducible Publications with Python and Quarto

Day 3
Talks
Tom Mock  |  2022/12/03 17:00:00 UTC - 2022/12/03 17:30:00

Quarto is an open-source scientific and technical publishing system that builds on standard markdown with features essential for scientific communication. The system has support for reproducible embedded computations, equations, citations, crossrefs, figure panels, callouts, advanced layout, and more. In this talk we’ll explore the use of Quarto with Python, describing both integration with IPython/Jupyter and the Quarto VS Code extension.

Join: https://www.youtube.com/watch?v=5j1w18Rwdi4

Watch: https://www.youtube.com/watch?v=5j1w18Rwdi4&t=32598s

View Details


Modern Analytics in the Cloud – A case for fraud detection

Day 3
Talks
Marwa Ahmed  |  2022/12/03 17:00:00 UTC - 2022/12/03 17:30:00

There’s a growing interest from small and large companies alike to move their data and their analytical pipelines into the Cloud as it adds large cost and operational benefits to businesses. Despite this, it can be unclear and sometimes confusing to know how cloud services can be used to replicate your existing analytical solutions in the Cloud or even how services can fit together to build new solutions.
The goal of this talk is to help answer these two questions. First by explaining what modern analytics look like in cloud environments and then by presenting a live use case for building an end-to-end analytical solution in the context of fraud detection for E-commerce businesses.

This talk will assume knowledge in some areas, such as the Hadoop ecosystem and the main tools used such as Airflow, Kafka, Spark, etc. an overall idea will be more than sufficient and some experience with building and deploying machine learning models (some MLOps experience). Therefore, the target audience would be data scientists/engineers with 4-5 years of experience working in analytics and/or architects looking to move their analytics solutions to the Cloud but are still unsure how it can fit together.

At the end of the talk, the audience will have a clear understanding of how modern analytics can be performed in the cloud and what a typical modern data architecture looks like. In the context of AWS, the audience will also have an understanding of the AWS analytics service offerings and what services can be used for/tailored to their needs. Finally, the audience will gain a clearer idea of how they can leverage ML capabilities to build a full pipeline in the cloud while cutting their development time by half.

Join: https://www.youtube.com/watch?v=48nWtfgEevc

Watch: https://www.youtube.com/watch?v=48nWtfgEevc&t=7392s

View Details


Classification Through Regression: Unlock the True Potential of Your Labels

Day 3
Talks
Topaz Gilad  |  2022/12/03 17:30:00 UTC - 2022/12/03 18:00:00

“Is a lion closer to being a giraffe or an elephant?” It is not a question anyone asks. So why address that classification problem the same as you would classification of age groups or medical condition severity?

The talk will walk you through a review of regression-based approaches for what may seem like classification problems. Unlock the true potential of your labels!

Join: https://www.youtube.com/watch?v=5j1w18Rwdi4

Watch: https://www.youtube.com/watch?v=5j1w18Rwdi4&t=34419s

View Details


HuggingFace + Ray AIR Integration: A Python developer’s guide to scaling Transformers

Day 3
Talks
Antoni Baum  |  2022/12/03 17:30:00 UTC - 2022/12/03 18:00:00

Hugging Face Transformers is a popular open-source project with cutting edge Machine Learning (ML), but meeting the computational requirements for advanced models it provides often requires scaling beyond a single machine. In this session, we explore the integration between Hugging Face and Ray AI Runtime (AIR), allowing users to scale their model training and data loading seamlessly. We will dive deep into the implementation and API and explore how we can use Ray AIR to create an end-to-end Hugging Face workflow, from data ingest through fine-tuning and HPO to inference and serving.

Join: https://www.youtube.com/watch?v=48nWtfgEevc

Watch: https://www.youtube.com/watch?v=48nWtfgEevc&t=9192s

View Details


Keynote – Pia Mancini

Day 3
Keynotes
Pia Mancini  |  2022/12/03 18:00:00 UTC - 2022/12/03 19:00:00

Pia is the co-founder and CEO of Open Collective.

Join: https://www.youtube.com/watch?v=5j1w18Rwdi4

Watch: https://www.youtube.com/watch?v=5j1w18Rwdi4&t=36239s

View Details


I hate writing tests, that’s why I use Hypothesis

Day 3
Talks
Cheuk Ting Ho  |  2022/12/03 19:00:00 UTC - 2022/12/03 19:30:00

Ok, I lied, I still write tests. But instead of the example-based tests that we normally write, have you heard of property-based testing? By using Hypothesis, instead of thinking about what data I should test it for, it will generate test data, including boundary cases, for you.

Join: https://www.youtube.com/watch?v=TMUDmA_aIKo

Watch: https://www.youtube.com/watch?v=TMUDmA_aIKo&t=1096s

View Details


Lightning Talks – Sophie Clayton, Dina Bavli, Colleen M. Farrelly, David Chapuis

Day 3
Talks
Sophie Clayton, Dina Bavli, Colleen M. Farrelly, David Chapuis  |  2022/12/03 19:00:00 UTC - 2022/12/03 20:30:00

Lightning Talks are short 5-10 minute sessions presented by community members on a variety of interesting topics.

Join: https://www.youtube.com/watch?v=48nWtfgEevc

Watch: https://www.youtube.com/watch?v=48nWtfgEevc&t=14588s

View Details


Bayesian Optimization: Fundamentals, Implementation, and Practice

Day 3
Talks
Quan Nguyen  |  2022/12/03 19:30:00 UTC - 2022/12/03 20:00:00

How can we make smart decisions when optimizing a black-box function?
Expensive black-box optimization refers to situations where we need to maximize/minimize some input–output process, but we cannot look inside and see how the output is determined by the input.
Making the problem more challenging is the cost of evaluating the function in terms of money, time, or other safety-critical conditions, limiting the size of the data set we can collect.
Black-box optimization can be found in many tasks such as hyperparameter tuning in machine learning, product recommendation, process optimization in physics, or scientific and drug discovery.

Bayesian optimization (BayesOpt) sets out to solve this black-box optimization problem by combining probabilistic machine learning (ML) and decision theory.
This technique gives us a way to intelligently design queries to the function to be optimized while balancing between exploration (looking at regions without observed data) and exploitation (zeroing in on good-performance regions).
While BayesOpt has proven effective at many real-world black-box optimization tasks, many ML practitioners still shy away from it, believing that they need a highly technical background to understand and use BayesOpt.

This talk aims to dispel that message and offers a friendly introduction to BayesOpt, including its fundamentals, how to get it running in Python, and common practices.
Data scientists and ML practitioners who are interested in hyperparameter tuning, A/B testing, or more generally experimentation and decision making will benefit from this talk.
While most background knowledge necessary to follow the talk will be covered, the audience should be familiar with common concepts in ML such as training data, predictive models, multivariate normal distributions, etc.

Join: https://www.youtube.com/watch?v=TMUDmA_aIKo

Watch: https://www.youtube.com/watch?v=TMUDmA_aIKo&t=2749s

View Details


Deep Into the Tweet

Day 3
Talks
Dina Bavli  |  2022/12/03 20:00:00 UTC - 2022/12/03 20:30:00

Let’s scratch the twitter meta-data together and go below the surface with tweepy. Want to find out if the tweets you follow are trying to persuade you to do things? Have the feeling the advocates for some issues use certain emotions to push you in certain directions? Now you can find out

Join: https://www.youtube.com/watch?v=TMUDmA_aIKo

Watch: https://www.youtube.com/watch?v=TMUDmA_aIKo&t=4498s

View Details