Saturday 2:15 PM–3:00 PM in Boardroom

Fighting Against Chaotically Separated Values with Embulk

Sadayuki Furuhashi

Audience level:
Intermediate

Description

Python is a great tool for performing data analysis, but often time the hardest part is getting access to your data that’s located in a variety of business systems - files, database, and SaaS applications. Productionizing this process is even harder: scripts frequently fail and require precious to to fix and re-test. In this talk, I will review some open source tools I authored and show you how

Abstract

In this talk we will cover: - How we created a data collection tool that can read any chaotically formatted files called "CSV" by guessing its structure automatically - Explore the plugin-based-architecture that makes it easy to load data from external sources and publish to production systems. From files to business systems such as Salesforce & Mixpanel. - Review current plugins (over 100 released by the OSS community) and use cases - Explain how distributed execution enhances stability and scalability