Saturday October 30 8:30 AM – Saturday October 30 9:00 AM in Talks I

A first step from ad-hoc SQL to scalable ETL

Martin Wanjiru

Prior knowledge:
Previous knowledge expected
Data Warehouse, SQL, Business Intelligence, Data Analysis, Data Science

Summary

Software engineers design tables to capture transactions, which are not ideal for querying. On the other hand, Data Scientists and Analysts require limitless view of business data, which is only achievable using long and complex SQL queries. This creates the need to have a separate analytics database with a multidimensional database structure. This talk discusses a solution for this problem.

Description

“Hey team, I have these two dashboards on the number of sales and they don’t agree. Why is this the case, and which is correct? Can this be fixed before my 10 am meeting with investors?” or “I want to create a data science model to categorize our customers based on their repayment trends and their adherence to their PTP (promise to pay) plans, so that the operations team can put more effort on the customers less likely to pay on time and less on the promising ones - where can I get this data?”, these are the kind of reporting problems we faced on a daily. We lacked a single source of truth. Most of our SQL business logic was spread across our tools, from 400+ lines of complex SQL queries creating materialized views on the production database (something the Engineering team did not take kindly) to other queries living in our BI Tools (redash and Tableau). If the business logic slightly changed, It was very difficult to update all our queries. Furthermore, debugging was painful! All these challenges prompted us to have a separate database for analytics. We opted for an ELT process. We got a tool to replicate data from the production database to the data warehouse. We then started using DBT to do SQL transformation. DBT does most of the work for us - it determines the execution order of the models (SQL statements) and created objects in our warehouse. It also allows us to test our models, to ensure that the defined business logic is not violated. These models live in a version control, making it easy to track changes we do. We also have a CI/CD tool that we use to run the models hourly. This set up has improved the way we report.The time spent to gather and clean data to be consumed by our data science models has also significantly reduced.