Friday October 29 5:00 PM – Friday October 29 5:30 PM in Talks I

Dev, Staging, and Production in Data Engineering with Terraform

Sarah Krasnik

Prior knowledge:
No previous knowledge expected

Summary

Early on, software engineers are taught: develop locally, test in staging, deploy to production. What does this mean for analytics? In this talk, we’ll walk through how data engineering teams should leverage a multi-environment development workflow built using Terraform. The proposed architecture will allow for thorough data quality testing resulting in reliable production-grade data products.

Description

Development Environments in Software Engineering

I'll start with some background on software engineering workflows. Although the dev/staging/production setup might be common knowledge for software engineers, it's not something data engineers have been exposed to as often. The complexity of fully fledged applications creates a need for thorough testing. Without it, software products cannot be QAed thoroughly. With the rising complexity of data products, the same level of QA is required to sustain reliable data pipelines.

How and Why to Test Data Pipelines

At a truly data driven organization, the output of analytics teams is mission critical. Just like deploying a bug-ridden consumer facing application, a decision made on incorrect data can have significant dollar consequences. Without a code and data testing framework, there's no way to ensure data quality.

I advocate for developing locally, running on staging with end-to-end tests, and deploying to production with confidence. If the staging environment is configured in the same way as production, any errors will be caught before they make their way into data products exposed to end users. I will walk through how a testing framework can be configured to work in multiple environments.

Infrastructure as Code: Terraform

Setting up one data platform is hard enough, let alone hosting two (staging, production) and accomodating three (staging, production, plus local development). Terraform allows engineers to specify infrastructure as code, inherently making the data platform configuration easily repeatable. This talk will include a detailed outline of deploying a data platform with Terraform, accompanied by an example repository. The outcome will be an easily tranferrable, coded, tested, and therefore understood data platform with multiple environments.

Learning Goals

  • Why data engineers should embrace a multi-environment deployment like software engineers do
  • What "multi-environment" means in the data world and how it enables data quality
  • Leveraging infrastructure as code to implement a scalable data platform