From this session, you will be able to learn: - Concepts of designing a pristine data pipeline for Machine Learning applications - Evolution of data landspace from Base to ACID, datawarehouse to lakehouses - Overview of modern data technologies like Apache Spark, cloud services like AWS/GCP/Azure, Delta Lake etc - Best practices for Data Engineering and ML lifecycle
For any data applications, data engineering is a critical component to enhance the Machine learning, data science or analytics projects. For Machine Learning or any downstream applications, if you get the data right, most of your problems are solved at the beginning of designing the pipeline. In this talk, we’ll discuss the concepts of databases and data warehousing and how modern applications have evolved to use lakehouses. This talk will focus on some core concepts of designing data pipelines, explanation of under the hood processings like ACID, MVCC algorithm, data formats, as well as some of the best practices that guarantees a pristine data lake for successful Machine Learning applications. For the attendees from software engineering background, they can get a view of how the data engineering world goes through similar process of SDLC like implementation, assessment, change, and improvement of the data life cycle process itself.