The Need for Modular Data Science Solutions

Usman Kamran

Prior knowledge:
No previous knowledge expected

Summary

Data science solutions have an evolving scope. Modularity needs to be inherent in both the development process and the delivery process in order for data science projects to translate into scalable enterprise-ready solutions. This talk will delineate how to accelerate adoption of data science solutions into the development pipeline through piecewise, iterative development.

Description

Data science projects often start off as experiments. In order to develop a solution, the problem needs to, naturally, be scoped and clearly defined. A gap, however, exists between the development of data science experiments and their ultimate translation into enterprise-ready applications, capable of delivering insights. Integral to bridging the gap between data science projects and product delivery is the need for modular, sustainable development. Starting from the first line of code written, data scientists need to be thinking of how this code could eventually translate into a well-defined product that can return business value. Thinking modularly and adopting an API-first methodology early reduces the need for data science projects to be reworked once projects are passed on to development teams. Great data scientists think sustainably and naturally gravitate towards excellent development practices. Efficient data science solutions are easily parlayed into backend-ready code. By adopting a forward-facing mindset, data scientists can save their company months of time needed to rework their solution into a business-ready application.

There are three major keys to developing a scalable, sustainable data science solution: developing reusable code, making parallelizability a priority, and thinking about the infrastructure early. Firstly, functions are your friends! It is important to split code up into constituent pieces to avoid the need to redo work. Having a single function that can fit into a number of use-cases prevents long, messy Jupyter notebooks that, whilst being functional, are impractical and inefficient to utilize. Breaking down code into its constituent pieces helps solidify its impact both in the current project and future projects to come.

Secondly, for solutions that may require big data solutions, use parallelizable development approaches. This involves incorporating Python’s multiprocessing library, utilizing Dask, and venturing into the world of Scala and Spark. In a world where multi-core engines are increasing prevalent, it is imperative to use the parallelizability of the compute available. Code that can be multi-threaded can enable a multi-fold increase in efficiency in data processing. This expedites scalability and wide-scale user adoption. Think big early! Parallelizable code can be scaled up efficiently and has wide-ranging use-cases. Single-threaded development limits the scope of the project early.

Lastly, infrastructure is an important consideration to make early in the development of a data science project. Of note is the utilization of Docker containers to modularize the deployment of ready-to-deploy code. Containers can harness the use of scalable infrastructure, such as Kubernetes. With scalable compute on-hand, data scientists can quickly test their solutions on increasingly large datasets; thereby garnering a concrete proof of concept prior to translating the solution into a market-ready application. With the code having been modularized into containers and a scalable infrastructure already in place, stakeholders on the business front can also acquire an early understanding of the financial overhead required for the eventual adoption of the data science solution into an application. This practice can provide a quicker estimation of the financial viability of a project early on, allowing for time to evaluate the profitability of the solution prior to further investment.

In summary, modularizable code saves time, expedites development, enhances adoptability, and can enable successful, profitable data science solutions. Learning how to do develop efficient code starts with a modular mindset. By working smarter, you can make your data work harder for you!