In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'scalability':
Do the tools scale and perform for data exploration, preparation,
modeling, scoring, and deployment?
As data, data science projects, and the data science team grow,
is the enterprise able to support these adequately?
The term 'scalability' can be defined as the "capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged to accommodate that growth." Scalability with respect to data science needs to reflect the hardware and software aspects, as well as the people and process aspects. This includes several factors: data volume (number of rows, columns, and overall bytes), algorithm design and implementation (parallel, distributed, memory efficient) for data preparation and model building and scoring, hardware (RAM, CPU, storage), volume and rate of data science work products produced, number of data science players and projects, and workflow complexity.
The 5 maturity levels of the "scalability" dimension are:
Level 1: Data volumes are typically "small" and limited by desktop-scale hardware and tools, with analytics performed by individuals using simple workflows.
Level 1 enterprises perform analytics on data that can fit and be manipulated in memory, typically on desktop hardware, and possibly using open source tools. At Level 1, data volumes are such that loading data from flat files or programmatically from databases doesn't introduce problematic latency. Similarly, algorithm efficiency in terms of memory consumption or ability to take advantage of multiple CPUs isn't a significant issue. Data science work products are produced at a rate that taxes neither individuals nor infrastructure.
Level 2: Data science projects take on greater complexity and leverage larger data volumes.
In Level 2 enterprises, data science players are taking on more projects of greater complexity that require more data. This increase in data volume introduces increasingly intolerable latency due to data movement, and highlights inadequate hardware resources and inefficient algorithm implementations. The need to produce more data science work products more frequently also taxes existing hardware resources. The Level 2 enterprise begins exploring scalable tools for processing data where they reside instead of relying on data movement and tools that can enhance the use of open source tools and packages. Data scientists resort to data sampling to address tool limitations.
Level 3: Individual groups adopt varied scalable data science tools and provide greater hardware resources for data scientist use.
The Level 3 enterprise is addressing its data science growing pains experienced at Level 2 by adopting tools that minimize latency due to data movement, have parallel distributed algorithm implementations, and provide infrastructure for leveraging open source tools. These new tools enable data scientists to use more if not all desired data in their analytics, however, there is no standard suite of tools across the enterprise and the various tools do not facilitate collaboration. An increase in available hardware resources (on-premises or cloud) for solving bigger and more complex data science problems yields significant productivity gains for the data science team.
Level 4: Enterprise standardizes on an integrated suite of scalable data science tools and dedicates sufficient hardware capacity to data science projects.
Having explored and test-driven various data science tools, the Level 4 enterprise standardizes on an integrated suite of scalable tools that enables data science players to realize full-scale data science projects. Data science projects, and data scientists in particular, have sufficient hardware resources (on-premises or cloud) for both development and production.
Level 5: Data scientists have on-demand access to elastic compute resources both on premises and in the cloud with highly scalable algorithms and infrastructure.
The Level 5 enterprise focuses on more elastic compute resources for data scientists. As data volumes increase, data science projects benefit from being able to quickly and easily increase/decrease compute resources, which in turn expedites data exploration, data preparation, machine learning model training, and data scoring - whether for individual models or massive predictive modeling involving thousands or even millions of individual models. Elastic compute resources can eliminate the need for dedicating resources for peak demand requirements. Alternatively, cloud-at-customer solutions can provide benefits while meeting regulatory or data privacy requirements. The combination of scalable algorithms and infrastructure with elastic compute resources enables the enterprise to meet time-sensitive business objectives while minimizing cost.
In my next post, we'll cover the 'asset management' dimension of the Data Science Maturity Model.