In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'tools':
What tools are used within the enterprise for data science?
Can data scientists take advantage of open source tools in combination
with high performance and scalable production quality infrastructure?
A wide range of tools support data science ranging from open source to proprietary, relational database to "big data" platforms, simple analytics to complex machine learning. Tools may support isolated activities or be highly collaborative, and enable modeling in the small to massive predictive modeling with full model management. Orthogonal to each of these is the scale at which these tools can perform. Some tools and algorithm implementations will perform well for small or even moderate sized data, but fail or become unusable when presented with larger data volumes. For this, special parallel, distributed implementations are necessary to leverage multi-node/processor machines and machine clusters.
Seldom will a single tool provide all required functionality, which is usually provided by a mix of commercial and open source tools. However, enterprises require commercial support for the tools adopted. As a result, commercial tools that integrate with open source tools and provide support for data- and task-parallel execution along with ease of deployment are highly desired.
The 5 maturity levels of the "tools" dimension are:
Level 1: An ad hoc array of non-scalable tools is predominantly used for isolated data analysis on desktop machines.
Data science players at Level 1 use traditional desktop tools for data analysis, relying heavily on spreadsheet-based tools along with various open source tools for analytics and visualization.
Level 2: Enterprise manages data through database management systems and relies on extensive open source libraries along with specialized commercial tools.
Level 2 enterprises, taking data management more seriously, introduce relational database management software tools. Data science projects also benefit from the broader open source package ecosystem for advanced data exploration, statistical analysis, visualization, and predictive analytics / machine learning. However, at Level 2, there is little integration between commercial and open source tools, and performance and scalability are an issue for data science projects.
Level 3: Enterprise seeks scalable tools to support data science projects involving large volume data.
Data science projects at Level 3 enterprises are hindered by performance and scalability of existing software and environments. A concerted effort is made to evaluate and acquire commercial tools with a range of scalable machine learning algorithms and techniques to complement open source techniques and facilitate production deployment. Data science players may begin to explore Big Data platforms to address new sources of high volume data, scalability, and cost reduction. Cloud-based tools are also under review. As data science projects grow in complexity involving larger team efforts, tools supporting collaboration become a recognized need.
Level 4: Enterprise standardizes on a suite to tools to meet data science project objectives.
The Level 4 enterprise understands the needs of data science players and projects to meet business objectives. Enhanced productivity requires scalable tools that support collaboration and work with data from a wide range of sources. Automation and integration play a major role in enhancing productivity, so tools that avoid paradigm shifts and automate tasks in data exploration, preparation, machine learning, and graph and spatial analytics are particularly valuable. Adopted tools are available or function across multiple platforms, including on-premises and cloud. As machine learning models have become a focal point for data science projects, adopted tools must support full model management.
Level 5: Enterprise regularly assesses state-of-the-art algorithms, methodologies, and tools for improving solution accuracy, insights, and performance, along with data scientist productivity.
Level 5 enterprises optimize their data science tool environment. Having understood what is required for effective data science projects and data science player productivity at Level 4, enterprises work with tool providers to further enhance those tools to meet business objectives.
In my next post, we'll cover the 'deployment' dimension of the Data Science Maturity Model, that last dimension in this series.