In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'data access':
How do data analysts and data scientists request and access data?
How is data access controlled, managed, and monitored?
When we consider 'data access,' one definition refers to "software and activities related to storing, retrieving, or acting on data housed in a database or other repository" normally coupled with authorization - who is permitted to access what - and auditing - who accessed what, when, and from where. As discussed below, data access can be provided with little or no control such as when handing someone a memory stick, or strict access control through secure database authentication and computer network authentication. Data access takes into account not only the user side, but also the ability of administrators to effectively manage the data access life cycle - from initial request to revoking privileges and post-use data cleanup.
The 5 maturity levels of the "data access" dimension are:
Level 1: Data analysts typically access data via flat files obtained explicitly from IT or other sources.
Data science players at Level 1 enterprises use what has historically been called the 'sneakernet.' If you need data, you walk over to the data owners, get a copy on a hard drive or memory stick, and load it onto your local machine. This, of course, has morphed into emailing requests to data owners and either getting back requested data via email, drop boxes, or FedEx. Providing access to data in this manner is clearly not secure. Further, obtaining the 'right' data is unlikely to occur on the first try, so multiple iterations may be needed with data owners - the data request cycle - which results in delays, and even annoying those data owners.
Level 2: Data access available via direct programmatic database access.
In Level 2 enterprises, the sneakernet is recognized as insecure and inefficient. Moreover, since much of enterprise data is stored in databases, authorization and programmatic access is more readily enabled. With direct access to databases via convenient APIs (ODBC, R and Python packages, etc.), more data can be made available to data science players, thereby shortening the data request cycle. However, any processing beyond what is possible in the data repository/environment itself, e.g., SQL for relational databases, still requires data to be pulled to the client machine, which can have security implications.
Level 3: Data scientists have authenticated, programmatic access to large volume data, but database administrators struggle to manage the data access life cycle.
The Level 3 enterprise is experiencing data access growing pains. Data scientists now have access to large volume data and want to use more if not all of that data in their work. Database administrators are inundated with requests for both broad (multi-schema) and narrow (individual table) data access. Ensuring individuals have proper approvals for accessing the data they need and possibly implementing data masking causes data access request backlogs. The Level 3 enterprise has also started to supplement traditional structured database data with new "big data" repositories, e.g., HDFS, NoSQL, etc. These even greater volumes of data include anything from social media data to sensor, image, text, and voice data.
Level 4: Data access is more tightly controlled and managed with identity management tools.
While enterprises in some industries, e.g., Finance, will have addressed access control to varying degrees, when addressing data access more broadly, the Level 4 enterprise understands the importance of end-to-end life cycle management of user identities and begins introducing tools to strengthen security and simplify compliance as appropriate. A goal for Level 4 enterprises is to make it easier for data science players to request and receive access to data, while also making it easier for administrators to manage, especially as more big data repositories are introduced. An enterprise-wide self-service access request web application may be used to facilitate requesting and granting data access. Ideally, this would be integrated with the metadata management tool used for data awareness.
Level 5: Data access lineage tracking enables unambiguous data derivation and source identification.
The Level 5 enterprise has standardized on identity management and auditing to support secure data access, and now focuses on the question "what is the source of the data that produced this result?" Even in enterprises that leverage an enterprise data warehouse, data may still be replicated to other databases, or various gateways leveraged to give transparent access to remote data. The Level 5 enterprise enables tracking the derivation of data science work products - their lineage - with verification of actual data sources.
In my next post, we'll cover the 'scalability' dimension of the Data Science Maturity Model.