INSIGHTS

Article

DataOps Series — Chapter 1: Principles of DataOps

Arrayo spoke with industry executives and experts about the keys to success in data strategy. We talked to Chief Data Officers, Chief Digital Officers, CFOs, and experts in financial services, pharma, biotech and healthcare. Our goal is to find an answer to one simple question: What works?

The people we spoke with come from a variety of organizations — some companies are young, agile startups, while others are mature Fortune-500 corporations.

In our previous research projects, we discussed data culture and “data nirvana” and about how to find the right balance between data defense and data offense. In this third installment, we focus on DataOps. This third series in Arrayo’s cross-industry research project comprises of four Chapters. In Chapter 1, we discuss the Principles of DataOps. In subsequent Chapters, we will delve into the Tools that DataOps professionals use (Chapter 2), the various components of the Data Stack (Chapter 3), and some Use Cases to illustrate how teams can thrive by having a strong data culture (Chapter 4).

Introduction

Inspired by the DevOps movement, DataOps seeks to create business value from the data pipeline in a more agile, collaborative way. In a nutshell, DataOps seeks to pipe the data in at every point of an enterprise to solve industry problems, and to do so with guaranteed quality at an accelerated speed. DataOps is a set of practices, processes, and technologies for building, operationalizing, automating, and managing data pipelines from source to consumption.

DataOps is about both technology and methodology. The emphasis is on collaboration, and to do that your enterprise needs to leverage integration, automation, communication, and cooperation between everyone in the data-verse. This encompasses data designers, architects and engineers, analytics teams, and any other group that needs to increase its consumption of high quality, easily accessible data to deliver new and innovative solutions.

Every industry has a set of challenges that DataOps can solve. In Life Sciences, the challenge may be to enable scientists to query, analyze and visualize pre-clinical and clinical trials biomarker data
and to accelerate data driven decision making, considering the costs of bringing new therapies to market are known to be substantial. In Finance, it may be used to increase trust in the data that informs risk calculations, to let market research flex its wings, to decrease trade breaks, or to reduce regulatory reporting headaches. Some CDOs have started to fill a new role called “data product manager”, which has similar responsibilities to an application product manager in the software development world, but with a DataOps focus.

No matter the industry, your data team is probably stretched to its limits. There are requests for data coming from all sides, and business users are expressing frustration about the time it takes to get the data and the subsequent bad quality data they are receiving. Data professionals and analysts are spending too much time cleaning the data before passing it along. They are writing the same jobs over and over, with minor variations to satisfy stakeholders. All the while, the people who are trying to generate value from this data are being told to get in line. Even self-service models are producing a set of their own challenges, such as confusion over the “correct” source for data, differences in results based on misunderstandings, and limited availability to data SMEs.

So…where to start? It is best to keep in mind some of the accomplishments that DataOps promises to bring. These will apply to data organizations in any industry.

The top three goals of DataOps are

Reduce cycle time for high quality data and accelerate data time to market
Reduce errors and increase trust in data
Increase collaboration and integrate data tightly with the use cases.

One of the CDOs we spoke with summed it up as “the right data at the right time with the right veracity.”

The Principles of DataOps

Because DataOps is about both the technology and the methodology (and the people), we will start with the basic principles. There are many challenges that you will encounter during the move from traditional data management towards DataOps, so it is a good idea to keep one’s goals, guidelines, and principles in mind to help navigate the hurdles.

Prevention

As one speaker at a recent data conference said, “Don’t solve your problems. Figure out how not to create them in the first place.” As you pitch into problems in your data ecosystem, you will need to follow them to the logical point of prevention, whatever that may be. Sometimes, it could be as simple as a late file or lack of controls around data consumption at a critical point. Placing more controls at key points, upgrading legacy systems, and improving operational procedures will prevent problems from occurring in the first place. Challenges here may come from corporate culture as well as from the data ecosystem. As one speaker at a recent conference said, “It’s culture — and only culture — that keeps me up at night.”

Automation

You may also find that a manual process creates bottlenecks. Perhaps a legacy system is creating problems: if it can be decommissioned or bypassed, the relief would be felt in other areas of your data ecosystem. Perhaps there is a missing enrichment process that causes manual joins to be made to datasets before they are useful: make sure you understand the links and enrichment that are needed for the critical use cases. Perhaps there is a problem or a gap in the reference data: make sure you automate the retrieval of all necessary components. If you automate tactically, be sure to understand what end-to-end automation will look like to deliver strategic solutions.

Collaboration

Of all the DataOps principles, this is arguably one of the most important ones. Without the right level of collaboration between all stakeholders, and the right kinds of communication, it is difficult to achieve a streamlined, agile data organization. An important focus of DataOps is to closely tailor their data products to the needs of the business units that will be using them. A lack of collaboration between people translates into poorly integrated data and may lead to the creation of data products that do not meet the needs of any group.

A common problem that points to a lack of collaboration is competing sources of truth in data. If you are in meetings where different groups spend time arguing about who has better data, you will need to figure out how to break the silos and ignite the collaboration. On the other hand, if teams work together to validate the methods that create the source of truth then, development time can be saved by not repeating the same work.

Integration

To have usable data that enables collaboration, you must have integrated data. Well-curated sets of data and well-managed metadata, points of linkages, and logical layers are critical. The integration of data will allow communication and collaboration and create a “virtuous cycle.” To start, you must break the walls that are creating data silos, either literally by using new data storage platforms and technology, or virtually by using logical layers.

Common Terminology

Everyone must speak the same language. If you cannot nail down terminology, you will not be able to enable collaboration. This is easier when your group already employs standard industry terminology, such as in medical and pharma research, or financial instruments. It may get more problematic for internal corporate usage or areas that do not have standardized terminology.

One way to ensure that the data is speaking to everyone is to create or purchase and customize an ontology. There are some that are already fit-for-purpose and can be deployed in your environment. In finance, you may wish to deploy FIBO (Financial Industry Business Ontology). In Healthcare, an example is well-structured, controlled vocabularies for shared use like Unified Medical Language System (UMLS) or you can look at the Open Biological and Biomedical Ontology (OBO) Foundry for biological sciences. Sophisticated data dictionaries and a business glossary that allows for your internal corporate terms to be incorporated, and that is known and easily accessible by all stakeholders is a must-have.

Transparency

Trust is hard to define, but easy to know when you experience it. Once all your stakeholders begin to trust the data, you will be on firm ground. One of the most important underpinnings of trust is transparency. Your stakeholders probably talk a lot about data quality, but what they are really asking you for is the ability to trust the data. You cannot implement data quality in a vacuum and expect people to trust it on hearsay alone. Your data constituents must see what is happening firsthand. You must strive for transparency at every step of the way: show where the data comes from, how it is being transformed, how and where it was obtained, what it means, how it is being cleaned, and how it is being used. In other words, show your work. Many organizations support this process by putting their entire data pipeline code into source control software. By doing this, interested parties can see how the data was sourced and transformed and any changes to your data process can be tracked over time.

Standardization

Standardization gives everyone the same language to speak. It allows people to find what they seek, because the data across the whole ecosystem has been standardized. Data consumers understand what the data means, and how to use it. The data ecosystem has standard procedures for ingesting and usage, and everyone knows what to expect. Standardization, when combined with automation, has the additional benefit of allowing repeatability. If data is standardized and data processes are automated, then your consumers are secure in the knowledge that the inputs are understood, and outputs are expected, each and every time.

Another aspect to standardization is data enrichment. If you have standardized your processes and done the job to curate the data, you need to take the next step towards powerful data by allowing people to see and trust the enrichment process. This means linking datasets in a predictable and understood way, managing your metadata with an iron hand, and having well-defined semantic and logical layers.

Summary

DataOps seeks to bind data tightly to the business needs and to daily operations for each group — regardless of whether they are analysts, management, or other data constituents. Each use case may have radically different data requirements, from getting products to market quickly to filing accurate external reports. An agile team of data engineers, business experts, and possibly a new role called a “Data Product Manager” can be given the power to navigate a trusted datasphere. Using the appropriate platforms and tools can bring huge benefits to your enterprise.

We hope you have enjoyed this article and would love to hear any thoughts or reactions you have to our article. Feel free to reach out to contact@teamarrayo.com or write a response down below.

*This article was written for SteepConsult Inc. dba Arrayo by Renée Colwell and John Hosmer.

DataOps Series — Chapter 1: Principles of DataOps

Introduction

The Principles of DataOps

Prevention

Automation

Collaboration

Integration

Common Terminology

Transparency

Standardization

Summary

Featured Insights

IG for Me but Not for Thee?

Bioinformatics Services

Clinical Data Sandbox

Machine Learning for Data Quality

Follow us on Social Media

Arrayo

Your new source for insights at the intersection of data, financial services, and life sciences.

DataOps Series — Chapter 1: Principles of DataOps

Introduction

The Principles of DataOps

Prevention

Automation

Collaboration

Integration

Common Terminology

Transparency

Standardization

Summary

Featured Insights

IG for Me but Not for Thee?

Bioinformatics Services

Clinical Data Sandbox

Machine Learning for Data Quality

Follow us on Social Media

Join The ARRAYO Newsletter!

Arrayo