Hit Enter to search or Escape to close
How can this new technology concretely help with data quality monitoring?
Data applications have an increasing amount of data to aggregate and process. Thus, data quality issues hamper business and activity monitoring.
Data quality problems and anomalies are often identified by end users, who then submit an incident ticket to the Data team so they can fix it. Data quality tasks represent significant manual effort. When Business and Data Analysts work on data issues, they spend the bulk of their time trying to identify and understand the issue, and not nearly as much time on remediating it. Due to frequent data quality tickets from the business, project teams must stop their development activity (e.g., development, etc.) to correct these anomalies.
Time wasted on data quality tasks…
Before we move on to machine learning, let’s better define what we mean by data quality. Let us also explain how a commonly used technique, data profiling, can be useful for monitoring data quality issues.
Data quality encompasses data consistency and data accuracy.
We define data profiling as a manner of assessing data in terms of structure and content.
For example, let’s imagine a column storing birth dates, in which all data must fit the following format: mm/dd/yyyy. Say that Max is born on January 1st, 1982. If the cell indicating his birthdate contains “01/01/1982,” the data is formatted correctly, i.e., it is consistent. If the cell contains “011/01/1982,” “Jan 1st, 1982,” or nothing at all, that will raise an error and prompt escalation. Thus data profiling represents a quick way to check for data consistency. But what of data accuracy? Does the data reflect reality? Indeed, what if the cell corresponding to Max’s birthdate contains “01/01/1967”? While the cell’s content conforms to the expected formatting, the data itself is wrong, i.e., inaccurate. Thus, data profiling can overlook data accuracy issues.
In sum, while data profiling can be effective for checking data consistency and identifying potential issues in a dataset, it may not be enough to ensure data accuracy. This is where machine learning can be useful.
Machine learning helps proactively identify data quality issues before they impact downstream systems and the end user.
First, a given data set is picked to be investigated. Second, the machine learning algorithm identifies outliers in the data (further explanation can be found under the Machine Learning Techniques section of this article). Third, these outliers are displayed via a dashboard in terms of data quality KPIs, such as “Number of suspicious data points.”
Based on the dashboard, the Data Analyst, end user, and one or more SMEs confirm the detected anomalies as data issues. Then the Data Analyst and SME decide which data issues to prioritize based on their impact on reports. Finally, the Data Analyst tells the developers about the data issues and which issues need to be remediated in priority.
Thus, the successive interactions are less stressful than those in the current state described earlier.
Next, we’ll focus on the technical approach of machine learning.
In machine learning, it is important to understand the difference between detection of outliers and detection of anomalies. Outliers correspond to data points that are significantly different from the rest of the data. Anomaly detection focuses on suspect behavior and patterns. Even if these notions are strongly related, an outlier might not be an anomaly and an anomaly might not be an outlier.
Whether we want to identify potential issues at a column level or at an element level (in the row) from a dataset or table, the key is to understand both the data’s behavior and the business perspective. Then, you can analyze and clean data to build your machine learning model.
Different machine learning approaches make it possible to identify different patterns and outliers. Since there is often no documentation of anomalies and no data labels (to use a supervised algorithm), unsupervised algorithms can be a good starting point. Examples of unsupervised algorithms:
Once you have clustered data and identified outliers, it is important to go further and to be able to explain why we have these outliers. For this, we must analyze and explain the identified patterns, and share it with the business and SMEs.
We have to keep in mind that results are strongly conditioned by the state of dataset, therefore is important to analyze the data according to:
These tasks are essential to prepare the dataset and can give us an idea of potential impacts on the results.
A Subject Matter Expert (SME), an expert in the domain at hand (e.g., Finance, Risk, Compliance, Translational), is needed to give feedback on the results provided by the model. The idea is that the SME must confirm the anomaly and provide an explanation for it. This information allows the analyst to adjust the model (via segmentation, profiling) and improve the approach.
Once the model is ready, it can start interacting with the target system.
Here are several considerations:
Enterprise machine learning solutions are very useful tools to improve anomaly tracking and reduce manual analysis effort. A strong commitment from stakeholders under the command of a data office will allow us to make a successful transition from unsupervised learning to supervised learning. This is possible with a real Data Issue Management policy that must include:
Building a knowledge base of anomalies will make it possible to build and train a model to better target anomalies (i.e., supervised learning).
In sum, AI through machine learning provides an effective solution to help large organizations overcome challenges related to data quality. An effective machine learning strategy requires a real commitment from all involved stakeholders.
Written by Nicolas Drisse Vaillant, Senior Data & Analytics Lead at Arrayo, and Olympe Scherer, Business Development Manager at Arrayo.