As a practitioner, author and educator he has been involved in some of the most challenging projects industry has seen. These projects were often the result of major corporate mergers and the need to consolidate and integrate databases of enormous variety and complexity. Mathematician by education, Arkady started his career in academic research in the area of pattern recognition. By the age of 20, he co-authored three books and numerous articles in the field. He also used his unique set of skills to build mathematical models in a variety of applications, from medical research to finance to construction of earthquake-resistant buildings.
|Published (Last):||5 November 2011|
|PDF File Size:||5.81 Mb|
|ePub File Size:||11.51 Mb|
|Price:||Free* [*Free Regsitration Required]|
Atomic Level Data Quality Information At the top level are aggregate scores which are high-level measures of the data quality. Well-designed aggregate scores are goal driven and allow us to evaluate data fitness for various purposes and indicate quality of various data collection processes. From the perspective of understanding the data quality and its impact on the business, aggregate scores are the key piece of data quality metadata.
At the bottom level of the data quality scorecard is information about data quality of individual data records. In the middle are various score decompositions and error reports allowing us to analyze and summarize data quality across various dimensions and for different objectives. Aggregate Scores On the surface, the data quality scorecard is a collection of aggregate scores.
Each score aggregates errors identified by the data quality rules into a single number — a percentage of good data records among all target data records. Aggregate scores help make sense out of the numerous error reports produced in the course of data quality assessment. Without aggregate scores, error reports often discourage rather than enable data quality improvement.
You have to be careful when choosing which aggregate scores to measure. The scores that are not tied with a meaningful business objective are useless. For instance, a simple aggregate score for the entire database is usually rather meaningless. Suppose, we know that 6. So what? This number does not help me at all if I cannot say whether it is good or bad, and I cannot make any decisions based on this information.
On the other hand, consider an HR database that is used, among other things, to calculate employee retirement benefits. Now, if you can build an aggregate score that says 6. You can use it to measure the annual cost of data quality to the business through its impact to a specific business process.
You can further use it to decide whether or not to initiate a data-cleansing project by estimating its ROI. The bottom line is that good aggregate scores are goal driven and allow us to make better decisions and take actions. Poorly designed aggregate scores are just meaningless numbers. Of course, it is possible and desirable to build many different aggregate scores by selecting different groups of target data records.
The most valuable scores measure data fitness for various business uses. These scores allow us to estimate the cost of bad data to the business, to evaluate potential ROI of data quality initiatives, and to set correct expectations for data-driven projects.
In fact, if you define the objective of a data quality assessment project as calculating one or several of such scores, you will have much easier time finding sponsors for your initiative. Other important aggregate scores measure quality of various data collection procedures. For example, scores based on the data origin provide estimates of the quality of the data obtained from a particular data source or through a particular data interface.
A similar concept involves measuring the quality of the data collected during a specific period of time. Indeed, it is usually important to know if the data errors are mostly historic or were introduced recently.
The presence of recent errors indicates a greater need for data collection improvement initiatives. Such measurement can be accomplished by an aggregate score with constraints on the timestamps of the relevant records. To conclude, analysis of the aggregate scores answers key data quality questions: What is the impact of the errors in your database on business processes?
What are the sources and causes of the errors in your database? Where in the database can most of the errors be found? Score Decompositions Next layer in the data quality scorecard is composed of various score decompositions, which show contributions of different components to the data quality.
Score decompositions can be built along many dimensions, including data elements, data quality rules, subject populations, and record subsets. For instance, in the above example we may find that 6. This can be used to prioritize a data cleansing initiative. This may suggest a need to improve data collection procedures in that subsidiary. The level of detail obtained through score decompositions is enough to understand where most data quality problems come from. However, if we want to investigate data quality further, more drill-downs are necessary.
The next step would be to produce various reports of individual errors that contribute to the score or sub-score tabulation. These reports can be filtered and sorted in various ways to better understand the causes, nature, and magnitude of the data problems. Finally, at the very bottom of the data quality scorecard pyramid are reports showing the quality of individual records or subjects.
These atomic level reports identify records and subjects affected by errors and could even estimate the probability that each data element is erroneous. Summary Data quality scorecard is a valuable analytical tool that allows to measure the cost of bad data for the business and to estimate ROI of data quality improvement initiatives. Building and maintaining a dimensional time-dependent data quality scorecard must be one of the first priorities in any data quality management initiative.
About the Author: Arkady Maydanchik For more than 30 years, Arkady Maydanchik has been a recognized leader and innovator in the fields of data quality and information integration.
Data Quality Assessment
About Mr. Maydanchik graduated from high school at the age of 15, having won numerous math contests in the former Soviet Union. By the age of 20, he co-authored 3 books and numerous articles in the fields of mathematical statistics and cluster analysis. His mathematical models were used in various fields, ranging from construction of earthquake resistant buildings to early diagnostics for kids with immune system deficiencies and to financial modeling. After immigrating to the United States in , Mr.
How to Create a Data Quality Scorecard by Arkady Maydanchik