Data Sentinel: A Declarative Production-Scale Data Validation Platform

Data Sentinel: A Declarative Production-Scale DataValidation Platform

Table of Contents




Abstract

Many organizations process big data for important business operations and decisions. Hence, data quality greatly affects their success. Data quality problems continue to be widespread, costing US businesses an estimated $600 billion annually. To date, addressing data quality in production environments still poses many challenges: easily defining properties of high-quality data; validating production-scale data in a timely manner; debugging poor quality data; designing data quality solutions to be easy to use, understand, and operate; and designing data quality solutions to easily integrate with other systems. Current data validation solutions do not comprehensively address these challenges. To address data quality in production environments atLinkedIn, we developed Data Sentinel, a declarative production-scale data validation platform. In a simple and well-structured configuration, users declaratively specify the desired data checks. Then, Data Sentinel performs these data checks and writes the results to an easily understandable report. Furthermore, DataSentinel provides well-defined schemas for the configuration and report. This makes it easy for other systems to interface or integrate with Data Sentinel. To make Data Sentinel even easier to use, understand, and operate in production environments, we provide Data Sentinel Service (DSS), a complementary system to help specify data checks, schedule, deploy, and tune data validation jobs, and understand data checking results. The contributions of this paper include the following: 1) Data Sentinel, a declarative production-scale data validation platform successfully deployed at LinkedIn 2) A generic design to build and deploy similar systems for production environments 3) Experiences and lessons learned that can benefit practitioners with similar objectives.

  • IEEE Keywords

    • Data integrity,
    • Production,
    • LinkedIn,
    • Big Data,
    • Business,
    • Debugging,
    • Schedules
  • Controlled Indexing

    • business data processing,
    • data analysis,
    • data handling

Introduction

Many organizations process big data for important businessoperations and decisions. Hence, data quality greatly affectstheir success. In a comprehensive study and survey of 647data warehousing and business intelligence professionals, “TheData Warehousing Institute estimates that data quality prob-lems cost U.S. businesses more than $600 billion a year” [1].Three proprietary studies estimated that the total cost of poordata quality ranged from 8-12% of revenue, and poor datamay consume 40-60% of a service organization’s expense [2].Despite the staggering costs of poor data quality, this problemappears to be widespread. Having performed measurementsat the data field level, many case studies reported field errorrates varying from 0.5% to 30% [2]. “Unless an enterprise hasmade extraordinary efforts, it should expect data (field) errorrates of approximately 1-5%” [2].

At LinkedIn, data quality problems affected the job rec-ommendations platform in October 2018. As a result, clientjob views and usage declined by 40-60% over time, resultingin financial losses for LinkedIn. Once revenue drops weredetected, a total of 5 engineers took 8 days to identify the rootcause and 11 days to resolve the issue. This incident revealedthe following observations:•Poor data quality is difficult to detect and can result insignificant business value loss•Data debugging is difficult and requires significant engi-neering resources•Resolving poor data quality not only requires timelyand correct intervention but also results in significantopportunity costs, such as other projects and deliverablesdelayedAddressing data quality in production environments facesthe following challenges:•Establishing fundamental principles, universal standards,and best practices for data quality, especially in produc-tion environments: such ideals do not exist [3]•Defining properties of high-quality data: an arduous andtime-consuming task, especially when properties andmeasurements of high-quality data may not be well-known•Examining and checking data: manual efforts are error-prone and not scalable, while automated efforts consumeengineering resources for development and maintenance•Debugging poor quality data and mitigating its effects:the ground truth may be unknown and the types of dataerrors may be unbounded and unforeseeable, especiallyin big data•Addressing data quality in big data: dimensions of bigdata complicate data quality operations•Improving attitudes and culture toward data quality: orga-nizations usually prioritize and address data quality onlyafter detecting related drops in revenue or productivityWe believe the following criteria are important for address-ing data quality in production environments:•A clean separation of concerns between application logicand logic for data quality operations•Easy to define properties of high-quality data: theseproperties help detect poor data quality•Easy to examine and check data for different use cases:this encourages prioritization of data quality, increases adoption rate, and enables data-driven decisions fromunderstandable results•Easy to debug poor quality data, such as identifying faultyrecords and their root causes•Easy to clean and repair poor data quality•Easy to intervene, such as preventing the propagation ofpoor quality data•Big data considerations: noise level tolerance, algorithmicand computational optimizations, and scalability•Production environment considerations: execution, effi-ciency, reliability, automation, standardization, handlingexpected and sudden changes, availability, deployment,scheduling, monitoring, interfacing or integrating withother systemsThis paper does not address data cleaning and reparation tomitigate poor data quality. Instead, we focus on data validationsolutions for production environments.Current data validation systems [4]–[7] do not comprehen-sively address these criteria. Most provide an API frameworkto define properties of high-quality data, perform the corre-sponding data checks, and persist the results. This design doesnot offer a clean separation of concerns between applicationlogic and data checking logic. Furthermore, it requires usersto learn new APIs and does little to prevent users frommixing data checking code with application code. This burdensthem to maintain this mixed code, and any API changesrequire re-testing at least parts of the code base invokingthe API for data checking. Other shortcomings include lackof consideration for the following: interfacing or integratingwith other systems; optimizations for big data; and productionenvironment aspects. Related work in Section VIII discussescurrent data validation systems and their drawbacks in moredetail.

To meet these previously identified criteria, we developedData Sentinel, a declarative production-scale data validationplatform. To easily define properties of high-quality data withData Sentinel, users only need to declaratively specify thedesired data checks in a simple and well-structured config-uration. Hence, users do not need to write and maintain datachecking code. Then, Data Sentinel performs these data checksand writes the results to an easily understandable report.This report clearly shows which data checks passed or failedand why, with sample faulty records provided if requested.Because Data Sentinel provides well-defined schemas for theconfiguration and report, other systems can easily interface orintegrate with Data Sentinel. In other words, Data Sentinel’sdesign follows the MVC [8] software design pattern andestablishes these schemas as a well-defined model. This modelof schemas guides the development of other systems thatintegrate with or extend Data Sentinel, through implementingcustom controllers and views. To effectively address data qual-ity in production environments at LinkedIn, it is imperative tomake Data Sentinel easy to use, understand, and operate inthese environments. To this end, we provide Data SentinelService (DSS), a complementary system to help specify data checks, schedule, deploy, and tune data validation jobs, and understand validation results. The main contributions of thispaper include the following:•Data Sentinel, a declarative production-scale data valida-tion platform successfully deployed at LinkedIn•A generic design to build and deploy similar systems forproduction environments•Experiences and lessons learned that can benefit practi-tioners with similar objectivesThe rest of this paper is organized as follows. SectionII gives an overview of Data Sentinel’s design. Section IIIdescribes the design of declarative configurations and datachecking results to easily specify and understand a rich suiteof natively supported data checks; Section III also discussesthe implementation of these data checks. Section IV examinesthe design and implementation of optimized code generation from the configuration file to perform the specified data checks. Section V discusses the design and implementation for big data considerations, including effective optimizations and scalability challenges. Section VI focuses on the de-sign of DSS for production environments. For practitioners with similar objectives, Section VII provides experiences and lessons learned during the development and deployment ofData Sentinel. Finally, we conclude with Sections VIII andIX to highlight related and future work.

Conclusion

In conclusion, we presented Data Sentinel, a declarative production-scale data validation platform, and showed how it successfully addresses data quality problems at LinkedIn. Wealso showed how a complementary system like Data SentinelService (DSS) can be instrumental in making a platform like Data Sentinel easy to use, understand, and operate in production environments. This resulted in Data Sentinel’s sincreased adoption at LinkedIn. Furthermore, we believe that data Sentinel can serve as a generic design to build and deploy similar systems for production environments. Finally, our experiences and lessons learned can provide invaluable insights for practitioners with similar objectives. Future work includes automatically discovering data assertions, recommending them, and generating a corresponding validation configuration. This process takes as input a dataset of interest, its schema, its semantics, or other important details. Such automation decreases the time and difficulty 1)to discover and formalize (potentially ill-defined) properties of what constitutes high-quality data 2) to prepare a configuration file for data validation with Data Sentinel. Another direction includes validating data in a real-time streaming manner, as opposed to Data Sentinel’s current offline batch-processing approach. We hope that Data Sentinel not only continues to raise awareness of the importance of data quality but also inspires the concepts of “data testing coverage” and data health metrics for datasets to be incorporated into software engineering and big data analytics.

Acknowledgment

Varun Mithal developed the Gradle plugin for Data Sentinel. Maneesh Varshney and members of the Data Sentinel Serviceteam (Adrian Fernandez, Sailesh Mittal, Jiefu Zheng, Audrey Alpizar, Changling Huang) worked with us to design and build data Sentinel Service. Deepak Agarwal, Kapil Surlaker, and Suga Viswesan supported the Data Sentinel project. Thank you for the help and support.

About KSRA

The Kavian Scientific Research Association (KSRA) is a non-profit research organization to provide research / educational services in December 2013. The members of the community had formed a virtual group on the Viber social network. The core of the Kavian Scientific Association was formed with these members as founders. These individuals, led by Professor Siavosh Kaviani, decided to launch a scientific / research association with an emphasis on education.

KSRA research association, as a non-profit research firm, is committed to providing research services in the field of knowledge. The main beneficiaries of this association are public or private knowledge-based companies, students, researchers, researchers, professors, universities, and industrial and semi-industrial centers around the world.

Our main services Based on Education for all Spectrum people in the world. We want to make an integration between researches and educations. We believe education is the main right of Human beings. So our services should be concentrated on inclusive education.

The KSRA team partners with local under-served communities around the world to improve the access to and quality of knowledge based on education, amplify and augment learning programs where they exist, and create new opportunities for e-learning where traditional education systems are lacking or non-existent.

FULL Paper PDF file:

Data Sentinel: A Declarative Production-Scale DataValidation Platform

Bibliography

author

A. Swami, S. Vasudevan and J. Huyn,

Year

2020

Title

Data Sentinel: A Declarative Production-Scale DataValidation Platform

Publish in

2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA, 2020, pp. 1579-1590,

Doi

10.1109/ICDE48307.2020.00140.

PDF reference and original file: Click here

 

+ posts

Somayeh Nosrati was born in 1982 in Tehran. She holds a Master's degree in artificial intelligence from Khatam University of Tehran.

Website | + posts

Professor Siavosh Kaviani was born in 1961 in Tehran. He had a professorship. He holds a Ph.D. in Software Engineering from the QL University of Software Development Methodology and an honorary Ph.D. from the University of Chelsea.

Website | + posts

Nasim Gazerani was born in 1983 in Arak. She holds a Master's degree in Software Engineering from UM University of Malaysia.