Current Conditional Functional Dependency (CFD) discovery algorithms always need a well-prepared training dataset. This condition makes them difficult to apply on large and low-quality datasets. To handle the volume issue of big data, we develop the sampling algorithms to obtain a small representative training set. We design the fault-tolerant rule discovery and conflict-resolution algorithms to address the low-quality issue of big data. We also propose parameter selection strategy to ensure the effectiveness of CFD discovery algorithms. Experimental results demonstrate that our method can discover effective CFD rules on billion-tuple data within a reasonable period.
- data mining ,
- conditional functional dependency ,
- big data ,
- data quality
With the accumulation of data at present, databases have become increasingly large. At the same time, due to the difficulty in manual maintenance and variations of data sources, big data involves a high possibility of quality problems which make them difficult to use. Therefore, cleaning techniques are crucial for effective use of big data. Conditional Functional Dependency (CFD) discovery algorithms are powerful tools in data cleaning, because they can find the hidden relation among items. Such relation can help us find dirty tuples which can be modified accordingly. The Functional Dependencies (FDs) can be considered as special forms of CFDs. High-quality rules are the core of effective data cleaning systems with CFDs.
High-quality CFD discovery on big data brings two challenges. First, the volume of big data requires a highly efficient and scalable discovery algorithm with complexity that is linear or sublinear to the data size. Second, big data may involve further data-quality problems. Thus, a clean training set can hardly be prepared for CFD discovery and a fault-tolerant CFD discovery approach is necessary. Owing to the importance of such an approach, some CFD discovery algorithms have been proposed. However, none of them could address the aforementioned challenges. Most existing methods such as the method in Ref.  discover high-quality rules with data mining algorithms on a small but clean dataset efficiently. However, these approaches are unsuitable for big data cleaning due to the lack of representative dataset. Other methods for discovering CFDs on dirty datasets such as the method in Ref.  need many passes over the dataset to find approximate CFDs, but this method may not be effective on a big dataset that cannot be loaded into the main memory.
Therefore, developing a scalable method is necessary to mine high-quality rules from big data with size larger than the main memory. To achieve this goal, we design a scalable and systemic algorithm. We sample from the big data first to obtain an effective training set within one pass of scan. Then, we discover CFDs based on sampling results. The reason for sampling before discovery is that, without sampling, we have to scan the big dataset many times to mine patterns and calculate support for finding CFDs. This task is time-consuming, especially when the dataset is larger than the memory. Another purpose of sampling is to filter dirty items and keep clean ones. The following example illustrates the need for sampling.
For big data, rule discovery in data cleaning brings new challenges. To solve this problem, we proposed a novel CFD discovery method for big data. For the volume feature of big data, we designed a sampling algorithm to obtain typical samples by scanning data only once. Then, on the sample set, we adapted existing CFD discovery algorithms to tolerate the fault. By integrating these modified methods, we discovered a preliminary CFD set. To increase the quality in the discovered rule set, we designed a graph-based rule selection algorithm. Considering that a user may have different requirements for CFD discovery, we proposed a strategy to select parameters according to the requirements of users. The experimental results demonstrated that the proposed algorithm is suitable for big data and outperforms existing algorithms. Future work includes extending the proposed algorithm to parallel platforms and modifying the proposed algorithm to discover other rules.
This paper was partially supported by the National Key R&D Program of China (No. 2018YFB1004700), the National Natural Science Foundation of China (Nos. U1509216, U1866602, and 61602129), and Microsoft Research Asia.
The Kavian Scientific Research Association (KSRA) is a non-profit research organization to provide research / educational services in December 2013. The members of the community had formed a virtual group on the Viber social network. The core of the Kavian Scientific Association was formed with these members as founders. These individuals, led by Professor Siavosh Kaviani, decided to launch a scientific / research association with an emphasis on education.
KSRA research association, as a non-profit research firm, is committed to providing research services in the field of knowledge. The main beneficiaries of this association are public or private knowledge-based companies, students, researchers, researchers, professors, universities, and industrial and semi-industrial centers around the world.
Our main services Based on Education for all Spectrum people in the world. We want to make an integration between researches and educations. We believe education is the main right of Human beings. So our services should be concentrated on inclusive education.
The KSRA team partners with local under-served communities around the world to improve the access to and quality of knowledge based on education, amplify and augment learning programs where they exist, and create new opportunities for e-learning where traditional education systems are lacking or non-existent.
FULL Paper PDF file:Mining conditional functional dependency rules on big data
Mining conditional functional dependency rules on big data
, in Big Data Mining and Analytics, vol. 3, no. 1, pp. 68-84, March 2020,
PDF reference and original file: Click here
Professor Siavosh Kaviani was born in 1961 in Tehran. He had a professorship. He holds a Ph.D. in Software Engineering from the QL University of Software Development Methodology and an honorary Ph.D. from the University of Chelsea.