Data Mining for Data Validation: A Research Overview

Chapter One Introduction We live in a world where large amounts of data are collected daily through automated systems and this has led to a flood of data that has reached petabytes or even Exabytes (EB) which is equal to 1018 bytes. The huge sheer size of data has led to the inability of humans to absorb and benefit from it. Organizations consider this data as an important asset. Data contains hidden knowledge that may be very important to achieve many goals that help in the growth and success of organizations, including effective marketing, decision support, detection of fraud, and many others. In order to take advantage of this huge data, there is a need to develop tools and techniques to discover that knowledge from the data. Such knowledge can be used in our work to catch the discrepancy in the data as a validation tool in data entry phase. 1.1. Knowledge Discovery in Databases Knowledge Discovery in Databases (KDD) has been defined as: “KDD is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” [12]. The KDD process consists of iterative steps to discover interesting patterns from huge amounts of data [10] and these steps are: 1. Data cleaning is a process to remove noise and inconsistent data. 2. Data integration is the process of combining data from multiple data heterogeneous sources. 3. Data selection is the process of retrieving the task-relevant data that is required by the current task or functionality. 4. Data transformation is the process where data is converted and consolidated into models suitable for mining by performing operations such as summarization or aggregation. 5. Data mining is one of the most important steps where intelligent mining methods are applied to extract data patterns. 6. Pattern evaluation is another important step to identify really interesting patterns that represent knowledge based on some interesting metrics. 7. Knowledge presentation is the process of delivering knowledge to the end user using some visualization and knowledge representation techniques. Steps 1 to 4 are concerned with the elevation of the quality of the data and making it suitable for the mining step. In this view, it can be seen that the data mining process is just one step in the knowledge discovery process, but an essential one because it reveals hidden patterns for evaluation. The process of KDD is depicted in figure 1.1 that shows that the data mining step is just one step in the process. Figure 1.1: The process of KDD. 1.2. Data Mining The Data Mining (DM) step in the process of KDD is an essential one where patterns are discovered and if they are interesting enough, they will be converted into knowledge. DM is not a restricted concept to a particular subject, so mining can be carried out in any type of data and there are many techniques from a wide range of subjects. "The interdisciplinary nature of data mining research and development contributes significantly to the success of data mining and its extensive applications." [Han page23]. So, DM is a confluence of many subjects as depicted in figure 1.21. In general, DM can be considered as a learning process, either supervised or unsupervised, which depends on the data format or structure (classified or unclassified). The tasks or functionalities are categories into types namely: descriptive mining tasks and predictive mining tasks as depicted in figure 1.3. In the descriptive mining tasks, unclassified data are used while in predictive mining tasks, classified data is used. The descriptive mining tasks give the general characteristics or properties of the data. The descriptive mining tasks are such as; characterization, discrimination, cluster analysis, outlier analysis, and association analysis. 1 This figure is borrowed from [Han]. Figure 1.2: The confluence of data mining fields. The predictive mining tasks are used to predict an instance of an object class, and also can be used to predict an attribute value in missing data. The predictive mining tasks are such; classification and prediction. Data Mining Tasks Predictive Tasks classification prediction Descriptive Tasks analysis cluster evolution analysis outlier analysis deviation analysis analysis association characterizat ion discriminatio n Figure 1.3: Classification of data mining tasks. Since 1993, one of the most common and widely used data mining functionality is association analysis (discovery of association rules) where a wide range of algorithms have been developed to discover association rules. 1.3. Association analysis Association analysis is the process of discovering interesting relationships in the form of association rules or Association Rules Mining (ARM). The AIS was the first algorithm for ARM. The name of the algorithm stems from the last name of its inventors: Agrawal, Imielinski, and Swami [1]. An association rule, which consists of frequent elements, is a very vital tool for decision-makers to take the appropriate action. The concept of association analysis has been used in many domains such as; business, health care, government agencies, Market basket analysis, and many more others. The knowledge extracted in the form of association rules is presented as: 'IF-THEN' rules, which are easily understood. Association rules can be used as a beneficial tool to be used as tool for data validation during the data entry phase. 1.4. Data validation Data validation refers to the process of ensuring the accuracy and quality of data. It is implemented by building several checks into a system to ensure the logical consistency of input and stored data [8]. There are many types of data validation that are carried out by a system. Most data validation procedures will perform by one or more of these checks to ensure the data's correctness before storing it in a database. Common types of data validation checks include:  Data Type Check is to confirm that the data entered has the required data type by the system. This type of checking is usually enforced by DBMS used.  Code Check is to ensure that a field value is selected from a valid list of values or follows certain formatting rules.  Range Check is to verify whether input data falls within a predefined range.  Format Check is to ensure the input data types follow a certain predefined format.  Consistency Check is to confirm that some input data follows some consistent logical order.  Uniqueness Check is to confirm that some data values must carry the uniqueness property which is a vital constraint in databases. Validation is an important and vital process in any system that receives input data from the outside world. Validation plays an important role in the systems' accuracy to produce output that can be used confidently in decision-making. Validation can mitigate any system defects or failures. Validation ensures the integrity of the data and consequently the results of the output. All validation checks, mentioned above, are within the domain of software engineering and programming. Here, we introduce data discrepancy validation which is a validation that checks for data discrepancy by the use of some functionality that comes from DM which is part of Artificial Intelligence (AI). As we mentioned, verifying the validity of data entry helps improve data quality at the lowest possible cost. 1.5 Data quality Data quality is critical for many information-intensive applications. Organizations and individuals routinely make important decisions based on incorrect data stored in supposedly reliable databases. Data errors in some fields, such as medicine, can have particularly serious consequences. These errors can arise at a variety of points in the data life cycle, from data entry, through storage, integration, and cleaning, to analysis and decision making [110]. Given the importance of data quality, major public initiatives have been launched such as the “Data Quality Act” in the United States of America and the “European Directive 2003/98” issued by the European Parliament [111]. Data entry is a critical and valuable process, and data quality control is key to ensuring that the data entered meets certain quality standards, such as accuracy, completeness, consistency, and relevance. ‫ذكر أساليب التي تاثر علي جودة البيانات مثل التدريب والبيئة‬ 1.6. Problem Statement The Social Security Fund is the largest Libyan institution that manages the work of disbursing 450,000 pensions for the month of February 2023 and records the number of 1,950,000 joint content subject to the Social Security Law through the collection of contributions and the disbursement of pensions and benefits The Social Security Fund relies entirely on the data of the insured registered in the databases. The Social Security Fund has added information technology since 1965, and the perforated cards and tapes are used to save the data of the insured. Disbursement of pensions and benefits through FoxPro language and SQL2000 databases, with this sequence of software development, the data moved from one database to another with irregularities in the structure, which led to poor data quality. This affects the reliability of the data and the reluctance to disclose any reports or indicators that help in making decisions. Accurate validation of data prior to storage in databases is a challenge for social security fund data that cannot be predicted at the time of programming data entry interfaces and is the main reason for quick decisions and court rulings, which causes data entry software to be inaccurate in data validation. 1.7. Motivations This research is motivated by a number of points such: • Accurate validation of data in smart ways helps organizations trust their data. • Reducing the cost and effort involved in the validation of data entry. • Improving the user-friendliness of data entry applications so that correct values are suggested based on the entered values. • The use of AI tools in data validation. • Save time and effort in programming the constraints imposed on the data. • Institutions' confidence in data entry applications that use smart methods to validate data entry. • Increasing end-users confidence in the software developers used in the organization so as to use accurate methods for verifying the validity of the data. 1.8. Aims of the study  The development of a data entry application that uses artificial intelligence techniques to verify the validity of data. Entering data without reprogramming the data entry interfaces when every new decision is issued that affects the rules of data validation, and is to reach an acceptable and reliable data quality in several areas, including actuarial studies.  Saving the effort and time in guessing to write the restriction codes for data entry interfaces can be very cumbersome. 1.9 Research goals Using intelligent and adaptive input forms helps in the data entry process faster through auto-completion. The main goal is to increase the accuracy and quality of the data without any additional cost. 1.10 Thesis organization This thesis is organized into five chapters, which are as follows:  Chapter two The researcher introduces the field of ARM. Presenting a brief survey of ARM algorithms and focusing on the Apriori algorithm. In addition, we will introduce the concept and methods of data validation. In addition, we will present previous researches dealing with the topic of using the Apriori algorithm in several applications. As well as introducing research on data validation.  Chapter three We will present the design and implementation of our system, Pensioners' data discrepancy validation using association rules. The system was designed and programmed using Visual Basic.Net 2019. In addition, realistic data from pensioners databases were used as inputs to this system.  Chapter fourth We will explain the results of a number of experiments that have been carried out on our system and which represent a test for it, and we will also show the interpretation of all association rules resulting from the implementation of the algorithm,  Chapter five We will explain the results obtained from the experiments performed on our system. In addition to making some comments on these results and some suggestions for future directions.

Data Mining for Data Validation: A Research Overview

Related documents

Products

Support

Data Mining for Data Validation: A Research Overview

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib