Uploaded by muftah_17

Chapter 1

advertisement
Chapter One
Introduction
We live in a world where large amounts of data are collected daily through automated
systems and this has led to a flood of data that has reached petabytes or even Exabytes
(EB) which is equal to 1018 bytes. The huge sheer size of data has led to the inability of
humans to absorb and benefit from it. Organizations consider this data as an important
asset. Data contains hidden knowledge that may be very important to achieve many
goals that help in the growth and success of organizations, including effective
marketing, decision support, detection of fraud, and many others. In order to take
advantage of this huge data, there is a need to develop tools and techniques to discover
that knowledge from the data. Such knowledge can be used in our work to catch the
discrepancy in the data as a validation tool in data entry phase.
1.1. Knowledge Discovery in Databases
Knowledge Discovery in Databases (KDD) has been defined as: “KDD is the nontrivial
process of identifying valid, novel, potentially useful, and ultimately understandable
patterns in data.” [12]. The KDD process consists of iterative steps to discover
interesting patterns from huge amounts of data [10] and these steps are:
1. Data cleaning is a process to remove noise and inconsistent data.
2. Data integration is the process of combining data from multiple data
heterogeneous sources.
3. Data selection is the process of retrieving the task-relevant data that is required by
the current task or functionality.
4. Data transformation is the process where data is converted and consolidated into
models suitable for mining by performing operations such as summarization or
aggregation.
5. Data mining is one of the most important steps where intelligent mining methods
are applied to extract data patterns.
6. Pattern evaluation is another important step to identify really interesting patterns
that represent knowledge based on some interesting metrics.
7. Knowledge presentation is the process of delivering knowledge to the end user
using some visualization and knowledge representation techniques.
Steps 1 to 4 are concerned with the elevation of the quality of the data and making it
suitable for the mining step. In this view, it can be seen that the data mining process is
just one step in the knowledge discovery process, but an essential one because it reveals
hidden patterns for evaluation. The process of KDD is depicted in figure 1.1 that shows
that the data mining step is just one step in the process.
Figure 1.1: The process of KDD.
1.2. Data Mining
The Data Mining (DM) step in the process of KDD is an essential one where patterns
are discovered and if they are interesting enough, they will be converted into
knowledge. DM is not a restricted concept to a particular subject, so mining can be
carried out in any type of data and there are many techniques from a wide range of
subjects. "The interdisciplinary nature of data mining research and development
contributes significantly to the success of data mining and its extensive applications."
[Han page23]. So, DM is a confluence of many subjects as depicted in figure 1.21.
In general, DM can be considered as a learning process, either supervised or
unsupervised, which depends on the data format or structure (classified or unclassified).
The tasks or functionalities are categories into types namely: descriptive mining tasks
and predictive mining tasks as depicted in figure 1.3. In the descriptive mining tasks,
unclassified data are used while in predictive mining tasks, classified data is used. The
descriptive mining tasks give the general characteristics or properties of the data. The
descriptive mining tasks are such as; characterization, discrimination, cluster analysis,
outlier analysis, and association analysis.
1
This figure is borrowed from [Han].
Figure 1.2: The confluence of data mining fields.
The predictive mining tasks are used to predict an instance of an object class, and also
can be used to predict an attribute value in missing data. The predictive mining tasks
are such; classification and prediction.
Data Mining
Tasks
Predictive
Tasks
classification
prediction
Descriptive
Tasks
analysis
cluster
evolution
analysis
outlier
analysis
deviation
analysis
analysis
association
characterizat
ion
discriminatio
n
Figure 1.3: Classification of data mining tasks.
Since 1993, one of the most common and widely used data mining functionality is
association analysis (discovery of association rules) where a wide range of algorithms
have been developed to discover association rules.
1.3. Association analysis
Association analysis is the process of discovering interesting relationships in the form
of association rules or Association Rules Mining (ARM). The AIS was the first
algorithm for ARM. The name of the algorithm stems from the last name of its
inventors: Agrawal, Imielinski, and Swami [1]. An association rule, which consists of
frequent elements, is a very vital tool for decision-makers to take the appropriate action.
The concept of association analysis has been used in many domains such as; business,
health care, government agencies, Market basket analysis, and many more others. The
knowledge extracted in the form of association rules is presented as: 'IF-THEN' rules,
which are easily understood. Association rules can be used as a beneficial tool to be
used as tool for data validation during the data entry phase.
1.4. Data validation
Data validation refers to the process of ensuring the accuracy and quality of data. It is
implemented by building several checks into a system to ensure the logical consistency
of input and stored data [8]. There are many types of data validation that are carried out
by a system. Most data validation procedures will perform by one or more of these
checks to ensure the data's correctness before storing it in a database. Common types
of data validation checks include:

Data Type Check is to confirm that the data entered has the required data type by
the system. This type of checking is usually enforced by DBMS used.

Code Check is to ensure that a field value is selected from a valid list of values or
follows certain formatting rules.

Range Check is to verify whether input data falls within a predefined range.

Format Check is to ensure the input data types follow a certain predefined format.

Consistency Check is to confirm that some input data follows some consistent
logical order.

Uniqueness Check is to confirm that some data values must carry the uniqueness
property which is a vital constraint in databases.
Validation is an important and vital process in any system that receives input data from
the outside world. Validation plays an important role in the systems' accuracy to
produce output that can be used confidently in decision-making. Validation can
mitigate any system defects or failures. Validation ensures the integrity of the data and
consequently the results of the output.
All validation checks, mentioned above, are within the domain of software engineering
and programming. Here, we introduce data discrepancy validation which is a validation
that checks for data discrepancy by the use of some functionality that comes from DM
which is part of Artificial Intelligence (AI). As we mentioned, verifying the validity of
data entry helps improve data quality at the lowest possible cost.
1.5 Data quality
Data quality is critical for many information-intensive applications.
Organizations and individuals routinely make important decisions based on incorrect
data stored in supposedly reliable databases. Data errors in some fields, such as
medicine, can have particularly serious consequences. These errors can arise at a variety
of points in the data life cycle, from data entry, through storage, integration, and
cleaning, to analysis and decision making [110].
Given the importance of data quality, major public initiatives have been launched such
as the “Data Quality Act” in the United States of America and the “European Directive
2003/98” issued by the European Parliament [111].
Data entry is a critical and valuable process, and data quality control is key to ensuring
that the data entered meets certain quality standards, such as accuracy, completeness,
consistency, and relevance.
‫ذكر أساليب التي تاثر علي جودة البيانات مثل التدريب والبيئة‬
1.6. Problem Statement
The Social Security Fund is the largest Libyan institution that manages the work of
disbursing 450,000 pensions for the month of February 2023 and records the number
of 1,950,000 joint content subject to the Social Security Law through the collection of
contributions and the disbursement of pensions and benefits The Social Security Fund
relies entirely on the data of the insured registered in the databases.
The Social Security Fund has added information technology since 1965, and the
perforated cards and tapes are used to save the data of the insured. Disbursement of
pensions and benefits through FoxPro language and SQL2000 databases, with this
sequence of software development, the data moved from one database to another with
irregularities in the structure, which led to poor data quality. This affects the reliability
of the data and the reluctance to disclose any reports or indicators that help in making
decisions.
Accurate validation of data prior to storage in databases is a challenge for social security
fund data that cannot be predicted at the time of programming data entry interfaces and
is the main reason for quick decisions and court rulings, which causes data entry
software to be inaccurate in data validation.
1.7. Motivations
This research is motivated by a number of points such:
•
Accurate validation of data in smart ways helps organizations trust their data.
•
Reducing the cost and effort involved in the validation of data entry.
•
Improving the user-friendliness of data entry applications so that correct values
are suggested based on the entered values.
•
The use of AI tools in data validation.
•
Save time and effort in programming the constraints imposed on the data.
•
Institutions' confidence in data entry applications that use smart methods to
validate data entry.
•
Increasing end-users confidence in the software developers used in the
organization so as to use accurate methods for verifying the validity of the data.
1.8. Aims of the study

The development of a data entry application that uses artificial intelligence
techniques to verify the validity of data. Entering data without reprogramming the
data entry interfaces when every new decision is issued that affects the rules of data
validation, and is to reach an acceptable and reliable data quality in several areas,
including actuarial studies.

Saving the effort and time in guessing to write the restriction codes for data entry
interfaces can be very cumbersome.
1.9 Research goals
Using intelligent and adaptive input forms helps in the data entry process faster through
auto-completion. The main goal is to increase the accuracy and quality of the data
without any additional cost.
1.10 Thesis organization
This thesis is organized into five chapters, which are as follows:
 Chapter two
The researcher introduces the field of ARM. Presenting a brief survey of ARM
algorithms and focusing on the Apriori algorithm. In addition, we will introduce
the concept and methods of data validation. In addition, we will present previous
researches dealing with the topic of using the Apriori algorithm in several
applications. As well as introducing research on data validation.

Chapter three
We will present the design and implementation of our system, Pensioners' data
discrepancy validation using association rules. The system was designed and
programmed using Visual Basic.Net 2019. In addition, realistic data from
pensioners databases were used as inputs to this system.
 Chapter fourth
We will explain the results of a number of experiments that have been carried
out on our system and which represent a test for it, and we will also show the
interpretation of all association rules resulting from the implementation of the
algorithm,

Chapter five
We will explain the results obtained from the experiments performed on our
system. In addition to making some comments on these results and some
suggestions for future directions.
Download