Research on Data Cleaning in Data Acquisition(China)

advertisement
Research on Data Cleaning in Data Acquisition
Zhang Jin
( Nanjing Audit University , Nanjing 210029, Jiangsu, China)
Abstracts: According to the present condition of electronic data collected for IT audit, this paper
analyzes the importance of data cleaning in electronic data acquisition. Based on the expatiation on the
principle of data cleaning, the paper also investigates the solutions to the common problems in
electronic data acquisition, which is illustrated by an example of the application of electronic data
cleaning in data acquisition.
Key words: IT audit, data acquisition, data cleaning
1. Introduction
To effectively implement the audit supervision under the network environment, it is necessary to do
some research into the techniques of data acquisition and processing, of which, electronic data acquisition
is an essential one. The features of IT audit data acquisition and transferring are as follows [1]:
(1) The variety of manifestation of auditees’ data: database, text, web data;
(2) The impossibility of acquiring all the data of auditees in auditing, hence, classification and cleaning.
Either true and original or integrated data are needed.
(3) In data acquisition, detailed acquisition and analysis of auditees’ data cannot be done instantly due to
limited time. Therefore, it is uncertain which data important, and which unimportant. The general measure
adopted is to acquire all the data needed after a certain range is determined before further processing and
coordinating.
(4) Considering the abundance of data and the risks in data acquisition, it is a general practice to acquire
more data than enough, which usually leads to data repetition and immensity.
(5) Some data attributes value being uncertain, it is not uncommon that the value cannot be obtained,
leading to data incompleteness.
It can be thus concluded that, to meet the demands of audit analysis from the data acquisition, data
cleaning plays an important role in data acquisition. As a result, based on the expatiation on the principle of
data cleaning, the paper investigates the solutions to the common problems in electronic data acquisition.
2. The principle of data cleaning
Dirty data
Data Cleaning
Business
Knowledge
Cleaning by User
Cleaning
Algorithm s
Autom atic
Cleaning
Cleaning Rules
Higher Quality Data
Fig.1 T he Principle of Data Cleaning
1
Data cleaning is also called data cleansing or scrubbing. To state it simple, it means to clean the errors
and inconformity, i.e. to detect and clean the wrong, incomplete and duplicated data by means of statistics,
data mining or predefined cleaning rules, so as to improve data quality. The principle of data cleaning is
illustrated in Figure 1 [2].
3. Data Cleaning Methods
Data cleaning involves many contents. According to the practical needs of electric data acquisition, the
paper mainly focuses on the approximately duplicated records cleaning, incomplete data cleaning and data
standardization.
3.1 Approximately Duplicated Records Cleaning
3.1.1 The Principle of Approximately Duplicated Records Cleaning
In order to reduce the redundant information from the electric data acquisition, it is important to clean
approximately duplicated records. The approximately duplicated records refer to the same real-world entity,
which cannot be confirmed by the system of database for its differences of formatting and spelling. Figure2
shows the principle of approximately duplicated records cleaning.
Dirty Data with
Duplicate Records
Sorting the Database
Approximately
Duplicate Records
Detecting
Records Approximation
Detecting
M
times
Duplicate
Records
Do duplicate
records satisfy
some Merge/
Purge Rule?
Yes
Automatic
Merge/Purge
No
Merged/Purged by User
Clean Data
Clean Data
Fig.2 T he Principle of Approximately
Duplicate Records Cleaning
The process of approximately duplicated records cleaning can be described as follows:
Firstly, the data to be cleaned is input into system. Then, data cleaning is performed. The module of
sorting the database introduces sorting arithmetic from arithmetic base in order to sort the database. Having
sorted the database, the module of records approximation detecting introduces approximation detecting
arithmetic from arithmetic base. Approximation detecting is done in the neighboring scope so as to 1)
calculate the approximation of records, 2) to determine whether the records are approximately duplicated
ones. To detect more approximately duplicated records, sorting the database once is inadequate. It is
necessary to adopt multi-round sorts, multi-round contrasts, with a different key for a different round,
2
before the integration of all the detected approximately duplicated records. In this way, the detection of
approximately duplicated records is completed. Finally, the integrated disposal of approximately duplicated
records is completed according to the predefined purge/merge rules for each detected group of
approximately duplicated records.
3.1.2 The key steps of approximately duplicated records cleaning
From Figure 2, the key steps of approximately duplicated records cleaning can be summarized as:
sorting the database  records approximation detecting  purge/merge of approximately duplicated
records, the functions of which are illustrated as follows:
(1) Sorting the database
To locate all the duplicated records in data source, it is essential that each possible record pair be
contrasted. However, the detection of approximately duplicated records becomes a costly operation. When
the amount of acquired electronic data increases enormously, this will result in an invalid and unpractical
approach. To decrease the number of record contrasts, and to increase the effectiveness of detection, the
general approach is to contrast the records within a limited range, i.e. to sort the database first, then to
contrast the records in the neighboring range.
(2) Records approximation detecting
It is an essential step to detect the approximation of records in approximately duplicated record
cleaning. By detecting the approximation of records, we can determine whether two records are
approximately duplicated records.
(3) Approximately duplicated record purge/merge
Having completed the detection of approximately duplicated records, the detected duplicated records
should be processed. For a group of approximately duplicated records, two methods are applied:
Method 1: regard one record true in the approximately duplicated records, the rest false. The mission
is, therefore, to delete the duplicated records in the database. In this situation, the following measures can
be taken:
Manual rules
Manual rules refers to find the most accurate record to store manually from a group of approximately
duplicated ones and delete all the rest from the database. This is the easiest.
Random rules
Random rules refers to select any one record to store randomly from a group for approximately
duplicated ones and delete all the rest from the database.
The latest rules
In many cases, the latest records can better represent a group of approximately duplicated records. For
example, the more up-to-date the information is, the more accurate it may be. The address of daily used
accounts is more authorized than that of the retired accounts. So it means to choose the latest record from a
group of approximately duplicated records and delete the others.
Integrated rules
Integrated rules refers to choose the most integrated record to store from a group of approximately
duplicated ones and delete the others.
Practical rules
The more repeated the information is, the more accurate it may be. For instance, if, in three records,
the phone numbers of two suppliers are the same, it is most likely that the repeated numbers are accurate
and reliable. According to this, practical rules refers to choose the record whose matching number is the
largest to store and delete the other duplicated records.
3
Method 2: regard each individual approximately duplicated record as a portion of the whole
information source, the purpose of which is to integrate a group of duplicated records to produce another
more complete group of new records. This approach is generally manually done.
3.1.3 Improved Efficiency of Approximately Duplicated Records Detection
It is very important to complete data cleaning quickly. Therefore, it is essential to improve the
efficiency of approximately duplicated records detection. From the previous analysis, it is clear that
approximate detection between records still remains a major problem in approximately duplicated records
detection. The key steps lie in the approximate detection between each field, whose efficiency has an
impact on the whole algorithmic efficiency. Generally, edit distance [3,4] is applied. Since the complexity of
distance editing is O (m × n ), without an effective filtration to reduce unnecessary edit distance when the
quantity of data is enormous, this may lead to over-length of detecting. Therefore, to increase the efficiency
of approximately duplicated records detection, the technique of length-filtering method can be adopted to
reduce unnecessary edit distance. Length-filtering method is based on the following theorem:
Theorem 1 given any two character strings, x, y, the length of which are respectively |x|, |y|. If the
maximum edit distance is k, the difference between the lengths of the two character strings cannot exceed k,
i.e.
x  y k.
Theorem 1 is called length-filtering. The pseudocode of approximation detecting algorithm optimized by
means of length-filtering is as follows [5]:
Input: two records: R1 and R2, the threshold of the two fields is δ1, the threshold of the two records is δ2.
(the two values are to determine whether the two records are approximate)
Output: True/False
Rdist = 0;
n = GetFieldNum(R1);
m = n;
For i = 1 to n
{
If R1.Field[i] == NULL OR R2.Field[i] == NULL
Then
m = m - 1;
Continue;
End If;
//********** Done with le ngth -filtering method **********
s_int = length(R1.Field[i]);
t_int = length(R2.Field[i]);
If abs(s_int-t_int) > δ1
Then
Return False;
Else
Dist = d(R1.Field[i],R2.Field[i]);
End If;
//********** Done with le ngth -filtering method **********
If Dist > δ1 Then
Return False;
4
Else
Rdist = Rdist + Dist;
End If;
}
Rdist = Rdist / m
If Rdist < δ2
Then
Return True;
Else
Return False;
End If;
Here, function d(R1.Field[i],R2.Field[i]) is used to calculate the edit distance between fields
R1.Field[i] and R2.Field[i].
The study shows that length-filtering method can effectively reduce unnecessary edit distance
calculation, reduce detecting time, and improve the efficiency of approximately duplicated records
detection.
3.2 The Cleaning of Incomplete Data
Due to being unable to obtain the value of some data attributes in collecting data, some data are
incomplete. To meet the demands of auditing analysis, it is necessary to clean the incomplete data in data
source. The principle is illustrated in Figure 3:
Dirty Data with
Incom pleteness Data
Detecting Incom pleteness
Data
Is Data
Incom plete?
Determ ining Data
Usability
Yes
No
No
Deleting Record
Is Data Usabile?
Yes
Inferring Missing
Attribute Values
Clean Data
Fig.3 T he Principle of Incompleteness Data Cleaning
The main steps of incomplete data cleaning are as follows:
(1) Detecting of incomplete data
The first step is to detect the incomplete data before cleaning so as to further cleaning.
(2) Detecting data’s usability
5
This is an important step in incomplete data cleaning. When much value of record attributes is lost, it
is unnecessary to make up all the records. In this case, it is very essential to determine the usability of
records in order to seek solution to data incompleteness. To determine the usability means to decide these
records should be saved or deleted, based on the degree of incompleteness of each record and other
factors.
The degree of incompleteness should first be evaluated. That is to calculate the percentage of lost
value of attributes of any record, then to consider other factors. For instance, decide whether the key
information still exists in the residuals of values of attributes, then decide whether to accept or reject. If the
attributes of one record are default, this means that the values are lost. The evaluation of data
incompleteness is as follows:
Let R  {a1 , a2 ,, an } .
Here, let n attributes of record R be denoted by a1 , a2 ,, an . m denotes the number of missing
attribute values in record R (including the fields whose values are default values). AMR denotes the
percentage of missing attribute values in record R,  denotes the threshold of the percentage of missing
attribute in record R. If:
AMR 
m
  ,   [0,1]
n
Then, the record should be retained; else, the record should be discarded.
In cleaning incomplete data, the value of  is decided by the analysis of its data source by expert and
is saved in the system for use. The default attribute values are also defined in the rules base for calculating
the value of m.
The existence of key attributes, apart from the consideration of the incompleteness degree, has also to
be taken into consideration. The key attributes are determined by the field experts according to their
analysis of concrete data source. The records should also be saved even AMR   , given the key attributes
exist in the incomplete data.
(3) Incomplete data processing
This means to process the lost values of attributes in the records by means of some techniques after
detecting the usability of data. Some methods are as follows [6]:
Manual method: this is often applied for processing important data or the incomplete data when the
amount is not large.
Constant substitute method: All missing values are filled in with the same constant, such as
“Unknown” or “Miss Value”. This method is simple but may result in wrong analysis results since all
missing values are filled in with the same.
Average substitute method: Use the average of an attribute to fill in all missing values in the same
attribute.
Regular substitute method: The value of the attribute that occurs most often is selected to be the
value for all the missing values of the attribute.
Estimated attribute method: this is the most complex, yet most scientific method. Use such relevant
arithmetic as regress and decision tree to predict the possible value of the lost attributes and then replace
defaults with predicted values.
The methods given above are some usual approaches to process the lost values of attributes in record
processing. Which method to adopt should be decided according to the specific data source.
6
3.3 Data Standardization
Since the data acquired in data acquisition may be various in formats, it is essential to standardize
different formats into unified format to convenience audit analysis. Two methods can be used to achieve
standardization.
1) To standardize date and such type of data, internal function in the system is adopted.
2) IF-THEN rules are used to complete the standardization of the field-value data.
e.g.
IF Ri (Gender ) = F THEN
Ri (Gender ) = 0
ELSE
Ri (Gender ) = 1
END IF
Here, Ri (Gender ) denotes the value of field “gender” in Ri . The different “F/M” or “0/1” can be
transferred into uniformed “0/1”.
4. A Practical Example
According to the third part, Jbuilder 10 can be adopted to achieve these cleaning methods. The
windows of approximately duplicated records and incomplete data cleaning’s subsystem are shown in
Figure 4 and 5.
Take the acquired data of ERP (Enterprise Resource Planning) for instance, we can see the key steps
of approximately duplicated records and incompleteness data cleaning as follows:
(1) The cleaning of approximately duplicated records
First of all, each parameter’s value should be confirmed. According to the analysis of the data table of
“Customer Information”, the value of each parameter is  1  2 ,  1  2 .
Fig.4 The Interface of Approximately
Duplicated Records Cleaning
Then, the process of approximately duplicated records detection should be run as the windows shown
in Figure 4.
Finally, the approximately duplicated records detected from “Customer Information” should be deleted
with “integrity rules”. It means to choose the most integrated record from a group of approximately
duplicated ones and delete the others.
7
Thus, we can complete the approximately duplicated records cleaning effectively.
(2) The cleaning of incomplete data
The value of each parameter should first be confirmed and defined in rules database. Having analyzed
the “Clients database”,  should be 0.5 and “Customer Name” should be made the key field.
Then the process of detecting incomplete data should be run as the windows shown in Figure 5.
Finally, the incomplete data are detected. For the detected incomplete data from “Clients Database”,
this should be done manually as the data is very important and the amount is not large.
Thus, the incomplete data cleaning is completed effectively.
Fig.5 The Interface of Incompleteness Data
Cleaning
5. Conclusion
It is worthwhile to do research into how to apply the advanced data cleaning technique into practice,
so as to collect electronic data to meet the increasing demands of audit analysis as auditing becomes more
challenging and the nature of auditees are various. Apart from the structured data, semi-structured XML
(Extensible Markup Language) may appear in the future, which also deserves our attention.
References:
[1] 审计署计算机技术中心. 计算机审计数据采集与处理技术的总体设计报告[R], 2004
[2] Lee M L, Ling T W, Low W L. IntelliClean: a knowledge-based intelligent data cleaner [A]. In: Proceeding of the 6th
ACM SIGKDD International Conference on Knowledge discovery and Data Mining[C]. Boston: ACM Press, 2000: 290
- 294
[3] Chen Wei, Ding Qiu-lin. Edit distance application in data cleaning and realization with Java[J].Computer and Information
Technology, 2003,11(6):33 - 35
[4] Bunke H, Jiang X Y, Abegglen K,et al. On the weighted mean of a pair of strings [J]. Pattern Analysis & Applications,
2002,5(5): 23 - 30
[5] Chen Wei, Ding Qiu-lin, Xie Qiang.Interactive data migration system and its approximately-detecting efficiency
optimization[J].Journal of South China University of Technology (Natural Science Edition), 2004, 22(2): 148 - 153
[6]
Batista G E A P A, Monard M C. An analysis of four missing data treatment methods for supervised learning [J].
Applied Artificial Intelligence, 2003,17(5-6): 519 - 533
8
Download