Data Mining As A Financial Auditing Tool

advertisement
Data Mining As A Financial Auditing Tool
M.Sc. Thesis in Accounting
Swedish School of Economics and Business Administration
2002
The Swedish School of Economics and Business Administration
Department:
Accounting
Type of Document: Thesis
Title:
Data Mining As A Financial Auditing Tool
Author:
Supatcharee Sirikulvadhana
Abstract
In recent years, the volume and complexity of accounting transactions in major
organizations have increased dramatically.
To audit such organizations, auditors
frequently must deal with voluminous data with rather complicated data structure.
Consequently, auditors no longer can rely only on reporting or summarizing tools in the
audit process.
Rather, additional tools such as data mining techniques that can
automatically extract information from a large amount of data might be very useful.
Although adopting data mining techniques in the audit processes is a relatively new
field, data mining has been shown to be cost effective in many business applications
related to auditing such as fraud detection, forensics accounting and security evaluation.
The objective of this thesis is to determine if data mining tools can directly
improve audit performance. The selected test area was the sample selection step of the
test of control process.
The research data was based on accounting transactions
provided by AVH PricewaterhouseCoopers Oy. Various samples were extracted from
the test data set using data mining software and generalized audit software and the
results evaluated. IBM’s DB2 Intelligent Miner for Data Version 6 was selected to
represent the data mining software and ACL for Windows Workbook Version 5 was
chosen for generalized audit software.
Based on the results of the test and the opinions solicited from experienced
auditors, the conclusion is that, within the scope of this research, the results of data
mining software are more interesting than the results of generalized audit software.
However, there is no evidence that the data mining technique brings out material
matters or present significant enhancement over the generalized audit software. Further
study in a different audit area or with a more complete data set might yield a different
conclusion.
Search Words: Data Mining, Artificial Intelligent, Auditing, Computerized Audit
Assisted Tools, Generalized Audit Software
Table of Contents
1. Introduction
1
1.1. Background
1
1.2. Research Objective
2
1.3. Thesis Structure
2
2. Auditing
4
2.1. Objective and Structure
4
2.2. What Is Auditing?
4
2.3. Audit Engagement Processes
5
2.3.1. Client Acceptance or Client Continuance
5
2.3.2. Planning
6
2.3.2.1. Team Mobilization
6
2.3.2.2. Client’s Information Gathering
7
2.3.2.3. Risk Assessment
7
2.3.2.4. Audit Program Preparation
9
2.3.3. Execution and Documentation
10
2.3.4. Completion
11
2.4. Audit Approaches
12
2.4.1. Tests of Controls
12
2.4.2. Substantive Tests
13
2.4.2.1. Analytical Procedures
13
2.4.2.2. Detailed Tests of Transactions
13
2.4.2.3. Detailed Tests of Balances
14
2.5. Summary
3. Computer Assisted Auditing Tools
14
17
3.1. Objective and Structure
17
3.2. Why Computer Assisted Auditing Tools?
17
3.3. Generalized Audit Software
18
3.4. Other Computerized Tools and Techniques
22
3.5. Summary
23
4. Data mining
24
4.1. Objective and Structure
24
4.2. What Is Data Mining?
24
4.3. Data Mining process
25
4.3.1. Business Understanding
26
4.3.2. Data Understanding
27
4.3.3. Data Preparation
27
4.3.4. Modeling
27
4.3.5. Evaluation
28
4.3.6. Deployment
28
4.4. Data Mining Tools and Techniques
29
4.4.1. Database Algorithms
29
4.4.2. Statistical Algorithms
30
4.4.3. Artificial Intelligence
30
4.4.4. Visualization
30
4.5. Methods of Data Mining Algorithms
32
4.5.1. Data Description
32
4.5.2. Dependency Analysis
33
4.5.3. Classification and Prediction
33
4.5.4. Cluster Analysis
34
4.5.5. Outlier Analysis
34
4.5.6. Evolution Analysis
35
4.6. Examples of Data Mining Algorithms
36
4.6.1. Apriori Algorithms
36
4.6.2. Decision Trees
37
4.6.3. Neural Networks
39
4.7. Summary
5. Integration of Data Mining and Auditing
40
43
5.1. Objective and Structure
43
5.2. Why Integrate Data Mining with Auditing?
43
5.3. Comparison between Currently Used Generalized Auditing Software
and Data Mining Packages
44
5.3.1. Characteristics of Generalized Audit Software
45
5.3.2. Characteristics of Data Mining Packages
46
5.4. Possible Areas of Integration
48
5.5. Examples of Tests
58
5.6. Summary
66
6. Research Methodology
68
6.1. Objective and Structure
68
6.2. Research Period
68
6.3. Data Available
68
6.4. Research Methods
69
6.5. Software Selection
70
6.5.1. Data Mining Software
70
6.5.2. Generalized Audit Software
71
6.6. Analysis Methods
71
6.7. Summary
72
7. The Research
73
7.1. Objective and Structure
73
7.2. Hypothesis
73
7.3. Research Processes
73
7.3.1. Business Understanding
73
7.3.2. Data Understanding
74
7.3.3. Data Preparation
75
7.3.3.1. Data Transformation
75
7.3.3.2. Attribute Selection
76
7.3.3.3. Choice of Tests
80
7.3.4. Software Deployment
82
7.3.4.1. IBM’s DB2 Intelligent Miner for Data
82
7.3.4.2. ACL
91
7.4. Result Interpretations
94
7.4.1. IBM’s DB2 Intelligent Miner for Data
94
7.4.2. ACL
95
7.5. Summary
99
8. Conclusion
101
8.1. Objective and Structure
101
8.2. Research Perspective
101
8.3. Implications of the Results
102
8.4. Restrictions and Constraints
103
8.4.1. Data Limitation
103
8.4.1.1. Incomplete Data
103
8.4.1.2. Missing Information
103
8.4.1.3. Limited Understanding
104
8.4.2. Limited Knowledge of Software Packages
104
8.4.3. Time Constraint
105
8.5. Suggestions for Further Researches
105
8.6. Summary
105
List of Figures
105
List of Tables
105
References
105
a) Books and Journals
105
b) Web Pages
105
Appendix A: List of Columns of Data Available
109
Appendix B Results of IBM’s Intelligent Miner for Data
105
a) Preliminary Neural Clustering (with Six Attributes)
105
b) Demographic Clustering: First Run
105
c) Demographic Clustering: Second Run
105
d) Neural Clustering: First Run
105
e) Neural Clustering: Second Run
105
f) Neural Clustering: Third Run
105
g) Tree Classification: First Run
105
h) Tree Classification: Second Run
105
i) Tree Classification: Third Run
105
Appendix C: Sample Selection Result of ACL
105
-1-
1. Introduction
1.1. Background
Auditing is a relatively archaic field and the auditors are frequently viewed as
stuffily fussy people. That is no longer true. In recent years, auditors have recognized
the dramatic increase in the transaction volume and complexity of their clients’
accounting and non-accounting records. Consequently, computerized tools such as
general-purpose and generalized audit software (GAS) have increasingly been used to
supplement the traditional manual audit process.
The emergence of enterprise resource planning (ERP) system, with the concept
of integrating all operating functions together in order to increase the profitability of an
organization as a whole, makes accounting system no longer a simple debit-and-credit
system. Instead, it is the central registrar of all operating activities. Though it can be
argued which is, or which is not, accounting transaction, still, it contains valuable
information. It is auditors’ responsibility to audit sufficient amount of transactions
recorded in the client’s databases in order to gain enough evidence on which an audit
opinion may be based and to ensure that there is no risk left unaddressed.
The amount and complexity of the accounting transactions have increased
tremendously due to the innovation of electronic commerce, online payment and other
high-technology devices. Electronic records have become more common; therefore, online auditing is increasingly challenging let alone manual access. Despite those
complicated accounting transactions can now be presented in the more comprehensive
format using today’s improved generalized audit software (GAS), they still require
auditors to make assumptions, perform analysis and interpret the results.
The GAS or other computerized tools currently used only allows auditors to
examine a company’s data in certain predefined formats by running varied query
commands but not to extract any information from that data especially when such
information is unknown and hidden. Auditors need something more than presentation
tools to enhance their investigation of fact, or simply, material matters.
On the other side, data mining techniques have improved with the advancement
of database technology. In the past two decades, database has become commonplace in
-2-
business. However, the database itself does not directly benefit the company; in order
to reap the benefit of database, the abundance of data has to be turned into useful
information. Thus, Data mining tools that facilitate data extraction and data analysis
have received greater attention.
There seems to be opportunities for auditing and data mining to converge.
Auditing needs a mean to uncover unusual transaction patterns and data mining can
fulfill that need. This thesis attempts to explore the opportunities of using data mining
as a tool to improve audit performance. The effectiveness of various data mining tools
in reaching that goal will also be evaluated.
1.2. Research Objective
The research objective of this thesis is to preliminarily evaluate the usefulness
of data mining techniques in supporting auditing by applying selected techniques with
available data sets. However, it is worth nothing that the data sets available are still in
question whether it could be induced as generalization.
According to the data available, the focus of this research is sample selection
step of the test of control process. The relationship patterns discovered by data mining
techniques will be used as a basis of sample selection and the sample selected will be
compared with the sample drawn by generalized audit software.
1.3. Thesis Structure
The remainder of this thesis is structured as follows:
Chapter 2 is a brief introduction to auditing. It introduces some essential
auditing terms as a basic background. The audit objectives, audit engagement processes
and audit approaches are also described here.
Chapter 3 discusses some computer assisted auditing tools and techniques
currently used in assisting auditors in their audit work. The main focus will be on the
generalized audit software (GAS), particularly in Audit Command Language (ACL) -the most popular software in recent years.
Chapter 4 provides an introduction to data mining. Data mining process, tools
and techniques are reviewed. Also, the discussions will attempt to explore the concept,
-3-
methods and appropriate techniques of each type of data mining patterns in greater
detail. Additionally, some examples of the most frequently used data mining algorithms
will be demonstrated as well.
Chapter 5 explores many areas where data mining techniques may be utilized
to support the auditors’ performance. It also compares GAS packages and data mining
packages from the auditing profession’s perspective. The characteristics of these
techniques and their roles as a substitution of manual processes are also briefly
discussed. For each of those areas, audit steps, potential mining methods, and required
data sets are identified.
Chapter 6 describes the selected research methodology, the reasons for
selection, and relevant material to be used. The research method and the analysis
technique of the results are identified as well.
Chapter 7 illustrates the actual study. The hypothesis, relevant facts of the
research processes and the study results are presented. Finally, the interpretation of
study results will be attempted.
Finally, chapter 8 provides a summary of the entire study. The assumptions,
restrictions and constraints of the research will be reviewed, followed by suggestions for
further research.
-4-
2. Auditing
2.1. Objective and Structure
The objective of this chapter is to introduce the background information on
auditing. In section 2.2, definitions of essential terms as well as main objectives and
tasks of auditing profession are covered. Four principal audit procedures are discussed
in section 2.3. Audit approaches including test of controls and substantive tests are
discussed in greater details in section 2.4. Finally, section 2.5 provides a brief summary
of auditing perspective.
Notice that dominant content covered in this chapter are based on the notable
textbook “Auditing: An Integrated Approach” (Arens & Loebbecke, 2000) and my own
experiences.
2.2. What Is Auditing?
Auditing is the accumulation and evaluation of evidence about information to
determine and report on the degree of correspondence between the information and
established criteria (Arens & Loebbecke, 2000, 16). Normally, independent auditors,
also known as certified public accountants (CPAs), conduct audit work to ascertain
whether the overall financial statements of a company are, in all material respects, in
conformity with the generally accepted accounting principles (GAAP). Financial
statements include Balance Sheets, Profit and Loss Statements, Statements of Cash
Flow and Statements of Retained Earning. Generally speaking, what auditors do is to
apply relevant audit procedures, in accordance with GAAP, in the examination of the
underlying records of a business, in order to provide a basis for issuing a report as an
attestation of that company’s financial statements. Such written report is called auditor’s
opinion or auditor’s report.
Auditor’s report expresses the opinion of an independent expert regarding the
degree of reliability upon of the information presented in the financial statements. In
other words, auditor’s report assures the financial statements users, which normally are
external parities such as shareholders, investors, creditors and financial institutions, of
the reliability of financial statements, which are prepared by the management of the
company.
-5-
Due to the time and cost constraints, auditors cannot examine every detail
records behind the financial statements. The concept of materiality and fairly stated
financial statements were introduced to solve this problem. Materiality is the magnitude
of an omission or misstatement of information that misleads the financial statement
users.
The materiality standard applied to each account balance is varied and is
depended on auditors’ judgement. It is the responsibility of the auditors to ensure that
all material misstatements are indicated in the auditors’ opinion.
In business practice, it is more common to find an auditor as a staff of an
auditing firm. Generally, several CPAs join together to practice as partners of the
auditing firm, offering auditing and other related services including auditing and other
reviews to interested parties. The partners normally hire professional staffs and form an
audit team to assist them in the audit engagement. In this thesis, auditors, auditing firm
and audit team are synonyms.
2.3. Audit Engagement Processes
The audit engagement processes of each auditing firm may be different.
However, they generally involve the four major steps: client acceptance or client
continuance, planning, execution and documentation, and completion.
2.3.1. Client Acceptance or Client Continuance
Client acceptance, or client continuance in case of a continued
engagement, is a process through which the auditing firm decides whether or not the
firm should be engaged by this client. Major considerations are:
-
Assessment of engagement risks: Each client presents different level
of risk to the firm. The important risk that an auditing firm must evaluate carefully in
accepting an audit client are: accepting a company with a bad reputation or questionable
ethics that involves in illegal business activities or material misrepresentation of
business and accounting records. Some auditing firms have basic requirements of
favorable clients.
On the other hand, some have a list of criteria to identify the
unfavorable ones. Unfavorable clients, for example, are in dubious businesses or have
too complex a financial structure.
-6-
-
Relationship conflicts: Independence is a key requirement of the
audit profession, of equal importance is the auditor’s objectivity and integrity. These
factors help to ensure a quality audit and to earn people’s trust in the audit report.
-
Requirements of the clients: The requirements include, for example,
the qualification of the auditor, time constraint, extra reports and estimated budget.
-
Sufficient competent personnel available
-
Cost-Benefit Analysis: It is to compare the potential costs of the
engagement with the audit fee offered from the client. The major portion of the cost of
audit engagement is professional staff charge.
If the client is accepted, a written confirmation, generally on an annual
basis, of the terms of engagement is established between the client and the firm.
2.3.2. Planning
The objective of the planning step is to develop an audit plan. It includes
team mobilization, client’s information gathering, risk assessment and audit program
preparation.
2.3.2.1. Team Mobilization
This step is to form the engagement team and to communicate
among team members. First, key team members have to be identified. Team members
include engagement partner or partners who will sign the audit report, staff auditors
who will conduct most of the necessary audit work and any specialists that are deemed
necessary for the engagement. The mobilization meeting, or pre-planning meeting,
should be conducted to communicate all engagement matters including client
requirements and deliverables, level of involvement, tentative roles and responsibilities
of each team member and other relevant substances. The meeting should also cover the
determination of the most efficient and effective process of information gathering.
In case of client continuance, a review of the prior year audit to
assess scope for improving efficiency or effectiveness should be identified.
-7-
2.3.2.2. Client’s Information Gathering
In order to perform this step, the most important thing is the
cooperation between the client and the audit team. A meeting is arranged to update the
client’s needs and expectations as well as management’s perception of their business
and the control environment.
Next, the audit team members need to perform the preliminary
analytical procedures which could involve the following tasks:
- Obtaining
background
information:
It
includes
the
understanding of client’s business and industry, the business objectives, legal
obligations and related risks.
- Understanding system structures: System structures include the
system and computer environments, operating procedures and the controls embedded in
those procedures.
- Control assessment: Based upon information about controls
identified from the meeting with the client and the understanding of system structures
and processes, all internal controls are updated, assessed and documented. The subjects
include control environment, general computerized (or system) controls, monitoring
controls and application controls.
More details about internal control, such as
definitions, nature, purpose and means of achieving effective internal control, can be
found in “Internal Control – Integrated Framework” (COSO, 1992).
Audit team members’ knowledge, expertise and experiences are
considered as the most valuable tools in performing this step.
2.3.2.3. Risk Assessment
Risk, in this case, is some level of uncertainty in performing audit
work. Risks identified in the first two steps are gathered and assessed. The level of
risks assessed in this step is directly lead to the audit strategy to be used. In short, the
level of task is based on the level of risks. Therefore, the auditor must be careful not to
understate or overstate the level of these risks.
-8-
Level of risks is different from one auditing area to another. In
planning the extent of audit evidences of each auditing area, auditors primarily use an
audit risk model such as the one shown below:
Planned Detection Risk =
Acceptable Audit Risk
Inherent Risk * Control Risk
- Planned detection risk: Planned detection risk is the highest
level of misstatement risk that the audit evidence cannot detect in each audit area. The
auditors need to accumulate audit evidences until the level of misstatement risk is
reduced to planned detection risk level. For example, if the planned detection risk is
0.05, then audit testing needs to be expanded until audit evidence obtained supports the
assessment that there is only five percent misstatement risk left.
- Acceptable audit risk: Audit risk is the probability that auditor
will unintentionally render inappropriate opinion on client’s financial statements.
Acceptable audit risk, therefore, is a measure of how willing the auditor is to accept that
the financial statements may be materially misstated after the audit is completed (Arens
& Loebbecke, 2000, 261).
- Inherent risk: Inherent risk is the probability that there are
material misstatements in financial statements. There are many risk factors that affect
inherent risk including errors, fraud, business risk, industry risk, and change risk. The
first two are preventable and detectable but others are not. Auditors have to ensure that
all risks are taken into account when considering the probability of inherent risk.
- Control risk: Control risk is the probability that a client’s
control system cannot prevent or detect errors. Normally, after defining inherent risks,
controls that are able to detect or prevent such risks are identified. Then, auditors will
assess whether the client’s system has such controls and, if it has, how much they can
rely on those controls. The more reliable controls, the lower the control risk. In other
words, control risk represents auditor’s reliance on client’s control structure.
It is the responsibility of the auditors to ensure that no risk factors
of each audit area are left unaddressed and the evidence obtained is sufficient to reduce
all risks to an acceptable audit risk level. More information about audit risk can be
-9-
found in Statement of Auditing Standard (SAS) No. 47: Audit Risk and Materiality in
Conducting an Audit (AICPA, 1983).
2.3.2.4. Audit Program Preparation
The purpose of this step is to determine the most appropriate audit
strategy and tasks for each audit objective within each audit area based on client’s
background information about related audit risks and controls identified from the
previous steps.
Firstly, the audit objectives, both transaction-related and balancerelated, of each audit area have to be identified. These two types of objectives share
one thing in common -- that they must be met before auditors can conclude that the
information presented in the financial statements are fairly stated. The difference is that
while transaction-related audit objectives are to ensure the correctness of the total
transactions for any given class, balance-related audit objectives are to ensure the
correctness of any given account balance. A primary purpose of audit strategy and task
is to ensure that those objectives are materially met.
Such objectives include the
following.
Transaction-Related and Balance-Related Audit Objectives
- Existence or occurrence: To ensure that all balances in the
balance sheet have really existed and the transactions in the
income statement have really occurred.
- Completeness: To ensure that all balances and transactions are
included in the financial statements.
- Accuracy: To ensure that the balances and transactions are
recorded accurately.
- Classification: To ensure that all transactions are classified in
the suitable categories.
- Cut-off (timing): To ensure that the transactions are recorded in
the proper period.
- 10 -
Others Balance-Related Audit Objectives
- Valuation: To ensure that the balances and transactions are
stated at the appropriate value.
- Right and obligation: To ensure that the assets are belonged to
and the liabilities are the obligation of the company.
- Presentation and disclosure: To ensure that the presentation of
the financial statements does not mislead the users and the
disclosures are enough for users to understand the financial
statements clearly.
After addressing audit objectives, it is time to develop an overall audit
plan. The audit plan should cover audit strategy of each area and all details related to
the engagement including the client’s needs and expectations, reporting requirements,
timetable. Then, the planning at the detail level has to be performed. This detailed plan
is known as a tailored audit program. It should cover tasks identification and schedule,
types of tests to be used, materiality thresholds, acceptable audit risk and person
responsible. Notice that related risks and controls of each area are taken into account
for prescribing audit strategy and tasks.
The finalized general plan should be communicated to the client in order
to agree upon significant matters such as deliverables and timetable. Both overall audit
plan and detailed audit programs need to be clarified to the team as well.
2.3.3. Execution and Documentation
In short, this step is to perform the audit examinations by following the
audit program. It includes audit tests execution, which will be described in more detail
in the next subsection, and documentation. Documentation includes summarizing the
results of audit tests, level of satisfaction, matters found during the tests and
recommendations. If there is an involvement of specialists, the process performed and
the outcome have to be documented as well.
Communication practices are considered as the most important skill to
perform this step. Not only with the client or the staff working for the client, it is also
- 11 -
crucial to communicate among the team. Normally, it is a responsibility of the more
senior auditor to coach the less senior ones. Techniques used are briefing, coaching,
discussing, and reviewing.
A meeting with client in order to discuss the issues found during the
execution process and the recommendations of those findings can be arranged either
formally or informally. It is a good idea to inform and resolve those issues with the
responsible client personnel such as the accounting manager before the completion step
and leave only the critical matters to the top management.
2.3.4. Completion
This step is similar to the final step of every other kind of projects. The
results of aforementioned steps are summarized, recorded, assessed and reported.
Normally, the assistant auditors report their work results to the senior, or in-charge,
auditors. The auditor-in-charge should perform the final review to ensure that all
necessary tasks are performed and that the audit evidence gathered for each audit area is
sufficient. Also, the critical matters left from the execution process have to be resolved.
The resolution of those matters might be either solved by client’s management
(adjusting their financial statements or adequately disclosing them in their financial
statement) or by auditors (disclosing them in the auditor’s opinion).
The last field work for auditors is review of subsequent events.
Subsequent events are events occurred subsequent to the balance sheet date but before
the auditor’s report date that require recognition in the financial statements.
Based on accumulated audit evidences and audit findings, the auditor’s
opinion can be issued. Types of auditor’s opinion are unqualified, unqualified with
explanatory paragraph or modified wording, qualified, adverse and disclaimer.
After everything is done, it is time to arrange the clearance meeting with
the client. Generally, auditors are required to report results and all conditions to the
audit committee or senior management. Although not required, auditors often make
suggestions to management to improve their business performance through the
Management Letter. On the other hand, auditors can get feedback from the client
according to their needs and expectations as well.
- 12 -
Also, auditors should consider evaluating their own performances in
order to improve their efficiency and effectiveness.
The evaluation includes
summarizing client’s comments, bottom-up evaluation (more senior auditors evaluate
the work of assistant auditors) and top-down evaluation (get feedback from field work
auditors).
2.4. Audit Approaches
In order to determine whether financial statements are fairly stated, auditors
have to perform audit tests to obtain competent evidence. The audit approaches used in
each audit area as well as the level of test depended on auditors’ professional
judgement. Generally, audit approaches fall into one of these two categories:
2.4.1. Tests of Controls
There are as many control objectives as many textbooks about system
security nowadays. However, generally, control objectives can be categorized into four
broad categories -- validity, completeness, accuracy and restricted access. With these
objectives in mind, auditors can distinguish control activities from the normal operating
ones.
When assessing controls during planning phase, auditors are able to
identify the level of control reliance -- the level of controls that help reducing risks. The
effectiveness of such controls during the period can be assessed by performing testing
of controls. However, only key controls will be tested and the level of tests depends
solely on the control reliance level. The higher control reliance is, the more tests are
performed.
The scope of tests should be sufficiently thorough to allow the auditor to
draw a conclusion as to whether controls have operated effectively in a consistent
manner and by the proper authorized person. In other words, the level of test should be
adequate enough to bring assurance of the relevant control objectives. The assurance
evidence can be obtained from observation, inquiry, inspection of supporting
documents, re-performance or the combination of these.
- 13 -
2.4.2. Substantive Tests
Substantive test is an approach designed to test for monetary
misstatements or irregularities directly affecting the correctness of the financial
statement balances. Normally, the level of tests depends on the level of assurance from
the tests of controls. When the tests of controls could not be performed either because
there is no or low control reliance or because the amount and extensiveness of the
evidence obtained is not sufficient, substantive tests are performed. Substantive tests
include analytical procedures, detailed tests of transactions as well as detailed tests of
balances. Details of each test are as follows:
2.4.2.1. Analytical Procedures
The objective of this approach is to ensure that overall audit results,
account balances or other data presented in the financial statements are stated
reasonably. Statement of Auditing Standard (SAS) No. 56 also requires auditors to use
analytical procedures during planning and final reporting phases of audit engagement
(AICPA, 1988).
Analytical procedures can be performed in many different ways.
Generally, the most accepted one is to develop the expectation of each account balance
and the acceptable variation or threshold. Then, this threshold is compared with the
actual figure. Further investigation is required only when the difference between actual
and expectation balances falls out of the acceptable variation range prescribed. Further
investigation includes extending analytical procedures, detail examination of supporting
documents, conducting additional inquiries and performing other substantive tests.
Notice that the reliabilities of data, the predictive method and the
size of the balance or transactions can strongly affect the reliability of assurance.
Moreover, this type of test requires significant professional judgement and experience.
2.4.2.2. Detailed Tests of Transactions
The purpose of detailed tests of transactions (also known as
substantive testing of transactions) is to ensure that the transaction-related audit
objectives are met in each accounting transaction. The confidence on transactions will
- 14 -
lead to the confidence on the account total in the general ledger. Testing techniques
include examination of relevant documents and re-performance.
The extent of tests remains a matter of professional judgement. It
can be varied from a sufficient amount of samples to all transactions depending on the
level of assurance that auditors want to obtain. Generally, samples are drawn either
from the items with particular characteristics or randomly sampled or a combination of
both. Examples of the particular characteristics are size (materiality consideration) and
unusualness (risk consideration).
This approach is time-consuming. Therefore, it is a good idea to
reduce the sampling size by considering whether analytical procedures or tests of
controls can be performed to obtain assurance in relation to the items not tested.
2.4.2.3. Detailed Tests of Balances
Detailed tests of balances (also called substantive tests of balances)
focuses on the ending balances of each general ledger account. They are performed
after the balance sheet date to gather sufficient competent evidence as a reasonable basis
for expressing an opinion on fair presentation of financial statements (Rezaee, Elam &
Sharbatoghlie, 2001, 155). The extent of tests depends on the results of tests of control,
analytical procedures and detailed tests of transactions relating to each account. Like
detailed tests of transactions, the sample size can be varied and remains a matter of
professional judgement.
Techniques to be applied for this kind of tests include account
reconciliation, third party confirmation, observation of the items comprising an account
balance and agreement of account details to supporting documents.
2.5. Summary
Auditing is the accumulation and evaluation of evidence about information to
determine and report on the degree of correspondence between the information and
established criteria. As seen in figure 2.1, the main audit engagement processes are
client acceptance, planning, execution and completion.
Client
Acceptance
- 15 -
Gather Information
Evaluate client
Mobilize
Planning
Gather information in details
Perform preliminary analytical procedures
Assess risk and control
Set materiality
Execution & Documentation
Develop audit plan and detailed audit program
Perform Tests of Controls
trol
Low
Con ance
i
High
Rel
Perform Substantive Tests
- Detailed Tests of Transactions
- Analytical Procedures
- Detailed Tests of Balances
Analytical Review
- Develop expectations
- Compare expectations
with actual figures
- Further investigate for
major differences
- Evaluate Results
Document testing results
Gather audit evidence and audit findings
Completion
Tests of Controls
- Identify controls
- Assess control reliance
- Select samples
- Test controls
- Further investigate for
unusual items
- Evaluate Results
Review subsequent events
Evaluate overall results
Detailed Tests
- Select samples
- Test samples
- Further investigate for
unusual items
- Evaluate results
Issue auditor’s report
Arrange clearance meeting with client
Evaluate team performance
Figure 2.1: Summary of audit engagement processes
Planning includes mobilization, information gathering, risk assessment and
audit program preparation. Two basic types of audit approaches the auditors can use
during execution phase are tests of controls and substantive tests. Substantive tests
include analytical procedures, detailed tests of transactions and detailed tests of
- 16 -
balances. The extent of test is based on the professional judgement of auditors.
However, materiality, control reliance and risks are also major concerns.
The final output of audit work is auditor’s report. The type of audit report -unqualified, unqualified with explanatory paragraph or modified wording, qualified,
adverse or disclaimer -- depends on the combination of evidences obtained from the
field works and the audit findings.
At the end of each working period, the accumulated evidence and performance
evaluation should be reviewed to assess scope for improving efficiency or effectiveness
for the next auditing period.
It is accepted that auditing business is not a profitable area of auditing firms.
Instead, the value-added services, also known as assurance services, such as consulting
and legal service are more profitable. The reason is that while cost of all services are
relatively the same, clients are willing to pay a limited amount for auditing service
comparing to other services. However, auditing has to be trustworthy and standardized
and all above-mentioned auditing tasks are, more or less, time-consuming and require
professional staff involvement. Thus, the main cost of auditing engagement is the
salary of professional staffs and it is considerably high. This cost pressure is a major
problem the auditing profession is facing nowadays.
To improve profitability of auditing business, the efficient utilization of
professional staff seems to be the only practical method. The question is how. Some
computerized tools and techniques are introduced into auditing profession in order to
assist and enhance auditing tasks.
However, the level of automation is still
questionable. As long as they still require professional staff involvement, auditing cost
is unavoidable high.
- 17 -
3. Current Auditing Computerized Tools
3.1. Objective and Structure
The objective of this chapter is to provide information about technological
tools and techniques currently used by auditors. Section 3.2 discusses why computer
assisted auditing tools (CAATs) are more than requisite in auditing profession at
present. In section 3.3, general audit software (GAS) is reviewed in detail. The topic
focuses on the most popular software, Audit Command Language (ACL). Other
computerized tools and techniques are briefly identified in section 3.4. Finally, a brief
summary of some currently used CAATs is provided in section 3.5.
Before proceeding, it is worth noting that this chapter was mainly based on two
textbooks and one journal, which are “Accounting Information Systems” (Bonar &
Hopwood, 2001), “Core Concept of Accounting Information System” (Moscove,
Simkin & Bagranoff, 2000) and “Audit Tools” (Needleman, 2001).
3.2. Why Computer Assisted Auditing Tools?
It is accepted that advances in technology have affected the audit process.
With the ever increasing system complexity, especially the computer-based accounting
information systems, including enterprise resource planning (ERP), and the vast amount
of transactions, it is impractical for auditors to conduct the overall audit manually. It is
even more impossible in an e-commerce intensive environment because all accounting
data auditors need to access are computerized.
In the past ten years, auditors frequently outsource technical assistance in some
auditing areas from information system (IS) auditor, also called electronic data
processing (EDP) auditor. However, when the computer-based accounting information
systems become commonplace, such technical skill is even more important. The rate of
growth of the information system practices within the big audit firms (known as “the
Big Five”) was estimated at between 40 to 100 percent during 1990 and 2005
(Bagranoff & Vendrzyk, 2000, 35).
Nowadays, the term “auditing with the computer” is extensively used.
It
describes the employment of the technologies by auditors to perform some audit work
- 18 -
that otherwise would be done manually or outsource. Such technologies are extensively
referred to as computer assisted auditing tools (CAATs) and they are now play an
important role in audit work.
In auditing with the computer, auditors employ CAATs with other auditing
techniques to perform their work. As its name suggests, CAAT is a tool to assist
auditors in performing their work faster, better, and at lower cost. As CAATs become
more common, this technical skill is as important to auditing profession as auditing
knowledge, experience and professional judgement.
There are a variety of software available to assist the auditors. Some are
general-purpose software and some are specially designed that are customized to be
used to support the entire audit engagement processes. Many auditors consider simple
general ledger, automated working paper software or even spreadsheet as audit
software. In this thesis, however, the term audit software refers to software that allows
the auditors to perform overall auditing process that generally known as the generalized
audit software.
3.3. Generalized Audit Software
Generalized audit software (GAS) is an automated package originally
developed in-house by professional auditing firms. It facilitates auditor in performing
necessary tasks during most audit procedures but mostly in the execution and
documentation phase.
Basic features of a GAS are data manipulation (including importing, querying
and sorting), mathematical computation, cross-footing, stratifying, summarizing and file
merging. It also involves extracting data according to specification, statistical sampling
for detailed tests, generating confirmations, identifying exceptions and unusual
transactions and generating reports. In short, they provide auditors the ability to access,
manipulate, manage, analyze and report data in a variety of formats.
Some packages also provide the more special features such as risk assessment,
high-risk transaction and unusual items continuous monitoring, fraud detection, key
performance indicators tracking and standardized audit program generation. With the
standardized audit program, these packages help the users to adopt some of the
profession's best practices.
- 19 -
Most auditing firms, nowadays, have either developed their own GASs or
purchased some commercially available ones. Among a number of the commercial
packages, the most popular one is the Audit Command Language (ACL). ACL is
widely accepted as the leading software for data-access, analysis and reporting. Some
in-house GAS systems of those large auditing firms even allow their systems to
interface with ACL for data extraction and analysis.
Figure 3.1: ACL software screenshot (version 5.0 Workbook)
ACL software (figure 3.1) is developed by ACL Services Ltd. (www.acl.com).
It allows auditors to connect personal laptops to the client’s system and then download
client’s data into their laptops for further processing. It is capable of working on large
data set that makes testing at hundred-percent coverage possible. Moreover, it provides
a comprehensive audit trail by allowing auditors to view their files, steps and results at
any time. The popularity of the ACL is resulted from its convenience, its flexibility and
its reliability. Table 3.1 illustrates the features of ACL and how are they used in each
step of audit process.
- 20 -
Audit Processes
ACL Features
Planning
-
Risk assessment
- “Statistics” menu
- “Evaluation” menu
Execution and Documentation
Tests of Controls
-
Sample selection
- “Sampling” menu with the ability to
specify sampling size and selection
criteria
- “Filter” menu
-
Controls Testing
- “Analyze” menu including Count,
Total, Statistics, Age, Duplicate,
Verify and Search
- Expression builder
-
Results evaluation
- Evaluation menu
Analytical Review
-
Expectations development
- “Statistics” menu
-
Expected versus actual figures
- “Merge” command
comparison
- “Analyze” menu including Statistics,
Age, Verify and Search
- Expression builder
-
Results evaluation
- Evaluation menu
Table 3.1: ACL features used in assisting each step of audit processes
- 21 -
Audit Processes
ACL Features
Detailed Tests
-
Sample selection
- “Sampling” menu with the ability to
specify sampling size and selection
criteria
- “Filter” menu
-
Sample testing
- “Analyze” menu including Count,
Total, Statistics, Age, Duplicate,
Verify and Search
- Expression builder
-
Results evaluation
Documentation
- Evaluation menu
- Document note
- Automatic command log
- File history
Completion
-
Lesson learned record
- “Document Notes” menu
- “Reports” menu
Other Possibilities
-
Fraud detection
- “Analyze” menu including Count,
Total, Statistics, Age, Duplicate,
Verify and Search
- Expression builder
- “Filter” menu
Table 3.1: ACL features used in assisting each step of audit processes (Continued)
With ACL’s capacity and speed, auditors can shorten the audit cycle with more
thorough investigation. There are three beneficial features that make ACL a promising
tool for auditors. First, the interactive capability allows auditors to test, investigate,
analyze and get the results at the appropriate time. Second, the audit trail capability
- 22 -
records history of the files, commands used by auditors and the results of such
commands. This includes command log files that are, in a way, considered as record of
work done. Finally, the reporting capability produces various kinds of report including
both predefined and customized ones.
However, there are some shortcomings. The most critical one is that, like other
GAS, it is not able to deal with files that have complex data structure. Although ACL’s
Open Data Base Connectivity (ODBC) interface is introduced to reduce this problem,
some intricate files still require flattening.
Thus, it presents control and security
problems.
3.4. Other Computerized Tools and Techniques
As mentioned above, there are many other computerized tools other than audit
software that are capable of assisting some part of the audit processes. Those tools
include the following:
-
Planning tools: project management software, personal information
manager, and audit best practice database, etc.
-
Analysis tools: database management software, and artificial intelligence.
-
Calculation tools: spreadsheet software, database management software,
and automated working paper software, etc.
-
Sample selection tools: spreadsheet software.
-
Data manipulation tools: database management software.
-
Documents preparation tools: word processing software and automated
working paper software.
In stead of using these tools as a substitution of GAS, auditors can incorporate
some of these tools with GAS to improve the efficiency of the audit process. Planning
tools is a good example.
Together with the computerized tools, computerized auditing technique that
used to be performed by the EDP auditors has now become part of an auditor’s
repertoire. At least, financial auditors are required to understand what technique to use,
- 23 -
how to apply those techniques, and how to interpret the result to support their audit
findings.
Such techniques should be employed appropriately to accomplish the audit
objectives. Some examples are as follows:
-
Test data: test how the system detect invalid data,
-
Integrated test facility: observe how fictitious transactions are processed,
-
Parallel simulation: simulate the original transactions and compare the
results,
-
System testing: test controls of the client’s accounting system, and
-
Continuous auditing: embed audit program into client’s system.
3.5. Summary
In these days, technology impacts the ways auditors perform their work. To
conduct the audit, auditors can no longer rely solely on their traditional auditing
techniques.
Instead, they have to combine such knowledge and experience with
technical skills.
In short, the boundary between the financial auditor and the
information system auditor has becomes blurred. Therefore, it is important for the
auditors to keep pace with the technological development so that they can decide what
tools and techniques to be used and how to use them effectively.
Computer assisted auditing tools (CAATs) are used to compliment the manual
audit procedures. There are many CAATs available in the market. The challenge to the
auditors is to choose the most appropriate ones for their work. Both the generalized
audit software (GAS), that integrates overall audit functions, and other similar software
are available to support their work. However, GAS packages tend to be more widely
used due to its low cost, high capabilities and high reliability.
- 24 -
4. Data mining
4.1. Objective and Structure
The objective of this chapter is to describe the basic concept of data mining.
Section 4.2 provides some background on data mining and explains its basic element.
Section 4.3 describes data mining processes in greater detail. Data mining tools and
techniques are discussed in section 4.4 and methods of data mining algorithms are
discussed in section 4.5. Examples of most frequently used data mining algorithms are
provided in section 4.6. Finally, the brief summary of data mining is reviewed in
section 4.7.
Notice that the major contents in this chapter are based on “CRISP-DM 1.0
Step-by-Step Data Mining Guide” (CRISP-DM, 2000), “Data Mining: Concepts and
Techniques” (Han & Kamber, 2000) and “Principles of Data Mining” (Hand, Heikki &
Smyth 2001).
4.2. What Is Data Mining?
Data mining is a set of computer-assisted techniques designed to automatically
mine large volumes of integrated data for new, hidden or unexpected information, or
patterns.
Data mining is sometimes known as knowledge discovery in databases
(KDD).
In recent years, database technology has advanced in stride. Vast amounts of
data have been stored in the databases and business people have realized the wealth of
information hidden in those data sets. Data mining then become the focus of attention
as it promises to turn those raw data into valuable information that businesses can use to
increase their profitability.
Data mining can be used in different kinds of databases (e.g. relational
database, transactional database, object-oriented database and data warehouse) or other
kinds of information repositories (e.g. spatial database, time-series database, text or
multimedia database, legacy database and the World Wide Web) (Han, 2000, 33).
Therefore, data to be mined can be numerical data, textual data or even graphics and
audio.
- 25 -
The capability to deal with voluminous data sets does not mean data mining
requires huge amount of data as input. In fact, the quality of data to be mined is more
important. Aside from being a good representative of the whole population, the data sets
should contain the least amount of noise -- errors that might affect mining results.
There are many data mining goals have been recognized; these goals may be
grouped into two categories -- verification and discovery. Both of the goals share one
thing in common -- the final products of mining process are discovered patterns that
may be used to predict the future trends.
In the verification category, data mining is being used to confirm or disapprove
identified hypotheses or to explain events or conditions observed.
However, the
limitation is that such hypotheses, events or conditions are restricted by the knowledge
and understanding of the analyst. This category is also called top-down approach.
Another category, the discovery, is also known as bottom-up approach. This
approach is simply the automated exploration of hitherto unknown patterns. Since data
mining is not limited by the inadequacy of the human brain and it does not require a
stated objective, inordinate patterns might be recognized. However, analysts are still
required to interpret the mining results to determine if they are interesting.
In recent years, data mining has been studied extensively especially on
supporting customer relationship management (CRM) and fraud detection. Moreover,
many areas have begun to realize the usefulness of data mining. Those areas include
biomedicine, DNA analysis, financial industry and e-commerce. However, there are
also some criticisms on data mining shortcomings such as its complexity, the required
technical expertise, the lower degree of automation, its lack of user friendliness, the lack
of flexibility and presentation limitations. Data mining software developers are now
trying to mitigate those criticisms by deploying an interactive developing approach. It
is expected that with the advancement in this new approach, data mining will continue
to improve and attract more attention from other application areas as well.
4.3. Data Mining Process
According to CRISP-DM, a consortium that attempted to standardize data
mining process, data mining methodology is described in terms of a hierarchical process
that includes four levels as shown in Figure 4.1. The first level is data mining phases,
- 26 -
or processes of how to deploy data mining to solve business problems. Each phase
consists of several generic tasks or, in other words, all possible data mining situations.
The next level contains specialized tasks or actions to be taken in order to carry
out in certain situations. To make it unambiguous, the generic tasks of the second phase
have to be enumerated in greater details. The questions of how, when, where and by
whom have to be answered in order to develop a detailed execution plan. Finally, the
fourth level, process instances, is a record of the actions, decisions and results of an
Processes / Phases
Generic Tasks
Special Tasks
Process Instances
actual data mining engagement or, in short, the final output of each phase.
Figure 4.1: Four level breakdown of the CRISP-DM data mining methodology
(CRISP-DM, 2000, 9)
The top level, data mining process, consists of six phases which are business
understanding, data understanding, data preparation, modeling, evaluation and
deployment. Details of each phase are better described as follows.
4.3.1. Business Understanding
The first step is to map business issues to data mining problems.
Generic tasks of this step include business objective determination, situation
assessment, data mining feasibility evaluation and project plan preparation. At the end
of the phase, project plan will be produced as a guideline to the whole project. Such
plan should include business background, business objectives and deliverables, data
mining goals and requirements, resources and capabilities availability and demand,
assumptions and constraints identification as well as risks and contingencies
assessment.
- 27 -
This project plan should be dynamic. This means that at the end of
each phase or at each prescribed review point, the plan should be reviewed and updated
in order to keep up with the situation of the project.
4.3.2. Data Understanding
The objective of this phase is to gain insight into the data set to be
mined. It includes capturing and understanding the data. The nature of data should be
reviewed in order to identify appropriate techniques to be used and the expected
patterns.
Generic tasks of this phase include data organization, data collection,
data description, data analysis, data exploration and data quality verification. At the end
of the phase, the results of all above-mentioned tasks have to be reported.
4.3.3. Data Preparation
As mentioned above, one of the major concerns in using data mining
technique is the quality of data. The objective of this phase is to ensure that data sets
are ready to be mined. The process includes data selection (deciding on which data is
relevant), data cleaning (removing all, or most, incompleteness, noises and
inconsistency), data scrubbing (cleaning data by abrasive action), data integration
(combining data from multiple sources into standardized format), data transformation
(converting standardized data into ready-to-be-mined and standardized format) and data
reduction (removing redundancies and merging data into aggregated format).
The end product of this phase includes the prepared data sets and the
reports describing the whole processes.
The characteristics of data sets could be
different from the prescribed ones. Therefore, the review of project plan has to be
performed.
4.3.4. Modeling
Though, the terms “models” and “patterns” are used interchangeably,
there are some differences between them. A model is a global summary of data sets that
can describe the population from which the data were drawn while a pattern describes a
structure relating to relatively small local part of the data (Hand, Heikki & Smyth, 2001,
165). To make it simplistic, a model can be viewed as a set of patterns.
- 28 -
In this phase, a set of data mining techniques is applied to the
preprocessed data set.
The objective is to build a model that most satisfactorily
describes the global data set. Steps include data mining technique selection, model
design, model construction, model testing, model validation and model assessment.
Notice that, typically, several techniques can be used in parallel to the
same data mining problem. The model can be focused on either the most promising
technique or using many techniques simultaneously. However, the latter technique
requires cross-validated capabilities and evaluation criteria.
4.3.5. Evaluation
After applying data mining techniques in a model with data sets, the
result of the model will be interpreted.
However, it does not mean data mining
engagement is over once the results are obtained. Such results have to be evaluated in
conjunction with business objectives and context. If the results are satisfactory, the
engagement can move on to the next phase. Otherwise, another iteration or moving
back to the previous phase has to be done. The expertise of analysts is required in this
phase.
Besides the result of the model, some evaluation criteria should be
taken into account. Such criteria include benefits the business would get from the
model, accuracy and speed of the model, the actual costs, degree of automation, and
scalability.
Generic tasks of this phase include evaluating mining result, reviewing
processes and determining the next steps. At the end of the phase, the satisfactory
model is approved and the list of further actions is identified.
4.3.6. Deployment
Data mining results are deployed into business process in this phase.
This phase begins with deployment plan preparation. Besides, the plan for monitoring
and maintenance has to be developed. Finally, the success of data mining engagement
should be evaluated including area to be improved and explored.
Another important thing is that the possibility of failure has to be
accepted. No matter how well the model is designed and tested, it is just a model that
- 29 -
was built from a set of sample data sets. Therefore, the ability to adapt to business
change and prompt management decision to correct it are required. Moreover, the
performance of the model needs to be evaluated on a regular basis.
The sequence of those phases is not rigid so moving back and forth between
phases is allowed. Besides, the relationship could exist between any phases. At each
review point, the next step has to be specified -- a step that can be either forward or
backward.
The lesson learned during and at the end of each phase should be documented
as a guideline for the next phase. Besides, the documentation of all phases as well as
the result of deployment should be documented for the next engagement. Details
should include results of each phase, matters arising, problem solving options and
method selected.
Besides CRISP-DM guideline, there are other textbooks dedicating for
integrating data mining into business problems. For the sake of simplicity, I would not
go into too much detail than mentioned above. However, more information may be
found in “Building Data Mining Applications for CRM” (Berson, Smith & Kurt, 2000)
and “Data Mining Cookbook” (Rud, 2001).
4.4. Data Mining Tools and Techniques
Data mining is developed from many fields including database technology,
artificial intelligence, traditional statistics, high-performance computing, computer
graphics and data visualization. Hence, there are abundance of data mining tools and
techniques available. However, those tools and techniques can be classified into four
broad categories, which are database algorithms, statistical algorithms, artificial
intelligence and visualization. Details of each category are as follows:
4.4.1. Database algorithms
Although data mining does not require large volume of data as input, it is
more practical to deploy data mining techniques on large data sets. Data mining is most
useful with the information that human brains could not capture. Therefore, it can be
said that the objective of data mining is to mine databases for useful information.
- 30 -
Thus, many database algorithms can be employed in order to assist
mining processes especially in the data understanding and preparation phase. The
examples of those algorithms are data generalization, data normalization, missing data
detection and correction, data aggregation, data transformation, attribute-oriented
induction, and fractal and online analytical processing (OLAP).
4.4.2. Statistical algorithms
The distinction between statistics and data mining is indistinct as almost
all data mining techniques are derived from statistics field. It means statistics can be
used in almost all data mining processes including data selection, problem solving,
result presentation and result evaluation.
Statistical techniques that can be deployed in data mining processes
include mean, median, variance, standard deviation, probability, confident interval,
correlation coefficient, non-linear regression, chi-square, Bayesian theorem and Fourier
transforms.
4.4.3. Artificial Intelligence
Artificial intelligence (AI) is the scientific field seeking for the way to
locate intelligent behavior in a machine.
It can be said that artificial intelligence
techniques are the most widely used in mining process. Some statisticians even think of
data mining tool as an artificial statistical intelligence. Capability of learning is the
greatest benefit of artificial intelligence that is most appreciated in the data mining field.
Artificial intelligence techniques used in data mining processes include
neural network, pattern recognition, rule discovery, machine learning, case-based
reasoning, intelligent agents, decision tree induction, fuzzy logic, genetic algorithm,
brute force algorithm and expert system.
4.4.4. Visualization
Visualization
techniques
are
commonly
used
to
visualize
multidimensional data sets in various formats for analysis purpose. It can be viewed as
higher presentation techniques that allow users to explore complex multi-dimensional
data in a simpler way. Generally, it requires the integration of human effort to analyze
and assess the results from its interactive displays. Techniques include audio, tabular,
- 31 -
scatter-plot matrices, clustered and stacked chart, 3-D charts, hierarchical projection,
graph-based techniques and dynamic presentation.
To separate data mining from data warehouse, online analytical processing
(OLAP) or statistics is intricate. One thing to be sure of is that data mining is not any of
them. The difference between data warehouse and data mining is quite clear. Though
there are some textbooks about data warehouse that devoted a few pages to data mining
topic, it does not mean that they took data mining as a part of data warehousing.
Instead, they all agreed that while data warehouse is a place to store data, data mining is
a tool to distil the value of such data. The examples of those textbooks are “Data
Management” (McFadden, Hoffer & Prescott, 1999) and “Database Systems : A
Practical Approach to Design, Implementation, and Management” (Connolly, Begg &
Strachan, 1999).
One might argue that the value of data could be realized by using OLAP as
claimed in many data warehouse textbooks. OLAP, however, can be thought of as
another presentation tool that reform and recompile the same set of data in order to help
users find such value easier. It requires human interference in both stating presenting
requirements as well as interpreting the results. On the other hand, data mining uses
automated techniques to do those jobs.
As mentioned above, the differentiation between data mining and statistics is
much more complicated. It is accepted that the algorithms underlying data mining tools
and techniques are, more or less, derived from statistics. In general, however, statistical
tools are not designed for dealing with enormous amount of data but data mining tools
are. Moreover, the target users of statistical tools are statisticians while data mining is
designed for business people.
This simply means that data mining tools are
enhancement of statistical tools that blend many statistical algorithms together and
possess a capability of handling more data in an automated manner as well as a userfriendly interface.
The choice of an appropriate technique and timing depend on the nature of the
data to be analyzed, the size of data sets and the type of methods to be mined. A range
of techniques can be applied to the problems either alone or in combination. However,
when deploying sophisticated blend of data mining techniques, there are at least two
- 32 -
requirements that need to be met -- the ability to cross validate results and the
measurement criteria.
4.5. Methods of Data Mining Algorithms
Though nowadays data mining software packages are claimed to be more
automated, they still require some directions from users. Expected method of data
mining algorithm is one of those requirements. Therefore, in employing data mining
tools, users should have a basic knowledge of these methods. The types of data mining
methods can be categorized differently. However, in general, they fall into six broad
categories which are data description, dependency analysis, classification and
prediction, cluster analysis, classification and prediction, cluster analysis, outlier
analysis and evolution analysis. Details of each method are as follows:
4.5.1. Data Description
The objective of data description is to provide an overall description of
data, either in itself or in each class or concept, typically in summarized, concise and
precise form. There are two main approaches in obtaining data description -- data
characterization and data discrimination. Data characterization is summarizing general
characteristics of data and data discrimination, also called data comparison, is
comparing characters of data between contrasting groups or classes. Normally, these
two approaches are used in aggregated manner.
Though data description is one among many types of data mining
algorithm methods, usually it is not the real finding target. Often the data description is
analyst’s first requirement, as it helps to gain insight into the nature of the data and to
find potential hypotheses, or the last one, in order to present data mining results. The
example of using data description as a presentation tool is the description of the
characteristics of each cluster that could not be identified by neural network algorithm.
Appropriate data mining techniques for this method are attribute-oriented
induction, data generalization and aggregation, relevance analysis, distance analysis,
rule induction and conceptual clustering.
- 33 -
4.5.2. Dependency Analysis
The purpose of dependency analysis, also called association analysis, is
to search for the most significant relationship across large number of variables or
attributes.
Sometimes, association is viewed as one type of dependencies where
affinities of data items are described (e.g., describing data items or events that
frequently occur together or in sequence).
This type of methods is very common in marketing research field. The
most prevalent one is market-basket analysis. It analyzes what products customers
always buy together and presents in “[Support, Confident]” association rules. The
support measurement states the percentage of events occurring together comparing to
the whole population.
The confident measurement affirms the percentage of the
occurrence of the following events comparing to the leading one. For example, the
association rule in figure 4.2 means milk and bread were bought together at 6% of all
transactions under analysis and 75% of customers who bought milk also bought bread.
Milk => bread
[support = 6%, confident = 75%]
Figure 4.2: Example of association rule
Some techniques for dependency analysis are nonlinear regression, rule
induction, statistic sampling, data normalization, Apriori algorithm, Bayesian networks
and data visualization.
4.5.3. Classification and Prediction
Classification is the process of finding models, also known as classifiers,
or functions that map records into one of several discrete prescribed classes. It is
mostly used for predictive purpose.
Typically, the model construction begins with two types of data sets -training and testing. The training data sets, with prescribed class labels, are fed into the
model so that the model is able to find parameters or characters that distinguish one
class from the other. This step is called learning process. Then, the testing data sets,
without pre-classified labels, are fed into the model.
The model will, ideally,
automatically assign the precise class labels for those testing items. If the results of
- 34 -
testing are unsatisfactory, then more training iterations are required. On the other hand,
if the results are satisfactory, the model can be used to predict the classes of target items
whose class labels are unknown.
This method is most effective when the underlying reasons of labeling
are subtle. The advantage of this method is that the pre-classified labels can be used as
the performance measurement of the model. It gives the confidence to the model
developer of how well the model performs.
Appropriate techniques include neural network, relevance analysis,
discriminant analysis, rule induction, decision tree, case-based reasoning, genetic
algorithms, linear and non-linear regression, and Bayesian classification.
4.5.4. Cluster analysis
Cluster analysis addresses segmentation problems. The objective of this
analysis is to separate data with similar characteristics from the dissimilar ones. The
difference between clustering and classification is that while clustering does not require
pre-identified class labels, classification does. That is why classification is also called
supervised learning while clustering is called unsupervised learning.
As mentioned above, sometimes it is more convenient to analyze data in
the aggregated form and allow breaking down into details if needed.
For data
management purpose, cluster analysis is frequently the first required task of the mining
process. Then, the most interesting cluster can be focused for further investigation.
Besides, description techniques may be integrated in order to identify the character
providing best clustering.
Examples of appropriate techniques for cluster analysis are neural
networks, data partitioning, discriminant analysis and data visualization.
4.5.5. Outlier Analysis
Some data items that are distinctly dissimilar to others, or outliers, can be
viewed as noises or errors which ordinarily need to be drained before inputting data sets
into data mining model. However, such noises can be useful in some cases, where
unusual items or exceptions are major concerns. Examples are fraud detection, unusual
usage patterns and remarkable response patterns.
- 35 -
The challenge is to distinguish the outliers from the errors.
When
performing data understanding phase, data cleaning and scrubbing is required. This
step includes finding erroneous data and trying to fix them. Thus, the possibility to
detect interesting differentiation might be diminished.
On the other hand, if the
incorrect data remained in the data sets, the accuracy of the model would be
compromised.
Appropriate techniques for outlier analysis include data cube,
discriminant analysis, rule induction, deviation analysis and non-linear regression.
4.5.6. Evolution Analysis
This method is the newest one. The creation of evolution analysis is to
support the promising capability of data warehouses which is data or event collection
over a period of time. Now that business people came to realize the value of trend
capture that can be applied to the time-related data in the data warehouse, it attracts
increasing attention in this method.
Objective of evolution analysis is to determine the most significant
changes in data sets over time. In other words, it is other types of algorithm methods
(i.e., data description, dependency analysis, classification or clustering) plus timerelated and sequence-related characteristics. Therefore, tools or techniques available for
this type of methods include all possible tools and techniques of other types as well as
time-related and sequential data analysis tools.
The examples of evolution analysis are sequential pattern discovery and
time-dependent analysis. Sequential pattern discovery detects patterns between events
such that the presence of one set of items is followed by another (Connolly, 1999, 965).
Time-dependent analysis determines the relationship between events that correlate in a
definite of time.
Different types of methods can be mined in parallel to discover hidden or
unexpected patterns, but not all patterns found are interesting. A pattern is interesting if
it is easily understood, valid, potentially useful and novel (Han & Kamber, 2000, 27).
Therefore, analysts are still needed in order to evaluate whether the mining results are
interesting.
- 36 -
To distinguish interesting patterns, users of data mining tools have to solve at
least three problems.
First, the correctness of patterns has to be measured.
For
example, the measurement of dependency analysis is “[Confident, Support]” value. It is
easier for the methods that have historical or training data sets to compare the
correctness of the patterns with the real ones; i.e., classification and prediction method.
For those methods that training data sets are not available, then the professional
judgement of the users of data mining tools is required.
Second, the optimization model of patterns found has to be created.
For
example, the significance of “Confident” versus “Support” has to be formulated. To put
it in simpler terms, it is how to tell which is better between higher “Confident” with
lower “Support” or lower “Confident” with higher “Support”.
Finally, the right point to stop finding patterns has to be specified. This is
probably the most challenging problem. This leads to two other problems -- how to tell
the current optimized pattern is the most satisfactory one and how to know it can be
used as a generalized pattern on other data sets. In short, while trying to optimize the
patterns, the over-fitting problem has to be taken into account as well.
4.6. Examples of Data Mining Algorithms
As mentioned above, there are plenty of algorithms used to mine the data. Due
to the limited of space, this section is focused on the most frequently used and
widespread recognized algorithms that can be indisputable thought of as data mining
algorithms; neither pure statistical, nor database algorithms. The examples include
Apriori algorithms, decision trees and neural networks. Details of each algorithms are
as follows:
4.6.1. Apriori Algorithms
Apriori algorithm is the most frequently used in the dependency analysis
method.
It attempts to discover frequent item sets using candidate generation for
Boolean association rules. Boolean association rule is a rule that concerns associations
between the presence or absence of items (Han & Kamber, 2000, 229).
The steps of Apriori algorithms are as follows:
(a) The analysis data is first partitioned according to the item sets.
- 37 -
(b) The support count of each item set (1-itemsets), also called
Candidate, is performed.
(c) The item sets that could not satisfy the required minimum support
count are pruned. Thus creating the frequent 1-itemsets (a list of item
sets that have at least minimum support count).
(d) Item sets are joined together (2-itemsets) to create the second-level
candidates.
(e) The support count of each candidate is accumulated.
(f) After pruning unsatisfactory item sets according to minimum support
count, the frequent 2-itemsets is created.
(g) The iteration of (d), (e) and (f) are executed until no more frequent kitemsets can be found or, in other words, the next frequent k-itemsets
contains empty frequent.
(h) At the terminated level, the Candidate with maximum support count
wins.
By using Apriori algorithms, the group of item sets that most frequently
come together is identified. However, dealing with large amount of transactions means
the candidate generation, counting and pruning steps needed to be repeated numerous
times. Thus, to make the process more efficient, some techniques such as hashing
(reducing the candidate size) and transaction reduction can be used (Han & Kamber,
2000, 237).
4.6.2. Decision Trees
Decision tree is a predictive model with tree or hierarchical structure. It
is used most in classification and prediction methods. It consists of nodes, which
contained classification questions, and branches, or the results of the questions. At the
lowest level of the tree -- leave nodes -- the label of each classification is identified.
The structure of decision tree is illustrated in figure 4.3.
Typically, like other classification and prediction techniques, the decision
tree begins with exploratory phase. It requires training data sets with labels to be fed.
- 38 -
The underlying algorithm will try to find the best-fit criteria to distinguish one class
from another. This is also called tree growing. The major concerns are the quality of
the classification problems as well as the appropriate number of levels of the tree. Some
leaves and branches need to be removed in order to improve the performance of the
decision tree. This step is also called tree pruning.
On the higher level, the predetermined model can be used as a prediction
tool. Before that, the testing data sets should be fed into the model to evaluate the
Transaction = 50
x > 35 ?
No
Yes
Transaction = 15
y > 52 ?
No
Transaction = 9
Group E
Transaction = 35
y > 25 ?
Yes
No
Transaction = 6
Group D
Yes
Transaction = 25
x > 65 ?
No
Transaction = 15
Group A
Transaction = 10
Group C
Yes
Transaction = 10
Group B
model performance. Scalability of the model is the major concern in this phase.
Figure 4.3: A decision tree classifying transactions into five groups
The fundamental algorithms can be different in each model. Probably
the most popular ones are Classification and Regression Trees (CART) and Chi-Square
Automatic Interaction Detector (CHAID). For the sake of simplicity, I will not go into
the details of these algorithms and only perspectives of them are provided.
CART is an algorithm developed by Leo Breiman, Jerome Friedman,
Richard Olshen and Charles Stone. The advantage of CART is that it automates the
- 39 -
pruning process by cross validation and other optimizers. It is capable of handling
missing data and it sets the unqualified records apart from the training data sets.
CHAID is another decision tree algorithm that uses contingency tables
and the chi-square test to create the tree. The disadvantage of CHAID comparing to
CART is that it requires more data preparation process.
4.6.3. Neural Networks
Nowadays, neural networks, or more correctly the artificial neural
networks (ANN), attract the most interest among all data mining algorithms. It is a
computer model based on the architecture of the brain. To put it simply, it first detects
the pattern from data sets. Then, it predicts the best classifiers. And finally, it learns
from the mistakes. It works best in classification and prediction as well as clustering
methods. The structure of neural network is shown in figure 4.4.
First hidden layer
Second hidden layer
Input Layer
Output layer
Figure 4.4: A neural network with two hidden layers
As noticed in figure 4.4, neural network is comprised of neurons in input
layer, one or more hidden layers and output layer. Each pair of neurons is connected
with a weight. In the cases where there are more than one input neurons, the input
weights are combined using a combination function such as summation (Berry &
- 40 -
Linoff, 2000, 122). During training phase, the network learns by adjusting the weights
so as to be able to predict the correct output (Han & Kamber, 2000, 303).
The
most
well
known
neural
network
learning
algorithm
is
backpropagation. It is the method of updating the weights of the neurons. Unlike other
learning algorithms, backpropagation algorithm works, or learns and adjusts the
weights, backward which simply means that it predicts the weighted algorithms by
propagating the input from the output.
Neural networks are widely recognized for its robustness; however, the
weakness is its lack of self-explanation capability. Though the performance of the
model is satisfactory, some people do not feel comfortable or confident to rely
irrationally on the model.
It should note that some algorithms are good at discovering specific methods
where some others are appropriate for many types of methods. The choice of algorithm
or set of algorithms used depends solely on user’s judgement.
4.7. Summary
Data mining, which is also known as knowledge discovery in databases
(KDD), is the area of attention in recent years. It is a set of techniques that exhaustively
automated to uncover potentially interesting patterns from a large amount of data in any
kind of data repositories. Data mining goals can be roughly divided into two main
categories, verification (including explanation and confirmation) and discovery.
The first step of the data mining process is to map business problems to data
mining problems.
Then, data to be mined is captured, studied, selected and
preprocessed respectively.
The preprocessed activities are performed in order to
prepare final data sets to be fed into data mining model. Next, data mining model is
constructed, tested, and applied. The results of this step are evaluated subsequently. If
the result is satisfactory, then it will be deployed in the real business environment.
Lessons learned during data mining engagement should be recorded as guidelines for
future project.
As data mining is developed from and driven by multidisciplinary fields,
different tools and techniques can be applied in each step of data mining process. Those
- 41 -
tools and techniques include database algorithms, statistic algorithms, artificial
intelligence and data visualization. The choice of tools and techniques depends on
nature and size of data as well as types of methods to be mined.
Types of data mining methods can be categorized into six groups which are
data description, dependency analysis, classification and prediction, cluster analysis,
outlier analysis and evolution analysis. The appropriate techniques or algorithms of
each data mining method are summarized in table 4.1.
Among all underlying
algorithms of these methods, Apriori algorithms, decision trees and neural networks are
the most familiar ones.
Data Mining Methods
Data description
Dependency analysis
Classification and prediction
Appropriate Data Mining Techniques
-
Attribute-oriented induction
-
Data generalization and aggregation
-
Relevance analysis
-
Distance analysis
-
Rule induction
-
Conceptual clustering
-
Nonlinear regression
-
Rule induction
-
Distance-based analysis
-
Data normalization
-
Apriori algorithm
-
Bayesian network
-
Visualization
-
Neural network
-
Relevance analysis
-
Discriminant analysis
-
Rule induction
-
Decision tress
-
Case-based reasoning
-
Genetic algorithms
-
Linear and nonlinear regression
- 42 -
Table 4.1: Summarization of appropriate data mining techniques of each data mining
method
Data Mining Methods
Cluster analysis
Outlier analysis
Evolution analysis
Appropriate Data Mining Techniques
-
Data visualization
-
Bayesian classification
-
Neural network
-
Data partitioning
-
Discriminant analysis
-
Data visualization
-
Data cube
-
Discriminant analysis
-
Rule induction
-
Deviation analysis
-
Nonlinear regression
-
All above-mentioned techniques
-
Time-related analysis
-
Sequential analysis
Table 4.1: Summarization of appropriate data mining techniques of each data mining
method (Continued)
Data mining already has its market in customer relationship management as
well as fraud detection and is expected to penetrate new areas in the near future.
However, data mining software packages that are currently available have been
criticized for not automated enough and not user-friendly enough. Therefore, with the
abundance of market opportunities, the continued improvement and growth in the data
mining arena can be anticipated.
- 43 -
5. Integration of Data Mining and Auditing
5.1. Objective and Structure
The objective of this chapter is to identify ways that data mining techniques
may be utilized in the audit process.
The reasons why data mining should be integrated with auditing process are
reviewed in section 5.2. Section 5.3 provides a comparison between the characteristics
of currently used generalized auditing software (GAS) and data mining packages from
auditing profession’s perspective. In section 5.4, each possible area of integration is
discussed in more details including possible mining methods and required data sets.
Lastly, a brief summary of data mining trend in auditing profession is provided.
5.2. Why Integrate Data Mining with Auditing?
As mentioned in the first chapter, auditors have realized the dramatically
increase of transaction volume and complexity of accounting and non-accounting
transactions. The greater amount of the transactions is resulted from new emerging
technologies especially business intelligent systems such as enterprise resource planning
(ERP) systems and supply chain management (SCM) systems. Now that transactions
can be made flexibly online without time constraint, such growth can be unsurprisingly
anticipated.
Besides online systems and transactions, other hi-technology devices make
accounting and managerial transactions more complicated. As transactions are made,
recorded and stored electronically, the advanced tools to capture, analyze, present and
report are required.
Dealing with intricate transactions in large volume, it requires the considerable
more effort of professional stuffs and that cannot be cost-effective. Besides, in some
cases, professional judgement along might not be sufficient due to human brain’s
limitation. Therefore, the capability to automatically manipulating complicated data
through data mining is of great interest to the auditing profession.
- 44 -
On the other hand, the huge auditing market presents tremendous opportunity
for data mining business as well. Auditing is one of many application areas that the
explosive growth of data mining integration is predicted to continue. Therefore, the
opportunity to have data mining tools as an advanced computer assisted auditing tools
(CAATs) can be expected before long.
5.3. Comparison between Generalized Auditing Software and Data
Mining Packages
As mentioned above, nowadays auditors rely exceedingly on generalized audit
software (GAS). The objective of this section is to make it more transparent the
differences between currently used GAS and data mining packages available in the
market considering in auditing profession perspective.
This section was mainly based on the features of auditing software gathered
from the software developers’ websites and some software review publications. The
software packages include the followings:
-
ACL - Audit Command Language (ACL Services Ltd., 2002)
-
IDEA - Interactive Data Extraction and Analysis (Audimation Services
Inc., 2002)
-
DB2 Intelligent Miner for Data (IBM Corporation, 2002)
-
DBMiner (DBMiner Technology Inc., 2002)
-
Microsoft Data Analyzer (Microsoft Corporation, 2002)
-
SAS Enterprise Miner (SAS Institute Inc., 2002)
-
SAS Analytic Intelligence (SAS Institute Inc. 2002)
-
SPSS (SPSS Inc., 2002)
The publications include the followings:
-
Audit Tools (Needleman, 2001)
-
How To Choose a PC Auditing Tool (Eurotek Communication Ltd., 2002)
- 45 -
-
Information Systems Auditing and Assurance (Hall, 2000)
-
Software Showcase (Glover, Prawitt & Romney, 1999)
-
A Review of Data Mining Techniques (Lee & Siau, 2001)
-
Data Mining – A Powerful Information Creating Tool (Gargano & Raggad,
1999)
-
Data Warehousing, Technology Assessment and Management (Ma, Chou
& Yen, 2000)
5.3.1. Characteristics of Generalized Audit Software
Though the features of each GAS are different from one another, most
packages, more or less, share the following characteristics:
- All-in-one features: GAS packages are designed to support the entire
audit engagement including data access, project management that can be used to
manage the engagement, and all audit procedures.
- Specifically customized for audit work: As audit work generally
follow some predicable approaches, GAS packages can be designed to support those
approaches.
It means that the auditors do not need to alter the programs before
employing them and are able to understand how to work with them easily. Of all the
features, the audit trail might be the most valuable one.
- User friendliness features: Most GAS packages have a user friendly
interface that include easy to use and understand features as well as high presentation
capability.
- No or little technical skill required: Due to GAS’s user friendly
interface and it is specifically designed for audit work, it requires no or little technical
skills to work with.
- Speed depending on the amount of transaction input: Nearly all GAS
packages available nowadays are designed for processing huge amount of transactions
that could reach millions.
transaction.
However, the processing speed depends on the input
- 46 -
- Professional judgement required: Audit features that are built into
GAS packages include sorting, querying, aging and stratifying. However, they still
require auditors to interactively observe, evaluate and analyze the results.
There are many GAS packages available in the market nowadays. The
return on investment for GAS packages is considered high especially comparing to
expenses of professional staffs. Therefore, most auditing firms nowadays rely on them
a lot.
However, experience and professional judgement of auditors are still
indispensable. That is, GAS can reduce degree of professional staff requirements but
cannot replace any level of professional staffs.
5.3.2. Characteristics of Data Mining Packages
Among a plethora of data mining packages available, some
characteristics of data mining packages in general are:
- Automated capability: The ideal objective of data mining is to
automatically discover useful information from a data set. Though today’s data mining
packages are still not completely automated, only the guidance to ensure that the results
are interesting and evaluation of the results require intensive human efforts.
- High complexity: How data mining algorithms work is sometimes
mysterious because of their complexity. Their poor self-explaining capability results in
low confidence of the result by the users.
- Scalability: It could be said that data warehousing is the foundation of
data mining evolution. Data mining, therefore, is designed for unlimited amount of data
in the data warehouse that makes scalability one of the key features of the data mining
characteristics.
- Ability to handle complex problems: As its capability not limited by
the human brain, data mining is able to handle complex ad hoc problems.
- Opportunity to uncover unexpected interesting information: When
performing audit work, auditors normally know what they are looking for. This might
result in the limited scope of tests. On the other hand, data mining can be used even
when users do not have a preconceived notion of what to look for. Therefore, with data
- 47 -
mining, some interesting information hidden in the accounting transactions can be
revealed.
- Learning capability embedded: Many data mining algorithms have
learning capability. Experiences and mistakes from the past can be used to improve the
quality of the models automatically.
- Technical skill required: Substantial technical skill is mandatory of
the data mining software users. First, users must know the difference between various
types of data mining algorithms in order to choose the right one. Second, they are
supposed to guide the program to ensure that the findings are interesting. Finally, the
result of data mining process must be evaluated.
- Lack of interoperability: There are numerous data mining algorithm
methods or data mining techniques to be employed.
However, software packages
currently available are of data mining software users tend to focus on one method and
employ only a few techniques. Interoperability between different data mining algorithm
methods still presents significant challenges to data mining software developers.
- Lack of self-explanation capability: In general, data mining processes
are done automatically and the underlying reasons of the result are frequently subtle.
From an auditing perspective, this is a major problem because the audibility, audit trails
and replicability are key requirements in audit work.
- Relatively high cost: Though data mining software has becoming
cheaper, it is still somewhat expensive comparing to other software.
Besides, in
performing data mining, the users have to incur data preparation cost, analyzing cost
and training cost.
Although data mining is frequently considered as a highly proficient
technique in many application areas, it has not been widely adopted in auditing
profession yet. However, it is expected to gain increasing popularity in audit. The
automation potential of data mining suggests that it could greatly improve the efficiency
of audit professionals including replacing level of professional staffs involvement.
- 48 -
5.4. Possible Areas of Integration
Recently, people in auditing profession have started realizing that technology
advancement can facilitate auditing practices. Data mining is one of those technologies
that has been proven to be very effective in application areas such as fraud detection,
which can be thought of as a part of auditing. However, the integration of data mining
techniques and audit work is relatively immature.
Based on the above-mentioned theories in chapter 2 and 4 as well as my
personal experiences, I tried to go through all auditing steps and list out the possibilities
that data mining techniques can be applied into those steps. The opportunities and their
details are summarized in table 6.1.
Notice that though, in my opinion, there are many audit steps that data mining
techniques are likely to be capable of assisting, enhancing or handling, such potentials
may not seem realistic at this moment. One might argue that some listed steps can be
done effortlessly by accustomed manual procedures with a little help from or even
without easy-to-use software packages.
The examples of those steps are client’s
acceptance, client’s continuance and opinion issuance.
I have nothing against such opinion especially when the financial figures or
data sets required of those steps are not that much and electronic data is proprietary.
However, I still do not want to disregard any steps because they can provide some ideas
for further research in the time when those data becomes overflow.
Another attention to be drawn is that it is by no mean the table below is definite
answer for the integration between auditing procedures and data mining techniques.
However, as far as I concern, the major and apparent notions of the topic should be
already covered.
- 49 -
Audit Processes
Appropriate Mining Methods
Client Acceptance or Client
-
Classification and prediction
Continuance
- Evolution analysis
Data Sets Required
- Previous years financial
statements
Possibility
- By using financial ratios,
business rating and industry
- Business rating
rating, client can be classified
- Industry rating
as favorable or unfavorable
- Economic index
(by using classification and
- Previous years actual costs (in
prediction methods). Then,
case of client continuance)
together with estimated cost
based on previous years’
records and economic index
(by using evolution analysis),
the acceptance result can be
reached.
Planning
- Risk assessment
-
Dependency analysis
-
Classification and prediction
- Previous years financial
Table 5.1: Possible areas of data mining and audit processes integration
statements
- By using dependency analysis,
risk triggers (e.g. financial
- 50 -
Audit Processes
Appropriate Mining Methods
Data Sets Required
Possibility
- Business rating
ratios, business rating,
- Industry rating
industry rating and controls)
- Economic index
can be identified. Then, the
- System flowcharts
level of risk of each audit area
can be prescribed by using the
risk triggers as criterion (using
classification and prediction
methods).
- Audit program preparation
-
Classification and prediction
- Client’s system information
- Results of risk assessment step
- The appropriate combination
of audit approach for each
audit area can be distinguished
based on client’s information
gathered and risks identified in
risk assessment step.
Table 5.1: Possible areas of data mining and audit Processes integration (Continued)
- 51 -
Audit Processes
Appropriate Mining Methods
Data Sets Required
Possibility
Execution and Documentation
Tests of Controls
- Controls identification
-
Classification
-
Data description
- System information
- Controls can be identified
from normal activities by
using classification analysis;
the characteristics of such
controls can be identified by
data description method.
- Control reliance assessment
-
Classification and prediction
- Results of risk assessment step
- Results of control identification
step
- The control reliance level of
each area can be categorized
based on risks and control
over such risks identified in
previous steps.
-
Sample Selection
-
Cluster analysis
-
Outlier analysis
- Accounting transactions
Table 5.1: Possible areas of data mining and audit Processes integration (Continued)
- Accounting transactions with
similar characters are grouped
- 52 -
Audit Processes
Appropriate Mining Methods
Data Sets Required
Possibility
through clustering. Samples
can be selected from each
cluster, together with unusual
items identified by outlier
analysis method.
- Controls Testing
-
Cluster analysis
-
Outlier analysis
- Results of sample selection step
- Either by grouping ordinary
transactions together or by
identifying the outliers, the
unusual items or matters can
be identified.
-
Results evaluation
-
Classification
- Results of control testing step
- The test results from previous
step can be classified as either
satisfactory or unsatisfactory.
If unsatisfactory, further
investigation can be done by
iterating the test or using other
Table 5.1: Possible areas of data mining and audit Processes integration (Continued)
- 53 -
Audit Processes
Appropriate Mining Methods
Data Sets Required
Possibility
techniques including
interviewing responsible
personnel, review of
supporting documents or
additional investigative works.
Analytical Review
- Expectations development
-
Classification and prediction
-
Evolution analysis
- Previous years accounting
transactions
- The expectations of each
balance can be predicted based
- Business rating
on previous years’ balances,
- Industry rating
current circumstances of the
- Economic index
business, the state of its
industry and the economic
environment.
- Expected versus actual
-
Classification
figures comparison
-
Outlier analysis
- Results of expectations
development step
Table 5.1: Possible areas of data mining and audit Processes integration (Continued)
- The differences between
expected and actual figures are
- 54 -
Audit Processes
Appropriate Mining Methods
Data Sets Required
- Accounting transactions
Possibility
grouped. Those that do not
fall into acceptable range
should be identified and
further investigated.
- Results evaluation
-
Classification
- Results of expected versus
actual figures comparison step
- The test results from previous
step can be classified as either
satisfactory or unsatisfactory.
If unsatisfactory, further
investigation can be done by
iterating the test or using other
techniques including
interviewing responsible
personnel, review of
supporting documents or
additional investigative works.
Table 5.1: Possible areas of data mining and audit Processes integration (Continued)
- 55 -
Audit Processes
Appropriate Mining Methods
Data Sets Required
Possibility
Detailed Tests
- Sample selection
-
Cluster analysis
-
Outlier analysis
- Accounting transactions
- By Accounting transactions
with similar characters are
grouped through clustering.
Samples can be selected from
each cluster, together with
unusual items identified by
outlier analysis method.
- Sample testing
-
Cluster analysis
-
Outlier analysis
- Results of sample selection step
- Either be grouping ordinary
transactions together or by
identifying the outliers, the
resulting unusual items or
matters arising can be
identified.
Table 5.1: Possible areas of data mining and audit Processes integration (Continued)
- 56 -
Audit Processes
- Results evaluation
Appropriate Mining Methods
-
Classification
Data Sets Required
- Results of sample testing step
Possibility
- The testing results from
previous step can be classified
as satisfactory or
unsatisfactory. In case of
unsatisfactory, further
investigation can be done by
iterating the test.
Documentation
-
Data description
- Results of all results evaluation
steps
- The characters of test results
and matters arising can be
described and recorded by
data description method
Completion
- Opinion Issuance
-
Dependency analysis
-
Classification and prediction
- Results of all results evaluation
step
Using dependency analysis,
circumstances or evidence that
will affect the types of opinion
Table 5.1: Possible areas of data mining and audit Processes integration (Continued)
- 57 -
Audit Processes
Appropriate Mining Methods
Data Sets Required
Possibility
issued can be collected. Then,
based on the test results, audit
findings, matters surfaced and
other related circumstances,
types of opinion can be
rendered.
- Lesson learned record
-
Data description
- Results of all results evaluation
steps
- The nature of tests, test
results, audit findings, matter
surfaced and other relevant
circumstances can be
described and recorded.
Table 5.1: Possible areas of data mining and audit Processes integration (Continued)
- 58 -
Notice that the size of the data set required is generally small. One might argue
that when the size of data sets is not that massive, it is no use to change from familiar
GAS to complicated data mining packages. One attention to be paid, however, is that
the data sets specified are those required for the current year auditing processes.
However, to build a data mining model, the training data sets based on historical data
are essential as well. Historical data includes both the previous year data of the client
itself and that of other businesses in similar and substitute industry. Therefore, the data
sets used could be considerably larger in the first audit period. Besides, data sets
required of some steps can be substantially large, such as sample selection step with
accounting transactions as data sets required.
5.5. Examples of Tests
In reality, the worst limitation is the lack of data availability, especially in the
first year audit, which makes some steps of the table in previous section do not sound
promising. The only certain data available for all audit engagements is general ledger
or accounting transactions for the audited period. Therefore, this section is focused on
what data mining can contribute when data available is only current period general
ledger.
As a general note, data mining methods that require historical data as a training
data set cannot be done.
Examples are classification and prediction, dependency
analysis and evolution analysis. However, in some cases, data from previous months
can be used as training data sets for the following months. To put it simplistically, the
audit steps which are performed at the beginning of the audit engagement require data
from previous years to train the model and, thus, are not feasible when only general
ledger in the current period is available. Those steps include client acceptance and
planing steps. Therefore, execution phase is the only possible phase to use data mining
technique in the first year audit when available data is limited.
The structure of general ledger of each company may be different. To avoid
confusion, the general ledger this section will base on is a simply flat file as shown in
figure 5.1 with the common attributes as follows;
- 59 -
Journal Transaction
Number Number
Date
Document
Number
Description
Account
Responsible Authorized Profit
Amount
Number
Person
Person Center
0001
01
1/7/2002 E00198 Rental fee
1101
95,000
SCS
SVC
100
0001
02
1/7/2002 E00198 Rental fee
7801 - 95,000
SCS
SVC
100
0002
01
5/7/2002 S00059 Sales - Customer A
1209
765
SCS
SCS
403
0002
02
5/7/2002 S00059 Sales - Customer A
6103
520
SCS
SCS
403
0002
03
5/7/2002 S00059 Sales - Customer A
4103 -
765
SCS
SCS
403
0002
04
5/7/2002 S00059 Sales - Customer A
1303 -
520
SCS
SCS
403
0003
01
6/7/2002 P00035 Purchase - Company Z 1307
7,300
SCS
SCS
215
0003
02
6/7/2002 P00035 Purchase - Company Z 1312
450
SCS
SCS
215
0003
03
6/7/2002 P00035 Purchase - Company Z 2106 - 7,750
SCS
SCS
215
Figure 5.1: Basic structure of general ledger
-
Journal Number -- or recorded number which can be used as a reference of
each balanced transactions.
-
Transaction Number – include in order to make each record in the file
unique.
-
Date -- ideally, the transaction date and the recorded date should be the
same. If not, at least transaction date has to be identified.
-
Document Number -- refers to source or supporting documents.
-
Description -- or explanation of each journal.
-
Account Number
-
Amount -- normally the debit amount has positive value while credit
amount has negative value.
-
Responsible Person -- or person who prepares or keys in the transaction.
-
Authorized Person -- some transactions with certain characteristics may
require approval.
-
Other Additional Information -- such as profit center, customer group,
currency and spot exchange rate. In this case, profit center is selected as
the example.
- 60 -
Based on this general ledger structure, the detailed examples of tests using data
mining techniques in each audit steps of execution phase are presented in table 5.2.
However, it is important to note that the examples of interesting patterns in table 5.2 did
not include some patterns that can be identified effortlessly by manual procedures or
GAS packages.
Examples are sample selection based on significant amount,
transactions that occurred on pre-identified date (e.g. weekends, holidays) and
differences between sub-ledger and general ledger systems.
Besides, some audit
processes that are easily done manually or by using GAS such as detailed testing and
result evaluation are not included.
Audit Processes
Examples of Tests
Test of Controls
- Sample selection
- Applied techniques: Grouping all accounting
transactions with all relevant attributes by
using clustering technique.
- Examples of interesting patterns:
- Transactions approved by unauthorized
person.
- Transactions that almost reach the limited of
authorized person and occurred repeatedly
in sequence.
- Type of transactions that are always
approved by certain authorized person such
as transactions of profit center A always
approved by authorized person B.
- Transactions that are approved by
authorized person but not the same person
as usual such as transaction of profit center
A always approved by authorized person B
Table 5.2: Examples of tests of each audit step in execution phase
- 61 -
Audit Processes
Examples of Tests
but there were a few cases that the
transactions were approved by authorized
person C.
- Control testing
- Applied techniques: By using clustering
technique, grouping samples according to more
limited of relevant attributes.
- Examples of interesting patterns:
- Range of transaction amount prepared by
each responsible person.
- Range of transaction amount approved by
each authorized person.
- Distribution of transaction amount in each
profit center.
- Distribution of transaction amount grouped
by date.
- Relationship between responsible person
and authorized person.
- Relationship between responsible person or
authorized person and profit center.
- Date distribution of transactions prepared by
each responsible person or approved by
each authorized person.
- Date distribution of transactions of each
profit center.
- Integration between some patterns above.
Table 5.2: Examples of tests of each audit step in execution phase (Continued)
- 62 -
Audit Processes
Examples of Tests
Analytical Review
- Expectation development
- Applied techniques: Predicting the figures of
the following months based on the previous
months using time-dependent analysis.
However, the technique can be more effective
if the other concerns; such as season, inflation
rate, interest rate and industry index, are taken
into account.
- Examples of interesting patterns:
- Expectation figures that have very stable
pattern.
- Expectation figures that have very unstable
pattern.
- Other general analysis
- Applied techniques: Consider time variable,
trying to cluster accounting transactions in
each category (e.g. assets, liabilities, sales,
expenses) differently.
- Examples of interesting patterns:
- Some small transactions that occurred
repeatedly in the certain period of the
month.
- Same types of transactions that are recorded
differently (e.g. recorded in the different
account numbers).
Table 5.2: Examples of tests of each audit step in execution phase (Continued)
- 63 -
Audit Processes
Examples of Tests
- Sales figure of some months that are
considered excessively higher or lower than
others.
- Expenses that are extremely unstable during
the year.
- Repeatedly purchase of fixed assets.
- Repeatedly re-financing transactions
especially loan to and from related
companies.
Detailed Tests of Transactions
- Sample selection
- Applied techniques: Grouping all accounting
transactions in each area with all relevant
attributes by using clustering technique. It
may be a good idea to include the results of
both control testing and analytical review of
each area.
Examples of interesting patterns:
- Transactions that do not belong to any
clusters, or outliers.
- Group of transactions that has large
percentage of the population.
- Group of transactions that has unusual
relationship with the results of control
testing and analytical review.
Table 5.2: Examples of tests of each audit step in execution phase (Continued)
- 64 -
Audit Processes
Examples of Tests
- Applied techniques: By using clustering
technique, grouping samples according to more
limited of relevant attributes.
Examples of interesting patterns:
- Large amount of transactions that refer to
the same document number.
- Large amount of transactions that occur in
the same date especially when it is not
month-end or any pre-identified peak date.
- Group of non-regular basis transactions that
occurred repeatedly. An example is fixedassets purchasing transactions that occurred
at the same amount every month-end date.
- Time difference between document date and
record date that is different from normal
patterns. For example, during the second
week of every month, time gap will be 5
days longer than normal.
Detailed Tests of Balances
- Sample selection
- Applied techniques: Grouping all accounting
transactions in each area with all relevant
attributes by using clustering technique. It
may be a good idea to include the results of
both control testing and analytical review of
each area.
Table 5.2: Examples of tests of each audit step in execution phase (Continued)
- 65 -
Audit Processes
Examples of Tests
Examples of interesting patterns:
- Transactions that do not belong to any
clusters, or outliers.
- Group of transactions that has large
percentage of the population.
- Group of transactions that has unusual
relationship with the results of control
testing and analytical review.
- Applied techniques: By using clustering
technique, grouping samples according to more
limited of relevant attributes.
Examples of interesting patterns:
- “Cast at bank” ending balances of some
banks that are different from others such as
large amount of fixed balance though the
company does not have any obligation
agreements with the bank.
- Customer that has many small balances of
the same product. For example, insurance
customer whose balance is comprised of
many insurance policies bought repeatedly
in 2 weeks.
- Inter-company balances that pricing pattern
are significant different from normal
transactions.
Table 5.2: Examples of tests of each audit step in execution phase (Continued)
- 66 -
Generally, it is more effective to analyze data in the aggregated form and allow
iterating the test in more detail if needed. The reason is that, by running the test in
aggregated manner, the exploration will be faster and more comprehensible in the sense
that the number of results is in a manageable range. Besides, the over-fitting problems
can be prevented. For example, the higher hierarchy of account numbers should be
used in the first clustering analysis of detailed testing. Then, only interesting groups of
accounts will be selected to further testing. The detailed account numbers can be used
in this run.
As noticed in the table above, sample selection might be the most promising
step where data mining can be applied. The possibility is that by using data mining
package to perform clustering analysis, auditors can select the samples from some
representatives of the groups categorized in the way that they have not distinguished
before and the obviously different transactions from normal ones or outliers. Then,
further tests remain a matter of professional judgement. Auditors can consider using
data mining packages to further the test or obtain the samples selected and test them
with other GAS packages.
From the discussion above, it came to a conclusion that, at present, data mining
cannot be a substitution of GAS or other computerized tools currently used in auditing
profession. However, it might be incorporate as an enhancement of GAS in some
auditing steps and if it does make sense, the development and research in this field to
make a customized data mining package for auditing can be anticipated.
5.6. Summary
Recently, data mining has become an area of spotlight for auditing profession
due to the cost pressure while auditing profession provided another promising market
for data mining.
Therefore, the integration between auditing knowledge and data
analysis techniques is not far from the truth.
As seen in Table 5.3, generalized audit software (GAS) and data mining
packages have some different characteristics from the audit perspective. At present,
GAS already has a market in audit profession. The capability to assist overall audit
process with little technical skill required is the major reason for its success. However,
- 67 -
Characteristics
GAS
Data Mining Package
Customized for audit works
Yes
No
Support entire audit procedures
Yes
No
User friendly
More
Less
Require technical skill
Less
More
Automated
No
Yes
Capable of learning
No
Yes
Lower
Higher
Cost
Table 5.3: Comparison between GAS and data mining package characteristics
GAS also has been criticized that it is only an old query tool with more efficient
presentation features -- that it could make some tasks easier but it could not complete
anything by itself.
Data mining, on the other hand, promises automated works but is quite difficult
to employ. However, data mining tool remain promising in a variety of application
areas upon further research, improvement and refinement.
If the appropriate data
mining tools for the auditing profession are developed, it is expected to be able to
replace some professional expertise required in some auditing processes.
Though data mining seems to be feasible in almost all steps of audit
procedures, the most practical and required one is the execution phase especially sample
selection step. It can be done by mapping audit approaches, including tests of controls,
analytical review and detailed tests, into data mining problems.
- 68 -
6. Research Methodology
6.1. Objective and Structure
The objective of this chapter is to provide the perspective of the actual study -empirical part. The research period is specified in section 6.2. The material of the study
is discussed in section 6.3 and the research methods including the reason for the chosen
study and the techniques used are provided in section 6.4. Next, section 6.5 specifies
the chosen software used in the testing phase. Then, the methods of result interpretation
and analysis are identified. Finally, all above-mentioned substances are summarized in
section 6.6.
6.2. Research Period
The research period covers a twelve-month period of accounting transaction
archive started from January 2000.
6.3. Data Available
For this thesis, the most appropriate data set is an accounting transaction
archive. Since, though does not require, it makes more sense to apply data mining with
vast amount of data sets so that the automation capability of data mining can be
appreciated, the expected data set is relatively large.
The data set used in the study was provided courtesy of PwC
PricewaterhouseCoopers Oy (www.pwcglobal.com/fi/fin/main/home/index.html).
To
preserve the confidentiality of data, the data set was sanitized so that the proprietor of
the data was kept anonymous and sensitive information was eliminated. Sensitive
information such as account names, cost center names, account descriptions and
structures and basic nature of transactions. Besides, the supporting documents that
contained confidential information such as chart of account were not provided.
According to PwC, the data set was captured from a client’s general ledger
system. However, since the purpose of PwC research was to analyze the relationship
between expenses and cost centers, only transactions relevant to the research were
obtained.
Although the data set does not represent complete general ledger, it is
considered complete for cost center analysis. A more detailed account of this matter is
- 69 -
provided in section 7.4 -- Restrictions and Constraints. However, it is worth nothing
that, due to the incomplete data set, missing information and limited supporting
knowledge and information, the scope of test was limited to a few areas. This is
somewhat different from a normal audit engagement where the information limitation is
a serious matter that could prevent the auditors from rendering an opinion.
The initial data set consists of accounting transactions taken from the general
ledger system. It contains 494,704 tuples (or transactions), with forty-six attributes (or
columns) where seven are calculation numeric attributes and only two are of
explanation in nature. The list of columns is provided in Appendix A.
Due to the limitation of available data, the area of research was the sample
selection step of the “test of controls” phase. The detailed discussion of the research
area was identified in section 7.3.2 -- Data Understanding.
6.4. Research Methods
This study is focusing on the use of data mining techniques to assist audit
work. The data set was tested using both data mining software and generalized audit
software (GAS) in order to compare whether the data mining software can be used as an
enhancement or a replacement of GAS.
As stated in Chapter 4, data mining process consists of six phases: business
understanding, data understanding, data preparation, modeling, evaluation, and
deployment. However, the modeling phase, which also includes building a model for
future use, is time-consuming. Due to the time constraint, data mining techniques are
applied to the data so that the usefulness of such techniques can be evaluated without
building the model. Besides, as the results of test have to be compared, interpret and
analyzed based on the hypotheses in section 7.4 -- Results Interpretation, the evaluation
phase is not included in the research process section. Finally, the deployment phase is
considered out of scope of this thesis.
In my opinion, data mining process can be applied to all software usage but the
first three phases are especially remarkably valuable for the users to understand how to
benefit from the software efficiently. However, for ready-to-use software packages
such as GAS, modeling phase can be thought of as the deployment phase. To make the
- 70 -
process of both data mining and GAS compatible, the fourth phase, which is the last
phase of research process, will be called software deployment.
For all practical purpose, the first three phases were performed once and both
GAS and data mining software packages were used. The prepared data was tested on
GAS and data mining software, and the result was then evaluated and compared on its
usability and interestingness.
6.5. Software Selection
6.5.1. Data Mining Software
For data mining software, the DB2 Intelligent Miner for Data Version 6.1
was chosen.
It was developed by IBM Corporation (www.ibm.com) and was
distributed in September 1999. The distinction of the product is its capability to mine
both relational data and flat file data. However, as implied by its name, this product
works better with DB2’s relational data.
Figure 6.1: IBM’s DB2 Intelligent Miner for Data Version 6.1 screenshot
As shown in figure 6.1, the mining template of DB2 Intelligent Miner for
Data supports six methods of data mining algorithms. The methods include association,
classification, clustering, prediction and sequential patterns and similar sequences. The
- 71 -
underlying algorithms as well as other parameters and constraints can be chosen and
specified for each mining session. Besides, the mining results can be viewed in many
formats with the enhanced visualization tools.
Another interesting feature of DB2 Intelligent Miner for Data is its
interoperability. It can be integrated with other business intelligent products such as
SPSS (Statistical Package for the Social Sciences) and operational information systems
such as SAP.
6.5.2. Generalized Audit Software
ACL (Audit Command Language), the most well-known generalized
audit software, was selected as the representation of GAS. It was developed by ACL
Service Ltd. (www.acl.com) and the version chosen was ACL for Windows Workbook
Version 5 that was distributed in 1997.
I would not go into details about ACL because almost all features were
mentioned in section 3.3 -- Generalized Audit Software. However, it is worth noting
that, besides the statistical and analytical functions, the preeminent feature of ACL is its
automatic command log. It captures all activities including commands called by the
users, messages and the results of the commands.
6.6. Analysis Methods
The auditing results from both software packages were interpreted, analyzed
and compared. The focus was on the assertions from the auditing point of view. The
elements of comparison include the following:
-
Interestingness and relevance of the results
-
Time spent
-
Level of difficulty
-
Level of required technical knowledge
-
Level of automation
-
Risks and constraints
- 72 -
However, as auditing assertions are frequently linked to materiality and
judgment, to base the analysis solely on a single opinion was considered not sufficient.
Thus, to strengthen the interpretation of the test results and to avoid bias, the opinions,
suggestions and comments on the above-mentions arguments from experienced auditors
and competent professionals were also gathered.
6.7. Summary
For the study, a data set provided by SVH PricewaterhouseCoopers Oy was
tested with data mining software and generalized audit software (GAS). IBM’s DB2
Intelligent Miner for Data Version 6 was selected to represent data mining software
while ACL for Windows Workbook Version 5 was chosen for GAS.
Based on the data available, the study focused on sample selection for control
testing. The data set was tested with DB2 Intelligent Miner for Data and ACL for
Windows Workbook. The results of the tests were interpreted concerning relevance,
interestingness, time spent, required knowledge, automated level, risk and constraints.
The interpretation was also confirmed by a reasonable number of auditors and
competent professionals.
- 73 -
7. The Research
7.1. Objective and Structure
This chapter aims to provide the information about the research.
The
hypothesis is described in section 7.2. Then, all facts about the work performed are
documented in section 7.3. The results of the research are summarized in section 7.4
and interpreted in section 7.5.
Finally, a brief conclusion of this actual study is
discussed in section 7.6.
7.2. Hypothesis
The hypotheses of this study are as follows:
H1: Data mining will enhance computer assisted audit tools by automatically
discovering interesting information from the data.
H2: By using clustering techniques, data mining will find the more interesting
groups of samples to be selected for the tests of controls comparing to
sampling by using generalized audit software.
7.3. Research Process
The research process for this thesis consists of four phases: they are business
understanding, data understanding, data preparation, and software deployment. The
first three phases were performed only once and the last phase was performed using
both software packages. Details of each phase are as follows:
7.3.1. Business Understanding
In auditing, the term “business” might be ambiguous as it can be thought
of as the business of the proprietor of the data, the auditing, or both.
Since the
proprietor of the data is kept anonymous in this case, the focus is auditing. However, it
is important to note that normally all requirements from both businesses should be taken
into account. And once the business objective is set, it should be rigid so that it will be
easier and more logical to perform the following phases.
- 74 -
The main purpose of this thesis is to find out whether data mining is
useful to auditing profession. However, it is impossible to test all assertions since data
and time are limited. Therefore, only one practical area of research where data is
available is selected for testing. If this research shows that data mining can contribute
something to auditing profession, then further research using more complete data may
be anticipated.
In this research, as the possible areas of tests rely on the data available
and the detailed business objective cannot be identified until the data is studied;
consequently, this phase has to be performed in parallel with the data understanding and
data preparation phase.
7.3.2. Data Understanding
In practice, data understanding and data preparation phases go hand in
hand. In order to do the data understanding phase, the data needs to be transformed so
that the software will accept the data as the input. On the other hand, the data has to be
understood before performing the cleaning phase, otherwise some useful information
might be ripped off or the noises might be left out.
As a result, the processes of data understanding and data preparation
have to be iterated. Going back and forth between these two phases allows user to
revise the data-mining objective in order to meet business objective. As mentioned
before, the business objective of this thesis is ambiguous because the selected test
depends on the data available. Therefore, the business understanding phase is also
included in all iteration. To make it simple, the details of the iteration are documented
only once in section 7.3.3 -- Data Preparation.
I chose to pre-study the data by using SQL (Structured Query Language)
commands in DB2 Command Line Processor. The studied process is only to analyze
the relevance and basic nature of attributes by querying each attribute in different ways.
For example, all unique records of each attribute were queried so that the empty
attributes (with null values) and the attributes that have same value in all records can be
eliminated. Notice that, in ACL, these queries are embedded under user-friendly builtin functions. Therefore, the results of querying will be the same.
- 75 -
Another attention to be paid is that, in reality, the data can be studied
much more extensively. After the preliminary analysis, the interesting matters will be
further investigated. Such investigation includes aggregated analysis (analyzing the
studied data by aggregating it with other relevant information), supporting material
review and information querying from responsible person. However, this can not be
done in this research due to the confidentiality constraint.
Since the available data is the general ledger of a certain period, the
possible areas of research were limited to those stated in table 5.2. However, due to
data and time constraint, the research was restricted to only sample selection step of the
control testing process. That is because the knowledge about the data available is
insufficient which made the analysis process of control testing and analytical procedure
more complicated, or even infeasible.
In addition, the data cannot be tested in
aggregated format because the structures of neither accounts nor cost centers were
available. Therefore, the detailed test of transactions and detailed test of balances
cannot be performed effectively either.
In a normal audit engagement, the sample size of the control testing is
varied depending on the control reliance. As the control reliance of this data cannot be
identified, the sample size is set to fifty transactions, which are considered as a mediumsize sample set.
7.3.3. Data Preparation
The objective of this step is to understand the nature and structure of the
data in order to select what to be tested and how it should be tested. It includes three
major steps, which are data transformation, attribute selection and choice of tests.
Details of each step are as follows:
7.3.3.1. Data Transformation
As the data file provided by SVH PricewaterhouseCoopers Oy
(PwC) was in ACL format, it can be used with the ACL package directly although the
version difference caused some problems. However, to test the data with Intelligent
Miner for Data, such data had to be transformed into the format that can be placed in
DB2 environment. Therefore, in this research, this step was only needed for data
mining test.
- 76 -
Notice that IM can also mine data in normal text files.
However, it is more complicated as it requires users to specified the length of record,
the length of columns and the definition of each column. Besides, it is more convenient
to study the data and to test the correctness of imported data in DB2 command
processor because the commands are based on SQL.
As it is out of the scope of this paper, the details of each
transformation step will not be provided. In short, data was converted into text file with
a certain format and then imported into DB2 database afterwards. However, it is worth
nothing that this process was time consuming especially when the structure of the data
is unclear.
7.3.3.2. Attribute Selection
It is always better to scope the data set to only relevant attributes
so that the data set can be reduced to more manageable size and the running time of the
software is minimized. However, this step is not considered significant to ACL. This is
because the sample selection algorithm of ACL is quite rigid and is not based on any
particular attribute except for those specified by users. Therefore, this step is necessary
for data mining test only.
This step aims to understand the data better so that the choice of
test can be specified.
As mentioned in the data understanding section, the most
appropriate test is the sample selection step of the control testing process given the data
constraints. However, it is crucial to identify the relevant attributes as the selection
criteria.
The risk is the relevant ones might be eliminated which will affect the
interestingness and the accuracy of the result.
On the other hand, the remaining
irrelevant attributes can detract the data structure.
Many iterations of this step were performed. At the end of the
iteration, the possible choices of test are reviewed regarding the remained data set. A
brief summary of each iteration is documented below:
a) First Iteration: Eliminating all trivial attributes -- This
iteration includes the following:
- 77 -
- Eliminate all attributes that contained only null value -These include altogether seven attributes.
- Eliminate all attributes that contained same value in all
records -- These include altogether twelve attributes.
- Eliminate all redundant attributes that are highly
correlated with the more detailed ones (e.g., Month
attribute can be replaced by Date attribute) -- These
attributes will not add additional information and may
distort the result. They include altogether six attributes.
b) Second iteration: Eliminating irrelevant attributes that might
cause noises or that the structure is unclear -- This iteration
includes the following:
- Eliminate attributes that mainly contained null or
meaningless (e.g. “*****”) value -- This step can be
considered risky as the exception might be outlier or
interesting pattern.
However, as the data cannot be
analyzed in detail, to keep these attributes would
contribute nothing but data structure detraction. Thus, the
attributes that contained more than forty-percent empty
value were removed.
These include altogether six
attributes.
- Eliminate attributes with certain structure such as
sequential numbers grouped by specific criteria -- These
include altogether four attributes.
At the end of this stage, only complete and non-deviating
attributes were remained in the data set. However, the choice
of tests from these eleven attributes could not be specified
due to two reasons. First, although data mining software can
handle many divergent attributes, the test result will be
excessively incomprehensive. Second, some of the attributes
that have unknown or unclear definition will reduce the
- 78 -
accuracy of the test result and make the result more
complicated to analyze.
c) Third iteration: Eliminating attributes that the structures are
unclear which might depreciate the accuracy of the results -This is very risky because it depends on judgement.
Therefore, after the selection, confirmation from responsible
person at PwC was also obtained. This step includes the
following:
- Eliminate attributes that require aggregated analysis -- As
mentioned earlier, attributes that are complicated such as
having many unique values should be analyzed in
aggregate format. Therefore, three attributes that require
the understanding of structure were eliminated.
- Eliminate attributes that were added up for internal
analysis by the company whose data belongs to -- These
include altogether three attributes.
At the end of this phase, the scope of test was reduced to only
six attributes. Those attributes include “Batch Error”, “Cost
Center”,
“Authorized
Person”,
“Document
Number”,
“Document Date” and “Amount (EUR)”. Aside from using
DB2 Command Line Processor to analyze each attribute
more extensively, a small clustering test (optimizing mining
run for time) was also run in order to ensure the relevance
and to preliminarily examine the relationship among
attributes. However, the results of the test are, as expected,
incomprehensible especially when further analysis can not be
made. The example of the testing results is shown in figure
7.1. Therefore, the other iteration of this step was required.
d) Fourth iteration: Eliminating attributes for the final test -According to PwC, this data set was taken from general
- 79 -
Figure 7.1: Result of neural clustering method with six input attributes.
ledger system for the cost center analysis. However, after the
preliminary test, it cannot be established that the data set is a
representation of the complete general ledger; a limitation.
that will be addressed in section 8.4 -- Restrictions and
Constraints.
Finally, the other three attributes were
eliminated so that the final test will include only relevant
attributes that maximize the efficiency of the test. Details of
each eliminated attribute are as follows:
- Cost Center -- As this analysis is important for PwC, it
was not eliminated despite the structure is ambiguous.
However, this attribute consisted of 748 unique values so
it is difficult to analyze without an understanding of its
hierarchical structure.
- Document Number -- This is the only reference number at
the transaction level of this data set. The first assumption
is that it is journal number referring to the original
- 80 -
document.
However, the sum of all the transaction
balances of each document number is not zero. Besides,
the transactions of each document number referred to a
different Document Date. Thus, it was concluded that
these are not journal numbers and it would be of no use to
leave it in the data set.
- Document Date -- As mentioned earlier, the document
date of the transactions were different even when they
referred to the same Document Number.
Besides, the
knowledge of this attribute was insufficient so further
analysis regarding this attribute was not feasible.
7.3.3.3. Choice of Tests
As mentioned in the data understanding subsection, the only
possible testing area is sample selection step of the control testing. This step can be
performed differently by grouping the transactions using different sets of attributes.
However, since the knowledge of the data is insufficient, only three relevant attributes
were remained in the data set. Therefore, it cannot be studied extensively.
The focus of the test is how data mining will group the
accounting transactions according to those three relevant attributes and whether the
result brings out any interesting matters. Although the most appropriate data-mining
method for this step is clustering, there is also a problem of its lacking of selfexplanation. In other words, the results of clustering functions are clusters that are
automatically grouped by underlying algorithms of data mining method but the criteria
of grouping are left not addressed.
For the purpose of cross-validation, tree
classification method was chosen to test because it shows how the rules are conducted.
In conclusion, the mining methods chosen are clustering and classification.
Descriptions of each method according to IBM’s Intelligent Miner for Data help manual
are as follows:
a) Clustering
There are two options of clustering mining functions -demographic
and
neural.
Demographic
clustering
- 81 -
automatically determines the number of clusters to be
generated. Similarities between records are determined by
comparing their field values. The clusters are then defined so
that Condorcet’s criterion is maximized.
Condorcet’s
criterion is the floating number between zero and one and is
the sum of all record similarities of pairs in the same cluster
minus the sum of all record similarities of pairs in different
clusters (IBM, 2001a, 6).
Put another way, the more
Condorcet value, the more similar all records are in the same
cluster.
Similarly, neural clustering groups database records by
similar characteristics.
However, it employs a Kohonen
Feature Map neural network. Kohonen Feature Map uses a
process called self-organization to group similar input
records together (IBM, 2001a, 6).
Beside the underlying algorithms, there are two major
differences between these two functions according to
“Mining Your Own Business in Banking Using DB2
Intelligent Miner for Data” (IBM, 2001, 63).
First,
demographic clustering has been developed to work with
non-continuous numeric (or categorical variables) while
neural clustering techniques works best with variables with
continuous numeric values and maps categorical values to
numeric values.
Second, For Neural clustering, users have to specify the
number of clusters that they wish to derive while, with
demographic clustering, the natural number of clusters is
automatically created based on the use specifying how
similar the records within the individual clusters should be.
- 82 -
b) Classification
Two
algorithms
classification
of
and
classification
neural
method
classification.
are
The
tree
neural
classification employs a back-propagation neural network,
which is general-purpose, supervised-learning algorithm, to
classify data. This simply means that the results will be as
ambiguous as the clustering techniques. Since this method
was just for gaining a deeper understanding of the structure
of the records and cross-validation, only tree classification
function was chosen.
Tree classification utilizes historical data containing a set of
values and a classification for these values. By analyzing the
data that has already been classified, it reveals the
characteristics that led to the previous classification (IBM,
2001a, 7)
In sum, three mining functions, namely demographic clustering,
neural clustering and tree classification, were chosen to complement each other and to
validate the derived results.
7.3.4. Software Deployment
This is the most critical step in this research and will provide the trail for
further research.
Therefore, more space will be devoted to explain each software
deployment process. Details are as follows:
7.3.4.1. IBM’s DB2 Intelligent Miner for Data
Before proceeding, it is important to note that the explanations
in this subsection are mainly based on the following:
-
“Intelligent Miner for Data” (IBM, 2001a)
-
“Data Mining for Detecting Insider Trading in Stock
Exchange with IMB DB2 Intelligent Miner for Data” (IBM,
2001b)
- 83 -
-
“Mining Your Own Business in Banking Using DB2
Intelligent Miner for Data” (IBM, 2001c)
-
Online help manual of the software
As a general note, each function is run more than once because
there is no golden rule as to what are the most appropriate parameter values to be used.
The parameters set for each run were updated and revised in order to make the result of
the next run more comprehensible or to ensure that the current run is the most
satisfactory. Only interesting graphical results are illustrated in the discussion below
while all high-resolution graphical results and the descriptive results are provided in
Appendix B.
Three functions of DB2 Intelligent Miner for Data, namely the
demographic clustering, neural clustering and tree classification were chosen to test.
Details of each function are as follows:
a) Demographic Clustering
As mentioned above, the number of clusters to be generated
is
determined
automatically.
It
finds
the
optimum
combination of values that maximizes the similarity of all
records within each cluster, while at the same time
maximizes the dissimilarity between clusters it produces or,
in other words, it tries to maximize the value of Condorcet
(IBM, 2001a, 64). However, there are four parameters that
users have to specify which the details are illustrated in table
7.1.
For clustering result, the derived clusters are presented in
separate horizontal strips ordered by size. The combination
of each attribute is shown in each strip ordered by their
important to the individual cluster (IBM, 2001c, 70).
The categorical variable will be shown as pie chart. The
inner circle shows the percentage of each variable value of
- 84 -
Parameter
Maximum Passes
Definition
Default Value
2
The maximum number of times the
function goes through the input data to
perform the mining function.
Maximum Clusters
9
The largest numbers of clusters the
function generates.
2
Accuracy
The minimum percentage of
Improvement
improvement on clustering quality after
each pass of the data. It is used as the
stopping criterion. If the actual
improvement is less than the value
specified, no more passes occur.
Similarity Threshold
0.5
It limits the values accepted as best fit
for a cluster.
Table 7.1: Definitions and default values of demographic clustering parameters
the cluster while the outer one shows the percentage of the
same variable value but to the whole population. On the
other hand, the numerical variable will be shown with
histogram. The highlighted part is the distribution of the
population while the line shows the distribution of the cluster.
Notice that this result pattern also applies to neural clustering
as well.
In the first run of this method, the default values of all
parameters were chosen. The graphical result of this run is
shown in figure 7.2. As seen from the figure, eight clusters
were derived from this run. The interesting matters are as
follows:
- The
largest
cluster
(Cluster0)
contained
493,846
transactions or 99.83% of the population. This simply
- 85 -
Figure 7.2: Graphical result of the first run of demographic clustering (Parameters
values: 2, 9, 2, 0.5)
means that almost all of the records follow the same
pattern.
It includes being not error and 86.37% of
transaction amounts are between 0 and 50,000.
- The second largest cluster (Cluster2) is where all error
transactions are. It also shows that all of them did not
have authorized person to approve the transactions. The
Condorcet value of this cluster is 0.9442, which means all
of them are almost identical and extremely dissimilar to
other clusters.
- Cluster3 and Cluster4 should have a close relationship
because the distribution of authorized person of Cluster 3
and Cluster4 are almost the same.
Besides, the
distributions of the absolute transaction amount are almost
equal.
- 86 -
- Cluster5 and Cluster7, which mainly contain the extremely
high and low transactions, include only seven transactions.
It is interesting that these small amounts of transactions
were grouped separately from others.
In other words,
these two clusters can be thought of as outliers.
The global Condorcet value at 0.6933 is considered
satisfactory. However, for comparison purpose, another run
was performed. Because it seems that the result is already
detailed, in the second run, the maximum number of clusters
was reduced to five while other parameter values remained
the same.
Except for the number of derived clusters, the result of the
second run is almost as same as the first one.
This is
especially true of the exact Condorcet value and the
distribution of the major clusters including Cluster0,
Cluster2, Cluster3 and Cluster4. This means that neither the
fewer nor the greater number of clusters provides better result
in this case. However, as the result of the first run shows
more detail, it will be used as a representative of
demographic clustering for the comparison analysis.
b) Neural Clustering
For neural clustering, the users have to specify two
parameters -- the maximum passes and the maximum
clusters. The default value of those parameters are five and
nine, respectively. Moreover, the input data are selected by
default
and
normalization
of
the
data
is
strongly
recommended. The input data normalization means that the
data will be scaled to a range of zero to one in case of
continuous or discrete numeric field and converted into a
numeric code in case of categorical field so that they can be
presented to the neural network.
- 87 -
All default values were used in the first run of this method.
The result is shown in Figure 7.3. Notice that the result of
neural clustering is different from the result of demographic
clustering. The reason why different clustering techniques
produce different view of transaction records is that their
functions are based on different algorithms. Normally, the
choice of technique depends on user’s judgement; however,
most of the time more than one techniques are used together
for comparison purpose.
Figure 7.3: Graphical result of the first run of neural clustering (Parameter values: 5, 9)
From Figure 7.3, seven clusters were produced.
The
interesting patterns of this result are as follows:
- The most interesting cluster is Cluster4, which contains
573 transactions or 0.12% of the population. It contains
all error transactions which previously known from the
preliminary analysis and the demographic clustering that
they were not authorized. However, it also contains the
- 88 -
transactions authorized by the other person (LOUHITA)
which might means that this person had approved a set of
transactions that has the same pattern as the erroneous one.
Thus, if it is the real case, this is the cluster where auditors
should focus on.
- The smallest cluster, Cluster1, which contains only
twenty-eight transactions and some were not authorized.
In addition, the major transaction amounts of this group
are extremely low.
For comparison, two other runs with the maximum clusters of
four and sixteen were produced. For the third run, the result
is considered too detailed and does not show anything more
specific than the first one.
In the second run, the error transactions were grouped with
the majority of the records (64.86%). However, the most
interesting cluster is the cluster that contains ninety-five
transactions including those not authorized but also not
recorded as errors. However, the result is relatively similar to
the Cluster1 of the first run. Therefore, only the result of the
first run is selected as the representative of neural clustering
result.
c) Tree Classification
As mentioned above, tree classification technique was chosen
for cross-validation purpose. The advantage of this technique
is its rule explanation. For DB2 Intelligent Miner for Data,
the binary decision tree visualization is built as an output of
the tree classification.
Like other classification techniques, tree classification
requires training and testing data set to build the model, or
classifier. However, in this test, all records were used as the
training data set. That is because the objective of this test is
- 89 -
not to build a model. Instead, it is to find the decision path
where the model is built on.
The only parameter that user has to specify when running tree
classification test is the maximum tree depth value. It is the
maximum number of levels to be created for the binary
decision tree. The default value for maximum tree depth is
unlimited.
However, the choice of data fields has to be specified as well.
The relevant attributes are chosen from available fields as
input fields and class labels. The class label represents the
particular pre-identified classification based on the input
fields. In this case, the Batch Error attribute was used as the
class label, while the other two attributes were specified as
input fields. Once more, the default value of maximum tree
depth was used in the first run.
Clearly, the result of the first run with sixty-seven tree-depth
levels is incomprehensible. Although there is a feature of the
Intelligent Miner for Data that allows pruning the tree, it only
allows pruning down the groups that have their population
less than the certain amount but not the other way around.
By doing so, small clusters that normally are more interesting
in terms of different patterns will be pruned down.
Therefore, with the same choice of data fields, the maximum
tree depth of the second run was set at ten in order to reduce
the level of the tree while remaining all records to be
grouped.
Figure 7.4 shows the binary tree of the
classification result.
The most interesting nodes are the second, the fourth and the
fifth leaf nodes because the majority of their population are
error transactions. An example of the interpretation is that,
for the second leaf node, the transaction whose value is
- 90 -
Figure 7.4: Graphical result of the second run of tree classification (Maximum tree
depth: 10)
greater than –21,199.91 but less than –8,865.80 has an error
possibility of 83.3%. However, the tree path begins with the
null value of authorized person attribute, which simply means
that all authorized transactions were not include in the tree
path.
Therefore, only amount attribute was taken into
consideration. Due to the fact that the value of transaction
amount is varied by its nature, this result does not contribute
to that interesting pattern.
Finally, the third run of tree classification was done. In this
run, the maximum tree depth remained the same but the
Authorized Person attribute was chosen as the predefined
class label and the Batch Error attribute was switched to the
input field list. However, the result came out with 59.69%
error rate, which is not considered as satisfactory.
- 91 -
In conclusion, clustering techniques fit the objective of this
research more than classification techniques.
The first run of both demographic
clustering and neural clustering were chosen as representatives of data mining result to
compare with the result of ACL in section 7.4 -- Results Comparison.
7.3.4.2. ACL
As mentioned earlier, ACL software is customized especially for
audit work. Therefore, the sampling feature, which is illustrated in Figure 7.5, is
provided for sample selection step. Notice that the discussion below is mainly based on
the help manual of the software.
Figure 7.5: Sampling feature of ACL
The sampling functions of ACL are quite rigid. Users have to
first specify the sample type, either monetary unit sample (MUS) or record sample. As
the name suggests, MUS function biases higher value items. Simply put, a transaction
with higher value has higher possibility to be selected than a transaction with lower
value. It is useful for detailed testing because it provides a high level of assurance that
all material items in the population are subject to testing. Notice that the population for
MUS is the absolute value of the field being sampled.
On the other hand, the record sample is unbiased which simply
means that each record has an equal chance of being selected and the transactional value
is not taken into account. Record sampling is most useful for control or compliance
testing where the rate of errors is more important than the monetary unit. Therefore, the
record sampling is selected to test in this research.
- 92 -
Next, sample parameter has to be chosen to specify the sample
type or the specific method to be used to draw the samples. The methods include fixed
interval, cell and random methods. The brief explanations of each method are as
follows:
- Fixed internal sample: An interval value and a random start
number must be specified. It implies that the sample set will
consist of the start number record and every item at the
interval value order thereafter.
- Cell sample: The population is broken into groups the size of
the interval and one random item is chosen from each group
based on the random seed specified. Therefore, this method
is also known as a random interval sample.
- Random Sample: The size of the sample, a random seed value
and the size of the population have to be specified. The
process is that ACL will generate the required number of
random numbers between one and the population specified
based on the random seed and then the selection will be made
using the prescribed random numbers.
To put it simplistically, by using ACL to generate samples,
auditors have to either have sampling criteria in mind or to let the software choose for
them randomly.
This might be very effective when auditors have the clear
understanding of the data they have. On the other hand, if the structure of the data is
ambiguous, only monetary unit sampling and random sampling can be used and, thus,
the interesting transactions may be completely skipped.
As the focus of this research is the test of controls and the
structure of the available data is extremely unclear, random record sampling was
selected. However, notice that the samples can also be selected differently based on the
preliminary analysis from the data understanding step.
follows:
Those possibilities are as
- 93 -
- Randomly selected from the whole population: By using this
strategy, every transaction has an equal chance to be selected
regardless of the error or the authorized person.
- Randomly selected from all records except for those that are
recorded as errors.
In this case, all of the non-error
transactions have the same possibility to be selected.
However, the potential of interesting pattern discovery is
uncertain and depends mainly on the experience and skill of
the auditor.
- Randomly selected from the transactions that were not
authorized.
erroneous
According to the preliminary analysis, all
transactions
are
unauthorized
transactions.
Therefore, the chance that potentially inaccurate transactions
will be selected is assumed to be similar to the chance that
the error transactions will be selected.
- Randomly selected from the transactions that were not
authorized and were not recorded as errors. In this case, it
might be a chance that the inaccurate records, which were not
recorded as errors, will be selected.
From my point of view, the only possible benefit from
examining error records is that the patterns of errors can be specified. However, it is
not considered an efficient way especially when the certain number of error transactions
is large. On the other hand, if a small sample size is selected from a large population, it
is difficult to find any correlation among those samples. Therefore, the last option was
chosen for testing so that the samples are selected from the most potential inaccurate
transactions.
Fifty samples were selected from a population of 258 based on
100 random numbers.
Details of the sample are provided in Appendix C.
distribution of the transaction amounts of the sample is shown in Figure 7.6.
The
- 94 -
Figure 7.6: The transaction amount distribution of ACL samples
Notice that, from the fifty samples, three transactions have
relatively high amount and only one has very low amount. Otherwise, the transactions
are not that different from the average amount of –72.67.
7.4. Result Interpretations
Before proceeding further, it might be worth reviewing some of the interesting
matters of the result. The details are as follows:
7.4.1. IBM’s DB2 Intelligent Miner for Data
From demographic clustering (the first run), the most interesting clusters
are Cluster5 and Cluster7 where only four and three transactions are identified,
respectively. These two clusters share the same pattern, which is a set of small number
of transactions with the extremely high absolute values. These transactions can be
considered as the outlier because there are only a few of them and they were grouped
separately from the other transactions.
From neural clustering (the first run), there are two outstanding clusters
which are Cluster4 and Cluster1. Cluster4 is comprised of 457 error transactions and
116 transactions authorized by an authorized person, “LOUHITA”.
Although the
- 95 -
reason why these transactions were grouped together is not provided and is subject to
further research, it is still irrefutably interesting.
The other cluster, Cluster1, is consisted of twenty-eight transactions
which some were not authorized. Besides, the amount of transactions in this cluster is
relatively low. As the size of this cluster is only 0.01%, it can be considered as outlier
as well.
The fifty samples selected based on mining clustering, in my opinion,
would be comprised of the following:
a) Seven transactions from Cluster5 and Cluster7 of the demographic
clustering result.
b) Twenty-eight transactions from Cluster1 of the neural clustering
result.
c) Five randomly selected transactions from 116 transactions of Cluster4
of the neural clustering result.
d) One transaction from each cluster other than the above except for
Cluster2 of the demographic clustering result which contains all error
transactions.
However, there is a chance that some records in a) and b) might be the
same. One of each double sample should be eliminated so that it will be counted only
once. In this case, the substituting samples should be selected from c). Unfortunately,
the transactions in this data set do not have identification information. Therefore, there
is no chance to know whether there are any double samples until the further
investigation is conducted.
7.4.2. ACL
The sample set derived by ACL is consisted of fifty transactions
randomly selected from the transactions that were not authorized but also not recorded
as errors. It includes four transactions with outstanding absolute values and fifty-six
indifferent small ones.
- 96 -
Based on the results from both DB2 Intelligent Miner for data and ACL, the
comparison between sample selection is illustrated in table 7.2.
Issue
Intelligent Miner for Data
ACL
Interestingness
The number of samples can be
It is fair to say that the
and Relevance
varied from each cluster. The
samples selected by ACL are
more focus can be put on the
relatively ordinary. The
more interesting ones but at
chance to find interesting
least one sample is selected
transactions is low. Besides,
from each group of the
although the test is
population. However, the
satisfactory, it cannot verify
criteria of clustering are in
the correctness of all
question and, thus, auditor’
population, not even at the
judgement is still required to
group level. Thus, the
preliminarily analyze the
decision of sample size is
interestingness and the
really important in order to
relevance of the clusters.
ensure that the samples are, to
some extent, good
representative of the
population.
Time Spent
Excluding time spent in learning Running the sampling
how to use the software, the
function takes only a few
running time for data mining
minutes for ACL. Besides,
test takes only a few minutes for
only sample size and seed
each test. However, in order to
value must be specified.
find the optimum parameters for
Therefore, it does not require
each type of test, many
much time to spend on
iterations have to be run and that decision process.
requires extra time for analysis
and decision process.
Table 7.2: Comparison between results of IBM’s DB2 Intelligent Miner for Data and
ACL
- 97 -
Issue
Intelligent Miner for Data
ACL
Required
Technical knowledge about
The features of ACL are
Technical
database and understanding
somewhat similar to general-
Knowledge
about data mining are required
purpose software packages.
in all data mining processes
Thus, it does not require
especially in data understanding
much efforts to learn how to
phase. Besides, for auditors that
use.
do not have the technical
background, it is still not easy to
learn how to use data mining
software despite it is much more
user-friendly at present.
Level of
Although the most important
As the features of ACL are
Automation
and the most difficult process --
not that flexible, auditors
clustering -- is performed
need to only specify a few
automatically, auditors are still
parameter values in order to
required to measure whether the
run the test. Otherwise, all
clustering result is interesting
tasks are done automatically.
and whether the other run
should be performed. Besides,
the choice of samples from each
clusters has to be identified by
the auditors as well.
Risk and
As the reason why each cluster
The sample size is very
Constraints
is grouped together is
important in random record
ambiguous, to measure whether
sampling. If the sample size
the clustering result is
is too low, the chance of
interesting is a matter of
finding interesting matter is
judgement. However, as the
low. On the other hand, if the
Table 7.2: Comparison between results of IBM’s DB2 Intelligent Miner for Data and
ACL (Continued)
- 98 -
Issue
Intelligent Miner for Data
ACL
samples are selected from each
sample size is too large, it is
cluster that has some similar
not feasible to perform the
characteristics. It provides more test. Besides, the population
assurance that the sample set is
where samples to be selected
a better representative of the
is also significant and
population.
required auditors’ judgement.
Table 7.2: Comparison between results of IBM’s DB2 Intelligent Miner for Data and
ACL (Continued)
From the discussion above, one may concludes that the samples selected by
clustering function of Intelligent Miner for Data is more interesting than the samples
selected by ACL. However, using Intelligent Miner for Data requires more technical
knowledge than using ACL. In my opinion, the required technical knowledge for using
Intelligent Miner for Data is tolerable especially when the result is satisfactory.
Comments and suggestions about the results were also solicited from five
auditors at different experience levels. For confidentiality reason, however, their names
are kept anonymous. All of them agreed that the result from Intelligent Miner for Data
is more interesting and, if the mining result is available, they will certainly choose the
sample selected from it. Nevertheless, the result frequently is not specific enough and
additional research is still required. Two out of five said that the required technical
knowledge makes data mining less appealing. If, and only if, data mining software is
more customized for audit work, would they consider using it. Finally, professional
judgement is not an issue because it is required in any case.
In conclusion, based on the result of this research, there are some interesting
patterns discovered automatically by data mining technique that cannot be found by
generalized audit software. This finding confirms the H2 hypothesis. However, it is
worth nothing that it does not prove whether those interesting patterns are material
matters.
Although data mining techniques are able to draw out potentially interesting
information from the data set, it is, however, not realistic to say that data mining is an
advance computer assisted auditing tools due to its deployment difficulties and the
- 99 -
ambiguous results. Therefore, H1 hypothesis is partly confirmed and is subject to
further research.
7.5. Summary
The hypotheses of this thesis are that clustering techniques of data mining will
find the more interesting groups of sample than those sampled by using generalized
audit software and, thus, data mining may considered as an enhancement of the
computer assisted audit tools.
The research process consists of business understanding, data understanding,
data preparation, and software deployment phases. Many iterations of the first three
phases were performed in order to understand the data better and to determine what to
study. Finally, the choice of test is the sample selection step of the control testing
process concentrating on authorized persons, batch errors and the transaction balances.
For methods of data mining technique, demographic clustering, neural clustering and
tree classification were chosen.
The prepared data was applied with the IBM’s DB2 Intelligent Miner for Data
and ACL as inputs. While the sampling choice of Intelligent Miner for Data is based on
auditors’ judgement regarding the derived clusters, the sample set selected by ACL is
automatically generated. Auditors’ judgement is indispensable in determining which
cluster is more interesting and should be focused on.
Based on the results of this research, the conclusion is that the result of
Intelligent Miner for Data is more interesting than the result of ACL. Although greater
technical knowledge is required by Intelligent Miner for Data, it is in the acceptable
level. However, the automated capability of data mining cannot be fully appreciated, as
the auditors’ judgements are still required to interpret the results. On the other hand,
ACL is much easier to use but the quality of result can be compromised. A comparison
of the results derived from both software packages based on my opinion is summarized
in table 7.3. It should be reminded, however, that all analysis is based on certain
unavoidable assumptions about the data and further investigation is required to confirm
these interpretations.
- 100 -
Intelligent Miner
for Data
ACL
Higher
Lower
Slightly Higher
Slightly Lower
Required Technical Knowledge
Higher
Lower
Level of Automation
Lower
Higher
Risk and Constraint
Lower
Higher
Issue
Interestingness and Relevance
Time Spent
Table 7.3:
Summary of comparison between sample selection result of Intelligent
Miner for Data and ACL
The consensus of auditors’ comments on these results is that the results of data
mining clustering techniques are more interesting than the results of ACL although it
requires high level of auditors’ judgement.
However, they also agree that it still
requires further research to determine whether the sample set from Intelligent Miner for
data is a better option than the sample set randomly selected by ACL. Moreover, they
feel more comfortable working with ACL due to its well-customized features and userfriendly interface.
In sum, it is unarguably that, within the scope of this research, the result of data
mining is more interesting than the result of normal generalized audit software even
when the data was incomplete and the supporting knowledge and information was
limited. However, the conclusion that data mining is a superior computer assisted audit
tool cannot be made until more extensive researches are conducted.
- 101 -
8. Conclusion
8.1. Objective and Structure
This chapter will attempt to provide an overall perspective of this thesis
including a brief summary of the whole study in section 8.2.
The results and
implication are discussed in section 8.3, restrictions and constraints of the study in
section 8.4, and suggestions for further research in section 8.5. A final conclusion is
given in section 8.6.
8.2. Research Perspective
This master thesis aims to find out whether data mining is a promising tool for
the auditing profession. Due to the dramatically increased volume of accounting and
business data and the increased complexity of the business, auditors can no longer rely
solely on their instincts and professional judgments. Lately, auditors have realized that
technology, especially intelligent software, is more than just a required tool. New tools
including generalized auditing software have been adopted by the audit profession.
Another relatively new filed that has received greater attention from all
businesses is data mining. It has been applied to use with fraud detection, forensic
accounting and security evaluation in other business applications related to auditing.
The allure of data mining software is its automated capability.
However, although data mining has been around for more than a decade, the
integration between data mining and auditing is still esoteric. The biggest cost of
auditing is professional staff expense. That is why the employment of data mining
seems to make good sense in this profession.
In this thesis, the ideal opportunities that data mining can be integrated with
audit work are explored. However, due to the restrictions and limitation of the available
data for research, the test can not be done extensively. The only area of testing is
sample selection step of the test of control process.
The data provided by SVH
PricewaterhouseCoopers Oy was studied with both data mining software (IBM’s DB2
Intelligent Miner for Data) and generalized audit software (ACL) and the results from
both studies were compared and analyzed.
- 102 -
In general, samples selection can be done differently depending on what to
focus. In this thesis, the focus is relationship between authorized persons, errors and
transaction amounts. With Intelligent Miner for Data, demographic clustering and
neural clustering functions were selected to draw out the relationship patterns among
the data. The results were analyzed and the choice of sample was based on that
analysis.
For ACL, the random record samples were automatically selected based on the
random number and the size of the sample set specified. The population that the
samples were selected from was a group of records that has the most potential to be the
error ones.
8.3. Implications of the Results
Based on the feedback from the five auditors and my observations, it can be
concluded that, within the scope of this research, the results derived from data mining
techniques are more interesting than those derived from generalized audit software.
However, this conclusion is predicated on certain unavoidable assumptions of the data
set and, thus, is not conclusive. Further investigation of whether data mining result is
more interesting is necessary but, unfortunately, due to the data limitation and time
constraint it cannot be done in this research.
However, it is by no mean the results of the generalized audit software are
considered not useful. If the size of the transaction archive to be audited is not that
massive and the relationship between those transactions is unambiguous, employing
generalized audit software is easier to use and is not a bad idea at all.
In sum, the hypothesis that clustering techniques of data mining will find the
more interesting groups of sample to be selected for the test of controls comparing to
sampling by generalized audit software is confirmed. However, the other hypothesis,
which is that data mining will enhance computer assisted audit tools by automatically
discovering interesting information from the data is still unclear. The determination
whether the information derived from data mining techniques is really interesting
cannot be made until the further investigation is performed. Besides, the auditors’
judgement is still a prerequisite so the level of automation is not fully appreciated.
- 103 -
8.4. Restrictions and Constraints
The restrictions and constraints of this research are as follows:
8.4.1. Data Limitation
Although the available data was taken from general ledger system, it was
not a complete collection of the general ledger transactions. The limitations include
incomplete data, missing information, and limited understanding of the data. Details are
as follows:
8.4.1.1. Incomplete data
In the data understanding phase, the objective of which is to
better understand the nature of the data in general, the data set was found out that it was
not a complete general ledger. That is because the sum amount of all transactions (plus
sign for debit amounts and minus sign for credit amount) does not zero out. Besides,
neither the sum amount of each recorded date, nor of each document number is zero.
Therefore, the assumption is that this is a partial listing of the general ledger, which is
considered complete for cost center analysis.
Unquestionably, for data mining that aims at finding hidden
interesting pattern, it is much better when the data is complete. However, finding the
accounting transactions at the reasonably large size is not simple either. Moreover, for
test of controls in general, it is not that critical to have all the accounting transactions as
the sampling population; this is especially true when auditors have some selection
criteria in mind and the extended test is allowed. In summary, this data set is fairly
satisfactory for a pilot test such as this research but more comprehensive data is
required for more detailed study.
8.4.1.2. Missing information
As mentioned earlier, all sensitive information such as account
names were eliminated from the data set. Besides, the supporting information for
aggregated analysis -- such as chart of accounts, cost center structure or transaction
mapping -- is not available. Without knowing what is what, the analysis is extremely
- 104 -
difficult, if not impossible. Therefore, the scrutinized testing and analytical review is no
longer feasible for this research.
For detailed testing, it is normal to select samples for testing
according to the audit areas or, in other words, account groups. Therefore, the chart of
accounts, or at least the knowledge of data structure, is remarkably important. This
situation limited the scope of test to only tests of controls.
8.4.1.3. Limited Understanding
In the normal audit engagement, one of the most necessary
requirements is the ability to further investigate when the matter arises. Generally, it
can be done by reviewing supporting document and interviewing responsible person. If
the result of further investigation is unsatisfactory, the scope of test may be expanded
later on as well.
However, further investigation cannot be done in this research.
The one and only resource data is a limited version of data file obtained from SVH
PricewaterhouseCoopers Oy.
Although the analysis process includes the opinions
gathered from many auditors and competent persons, they are based on assumptions and
judgements.
8.4.2. Limited Knowledge of Software Packages
Although both chosen software packages in this research are well
customized and user-friendly, there is a chance that some useful features might be
overlooked due to the limited knowledge about the software. However, since this
research is just a pilot test of this subject, self-studies that included reading manuals and
inquiring of competent persons is considered sufficient.
Nevertheless, it is worth noting that when an auditing firm decided to
study, use or implement a software package, it is a good idea to educate the responsible
personnel by experts of such software. This includes training the real users, providing
real-time support and sharing the discovered knowledge among team members.
- 105 -
8.4.3. Time Constraints
This might be the most important constraint of this research. The more
time spent, the more extensive the test can be. Further testing with this limited data and
additional knowledge may be obtained through trial and errors until the more interesting
matter is found. However, in my opinion, it does not consider this as an intelligent
strategy and that it will not contribute to more satisfactory results comparing to time
spent. Therefore, the test was restricted to the most promising one and leave the rest to
future research when more complete data is available.
8.5. Suggestions for Further Research
As mentioned above, the research regarding the integration of data mining and
auditing can be done extensively especially when the complete data is available. The
examples are the possible areas of integration in table 5.1 and the examples of tests that
can be performed in the execution phase when only general ledger transactions of the
current year are available in table 5.2.
However, it is important to note that it is far more feasible and efficient to work
on the complete data set.
This includes the privilege to scrutinize all relevant
supporting information and the permission to perform more extensive investigation if
necessary.
8.6. Summary
Though the integration between data mining techniques and audit processes is
a relatively new field, data mining is considered useful and helps reducing cost pressure
in many business applications related to auditing. Therefore, this thesis aims to explore
the possibility of the integration between data mining and the actual audit engagement
processes. However, due to the data and other limitations, the study could not be done
extensively. Only sample selection step of the test of controls was studied.
From the result of this research, it does show that data mining techniques might
be able to contribute something to this profession even when the data was not that
complete and all the analyses were based on assumptions. However, it also does not
prove that data mining is a good fit for every audit work. It requires a substantial effort
- 106 -
to learn how to employ data mining techniques and to understand the implication of the
results.
However, if the auditing firms have vast quantities of data to be audited and the
auditors are familiar with the nature of transactions and expected error patterns, then
data mining does provide an efficient means to surface interesting matters. However, it
is still a long way from possessing “Artificial Intelligence” to fully automate the audit
testing for the auditors.
- 107 -
List of Figures
Figure 2.1: Summary of audit engagement processes
15
Figure 3.1: ACL software screenshot (Version 5.0 Workbook)
19
Figure 4.1: Four level breakdown of the CRISP-DM data mining methodology
26
Figure 4.2: Example of association rule
33
Figure 4.3: A decision tree classifying transactions into five groups
38
Figure 4.4: A neural network with two hidden layers
39
Figure 5.1: Basic structure of general ledger
59
Figure 6.1: IBM’s DB2 Intelligent Miner for Data Version 6.1 Screenshot
70
Figure 7.1: Results of neural clustering method with six input attributes
79
Figure 7.2: Graphical result of the first run of demographic clustering
(Parameter value 2, 9, 2, 0.5)
85
Figure 7.3: Graphical result of the first run of neural clustering (Parameter
value 5, 9)
87
Figure 7.4: Graphical result of the second run of tree classification (Parameter
value 2, 9 ,2 ,0.5)
90
Figure 7.5: Sampling feature of ACL
91
Figure 7.6: The transaction amount distribution of ACL samples
94
- 108 -
List of Tables
Table 3.1: ACL features used in assisting each step of audit processes
20
Table 4.1: Summarization of appropriate data mining techniques of each data
mining method
41
Table 5.1: Possible areas of data mining and audit processes integration
49
Table 5.2: Examples of tests of each audit step in execution phase
60
Table 5.3: Comparison between GAS and data mining package characteristics
67
Table 7.1: Definitions and defaults values of demographic clustering parameters 84
Table 7.2: Comparison between results of IBM’s DB2 Intelligent Miner for
Data and ACL
96
Table 7.3: Summary of comparison between sample selection result of
Intelligent Miner for Data and ACL
100
- 109 -
References
a) Books and Journals
American Institute of Certified Public Accountants (AICPA) (1983), Statement
on Auditing Standards (SAS) No. 47: Audit Risk and Materiality in Conducting an
Audit.
American Institute of Certified Public Accountants (AICPA) (1988), Statement
on Auditing Standards (SAS) No. 56: Analytical Procedures.
Arens, Alvin A. & Loebbecke, James K. (2000), Auditing: An Integrated
Approach, New Jersey: Prentice-Hall.
Bagranoff, Nancy A. & Vendrzyk, Valaria P. (2000), The Changing Role of IS
Audit Among the Big Five US-Based Accounting Firms, Information Systems
Control Journal: Volume 5, 2000, 33-37.
Berry, Michael J. A. & Linoff, Gordon S. (2000), Mastering Data Mining, New
York: John Wiley & Sons Inc.
Berson, Alex, Smith, Stephen & Kurt, Thearling (2000), Building Data Mining
Applications for CRM, McGraw-Hill Companies Inc.
Bodnar, George H. & Hopwood, William S. (2001), Accounting Information
Systems, New Jersey: Prentice-Hall.
Committee of Sponsoring Organizations (COSO) (1992), Internal Control Integrated Framework.
Connolly, Thomas M., Begg, Carolyn E. & Strachan, Anne D. (1999), Database
Systems – A Practical Approach to Design, Implementation, and Management,
Addison Wesley Longman Limited.
Cross Industry Standard Process for Data Mining (CRISP-DM) (2000),
CRISP-DM 1.0 Step-by-Step Data Mining Guide, www.crisp-dm.org/.
- 110 -
Gargano, Michael L. & Raggad, Bel G. (1999), Data Mining – A Powerful
Information Creating Tool, OCLC Systems & Services, Volume 15, Number 2,
1999, 81-90.
Glover, Steven, Prawitt, Douglas & Romney Marshall (1999), Software
Showcase, The Internal Auditor, Volume 56, Issue 4, August 1999, 49- 56.
Hall, James A. (2000), Information Systems Auditing and Assurance, SouthWestern College Publishing.
Han, Jiawei & Kamber, Micheline (2000), Data Mining: Concepts and
Techniques, San Francisco: Morgan Kaufmann Publisher.
Hand, David, Heikki, Mannila & Smyth, Padhraic (2001), Principles of Data
Mining, MIT Press.
IBM Corporation (2001a) Intelligent Miner for Data - Data Mining, IBM
Corporation.
IBM Corporation (2001b) Data Mining for Detecting Insider Trading in Stock
Exchange with IMB DB2 Intelligent Miner for Data, IBM Corporation.
IBM Corporation (2001c), Mining Your Own Business in Banking Using DB2
Intelligent Miner for Data, IBM Corporation.
Lee, Sang Jun & Keng, Siau (2001), A Review of Data Mining Techniques,
Industrial Management & Data Systems: Volume 101, Number 01, 2001, 44-46.
Ma, Catherine, Chou, David C. & Yen, David C. (2000), Data Warehousing,
Technology Assessment and Management, Industrial Management & Data
Systems: Volume 100, Number 3, 2000, 125-135.
McFadden, Fred R., Hoffer, Jeffrey A. & Prescott, Mary B. (1999), Modern
Data Management, Addison-Wesley Educational Publisher Inc.
Moscove, Stephen A., Simkin, Mark G & Bagranoff, Nancy A. (2000), Core
Concept of Accounting Information System, New York: John Wiley & Sons Inc.
- 111 -
Needleman, Ted (2001), Audit Tools, The Practical Accountant: March 2001, 3840.
Rezaee, Zabihollah, Elam, Rick & Shabatoghlie, Ahmad (2001), Continuous
Auditing: The Audit of the Future, Managerial Auditing Journal: Volume 16,
Number 3, 2001, 150-158.
Rud, Olivia Parr (2001), Data Mining Cookbook, New York: John Wiley & Sons
Inc.
- 112 -
b) Web Pages
ACL
Service
Limited
(2002)
ACL
for
Windows,
www.acl.com/en/softwa/softwa_aclwin.asp (Accessed on January 4, 2002)
Audimation Services Inc. (2002) IDEA – Setting The Standard in Case of Use,
www.audimation.com/idea.html (Accessed on January 4, 2002)
DB Miner Technology Inc. (2002) DBMiner Insight – The Next Generation of
Business
Intelligence, www.dbminer.com/products/index.html (Accessed on
February 2, 2002)
Eurotek Communication Limited (2002) How To Choose a PC Auditing Tool,
www. Eurotek.co.uk/howchoose.htm (Accessed on March 12, 2002)
IBM
Corporation
(2002)
DB2
Intelligent
Miner
for
Data,
www-
3.ibm.com/software/data/iminer/fordata/ (Accessed on February 5, 2002)
Microsoft Corporation (2002) Microsoft Data Analyzer – The Office Analysis
Solution, www.microsoft.com/office/dataanalyzer/ (Accessed on February 2, 2002)
SAS Institute Inc. (2002) Uncover Gems of Information – Enterprise Miner,
www.sas.com/products/miner/index.html (Accessed on March 12, 2002)
SAS
Institute
Inc.
(2002)
SAS
Analytic
Intelligence,
www.sas.com/technologies/analytical_intelligence/index.html (Accessed on March
12, 2002)
SPSS Inc. – Business Intelligence Department (2002) Effectively Guide Your
Organization’s Future with Data Mining, www.spss.com (Accessed on February
21, 2002)
- 113 -
Appendix A: List of Columns of Data Available
No.
Original Name
Translated Name
1.
AS_RYHMA_GLM
Customer Group
2.
CHARACTER1
Character
3.
EROTIN1
Seperator1
4.
EROTIN3
Seperator3
5.
KAUSI_TUN
Period ID
6.
KLO_AIKA
Period Date
7.
LAJI_TUN
Type ID
8.
NIPPU_JNO
Batch Queue
9.
NIPPU_JNO_VIRHE
Batch Error
10.
NIPPU_KAUSI_TUN
Batch Period
11.
NIPPU_KAUSI_TYYPPI
Batch Technical Number
12.
NIPPU_KIRJ_PVM
Batch Date
13.
NIPPU_MLK_1
Batch Point 1
14.
NIPPU_MLK_M_JNO
Batch Point Queue
15.
NIPPU_MLK_T_JNO
Batch Point Queue
16.
NIPPU_MLK_TUN
Batch Point ID
17.
NIPPU_TEKN_NRO
Batch Technical Number
18.
NIPPU_TUN_GLM
Batch ID
19.
NIPPU_VALKAS_KDI
Batch Currency
20.
ORG_TUN_LINJA_GLM
Foreign Cost Center
21.
ORG_TUN_MATR_GLM
Cost Center
22.
PAIVITYS_PVM
Transaction Date
23.
SELV_TILI_TUNNUS
Authorized Person
24.
TILI_NO
Account Number
25.
TILI_NO_AN
Reconciliation Account Number
26.
TOSITE_KASPVM
Document Date
- 114 -
No.
Original Name
Translated Name
27.
TOSITE_NO
Document Number
28.
TUNNUS
ID
29.
TUN_KUMPP_GLM
Partner ID
30.
VAL_KURSSI
Exchange Rate
31.
VAL_KURSSI2
Exchange Rate 2
32.
VIENTI_JNO
Entry Queue
33.
VIENTI_KASTILA_GLM
Status
34.
VIENTI_M_VAL_DES_1
Currency Amount Point
35.
VIENTI_MAARA_ALKP
Original Amount (FIM)
36.
VIENTI_MAARA_M
Amount (EUR)
37.
VIENTI_MAARA_M_DK
Debit / Credit
38.
VIENTI_MAARA_T
Amount
39.
VIENTI_MAARA_VAL_1
Currency Amount 1
40.
VIENTI_MAARA_VAL_2
Currency Amount 2
41.
VIENTI_SELITE
Explanation
42.
VIENTI_SELITE2
Explanation2
43.
VIENTI_VALPAIV_KE
Center
44.
YHTIO_TUN_GLM
Company Code
45.
YHTIO_TUN_GLM_AN
Company Reconciliation Code
46.
YHTIO_TUN_KUMPP
Inter-company Code
- 115 -
Appendix B: Results of IBM’s Intelligent Miner for Data
a) Preliminary Neural Clustering (with Six Attributes)
- 116 -
User Specified Parameters
Mining Run Outputs
Maximum Number of Passes:
5
Number of Passes Performed: 5
Maximum Number of Clusters: 9
Number of Clusters:
9
Deviation:
0.732074
Cluster Characteristics:
Id
Cluster Size
Id
Absolute
%
0
129267
26.13
1
51018
2
Cluster Size
Absolute
%
5
264
0.05
10.31
6
43665
8.83
79163
16.00
7
31327
6.33
3
11592
2.34
8
142422
28.79
4
5987
1.21
Reference Field Characteristics (For All Field Types):
(Field Types: ( ) = Supplementary
CA = Categorical
CO = Continuous Numeric
Id
Name
DN = Discrete Numeric)
Modal
Frequency
(%)
Modal
Value
Type
No. of
Possible
Values/
Buckets
1
NIPPU_JNO_VIRHE
CA
0
99.91
2
2
ORG_TUN_MATR_GLM
CA
7989
3.31
748
3
SELV_TILI_TUNNUS
CA
AUT.HYV
35.12
18
4
TOSITE_KASPVM
CA
2000-02-29
3.41
303
5
TOSITE_NO
CO
250
91.14
15
6
VIENTI_MAARA_M
CO
50000
86.37
12
Reference Field Characteristics (For Numeric Fields Only):
Id
Name
5
TOSITE_NO
6
VIENTI_MAARA
Minimum
Value
Maximum
Value
Mean
Standard
Deviation
1
23785
867.926
3283.79
-5.64084E7
5.171E7
9.72619
271013
- 117 -
b) Demographic Clustering: First Run
- 118 -
User Specified Parameters
Mining Run Outputs
Maximum Number of Passes:
2
Maximum Number of Clusters:
9
Improvement Over Last Pass:
2
Similarity Threshold:
0.5
Number of Passes Performed: 2
Number of Clusters:
8
Improvement Over Last Pass: 0
Global Condorcet Value:
0.6933
Cluster Characteristics:
Cluster Size
Absolute
%
Condorcet
Value
0
493846
99.83
0.6933
1
73
0.01
2
457
3
174
Id
Id
Cluster Size
Condorcet
Absolute
%
4
117
0.02
0.6916
0.7988
5
4
0.00
0.9337
0.09
0.9442
6
31
0.01
0.7918
0.04
0.6765
7
3
0.00
0.8095
Similarity Between Clusters: Similarity Filters: 0.25
Cluster 1
Cluster 2
Similarity
Cluster 1
Cluster 2
Similarity
0
1
0.46
1
7
0.39
0
2
0.44
3
4
0.48
0
3
0.41
3
5
0.43
0
4
0.39
3
6
0.43
0
5
0.34
3
7
0.36
0
6
0.34
4
5
0.36
0
7
0.35
4
6
0.43
1
3
0.42
4
7
0.47
1
4
0.43
5
6
0.48
1
5
0.44
5
7
0.56
1
6
0.42
6
7
0.45
- 119 -
Reference Field Characteristics (For All Field Types):
Id
Name
Type
Modal
Value
Modal
Frequency
(%)
No. of
Possible
Values /
Buckets
Condorcet
Value
1
NIPPU_JNO_VIRHE
CA
0
99.91
2
0.9982
2
SELV_TILI_TUNNUS
CA
AUT.
HYV
35.21
18
0.1911
3
VIENTI_MAARA_M
CO
50000
86.37
12
0.8871
Reference Field Characteristics (For Numeric Fields Only):
Id
3
Name
Minimum
Value
VIENTI_MAARA_M -5.64084E7
Maximum
Value
5.171E7
Mean
9.72619
Standard
Deviation
Distance
Unit
271013 135506.453
- 120 -
c) Demographic Clustering: Second Run
- 121 -
Mining Run Outputs
User Specified Parameters
Number of Passes Performed:
2
Maximum Number of Clusters: 5
Number of Clusters:
5
Improvement Over Last Pass:
2
Improvement Over Last Pass:
0
Similarity Threshold:
0.5
Global Condorcet Value:
0.6933
Maximum Number of Passes:
2
Cluster Characteristics:
Cluster Size
Absolute
%
Condorcet
Value
0
493846
99.83
0.6933
1
91
0.02
0.6803
2
457
0.09
0.9442
Id
Id
Cluster Size
Condorcet
Absolute
%
3
178
0.04
0.6706
4
133
0.03
0.6519
Similarity Between Clusters (Similarity Filters: 0.25)
Cluster 1
Cluster 2
Similarity
Cluster 1
Cluster 2
Similarity
0
1
0.44
1
3
0.43
0
2
0.44
1
4
0.42
0
3
0.41
3
4
0.47
0
4
0.38
Reference Field Characteristics (For All Field Types):
Id
Name
Modal
Frequency
(%)
Modal
Value
Type
No. of
Possible
Values /
Buckets
Condorcet
Value
1
NIPPU_JNO_VIRHE
CA
0
99.91
2
0.9982
2
SELV_TILI_TUNNUS
CA
AUT.HYV
35.21
18
0.1911
3
VIENTI_MAARA_M
CO
50000
86.37
12
0.8871
Reference Field Characteristics (For Numeric Fields Only):
Id
Name
Minimum
Value
3
VIENTI_MAARA_M
-5.64084E7
Maximum
Value
5.171E7
Mean
9.72619
Standard
Deviation
Distance
Unit
271013 135506.453
- 122 -
d) Neural Clustering: First Run
- 123 -
User Specified Parameters
Mining Run Outputs
Maximum Number of Passes:
5
Maximum Number of Clusters: 9
Number of Passes Performed: 5
Number of Clusters:
7
Deviation:
0.0980459
Cluster Characteristics:
Id
Cluster Size
Absolute
Id
%
Cluster Size
Absolute
%
0
162528
32.85
5
79773
16.13
1
54463
11.01
6
69742
14.10
2
11221
2.27
8
19814
4.01
4
9421
1.90
Reference Field Characteristics (For All Field Types):
Id
Name
Type
Modal
Value
Modal
Frequency(%)
No. of
Possible
Values/
Buckets
1
NIPPU_JNO_VIRHE
CA
0
99.91
2
2
SELV_TILI_TUNNUS
CA
AUT.HYV
35.21
18
3
VIENTI_MAARA_M
CO
50000
86.37
12
Reference Field Characteristics (For Numeric Fields Only):
Id
Name
Minimum
Value
3
VIENTI_MAARA_M
-5.64084E7
Maximum
Value
5.171E7
Mean
9.72619
Standard
Deviation
271013
- 124 -
e) Neural Clustering: Second Run
- 125 -
User Specified Parameters
Mining Run Outputs
Maximum Number of Passes:
5
Number of Passes Performed: 5
Maximum Number of Clusters: 4
Number of Clusters:
3
Deviation:
0.272368
Cluster Characteristics:
Id
Cluster Size
Absolute
Id
%
Cluster Size
Absolute
0
173749
35.12
1
95
0.02
3
320861
%
64.86
Reference Field Characteristics (For All Field Types):
Id
Name
Type
Modal
Value
Modal
Frequency
(%)
No. of Possible
Values/
Buckets
1
NIPPU_JNO_VIRHE
CA
0
99.91
2
2
SELV_TILI_TUNNUS
CA
AUT.HYV
35.21
18
3
VIENTI_MAARA_M
CO
50000
86.37
12
Reference Field Characteristics (For Numeric Fields Only):
Id
Name
Minimum
Value
3
VIENTI_MAARA_M
-5.64084E7
Maximum
Value
5.171E7
Mean
9.72619
Standard
Deviation
271013
- 126 -
f) Neural Clustering: Third Run
- 127 -
User Specified Parameters
Mining Run Outputs
Maximum Number of Passes:
5
Number of Passes Performed:
Maximum Number of Clusters: 16
5
Number of Clusters: 9
Deviation:
0.0380213
Cluster Characteristics:
Id
Cluster Size
Absolute
Id
%
Cluster Size
Absolute
%
0
162528
32.85
11
79773
16.13
3
54463
11.01
12
69742
14.10
5
11221
2.27
14
19814
4.01
8
9421
1.90
15
74804
15.12
10
12939
2.62
Reference Field Characteristics (For All Field Types):
Id
Name
Type
Modal
Value
Modal
Frequency(%)
No. of Possible
Values/Buckets
1
NIPPU_JNO_VIRHE
CA
0
99.91
2
2
SELV_TILI_TUNNUS
CA
AUT.HYV
35.21
18
3
VIENTI_MAARA_M
CO
50000
86.37
12
Reference Field Characteristics (For Numeric Fields Only):
Id
Name
Minimum
Value
3
VIENTI_MAARA_M
-5.64084E7
Maximum
Value
5.171E7
Mean
9.72619
Standard
Deviation
271013
- 128 -
g) Tree Classification: First Run
- 129 -
Internal Node
Class
Records
Errors
Purity
0
0
494705
457
99.9
0.0
1
715
258
63.9
0.0.0
0
137
32
76.6
0.0.1
1
578
153
73.5
0.0.1.0
1
337
38
88.7
0.0.1.1
1
241
115
52.3
0.0.1.1.0
1
238
112
52.9
0.0.1.1.0.0
0
51
22
56.9
0.0.1.1.0.1
1
187
83
55.6
0.0.1.1.0.1.0
1
36
10
72.2
0.0.1.1.0.1.1
1
151
73
51.7
0.0.1.1.0.1.1.0
0
5
0
100.0
0.0.1.1.0.1.1.1
1
146
68
53.4
0.0.1.1.0.1.1.1.0
1
70
28
60.0
0.0.1.1.0.1.1.1.1
0
76
36
52.6
0.0.1.1.0.1.1.1.1.0
0
20
6
70.0
0.0.1.1.0.1.1.1.1.1
1
56
26
53.6
0.0.1.1.0.1.1.1.1.1.0
1
34
13
61.8
0.0.1.1.0.1.1.1.1.1.1
0
22
9
59.1
0.0.1.1.1
0
3
0
100.0
0.1
0
493990
0
100.0
- 130 -
h) Tree Classification: Second Run
- 131 -
Internal Node
Class
Records
Errors
Purity
0
0
494705
457
99.9
0.0
1
715
258
63.9
0.0.0
0
137
32
76.6
0.0.0.0
0
11
0
100.0
0.0.0.1
0
126
32
74.6
0.0.0.1.0
1
6
1
83.3
0.0.0.1.1
0
120
27
77.5
0.0.1
1
578
153
73.5
0.0.1.0
1
337
38
88.7
0.0.1.0.0
1
213
11
94.8
0.0.1.0.1
1
124
27
78.2
0.0.1.1
1
241
115
52.3
0.0.1.1.0
1
238
112
52.9
0.0.1.1.1.0
0
3
0
100.0
0.1
0
493990
0
100.0
- 132 -
i) Tree Classification: Third Run
- 133 -
Internal Node
Class
Records
Errors
Purity
0
AUT.HYV
493990
320241
35.2
0.0
LEHTIIR
32586
23651
27.4
0.0.0
SILFVMI
30353
32+89
27.3
0.0.0.0
LEHTIIR
12599
6963
44.7
0.0.0.0.0
SILFVMI
713
379
46.8
0.0.0.0.1
LEHTIIR
11886
6308
46.9
0.0.0.1
SILFVMI
17643
12643
28.3
0.0.1
LEHTIIR
2344
279
88.1
0.0.1.0
LEHTIIR
2308
243
89.5
0.0.1.1
KYYKOPI
36
18
50.0
0.1
AUT.HYV
461404
290884
37.0
0.1.0
AUT.HYV
435505
267403
38.6
0.1.0.0
AUT.HYV
29330
9723
66.8
0.1.0.1
AUT.HYV
406175
257680
36.6
0.1.0.1.0
LINDRHA
153161
112878
26.3
0.1.0.1.1.1
AUT.HYV
253014
138822
45.1
0.1.1
LEHTIIR
25899
14121
45.5
0.1.1.0
LEHTIIR
25031
13378
46.6
0.1.1.1
SILFVMI
868
452
47.9
- 134 -
Appendix C: Samples Selection Result of ACL
Sample Transaction Transaction
Number Number
Amount
1
008
-104.28
2
011
-916.43
3
014
-660.14
4
024
639.11
5
026
-1248.53
6
029
-4030.2
7
030
-2047.32
8
039
-1091.54
9
042
-2799.32
10
050
-2565.03
11
056
-1442.37
12
063
-5886.58
13
065
-127.02
14
068
-1492.67
15
069
-660.14
16
086
318.92
17
088
67.82
18
089
-479.34
19
090
4860.15
20
101
58.02
21
116
97.13
22
117
2123.52
23
120
26859.97
24
121
1081.77
25
126
13543.16
Sample Transaction Transaction
Number Number
Amount
26
137
185
27
154
6179.24
28
156
435.43
29
159
43.76
30
162
48.49
31
165
1795.71
32
167
94.64
33
168
-253.8
34
174
-85.08
35
176
30.9
36
178
-325.68
37
181
-28.56
38
192
325.68
39
193
33.6
40
197
960.23
41
198
-36389.95
42
199
14555.98
43
200
3638.99
44
212
-243.2
45
228
1173.36
46
231
-205.84
47
237
41.16
48
239
-652.72
49
242
-238.79
50
250
9.18
Download