Uploaded by Sayo Enoch Aluko

e Banking Fraud Detection publish

advertisement
 " # $% & ' & '( # ' ) *" + ,-. #
/ # 0123# 4 / 5#-62
/ # ' ' ( !
' ' # ' /
' ' ( * # % # #7 " / "# / (
) '"#&))+ '# / *( #) ( (#
(* # , .& & 89 ,: ' '.(; / < '(
!
! " #$% " #" # & !"#$ % " ! & ' & ( )*! & " " *)+,-!
./
0 2& # 2 1,)* % "
' & ( " " 1,)*
1,)*
! & DEDICATION
This research work is dedicated first to my father in heaven (GOD) and Savior,
Lord Jesus Christ. My mother; Mrs. Olufumilola Comfort Aluko, My father, Late
Mr. Samuel Oluwole Aluko and grand-mother, Late Princess Alice Adeleye Aluko;
My humble gratitude goes to my siblings: Mr. Gbenga Aluko, Cornel Olubunmi
Aluko, Dr. Seun Aluko, Mr. James Aluko (FNM), Temilola Abidemi Aluko,...
Nieces and Nephews, Friends, Colleagues, good and well wishers
i
ACKNOWLEDGEMENTS
The fact that I am writing this sentence echoes my enormous indebtedness to the Almighty God
for his grace, mercy and most importantly his unending blessings and favour towards me.
I also thank my project supervisor, Dr. M.I Akinyemi for her encouragement, guardians, support
and so much more, for the contribution and effort towards the success of this research work.
Dr. (Mrs.) M.I Akinyemi, for guiding my work tirelessly, going through all my draft, providing
valuable suggestions and constructive criticism, for the improvement of the dissertation
I will also like to appreciate the effort of my lecturers in Department of Mathematics, the
departmental HOD, Prof. J.O Olaleru, Prof. R.O Okafor, Prof. J.A Adepoju, Prof. S.O Ajala, Prof.
S.A Okunuga, Dr. R.A Kasumu, Dr. A.A Mogbademu, Dr. (Mrs.) J.N Onyeka-Ubaka, Dr. M.O
Adamu Ira, Dr I.O Abiala, DR A.A, Akinfenwa, among others
My profound gratitude goes to my parents; my mother; Mrs. Olufumilola Comfort Aluko, My
father, Late Mr. Samuel Oluwole Aluko and grand-mother, Late Princess Alice Adeleye Aluko;
my humble gratitude goes to my siblings: Mr. Gbenga Aluko, Cornel Olubunmi Aluko, Dr. Seun
Aluko, Mr. James Aluko (FNM), Temilola Abidemi Aluko,...Nieces and Nephews, Friends,
Colleagues, good and well wishers for their encouragement, words of wisdom, financial support
and so much more, for always supporting me and being my source of strength.
Finally, I express my gratitude to my entire families, beloved colleagues and friends.
To all of us, God’s richest blessing is ours!
ii
TABLE OF CONTENTS
Dedication
i
Acknowledgements
ii
Table of contents
iii
List of Figures
v
CHAPTER ONE:
Background of the Study
1.0
Introduction
1
1.1
Background of the study
1
1.2
Statement of the problem
4
1.3
Objectives of the study
4
1.4
Research questions
5
1.5
Significance of the study
5
1.6
Scope and Limitation of the study
5
1.7
Historical Background of the case study
6
1.8
Definition of terms
7
CHAPTER TWO:
10
Literature Review
2.0
Electronic banking fraud characteristics and related work
10
2.1
Electronic banking fraud characteristics
10
2.2
General work in fraud detection
12
2.3
Fraud detection in Electronic banking
15
iii
2.4
Credit card fraud detection
15
2.5
Computer intrusion detection
16
2.6
Telecommunication fraud detection
17
CHAPTER THREE:
Research Methodology
19
3.0
Introduction
19
3.1
Methodology description
19
3.2
Credit Card Fraud Detection Methods
20
3.3
Model Specification
26
3.4
Gaussian distribution to developing Anomaly detection Algorithms
37
3.5
Data Pre-Processing and Fraud Detection
41
CHAPTER FOUR:
Data Presentation and Analysis
44
4.1
Exploratory Data Analysis and Gaussian distribution Validation
44
4.2
K-Mean Cluster Analysis
45
4.6
Principal Component Analysis
49
4.8
Model Based Anomaly Detection Output
53
4.9
Outlier Detection based Mechanism
54
CHAPTER FIVE:
Summary, Conclusions and Recommendations
5.0
Summary of Findings
56
5.1
Conclusion
57
5.2
Recommendations
58
iv
5.3
Suggestion for Further Studies
58
References
59
APPENDIX:
Experimental Tools and Code for Project Implementation
61
LIST OF FIGURES:
64
Figure 3.1.1
64
Figure 3.5.4
64
Figure 3.5.5
64
Figure 3.5.6
65
Figure 3.5.7
65
Figure 3.6.2
66
Figure 4.2.3
66
Figure 4.4.2
66
Figure 5.1.1
67
Figure 5.1.2
67
Figure 5.1.2b
68
v
CHAPTER ONE
INTRODUCTION
1.1
BACKGROUND OF THE STUDY
In spite of the challenging economy, the use of e-channel platforms –Internet banking,
Mobile Banking, ATM, POS, Web, etc. has continued to experience significant growth.
According to NIBSS 2015 annual fraud report, transaction volume and value grew by 43.36%
and 11.57% respectively, compared to 2014. Although e-fraud rate in terms of value reduced by
63% in 2015, due, in part, to the introduction of BVN and improved collaboration among banks
via the fraud desks; the total fraud volume increased significantly by 683% in 2015 compared to
2014. Similarly, data released recently by NITDA (Nigeria Information Technology
Development Agency) indicated that Nigeria experienced a total of 3,500 cyber-attacks with
70% success rate, and a loss of $450 million within the last one year.
The sustained growth of e-transactions as depicted by the increased transaction volume and value
in 2015, coupled with the rapidly evolving nature of technology advancements within the
e-channel ecosystem continues to attract cybercriminals who continuously develop new schemes
to perpetrate e-fraud.
What is e-fraud? What is responsible for its growth in Nigeria? What are the major techniques
used by these criminals to commit fraud? Is e-fraud dying in Nigeria? Can it be mitigated?
What is e-fraud?
e-fraud can be briefly defined as Electronic Banking trickery and deception which affects the
entire society, impacting upon individuals, businesses and governments.
Why Is It Growing?
The following inherent factors fuel e-fraud in Nigeria:
i.
Dissatisfied staff;
ii.
Increased adoption of e-payment systems for transactions due to its convenience
and simplicity;
iii.
Emerging payment products being adopted by Nigerian banks;
iv.
Growing complexity of e-channel systems;
v.
Abundance of malicious code, malware and tools available to attackers;
1
vi.
Rapid pace of technological innovations;
vii.
Casual security practices and knowledge gap;
viii.
Obscurity approach of the internet;
ix.
The increasing role of Third-party processors in switching e-payment
transactions;
x.
Passive approach to fraud detection and prevention;
xi.
Lack of inter industry collaboration in fraud prevention -banks, telecoms, police,
etc.
What are the Major Techniques?
Cybercriminals employ several techniques to perpetrate e-fraud, including:
1. Cross Channel Fraud: customer information obtained from one channel (i.e. call center)
and being used to carry out fraud in another channel (i.e. ATM).
2. Data theft: hackers access secure or non-secure sites, get the data and sell it.
3. Email Spoofing: changing the header information in an email message in order to hide
identity and make the email appear to have originated from a trusted authority.
4. Phishing: refers to stealing of valuable information such as card information, user IDs,
PAN and passwords using email spoofing technique.
5. Smishing: attackers use text messages to defraud users. Often, the text message will
contain a phone number to call.
6. Vishing: fraudsters use phone calls to solicit personal information from their victim.
7. Shoulder Surfing: refers to using direct observation techniques, such as looking over
someone's shoulder, to get personal information such as PIN, password, etc.
8. Underground websites: Fraudsters purchase personal information such as PIN, PAN, etc.
from underground websites.
9. Social Media Hacking: obtaining personal information such as date of birth, telephone
number, address, etc. from social media sites for fraudulent purposes.
10. Key logger Software: use of malicious software to steal sensitive information such as
password, card information, etc.
11. Web Application Vulnerability: attackers gain unauthorized access to critical systems
by exploiting weaknesses on web applications.
12. Sniffing: viewing and intercepting sensitive information as it passes through a network.
2
13. Google Hacking: using Google techniques to obtain sensitive information about a
potential victim with the aim of using such information to defraud the victim.
14. Session Hijacking: unauthorized control of communication session in order to steal data
or compromise the system in some manner.
15. Man-in-The-Middle Attack: a basic tool for stealing data and facilitating more complex
attacks.
Is e-fraud becoming extinct?
Fraud value may have reduced in 2015, but the significant increase in volume of attacks
depicts the enormous threat of e-fraud. Furthermore, information released by security firm,
Kaspersky, shows that in 2015, there were over a million attempted malware infections that
aimed to steal money via Electronic Banking access to bank accounts.
As financial institutions adopt emerging payment systems and other technological
innovations as a means of increasing revenue and reducing costs; cyber thieves on the other
hand, are exploiting gaps inherent in these innovations to perpetrate fraud bearing in mind, the
fact that security is usually not the primary focus in most of these innovations.
Can it be alleviated?
Because of the risk inherent in the e-channel space, many organisations have attempted to
implement the following comprehensive strategies for detecting and preventing e-fraud:
x
Fraud Policies
x
Fraud Risk Assessment
x
Fraud Awareness and Training
x
Monitoring
x
Penetration Testing
x
Collaboration
In conclusion increased revenue, optimized costs, innovations, regulation, convenience
and simplicity are the major factors driving the massive adoption of e-channel platforms in
Nigeria. Furthermore, the usages of these platforms have created opportunities for cyber-thieves
who continuously devise new and sophisticated schemes to perpetrate fraud.
e-fraud will continue to grow, and combating it requires effective fraud strategies, collaboration
and cooperation of many organisations in Nigeria including government agencies and other
countries. If otherwise, cybercriminals would be getting richer from the hard work of others due
to lack of united front on the part of everyone.
3
1.2
STATEMENT OF THE PROBLEM
Electronic banking is a driving force that is changing the landscape of the banking environment
fundamentally towards a more competitive industry. Electronic banking has blurred the
boundaries between different financial institutions, enabled new financial
products and services, and made existing financial services available in different package,
(Anderson S. 2000), but the influences of electronic banking go far beyond this.
The developments in electronic banking together with other financial innovativeness are
constantly bringing new challenges to finance theory and changing people’s understanding of the
financial system. It is not surprising that in the application of electronic banking in Nigeria, the
financial institutions have to face its problems:1.
Communication over the internet is insecure and often congested.
2.
The financial institutions would also have to contend with other internet challenges
including insecurity, quality of services and some aberrations in electronic finance.
3.
Besides, the existing banking environment also creates some challenges to the smooth
operations of electronic banking in Nigeria.
To this effect, this project will serve as a verification and practical authentication, by
carrying out various fraud detection techniques, to discover, if integrated techniques system, is
indeed providing far better system performance efficiency than a singular system as suggested by
most of the researchers.
1.3
OBJECTIVES OF THE STUDY
The main objective of this study is to find out the solution of controlling fraud, since it
seems to be a critical problem in many organisations including the government. Specifically the
following are objective of the study;
i.
ii.
iii.
Identify the factors that cause fraud,
Explore the various techniques of fraud detection
Explore some major detection techniques based on the unlabelled data available for
analysis, which do not contain a useful indicator of fraud. Thus, unsupervised Machine
Learning and predictive modeling with major focus on Anomaly/Outlier Detection (OD)
will be considered as the major techniques for this project work.
4
1.4 RESEARCH QUESTIONS
x
What are the factors that cause fraud?
x
What specific phenomena typically occur before, during, or after a fraud incident?
x
What other characteristics are generally seen with fraud?
x
What are the various techniques of fraud detection?
x
Is there a specific fraud detection technique suitable for a typical type of fraud?
When all these phenomena and characteristics are pinpointed, predicting and detecting fraud
becomes a much more manageable task.
1.5 SIGNIFICANT OF THE STUDY
x
Understand the different areas of fraud and their specific detection methods
x
Identify anomalies and risk areas using data mining and machine learning algorithm
techniques
x
Carry out some major fraud detection techniques, as a model and encouragement to
initiate fraud detection techniques from different banks working together to achieve more
extensive and better result.
1.6 SCOPE AND LIMITATION OF THE STUDY
This work considers anomaly detection as the main theme. Therefore, the following resources
illustrate the variety of approaches, methods and tools for the task in each ecosystem. In order to
make sure this study will be successful, data mining and statistical methodology will be explored
to detect fraud and take immediate action to minimize costs. Through the use of sophisticated
data mining tools, millions of transactions can be searched and spot for patterns and detect
fraudulent transactions.
Using sophisticated data mining tools such as Decision trees: Booting trees, Classification trees
and Random forest; Machine learning, Association rules, Cluster analysis and Neural networks.
Predictive models can be generated to estimate things such as probability of fraudulent behavior
or the naira amount of fraud. These predictive models help to focus resources in the most
5
efficient manner to prevent or recover fraud losses.
In the course of this research work some constraints were encountered, for instance, it
does not make sense to describe fraud detection techniques in great detail in the public domain,
as this gives criminals the information that they require in order to evade detection. Although
data sets are readily available, yet, results are often censored, making them difficult to assess (for
example, Leonard 1993). Many fraud detection problems involve huge data sets that are
constantly evolving; besides, original data sets are modified in order, not to infringe on clients
personal information and for the organisation security measure.
Data Source: Chartered Institute of Treasury Management, Abuja
http://www.cbn.gov.ng/neff%20annual%20report%2015.pdf
http://www.nibbs-plc.com/ng/report/2014fraud.report
https://statistics.cbn.gov.ng/cbn-ElectronicBankingstats/DataBrowser.aspx
1.7 HISTORICAL BACKGROUND OF THE CASE STUDY
2015 was an incredible year for cyber-security in Nigeria. In May 2015, the cybercrime bill was
signed into law in Nigeria by President Goodluck Jonathan. The implication of this to individuals
and corporations is that cybercrime is now properly defined and legal consequences are attached
to any defiance of this law. A cyber-attack hit the main website of the British Broadcasting
Corporation (BBC) and its i-Player Streaming service on New
Year's Eve. The BBC’s websites were unavailable for several hours as a result of the attack.
This was the first widely reported cyber-attack of the year 2016. While it is bad enough to hear
such news at the start of the year, what should be having main concern is the number of
unreported or stealth cyber-attacks that have and will occur in 2017.
As the Internet and technology continues to evolve, the world becomes more connected and no
one is immune to these threats.
At the beginning of year 2014, an annual forecast of Nigeria’s cyber-security landscape
was detailed in the 2015 Nigeria Cyber-security Outlook. This included forecasts that the
likelihood of cyber-security issues were expected to reduce towards the last quarter of the year
due to the successful implementation of the Bank Verification Number (BVN) exercise; an
initiative powered by the Central Bank of Nigeria (CBN). This prediction was confirmed in a
report presented by the Chairman of the Nigeria Electronic Fraud Forum (NEFF) who is also
Director, Banking and Payment System Department, CBN; Mr. Dipo
Fatokun during the forum’s annual dinner. He stated that the loss arising from electronic
6
payment fraud had fallen by 63% and there had been a reduction of 45.98% in attempted
Electronic Banking fraud by the end of 2015 as against the beginning of the same year. This drop
could be partly attributed to the successful implementation of the BVN; a commendable
initiative implemented to secure Nigeria’s payment system in 2015.
The 2015 forecast also indicated higher risk of current and former employees or contractors
resorting to cybercrime as a means to maintain their standard of living. During the course of the
year, forensic specialists were kept busy (hopefully with pockets full) as several companies had
to engage digital forensic specialists to investigate cybercrime perpetrated by various suspects
who are largely made up of employees and former employees of the victim organizations.
The forecast further highlighted the fact that there would be an increase in cyber-attacks
of websites and information technology (IT) infrastructure of political organizations and public
institutions, and these would appear as headlines in local dailies. The prediction became a reality
and at various points during the year, there were several allegations of hacking attempts on the
websites of public institutions and political parties. Some worthy of mention are; the reported
hack and de-facing of the Independent National Electoral Commission (INEC) website in March
2015 and also that of the Lagos state government in December 2015. Through the year 2015 and
2016, the cyber-security journey of hacks, attacks and triumphs still continue.
1.8 DEFINITION OF TERMS
x
Fraud detection: refers to detection of criminal activities occurring in commercial
organizations x
Anomaly: is a pattern in the data that does not conform to the expected behavior x
Classification: Classification is finding models that analyze and classify a data item into
several predefined classes. x
Sequencing: Sequencing is similar to the association rule. The relationship exists over a
period of time such as repeat visit to supermarket. x
Regression: Regression is mapping a data item to a real-valued prediction variable. x
Clustering: Clustering is identifying a finite set of categories or clusters to describe the
data. x
Dependency Modeling: Dependency Modeling (Association Rule Learning) is finding a
model which describes significant dependencies between variables. x
Deviation Detection: Deviation Detection (Anomaly Detection) is discovering the most
7
x
significant changes in the data. Summarization: Summarization is finding a compact description for a subset of data. x
Data Cleaning: removes noise from data, x
Data integration: combines various data source x
Data Selection: transformation transforms data into the storm appropriate for mining x
Automated Teller Machine (ATM): Gives customers easy access to his/her cash
whenever he/she needs it (24 hours a day 7days a week). x
Internet banking: With a PC connected to the bank via the internet, the product empowers
a customer to transact banking business when where and how he/she wants with little or
no interaction with the bank physically. x
Mobile Banking: Offer customers the freedom of banking with mobile phone. The product keep a customer in touch with his/her finances all the time and anywhere.
x
Electronic banking: This refers to the use of computer and telecommunication to enable
banking transactions to be done by telephone or computer. x
Electronic funds transfer (EFT): this involves transfer of money from one bank account
to another by means of communication links. x Smart Cards: Is a plastic card that contains a micro processor that store and update
information, typically used in performing financial transactions. x E-money: is also known as electronic cash which refers to money or script which
exchange only electronically. A good example of e-money is money transfer. x Bill payment: it refers to e-banking application whereby customer directs the financial
institutions to transfer funds to the account of another person or business. x
Classification by decision tree induction: A decision tree is a decision support tool that
uses a tree-like graph or model of decisions and their possible consequences, including
chance event outcomes, resource costs, and utility. It is one way to display an algorithm. x
Bayesian Classification: The Bayesian Classification is also known as the Naive Bayes
Classification. As the name suggests, this classifier uses the Naive Bayes Theorem to get the
classification for a given variable values. x Neural Networks: Neural network is a set of connected input/output units and each
connection has a weight present with it.
This research work will explore the procedures for computing the presence of outliers using
the various distance measures with clustering-based anomaly detection as a methodology.
Since the available data sets for this research is unlabelled and does not contain a useful
8
indicator of fraud, thus, there could be a need to reduce available multidimensional available
data set to lower dimension while retaining most of the information using Principal
Component Analysis. Although, Predictive Modelling with Unsupervised Machine Learning
with major focus on Anomaly Detection will be the major center of attention. 9
LITERATURE REVIEW
2.0 Electronic banking fraud characteristics and related work
I will like to, first summarize the main characteristics of Electronic banking fraud, and
then discuss the related work on different areas of fraud detection. Most published work about
fraud detection is related to the domain of credit card fraud, computer intrusion and
telecommunication fraud. Therefore I will discuss each of these and explain the limitations of the
existing work when applied to detect Electronic banking fraud.
2.1 Electronic banking fraud characteristics
From a system point of view, the essence of Electronic fraud reflects the synthetic abuse
of interaction between resources in three worlds: the fraudster’s intelligence abuse in the social
world, the abuse of web technology and Internet banking resources in the cyber world, and the
abuse of trading tools and resources in the physical world. A close investigation of the
characteristics is important for developing effective solutions, which will then be helpful for
other problem-solving. (Sahin, Y., and Duman, E. 2011)
Investigations based on literature review shows that real-world Electronic banking
transaction data sets and most electronic banking fraud has the following characteristics and
challenges: (1) highly imbalanced large data set; (2) real time detection; (3) dynamic fraud
behavior; (4) weak forensic evidence; and (5) diverse genuine behavior patterns.
The data set is large and highly imbalanced. According to a study on one Australian
bank’s Electronic banking data, Electronic banking fraud detection involves a large number of
transactions, usually millions. However, the number of daily frauds is usually very small. For
instance, there were only 5 frauds among more than 300,000 transactions on one day. These
results in the task of detecting very rare fraud dispersed among a massive number of genuine
transactions.
Fraud detection needs to be real time.
According to Linda D., Hussein A., John P., (2009), In Electronic banking, the interval
between a customer making a payment and the payment being transferred to its destination
account is usually very short. To prevent instant money loss, a fraud detection alert should be
generated as quickly as possible. This requires a high level of efficiency in detecting fraud in
large and imbalanced data.
10
The fraud behavior is dynamic. According to MasoumehZareapoor, Fraudsters
continually advance their techniques to defeat Electronic banking defenses. Malware, which
accounts for the greater part of Electronic banking fraud, has been reported to have over 55,000
new malicious programs every day. This puts fraud detection in the position of having to defend
against an ever-growing set of attacks. This is far beyond the capability of any single fraud
detection model, and requires the adaptive capability of models and the possibility of engaging
multiple models for leveraging the challenges that cannot be handled by any single model.
(Seeja.K.R, and M.Afshar.Alam, 2012)
The forensic evidence for fraud detection is weak. For Electronic banking transactions, it
is only possible to know source accounts, destination currency value associated with each
transaction, but other external information, for example, the purpose of the spending, is not
available. Moreover, with the exception of ID theft, most electronic banking fraud is not caused
by the hijack of an Electronic banking system but by attacks on customers’ computers. In fraud
detection, only the Electronic banking activities recorded in banking systems can be accessed,
not the whole compromise process and solid forensic evidence (including labels showing
whether a transaction is fraudulent) which could be very useful for understanding nature of the
deception. This makes it challenging to identify sophisticated fraud with very limited
information. (Adnan M. Al-Khatib, 2012)
The customer behavior patterns are diverse. An Electronic banking interface provides a
one-stop entry for customers to access most banking services and multiple accounts. In
conducting Electronic banking business, every customer may perform very differently for
different purposes. This leads to a diversity of genuine customer transactions. In addition,
fraudsters simulate genuine customer behavior and change their behavior frequently to compete
with advances in fraud detection. This makes it difficult to characterize fraud and even more
difficult to distinguish it from genuine behavior. (Tung-shou Chen, 2006)
The Electronic banking system is fixed. The Electronic banking process and system of
any bank are fixed. Every customer accesses the same banking system and can only use the
services in a predefined way. This leads to good references for characterizing common genuine
behavior sequences, and for identifying tiny suspicions in fraudulent Electronic banking.
The above characteristics make it very difficult to detect Electronic banking fraud, and
Electronic banking fraud detection presents several major challenges to the research, especially
for the mainstream data mining community: extremely imbalanced data, big data, model
11
efficiency in dealing with complex data, dynamic data mining, pattern mining with limited or no
labels, and discriminate analysis of data without clear differentiation. In addition, it is very
challenging to develop a single model to tackle all of the above aspects, which greatly challenge
the existing work in fraud detection. (Tung-shou Chen, 2006)
2.2 General work in fraud detection
Many statistic and machine learning techniques have been developed for tackling fraud
for example, Neural Network, Decision Tree, Logistic Regression and Rule-based Expert
Systems. They have been used to detect abnormal activities and for fraud detection in many
fields, such as money laundering, credit card fraud, computer intrusion, and so on. They can be
categorized as unsupervised approaches and supervised ones. Unsupervised approaches, such as
Hidden Markov Model, are mainly used in outlier detection and spike detection when the
training samples are unlabeled. Based on historical data and domain knowledge, Electronic
banking can collect clearly labeled data samples for the reports from victims or related crime
control organizations. Unsupervised approaches cannot use such label information, and the
accuracy is lower than that of supervised approaches. Some supervised methods, such as Neural
Network and Random Forests, perform well in many classification applications, including fraud
detection applications, even in certain class-imbalanced scenarios. However, they either cannot
tackle extremely imbalanced data, or are not capable of dealing with comprehensive
complexities as shown in the Electronic banking data and business. (Philip K. Chan, Wei Fan,
Andreas L., 1999)
Understanding the complexities of contrast between fraudulent behavior and genuine
behavior can also provide essential patterns which, when incorporated in a classifier, lead to high
accuracy and predictive power. Such understanding triggers the emergence of contrast pattern
mining, such as emerging pattern, jumping emerging patterns, and mining contrast sets.
However, various research works show that these approaches are not efficient for detecting rare
fraud among an extremely large number of genuine transactions.
2.2b. In an approach to fraud detection that is based on tracking calling behaviour on an
account over time and scoring calls according to the extent that they deviate from patterns that
resemble fraud are described. Account summaries are compared to threshold each period and an
account whose summary exceeds a threshold can be queued to be analyzed for fraud.
12
Thresholding has several disadvantages; it may vary with time of day, type of account and types
of call to be sensitive to fraud investigation without setting off too many false alarms for
legitimate traffic. (Fawcett, T and Provost, F., 1996)
Fawcett and Provost developed an innovative method for choosing account-specific threshold
rather than universals threshold that apply to all accounts or all accounts in a segment. In the
experiment, fraud detection is based on tracking account behaviour. Fraud detection was event
driven and not time driven, so that fraud can be detected as it is happening. Second, fraud
detection must be able to learn the calling pattern on an account and adapt to legitimate changes
in calling behaviour. Lastly, fraud detection must be self-initializing so that it can be applied to
new accounts that do not have enough data for training. The approach adopted probability
distribution functions to track legitimate calling behaviour.
Other models that have been developed in research settings that have promising potential for real
world applications include the Customer Relationship Model, Bankruptcy Prediction Model,
Inventory Management Model, and Financial Market Model. (Fawcett, T and Provost, F.,
1997)
Similarly, it was stated that that many financial institutions see the value of Artificial Neural
Network (ANNs) as a supporting mechanism for financial analysts and are actively investing in
this arena. The models described provide the needed knowledge to choose the type of neural
network to be used. The use of techniques of decision trees, in conjunction with the management
model CRISP-DM, to help in the prevention of bank fraud was evaluated in. The study
recognized the fact that it is almost impossible to eradicate bank fraud and focused on what can
be done to minimize frauds and prevent them. The research offered a study on decision trees, an
important concept in the field of artificial intelligence. The study focused on discussing how
these trees are able to assist in the decision making process of identifying frauds by the analysis
of information regarding bank transactions. This information is captured with the use of
techniques and the CRISP-DM management model of data mining in large operational databases
logged from internet bank.
The Cross Industry Standard Process for Data-Mining – CRISP-DM is a model of a data mining
process used to solve problems by experts. The model identifies the different stages in
implementing a data mining project while, A decision tree is both a data representing structure
and a method used for data mining and machine learning, the model also describe the use of
13
neural networks in analyzing the great increase in credit card transactions, since credit card fraud
has become increasingly rampant in recent years. This study investigates the efficacy of applying
classification models to credit card fraud detection problems.
Three different classification methods, i.e. decision tree, neural networks and logistic regression
were tested for their applicability in fraud detections. The paper provides a useful framework to
choose the best model to recognize the credit card fraud risk. Detecting credit card fraud is a
difficult task when using normal procedures, so the development of the credit card fraud
detection model has become of significance, whether in the academic or business community
recently.
These models are mostly statistics-driven or artificial intelligent-based, which have the
theoretical advantages in not imposing arbitrary assumptions on the input variables.
To increase the body of knowledge on this subject, an in-depth examination of important
publicly available predictors of fraudulent financial statements was offered. They tested the
value of these suggested variables for detection of fraudulent financial statements within a
matched pair’s sample. Self Organizing Artificial Neural Network (ANN) AutoNet was used in
conjunction with standard statistical tools to investigate the usefulness of these publicly available
predictors. The study resulted in a model with a high probability of detecting fraudulent financial
statement on one sample. An illustration of the decision tree for the training sets for the
multilayer perceptron network based on the work is displayed below
Source: (Werbos ; Rumelhart )
14
In this work, the irregularity detection system Model has sought to reduce the risk level of
fraudulent transactions that take place in the Nigerian banking industry thereby aiding in the
decrement of bank fraud. This will brings about reduced fraudulent transactions if implemented
properly. Neural network technology is appropriate in detecting fraudulent transactions because
of its ability to learn and remember the characteristics of the fraudulent transactions and apply
that “knowledge” when assessing new transactions. (Yuhas B.P., 1993)
The study reinforced the validity and efficiency of AutoNet as a research tool and provides
additional empirical evidence regarding the merits of suggested red flags for fraudulent financial
statements. Reviews the various factors that lead to fraud in our banking system may have some
attachment. Therefore, there must be some factors that may have led to this fraudulent.
2.3
Fraud detection in Electronic banking
There are very few papers about fraud detection in Electronic banking. Most of them concern
fraud prevention, which uses efficient security measures to prevent fraudulent financial
transactions performed by unauthorized users and to ensure transaction integrity Aggelis
proposed an Electronic banking fraud detection system for offline processing. Another system
presented in works well Electronic but needs a component that must be downloaded and installed
in the client device, which is inconvenient for deployment. (Kevin J. L., 1995)
In practice, typical existing Electronic banking fraud detection systems are rule based and
match likely fraud in transactions. The rules are mostly generated according to domain
knowledge; consequently, these systems usually have a high false positive rate but a low fraud
detection rate. Importantly, the rules are not adaptive to changes in the types of fraud.
2.4 Credit card fraud detection
Credit card fraud is divided into two types: offline fraud and Electronic fraud. Offline
fraud is committed by using a stolen physical card at a storefront or call center. In most cases, the
institution issuing the card can lock it before it is used in a fraudulent manner, if the theft is
discovered quickly enough. Electronic fraud is committed via web, phone shopping or
cardholder-not-present. Only the card’s details are needed, and a manual signature and card
imprint are not required at the time of purchase. With the increase of e-commence, Electronic
15
credit card transaction fraud is increasing. Compared to Electronic banking fraud detection, there
are many available research discussions and solutions about credit card fraud detection.
Most of the work on preventing and detecting credit card fraud has been carried out with
neural networks. CARDWATCH features a neural network trained with the past data of a
particular customer and causes the network to process current spending patterns to detect
possible anomalies. Brause and Langsdorf proposed a rule-based association system combined
with the neuro-adaptive approach. Falcon, developed by HNC, uses feed-forward Artificial
Neural Networks trained on a variant of a back-propagation training algorithm. Machine
learning, adaptive pattern recognition, neural networks, and statistical modeling are employed to
develop Falcon predictive models to provide a measure of certainty about whether a particular
transaction is fraudulent. A neural MLP-based classifier is another example of a system that uses
neural networks. It acts only on the information of the operation itself and of its immediate
previous history, but not on historic databases of past cardholder activities. (Yuhas B.P., 1993)
A parallel Granular Neural Network (GNN) method uses a fuzzy neural network and
rule-based approach. The neural system is trained in parallel using training data sets, and the
trained parallel fuzzy neural network then discovers fuzzy rules for future prediction. Cyber
Source introduces a hybrid model, combining an expert system with a neural network to increase
its statistic modeling and reduce the number of “false” rejections. There are also some
unsupervised methods, such HMM and cluster, targeting unlabeled data sets.
All credit card fraud detection methods seek to discover spending patterns based on the
historical data of a particular customer’s past activities. It is not suitable for Electronic banking
because of the diversity of Electronic banking customers’ activities and the limited historical
data available for a single customer. (Reategui, E.B. and Campbell, J. A, 1994)
2.5 Computer intrusion detection
Many intrusion detection systems base their operations on analysis of audit data
generated by the operation system. According to Sundaram, intrusion detection approaches in
computers are broadly classified into two categories based on a model of intrusions: misuse and
anomaly detection. Misuse detection attempts to recognize the attacks of previously observed
intrusions in the form of a pattern or signature, and then monitors such occurrences. Misuse
approaches include expert systems, model-based reasoning, state transition analysis, and
keystroke dynamics monitoring. Misuse detection is simple and fast. Its primary drawback is that
16
it is not possible to anticipate all the different attacks because it looks for only known patterns of
abuse. (Sundaram, A. 1996)
According to Reichl, Anomaly detection tries to establish a historical normal profile for
each user and then uses a sufficiently large deviation from the profile to indicate possible
intrusions. Anomaly detection approaches include statistical approaches, predictive pattern
generation, and neural networks. The advantage of anomaly detection is that it is possible to
detect novel attacks; its weakness is that it is likely to have high rates of false alarm.
Data mining approaches can be applied for intrusion detection. A classification model
with association rules algorithm and frequent episodes has been developed for anomaly intrusion
detection. This approach can automatically generate concise and accurate detection models from
a large amount of audit data. However, it requires a large amount of audit data in order to
compute the profile rule sets. Because most forensic evidence for fraud is left on customers’
computers and it is difficult to retrieve, intrusion detection methods cannot be directly used for
Electronic banking. (Buschkes R, Kesdogan D, Reichl P., 1998)
2.6 Telecommunication fraud detection
According to Yuhas, the various types of telecommunication fraud can be classified into
two categories: subscription fraud and superimposed fraud. Subscription fraud occurs when a
subscription to a service is obtained, often with false identity details and no intention of making
payment. Superimposed fraud occurs when a service is used without necessary authority and is
usually detected by the appearance of unknown calls on a bill. Research work in
telecommunication fraud detection has concentrated mainly on identifying superimposed fraud.
Most techniques use Call Detail Record data to create behavior profiles for customers, and
detect deviations from these profiles. (Yuhas, B.P. 1995)
Proposed approaches include the rule-based approach, neural networks, visualization
methods, and so on. Among them, neural networks can actually calculate user profiles in an
independent manner, thus adapting more elegantly to the behavior of various users. Neural
networks are claimed to substantially reduce operation costs. As with credit card fraud detection,
it is difficult for telecommunication fraud detection methods to characterize the behavior patterns
of Electronic banking customers effectively. (Wills, G.J)
Clearly, no single existing method can solve the Electronic banking fraud detection
problem easily. Because different approaches have advantages in different aspects, it is believed
17
that a combined solution will outperform any single solution. Neural network has been
successfully adopted in all three kinds of fraud detection and is believed to be a stable model.
As the Electronic banking behavior sequence data is available from the Electronic banking
interface log and is discriminative between abnormal and normal activities, sequential behavior
pattern should be included for fraud detection. (Brachman, R.J and Wills G.J)
18
CHAPTER THREE
RESEARCH METHODOLOGY
3.1
Introduction
This chapter presents the analytical framework and the methodology in building Electronic
Banking Fraud Detection using Data Mining and R for implementing Machine Learning
Algorithms in detection of fraud. The method of analysis was K-Mean Clustering Analysis and
Principal Component Analysis. Accordingly, Predictive model was formulated and adequate
procedures and technique for computing the presence of outliers, using various distance measures
is adopted.
3.2
Methodology Description
3.2.1
Electronic Banking Transaction Fraud Detection Techniques Summary
This technique will follow the tabular procedure below for Electronic Banking transactions to
demonstrate the fraud detection process. This process will consist of the following steps, the table
below summarises the steps:
Steps
Description
1. read-untagged-data
Data (data object name before
preprocessing),
2. data-preprocessing
Preprocess and clean the data: group
or aggregate the items together based
on the labelID
Split the data into (behavioral
transaction pattern)
Build clusters which identifies groups
within the datasets and numeric
variables using K-Mean Algorithms /
display Discriminant Analysis Plot
Model Principal Component Variables
Highlight homogeneous groups of
individuals with Parallel Coordinate
Plot (PCP)
Prediction on experimental sets
Evaluate performance
3. create-risk-table
4. Modelling
5. Visualisation
6. Prediction
7. Evaluation
19
3.2.2 Credit Card Fraud Detection Methods
On doing the literature survey of various methods for fraud detection I come to the
conclusion that to detect credit card fraud there are a lot of approaches, stated as follows:
™ A Hybrid Approach and Bayesian Theory.
™ Hybridization
™ Hidden Markov Model.
™ Genetic Algorithm
™ Neural Network
™ Bayesian Network
™ K- nearest neighbor algorithm
™ Stream Outlier Detection based on Reverse K-Nearest Neighbors(SODRNN)
™ Fuzzy Logic Based System
™ Decision Tree
™ Fuzzy Expert System
™ Support Vector Machine
™ Meta Learning Strategy
20
3.2.3
Credit Card Fraud Detection Techniques
According to Wheeler, R and Aitken, S. (2000), the credit card fraud detection techniques are
classified in two general categories: fraud analysis (misuse detection) and user behavior analysis
(anomaly detection).
The first group of techniques deals with supervised classification task in transaction level. In
these methods, transactions are labeled as fraudulent or normal based on previous historical data.
This dataset is then used to create classification models which can predict the state (normal or
fraud) of new records. There are numerous model creation methods for a typical two class
classification task such as: rule induction, decision trees and neural networks. This approach is
proven to reliably detect most fraud tricks which have been observed before, it also known as
misuse detection.
The second approach (anomaly detection), deals with unsupervised methodologies which are
based on account behavior. In this method a transaction is detected fraudulent if it is in contrast
with user’s normal behavior. This is because we don’t expect fraudsters behave the same as the
account owner or be aware of the behavior model of the owner. To this aim, we need to extract
the legitimate user behavioral model (i.e. user profile) for each account and then detect fraudulent
activities according to it. Comparing new behaviors with this model, different enough activities
are distinguished as frauds. The profiles may contain the activity information of the account; such
as transaction types, amount, location and time of transactions, this method is also known as
anomaly detection, (Yeung, D., and Ding, Y., (2002).
It is important to highlight the key differences between user behavior analysis and fraud analysis
approaches. The fraud analysis method can detect known fraud tricks, with a low false positive
rate (FPR). These systems extract the signature and model of fraud tricks presented in dataset and
can then easily determine exactly which frauds, the system is currently experiencing. If the test
data does not contain any fraud signatures, no alarm is raised. Thus, the false positive rate (FRP)
can be reduced extremely. However, since learning of a fraud analysis system (i.e. classifier) is
based on limited and specific fraud records, it cannot distinguish or detect original frauds. As a
result, the false negatives rate (FNR), may be extremely high depending on how ingenious are the
fraudsters. User behavior analysis, on the other hand, greatly addresses the problem of detecting
novel frauds. These methods do not search for specific fraud patterns, but rather compare
21
incoming activities with the constructed model of legitimate user behavior. Any activity that is
enough different from the model will be considered as a possible fraud.
Though, user behavior analysis approaches are powerful in detecting innovative frauds, they
really suffer from high rates of false alarm. Moreover, if a fraud occurs during the training phase,
this fraudulent behavior will be entered in baseline mode and is assumed to be normal in further
analysis. (Yeung, D., and Ding, Y., (2002).
Now I will discuss briefly and introduce some current fraud detection techniques which are
applied to credit card fraud detection tasks, also main advantage and disadvantage of each
approach will be discussed.
3.2.4
Artificial Neural Network
An artificial neural network (ANN) is a set of interconnected nodes designed to imitate the
functioning of the human brain, Douglas, L., and Ghosh, S., (1994). Each node has a weighted
connection to several other nodes in adjacent layers. Individual nodes take the input received
from connected nodes and use the weights together with a simple function to compute output
values. Neural networks come in many shapes and architectures. The Neural network
architecture, including the number of hidden layers, the number of nodes within a specific hidden
layer and their connectivity, most be specified by user based on the complexity of the problem.
ANNs can be configured by supervised, unsupervised or hybrid learning methods.
3.2.5
Supervised techniques
In supervised learning, samples of both fraudulent and non-fraudulent records, associated with
their labels are used to create models. These techniques are often used in fraud analysis approach.
One of the most popular supervised neural networks is back propagation network (BPN). It
minimizes the objective function using a multi-stage dynamic optimization method that is a
generalization of the delta rule. The back propagation method is often useful for feed-forward
network with no feedback. The BPN algorithm is usually time-consuming and parameters like the
number of hidden neurons and learning rate of delta rules require extensive tuning and training to
achieve the best performance. In the domain of fraud detection, supervised neural networks like
back-propagation are known as efficient tool that have numerous applications.
22
Ragh avendra Patidar, et al. used a dataset to train a three layers back propagation neural
network in combination with genetic algorithms (GA) for credit card fraud detection. In this
work, genetic algorithms was responsible for making decision about the network architecture,
dealing with the network topology, number of hidden layers and number of nodes in each layer.
Also, Aleskerov et al. developed a neural network based data mining system for credit card fraud
detection. The proposed system (CARDWATCH) had three layers auto associative architectures.
They used a set of synthesized data for training and testing the system. The reported results show
very successful fraud detection rates.
In a P-RCE neural network was applied for credit card fraud detection. P-RCE is a type of radialbasis function networks that usually applied for pattern recognition tasks. Krenker et al. proposed
a model for real time fraud detection based on bi-directional neural networks. They used a data
set of cell phone transactions provided by a credit card company. It was claimed that the system
outperforms the rule based algorithms in terms of false positive rate.
Again in a parallel granular neural network (GNN) is proposed to speed up data mining and
knowledge discovery process for credit card fraud detection. GNN is a kind of fuzzy neural
network based on knowledge discovery (FNNKD).The underlying dataset was extracted from
SQL server database containing sample Visa Card transactions and then preprocessed for
applying in fraud detection. They obtained less average training errors in the presence of larger
training dataset.
3.2.6
Unsupervised techniques
According to Yamanishi, K., and Takeuchi, J. (2004), the unsupervised techniques do not need
the previous knowledge of fraudulent and normal records. These methods raise alarm for those
transactions that are most dissimilar from the normal ones. These techniques are often used in
user behavior approach .ANNs can produce acceptable result for enough large transaction dataset.
They need a long training dataset. Self-organizing map (SOM) is one of the most popular
unsupervised neural networks learning which was introduced by SOM and provides a clustering
method, which is appropriate for constructing and analyzing customer profiles, in credit card
fraud detection, as suggested. SOM operates in two phase: training and mapping. In the former
phase, the map is built and weights of the neurons are updated iteratively, based on input
samples, in latter, test data is classified automatically into normal and fraudulent classes through
23
the procedure of mapping. After training the SOM, new unseen transactions are compared to
normal and fraud clusters, if it is similar to all normal records, it is classified as normal. New
fraud transactions are also detected similarly.
One of the advantages of using unsupervised neural networks over similar techniques is that these
methods can learn from data stream. The more data passed to a SOM model, the more adaptation
and improvement on result is obtained. More specifically, the SOM adapts its model as time
passes. Therefore it can be used and updated electronic in banks or other financial corporations.
As a result, the fraudulent use of a card can be detected fast and effectively. However, neural
networks has some drawbacks and difficulties which are mainly related to specifying suitable
architecture in one hand and excessive training required for reaching to best performance in other
hand. Williams, G. and Milne, P., (2004)
3.2.7
Hybrid supervised and unsupervised techniques
In addition to supervised and unsupervised learning models of neural networks, some researchers
have applied hybrid models. John ZhongLei et.Al., proposed hybrid supervised (SICLN) and
unsupervised (ICLN) learning network for credit card fraud detection. They improved the reward
only rule of SICLN model to ICLN in order to update weights according to both reward and
penalty. This improvement appeared in terms of increasing stability and reducing the training
time. Moreover, the number of final clusters of the ICLN is independent from the number of
initial network neurons. As a result the inoperable neurons can be omitted from the clusters by
applying the penalty rule. The results indicated that both the ICLN and the SICLN have high
performance, but the SICLN outperforms well-known unsupervised clustering algorithms.
(R. Huang, H. Tawfik, A. Nagar., 2010)
3.2.8
DECISION TREES AND SUPPORT VECTOR MACHINES:
Classification models which are based on decision trees and support vector machines (SVM) are
developed and applied on credit card fraud detection problem. In this technique, each account is
tracked separately by using suitable descriptors, and the transactions are attempted to be
identified and indicated as normal or legitimate. Sahin, Y., and Duman, E.,(2011).
The identification is based on the suspicion score produced by the developed classifier model.
When a new transaction is proceeding, the classifier can predict whether the transaction is
normal or fraud.
24
In this approach, firstly, all the collected data is pre-processed before we start the modeling
phase. Since, the distribution of data with respect to the classes is highly imbalanced, so
stratified sampling is used to under sample the normal records so that the models have chance to
learn the characteristics of both the normal and the fraudulent record’s profile. To do this, the
variables that are most successful in differentiating the legitimate and the fraudulent transactions
are founded. Then, these variables are used to form stratified samples of the legitimate records.
Later on, these stratified samples of the legitimate records are combined with the fraudulent ones
to form three samples with different fraudulent to normal record ratios. The first sample set has a
ratio of one fraudulent record to one normal record; the second one has a ratio of one fraudulent
record to four normal ones; and the last one has the ratio of one fraudulent to nine normal ones.
The variables which are used make the difference in the fraud detection systems. Our main
motive in defining the variables that are used to form the data-mart is to differentiate the profile
of the fraudulent card user from the profile of legitimate card user. The results show that the
classifiers of SVM and other decision tree approaches outperform SVM in solving the problem
under investigation. However, as the size of the training data sets become larger, the accuracy
performance of SVM based models becomes equivalent to decision tree based models, but the
number of frauds caught by SVM models are still less than the number of frauds caught by
decision tree methods. (Carlos Leon, Juan I. Guerrero, Jesus Biscarri., 2012)
3.2.9 FUZZY LOGIC BASED SYSTEMS:
Fuzzy Neural Network
The purpose of Fuzzy neural networks is to process the large volume of information which is
not certain and is extensively applied in our lives. Syeda et al in 2002 proposed fuzzy neural
networks which run on parallel machines to speed up the rule production for credit card fraud
detection which was customer-specific. His work can be associated to Data mining and
Knowledge Discovery in data bases (KD). In this technique, he used GNN (Granular Neural
Network) method that uses fuzzy neural network which is based on knowledge discovery
(FNNKD), to train the network fast and how fast a number of customers can be processed for
fraud detection in parallel. A transaction table is there which includes various fields like the
transaction amounts, statement date, posting date, time between transactions, transaction code,
day, transaction description, and etc. But for implementation of this credit card fraud detection
method, only the significant fields from the database are extracted into a simple text file by
25
applying suitable SQL queries. In this detection method the transaction amounts for any
customer is the key input data. This preprocessing of data had helped in decreasing the data size
and processing, which speeds up the training and makes the patterns briefer. In the process of
fuzzy neural network, data is classified into three categories:
™ First for training,
™ Second for prediction, and
™ Third one is for fraud detection.
The detection system routine for any customer is as follows:
Preprocess the data from a SQL server database
Extract the preprocessed data into a text file.
Normalize the data and distribute it into 3 categories (training, prediction, detection)
For normalization of data by a factor, the GNN has accepted inputs in the range of 0 to 1, but the
transaction amount was any number greater than or equal to zero because for a particular
customer only the maximum transaction amount is considered in the entire work. In this detection
method, there are two important parameters that are used during the training that are:
(i)
Training error and
(ii)
Training cycles.
With increase in the training cycles, the training error will be decreased. The accuracy of the
results depends on these parameters. In prediction stage, the maximum absolute prediction error
is calculated. In fraud detection stage also, the absolute detection error is calculated and then if
the absolute detection error is greater than zero then it is checked to see if this absolute detection
error is greater than the maximum absolute prediction error or not. If it is found to be true then it
indicates that the transaction is fraudulent otherwise transaction is reported to be safe. Both
training cycles and data partitioning are extremely important for better results. The more there is
data for training the neural network the better prediction it gives. The lower training error makes
prediction and the detection more accurate. The higher the fraud detection error is, the greater is
the possibility of the transaction to be fraudulent. (Peter J. Bentley, 2000)
3.3 Model Specification
In this work, the Predictive Model for Unsupervised Machine Learning Detection System has
sought to reduce the risk level of fraudulent transactions that take place in the Nigerian banking
industry thereby aiding in the decrement of bank fraud. This will brings about reduced fraudulent
26
transactions if implemented properly. The efficiency is measured on the basis of frequency of
detecting outliers or unusual behavioral user pattern.
3.3.1 Model for Data Reduction
According to Bruker Daltonics, Data reduction techniques can be applied to obtain a reduced
representation of the data set that is much smaller in volume, yet closely maintains the integrity
of the original data. That is, mining on the reduced data set should be more efficient yet produce
the same (or almost the same) analytical results. (D.L Massart, and Y. Vander Heyden., 2004)
3.3.2 Principal Component Analysis (PCA) Procedure
Suppose that we have a random vector X.
= ⋮ with population variance covariance matrix:
() = Σ =
⋮
⋯
⋱
⋯
⋮ Consider the linear combinations:
= + + ⋯ + = + + ⋯ + .
.
.
= + + ⋯ + Each of these can be thought of as a linear regression, predicting from , ..., There is no intercept, but, , , ..., can be viewed as regression coefficients.
Note that is a function of our random data, and so is also random. Therefore it has a
27
population variance:
( ) = /
= Moreover, and will have a population covariance:
, ! = /
= Here the coefficients " are collected into the vector:
= ⋮ First Principal Component (PCA1): #
The first principal component is the linear combination of X-variables that has maximum
variance (among all linear combinations), so it accounts for as much variation in the data as
possible. Specifically we will define coefficients , ,..., for that component in such
a way that its variance is maximized, subject to the constraint that the sum of the squared
coefficients is equal to one. This constraint is required so that a unique answer may be obtained.
More formally, select , ,..., that maximizes:
( ) = /
= subject to the constraint that:
/
= =1
Second Principal Component (PCA2): $
The second principal component is the linear combination of x-variables that accounts for as muc
h of the remaining variation as possible, with the constraint that the correlation between the first
and second component is 0
Select, , , ... , that maximizes the variance of this new component...
( ) = 28
/
= subject to the constraint that the sums of squared coefficients add up to one,
/
= =1
along with the additional constraint that these two components will be uncorrelated with one
another:
( , ) = /
= = 0
All subsequent principal components have this same property; they are linear combinations
that account for as much of the remaining variation as possible and they are not correlated
with the other principal components.We will do this in the same way with each additional
component. For instance:
ith Principal Component (PCAi): We select, , , ... , that maximizes:
( ) = /
= subject to the constraint that the sums of squared coefficients add up to one; along with
the additional constraint that this new component will be uncorrelated with all the previously
defined components:
/
= =1
( , ) = /
= = 0
= = 0
( , ) = .
.
.
29
/
(% , ) = %, /
= % = 0
Therefore all principal components are uncorrelated with one another.
3.3.2 How do we find the coefficients?
How do we find the coefficients for a principal component?
The solution involves the eigenvalues and eigenvectors of the variance covariance matrix Σ.
Solution:
We are going to let through & denote the eigenvalues of the variance covariance matrix Σ. These
are ordered so that & has the largest eigenvalue and & is the smallest.
& ≥ & ≥ ⋯ ≥ &
We are also going to let the vectors through , that is, , , … , ; denote the corresponding
eigenvectors. It turns out that the elements for these eigenvectors will be the coefficients of our
principal components.
The variance for the ith principal component is equal to the ith eigenvalue.
( ) = + + ⋯ ! = &
Moreover, the principal components are uncorrelated with one another.
, ! = 0
The variance covariance matrix may be written as a function of the eigenvalues and their
corresponding eigenvectors. This is determined by using the Spectral Decomposition Theorem.
This will become useful later when we investigate topics under factor analysis.
Spectral Decomposition Theorem
The variance covariance matrix can be written as the sum over the p eigenvalues, multiplied by
the product of the corresponding eigenvector times its transpose as shown in the first expression
below:
/
Σ = & /
Σ ≘ & 30
The second expression is a useful approximation if &- , &- , … , & are small. We
might approximate Σ by:
/
& Again, this will become more useful when we talk about factor analysis.
Note, we defined the total variation of X as the trace of the variance covariance
matrix, or if you like, the sum of the variances of the individual variables. This is
also equal to the sum of the eigenvalues as shown below:
.
(Σ) =
+
+⋯+
= & + & + ⋯ &
This will give us an interpretation of the components in terms of the amount of the
full variation explained by each component. The proportion of variation explained
by the ith principal component is then going to be defined to be the eigenvalue for
that component divided by the sum of the eigenvalues. In other words, the ith
principal component explains the following proportion of the total variation:
&
& + & + ⋯ + &
A related quantity is the proportion of variation explained by the first k principal
component. This would be the sum of the first k eigenvalues divided by its total
variation.
& + & + ⋯ + &
& + & + ⋯ + &
Naturally, if the proportion of variation explained by the first k principal components
is large, then not much information is lost by considering only the first k principal
components.
Why It May Be Possible to Reduce Dimensions
When we have correlations (multicollinarity) between the x variables, the data may
more or less fall on a line or plane in a lower number of dimensions. For instance,
imagine a plot of two x variables that have a nearly perfect correlation. The data
points will fall close to a straight line.
All of this is defined in terms of the population variance covariance matrix Σ which
is unknown. However, we may estimate Σ by the sample variance:
31
covariance matrix which is given in the standard formula here:
6
2=
1
( − 4̅ )( − 4̅ )/
3−1
Procedure:
Compute the eigenvalues & of the sample variance covariance matrix S, and the
corresponding eigenvectors; then we will define our estimated principal components
using the eigenvectors as
our coefficients:
7 = ̂ + ̂ + ⋯ + ̂ 7 = ̂ + ̂ + ⋯ + ̂ .
.
.
7 = ̂ + ̂ + ⋯ + ̂ Generally, we only retain the first k principal component. Here we must balance two
conflicting desires:
1.
To obtain the simplest possible interpretation, we want k to be as small as
possible. If we can explain most of the variation just by two principal components
then this would give us a much simpler description of the data. The smaller k is the
smaller amount of variation is explained by the first k component.
2.
To avoid loss of information, we want the proportion of variation explained
by the first k principal components to be large. Ideally as close to one as possible;
i.e., we want
λ7 + λ7 + λ7 9
≘1
λ7 + λ7 + λ7 9
32
3.3.3 Standardize the Variables
According to Baxter, R., and Hawkins, S., (2002), if raw data is used principal
component analysis will tend to give more emphasis to those variables that have
higher variances than to those variables that have very low variances. In effect the
results of the analysis will depend on what units of measurement are used to measure
each variable. That would imply that a principal component analysis should only be
used with the raw data if all variables have the same units of measure. And even in
this case, only if you wish to give those variables which have higher variances more
weight in the analysis.
Summary
The results of principal component analysis depend on the scales at which the
variables are measured. Variables with the highest sample variances will tend to be
emphasized in the first few principal components.
Principal Component analysis using the covariance function should only be
considered if all of the variables have the same units of measurement.
If the variables either have different units of measurement (i.e., pounds, feet, gallons,
etc), or if we wish each variable to receive equal weight in the analysis, then the
variables should be standardized before a principal components analysis is carried
out. Standardize the variables by subtracting its mean from that variable and dividing
it by its standard deviation:
: =
;<> %;?>
@>
Where,
= Data for variable j in sample unit i
? = Sample mean for variable j
2 = Sample standard deviation for the variable j
We will now perform the principal component analysis using the standardized data.
Note: the variance covariance matrix of the standardized data is equal to the
correlation matrix for the unstandardized data. Therefore, principal component
analysis using the standardized data is equivalent to principal component analysis
using the correlation matrix.
33
Principal Component Analysis Procedure
The principal components are first calculated by obtaining the eigenvalues for the
correlation matrix:
λ7 , λ7 , … , λ7 A
In this matrix we denote the eigenvalues of the sample correlation matrix R, and the
corresponding eigenvectors
eB , eB , … , eBA
Then the estimated principal components scores are calculated using formulas similar
to before, but instead of using the raw data we will use the standardized data in the
formulae below:
7 = ̂ : + ̂ : + ⋯ + ̂ :
7 = ̂ : + ̂ : + ⋯ + ̂ :
3.3.4. Measures of Association for Continuous Variables
According to Johnson and Wichern, the following standard notations are generally used:
=
Response for variable k in sample unit (the number of individual observation at site i)
3 = Number of sample unit
C =Number of variables
Johnson and Wichern list four different measures of association (similarity) that are
frequently used with continuous variables in cluster analysis:
Euclidean Distance - This is used most commonly. For instance, in two dimensions, we
can plot the observations in a scatter plot, and simply measure the distances between the
pairs of points. More generally we can use the following equation:
D , ! = E − !
This is the square root of the sum of the squared differences between the measurements
for each variable.
Some other distances also use similar concept. For instance the Minkowski Distance is:
H
H
D , ! = FG − G I
Here the square is replaced with raising the difference by a power of m and instead of
taking the square root, we take the mth root.
34
Here are two other methods for measuring association:
Canberra Metric
D , ! = G − G
+ Czekanowski Coefficient
D , ! = 1 −
2 ∑ LM3 − !
∑ + !
For each of these distance measures, the smaller the distance, the more similar (more
strongly associated) are the two subjects.
Now the measure of association must satisfy the following properties:
1.
Symmetry
D , ! = D , !
i.e., the distance between subject one and subject two must be the same as the distance
between subject two and subject one.
2.
Positivity
D , ! > 0, MO ≠ i.e., the distances must be positive, negative distances are not allowed!
3.
Identity
D , ! = 0, MO = i.e., the distance between the subject and itself should be zero.
4.
Triangle inequality
D( , ) ≤ D , ! + D( , )
This follows from geometric consideration, where we learnt that sum of two sides of a
triangle cannot be smaller than the third side.
35
3.3.5. Agglomerative Hierarchical Clustering
Combining Clusters in the Agglomerative Approach
In the agglomerative hierarchical approach, we start by defining each data point to be a
cluster and combine existing clusters at each step. Bates, S., and Saker, H., (2006)
Here are four different methods for doing this:
1. Single Linkage: In single linkage, we define the distance between two clusters to be
the minimum distance between any single data point in the first cluster and any
single data point in the second cluster. On the basis of this definition of distance
between clusters, at each stage of the process we combine the two clusters that have
the smallest single linkage distance.
2. Complete Linkage: In complete linkage, we define the distance between two clusters
to be the maximum distance between any single data point in the first cluster and any
single data point in the second cluster. On the basis of this definition of distance
between clusters, at each stage of the process we combine the two clusters that have
the smallest complete linkage distance.
3. Average Linkage: In average linkage, we define the distance between two clusters
to be the average distance between data points in the first cluster and data points in
the second cluster. On the basis of this definition of distance between clusters, at
each stage of the process we combine the two clusters that have the smallest
average linkage distance.
4. Centroid Method: In centroid method, the distance between two clusters is the
distance between the two mean vectors of the clusters. At each stage of the process
we combine the two clusters that have the smallest centroid distance.
5. Ward’s Method: This method does not directly define a measure of distance between
two points or clusters. It is an ANOVA based approach. At each stage, those two
clusters merge, which provides the smallest increase in the combined error sum of
squares from one-way univariate ANOVAs that can be done for each variable with
groups defined by the clusters at that stage of the process
36
According to, Pinheiro, R., and Bates, S., (2000), none of these methods is uniformly
the best. In practice, it’s advisable to try several methods and then compare the results
to form an overall judgment about the final formation of clusters.
Notationally define as:
, , … , = RST .M3 O L UVT. 1
, , … , = RST .M3 O L UVT. 2
D(4, W) = XMT. 3 S.Y3 ST .M3 . 4 3D ST .M3 . W
3.4.1. Linkage Methodology or Measuring Association between Cluster 1 and 2 (Z#$ )
1.
Single Linkage D = min, D , !; This is the distance between the closest
members of the two clusters
2.
Complete Linkage D = max, D , !; This is the distance between the members
that are farthest apart (most dissimilar)
3.
Average Linkage D = ∑ ∑ D ! ; This method involves looking at the
distances between all pairs and averages of all the distances. This is also called, Uniweighted
pair Group Mean (UPGMA)
4.
Centroid Method D = D(4̅ , W?); This involves finding the mean vector location for
each of the clusters and taken the distance between these two centroid. (Vesanto, J.,
&Alhoniemi, E., 2000).
3.4.2
Applying Gaussian distribution to developing Anomaly detection Algorithms:
Han et al. and Andrew ng (2012)
Let 4 ⋲ ℝ, assuming each ~^(_,
),
such that the joint probability density functions of
is given by:
`(; _,
)
= 1d
√2c
estimate μ and
4C − f
( − _)d
2
g,
, we have _ = 1dL ∑H
4 and
37
where μ and
are unknown, thus to
= 1dL ∑H
( − _) ;
3.4.3 Now, to develop anomaly detection Algorithm,
Let training set: h4 () , 4 () , 4 (j) , … 4 (H) k, features from m-user’s and each 4 ∈ ℝ6 , such
that 4 is a vector. Thus we can model the probability of4, P (4) based on all the features:
, , j , … 6 , with the assumption that ~^(_,
C(4 )~C( ; _ ,
),
),
∋:
Although in machine learning, it may not necessarily follows
that X~ identically independently distributed.
Consequently, it follows that:
~^(_
)
~^(_
)
.
.
6 ~^(_6
6)
Such that, the joint probability density functions: `(4) = ∏6 C ; _ ,
!.
This can be written in expanded form as:
`(4) = C( ; _ ,
). C( ; _ , ). C(j ; _j , j ) … C(6 ; _6 , 6 )
3.4.4 Anomaly Detection Algorithms Steps:
If a user is a suspect, the following steps should be followed:
1. Choose features , that has higher indicative probability of anomalous example
2. Fit parameter _ , _ , … _6 ;
, , … 6,
()
such that _ = 1dL ∑6 4 estimate
_ , _ , … _6 ;
And
= 1dL ∑H
s4 − _ t estimate
()
, ,… 6
and compute C ; _ ,
Given new , computeC(4), such that:
3. `(4) = ∏6 C ; _ ,
!
= ∏6 1u
√2c
4. Feature is anomaly if and only if `(4) < ℰ
38
4C
− _ !
u
−v 2
w
!
5. Plot the graph of C( ; _ ,
). C( ; _ , ). C(j ; _j , j ) … C(6 ; _6 , 6 )
6. Let assume ℰ = 0.02, then:
()
a. If Cs4{|}{ t = 0.0426 ≥ ℰ, .ℎ3 3. 3L UW, SV. MO:
()
b. Cs4{|}{ t = 0.0021 < ℰ,
.ℎ3 .ℎ . MT 3 M3DM .M3 O D OU ‚, .ℎVT OU ‚ 3L UW
3.4.5
The important or real number evaluation:
When developing learning algorithm, that is choosing features, etc, decision making is
much easier if we have a way of evaluating our learning algorithms:
1. By choosing the features to use and include in our learning model
2. How to evaluate them, and decide on the improvement of algorithm system by
deciding the features not to be included and features to be included
3. Assuming we have some labelled data, of anomalous and non-anomalous examples,
let assume (W = 0, MO 3 L U 3D W = 1, MO 3L UVT)
4. Now in the process of developing and evaluating the datasets, for example:
a. Training Set: h4 () , 4 () , 4 (j) , … 4 (H) k assume normal examples and none is
anomalous
b. Cross Validation Set: „4
()
()
() ()
† ,W † ;
()
4
(H‡ˆ ) (H‡ˆ )
() ()
,W † ‰
† ,W † ,…4 †
(H )
()
(H )
c. Test Set: „4{|}{ , W{|}{ ; 4{|}{ , W{|}{ , … 4{|}{‡ˆ , W{|}{‡ˆ ‰
5. It is necessary to include W = 1, OY 3L UVT examples in the Test Set and Cross
Validation Set.
3.4.6
Algorithms Evaluation:
Step1. Fit model C(4) on Training Set: h4 () , 4 () , 4 (j) , … 4 (H) k
Step2. Fit model C(4) on Cross Validation Set:
„4
() ()
† ,W † ;
4
(H‡ˆ ) (H‡ˆ )
() ()
,W † ‰
† ,W † ,…4 †
()
()
()
()
(H )
(H )
Step3. Fit model C(4) on Test Set: „4{|}{ , W{|}{ ; 4{|}{ , W{|}{ , … 4{|}{‡ˆ , W{|}{‡ˆ ‰
Step4. Predict W = ‹
1, MO C(4) < ℰ ( 3L UW)
0, MO C(4) ≥ ℰ (3 L U)
39
Possible Evaluation Metrics (Ref: 3.2.3)
1.
T P, the true positive number,
2. F P, the false positive number,
3. F N, the false negative number,
4. T N, the true negative number,
5. Precision/ Recall
6.  − 2 Cross , note, Cross Validation Set can be used to estimate and choose
3.4.7 Now, given training, cross validation and test sets, algorithms evaluation is
computed as followed:
()
()
1. Think of 4{|}{ , W{|}{ , such that, 4 = O .V T O VT M ′ T .
3T .M3 .MM.MT,
predicting
1, MO C(4) < ℰ ( 3L UW)
W = ‹
, thus we have got y-labelled
0, MO C(4) ≥ ℰ (3 L U)
2. Algorithms label y is either normal or anomalous
3. However, for this very transaction data set for the research work, there are more
W = 0, that is normal, compare to W = 1, that is anomalous, thus, looking at the
normality test and the histogram plot of the dependent variable, we can see that the
transactionAmount, dependent feature is highly skewed.
4. Thus classification may not necessarily be a good evaluation metrics because of the
skewed metric variable, thus the data is transformed to meet up with the normality
assumption.
3.4.8 Suggestion and Guidelines on how to Design or Choose Features for Anomaly
Detection Algorithms:
1. Plot the histogram of the assumed features from available data set, to confirm if it is
normally distributed.
2. If normal, then fit the algorithms model, else, transform by taking log or any other
appropriate function and check again if the histogram plot validate the normality
assumption
3. Define the new feature as new X and replace with the previous variable X.
4. Then fit the anomaly detection algorithms as stated earlier
40
3.5.7
Data Pre-Processing for Fraud Detection
The deployment of unsupervised K-Mean clustering algorithm could be too demanding and
unrealistic based on the mathematical and algorithms steps and procedures suggested in the
various literature reviews and research work even for an R_package expert user.
Consequently, I source for a graphical user Interface package, such as rattle for easy
manipulation and implementation based on the guide lines and suggestion by Williams,
Graham. Data Mining with Rattle and R. s.l. Springer
Now, the problem at hand contains large number of data with no prior known features that
can be used for classification. Clustering the data into different groups and trying to
understand the behaviour of each group is suggested as a methodology for modelling the user
behavioral pattern of the transaction data sets. Thus, I explore the dtrans_data and
Aggdtrans_data sets with R/Rattle to validate the legitimate user behavioral model. The
algorithm chosen for clustering the transaction data is K-mean algorithm and the tools for the
implementation are R and Rattle. The following sections will present the algorithm that will
be used for clustering and the tools used for implementing the solution.
3.5.8
K-means algorithm
K-MEANS is the simplest algorithm used for clustering which is unsupervised clustering
algorithm. This algorithm partitions the data set into k clusters using the cluster mean value
so that the resulting clusters intra cluster similarity is high and inter cluster similarity is low.
K-Means is iterative in nature it follows the following steps:
™ Arbitrarily generate k points (cluster centres), k being the number of clusters desired.
™ Calculate the distance between each of the data points to each of the centres, and
assign each point to the closest centre.
™ Calculate the new cluster centre by calculating the mean value of all data points in the
respective cluster.
™ With the new centres, repeat step 2. If the assignment of cluster for the data points
changes, repeat step 3 else stop the process.
™ The distance between the data points is calculated using Euclidean distance as
follows. The Euclidean distance between two points or features,
X1= (x11, x12... x1m) , X2= (x21, x22 ,...., x2m)
41
6
XMT.( ; ) = E(4 − 4 )
Advantages
9 Fast, robust and easier to understand.
9 Relatively efficient: O (t k n d), where n is objects, k is clusters, d is dimension of
each object, and t is iterations. Normally k, t , d < n.
9 Gives best result when data set are distinct or well separated from each other.
Disadvantages
9 The learning algorithm requires apriori specification of the number of cluster centres.
9 The learning algorithm provides the local optima of the squared error function.
9 Applicable only when mean is defined i.e. fails for categorical data.
9 Unable to handle noisy data and outliers
3.6.3
Strategies for data reduction
Data reduction techniques can be applied to obtain a reduced representation of the data set
that is much smaller in volume, yet closely maintains the integrity of the original data. That
is, mining on the reduced data set should be more efficient yet produce the same (or almost
the same) analytical results. Strategies for data reduction include the following:
Data aggregation, where aggregation operations are applied to the data in the construction of
optimal data variables and features for the analysis (Bruker Daltonics)
Attribute subset selection, where irrelevant, weakly relevant or redundant attributes or
dimensions may be detected and removed.
Dimensionality reduction, where encoding mechanisms are used to reduce the dataset size
Numerosity reduction, where the data are replaced or estimated by alternative, smaller data
representations such as parametric models (which need store only the model parameters
instead of the actual data) or nonparametric methods such as clustering, sampling, and the use
of histograms.
42
Discretization and concept hierarchy generation: where raw data values for attributes are
replaced by ranges or higher conceptual levels. Data discretization is a form of numerosity
reduction that is very useful for the automatic generation of concept hierarchies.
Discretization and concept hierarchy generation are powerful tools for data mining, in that
they allow the mining of data at multiple levels of abstraction.
From the above data reduction strategies, the attribute subset selection strategy has been
selected, for the step of data cleaning and transformation in Rattle typical work flow.
43
CHAPTER FOUR
4.0
RESULTS AND ANALYSIS
In this chapter, I present the result of the experimental deployment and practical evaluation
of K-Mean Clustering Analysis and Principal Component Analysis on the procedures for
computing the presence of outliers using various distance measures and general detection
performance for unsupervised machine learning on how to design, choose features and
carry out electronic transaction fraud detection
4.1
Descriptive Statistics Visualization for Probability Model
Suitable Variables for the Models (Response Variable)
The response variable is initially unsuitable for the proposed model, since it was highly
skewed, I, need transform the transactionNairaAmount and as we can see, the histogram of
the transformed variable with normality curve is displayed above.
44
Iteration 1: Applying K-Mean Cluster Analysis on the dtrans_data:
4.2
Prediction before manipulation and transformation of the data field variables:
K-Means Clustering: Is a cluster analysis which identifies groups within a dataset. The KMeans clustering algorithm will search for K clusters is specified. The resulting K clusters
are represented by the mean or average values of each of the variables.
By default, K-Means only works with numeric variables: (Han et al.Data Mining, 2012)
The result output is display below: List of figures 4.2.3
Cluster centres:
transactionNairaAmount
1. 0.0009903937
2. 0.0001173569
3. 0.0016224521
4. 0.0011566641
5. 0.0072892047
6. 0.0011805694
7. 0.0001121783
8. 0.0001213718
9. 0.0017597340
10. 0.0037814491
transactionDate transactionTime localHour
0.54419590
0.88409986
0.26661406
0.19543350
0.26520170
0.55722499
0.84923575
0.09760178
0.74819760
0.79978916
0.7084763
0.6963829
0.8541757
0.4618868
0.1615614
0.3504528
0.3917612
0.7445254
0.1186837
0.8846654
0.4720105
0.4688121
0.7489533
0.1810789
0.7907233
0.1709038
0.2103045
0.5105361
0.8762319
0.7172179
Within cluster sum of squares:
25.422345
17.294585
24.974340
19.229342
24.111857
26.248646
26.344331
52.941855
16.483182
9.860579
The cluster centre table above summarises the measure of association or linkages between
two clusters. This involves finding the mean vector location for each of the clusters and taken
the distances between these two centroid. First, the initial cluster centroid will be randomly
selected from the four variables. The first row, gives the initial cluster centres; the procedure
then working iteratively. The within sum of squares table summarises the nearest neighbors
between two distinct clusters based on the initial table, the cluster centre table. For instance,
from the table above, it seems that cluster 3, is the middle, because, seven (7) of the clusters
(1, 2, 4, 6, 8, 9 and 10), are closest to cluster 3 and not to any other cluster.
45
Implication: Since the principal purpose is to look at the cluster means for the significant of
explanatory transaction variable identified based on the cluster centres. We can see from row
3, of the cluster centre table, that transactionTime has the highest cluster centre value,
followed by the localHour and so on. Besides, from the tables above it is now clear, that
cluster 3 is the nearest neighbour to cluster 10, based on the best explanatory cluster variables
values (0.8541757 against 0.8846654) and (0.7489533 against 0.7172179). Furthermore, the
graphical display of the score plot in the later analysis will validate this more explicitly.
After Cluster has been built, the display of the Discriminant plot is shown below:
The Discriminant coordinate figure above demonstrated the visual representation of cluster
sizes, ten clusters, altogether as previously explained, which account for has 53.69% of the
point variability as shown in the figure above, cluster sizes varies for each clusters, with 426
as the dimension of the least cluster and 1133 being the dimension of the biggest cluster.
Reference the List of figure 4.4.2 for the remaining cluster sizes.
(Vesanto, J., &Alhoniemi, E.,)
46
4.4 The result of the Iteration2 on dtrans_data transformed output is display:
Data means:
R10transactionNairaAmount
0.2860758
transactionDate RRK_transactionTime RRK_localHour
0.4936329
0.5011834
0.5020773
The Data means table: Now, we can recall that, the principal purpose is to look at the
cluster means for the significant of the best explanatory transaction variable identified based
on the cluster centres. This involves finding the mean vector location for each of the clusters
and taken the distance between these two centroid. Since the distance between two clusters is
the distance between the two mean vectors of the clusters; from the data mean table, we can
see that transactionTime and localHour has the shortest mean distance apart, with data
means value of 0.5011834 and 0.5020773, respectively. Generally, according to, Vesanto, J.,
&Alhoniemi, E., at each stage we combine the two clusters that have the smallest centroid
distance.
Implication: This ascertains further, from the tables above, that transactionTime and
localHour are key explanatory variable.
Cluster centres:
R10_ transactionNairaAmount
1. 0.2808516
2. 0.2815987
3. 0.2512386
4. 0.2280972
5. 0.5577502
6. 0.3368204
7. 0.2945089
8. 0.2351319
9. 0.2425915
10. 0.2755736
transactionDate RRI_transactionTime
0.6300578
0.6524175
0.2537390
0.1744212
0.3500956
0.5052231
0.9007382
0.1072978
0.4943533
0.8378023
RRI_localHour
0.7018665
0.2876451
0.8967377
0.3881951
0.2615866
0.5856259
0.5553950
0.7326407
0.1179586
0.8836406
0.5124956
0.1057152
0.7156922
0.1435193
0.6418214
0.2270445
0.3095079
0.4686039
0.8835740
0.6828752
Within cluster sum of squares:
30.05577
34.28323
31.54214
22.94297
32.45812
36.23424
38.23785
47
63.03053
44.04175
29.80263
The cluster centre table based on the iteration 2 above similarly summarises the measure of
association or linkages between two clusters. This involves finding the mean vector location
for each of the clusters and taken the distances between these two centroid, just like the first
iteration. First, the initial cluster centroid will be randomly selected from the four variables.
The first row, gives the initial cluster centres; the procedure then working iteratively. The
within sum of squares table summarises the nearest neighbors between two distinct clusters
based on the initial table, the cluster centre table. From the table above, it clear that cluster 3,
is the middle, because, seven (7) of the clusters (1, 2, 4, 6, 8, 9 and 10), are closest to cluster
3 and not to any other cluster, in confirmation with iteration 1.
Implication: Since the principal purpose is to look at the cluster means for the significant of
explanatory transaction variable identified based on the cluster centres. We can see from row
3, of the cluster centre table, that transactionTime has the highest cluster centre value,
followed by the localHour and so on. Besides, from the tables above it is now clear, that
cluster 3 is the nearest neighbour to cluster 10, based on the best explanatory cluster variables
values (0.8967377 against 0.8836406) and (0.7156922 against 0.6828752). Similarly, the
graphical display of the score plot in the later analysis will validate this more explicitly.
4.5
Iteration 2: Applying K-Mean Cluster Analysis on the transformed dtrans_data:
Prediction after manipulation and transformation of the algorithm variables:
48
4.6
Now, the model has been enhanced and it explains 61.12% of the point variability.
The PCA is a tool to reduce multidimensional data to lower dimensions while retaining most
of the information. Now, the PCA is a transformation of the old coordinate system (peaks)
into the new coordinate system (PC), it can be estimated how much each of the old
coordinates (peaks) contribute to each of the new ones (PCs). These values are called
loadings. The higher the loading of a particular peak onto a PC, the more it contributes to that
PC. (Vesanto, J., & Alhoniemi, E., 2000)
4.7
Finalizing on the Desired Variables using Principal Component Analysis: PCA
Note that principal components on only the numeric variables are calculated, and so we
cannot use this approach to remove categorical variables from consideration. Any numeric
variables with relatively large rotation values (negative or positive) in any of the first few
components are generally variables that I may wish to include in the modelling. (List of
figures 4.3.2) The explanation of the next three (3), tables is more constructive and
consequential when considered in view of one another. (D.L. Massart and Y. Vander
Heyden)
Standard deviations:
PC1
PC2
1.4158685
PC3
1.0254212
0.9915365
PC4
0.9681633
Rotation:
PC1
PC2
PC3
PC4
R10transactionNairaAmount 0.06134073
0.66102682
0.43958358
-0.60494127
TransactionDate
-0.07433338
0.70342333
-0.8119341
0.70217871
RRK_transactionTime
-0.69456341
0.01979713
-0.09915599
-0.10961678
RRK_localHour
0.14005622
0.26000575
-0.88842588
-0.34803941
49
Importance of Components:
PC1
PC2
PC3
PC4
Standard Deviation
1.4159
1.0254
0.9915
0.9682
Proportion of variance
0.4009
0.2103
0.1966
0.1875
Cumulative Proportion
0.4009
0.6112
0.8079
0.9953
Interpretations:
Loading for the principal components is represented in the Rotation table, this contains a
matrix with loadings of each principal component, where the first column in the matrix
contains loading for the first principal component, and the second column in the matrix
contains loading for the second principal component and so on. Now, from the Rotation table
above, the first principal component (PC1) has the highest (in absolute value) loading for
transactionTime. Similarly, loading for the transaction Date and transactionTime are
‘negative’, while that of localHour is ‘positive’ in view of the transactionNairaAmount.
Consequently, the implication of the first principal component is that, transactionTime
contribute most to PC1, which gives the direction of the highest variance, similarly, PC1
represents a contrast between the explanatory variables: (transactionDate and
TransactionTime against the localHour in relation to the response variable, the
transactionNairaAmount). However, the second principal component PC2 has the highest
loading for transactionDate and localHour, thus, the contrast is mainly between
transactionDate and localHour.
Implication: The original variable are represented in the PC1 and PC2 dimension spaces, as
this will be explicitly demonstrated as a confirmation in the score plot of the PCs in the later
analysis. The PC1 represent the resultant of all values projected in the x-axis and this is
dominated by the transactionTime and to lesser extent, by the localHour. In contrast, the yaxis (PC2) is defined by the transactionNairaAmount and is dominated by the
transactionDate and to lesser extent, by the localHour. Consequently, the transactions would
be ranked according to the PC1 with the highest scoring explanatory variables being
probably the best at least in terms of transactionTime and localHour
50
4.8
Determine Number of Components to retain:
In practice; H0: Retain components that account for at least 5% to 10% of the total variance,
Now, if you look under Important of components of the output result, the row indicator is tag:
Proportion of variance, PC1, PC2, PC3 and PC4 columns, gives values greater than 10%,
which are approximately 40%, 21%, 20% and 19% respectively.
Similarly, H0: Retain component that combine account for at least 70% of the Cumulative
Proportion. Now, if you look under Important of components of the output result, the row
indicator is tag: Cumulative Proportion, PC1, PC2, PC3 columns, gives values greater than
70%, which are approximately 80.79% and approximately 100% if PC4 were to be included.
4.8.1 The Loading Plot below reveals the relationship between variables in the space of the
first two components. In the loading plot, we can see that transactionTime and localHour
have similar heavy load for PC1 and PC2, however others have heavy loading for PC3 and
PC4.
Now, main component variables can be expressed as a linear combination of the original
variables; the eigenvectors table above provides coefficients for the equation:
I will only express PC1 and PC2 as a linear combination of the original variables, because
this two constitute the sets or combinations of predictors and response variables scores that
contributed most information in the analysed data sets. (D.L. Massart and Y. Vander
Heyden)
4.8.2 Main Component Variables (PCV) as a linear combination of the original
variables
`Ž1 = −0.07431 + 0.06132 − 0.69563 + 0.14014
`Ž2 = 0.70341 + 0.66102 − 0.01973 + 0.26004
PC1 = the resultant of all values projected in the x-axis
PC1 = the resultant of all values projected in the y-axis
1 = .
3T .M3X ., 2 = .
3T .M3^ M
“LV3., 3 = .
3T .M3”ML,
4 = U U•V
Note the principal component variables now represent the aggregation of the desired
variables that is finally included and used in the final modelling and for the implementation
51
of electronic transaction fraud detection techniques. The primary multidimensional data sets
have been finally reduced to lower dimensions while still retaining most of the information.
Now, the subsequent analysis of the score plot of the explanatory variables against the
response variable below will help validate the previous findings and better perceptive of the
principal components variables.
4.8.3 The score plot of the explanatory variables against the response variable:
The graphical Display of the score plot of the explanatory variables against the response
variable is displayed above for visualising the relationship between variables in the space of
the principal components.
From the score plot above, the interpretation of the axis comes from the analysis of this
figure. Now the original variables are represented in PC1 and PC 2 dimensional spaces.
The PC1 can be interpreted as the resultant of all the values projected on the x-axis. The
longer the projected vector is, the more important is the contribution in the dimension.
The origin of the new coordinate system is located in the centre of the datasets. The first PC,
that is, PC1, points in the direction of the highest variance and is dominated by the
52
transactionTime. In contrast, the y-axis (PC2), points in the direction of the second highest
variance and is defined by the transactionNairaAmount, dominated by the transactionDate
and to lesser extent, by the localHour, while the coordinate stay perpendicular. (D.L.
Massart and Y. Vander Heyden)
The implication of this will be to rank the transactions according to the PC1 with the highest
scoring explanatory variables values: 0.69563 3D 0.14014 respectively; being the best,
at least in terms of the transactionTime and the localHour.
4.9
Model based Anomaly Detection Output:
This is the Dataset overview for the dtrans_data before applying IDEA (Interactive
Data Exploration Analysis)
We can see that transactionNairaAmount, transactionTime and local hour has been dropped
for the corresponding PC variables accordingly. Now, points can be identified based on the
unique identifier, labelID and linked with brushing across multiple plots, to check mate
deviation of the conventional transaction behavioral model.
In view of the research objectives, I have been able to explore some of the various detection
techniques for unsupervised machine learning such as the K-Mean Cluster Analysis and the
Principal Component Analysis. Besides, the analyses above have helped in perceiving the
user behaviour transaction patterns, identify transactionTime and localHour as two major
explanatory attributes and key factor that can be worked with in e-banking fraud detection;
equally determine the threshold of identification of the relationship. Not only can we identify
the direction of the slope of the relationship between the Principal Component Variables, but
53
also, we could equally identify the strength of relationship or the degree of the slope. Now, I
shall proceed to the final stage for computing the presence of outlier using the various
distance measures and general detection performance based on the previous analysis.
Identification of Outliers:
4.9.4
3D Plot of the best explanatory variables and the Response Variable.
Outliers are observations which deviate so much from other observations as to arouse
suspicions that was generated by different mechanism. (Abe, N., Zadrozny, B., and
Langford, J.)
An inspection of the 3-plots displayed above, show how transactionNairaAmount varies with
transactionDate and Time. The Datetime is sub divided into 6 transaction time periods or
categories, that is: April, May, June, July, August and September. Since, the original data set
feature transaction from the Month of April to September. That is, between 2015-04-02
01:44:50 and 2015-09-30 23:06:54. [Ref: 3.1.4].
The Interactive graphic data set of these three variable components helps view how
transactionNairaAmount vary in time space or better still with respect to the transactionDate,
and localHour. The following are the user account identities that deviate from the behavioral
pattern based on our model as shown in the above: LabelID = [641B6A70B816],
[AB77E701417E], [C03089119C16], [AA39724E34AD], [973114BAAC2A],
[91C33507469F].
54
Scatterplot of transactionAmount against transactionTime and localHour
4.9.5 An inspection of the scatterplot above containing outliers’ shows up such
characteristics as large gaps between outlying and inlaying observations and deviations
between the outliers and the group of inliers as measured in the suitably standardized scale
based on the previous analysis.
Red Flags labelID: the following are the user account identities that deviate from the
behavioral pattern based on our model as shown in the scatterplot:
The scatterplot of the principal components based variables are display above with few of the
labeID’s as an identifier. (Aleskerov, E., & Freisleben, B.)
The plot features: The response variable, transactionNairaAmount against the explanatory
variables, transactionDate, transactionTime and the localHour respectively; validating the
above listed labelID’s: [641B6A70B816], [AB77E701417E], [C03089119C16],
[AA39724E34AD], [973114BAAC2A], [91C33507469F]. In conformation with the previous
3-plots demonstrated.
55
CHAPTER FIVE
SUMMARY OF FINDINGS, CONCLUSION AND RECOMMENDATION
5.1.1
Summary of Findings
A comprehensive evaluation of Data Mining Techniques and Machine Learning for
Unsupervised Anomaly Detection Algorithms on electronic banking transaction data sets
consisting of 9 column variables and 8,641 observations was carried out.
(Ref: Appendix for detail)
5.1.2 The summary of the experimental research finding and output are summarised below:
Red Flags labelID: the following are the user account identities that deviate from the
behavioral pattern based on our model.
At least, 6 out of the 8,430 transaction dataset are suspected and predicted to be a fraudulent
transaction.
LabelID = [641B6A70B816], [AB77E701417E], [C03089119C16], [AA39724E34AD],
[973114BAAC2A], [91C33507469F]. For detail reference list of Figure 5.1.2
At least, 6 out of the 8,641 transaction dataset are suspected and predicted to be a fraudulent
transaction.
Red Flags accountID Table Summary
labelID
transactionNairaAmount transaction Date
transaction Time
AB77E701417E
71,919.00
2015/07/03
10:33:01
AA39724E34AD
2,024,999,550.00
2015/06/24
04:26:19
973114BAAC2A
449,999,550.00
2015/06/25
02:56:36
91C33507469F
287,999,550.00
2015/08/27
09:42:12
641B6A70B816
105,678.00
2015/05/16
23:45:55
C03089119C16
22,495.00
2015/09/05
22:09:41
(Ref: Appendix for detail, List of Figure 5.1.2)
56
5.1.3 The main objective of this study is to find out the best solution (singular or integrated
detection methodology) of controlling fraud, since it seems to be a critical problem in many
organisations including the government.
Specifically the following are the summary of my findings:
The fraud detection techniques as proffer by the research work are as followed:
(i)
Pre-process original data set to suit techniques requirement
(ii)
Transform processed data variable fields, for detection techniques
(iii)
Applying K-Mean Cluster Analysis on the dtrans_data: which identifies
groups within a dataset
(iv)
Reduce multidimensional data to lower dimensions while retaining most of the
information using PCA
(v)
Any numeric variables with relatively large rotation values (negative or
positive) in any of the first few components are generally variables that I may
wish to include in the modelling
(vi)
Determine the number of components to retain: the Loading Plot below
reveals the relationship between variables in the space of the first two
components
(vii)
Expressed main component variables as a linear combination of the original
variables
(viii) Highlight homogeneous groups of individuals with Parallel Coordinate Plot
(PCP).
(ix)
Perform advance Exploratory Interactive Data Exploration Analysis (IDEA)
(x)
The major technique used in the final analysis is unsupervised Machine
Learning and predictive modeling with major focus on Anomaly/Outlier
Detection (OD).
5.2.0 Conclusion:
This research deals with the procedure for computing the presence of outliers using
various distance measures and as a general detection performance result, I can
conclude that nearest-neighbor based algorithms perform better in most cases when
compared to clustering algorithms for a small data sets. Also, the stability concerning
a not-perfect choice of k is much higher for the nearest-neighbor based methods. The
reason for the higher variance in clustering-based algorithms is very likely due to the
non-deterministic nature of the underlying k-means clustering algorithm.
57
Despite of this disadvantage, clustering-based algorithms have a lower computation
time. As a conclusion, I reckon to prefer nearest-neighbor based algorithms if
computation time is not an issue. If a faster computation is required for large datasets,
for example, just like the unlabelled dataset used for this research work or better still,
in a near real-time setting, clustering-based anomaly detection is the method of
choice, I observed.
Besides supporting the unsupervised anomaly detection research community, I also
believe that the study and its implementation are useful for researchers from
neighboring fields.
5.3.0 Recommendation:
On completion of the underlying system I can conclude that the integrated technique
system is providing far better system performance efficiency than a singular system
using k-means for outlier detection. Since the main focus is on finding fraudulent data
in a transaction dataset of credit cards, hence efficiency is measured on the basis of
frequency of detecting outliers or unusual behavioral user pattern. For this purpose the
techniques have a mechanism consisting of clustering based K-Nearest neighbor
algorithm with Anomaly Detection Efficiency. Thus, we are having a system which is
efficiently detecting unusual behavioral pattern as a final product.
5.3.1 Suggestion for Further Studies
The future scope for this system can be working with more attributes of the accountID
information. As the technology is growing rapidly hackers are finding new ways to
crack the security means, so by working with more attributes we can make the system
more complex. This in turns will make the system safer.
58
REFERENCE
Agboola, A.A (2002). Information Technology, Bank Automation and Attitude of Workers
in Nigeria Banks, Journal of Social Sciences, 5, 89-102
Aleskero, E., Freisleben B., Rao B., CARDWATCH: “A Neural Network-Based Database
Mining System for Credit Card Fraud Detection”, the International Conference on
Computational Intelligence for Financial Engineering, pp. 220-226, 1997
Andreas, L., David W., Prodromidis, L., & Salvatore, J., (1997): “Credit Card Fraud
Detection Using Meta-Learning, Issues and Initial Results”; Department of Computer
Science Columbia University.
Bell, D., and La Padula L., (1976), “Secure Computer System: Unified Exposition and
Multic Interpretation, ESD-TR-75-306 (March), Mitre Corporation
Cai, S., & Jun, M., (2001): “The Key Determinant of Internet Banking Service Quality: A
Content analysis”, International Journal of Bank Marketing, (2001) 19(7), pp.276-291.
Central Bank of Nigeria (2003): Report of Technical Committee on Electronic Banking
Abuja: CBN Central Bank of Nigeria (2003b): Guidelines on Electronic Banking, Abuja
Christopher, G., Mike, C., and Amy, W. (2006): A logit analysis of electronic banking in
New Zealand”, International Journal Bank of Marketing, Vol. 24, No. 6, pp.360-383
Douglas, L., Sushmito, G., (1994): “Credit Card Fraud Detection with a Neural Network,”
Proceedings of the 27th Annual Hawaii International Conference of System Science
Duman, E., & Sahin, Y., (2011): “Detecting Credit Card Fraud by Decision Trees and
Support Vector Machine”, proceeding International Multi-Conference of engineering and
Computer Statistics, Vol.1, 2011
Ekberg, P., et al, (2013) “Online Banking Access System: Principles behind Choices and
Further Development Seen from Managerial Perspectives”, retrieve December, (2013).
Geethal, V., and Malarvizhi, M,: “Acceptance of E-Banking Among Customers”, Journal
of Management and Science Vol.2, No.1
Ghosh S., & Reilly, D., (2004): Credit Card Fraud Detection with Neural Network. Proc. Of
27th “Hawaii International Conference on Systems Science 3: 621-630
Hamid, M. R., et al., (2007): “A Comparative Analysis of Internet banking in Malaysia and
Thailand” Journal of Internet Business (2007) (4), 1-19
Hearst, M., Rachna D., & Tygar, J., (2006): “Why Phishing Works”, In the Proceedings of
Human Factors in Computing Systems
Hutchinson, D., & Warren, M. (2001): “A Framework of Security Authentication for
Internet Banking”, Paper presented at the International We-B Conference (2nd), 2001. Perth
59
Jain, A., Hong, L., & Pankanti, S., (2000): Biometric Identification. Association for
Computing Machinery, Communication of the ACM, 43(2), 90-98
Karim, Z., et al., (2009): “Towards Secured Information System in Online Banking” Paper
presented at International Conference for Internet Technology and Secured Transaction,
London
Leow, H.B., (1999): “New Distribution Channels In Banking Services”, Banker Journal
Malaysia, (199) (110), pp 48-56
Lokesh Sharma, RaghavendraPatridar, (2011): “Credit Card Fraud Detection Using
Neural Network”, International Journal of Soft Computing and Engineering (IJSCE)
Maes S., Tuyls K., Vanschoenwinlel B., (2002): “Credit Card Fraud Detection Using
Bayesian and Neural Networks”; Vrije University Brussel- Belgium.
Panida S., & Sunsern L., (2011): “A Comparative Analysis of the Security of Internet
Banking in Australia: A Customer Perspective”, being a discussion paper delivered at the 2nd
Internal Cyber-Resilience Conference, Australia.
Pavlon, P., (2001): “Integrating trust in electronic commerce with the Technology
Acceptance Model”, Development and Validation AMCIS (2001) Proceeding [Online].
Available at: http://aisel.aisnet.org/amcis2001/159.Accessed 3 August 2008.
60
APPENDIX A
Experimental Tools and Code for Project Implementation:
##############################################################
##Load Rattle for Data Mining
#######################################################
library(rattle)
rattle()
#############################################################
#Read untagged transaction
#############################################################
data=read.csv(file.choose(),stringsAsFactors = F,header=TRUE)
attach(data)
str(data)
head(data)
data <- data[,-7]
############################################################
#Make Account ID as string
labelID=as.character(labelID)
stateCode=as.factor(stateCode)
#######################################################
##### Format Time to 6digits #########################
##### Time Formatting
###############################
require(lubridate)
transactionDate= as.character(transactionDate)
transactionTime= as.character(transactionTime)
transactionTime=sprintf("%06d", data$transactionTime)
####
data <- read.table(file = "file.csv", header = FALSE, sep = ";")#####
date_time <- paste(transactionDate, transactionTime)
date_time <- ymd_hms(date_time)
#Append the new variable date_time
data=data.frame(data,date_time)
head(data)
head(transactionTime)
str(transactionTime)
61
############################################################
####
Sort Data in Account-Date-Time order ###########
########################################################
library("dplyr")
#Now, we will select seven columns from data, arrange the rows by
#the transactionDate and then arrange the rows by transactionTime.
#Finally show the details of the final data frame
sort=data %>% select(labelID,
transactionNairaAmount,stateCode,transactionDate,transactionTime,date_time,localHour) %>%
arrange(transactionDate, transactionTime)
tail(sort)
head(sort)
str(sort)
#############################################################
#Remove duplicate rows
dtrans_data=sort[!duplicated(sort), ]
str(dtrans_data)
head(dtrans_data)
tail(dtrans_data)
glimpse(dtrans_data)
summary(dtrans_data)
#####################################################################
Tdtrans_data_AD=data.frame(log(transactionNairaAmount),localHour,as.numeric(transactionTime))
dtrans_data_AD=data.frame(transactionNairaAmount,localHour,as.numeric(transactionTime))
########################################################
#Descriptive Satistics Visualization for Probability Model
#Suitable Variables for the Models
###########################################################
#To demonstrate the need for response variable transformation
#transactionAmount variable is a a variable from a time series multivariate dataset
plot.ts(log(transactionNairaAmount))
##########################################################
#Transaction Amount
transactionNairaAmount=sapply(transactionNairaAmount, mean, na.rm=TRUE)
x=log(transactionNairaAmount)
xbar=mean(x)
S=sd(x)
graph1=hist(x,col='grey')
62
graph2=hist(x,col='grey',probability=T,main="Histogram of transaction Naira
Amount",xlab="transformed transaction Naira Amount")
curve(dnorm(x,xbar,S),col=2,add=T)
##########################################################
#transactionIPaddress (bad predictor of transactionAmount)
transactionIPaddress=sapply(transactionIPaddress,mean, na.rm=TRUE)
x=log(transactionIPaddress)
xbar=mean(x)
S=sd(x)
graph1=hist(x,col='grey')
graph2=hist(x,col='grey',probability=T)
curve(dnorm(x,xbar,S),col=2,add=T)
#########################################################
#localHour(good predictor of transactionAmount)
localHour=sapply(localHour, mean, na.rm=TRUE)
x=(localHour)
xbar=mean(x)
S=sd(x)
graph1=hist(x,col='grey')
graph2=hist(x,col='grey',probability=T)
curve(dnorm(x,xbar,S),col=2,add=T)
##########################################################
#Normality Test
###########################################################
# normal fit
#qqnorm is for univariate data
qqnorm(localHour); qqline(transactionIPaddress)
old.par <- par(mfrow=c(1, 2))
qqnorm(transactionNairaAmount); qqline(transactionNairaAmount)
qqnorm(log(transactionNairaAmount)); qqlinelog((transactionNairaAmount)
qqnorm(log(transactionNairaAmount)); qqlinelog((transactionNairaAmount))
#qqplot is for bivariate data
qqplot(localHour,log(transactionNairaAmount))
qqplot(log(transactionNairaAmount),localHour)
63
APPENDIX B
List of Figures
Figure 3.1.1
Pre-processed data structure, object tagged dtrans_data
Figure 3.5.4 data head
Figure 3.5.5 dtrans_data Summary
64
Figure 3.5.6 dtrans_data summary
3.5.7 Frequency table of electronic banking transaction in different states in Nigeria
65
3.6.2Figure: the typical work flow of a dtrans_data data set as capture by Rattle and R.
Figure 4.2.3 Cluster Analysis Output
Cluster sizes: Figure 4.4.2
“604 531 426 625 408 512 507 571 1133 584”
66
Figure 5.1.1
The first and last (10) variable rows of the primary transaction dataset, with seven (7) data
field variables and 8,430 observations.
Figure 5.1.2
At least, 6 out of the 8,430 transaction dataset are suspected and predicted to be a fraudulent
transaction.
LabelID = [641B6A70B816], [AB77E701417E], [C03089119C16], [AA39724E34AD],
[973114BAAC2A], [91C33507469F]. The R programming output snapshot is displayed
below for detail reference.
67
Figure 5.1.2b
68
Download