" # $% & ' & '( # ' ) *" + ,-. # / # 0123# 4 / 5#-62 / # ' ' ( ! ' ' # ' / ' ' ( * # % # #7 " / "# / ( ) '"#&))+ '# / *( #) ( (# (* # , .& & 89 ,: ' '.(; / < '( ! ! " #$% " #" # & !"#$ % " ! & ' & ( )*! & " " *)+,-! ./ 0 2& # 2 1,)* % " ' & ( " " 1,)* 1,)* ! & DEDICATION This research work is dedicated first to my father in heaven (GOD) and Savior, Lord Jesus Christ. My mother; Mrs. Olufumilola Comfort Aluko, My father, Late Mr. Samuel Oluwole Aluko and grand-mother, Late Princess Alice Adeleye Aluko; My humble gratitude goes to my siblings: Mr. Gbenga Aluko, Cornel Olubunmi Aluko, Dr. Seun Aluko, Mr. James Aluko (FNM), Temilola Abidemi Aluko,... Nieces and Nephews, Friends, Colleagues, good and well wishers i ACKNOWLEDGEMENTS The fact that I am writing this sentence echoes my enormous indebtedness to the Almighty God for his grace, mercy and most importantly his unending blessings and favour towards me. I also thank my project supervisor, Dr. M.I Akinyemi for her encouragement, guardians, support and so much more, for the contribution and effort towards the success of this research work. Dr. (Mrs.) M.I Akinyemi, for guiding my work tirelessly, going through all my draft, providing valuable suggestions and constructive criticism, for the improvement of the dissertation I will also like to appreciate the effort of my lecturers in Department of Mathematics, the departmental HOD, Prof. J.O Olaleru, Prof. R.O Okafor, Prof. J.A Adepoju, Prof. S.O Ajala, Prof. S.A Okunuga, Dr. R.A Kasumu, Dr. A.A Mogbademu, Dr. (Mrs.) J.N Onyeka-Ubaka, Dr. M.O Adamu Ira, Dr I.O Abiala, DR A.A, Akinfenwa, among others My profound gratitude goes to my parents; my mother; Mrs. Olufumilola Comfort Aluko, My father, Late Mr. Samuel Oluwole Aluko and grand-mother, Late Princess Alice Adeleye Aluko; my humble gratitude goes to my siblings: Mr. Gbenga Aluko, Cornel Olubunmi Aluko, Dr. Seun Aluko, Mr. James Aluko (FNM), Temilola Abidemi Aluko,...Nieces and Nephews, Friends, Colleagues, good and well wishers for their encouragement, words of wisdom, financial support and so much more, for always supporting me and being my source of strength. Finally, I express my gratitude to my entire families, beloved colleagues and friends. To all of us, God’s richest blessing is ours! ii TABLE OF CONTENTS Dedication i Acknowledgements ii Table of contents iii List of Figures v CHAPTER ONE: Background of the Study 1.0 Introduction 1 1.1 Background of the study 1 1.2 Statement of the problem 4 1.3 Objectives of the study 4 1.4 Research questions 5 1.5 Significance of the study 5 1.6 Scope and Limitation of the study 5 1.7 Historical Background of the case study 6 1.8 Definition of terms 7 CHAPTER TWO: 10 Literature Review 2.0 Electronic banking fraud characteristics and related work 10 2.1 Electronic banking fraud characteristics 10 2.2 General work in fraud detection 12 2.3 Fraud detection in Electronic banking 15 iii 2.4 Credit card fraud detection 15 2.5 Computer intrusion detection 16 2.6 Telecommunication fraud detection 17 CHAPTER THREE: Research Methodology 19 3.0 Introduction 19 3.1 Methodology description 19 3.2 Credit Card Fraud Detection Methods 20 3.3 Model Specification 26 3.4 Gaussian distribution to developing Anomaly detection Algorithms 37 3.5 Data Pre-Processing and Fraud Detection 41 CHAPTER FOUR: Data Presentation and Analysis 44 4.1 Exploratory Data Analysis and Gaussian distribution Validation 44 4.2 K-Mean Cluster Analysis 45 4.6 Principal Component Analysis 49 4.8 Model Based Anomaly Detection Output 53 4.9 Outlier Detection based Mechanism 54 CHAPTER FIVE: Summary, Conclusions and Recommendations 5.0 Summary of Findings 56 5.1 Conclusion 57 5.2 Recommendations 58 iv 5.3 Suggestion for Further Studies 58 References 59 APPENDIX: Experimental Tools and Code for Project Implementation 61 LIST OF FIGURES: 64 Figure 3.1.1 64 Figure 3.5.4 64 Figure 3.5.5 64 Figure 3.5.6 65 Figure 3.5.7 65 Figure 3.6.2 66 Figure 4.2.3 66 Figure 4.4.2 66 Figure 5.1.1 67 Figure 5.1.2 67 Figure 5.1.2b 68 v CHAPTER ONE INTRODUCTION 1.1 BACKGROUND OF THE STUDY In spite of the challenging economy, the use of e-channel platforms –Internet banking, Mobile Banking, ATM, POS, Web, etc. has continued to experience significant growth. According to NIBSS 2015 annual fraud report, transaction volume and value grew by 43.36% and 11.57% respectively, compared to 2014. Although e-fraud rate in terms of value reduced by 63% in 2015, due, in part, to the introduction of BVN and improved collaboration among banks via the fraud desks; the total fraud volume increased significantly by 683% in 2015 compared to 2014. Similarly, data released recently by NITDA (Nigeria Information Technology Development Agency) indicated that Nigeria experienced a total of 3,500 cyber-attacks with 70% success rate, and a loss of $450 million within the last one year. The sustained growth of e-transactions as depicted by the increased transaction volume and value in 2015, coupled with the rapidly evolving nature of technology advancements within the e-channel ecosystem continues to attract cybercriminals who continuously develop new schemes to perpetrate e-fraud. What is e-fraud? What is responsible for its growth in Nigeria? What are the major techniques used by these criminals to commit fraud? Is e-fraud dying in Nigeria? Can it be mitigated? What is e-fraud? e-fraud can be briefly defined as Electronic Banking trickery and deception which affects the entire society, impacting upon individuals, businesses and governments. Why Is It Growing? The following inherent factors fuel e-fraud in Nigeria: i. Dissatisfied staff; ii. Increased adoption of e-payment systems for transactions due to its convenience and simplicity; iii. Emerging payment products being adopted by Nigerian banks; iv. Growing complexity of e-channel systems; v. Abundance of malicious code, malware and tools available to attackers; 1 vi. Rapid pace of technological innovations; vii. Casual security practices and knowledge gap; viii. Obscurity approach of the internet; ix. The increasing role of Third-party processors in switching e-payment transactions; x. Passive approach to fraud detection and prevention; xi. Lack of inter industry collaboration in fraud prevention -banks, telecoms, police, etc. What are the Major Techniques? Cybercriminals employ several techniques to perpetrate e-fraud, including: 1. Cross Channel Fraud: customer information obtained from one channel (i.e. call center) and being used to carry out fraud in another channel (i.e. ATM). 2. Data theft: hackers access secure or non-secure sites, get the data and sell it. 3. Email Spoofing: changing the header information in an email message in order to hide identity and make the email appear to have originated from a trusted authority. 4. Phishing: refers to stealing of valuable information such as card information, user IDs, PAN and passwords using email spoofing technique. 5. Smishing: attackers use text messages to defraud users. Often, the text message will contain a phone number to call. 6. Vishing: fraudsters use phone calls to solicit personal information from their victim. 7. Shoulder Surfing: refers to using direct observation techniques, such as looking over someone's shoulder, to get personal information such as PIN, password, etc. 8. Underground websites: Fraudsters purchase personal information such as PIN, PAN, etc. from underground websites. 9. Social Media Hacking: obtaining personal information such as date of birth, telephone number, address, etc. from social media sites for fraudulent purposes. 10. Key logger Software: use of malicious software to steal sensitive information such as password, card information, etc. 11. Web Application Vulnerability: attackers gain unauthorized access to critical systems by exploiting weaknesses on web applications. 12. Sniffing: viewing and intercepting sensitive information as it passes through a network. 2 13. Google Hacking: using Google techniques to obtain sensitive information about a potential victim with the aim of using such information to defraud the victim. 14. Session Hijacking: unauthorized control of communication session in order to steal data or compromise the system in some manner. 15. Man-in-The-Middle Attack: a basic tool for stealing data and facilitating more complex attacks. Is e-fraud becoming extinct? Fraud value may have reduced in 2015, but the significant increase in volume of attacks depicts the enormous threat of e-fraud. Furthermore, information released by security firm, Kaspersky, shows that in 2015, there were over a million attempted malware infections that aimed to steal money via Electronic Banking access to bank accounts. As financial institutions adopt emerging payment systems and other technological innovations as a means of increasing revenue and reducing costs; cyber thieves on the other hand, are exploiting gaps inherent in these innovations to perpetrate fraud bearing in mind, the fact that security is usually not the primary focus in most of these innovations. Can it be alleviated? Because of the risk inherent in the e-channel space, many organisations have attempted to implement the following comprehensive strategies for detecting and preventing e-fraud: x Fraud Policies x Fraud Risk Assessment x Fraud Awareness and Training x Monitoring x Penetration Testing x Collaboration In conclusion increased revenue, optimized costs, innovations, regulation, convenience and simplicity are the major factors driving the massive adoption of e-channel platforms in Nigeria. Furthermore, the usages of these platforms have created opportunities for cyber-thieves who continuously devise new and sophisticated schemes to perpetrate fraud. e-fraud will continue to grow, and combating it requires effective fraud strategies, collaboration and cooperation of many organisations in Nigeria including government agencies and other countries. If otherwise, cybercriminals would be getting richer from the hard work of others due to lack of united front on the part of everyone. 3 1.2 STATEMENT OF THE PROBLEM Electronic banking is a driving force that is changing the landscape of the banking environment fundamentally towards a more competitive industry. Electronic banking has blurred the boundaries between different financial institutions, enabled new financial products and services, and made existing financial services available in different package, (Anderson S. 2000), but the influences of electronic banking go far beyond this. The developments in electronic banking together with other financial innovativeness are constantly bringing new challenges to finance theory and changing people’s understanding of the financial system. It is not surprising that in the application of electronic banking in Nigeria, the financial institutions have to face its problems:1. Communication over the internet is insecure and often congested. 2. The financial institutions would also have to contend with other internet challenges including insecurity, quality of services and some aberrations in electronic finance. 3. Besides, the existing banking environment also creates some challenges to the smooth operations of electronic banking in Nigeria. To this effect, this project will serve as a verification and practical authentication, by carrying out various fraud detection techniques, to discover, if integrated techniques system, is indeed providing far better system performance efficiency than a singular system as suggested by most of the researchers. 1.3 OBJECTIVES OF THE STUDY The main objective of this study is to find out the solution of controlling fraud, since it seems to be a critical problem in many organisations including the government. Specifically the following are objective of the study; i. ii. iii. Identify the factors that cause fraud, Explore the various techniques of fraud detection Explore some major detection techniques based on the unlabelled data available for analysis, which do not contain a useful indicator of fraud. Thus, unsupervised Machine Learning and predictive modeling with major focus on Anomaly/Outlier Detection (OD) will be considered as the major techniques for this project work. 4 1.4 RESEARCH QUESTIONS x What are the factors that cause fraud? x What specific phenomena typically occur before, during, or after a fraud incident? x What other characteristics are generally seen with fraud? x What are the various techniques of fraud detection? x Is there a specific fraud detection technique suitable for a typical type of fraud? When all these phenomena and characteristics are pinpointed, predicting and detecting fraud becomes a much more manageable task. 1.5 SIGNIFICANT OF THE STUDY x Understand the different areas of fraud and their specific detection methods x Identify anomalies and risk areas using data mining and machine learning algorithm techniques x Carry out some major fraud detection techniques, as a model and encouragement to initiate fraud detection techniques from different banks working together to achieve more extensive and better result. 1.6 SCOPE AND LIMITATION OF THE STUDY This work considers anomaly detection as the main theme. Therefore, the following resources illustrate the variety of approaches, methods and tools for the task in each ecosystem. In order to make sure this study will be successful, data mining and statistical methodology will be explored to detect fraud and take immediate action to minimize costs. Through the use of sophisticated data mining tools, millions of transactions can be searched and spot for patterns and detect fraudulent transactions. Using sophisticated data mining tools such as Decision trees: Booting trees, Classification trees and Random forest; Machine learning, Association rules, Cluster analysis and Neural networks. Predictive models can be generated to estimate things such as probability of fraudulent behavior or the naira amount of fraud. These predictive models help to focus resources in the most 5 efficient manner to prevent or recover fraud losses. In the course of this research work some constraints were encountered, for instance, it does not make sense to describe fraud detection techniques in great detail in the public domain, as this gives criminals the information that they require in order to evade detection. Although data sets are readily available, yet, results are often censored, making them difficult to assess (for example, Leonard 1993). Many fraud detection problems involve huge data sets that are constantly evolving; besides, original data sets are modified in order, not to infringe on clients personal information and for the organisation security measure. Data Source: Chartered Institute of Treasury Management, Abuja http://www.cbn.gov.ng/neff%20annual%20report%2015.pdf http://www.nibbs-plc.com/ng/report/2014fraud.report https://statistics.cbn.gov.ng/cbn-ElectronicBankingstats/DataBrowser.aspx 1.7 HISTORICAL BACKGROUND OF THE CASE STUDY 2015 was an incredible year for cyber-security in Nigeria. In May 2015, the cybercrime bill was signed into law in Nigeria by President Goodluck Jonathan. The implication of this to individuals and corporations is that cybercrime is now properly defined and legal consequences are attached to any defiance of this law. A cyber-attack hit the main website of the British Broadcasting Corporation (BBC) and its i-Player Streaming service on New Year's Eve. The BBC’s websites were unavailable for several hours as a result of the attack. This was the first widely reported cyber-attack of the year 2016. While it is bad enough to hear such news at the start of the year, what should be having main concern is the number of unreported or stealth cyber-attacks that have and will occur in 2017. As the Internet and technology continues to evolve, the world becomes more connected and no one is immune to these threats. At the beginning of year 2014, an annual forecast of Nigeria’s cyber-security landscape was detailed in the 2015 Nigeria Cyber-security Outlook. This included forecasts that the likelihood of cyber-security issues were expected to reduce towards the last quarter of the year due to the successful implementation of the Bank Verification Number (BVN) exercise; an initiative powered by the Central Bank of Nigeria (CBN). This prediction was confirmed in a report presented by the Chairman of the Nigeria Electronic Fraud Forum (NEFF) who is also Director, Banking and Payment System Department, CBN; Mr. Dipo Fatokun during the forum’s annual dinner. He stated that the loss arising from electronic 6 payment fraud had fallen by 63% and there had been a reduction of 45.98% in attempted Electronic Banking fraud by the end of 2015 as against the beginning of the same year. This drop could be partly attributed to the successful implementation of the BVN; a commendable initiative implemented to secure Nigeria’s payment system in 2015. The 2015 forecast also indicated higher risk of current and former employees or contractors resorting to cybercrime as a means to maintain their standard of living. During the course of the year, forensic specialists were kept busy (hopefully with pockets full) as several companies had to engage digital forensic specialists to investigate cybercrime perpetrated by various suspects who are largely made up of employees and former employees of the victim organizations. The forecast further highlighted the fact that there would be an increase in cyber-attacks of websites and information technology (IT) infrastructure of political organizations and public institutions, and these would appear as headlines in local dailies. The prediction became a reality and at various points during the year, there were several allegations of hacking attempts on the websites of public institutions and political parties. Some worthy of mention are; the reported hack and de-facing of the Independent National Electoral Commission (INEC) website in March 2015 and also that of the Lagos state government in December 2015. Through the year 2015 and 2016, the cyber-security journey of hacks, attacks and triumphs still continue. 1.8 DEFINITION OF TERMS x Fraud detection: refers to detection of criminal activities occurring in commercial organizations x Anomaly: is a pattern in the data that does not conform to the expected behavior x Classification: Classification is finding models that analyze and classify a data item into several predefined classes. x Sequencing: Sequencing is similar to the association rule. The relationship exists over a period of time such as repeat visit to supermarket. x Regression: Regression is mapping a data item to a real-valued prediction variable. x Clustering: Clustering is identifying a finite set of categories or clusters to describe the data. x Dependency Modeling: Dependency Modeling (Association Rule Learning) is finding a model which describes significant dependencies between variables. x Deviation Detection: Deviation Detection (Anomaly Detection) is discovering the most 7 x significant changes in the data. Summarization: Summarization is finding a compact description for a subset of data. x Data Cleaning: removes noise from data, x Data integration: combines various data source x Data Selection: transformation transforms data into the storm appropriate for mining x Automated Teller Machine (ATM): Gives customers easy access to his/her cash whenever he/she needs it (24 hours a day 7days a week). x Internet banking: With a PC connected to the bank via the internet, the product empowers a customer to transact banking business when where and how he/she wants with little or no interaction with the bank physically. x Mobile Banking: Offer customers the freedom of banking with mobile phone. The product keep a customer in touch with his/her finances all the time and anywhere. x Electronic banking: This refers to the use of computer and telecommunication to enable banking transactions to be done by telephone or computer. x Electronic funds transfer (EFT): this involves transfer of money from one bank account to another by means of communication links. x Smart Cards: Is a plastic card that contains a micro processor that store and update information, typically used in performing financial transactions. x E-money: is also known as electronic cash which refers to money or script which exchange only electronically. A good example of e-money is money transfer. x Bill payment: it refers to e-banking application whereby customer directs the financial institutions to transfer funds to the account of another person or business. x Classification by decision tree induction: A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm. x Bayesian Classification: The Bayesian Classification is also known as the Naive Bayes Classification. As the name suggests, this classifier uses the Naive Bayes Theorem to get the classification for a given variable values. x Neural Networks: Neural network is a set of connected input/output units and each connection has a weight present with it. This research work will explore the procedures for computing the presence of outliers using the various distance measures with clustering-based anomaly detection as a methodology. Since the available data sets for this research is unlabelled and does not contain a useful 8 indicator of fraud, thus, there could be a need to reduce available multidimensional available data set to lower dimension while retaining most of the information using Principal Component Analysis. Although, Predictive Modelling with Unsupervised Machine Learning with major focus on Anomaly Detection will be the major center of attention. 9 LITERATURE REVIEW 2.0 Electronic banking fraud characteristics and related work I will like to, first summarize the main characteristics of Electronic banking fraud, and then discuss the related work on different areas of fraud detection. Most published work about fraud detection is related to the domain of credit card fraud, computer intrusion and telecommunication fraud. Therefore I will discuss each of these and explain the limitations of the existing work when applied to detect Electronic banking fraud. 2.1 Electronic banking fraud characteristics From a system point of view, the essence of Electronic fraud reflects the synthetic abuse of interaction between resources in three worlds: the fraudster’s intelligence abuse in the social world, the abuse of web technology and Internet banking resources in the cyber world, and the abuse of trading tools and resources in the physical world. A close investigation of the characteristics is important for developing effective solutions, which will then be helpful for other problem-solving. (Sahin, Y., and Duman, E. 2011) Investigations based on literature review shows that real-world Electronic banking transaction data sets and most electronic banking fraud has the following characteristics and challenges: (1) highly imbalanced large data set; (2) real time detection; (3) dynamic fraud behavior; (4) weak forensic evidence; and (5) diverse genuine behavior patterns. The data set is large and highly imbalanced. According to a study on one Australian bank’s Electronic banking data, Electronic banking fraud detection involves a large number of transactions, usually millions. However, the number of daily frauds is usually very small. For instance, there were only 5 frauds among more than 300,000 transactions on one day. These results in the task of detecting very rare fraud dispersed among a massive number of genuine transactions. Fraud detection needs to be real time. According to Linda D., Hussein A., John P., (2009), In Electronic banking, the interval between a customer making a payment and the payment being transferred to its destination account is usually very short. To prevent instant money loss, a fraud detection alert should be generated as quickly as possible. This requires a high level of efficiency in detecting fraud in large and imbalanced data. 10 The fraud behavior is dynamic. According to MasoumehZareapoor, Fraudsters continually advance their techniques to defeat Electronic banking defenses. Malware, which accounts for the greater part of Electronic banking fraud, has been reported to have over 55,000 new malicious programs every day. This puts fraud detection in the position of having to defend against an ever-growing set of attacks. This is far beyond the capability of any single fraud detection model, and requires the adaptive capability of models and the possibility of engaging multiple models for leveraging the challenges that cannot be handled by any single model. (Seeja.K.R, and M.Afshar.Alam, 2012) The forensic evidence for fraud detection is weak. For Electronic banking transactions, it is only possible to know source accounts, destination currency value associated with each transaction, but other external information, for example, the purpose of the spending, is not available. Moreover, with the exception of ID theft, most electronic banking fraud is not caused by the hijack of an Electronic banking system but by attacks on customers’ computers. In fraud detection, only the Electronic banking activities recorded in banking systems can be accessed, not the whole compromise process and solid forensic evidence (including labels showing whether a transaction is fraudulent) which could be very useful for understanding nature of the deception. This makes it challenging to identify sophisticated fraud with very limited information. (Adnan M. Al-Khatib, 2012) The customer behavior patterns are diverse. An Electronic banking interface provides a one-stop entry for customers to access most banking services and multiple accounts. In conducting Electronic banking business, every customer may perform very differently for different purposes. This leads to a diversity of genuine customer transactions. In addition, fraudsters simulate genuine customer behavior and change their behavior frequently to compete with advances in fraud detection. This makes it difficult to characterize fraud and even more difficult to distinguish it from genuine behavior. (Tung-shou Chen, 2006) The Electronic banking system is fixed. The Electronic banking process and system of any bank are fixed. Every customer accesses the same banking system and can only use the services in a predefined way. This leads to good references for characterizing common genuine behavior sequences, and for identifying tiny suspicions in fraudulent Electronic banking. The above characteristics make it very difficult to detect Electronic banking fraud, and Electronic banking fraud detection presents several major challenges to the research, especially for the mainstream data mining community: extremely imbalanced data, big data, model 11 efficiency in dealing with complex data, dynamic data mining, pattern mining with limited or no labels, and discriminate analysis of data without clear differentiation. In addition, it is very challenging to develop a single model to tackle all of the above aspects, which greatly challenge the existing work in fraud detection. (Tung-shou Chen, 2006) 2.2 General work in fraud detection Many statistic and machine learning techniques have been developed for tackling fraud for example, Neural Network, Decision Tree, Logistic Regression and Rule-based Expert Systems. They have been used to detect abnormal activities and for fraud detection in many fields, such as money laundering, credit card fraud, computer intrusion, and so on. They can be categorized as unsupervised approaches and supervised ones. Unsupervised approaches, such as Hidden Markov Model, are mainly used in outlier detection and spike detection when the training samples are unlabeled. Based on historical data and domain knowledge, Electronic banking can collect clearly labeled data samples for the reports from victims or related crime control organizations. Unsupervised approaches cannot use such label information, and the accuracy is lower than that of supervised approaches. Some supervised methods, such as Neural Network and Random Forests, perform well in many classification applications, including fraud detection applications, even in certain class-imbalanced scenarios. However, they either cannot tackle extremely imbalanced data, or are not capable of dealing with comprehensive complexities as shown in the Electronic banking data and business. (Philip K. Chan, Wei Fan, Andreas L., 1999) Understanding the complexities of contrast between fraudulent behavior and genuine behavior can also provide essential patterns which, when incorporated in a classifier, lead to high accuracy and predictive power. Such understanding triggers the emergence of contrast pattern mining, such as emerging pattern, jumping emerging patterns, and mining contrast sets. However, various research works show that these approaches are not efficient for detecting rare fraud among an extremely large number of genuine transactions. 2.2b. In an approach to fraud detection that is based on tracking calling behaviour on an account over time and scoring calls according to the extent that they deviate from patterns that resemble fraud are described. Account summaries are compared to threshold each period and an account whose summary exceeds a threshold can be queued to be analyzed for fraud. 12 Thresholding has several disadvantages; it may vary with time of day, type of account and types of call to be sensitive to fraud investigation without setting off too many false alarms for legitimate traffic. (Fawcett, T and Provost, F., 1996) Fawcett and Provost developed an innovative method for choosing account-specific threshold rather than universals threshold that apply to all accounts or all accounts in a segment. In the experiment, fraud detection is based on tracking account behaviour. Fraud detection was event driven and not time driven, so that fraud can be detected as it is happening. Second, fraud detection must be able to learn the calling pattern on an account and adapt to legitimate changes in calling behaviour. Lastly, fraud detection must be self-initializing so that it can be applied to new accounts that do not have enough data for training. The approach adopted probability distribution functions to track legitimate calling behaviour. Other models that have been developed in research settings that have promising potential for real world applications include the Customer Relationship Model, Bankruptcy Prediction Model, Inventory Management Model, and Financial Market Model. (Fawcett, T and Provost, F., 1997) Similarly, it was stated that that many financial institutions see the value of Artificial Neural Network (ANNs) as a supporting mechanism for financial analysts and are actively investing in this arena. The models described provide the needed knowledge to choose the type of neural network to be used. The use of techniques of decision trees, in conjunction with the management model CRISP-DM, to help in the prevention of bank fraud was evaluated in. The study recognized the fact that it is almost impossible to eradicate bank fraud and focused on what can be done to minimize frauds and prevent them. The research offered a study on decision trees, an important concept in the field of artificial intelligence. The study focused on discussing how these trees are able to assist in the decision making process of identifying frauds by the analysis of information regarding bank transactions. This information is captured with the use of techniques and the CRISP-DM management model of data mining in large operational databases logged from internet bank. The Cross Industry Standard Process for Data-Mining – CRISP-DM is a model of a data mining process used to solve problems by experts. The model identifies the different stages in implementing a data mining project while, A decision tree is both a data representing structure and a method used for data mining and machine learning, the model also describe the use of 13 neural networks in analyzing the great increase in credit card transactions, since credit card fraud has become increasingly rampant in recent years. This study investigates the efficacy of applying classification models to credit card fraud detection problems. Three different classification methods, i.e. decision tree, neural networks and logistic regression were tested for their applicability in fraud detections. The paper provides a useful framework to choose the best model to recognize the credit card fraud risk. Detecting credit card fraud is a difficult task when using normal procedures, so the development of the credit card fraud detection model has become of significance, whether in the academic or business community recently. These models are mostly statistics-driven or artificial intelligent-based, which have the theoretical advantages in not imposing arbitrary assumptions on the input variables. To increase the body of knowledge on this subject, an in-depth examination of important publicly available predictors of fraudulent financial statements was offered. They tested the value of these suggested variables for detection of fraudulent financial statements within a matched pair’s sample. Self Organizing Artificial Neural Network (ANN) AutoNet was used in conjunction with standard statistical tools to investigate the usefulness of these publicly available predictors. The study resulted in a model with a high probability of detecting fraudulent financial statement on one sample. An illustration of the decision tree for the training sets for the multilayer perceptron network based on the work is displayed below Source: (Werbos ; Rumelhart ) 14 In this work, the irregularity detection system Model has sought to reduce the risk level of fraudulent transactions that take place in the Nigerian banking industry thereby aiding in the decrement of bank fraud. This will brings about reduced fraudulent transactions if implemented properly. Neural network technology is appropriate in detecting fraudulent transactions because of its ability to learn and remember the characteristics of the fraudulent transactions and apply that “knowledge” when assessing new transactions. (Yuhas B.P., 1993) The study reinforced the validity and efficiency of AutoNet as a research tool and provides additional empirical evidence regarding the merits of suggested red flags for fraudulent financial statements. Reviews the various factors that lead to fraud in our banking system may have some attachment. Therefore, there must be some factors that may have led to this fraudulent. 2.3 Fraud detection in Electronic banking There are very few papers about fraud detection in Electronic banking. Most of them concern fraud prevention, which uses efficient security measures to prevent fraudulent financial transactions performed by unauthorized users and to ensure transaction integrity Aggelis proposed an Electronic banking fraud detection system for offline processing. Another system presented in works well Electronic but needs a component that must be downloaded and installed in the client device, which is inconvenient for deployment. (Kevin J. L., 1995) In practice, typical existing Electronic banking fraud detection systems are rule based and match likely fraud in transactions. The rules are mostly generated according to domain knowledge; consequently, these systems usually have a high false positive rate but a low fraud detection rate. Importantly, the rules are not adaptive to changes in the types of fraud. 2.4 Credit card fraud detection Credit card fraud is divided into two types: offline fraud and Electronic fraud. Offline fraud is committed by using a stolen physical card at a storefront or call center. In most cases, the institution issuing the card can lock it before it is used in a fraudulent manner, if the theft is discovered quickly enough. Electronic fraud is committed via web, phone shopping or cardholder-not-present. Only the card’s details are needed, and a manual signature and card imprint are not required at the time of purchase. With the increase of e-commence, Electronic 15 credit card transaction fraud is increasing. Compared to Electronic banking fraud detection, there are many available research discussions and solutions about credit card fraud detection. Most of the work on preventing and detecting credit card fraud has been carried out with neural networks. CARDWATCH features a neural network trained with the past data of a particular customer and causes the network to process current spending patterns to detect possible anomalies. Brause and Langsdorf proposed a rule-based association system combined with the neuro-adaptive approach. Falcon, developed by HNC, uses feed-forward Artificial Neural Networks trained on a variant of a back-propagation training algorithm. Machine learning, adaptive pattern recognition, neural networks, and statistical modeling are employed to develop Falcon predictive models to provide a measure of certainty about whether a particular transaction is fraudulent. A neural MLP-based classifier is another example of a system that uses neural networks. It acts only on the information of the operation itself and of its immediate previous history, but not on historic databases of past cardholder activities. (Yuhas B.P., 1993) A parallel Granular Neural Network (GNN) method uses a fuzzy neural network and rule-based approach. The neural system is trained in parallel using training data sets, and the trained parallel fuzzy neural network then discovers fuzzy rules for future prediction. Cyber Source introduces a hybrid model, combining an expert system with a neural network to increase its statistic modeling and reduce the number of “false” rejections. There are also some unsupervised methods, such HMM and cluster, targeting unlabeled data sets. All credit card fraud detection methods seek to discover spending patterns based on the historical data of a particular customer’s past activities. It is not suitable for Electronic banking because of the diversity of Electronic banking customers’ activities and the limited historical data available for a single customer. (Reategui, E.B. and Campbell, J. A, 1994) 2.5 Computer intrusion detection Many intrusion detection systems base their operations on analysis of audit data generated by the operation system. According to Sundaram, intrusion detection approaches in computers are broadly classified into two categories based on a model of intrusions: misuse and anomaly detection. Misuse detection attempts to recognize the attacks of previously observed intrusions in the form of a pattern or signature, and then monitors such occurrences. Misuse approaches include expert systems, model-based reasoning, state transition analysis, and keystroke dynamics monitoring. Misuse detection is simple and fast. Its primary drawback is that 16 it is not possible to anticipate all the different attacks because it looks for only known patterns of abuse. (Sundaram, A. 1996) According to Reichl, Anomaly detection tries to establish a historical normal profile for each user and then uses a sufficiently large deviation from the profile to indicate possible intrusions. Anomaly detection approaches include statistical approaches, predictive pattern generation, and neural networks. The advantage of anomaly detection is that it is possible to detect novel attacks; its weakness is that it is likely to have high rates of false alarm. Data mining approaches can be applied for intrusion detection. A classification model with association rules algorithm and frequent episodes has been developed for anomaly intrusion detection. This approach can automatically generate concise and accurate detection models from a large amount of audit data. However, it requires a large amount of audit data in order to compute the profile rule sets. Because most forensic evidence for fraud is left on customers’ computers and it is difficult to retrieve, intrusion detection methods cannot be directly used for Electronic banking. (Buschkes R, Kesdogan D, Reichl P., 1998) 2.6 Telecommunication fraud detection According to Yuhas, the various types of telecommunication fraud can be classified into two categories: subscription fraud and superimposed fraud. Subscription fraud occurs when a subscription to a service is obtained, often with false identity details and no intention of making payment. Superimposed fraud occurs when a service is used without necessary authority and is usually detected by the appearance of unknown calls on a bill. Research work in telecommunication fraud detection has concentrated mainly on identifying superimposed fraud. Most techniques use Call Detail Record data to create behavior profiles for customers, and detect deviations from these profiles. (Yuhas, B.P. 1995) Proposed approaches include the rule-based approach, neural networks, visualization methods, and so on. Among them, neural networks can actually calculate user profiles in an independent manner, thus adapting more elegantly to the behavior of various users. Neural networks are claimed to substantially reduce operation costs. As with credit card fraud detection, it is difficult for telecommunication fraud detection methods to characterize the behavior patterns of Electronic banking customers effectively. (Wills, G.J) Clearly, no single existing method can solve the Electronic banking fraud detection problem easily. Because different approaches have advantages in different aspects, it is believed 17 that a combined solution will outperform any single solution. Neural network has been successfully adopted in all three kinds of fraud detection and is believed to be a stable model. As the Electronic banking behavior sequence data is available from the Electronic banking interface log and is discriminative between abnormal and normal activities, sequential behavior pattern should be included for fraud detection. (Brachman, R.J and Wills G.J) 18 CHAPTER THREE RESEARCH METHODOLOGY 3.1 Introduction This chapter presents the analytical framework and the methodology in building Electronic Banking Fraud Detection using Data Mining and R for implementing Machine Learning Algorithms in detection of fraud. The method of analysis was K-Mean Clustering Analysis and Principal Component Analysis. Accordingly, Predictive model was formulated and adequate procedures and technique for computing the presence of outliers, using various distance measures is adopted. 3.2 Methodology Description 3.2.1 Electronic Banking Transaction Fraud Detection Techniques Summary This technique will follow the tabular procedure below for Electronic Banking transactions to demonstrate the fraud detection process. This process will consist of the following steps, the table below summarises the steps: Steps Description 1. read-untagged-data Data (data object name before preprocessing), 2. data-preprocessing Preprocess and clean the data: group or aggregate the items together based on the labelID Split the data into (behavioral transaction pattern) Build clusters which identifies groups within the datasets and numeric variables using K-Mean Algorithms / display Discriminant Analysis Plot Model Principal Component Variables Highlight homogeneous groups of individuals with Parallel Coordinate Plot (PCP) Prediction on experimental sets Evaluate performance 3. create-risk-table 4. Modelling 5. Visualisation 6. Prediction 7. Evaluation 19 3.2.2 Credit Card Fraud Detection Methods On doing the literature survey of various methods for fraud detection I come to the conclusion that to detect credit card fraud there are a lot of approaches, stated as follows: A Hybrid Approach and Bayesian Theory. Hybridization Hidden Markov Model. Genetic Algorithm Neural Network Bayesian Network K- nearest neighbor algorithm Stream Outlier Detection based on Reverse K-Nearest Neighbors(SODRNN) Fuzzy Logic Based System Decision Tree Fuzzy Expert System Support Vector Machine Meta Learning Strategy 20 3.2.3 Credit Card Fraud Detection Techniques According to Wheeler, R and Aitken, S. (2000), the credit card fraud detection techniques are classified in two general categories: fraud analysis (misuse detection) and user behavior analysis (anomaly detection). The first group of techniques deals with supervised classification task in transaction level. In these methods, transactions are labeled as fraudulent or normal based on previous historical data. This dataset is then used to create classification models which can predict the state (normal or fraud) of new records. There are numerous model creation methods for a typical two class classification task such as: rule induction, decision trees and neural networks. This approach is proven to reliably detect most fraud tricks which have been observed before, it also known as misuse detection. The second approach (anomaly detection), deals with unsupervised methodologies which are based on account behavior. In this method a transaction is detected fraudulent if it is in contrast with user’s normal behavior. This is because we don’t expect fraudsters behave the same as the account owner or be aware of the behavior model of the owner. To this aim, we need to extract the legitimate user behavioral model (i.e. user profile) for each account and then detect fraudulent activities according to it. Comparing new behaviors with this model, different enough activities are distinguished as frauds. The profiles may contain the activity information of the account; such as transaction types, amount, location and time of transactions, this method is also known as anomaly detection, (Yeung, D., and Ding, Y., (2002). It is important to highlight the key differences between user behavior analysis and fraud analysis approaches. The fraud analysis method can detect known fraud tricks, with a low false positive rate (FPR). These systems extract the signature and model of fraud tricks presented in dataset and can then easily determine exactly which frauds, the system is currently experiencing. If the test data does not contain any fraud signatures, no alarm is raised. Thus, the false positive rate (FRP) can be reduced extremely. However, since learning of a fraud analysis system (i.e. classifier) is based on limited and specific fraud records, it cannot distinguish or detect original frauds. As a result, the false negatives rate (FNR), may be extremely high depending on how ingenious are the fraudsters. User behavior analysis, on the other hand, greatly addresses the problem of detecting novel frauds. These methods do not search for specific fraud patterns, but rather compare 21 incoming activities with the constructed model of legitimate user behavior. Any activity that is enough different from the model will be considered as a possible fraud. Though, user behavior analysis approaches are powerful in detecting innovative frauds, they really suffer from high rates of false alarm. Moreover, if a fraud occurs during the training phase, this fraudulent behavior will be entered in baseline mode and is assumed to be normal in further analysis. (Yeung, D., and Ding, Y., (2002). Now I will discuss briefly and introduce some current fraud detection techniques which are applied to credit card fraud detection tasks, also main advantage and disadvantage of each approach will be discussed. 3.2.4 Artificial Neural Network An artificial neural network (ANN) is a set of interconnected nodes designed to imitate the functioning of the human brain, Douglas, L., and Ghosh, S., (1994). Each node has a weighted connection to several other nodes in adjacent layers. Individual nodes take the input received from connected nodes and use the weights together with a simple function to compute output values. Neural networks come in many shapes and architectures. The Neural network architecture, including the number of hidden layers, the number of nodes within a specific hidden layer and their connectivity, most be specified by user based on the complexity of the problem. ANNs can be configured by supervised, unsupervised or hybrid learning methods. 3.2.5 Supervised techniques In supervised learning, samples of both fraudulent and non-fraudulent records, associated with their labels are used to create models. These techniques are often used in fraud analysis approach. One of the most popular supervised neural networks is back propagation network (BPN). It minimizes the objective function using a multi-stage dynamic optimization method that is a generalization of the delta rule. The back propagation method is often useful for feed-forward network with no feedback. The BPN algorithm is usually time-consuming and parameters like the number of hidden neurons and learning rate of delta rules require extensive tuning and training to achieve the best performance. In the domain of fraud detection, supervised neural networks like back-propagation are known as efficient tool that have numerous applications. 22 Ragh avendra Patidar, et al. used a dataset to train a three layers back propagation neural network in combination with genetic algorithms (GA) for credit card fraud detection. In this work, genetic algorithms was responsible for making decision about the network architecture, dealing with the network topology, number of hidden layers and number of nodes in each layer. Also, Aleskerov et al. developed a neural network based data mining system for credit card fraud detection. The proposed system (CARDWATCH) had three layers auto associative architectures. They used a set of synthesized data for training and testing the system. The reported results show very successful fraud detection rates. In a P-RCE neural network was applied for credit card fraud detection. P-RCE is a type of radialbasis function networks that usually applied for pattern recognition tasks. Krenker et al. proposed a model for real time fraud detection based on bi-directional neural networks. They used a data set of cell phone transactions provided by a credit card company. It was claimed that the system outperforms the rule based algorithms in terms of false positive rate. Again in a parallel granular neural network (GNN) is proposed to speed up data mining and knowledge discovery process for credit card fraud detection. GNN is a kind of fuzzy neural network based on knowledge discovery (FNNKD).The underlying dataset was extracted from SQL server database containing sample Visa Card transactions and then preprocessed for applying in fraud detection. They obtained less average training errors in the presence of larger training dataset. 3.2.6 Unsupervised techniques According to Yamanishi, K., and Takeuchi, J. (2004), the unsupervised techniques do not need the previous knowledge of fraudulent and normal records. These methods raise alarm for those transactions that are most dissimilar from the normal ones. These techniques are often used in user behavior approach .ANNs can produce acceptable result for enough large transaction dataset. They need a long training dataset. Self-organizing map (SOM) is one of the most popular unsupervised neural networks learning which was introduced by SOM and provides a clustering method, which is appropriate for constructing and analyzing customer profiles, in credit card fraud detection, as suggested. SOM operates in two phase: training and mapping. In the former phase, the map is built and weights of the neurons are updated iteratively, based on input samples, in latter, test data is classified automatically into normal and fraudulent classes through 23 the procedure of mapping. After training the SOM, new unseen transactions are compared to normal and fraud clusters, if it is similar to all normal records, it is classified as normal. New fraud transactions are also detected similarly. One of the advantages of using unsupervised neural networks over similar techniques is that these methods can learn from data stream. The more data passed to a SOM model, the more adaptation and improvement on result is obtained. More specifically, the SOM adapts its model as time passes. Therefore it can be used and updated electronic in banks or other financial corporations. As a result, the fraudulent use of a card can be detected fast and effectively. However, neural networks has some drawbacks and difficulties which are mainly related to specifying suitable architecture in one hand and excessive training required for reaching to best performance in other hand. Williams, G. and Milne, P., (2004) 3.2.7 Hybrid supervised and unsupervised techniques In addition to supervised and unsupervised learning models of neural networks, some researchers have applied hybrid models. John ZhongLei et.Al., proposed hybrid supervised (SICLN) and unsupervised (ICLN) learning network for credit card fraud detection. They improved the reward only rule of SICLN model to ICLN in order to update weights according to both reward and penalty. This improvement appeared in terms of increasing stability and reducing the training time. Moreover, the number of final clusters of the ICLN is independent from the number of initial network neurons. As a result the inoperable neurons can be omitted from the clusters by applying the penalty rule. The results indicated that both the ICLN and the SICLN have high performance, but the SICLN outperforms well-known unsupervised clustering algorithms. (R. Huang, H. Tawfik, A. Nagar., 2010) 3.2.8 DECISION TREES AND SUPPORT VECTOR MACHINES: Classification models which are based on decision trees and support vector machines (SVM) are developed and applied on credit card fraud detection problem. In this technique, each account is tracked separately by using suitable descriptors, and the transactions are attempted to be identified and indicated as normal or legitimate. Sahin, Y., and Duman, E.,(2011). The identification is based on the suspicion score produced by the developed classifier model. When a new transaction is proceeding, the classifier can predict whether the transaction is normal or fraud. 24 In this approach, firstly, all the collected data is pre-processed before we start the modeling phase. Since, the distribution of data with respect to the classes is highly imbalanced, so stratified sampling is used to under sample the normal records so that the models have chance to learn the characteristics of both the normal and the fraudulent record’s profile. To do this, the variables that are most successful in differentiating the legitimate and the fraudulent transactions are founded. Then, these variables are used to form stratified samples of the legitimate records. Later on, these stratified samples of the legitimate records are combined with the fraudulent ones to form three samples with different fraudulent to normal record ratios. The first sample set has a ratio of one fraudulent record to one normal record; the second one has a ratio of one fraudulent record to four normal ones; and the last one has the ratio of one fraudulent to nine normal ones. The variables which are used make the difference in the fraud detection systems. Our main motive in defining the variables that are used to form the data-mart is to differentiate the profile of the fraudulent card user from the profile of legitimate card user. The results show that the classifiers of SVM and other decision tree approaches outperform SVM in solving the problem under investigation. However, as the size of the training data sets become larger, the accuracy performance of SVM based models becomes equivalent to decision tree based models, but the number of frauds caught by SVM models are still less than the number of frauds caught by decision tree methods. (Carlos Leon, Juan I. Guerrero, Jesus Biscarri., 2012) 3.2.9 FUZZY LOGIC BASED SYSTEMS: Fuzzy Neural Network The purpose of Fuzzy neural networks is to process the large volume of information which is not certain and is extensively applied in our lives. Syeda et al in 2002 proposed fuzzy neural networks which run on parallel machines to speed up the rule production for credit card fraud detection which was customer-specific. His work can be associated to Data mining and Knowledge Discovery in data bases (KD). In this technique, he used GNN (Granular Neural Network) method that uses fuzzy neural network which is based on knowledge discovery (FNNKD), to train the network fast and how fast a number of customers can be processed for fraud detection in parallel. A transaction table is there which includes various fields like the transaction amounts, statement date, posting date, time between transactions, transaction code, day, transaction description, and etc. But for implementation of this credit card fraud detection method, only the significant fields from the database are extracted into a simple text file by 25 applying suitable SQL queries. In this detection method the transaction amounts for any customer is the key input data. This preprocessing of data had helped in decreasing the data size and processing, which speeds up the training and makes the patterns briefer. In the process of fuzzy neural network, data is classified into three categories: First for training, Second for prediction, and Third one is for fraud detection. The detection system routine for any customer is as follows: Preprocess the data from a SQL server database Extract the preprocessed data into a text file. Normalize the data and distribute it into 3 categories (training, prediction, detection) For normalization of data by a factor, the GNN has accepted inputs in the range of 0 to 1, but the transaction amount was any number greater than or equal to zero because for a particular customer only the maximum transaction amount is considered in the entire work. In this detection method, there are two important parameters that are used during the training that are: (i) Training error and (ii) Training cycles. With increase in the training cycles, the training error will be decreased. The accuracy of the results depends on these parameters. In prediction stage, the maximum absolute prediction error is calculated. In fraud detection stage also, the absolute detection error is calculated and then if the absolute detection error is greater than zero then it is checked to see if this absolute detection error is greater than the maximum absolute prediction error or not. If it is found to be true then it indicates that the transaction is fraudulent otherwise transaction is reported to be safe. Both training cycles and data partitioning are extremely important for better results. The more there is data for training the neural network the better prediction it gives. The lower training error makes prediction and the detection more accurate. The higher the fraud detection error is, the greater is the possibility of the transaction to be fraudulent. (Peter J. Bentley, 2000) 3.3 Model Specification In this work, the Predictive Model for Unsupervised Machine Learning Detection System has sought to reduce the risk level of fraudulent transactions that take place in the Nigerian banking industry thereby aiding in the decrement of bank fraud. This will brings about reduced fraudulent 26 transactions if implemented properly. The efficiency is measured on the basis of frequency of detecting outliers or unusual behavioral user pattern. 3.3.1 Model for Data Reduction According to Bruker Daltonics, Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. That is, mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results. (D.L Massart, and Y. Vander Heyden., 2004) 3.3.2 Principal Component Analysis (PCA) Procedure Suppose that we have a random vector X. = ⋮ with population variance covariance matrix: () = Σ = ⋮ ⋯ ⋱ ⋯ ⋮ Consider the linear combinations: = + + ⋯ + = + + ⋯ + . . . = + + ⋯ + Each of these can be thought of as a linear regression, predicting from , ..., There is no intercept, but, , , ..., can be viewed as regression coefficients. Note that is a function of our random data, and so is also random. Therefore it has a 27 population variance: ( ) = / = Moreover, and will have a population covariance: , ! = / = Here the coefficients " are collected into the vector: = ⋮ First Principal Component (PCA1): # The first principal component is the linear combination of X-variables that has maximum variance (among all linear combinations), so it accounts for as much variation in the data as possible. Specifically we will define coefficients , ,..., for that component in such a way that its variance is maximized, subject to the constraint that the sum of the squared coefficients is equal to one. This constraint is required so that a unique answer may be obtained. More formally, select , ,..., that maximizes: ( ) = / = subject to the constraint that: / = =1 Second Principal Component (PCA2): $ The second principal component is the linear combination of x-variables that accounts for as muc h of the remaining variation as possible, with the constraint that the correlation between the first and second component is 0 Select, , , ... , that maximizes the variance of this new component... ( ) = 28 / = subject to the constraint that the sums of squared coefficients add up to one, / = =1 along with the additional constraint that these two components will be uncorrelated with one another: ( , ) = / = = 0 All subsequent principal components have this same property; they are linear combinations that account for as much of the remaining variation as possible and they are not correlated with the other principal components.We will do this in the same way with each additional component. For instance: ith Principal Component (PCAi): We select, , , ... , that maximizes: ( ) = / = subject to the constraint that the sums of squared coefficients add up to one; along with the additional constraint that this new component will be uncorrelated with all the previously defined components: / = =1 ( , ) = / = = 0 = = 0 ( , ) = . . . 29 / (% , ) = %, / = % = 0 Therefore all principal components are uncorrelated with one another. 3.3.2 How do we find the coefficients? How do we find the coefficients for a principal component? The solution involves the eigenvalues and eigenvectors of the variance covariance matrix Σ. Solution: We are going to let through & denote the eigenvalues of the variance covariance matrix Σ. These are ordered so that & has the largest eigenvalue and & is the smallest. & ≥ & ≥ ⋯ ≥ & We are also going to let the vectors through , that is, , , … , ; denote the corresponding eigenvectors. It turns out that the elements for these eigenvectors will be the coefficients of our principal components. The variance for the ith principal component is equal to the ith eigenvalue. ( ) = + + ⋯ ! = & Moreover, the principal components are uncorrelated with one another. , ! = 0 The variance covariance matrix may be written as a function of the eigenvalues and their corresponding eigenvectors. This is determined by using the Spectral Decomposition Theorem. This will become useful later when we investigate topics under factor analysis. Spectral Decomposition Theorem The variance covariance matrix can be written as the sum over the p eigenvalues, multiplied by the product of the corresponding eigenvector times its transpose as shown in the first expression below: / Σ = & / Σ ≘ & 30 The second expression is a useful approximation if &- , &- , … , & are small. We might approximate Σ by: / & Again, this will become more useful when we talk about factor analysis. Note, we defined the total variation of X as the trace of the variance covariance matrix, or if you like, the sum of the variances of the individual variables. This is also equal to the sum of the eigenvalues as shown below: . (Σ) = + +⋯+ = & + & + ⋯ & This will give us an interpretation of the components in terms of the amount of the full variation explained by each component. The proportion of variation explained by the ith principal component is then going to be defined to be the eigenvalue for that component divided by the sum of the eigenvalues. In other words, the ith principal component explains the following proportion of the total variation: & & + & + ⋯ + & A related quantity is the proportion of variation explained by the first k principal component. This would be the sum of the first k eigenvalues divided by its total variation. & + & + ⋯ + & & + & + ⋯ + & Naturally, if the proportion of variation explained by the first k principal components is large, then not much information is lost by considering only the first k principal components. Why It May Be Possible to Reduce Dimensions When we have correlations (multicollinarity) between the x variables, the data may more or less fall on a line or plane in a lower number of dimensions. For instance, imagine a plot of two x variables that have a nearly perfect correlation. The data points will fall close to a straight line. All of this is defined in terms of the population variance covariance matrix Σ which is unknown. However, we may estimate Σ by the sample variance: 31 covariance matrix which is given in the standard formula here: 6 2= 1 ( − 4̅ )( − 4̅ )/ 3−1 Procedure: Compute the eigenvalues & of the sample variance covariance matrix S, and the corresponding eigenvectors; then we will define our estimated principal components using the eigenvectors as our coefficients: 7 = ̂ + ̂ + ⋯ + ̂ 7 = ̂ + ̂ + ⋯ + ̂ . . . 7 = ̂ + ̂ + ⋯ + ̂ Generally, we only retain the first k principal component. Here we must balance two conflicting desires: 1. To obtain the simplest possible interpretation, we want k to be as small as possible. If we can explain most of the variation just by two principal components then this would give us a much simpler description of the data. The smaller k is the smaller amount of variation is explained by the first k component. 2. To avoid loss of information, we want the proportion of variation explained by the first k principal components to be large. Ideally as close to one as possible; i.e., we want λ7 + λ7 + λ7 9 ≘1 λ7 + λ7 + λ7 9 32 3.3.3 Standardize the Variables According to Baxter, R., and Hawkins, S., (2002), if raw data is used principal component analysis will tend to give more emphasis to those variables that have higher variances than to those variables that have very low variances. In effect the results of the analysis will depend on what units of measurement are used to measure each variable. That would imply that a principal component analysis should only be used with the raw data if all variables have the same units of measure. And even in this case, only if you wish to give those variables which have higher variances more weight in the analysis. Summary The results of principal component analysis depend on the scales at which the variables are measured. Variables with the highest sample variances will tend to be emphasized in the first few principal components. Principal Component analysis using the covariance function should only be considered if all of the variables have the same units of measurement. If the variables either have different units of measurement (i.e., pounds, feet, gallons, etc), or if we wish each variable to receive equal weight in the analysis, then the variables should be standardized before a principal components analysis is carried out. Standardize the variables by subtracting its mean from that variable and dividing it by its standard deviation: : = ;<> %;?> @> Where, = Data for variable j in sample unit i ? = Sample mean for variable j 2 = Sample standard deviation for the variable j We will now perform the principal component analysis using the standardized data. Note: the variance covariance matrix of the standardized data is equal to the correlation matrix for the unstandardized data. Therefore, principal component analysis using the standardized data is equivalent to principal component analysis using the correlation matrix. 33 Principal Component Analysis Procedure The principal components are first calculated by obtaining the eigenvalues for the correlation matrix: λ7 , λ7 , … , λ7 A In this matrix we denote the eigenvalues of the sample correlation matrix R, and the corresponding eigenvectors eB , eB , … , eBA Then the estimated principal components scores are calculated using formulas similar to before, but instead of using the raw data we will use the standardized data in the formulae below: 7 = ̂ : + ̂ : + ⋯ + ̂ : 7 = ̂ : + ̂ : + ⋯ + ̂ : 3.3.4. Measures of Association for Continuous Variables According to Johnson and Wichern, the following standard notations are generally used: = Response for variable k in sample unit (the number of individual observation at site i) 3 = Number of sample unit C =Number of variables Johnson and Wichern list four different measures of association (similarity) that are frequently used with continuous variables in cluster analysis: Euclidean Distance - This is used most commonly. For instance, in two dimensions, we can plot the observations in a scatter plot, and simply measure the distances between the pairs of points. More generally we can use the following equation: D , ! = E − ! This is the square root of the sum of the squared differences between the measurements for each variable. Some other distances also use similar concept. For instance the Minkowski Distance is: H H D , ! = FG − G I Here the square is replaced with raising the difference by a power of m and instead of taking the square root, we take the mth root. 34 Here are two other methods for measuring association: Canberra Metric D , ! = G − G + Czekanowski Coefficient D , ! = 1 − 2 ∑ LM3 − ! ∑ + ! For each of these distance measures, the smaller the distance, the more similar (more strongly associated) are the two subjects. Now the measure of association must satisfy the following properties: 1. Symmetry D , ! = D , ! i.e., the distance between subject one and subject two must be the same as the distance between subject two and subject one. 2. Positivity D , ! > 0, MO ≠ i.e., the distances must be positive, negative distances are not allowed! 3. Identity D , ! = 0, MO = i.e., the distance between the subject and itself should be zero. 4. Triangle inequality D( , ) ≤ D , ! + D( , ) This follows from geometric consideration, where we learnt that sum of two sides of a triangle cannot be smaller than the third side. 35 3.3.5. Agglomerative Hierarchical Clustering Combining Clusters in the Agglomerative Approach In the agglomerative hierarchical approach, we start by defining each data point to be a cluster and combine existing clusters at each step. Bates, S., and Saker, H., (2006) Here are four different methods for doing this: 1. Single Linkage: In single linkage, we define the distance between two clusters to be the minimum distance between any single data point in the first cluster and any single data point in the second cluster. On the basis of this definition of distance between clusters, at each stage of the process we combine the two clusters that have the smallest single linkage distance. 2. Complete Linkage: In complete linkage, we define the distance between two clusters to be the maximum distance between any single data point in the first cluster and any single data point in the second cluster. On the basis of this definition of distance between clusters, at each stage of the process we combine the two clusters that have the smallest complete linkage distance. 3. Average Linkage: In average linkage, we define the distance between two clusters to be the average distance between data points in the first cluster and data points in the second cluster. On the basis of this definition of distance between clusters, at each stage of the process we combine the two clusters that have the smallest average linkage distance. 4. Centroid Method: In centroid method, the distance between two clusters is the distance between the two mean vectors of the clusters. At each stage of the process we combine the two clusters that have the smallest centroid distance. 5. Ward’s Method: This method does not directly define a measure of distance between two points or clusters. It is an ANOVA based approach. At each stage, those two clusters merge, which provides the smallest increase in the combined error sum of squares from one-way univariate ANOVAs that can be done for each variable with groups defined by the clusters at that stage of the process 36 According to, Pinheiro, R., and Bates, S., (2000), none of these methods is uniformly the best. In practice, it’s advisable to try several methods and then compare the results to form an overall judgment about the final formation of clusters. Notationally define as: , , … , = RST .M3 O L UVT. 1 , , … , = RST .M3 O L UVT. 2 D(4, W) = XMT. 3 S.Y3 ST .M3 . 4 3D ST .M3 . W 3.4.1. Linkage Methodology or Measuring Association between Cluster 1 and 2 (Z#$ ) 1. Single Linkage D = min, D , !; This is the distance between the closest members of the two clusters 2. Complete Linkage D = max, D , !; This is the distance between the members that are farthest apart (most dissimilar) 3. Average Linkage D = ∑ ∑ D ! ; This method involves looking at the distances between all pairs and averages of all the distances. This is also called, Uniweighted pair Group Mean (UPGMA) 4. Centroid Method D = D(4̅ , W?); This involves finding the mean vector location for each of the clusters and taken the distance between these two centroid. (Vesanto, J., &Alhoniemi, E., 2000). 3.4.2 Applying Gaussian distribution to developing Anomaly detection Algorithms: Han et al. and Andrew ng (2012) Let 4 ⋲ ℝ, assuming each ~^(_, ), such that the joint probability density functions of is given by: `(; _, ) = 1d √2c estimate μ and 4C − f ( − _)d 2 g, , we have _ = 1dL ∑H 4 and 37 where μ and are unknown, thus to = 1dL ∑H ( − _) ; 3.4.3 Now, to develop anomaly detection Algorithm, Let training set: h4 () , 4 () , 4 (j) , … 4 (H) k, features from m-user’s and each 4 ∈ ℝ6 , such that 4 is a vector. Thus we can model the probability of4, P (4) based on all the features: , , j , … 6 , with the assumption that ~^(_, C(4 )~C( ; _ , ), ), ∋: Although in machine learning, it may not necessarily follows that X~ identically independently distributed. Consequently, it follows that: ~^(_ ) ~^(_ ) . . 6 ~^(_6 6) Such that, the joint probability density functions: `(4) = ∏6 C ; _ , !. This can be written in expanded form as: `(4) = C( ; _ , ). C( ; _ , ). C(j ; _j , j ) … C(6 ; _6 , 6 ) 3.4.4 Anomaly Detection Algorithms Steps: If a user is a suspect, the following steps should be followed: 1. Choose features , that has higher indicative probability of anomalous example 2. Fit parameter _ , _ , … _6 ; , , … 6, () such that _ = 1dL ∑6 4 estimate _ , _ , … _6 ; And = 1dL ∑H s4 − _ t estimate () , ,… 6 and compute C ; _ , Given new , computeC(4), such that: 3. `(4) = ∏6 C ; _ , ! = ∏6 1u √2c 4. Feature is anomaly if and only if `(4) < ℰ 38 4C − _ ! u −v 2 w ! 5. Plot the graph of C( ; _ , ). C( ; _ , ). C(j ; _j , j ) … C(6 ; _6 , 6 ) 6. Let assume ℰ = 0.02, then: () a. If Cs4{|}{ t = 0.0426 ≥ ℰ, .ℎ3 3. 3L UW, SV. MO: () b. Cs4{|}{ t = 0.0021 < ℰ, .ℎ3 .ℎ . MT 3 M3DM .M3 O D OU , .ℎVT OU 3L UW 3.4.5 The important or real number evaluation: When developing learning algorithm, that is choosing features, etc, decision making is much easier if we have a way of evaluating our learning algorithms: 1. By choosing the features to use and include in our learning model 2. How to evaluate them, and decide on the improvement of algorithm system by deciding the features not to be included and features to be included 3. Assuming we have some labelled data, of anomalous and non-anomalous examples, let assume (W = 0, MO 3 L U 3D W = 1, MO 3L UVT) 4. Now in the process of developing and evaluating the datasets, for example: a. Training Set: h4 () , 4 () , 4 (j) , … 4 (H) k assume normal examples and none is anomalous b. Cross Validation Set: 4 () () () () ,W ; () 4 (H ) (H ) () () ,W ,W ,…4 (H ) () (H ) c. Test Set: 4{|}{ , W{|}{ ; 4{|}{ , W{|}{ , … 4{|}{ , W{|}{ 5. It is necessary to include W = 1, OY 3L UVT examples in the Test Set and Cross Validation Set. 3.4.6 Algorithms Evaluation: Step1. Fit model C(4) on Training Set: h4 () , 4 () , 4 (j) , … 4 (H) k Step2. Fit model C(4) on Cross Validation Set: 4 () () ,W ; 4 (H ) (H ) () () ,W ,W ,…4 () () () () (H ) (H ) Step3. Fit model C(4) on Test Set: 4{|}{ , W{|}{ ; 4{|}{ , W{|}{ , … 4{|}{ , W{|}{ Step4. Predict W = 1, MO C(4) < ℰ ( 3L UW) 0, MO C(4) ≥ ℰ (3 L U) 39 Possible Evaluation Metrics (Ref: 3.2.3) 1. T P, the true positive number, 2. F P, the false positive number, 3. F N, the false negative number, 4. T N, the true negative number, 5. Precision/ Recall 6. − 2 Cross , note, Cross Validation Set can be used to estimate and choose 3.4.7 Now, given training, cross validation and test sets, algorithms evaluation is computed as followed: () () 1. Think of 4{|}{ , W{|}{ , such that, 4 = O .V T O VT M ′ T . 3T .M3 .MM.MT, predicting 1, MO C(4) < ℰ ( 3L UW) W = , thus we have got y-labelled 0, MO C(4) ≥ ℰ (3 L U) 2. Algorithms label y is either normal or anomalous 3. However, for this very transaction data set for the research work, there are more W = 0, that is normal, compare to W = 1, that is anomalous, thus, looking at the normality test and the histogram plot of the dependent variable, we can see that the transactionAmount, dependent feature is highly skewed. 4. Thus classification may not necessarily be a good evaluation metrics because of the skewed metric variable, thus the data is transformed to meet up with the normality assumption. 3.4.8 Suggestion and Guidelines on how to Design or Choose Features for Anomaly Detection Algorithms: 1. Plot the histogram of the assumed features from available data set, to confirm if it is normally distributed. 2. If normal, then fit the algorithms model, else, transform by taking log or any other appropriate function and check again if the histogram plot validate the normality assumption 3. Define the new feature as new X and replace with the previous variable X. 4. Then fit the anomaly detection algorithms as stated earlier 40 3.5.7 Data Pre-Processing for Fraud Detection The deployment of unsupervised K-Mean clustering algorithm could be too demanding and unrealistic based on the mathematical and algorithms steps and procedures suggested in the various literature reviews and research work even for an R_package expert user. Consequently, I source for a graphical user Interface package, such as rattle for easy manipulation and implementation based on the guide lines and suggestion by Williams, Graham. Data Mining with Rattle and R. s.l. Springer Now, the problem at hand contains large number of data with no prior known features that can be used for classification. Clustering the data into different groups and trying to understand the behaviour of each group is suggested as a methodology for modelling the user behavioral pattern of the transaction data sets. Thus, I explore the dtrans_data and Aggdtrans_data sets with R/Rattle to validate the legitimate user behavioral model. The algorithm chosen for clustering the transaction data is K-mean algorithm and the tools for the implementation are R and Rattle. The following sections will present the algorithm that will be used for clustering and the tools used for implementing the solution. 3.5.8 K-means algorithm K-MEANS is the simplest algorithm used for clustering which is unsupervised clustering algorithm. This algorithm partitions the data set into k clusters using the cluster mean value so that the resulting clusters intra cluster similarity is high and inter cluster similarity is low. K-Means is iterative in nature it follows the following steps: Arbitrarily generate k points (cluster centres), k being the number of clusters desired. Calculate the distance between each of the data points to each of the centres, and assign each point to the closest centre. Calculate the new cluster centre by calculating the mean value of all data points in the respective cluster. With the new centres, repeat step 2. If the assignment of cluster for the data points changes, repeat step 3 else stop the process. The distance between the data points is calculated using Euclidean distance as follows. The Euclidean distance between two points or features, X1= (x11, x12... x1m) , X2= (x21, x22 ,...., x2m) 41 6 XMT.( ; ) = E(4 − 4 ) Advantages 9 Fast, robust and easier to understand. 9 Relatively efficient: O (t k n d), where n is objects, k is clusters, d is dimension of each object, and t is iterations. Normally k, t , d < n. 9 Gives best result when data set are distinct or well separated from each other. Disadvantages 9 The learning algorithm requires apriori specification of the number of cluster centres. 9 The learning algorithm provides the local optima of the squared error function. 9 Applicable only when mean is defined i.e. fails for categorical data. 9 Unable to handle noisy data and outliers 3.6.3 Strategies for data reduction Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. That is, mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results. Strategies for data reduction include the following: Data aggregation, where aggregation operations are applied to the data in the construction of optimal data variables and features for the analysis (Bruker Daltonics) Attribute subset selection, where irrelevant, weakly relevant or redundant attributes or dimensions may be detected and removed. Dimensionality reduction, where encoding mechanisms are used to reduce the dataset size Numerosity reduction, where the data are replaced or estimated by alternative, smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering, sampling, and the use of histograms. 42 Discretization and concept hierarchy generation: where raw data values for attributes are replaced by ranges or higher conceptual levels. Data discretization is a form of numerosity reduction that is very useful for the automatic generation of concept hierarchies. Discretization and concept hierarchy generation are powerful tools for data mining, in that they allow the mining of data at multiple levels of abstraction. From the above data reduction strategies, the attribute subset selection strategy has been selected, for the step of data cleaning and transformation in Rattle typical work flow. 43 CHAPTER FOUR 4.0 RESULTS AND ANALYSIS In this chapter, I present the result of the experimental deployment and practical evaluation of K-Mean Clustering Analysis and Principal Component Analysis on the procedures for computing the presence of outliers using various distance measures and general detection performance for unsupervised machine learning on how to design, choose features and carry out electronic transaction fraud detection 4.1 Descriptive Statistics Visualization for Probability Model Suitable Variables for the Models (Response Variable) The response variable is initially unsuitable for the proposed model, since it was highly skewed, I, need transform the transactionNairaAmount and as we can see, the histogram of the transformed variable with normality curve is displayed above. 44 Iteration 1: Applying K-Mean Cluster Analysis on the dtrans_data: 4.2 Prediction before manipulation and transformation of the data field variables: K-Means Clustering: Is a cluster analysis which identifies groups within a dataset. The KMeans clustering algorithm will search for K clusters is specified. The resulting K clusters are represented by the mean or average values of each of the variables. By default, K-Means only works with numeric variables: (Han et al.Data Mining, 2012) The result output is display below: List of figures 4.2.3 Cluster centres: transactionNairaAmount 1. 0.0009903937 2. 0.0001173569 3. 0.0016224521 4. 0.0011566641 5. 0.0072892047 6. 0.0011805694 7. 0.0001121783 8. 0.0001213718 9. 0.0017597340 10. 0.0037814491 transactionDate transactionTime localHour 0.54419590 0.88409986 0.26661406 0.19543350 0.26520170 0.55722499 0.84923575 0.09760178 0.74819760 0.79978916 0.7084763 0.6963829 0.8541757 0.4618868 0.1615614 0.3504528 0.3917612 0.7445254 0.1186837 0.8846654 0.4720105 0.4688121 0.7489533 0.1810789 0.7907233 0.1709038 0.2103045 0.5105361 0.8762319 0.7172179 Within cluster sum of squares: 25.422345 17.294585 24.974340 19.229342 24.111857 26.248646 26.344331 52.941855 16.483182 9.860579 The cluster centre table above summarises the measure of association or linkages between two clusters. This involves finding the mean vector location for each of the clusters and taken the distances between these two centroid. First, the initial cluster centroid will be randomly selected from the four variables. The first row, gives the initial cluster centres; the procedure then working iteratively. The within sum of squares table summarises the nearest neighbors between two distinct clusters based on the initial table, the cluster centre table. For instance, from the table above, it seems that cluster 3, is the middle, because, seven (7) of the clusters (1, 2, 4, 6, 8, 9 and 10), are closest to cluster 3 and not to any other cluster. 45 Implication: Since the principal purpose is to look at the cluster means for the significant of explanatory transaction variable identified based on the cluster centres. We can see from row 3, of the cluster centre table, that transactionTime has the highest cluster centre value, followed by the localHour and so on. Besides, from the tables above it is now clear, that cluster 3 is the nearest neighbour to cluster 10, based on the best explanatory cluster variables values (0.8541757 against 0.8846654) and (0.7489533 against 0.7172179). Furthermore, the graphical display of the score plot in the later analysis will validate this more explicitly. After Cluster has been built, the display of the Discriminant plot is shown below: The Discriminant coordinate figure above demonstrated the visual representation of cluster sizes, ten clusters, altogether as previously explained, which account for has 53.69% of the point variability as shown in the figure above, cluster sizes varies for each clusters, with 426 as the dimension of the least cluster and 1133 being the dimension of the biggest cluster. Reference the List of figure 4.4.2 for the remaining cluster sizes. (Vesanto, J., &Alhoniemi, E.,) 46 4.4 The result of the Iteration2 on dtrans_data transformed output is display: Data means: R10transactionNairaAmount 0.2860758 transactionDate RRK_transactionTime RRK_localHour 0.4936329 0.5011834 0.5020773 The Data means table: Now, we can recall that, the principal purpose is to look at the cluster means for the significant of the best explanatory transaction variable identified based on the cluster centres. This involves finding the mean vector location for each of the clusters and taken the distance between these two centroid. Since the distance between two clusters is the distance between the two mean vectors of the clusters; from the data mean table, we can see that transactionTime and localHour has the shortest mean distance apart, with data means value of 0.5011834 and 0.5020773, respectively. Generally, according to, Vesanto, J., &Alhoniemi, E., at each stage we combine the two clusters that have the smallest centroid distance. Implication: This ascertains further, from the tables above, that transactionTime and localHour are key explanatory variable. Cluster centres: R10_ transactionNairaAmount 1. 0.2808516 2. 0.2815987 3. 0.2512386 4. 0.2280972 5. 0.5577502 6. 0.3368204 7. 0.2945089 8. 0.2351319 9. 0.2425915 10. 0.2755736 transactionDate RRI_transactionTime 0.6300578 0.6524175 0.2537390 0.1744212 0.3500956 0.5052231 0.9007382 0.1072978 0.4943533 0.8378023 RRI_localHour 0.7018665 0.2876451 0.8967377 0.3881951 0.2615866 0.5856259 0.5553950 0.7326407 0.1179586 0.8836406 0.5124956 0.1057152 0.7156922 0.1435193 0.6418214 0.2270445 0.3095079 0.4686039 0.8835740 0.6828752 Within cluster sum of squares: 30.05577 34.28323 31.54214 22.94297 32.45812 36.23424 38.23785 47 63.03053 44.04175 29.80263 The cluster centre table based on the iteration 2 above similarly summarises the measure of association or linkages between two clusters. This involves finding the mean vector location for each of the clusters and taken the distances between these two centroid, just like the first iteration. First, the initial cluster centroid will be randomly selected from the four variables. The first row, gives the initial cluster centres; the procedure then working iteratively. The within sum of squares table summarises the nearest neighbors between two distinct clusters based on the initial table, the cluster centre table. From the table above, it clear that cluster 3, is the middle, because, seven (7) of the clusters (1, 2, 4, 6, 8, 9 and 10), are closest to cluster 3 and not to any other cluster, in confirmation with iteration 1. Implication: Since the principal purpose is to look at the cluster means for the significant of explanatory transaction variable identified based on the cluster centres. We can see from row 3, of the cluster centre table, that transactionTime has the highest cluster centre value, followed by the localHour and so on. Besides, from the tables above it is now clear, that cluster 3 is the nearest neighbour to cluster 10, based on the best explanatory cluster variables values (0.8967377 against 0.8836406) and (0.7156922 against 0.6828752). Similarly, the graphical display of the score plot in the later analysis will validate this more explicitly. 4.5 Iteration 2: Applying K-Mean Cluster Analysis on the transformed dtrans_data: Prediction after manipulation and transformation of the algorithm variables: 48 4.6 Now, the model has been enhanced and it explains 61.12% of the point variability. The PCA is a tool to reduce multidimensional data to lower dimensions while retaining most of the information. Now, the PCA is a transformation of the old coordinate system (peaks) into the new coordinate system (PC), it can be estimated how much each of the old coordinates (peaks) contribute to each of the new ones (PCs). These values are called loadings. The higher the loading of a particular peak onto a PC, the more it contributes to that PC. (Vesanto, J., & Alhoniemi, E., 2000) 4.7 Finalizing on the Desired Variables using Principal Component Analysis: PCA Note that principal components on only the numeric variables are calculated, and so we cannot use this approach to remove categorical variables from consideration. Any numeric variables with relatively large rotation values (negative or positive) in any of the first few components are generally variables that I may wish to include in the modelling. (List of figures 4.3.2) The explanation of the next three (3), tables is more constructive and consequential when considered in view of one another. (D.L. Massart and Y. Vander Heyden) Standard deviations: PC1 PC2 1.4158685 PC3 1.0254212 0.9915365 PC4 0.9681633 Rotation: PC1 PC2 PC3 PC4 R10transactionNairaAmount 0.06134073 0.66102682 0.43958358 -0.60494127 TransactionDate -0.07433338 0.70342333 -0.8119341 0.70217871 RRK_transactionTime -0.69456341 0.01979713 -0.09915599 -0.10961678 RRK_localHour 0.14005622 0.26000575 -0.88842588 -0.34803941 49 Importance of Components: PC1 PC2 PC3 PC4 Standard Deviation 1.4159 1.0254 0.9915 0.9682 Proportion of variance 0.4009 0.2103 0.1966 0.1875 Cumulative Proportion 0.4009 0.6112 0.8079 0.9953 Interpretations: Loading for the principal components is represented in the Rotation table, this contains a matrix with loadings of each principal component, where the first column in the matrix contains loading for the first principal component, and the second column in the matrix contains loading for the second principal component and so on. Now, from the Rotation table above, the first principal component (PC1) has the highest (in absolute value) loading for transactionTime. Similarly, loading for the transaction Date and transactionTime are ‘negative’, while that of localHour is ‘positive’ in view of the transactionNairaAmount. Consequently, the implication of the first principal component is that, transactionTime contribute most to PC1, which gives the direction of the highest variance, similarly, PC1 represents a contrast between the explanatory variables: (transactionDate and TransactionTime against the localHour in relation to the response variable, the transactionNairaAmount). However, the second principal component PC2 has the highest loading for transactionDate and localHour, thus, the contrast is mainly between transactionDate and localHour. Implication: The original variable are represented in the PC1 and PC2 dimension spaces, as this will be explicitly demonstrated as a confirmation in the score plot of the PCs in the later analysis. The PC1 represent the resultant of all values projected in the x-axis and this is dominated by the transactionTime and to lesser extent, by the localHour. In contrast, the yaxis (PC2) is defined by the transactionNairaAmount and is dominated by the transactionDate and to lesser extent, by the localHour. Consequently, the transactions would be ranked according to the PC1 with the highest scoring explanatory variables being probably the best at least in terms of transactionTime and localHour 50 4.8 Determine Number of Components to retain: In practice; H0: Retain components that account for at least 5% to 10% of the total variance, Now, if you look under Important of components of the output result, the row indicator is tag: Proportion of variance, PC1, PC2, PC3 and PC4 columns, gives values greater than 10%, which are approximately 40%, 21%, 20% and 19% respectively. Similarly, H0: Retain component that combine account for at least 70% of the Cumulative Proportion. Now, if you look under Important of components of the output result, the row indicator is tag: Cumulative Proportion, PC1, PC2, PC3 columns, gives values greater than 70%, which are approximately 80.79% and approximately 100% if PC4 were to be included. 4.8.1 The Loading Plot below reveals the relationship between variables in the space of the first two components. In the loading plot, we can see that transactionTime and localHour have similar heavy load for PC1 and PC2, however others have heavy loading for PC3 and PC4. Now, main component variables can be expressed as a linear combination of the original variables; the eigenvectors table above provides coefficients for the equation: I will only express PC1 and PC2 as a linear combination of the original variables, because this two constitute the sets or combinations of predictors and response variables scores that contributed most information in the analysed data sets. (D.L. Massart and Y. Vander Heyden) 4.8.2 Main Component Variables (PCV) as a linear combination of the original variables `1 = −0.07431 + 0.06132 − 0.69563 + 0.14014 `2 = 0.70341 + 0.66102 − 0.01973 + 0.26004 PC1 = the resultant of all values projected in the x-axis PC1 = the resultant of all values projected in the y-axis 1 = . 3T .M3X ., 2 = . 3T .M3^ M LV3., 3 = . 3T .M3ML, 4 = U UV Note the principal component variables now represent the aggregation of the desired variables that is finally included and used in the final modelling and for the implementation 51 of electronic transaction fraud detection techniques. The primary multidimensional data sets have been finally reduced to lower dimensions while still retaining most of the information. Now, the subsequent analysis of the score plot of the explanatory variables against the response variable below will help validate the previous findings and better perceptive of the principal components variables. 4.8.3 The score plot of the explanatory variables against the response variable: The graphical Display of the score plot of the explanatory variables against the response variable is displayed above for visualising the relationship between variables in the space of the principal components. From the score plot above, the interpretation of the axis comes from the analysis of this figure. Now the original variables are represented in PC1 and PC 2 dimensional spaces. The PC1 can be interpreted as the resultant of all the values projected on the x-axis. The longer the projected vector is, the more important is the contribution in the dimension. The origin of the new coordinate system is located in the centre of the datasets. The first PC, that is, PC1, points in the direction of the highest variance and is dominated by the 52 transactionTime. In contrast, the y-axis (PC2), points in the direction of the second highest variance and is defined by the transactionNairaAmount, dominated by the transactionDate and to lesser extent, by the localHour, while the coordinate stay perpendicular. (D.L. Massart and Y. Vander Heyden) The implication of this will be to rank the transactions according to the PC1 with the highest scoring explanatory variables values: 0.69563 3D 0.14014 respectively; being the best, at least in terms of the transactionTime and the localHour. 4.9 Model based Anomaly Detection Output: This is the Dataset overview for the dtrans_data before applying IDEA (Interactive Data Exploration Analysis) We can see that transactionNairaAmount, transactionTime and local hour has been dropped for the corresponding PC variables accordingly. Now, points can be identified based on the unique identifier, labelID and linked with brushing across multiple plots, to check mate deviation of the conventional transaction behavioral model. In view of the research objectives, I have been able to explore some of the various detection techniques for unsupervised machine learning such as the K-Mean Cluster Analysis and the Principal Component Analysis. Besides, the analyses above have helped in perceiving the user behaviour transaction patterns, identify transactionTime and localHour as two major explanatory attributes and key factor that can be worked with in e-banking fraud detection; equally determine the threshold of identification of the relationship. Not only can we identify the direction of the slope of the relationship between the Principal Component Variables, but 53 also, we could equally identify the strength of relationship or the degree of the slope. Now, I shall proceed to the final stage for computing the presence of outlier using the various distance measures and general detection performance based on the previous analysis. Identification of Outliers: 4.9.4 3D Plot of the best explanatory variables and the Response Variable. Outliers are observations which deviate so much from other observations as to arouse suspicions that was generated by different mechanism. (Abe, N., Zadrozny, B., and Langford, J.) An inspection of the 3-plots displayed above, show how transactionNairaAmount varies with transactionDate and Time. The Datetime is sub divided into 6 transaction time periods or categories, that is: April, May, June, July, August and September. Since, the original data set feature transaction from the Month of April to September. That is, between 2015-04-02 01:44:50 and 2015-09-30 23:06:54. [Ref: 3.1.4]. The Interactive graphic data set of these three variable components helps view how transactionNairaAmount vary in time space or better still with respect to the transactionDate, and localHour. The following are the user account identities that deviate from the behavioral pattern based on our model as shown in the above: LabelID = [641B6A70B816], [AB77E701417E], [C03089119C16], [AA39724E34AD], [973114BAAC2A], [91C33507469F]. 54 Scatterplot of transactionAmount against transactionTime and localHour 4.9.5 An inspection of the scatterplot above containing outliers’ shows up such characteristics as large gaps between outlying and inlaying observations and deviations between the outliers and the group of inliers as measured in the suitably standardized scale based on the previous analysis. Red Flags labelID: the following are the user account identities that deviate from the behavioral pattern based on our model as shown in the scatterplot: The scatterplot of the principal components based variables are display above with few of the labeID’s as an identifier. (Aleskerov, E., & Freisleben, B.) The plot features: The response variable, transactionNairaAmount against the explanatory variables, transactionDate, transactionTime and the localHour respectively; validating the above listed labelID’s: [641B6A70B816], [AB77E701417E], [C03089119C16], [AA39724E34AD], [973114BAAC2A], [91C33507469F]. In conformation with the previous 3-plots demonstrated. 55 CHAPTER FIVE SUMMARY OF FINDINGS, CONCLUSION AND RECOMMENDATION 5.1.1 Summary of Findings A comprehensive evaluation of Data Mining Techniques and Machine Learning for Unsupervised Anomaly Detection Algorithms on electronic banking transaction data sets consisting of 9 column variables and 8,641 observations was carried out. (Ref: Appendix for detail) 5.1.2 The summary of the experimental research finding and output are summarised below: Red Flags labelID: the following are the user account identities that deviate from the behavioral pattern based on our model. At least, 6 out of the 8,430 transaction dataset are suspected and predicted to be a fraudulent transaction. LabelID = [641B6A70B816], [AB77E701417E], [C03089119C16], [AA39724E34AD], [973114BAAC2A], [91C33507469F]. For detail reference list of Figure 5.1.2 At least, 6 out of the 8,641 transaction dataset are suspected and predicted to be a fraudulent transaction. Red Flags accountID Table Summary labelID transactionNairaAmount transaction Date transaction Time AB77E701417E 71,919.00 2015/07/03 10:33:01 AA39724E34AD 2,024,999,550.00 2015/06/24 04:26:19 973114BAAC2A 449,999,550.00 2015/06/25 02:56:36 91C33507469F 287,999,550.00 2015/08/27 09:42:12 641B6A70B816 105,678.00 2015/05/16 23:45:55 C03089119C16 22,495.00 2015/09/05 22:09:41 (Ref: Appendix for detail, List of Figure 5.1.2) 56 5.1.3 The main objective of this study is to find out the best solution (singular or integrated detection methodology) of controlling fraud, since it seems to be a critical problem in many organisations including the government. Specifically the following are the summary of my findings: The fraud detection techniques as proffer by the research work are as followed: (i) Pre-process original data set to suit techniques requirement (ii) Transform processed data variable fields, for detection techniques (iii) Applying K-Mean Cluster Analysis on the dtrans_data: which identifies groups within a dataset (iv) Reduce multidimensional data to lower dimensions while retaining most of the information using PCA (v) Any numeric variables with relatively large rotation values (negative or positive) in any of the first few components are generally variables that I may wish to include in the modelling (vi) Determine the number of components to retain: the Loading Plot below reveals the relationship between variables in the space of the first two components (vii) Expressed main component variables as a linear combination of the original variables (viii) Highlight homogeneous groups of individuals with Parallel Coordinate Plot (PCP). (ix) Perform advance Exploratory Interactive Data Exploration Analysis (IDEA) (x) The major technique used in the final analysis is unsupervised Machine Learning and predictive modeling with major focus on Anomaly/Outlier Detection (OD). 5.2.0 Conclusion: This research deals with the procedure for computing the presence of outliers using various distance measures and as a general detection performance result, I can conclude that nearest-neighbor based algorithms perform better in most cases when compared to clustering algorithms for a small data sets. Also, the stability concerning a not-perfect choice of k is much higher for the nearest-neighbor based methods. The reason for the higher variance in clustering-based algorithms is very likely due to the non-deterministic nature of the underlying k-means clustering algorithm. 57 Despite of this disadvantage, clustering-based algorithms have a lower computation time. As a conclusion, I reckon to prefer nearest-neighbor based algorithms if computation time is not an issue. If a faster computation is required for large datasets, for example, just like the unlabelled dataset used for this research work or better still, in a near real-time setting, clustering-based anomaly detection is the method of choice, I observed. Besides supporting the unsupervised anomaly detection research community, I also believe that the study and its implementation are useful for researchers from neighboring fields. 5.3.0 Recommendation: On completion of the underlying system I can conclude that the integrated technique system is providing far better system performance efficiency than a singular system using k-means for outlier detection. Since the main focus is on finding fraudulent data in a transaction dataset of credit cards, hence efficiency is measured on the basis of frequency of detecting outliers or unusual behavioral user pattern. For this purpose the techniques have a mechanism consisting of clustering based K-Nearest neighbor algorithm with Anomaly Detection Efficiency. Thus, we are having a system which is efficiently detecting unusual behavioral pattern as a final product. 5.3.1 Suggestion for Further Studies The future scope for this system can be working with more attributes of the accountID information. As the technology is growing rapidly hackers are finding new ways to crack the security means, so by working with more attributes we can make the system more complex. This in turns will make the system safer. 58 REFERENCE Agboola, A.A (2002). Information Technology, Bank Automation and Attitude of Workers in Nigeria Banks, Journal of Social Sciences, 5, 89-102 Aleskero, E., Freisleben B., Rao B., CARDWATCH: “A Neural Network-Based Database Mining System for Credit Card Fraud Detection”, the International Conference on Computational Intelligence for Financial Engineering, pp. 220-226, 1997 Andreas, L., David W., Prodromidis, L., & Salvatore, J., (1997): “Credit Card Fraud Detection Using Meta-Learning, Issues and Initial Results”; Department of Computer Science Columbia University. Bell, D., and La Padula L., (1976), “Secure Computer System: Unified Exposition and Multic Interpretation, ESD-TR-75-306 (March), Mitre Corporation Cai, S., & Jun, M., (2001): “The Key Determinant of Internet Banking Service Quality: A Content analysis”, International Journal of Bank Marketing, (2001) 19(7), pp.276-291. Central Bank of Nigeria (2003): Report of Technical Committee on Electronic Banking Abuja: CBN Central Bank of Nigeria (2003b): Guidelines on Electronic Banking, Abuja Christopher, G., Mike, C., and Amy, W. (2006): A logit analysis of electronic banking in New Zealand”, International Journal Bank of Marketing, Vol. 24, No. 6, pp.360-383 Douglas, L., Sushmito, G., (1994): “Credit Card Fraud Detection with a Neural Network,” Proceedings of the 27th Annual Hawaii International Conference of System Science Duman, E., & Sahin, Y., (2011): “Detecting Credit Card Fraud by Decision Trees and Support Vector Machine”, proceeding International Multi-Conference of engineering and Computer Statistics, Vol.1, 2011 Ekberg, P., et al, (2013) “Online Banking Access System: Principles behind Choices and Further Development Seen from Managerial Perspectives”, retrieve December, (2013). Geethal, V., and Malarvizhi, M,: “Acceptance of E-Banking Among Customers”, Journal of Management and Science Vol.2, No.1 Ghosh S., & Reilly, D., (2004): Credit Card Fraud Detection with Neural Network. Proc. Of 27th “Hawaii International Conference on Systems Science 3: 621-630 Hamid, M. R., et al., (2007): “A Comparative Analysis of Internet banking in Malaysia and Thailand” Journal of Internet Business (2007) (4), 1-19 Hearst, M., Rachna D., & Tygar, J., (2006): “Why Phishing Works”, In the Proceedings of Human Factors in Computing Systems Hutchinson, D., & Warren, M. (2001): “A Framework of Security Authentication for Internet Banking”, Paper presented at the International We-B Conference (2nd), 2001. Perth 59 Jain, A., Hong, L., & Pankanti, S., (2000): Biometric Identification. Association for Computing Machinery, Communication of the ACM, 43(2), 90-98 Karim, Z., et al., (2009): “Towards Secured Information System in Online Banking” Paper presented at International Conference for Internet Technology and Secured Transaction, London Leow, H.B., (1999): “New Distribution Channels In Banking Services”, Banker Journal Malaysia, (199) (110), pp 48-56 Lokesh Sharma, RaghavendraPatridar, (2011): “Credit Card Fraud Detection Using Neural Network”, International Journal of Soft Computing and Engineering (IJSCE) Maes S., Tuyls K., Vanschoenwinlel B., (2002): “Credit Card Fraud Detection Using Bayesian and Neural Networks”; Vrije University Brussel- Belgium. Panida S., & Sunsern L., (2011): “A Comparative Analysis of the Security of Internet Banking in Australia: A Customer Perspective”, being a discussion paper delivered at the 2nd Internal Cyber-Resilience Conference, Australia. Pavlon, P., (2001): “Integrating trust in electronic commerce with the Technology Acceptance Model”, Development and Validation AMCIS (2001) Proceeding [Online]. Available at: http://aisel.aisnet.org/amcis2001/159.Accessed 3 August 2008. 60 APPENDIX A Experimental Tools and Code for Project Implementation: ############################################################## ##Load Rattle for Data Mining ####################################################### library(rattle) rattle() ############################################################# #Read untagged transaction ############################################################# data=read.csv(file.choose(),stringsAsFactors = F,header=TRUE) attach(data) str(data) head(data) data <- data[,-7] ############################################################ #Make Account ID as string labelID=as.character(labelID) stateCode=as.factor(stateCode) ####################################################### ##### Format Time to 6digits ######################### ##### Time Formatting ############################### require(lubridate) transactionDate= as.character(transactionDate) transactionTime= as.character(transactionTime) transactionTime=sprintf("%06d", data$transactionTime) #### data <- read.table(file = "file.csv", header = FALSE, sep = ";")##### date_time <- paste(transactionDate, transactionTime) date_time <- ymd_hms(date_time) #Append the new variable date_time data=data.frame(data,date_time) head(data) head(transactionTime) str(transactionTime) 61 ############################################################ #### Sort Data in Account-Date-Time order ########### ######################################################## library("dplyr") #Now, we will select seven columns from data, arrange the rows by #the transactionDate and then arrange the rows by transactionTime. #Finally show the details of the final data frame sort=data %>% select(labelID, transactionNairaAmount,stateCode,transactionDate,transactionTime,date_time,localHour) %>% arrange(transactionDate, transactionTime) tail(sort) head(sort) str(sort) ############################################################# #Remove duplicate rows dtrans_data=sort[!duplicated(sort), ] str(dtrans_data) head(dtrans_data) tail(dtrans_data) glimpse(dtrans_data) summary(dtrans_data) ##################################################################### Tdtrans_data_AD=data.frame(log(transactionNairaAmount),localHour,as.numeric(transactionTime)) dtrans_data_AD=data.frame(transactionNairaAmount,localHour,as.numeric(transactionTime)) ######################################################## #Descriptive Satistics Visualization for Probability Model #Suitable Variables for the Models ########################################################### #To demonstrate the need for response variable transformation #transactionAmount variable is a a variable from a time series multivariate dataset plot.ts(log(transactionNairaAmount)) ########################################################## #Transaction Amount transactionNairaAmount=sapply(transactionNairaAmount, mean, na.rm=TRUE) x=log(transactionNairaAmount) xbar=mean(x) S=sd(x) graph1=hist(x,col='grey') 62 graph2=hist(x,col='grey',probability=T,main="Histogram of transaction Naira Amount",xlab="transformed transaction Naira Amount") curve(dnorm(x,xbar,S),col=2,add=T) ########################################################## #transactionIPaddress (bad predictor of transactionAmount) transactionIPaddress=sapply(transactionIPaddress,mean, na.rm=TRUE) x=log(transactionIPaddress) xbar=mean(x) S=sd(x) graph1=hist(x,col='grey') graph2=hist(x,col='grey',probability=T) curve(dnorm(x,xbar,S),col=2,add=T) ######################################################### #localHour(good predictor of transactionAmount) localHour=sapply(localHour, mean, na.rm=TRUE) x=(localHour) xbar=mean(x) S=sd(x) graph1=hist(x,col='grey') graph2=hist(x,col='grey',probability=T) curve(dnorm(x,xbar,S),col=2,add=T) ########################################################## #Normality Test ########################################################### # normal fit #qqnorm is for univariate data qqnorm(localHour); qqline(transactionIPaddress) old.par <- par(mfrow=c(1, 2)) qqnorm(transactionNairaAmount); qqline(transactionNairaAmount) qqnorm(log(transactionNairaAmount)); qqlinelog((transactionNairaAmount) qqnorm(log(transactionNairaAmount)); qqlinelog((transactionNairaAmount)) #qqplot is for bivariate data qqplot(localHour,log(transactionNairaAmount)) qqplot(log(transactionNairaAmount),localHour) 63 APPENDIX B List of Figures Figure 3.1.1 Pre-processed data structure, object tagged dtrans_data Figure 3.5.4 data head Figure 3.5.5 dtrans_data Summary 64 Figure 3.5.6 dtrans_data summary 3.5.7 Frequency table of electronic banking transaction in different states in Nigeria 65 3.6.2Figure: the typical work flow of a dtrans_data data set as capture by Rattle and R. Figure 4.2.3 Cluster Analysis Output Cluster sizes: Figure 4.4.2 “604 531 426 625 408 512 507 571 1133 584” 66 Figure 5.1.1 The first and last (10) variable rows of the primary transaction dataset, with seven (7) data field variables and 8,430 observations. Figure 5.1.2 At least, 6 out of the 8,430 transaction dataset are suspected and predicted to be a fraudulent transaction. LabelID = [641B6A70B816], [AB77E701417E], [C03089119C16], [AA39724E34AD], [973114BAAC2A], [91C33507469F]. The R programming output snapshot is displayed below for detail reference. 67 Figure 5.1.2b 68