International Journal of futuristic Machine Intelligence and its Application Vol 1 Issue 2 Data warehouse with grid computing for Anti-money Laundering and Fraud Detection Mrs. Sangita Nemade1, Netrali Bhandare2, Sheetal Shetty3, Asst. Professor, Dept. of Computer Engineering, Govt. College of Engineering and Research, Pune, Maharashtra, India 1 B.E. Student, Dept. of Computer Engineering, Govt. College of Engineering and Research, Pune, Maharashtra, India 2 B.E. Student, Dept. of Computer Engineering, Govt. College of Engineering and Research, Pune, Maharashtra, India 3 Abstract A retail bank is a commercial institution with several branches across countries. It provides financial services, receiving deposits of money, processing transactions. In existing system banks uses different OLTP systems (Transactional systems) in order to function its day to day operations. If banks has to plan for future they use this data for analysis and business decision making. Business planning and decision includes, planning for future expansion, planning for promotions, calculating future risk, minimizing expenses and risk, optimizing sales, promotions etc. Currently banks are using this historical operational data and creating analytical reports in excel. This become difficult when data is huge and there may be chances of human error, that the reason there is need for automated scalable analytical system. Source data for this system will be data coming from OLTP System coming every week in flat files this data will be put into common database using ETL(Extract Transform and load process). This data is put into data warehouse this data will be used by Reporting system for anti-money laundering and fraud detection. Data Architecture for the following subjects related to a Bank are defined: Customer Profile, Deposits, Loan Accounts, Interest Income, Expenses, Profitability, Asset Liability Management, Human Resource, Credit Card, ATM Index Terms- Banking, Data Warehousing, Online Analytical Processing, Data Mining. I. INTRODUCTION The banking industry is becoming increasingly dependent on information technology to retain its competitiveness and adapt with the ever-evolving business environment. The industry which is essentially becoming a service industry of a higher order , has to rely on technology to keep abreast with global economy that technology has thrown open. Day after day, mountains of data is produced directly as a result of banking activities, and as a by-product of various transactions. A vast amount of information is about their customers. Yet, most of these data remains locked within archival system that must be coupled with operational systems to generate information necessary to support strategic decision-making. Model based decision support and executive information systems were always restricted by the lack of consistent data. Now-a-days data warehouse tries to cover this gap by providing actual and decision relevant information to allow the control of critical success factors. A data warehouse integrates large amount of enterprise data from multiple and independent data sources consisting of operational databases into a common repository for querying and analysing. Data warehousing therefore gains critical importance in the presence of data mining and generating several types of analytical reports which are usually not available in the original transaction processing system. II. PROBLEM STATEMENT Design a system which is to perform the Fraud Detection and Anti Money Laundering through Business Intelligence on the basis of transaction amount, uneven transactions. In earlier systems, no such technology was present until and unless a enquiry was set up on a user account by the government. Major disadvantage of this approach was that the enquiry came up on a very later period of time which would result in financial loss of the assets as sometime the fraud cannot be even tracked down. Tracking down the user involved in such criminal activities becomes a very challenging job. And in current world in which there are crores and crores of account, it becomes very difficult to frame out how AML or Fraud is done. Thus, considering the drawbacks of the present system, a new system approach i.e, AML and Fraud Detection using Data warehouse with Grid Computing is proposed. III. SYSTEM ARCHITECTURE The system architecture consists of four blocks including the data sources, data warehouse servers, OLAP servers and reporting & data mining block. The functionality of each of the blocks is illustrated below 1. Operational & External Data Sources For the implementation of a data warehouse & business intelligence system, the availability of reliable and actual data sources is essential and the most important without which the information reported, mined and forecasted may not be fruitful. For our system, the bank provided the bank’s International Journal of futuristic Machine Intelligence and its Application Vol 1 Issue 2 customers’ profile and transaction databases in various format such as .txt and .sql. These data sources are flat files and need to be converted in multi-dimensional format for OLAP operations. Fig. 1 System Architecture 2. Data Warehouse Servers This block contains the staging area, warehouse database servers & metadata repository. There is physical data movement from source database to data warehouse database. Staging area is primarily designed to serve as intermediate resting place for data before it is processed and integrated into the target data warehouse. This staging are serves many purposes above and beyond the primary function. The data is most consistent with the source. It is devoid of any transformation or has only minor format changes. The staging area in a relation database can be read/ scanned/ queried using Oracle without the need of logging into the source systemor reading files. It is a prime location for validating data quality from source or auditing and tracking down data issues. Staging area acts as a repository for historical data if not truncated. 3. IV. ALGORITHMS USED For the process of data mining the proposed sysytem uses some techniques and in order to implement these techniques correponding algorithms are used which are explained below with their mathematical modelling. Warehouse database server The next component is a warehouse database server that is almost always a relational database system. Back-end tools and utilities are used to feed data into the bottom tier from operational databases or other external sources (such as customer profile information provided by external consultants). These tools and utilities perform data extraction, cleaning, and transformation (e.g., to merge similar data from different sources into a unified format), as well as load and refresh functions to update the data warehouse. 4. (OBIEE) 11g. For providing the analytical result, we will be using some of the Online Analytical Processing (OLAP) operations such as slicing & dicing, roll up & roll down and pivoting. The analytical results will be provided in a multidimensional view using OLAP Cube Technology projected to assist decision makers such as visualization with comparison to different dimensions e.g. locations, time etc. Transactions made by fraudsters using counterfeit cards and making cardholder-not-present purchases will be detected through methods which seek changes in transaction patterns, as well as checking for particular patterns which are known to be indicative of counterfeiting. Reporting & Data Mining Tools The front-end client layer in data warehousing is the presentation phase which contains query and reporting tools, analysis tools and data mining tools for fraud detection and anti money laundering. The reporting tool that we’ve used for this purpose is Oracle Business Intelligence Enterprise Edition A. Classification Most commonly used technique for predicting a specific outcome such as response/no-response, high/medium /lowvalue customer, likely to buy /not-buy Generalized Linear Models (GLM) GLM is a popular statistical technique for linear modelling. Oracle Data Mining implements GLM for regression and for binary classification. GLM provides extensive coefficient statistics and model statistics, as well as row diagnostics. GLM also supports confidence bounds. In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted International Journal of futuristic Machine Intelligence and its Application Vol 1 Issue 2 value. In a generalized linear model (GLM), each outcome of the dependent variables, Y, is assumed to be generated from a particular distribution in the exponential family, a large range probability distribution that includes the normal Poisson and gamma distributions, among others. The mean, μ, of the distribution depends on the independent variables, X, through: where E( Y) is the expected value of Y; Xβ is the linear predictor, a linear combination of unknown parameters β; g is the link function.In this framework, the variance is typically a function, V, of the mean: It is convenient if V follows from the exponential family distribution, but it may simply be that the variance is a function of the predicted value. The unknown parameters, β, are typically estimated with maximum likelihood, maximum quasi-likelihood or Bayesian techniques. B. Regression Technique for predicting a continuous numerical outcome such as customer lifetime value, house value, process yield rates. Support Vector Machines (SVM) SVM is a powerful, state-of-the-art algorithm for linear and nonlinear regression. Oracle Data Mining implements SVM for regression, classification, and anomaly detection. SVM regression supports two kernels: the Gaussian kernel for nonlinear regression, and the linear kernel for linear regression. SVM also supports active learning. In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. Given some training data D, a set of n points of the form where the yi is either 1 or −1, indicating the class to which the point belongs. Each is a p-dimensional real vector. We want to find the maximum-margin hyper plane that divides the points having yi=1from those having yi=-1. Any hyper plane can be written as the set of points satisfying Maximum-margin hyper plane and margins for an SVM trained with samples from two classes. Samples on the margin are called the support vectors. where denotes the dot product and the (not necessarily normalized) normal vector to the hyperplane. The parameter b/||w|| determines the offset of the hyperplane from the origin along the normal vector Useful for exploring data and finding natural groupings. Members of cluster are more like each other than they are like members of a different cluster. Common examples include finding new customer segments and life science discovery. k-means clustering The k-Means algorithm is a distance-based clustering algorithm that partitions the data into a specified number of clusters.Distance-based algorithms rely on a distance function to measure the similarity between cases. Cases are assigned to the nearest cluster according to the distance function used. Oracle Data Mining Enhanced k-Means Oracle Data Mining implements an enhanced version of the kMeans algorithm with the following features: Distance function — The algorithm supports Euclidean, Cosine, and Fast Cosine distance functions. The default is Euclidean. Hierarchical model build —The algorithm builds a model in a top-down hierarchical manner, using binary splits and refinement of all nodes at the end. In this sense, the algorithm is similar to the bisecting k-Means algorithm. The centroids of the inner nodes in the hierarchy are updated to reflect changes as the tree evolves. The whole tree is returned Tree growth — The algorithm uses a specified split criterion to grow the tree one node at a time until a specified maximum number of clusters is reached, or until the number of distinct cases is reached. The split criterion may be the variance or the cluster size. By default the split criterion is the variance. Cluster properties — For each cluster, the algorithm returns the centroid, a histogram for each attribute, and a rule International Journal of futuristic Machine Intelligence and its Application Vol 1 Issue 2 describing the hyperbox that encloses the majority of the data assigned to the cluster. The centroid reports the mode for categorical attributes and the mean and variance for numerical attributes. This approach to k-Means avoids the need for building multiple k-Means models and provides clustering results that are consistently superior to the traditional k-Means. Centroid The centroid represents the most typical case in a cluster. For example, in a data set of customer ages and incomes, the centroid of each cluster would be a customer of average age and average income in that cluster. The centroid is a prototype. It does not necessarily describe any given case assigned to the cluster. The attribute values for the centroid are the mean of the numerical attributes and the mode of the categorical attributes. Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS). In other words, its objective is to find: closure lemma, the candidate set contains all frequent length item sets. After that, it scans the transaction database to determine frequent item sets among the candidates. The pseudo code for the algorithm is given below for a transaction database T, and a support threshold of €. Usual set theoretic notation is employed, though note that T is a multiset. Ck is the candidate set for level k. Generate() algorithm is assumed to generate the candidate sets from the large item sets of the preceding level, heeding the downward closure lemma. Count[c] accesses a field of the data structure that represents candidate set , which is initially assumed to be zero. Many details are omitted below, usually the most important part of the implementation is the data structure used for storing the candidate sets, and counting their frequencies. where μi is the mean of points in Si. C. Association Finds rules associated with frequently co-occuring items, used for market basket analysis, cross sell root analysis. Useful for product bundling, in-store placement, and defect analysis. Aprori Apriori is an algorithm for frequent item set mining and association rule learning over transactional databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis. Apriori is designed to operate on databases containing transactions (for example, collections of items bought by customers, or details of a website frequentation). Each transaction is seen as a set of items (an itemset). Given a threshold C, the Apriori algorithm identifies the item sets which are subsets of at least C transactions in the database. Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found. Apriori uses breadth- first search and a Hash tree structure to count candidate item sets efficiently. It generates candidate item sets of length from item sets of length K-1.Then it prunes the candidates which have an infrequent sub pattern. According to the downward V. CONCLUSION Climacteric competition and rising loan delinquency rates are seeing more banks exploring ways to use their data assets to gain a competitive advantage. This paper analyses how, in practice, data warehouse applications fits in with various different business problems at banking sector and also demonstrates how the bank-wide enterprise data warehouse can be implemented to provide atomic level information on all banking transactions, customers and all products for use in decision-support systems. The possibility of setting up a data warehouse, seems more remote when compared to the setting up of data marts, which can later be integrated into a bankwide enterprise data warehouse. The integrated data store can be used to uncover a huge potential loss of revenue, which can be averted and which will further guide how to approach pricing and service grouping well into the future. VI. REFERENCES International Journal of futuristic Machine Intelligence and its Application Vol 1 Issue 2 [1] [2] [3] [4] [5] [6] Inmon, W.H., Building the Data Warehouse. John Wiley, 1992. Paim, F. R., Carvalho, A. E., Castro, J. B. “Towards a Methodology for Requirements Analysis of Data Warehouse Systems”. In Proc. of the XVI SimpósioBrasileiro de Engenharia de Software (SBES2002), Gramado, Rio Grande do Sul, Brazil, 2002. Mylopoulos, J., Chung, L., Liao, S. S. Y., Wang, H., Yu, E. “Exploring Alternatives during Requirements Analysis”, Jan/Feb, 2005, pp. 2-6. G.V., Banking Business Unit – Challenges and Achievements, Infosys, (www.infy.com/investor_usgaap/ppt/am200 1_girish_final.ppt) Grigori, D., et al., (2004) “Business Process Intelligence”, Computers in Industry 53, 321-343, 2003Elsevier B.V., (www.sciencedirect.com) Osterfelt, S.; Business Intelligence: The Intelligent Customer, DM Review [http://wdmreview.com], November 2000.