- International Journal of Futuristic Machine Intelligence

advertisement
International Journal of futuristic Machine Intelligence and its Application
Vol 1 Issue 2
Data warehouse with grid computing for Anti-money Laundering and
Fraud Detection
Mrs. Sangita Nemade1, Netrali Bhandare2, Sheetal Shetty3,
Asst. Professor, Dept. of Computer Engineering, Govt. College of Engineering and Research, Pune, Maharashtra, India 1
B.E. Student, Dept. of Computer Engineering, Govt. College of Engineering and Research, Pune, Maharashtra, India 2
B.E. Student, Dept. of Computer Engineering, Govt. College of Engineering and Research, Pune, Maharashtra, India 3
Abstract
A retail bank is a commercial institution with several branches across countries. It provides financial services, receiving deposits of
money, processing transactions. In existing system banks uses different OLTP systems (Transactional systems) in order to function its
day to day operations. If banks has to plan for future they use this data for analysis and business decision making. Business planning
and decision includes, planning for future expansion, planning for promotions, calculating future risk, minimizing expenses and risk,
optimizing sales, promotions etc. Currently banks are using this historical operational data and creating analytical reports in excel.
This become difficult when data is huge and there may be chances of human error, that the reason there is need for automated scalable
analytical system. Source data for this system will be data coming from OLTP System coming every week in flat files this data will be
put into common database using ETL(Extract Transform and load process). This data is put into data warehouse this data will be used
by Reporting system for anti-money laundering and fraud detection. Data Architecture for the following subjects related to a Bank are
defined: Customer Profile, Deposits, Loan Accounts, Interest Income, Expenses, Profitability, Asset Liability Management, Human
Resource, Credit Card, ATM
Index Terms- Banking, Data Warehousing, Online Analytical
Processing, Data Mining.
I.
INTRODUCTION
The banking industry is becoming increasingly dependent
on information technology to retain its competitiveness and
adapt with the ever-evolving business environment. The
industry which is essentially becoming a service industry of a
higher order , has to rely on technology to keep abreast with
global economy that technology has thrown open. Day after
day, mountains of data is produced directly as a result of
banking activities, and as a by-product of various transactions.
A vast amount of information is about their customers. Yet,
most of these data remains locked within archival system that
must be coupled with operational systems to generate
information necessary to support strategic decision-making.
Model based decision support and executive information
systems were always restricted by the lack of consistent data.
Now-a-days data warehouse tries to cover this gap by
providing actual and decision relevant information to allow
the control of critical success factors. A data warehouse
integrates large amount of enterprise data from multiple and
independent data sources consisting of operational databases
into a common repository for querying and analysing. Data
warehousing therefore gains critical importance in the
presence of data mining and generating several types of
analytical reports which are usually not available in the
original transaction processing system.
II.
PROBLEM STATEMENT
Design a system which is to perform the Fraud Detection
and Anti Money Laundering through Business Intelligence on
the basis of transaction amount, uneven transactions. In earlier
systems, no such technology was present until and unless a
enquiry was set up on a user account by the government.
Major disadvantage of this approach was that the enquiry
came up on a very later period of time which would result in
financial loss of the assets as sometime the fraud cannot be
even tracked down. Tracking down the user involved in such
criminal activities becomes a very challenging job. And in
current world in which there are crores and crores of account,
it becomes very difficult to frame out how AML or Fraud is
done. Thus, considering the drawbacks of the present system,
a new system approach i.e, AML and Fraud Detection using
Data warehouse with Grid Computing is proposed.
III. SYSTEM ARCHITECTURE
The system architecture consists of four blocks including the
data sources, data warehouse servers, OLAP servers and
reporting & data mining block. The functionality of each of
the blocks is illustrated below
1.
Operational & External Data Sources
For the implementation of a data warehouse & business
intelligence system, the availability of reliable and actual data
sources is essential and the most important without which the
information reported, mined and forecasted may not be
fruitful. For our system, the bank provided the bank’s
International Journal of futuristic Machine Intelligence and its Application
Vol 1 Issue 2
customers’ profile and transaction databases in various format
such as .txt and .sql. These data sources are flat files and need
to be converted in multi-dimensional format for OLAP
operations.
Fig. 1 System Architecture
2.
Data Warehouse Servers
This block contains the staging area, warehouse database
servers & metadata repository. There is physical data
movement from source database to data warehouse database.
Staging area is primarily designed to serve as intermediate
resting place for data before it is processed and integrated into
the target data warehouse. This staging are serves many
purposes above and beyond the primary function. The data is
most consistent with the source. It is devoid of any
transformation or has only minor format changes. The staging
area in a relation database can be read/ scanned/ queried using
Oracle without the need of logging into the source systemor
reading files. It is a prime location for validating data quality
from source or auditing and tracking down data issues.
Staging area acts as a repository for historical data if not
truncated.
3.
IV. ALGORITHMS USED
For the process of data mining the proposed sysytem uses
some techniques and in order to implement these techniques
correponding algorithms are used which are explained below
with their mathematical modelling.
Warehouse database server
The next component is a warehouse database server that is
almost always a relational database system. Back-end tools
and utilities are used to feed data into the bottom tier from
operational databases or other external sources (such as
customer profile information provided by external
consultants). These tools and utilities perform data extraction,
cleaning, and transformation (e.g., to merge similar data from
different sources into a unified format), as well as load and
refresh functions to update the data warehouse.
4.
(OBIEE) 11g. For providing the analytical result, we will be
using some of the Online Analytical Processing (OLAP)
operations such as slicing & dicing, roll up & roll down and
pivoting. The analytical results will be provided in a multidimensional view using OLAP Cube Technology projected to
assist decision makers such as visualization with comparison
to different dimensions e.g. locations, time etc. Transactions
made by fraudsters using counterfeit cards and making
cardholder-not-present purchases will be detected through
methods which seek changes in transaction patterns, as well as
checking for particular patterns which are known to be
indicative of counterfeiting.
Reporting & Data Mining Tools
The front-end client layer in data warehousing is the
presentation phase which contains query and reporting tools,
analysis tools and data mining tools for fraud detection and
anti money laundering. The reporting tool that we’ve used for
this purpose is Oracle Business Intelligence Enterprise Edition
A. Classification
Most commonly used technique for predicting a specific
outcome such as response/no-response, high/medium /lowvalue customer, likely to buy /not-buy
Generalized Linear Models (GLM)
GLM is a popular statistical technique for linear modelling.
Oracle Data Mining implements GLM for regression and for
binary classification. GLM provides extensive coefficient
statistics and model statistics, as well as row diagnostics.
GLM also supports confidence bounds. In statistics, the
generalized linear model (GLM) is a flexible generalization of
ordinary linear regression that allows for response variables
that have error distribution models other than a normal
distribution. The GLM generalizes linear regression by
allowing the linear model to be related to the response variable
via a link function and by allowing the magnitude of the
variance of each measurement to be a function of its predicted
International Journal of futuristic Machine Intelligence and its Application
Vol 1 Issue 2
value. In a generalized linear model (GLM), each outcome of
the dependent variables, Y, is assumed to be generated from a
particular distribution in the exponential family, a large range
probability distribution that includes the normal Poisson and
gamma distributions, among others. The mean, μ, of the
distribution depends on the independent variables, X, through:
where E( Y) is the expected value of Y; Xβ is the linear
predictor, a linear combination of unknown parameters β; g is
the link function.In this framework, the variance is typically a
function, V, of the mean:
It is convenient if V follows from the exponential family
distribution, but it may simply be that the variance is a
function of the predicted value. The unknown parameters, β,
are typically estimated with maximum likelihood, maximum
quasi-likelihood or Bayesian techniques.
B. Regression
Technique for predicting a continuous numerical outcome
such as customer lifetime value, house value, process yield
rates.
Support Vector Machines (SVM)
SVM is a powerful, state-of-the-art algorithm for linear and
nonlinear regression. Oracle Data Mining implements SVM
for regression, classification, and anomaly detection. SVM
regression supports two kernels: the Gaussian kernel for
nonlinear regression, and the linear kernel for linear
regression. SVM also supports active learning. In machine
learning, support vector machines (SVMs, also support vector
networks) are supervised learning models with associated
learning algorithms that analyze data and recognize patterns,
used for classification and regression analysis. Given a set of
training examples, each marked as belonging to one of two
categories, an SVM training algorithm builds a model that
assigns new examples into one category or the other, making it
a non-probabilistic binary linear classifier. An SVM model is
a representation of the examples as points in space, mapped so
that the examples of the separate categories are divided by a
clear gap that is as wide as possible. New examples are then
mapped into that same space and predicted to belong to a
category based on which side of the gap they fall on. In
addition to performing linear classification, SVMs can
efficiently perform a non-linear classification using what is
called the kernel trick, implicitly mapping their inputs into
high-dimensional feature spaces.
Given some training data D, a set of n points of the form
where the yi is either 1 or −1, indicating the class to which the
point
belongs. Each
is a p-dimensional real vector.
We want to find the maximum-margin hyper plane that
divides the points having yi=1from those having yi=-1. Any
hyper plane can be written as the set of points
satisfying
Maximum-margin hyper plane and margins for an SVM
trained with samples from two classes. Samples on the margin
are called the support vectors.
where denotes the dot product and
the (not necessarily
normalized) normal vector to the hyperplane. The parameter
b/||w|| determines the offset of the hyperplane from the origin
along the normal vector
Useful for exploring data and finding natural groupings.
Members of cluster are more like each other than they are like
members of a different cluster. Common examples include
finding new customer segments and life science discovery.
k-means clustering
The k-Means algorithm is a distance-based clustering
algorithm that partitions the data into a specified number of
clusters.Distance-based algorithms rely on a distance function
to measure the similarity between cases. Cases are assigned to
the nearest cluster according to the distance function used.
Oracle Data Mining Enhanced k-Means
Oracle Data Mining implements an enhanced version of the kMeans algorithm with the following features:
Distance function — The algorithm supports Euclidean,
Cosine, and Fast Cosine distance functions. The default is
Euclidean.
Hierarchical model build —The algorithm builds a model in a
top-down hierarchical manner, using binary splits and
refinement of all nodes at the end. In this sense, the algorithm
is similar to the bisecting k-Means algorithm. The centroids of
the inner nodes in the hierarchy are updated to reflect changes
as the tree evolves. The whole tree is returned
Tree growth — The algorithm uses a specified split criterion
to grow the tree one node at a time until a specified maximum
number of clusters is reached, or until the number of distinct
cases is reached. The split criterion may be the variance or the
cluster size. By default the split criterion is the variance.
Cluster properties — For each cluster, the algorithm returns
the centroid, a histogram for each attribute, and a rule
International Journal of futuristic Machine Intelligence and its Application
Vol 1 Issue 2
describing the hyperbox that encloses the majority of the data
assigned to the cluster. The centroid reports the mode for
categorical attributes and the mean and variance for numerical
attributes.
This approach to k-Means avoids the need for building
multiple k-Means models and provides clustering results that
are consistently superior to the traditional k-Means.
Centroid
The centroid represents the most typical case in a cluster. For
example, in a data set of customer ages and incomes, the
centroid of each cluster would be a customer of average age
and average income in that cluster. The centroid is a
prototype. It does not necessarily describe any given case
assigned to the cluster. The attribute values for the centroid are
the mean of the numerical attributes and the mode of the
categorical attributes.
Given a set of observations (x1, x2, …, xn), where each
observation is a d-dimensional real vector, k-means clustering
aims to partition the n observations into k (≤ n) sets S = {S1,
S2, …, Sk} so as to minimize the within-cluster sum of
squares (WCSS). In other words, its objective is to find:
closure lemma, the candidate set contains all frequent length item sets. After that, it scans the transaction database to
determine frequent item sets among the candidates.
The pseudo code for the algorithm is given below for a
transaction database T, and a support threshold of €. Usual set
theoretic notation is employed, though note that T is a
multiset. Ck is the candidate set for level k. Generate()
algorithm is assumed to generate the candidate sets from the
large item sets of the preceding level, heeding the downward
closure lemma. Count[c] accesses a field of the data structure
that represents candidate set , which is initially assumed to
be zero. Many details are omitted below, usually the most
important part of the implementation is the data structure used
for storing the candidate sets, and counting their frequencies.
where μi is the mean of points in Si.
C. Association
Finds rules associated with frequently co-occuring items, used
for market basket analysis, cross sell root analysis. Useful for
product bundling, in-store placement, and defect analysis.
Aprori
Apriori is an algorithm for frequent item set mining and
association rule learning over transactional databases. It
proceeds by identifying the frequent individual items in the
database and extending them to larger and larger item sets as
long as those item sets appear sufficiently often in the
database. The frequent item sets determined by Apriori can be
used to determine association rules which highlight general
trends in the database: this has applications in domains such
as market basket analysis. Apriori is designed to operate on
databases containing transactions (for example, collections of
items bought by customers, or details of a website
frequentation). Each transaction is seen as a set of items (an
itemset). Given a threshold C, the Apriori algorithm identifies
the item sets which are subsets of at least C transactions in the
database. Apriori uses a "bottom up" approach, where frequent
subsets are extended one item at a time (a step known as
candidate generation), and groups of candidates are tested
against the data. The algorithm terminates when no further
successful extensions are found. Apriori uses breadth- first
search and a Hash tree structure to count candidate item sets
efficiently. It generates candidate item sets of length
from
item sets of length K-1.Then it prunes the candidates which
have an infrequent sub pattern. According to the downward
V.
CONCLUSION
Climacteric competition and rising loan delinquency rates are
seeing more banks exploring ways to use their data assets to
gain a competitive advantage. This paper analyses how, in
practice, data warehouse applications fits in with various
different business problems at banking sector and also
demonstrates how the bank-wide enterprise data warehouse
can be implemented to provide atomic level information on all
banking transactions, customers and all products for use in
decision-support systems. The possibility of setting up a data
warehouse, seems more remote when compared to the setting
up of data marts, which can later be integrated into a bankwide enterprise data warehouse. The integrated data store can
be used to uncover a huge potential loss of revenue, which can
be averted and which will further guide how to approach
pricing and service grouping well into the future.
VI. REFERENCES
International Journal of futuristic Machine Intelligence and its Application
Vol 1 Issue 2
[1]
[2]
[3]
[4]
[5]
[6]
Inmon, W.H., Building the Data Warehouse. John
Wiley, 1992.
Paim, F. R., Carvalho, A. E., Castro, J. B. “Towards a
Methodology for Requirements Analysis of Data
Warehouse Systems”. In Proc. of the XVI
SimpósioBrasileiro de Engenharia de Software
(SBES2002), Gramado, Rio Grande do Sul, Brazil,
2002.
Mylopoulos, J., Chung, L., Liao, S. S. Y., Wang, H.,
Yu, E. “Exploring Alternatives during Requirements
Analysis”, Jan/Feb, 2005, pp. 2-6.
G.V., Banking Business Unit – Challenges and
Achievements,
Infosys,
(www.infy.com/investor_usgaap/ppt/am200
1_girish_final.ppt)
Grigori, D., et al., (2004) “Business Process
Intelligence”, Computers in Industry 53, 321-343,
2003Elsevier B.V., (www.sciencedirect.com)
Osterfelt, S.; Business Intelligence: The Intelligent
Customer, DM Review [http://wdmreview.com],
November 2000.
Download