frauddetectideas2018v1-181021050608

advertisement
BigDAI
HiPIC
Predictive Analysis of Financial Fraud
Detection using Azure and Spark ML
IDEAS SoCal Conf 2018
Oct 20 2018
Jongwook Woo, PhD, jwoo5@calstatela.edu
Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat
Big Data AI Center (BigDAI / HiPIC)
California State University Los Angeles
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Introduction To Big Data Predictive Analytics
 Fraud Detection Predictive Analytics
 Summary
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Myself
Experience:
 Since 2002, Professor at California State University Los Angeles
– PhD in 2001: Computer Science and Engineering at USC
 Since 1998: R&D consulting in Hollywood
– Warner Bros (Matrix online game), E!, citysearch.com, ARM etc
– Information Search and Integration with FAST, Lucene/Solr, Sphinx
– implements eBusiness applications using J2EE and middleware
 Since 2007: Exposed to Big Data at CitySearch.com
 2012 - Present : Big Data Academic Partnerships
– For Big Data research and training
• Amazon AWS, MicroSoft Azure, IBM Bluemix
• Databricks, Hadoop vendors
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Myself: Partners for Services
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experience in Big Data
 Collaboration
 Big Data Technical Advisor of Isaac Engineering for Smart * (Factory, Farms, …) in Korea
 Council Member of IBM Spark Technology Center
 City of Los Angeles for DSF, OpenHub and Open Data
 Startup Companies in Los Angeles
 External Collaborator and Advisor in Big Data
– IMSC of USC
– Pennsylvania State University
– The Big Link, Softzen, Wiken in Korea
 Grants
 Oracle Cloud Big Data, IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Research
and Education Grant
 Partnership
 Academic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS,
Teradata
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Myself: Public Partners
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Myself: S/W Development Lead
http://www.mobygames.com/game/windows/matrix-online/credits
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Introduction To Big Data Predictive Analytics
 Fraud Detection Predictive Analytics
 Summary
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Smart *: Sensor Data (IoT), Bioinformatics, Social Computing, Streaming
data, smart phone, online game…
Cannot handle with the legacy approach
Too big
Non-/Semi-structured data
Too expensive
Need new systems
Non-expensive
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– Distributed Systems on non-expensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with non-expensive computers
Own super computers
Published papers in 2003, 2004
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
What is Hadoop?
 Hadoop Founder:
o Doug Cutting
 Apache Committer:
Lucene, Nutch, …
11
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Super Computer vs Hadoop
Cluster for Store
Cluster for Compute/Store
Cluster for Compute
Parallel vs. Distributed file systems by Michael Malak
Updated by Jongwook Woo
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Hadoop Cluster: Logical Diagram
Web Browser of Cluster nonitor: CM/Ambari
Cluster Monitor
HTTP(S)
HDFS
Agent
Hadoop
HDFS
HDFS
Hadoop
Agent
Agent
Hadoop
.
.
.
Big Data AI Center (BigDAI / HiPIC)
Hadoop
HDFS
Hadoop
HIVE
Agent
Agent
Hadoop
Agent
HDFS
HDFS
Agent
Hadoop
ZooKeeper
Hadoop
Agent
.
.
.
Impala
Agent
Hadoop
.
.
.
Jongwook Woo
CalStateLA
Hadoop Ecosystems
http://dawn.dbsdataprojects.com/tag/hadoop/
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Definition: Big Data
Non-expensive frameworks that is distributed parallel systems
and that can store a large scale data and process it in parallel [1,
2]
Hadoop
– Non-expensive Super Computer
– More public than the traditional super computers
• You can store and process your applications
– In your university labs, small companies, research centers
Others
– NoSQL DB (Cassandra, MongoDB, Redis, HBase)
– ElasticSearch
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Introduction To Big Data Predictive Analytics
 Fraud Detection Predictive Analytics
 Summary
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Alternate of Hadoop MapReduce
Limitation in MapReduce
Hard to program in Java
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab
 In-Memory storage for intermediate data
 20 ~ 100 times faster than N/W and Disk
– MapReduce
Good in Machine Learning
– Iterative algorithms
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Integrating Spark and Hadoop
 Spark
 File Systems: Tachyon
 Resource Manager: Mesos
 Dedicated Spark
– Cassandra, Couchbase…
 Integrating Spark into Hadoop cluster
 As Hadoop has been in the market for over 10 years
 Cloud Computing
– Oracle Cloud Big Data Compute, Amazon AWS, Azure HDInsight, IBM Bluemix, Google
Cloud Platform
• Object Storage, S3
 Hadoop vendors
– HDP, CDH
 Databricks: Spark on AWS & Azure
– Not much Hadoop ecosystems
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Spark
Spark SQL
Querying using SQL, HiveQL
Data Frame
Spark Streaming
DStream
– RDD in streaming
ML
Machine Learning on Data Frame, Pipelining
MLib
– On RDD
– Sparse vector support, Decision trees, Linear/Logistic Regression, PCA,
SVM, …
Big Data AI Center (BDAIC / HiPIC)
CalStateLA
Jongwook Woo
Contents
 Myself
 Introduction To Big Data
 Introduction To Big Data Predictive Analytics
 Fraud Detection Predictive Analytics
 Summary
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Big Data Analysis and Prediction Flow
Data Storage
Data Collection
Batch API: Yelp, Google
Streaming: Twitter, Apache
NiFi, Kafka, StereamSets,
Storm
Open Data: Government
HDFS, S3, Object Storage,
NoSQL DB (Couchbase)…
Data Filtering
Hive, Pig
Data Analysis and Science
Hive, Pig, Spark, BI Tools
(Qlik, Tableau, …)
-
Big Data Engineering
Big Data Analysis
Big Data Science
Data Visualization
Big Data AI Center (BigDAI / HiPIC)
Data Visualization
Qlik, Excel PowerMap,
Tableau, Looker, …
Jongwook Woo
CalStateLA
Terms
We know
Data Engineering
– Collect, clean, transform, filter data
Data Analysis
– Find insights from the existing data
Data Science (Predictive Analysis)
– Predict the trend or pattern from the existing data
Do we know?
Big Data Analysis and Science
– Using Big Data for Data Analysis and Science
• Hadoop, Spark, NoSQL DB, SAP HANA, ElasticSearch,..
– For Massive Data Set
• How to store and compute?
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Big Data Science
 Fraud Detection:
Accepted to APJIS journal by Jongwook Woo et al in 2018
– Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat
– Indexed SCOPUS
Goal
Analyzing Transaction data and Fraud Detection
– For Mobile Money Transaction
• based on a sample of real transactions
– extracted from one month of financial logs from a mobile money
service
– using Spark ML (Big Data) and Azure ML (Traditional)
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Financial Data Set
 Data is always issue
 No public available datasets on financial services
– Private nature of financial transactions
PaySim
– URL: https://www.kaggle.com/ntnu-testimon/paysim1
– generate a synthetic dataset
• from the private dataset
– that resembles the normal operation of transactions
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Financial Data Set (Cont‘d)
Size: 470 MB (=> 718MB)
6,362,620 records
Not that large scale data comparing to data set > GB
But its architecture here can be applicable to much bigger data set
– As it still adopt Spark Computing Engine in Big Data
– Linearly scalable
Attributes: 11
Target Column to Predict:
‘isFraud’
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experiment Environment:
Traditional Systems and Big Data
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experiment Environment
Azure ML:
Traditional small data set
Implement fundamental prediction models
– Using Sample data: 80MB (1/5 – 1/6 data set)
Select the best model among number of classifications
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experiment Environment (Cont‘d)
Spark ML
Test with Databricks CE and IBM Cloud
– 470 MB
AWS EMR
– Analyze all data
• 470 MB (=> 718MB)
– Implement and evaluate prediction model
• 3 different models
• Spark Clusters with 3 different # of nodes
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Hardware Specifications: Spark
IBM DSX Lite
Python 2, Spark 2.1
File System: Object Storage
2 Spark Executors, 16GB Memory
Databricks
Python 2, Spark 2.1 (Auto-updating, Scala 2.10)
File System : Databricks File System
Single/Unlimited Cluster, Memory : 6GB Memory
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experiment Environment
AWS EMR
EMR 12.1
– Spark 2.2.1 on Hadoop 2.8.3
– YARN with Ganglia 3.7.2 and Zeppelin 0.7.3.
 m3.xlarge instance
– Memory: 15.0 GiB,
– CPU: 4 vCPUs,
– Storage: 80 GiB (2 * 40 GiB SSD).
 File System : S3
3 different EMR clusters
– number of nodes that are servers:
• 3, 6, 11 nodes
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
PySpark on Databricks
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Work Flow in Azure ML
 Relatively Easy to build and test
Drag and Drop GUI
Work Flow
1. Data Engineering
–
–
–
Understanding Data
Data preparation
Balancing data statistically
2. Data Science: Machine Learning (ML)
–
–
–
Model building and validation
• Classification algorithms
Model evaluation
Model interpretation
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Data Understanding
• Numeric attributes:
amount, oldbalanceOrg, newbalanceOrg, oldbalanceDest,
newbalanceDest
• Categorical attributes:
step, type, isFraud, isFlaggedFraud
• String attributes:
nameOrig, nameDest
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experiment in Azure ML
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Precision vs Recall
True Positive (TP): Fraud? Yes it is
False Negative (FN): No fraud? but it is
False Positive (FP): Fraud? but it is not
 Precision
 TP / (TP + FP)
 Recall
Positive:
Event occurs
(Fraud)
Negative: Event
does not
Occur (non
Fraud)
 TP / (TP + FN)
 Ref: https://en.wikipedia.org/wiki/Precision_and_recall
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Model Evaluation
More into Recall
to capture the most fraudelent transactions
Bad Recall: Fatal
– If many false negative (FN)
• predict the transaction as normal not fraud
– but it is a fraud
– Painful
• Need to decrease FN
– That is to increase Recall
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experimental Results in AzureML
Model
Accuracy
Precision Recall
Two Class Logistic Regression
Two Class Decision Forest
Two Class Decision Jungle
Big Data AI Center (BigDAI / HiPIC)
0.916
Jongwook Woo
0.998
CalStateLA
Experimental Results
Accuracy
Decision Jungle
– Highest Recall 0.998
• While Precision: 0.916
– With small sample data set: 359KB
• takes 11 sec
Performance:
Times taken to build a model with whole data set:
– 470MB + data tweaking
– Over a day
Good Guide
to adopt the 3 similar algorithms for Spark ML
– Decision Tree, Random Forest, Logistic Regression
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experiment with Spark ML
1. Load the data source
 470 MB (=> 718MB)
2. Train and build the models
o Balanced data statistically
3. Evaluate
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Define the pipeline
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Train the models
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Pipeline Classification with Spark ML
Feature
Transformer
Classification
Estimator
ParamMap
Estimator
Classification
Evaluator
Validation
Estimator
Model
Transformer
Big Data AI Center (BigDAI / HiPIC)
Classification
Evaluator
Jongwook Woo
CalStateLA
Pipeline Classification with Spark ML
Feature
Transformer
Feature is
generated from
input columns
Classification
Estimator
ParamMap
Estimator
Classification
Evaluator
Validation
Estimator
Model
Transformer
Big Data AI Center (BigDAI / HiPIC)
Classification
Evaluator
Jongwook Woo
CalStateLA
Pipeline Classification with Spark ML
Feature
Transformer
Classification
Estimator
Validation
Estimator
Model
Transformer
Big Data AI Center (BigDAI / HiPIC)
ParamMap
Estimator
Classification
Evaluator
Classifiers: Decision
Tree, RandomForest,
LogisticRegression
Classification
Evaluator
Jongwook Woo
CalStateLA
Pipeline Classification with Spark ML
Feature
Transformer
Classification
Estimator
Validation
Estimator
Model
Transformer
Big Data AI Center (BigDAI / HiPIC)
ParamMap
Estimator
Classification
Evaluator
Combination of
Parameters: Max
Bins, Max Depth,…
Classification
Evaluator
Jongwook Woo
CalStateLA
Pipeline Classification with Spark ML
Feature
Transformer
Classification
Estimator
Validation
Estimator
Model
Transformer
Big Data AI Center (BigDAI / HiPIC)
ParamMap
Estimator
Classification
Evaluator
Validators: Cross
Validator, Train
Validation Split
Classification
Evaluator
Jongwook Woo
CalStateLA
Results
Model
Area under
ROC
Precision
Recall
DecisionTreeClassifier
RandomForestClassifier
LogisticRegression
0.909573
• 3 models with different combinations of the parameters
• Times taken (Spark Cluster): 1 hour
• In theory of Linear Scalability: 2 minutes with 30 Spark clsters
• The Random Forest has the best recall score
•
compared to Decision Tree and Logistic Regression.
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Experimental Results in AWS
Execution times
3 nodes:
– 40min – 70mins
11 nodes
– 10min – 20mins
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Smart Factory with Big Data
 Summary
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Summary
Introduction to Big Data
Introduction to Big Data Predictive Analytics
Experimental Result of Fraud Detection
Recall:
– RandomForest in SparkML
– DecisionJungle in AzureML
Performance:
– Traditional Systems:
• not good for large scale data
– Spark ML:
• Linearly Scalable
• Fast
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Questions?
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
References
1.
“Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang
Xu, The 2011 international Conference on Parallel and Distributed Processing Techniques and Applications
(PDPTA 2011), Las Vegas (July 18-21, 2011)
2.
Jongwook Woo, DMKD-00150, “Market Basket Analysis Algorithms with MapReduce”, Wiley
Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445452, ISSN 1942-4795
3.
Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016, Dallas, TX, Aug 12 2016
4.
How to choose algorithms for Microsoft Azure Machine Learning, https://docs.microsoft.com/enus/azure/machine-learning/machine-learning-algorithm-choice
5.
“Big Data Analysis using Spark for Collision Rate Near CalStateLA” , Manik Katyal, Parag Chhadva, Shubhra
Wahi & Jongwook Woo, https://globaljournals.org/GJCST_Volume16/1-Big-Data-Analysis-using-Spark.pdf
6.
Spark Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html
7.
(Accepted in Sept 2018) Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat, Jongwook Woo,
"Predictive Analysis of Financial Fraud Detection using Azure and Spark ML", Asia Pacific Journal of
Information Systems
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
References
8.
TensorFrames: Google Tensorflow on Apache Spark, https://www.slideshare.net/databricks/tensorframesgoogle-tensorflow-on-apache-spark
9.
Deep learning and Apache Spark, https://www.slideshare.net/QuantUniversity/deep-learning-and-apachespark
10.
Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark,
https://www.slideshare.net/SparkSummit/which-is-deeper-comparison-of-deep-learning-frameworks-onspark
11.
Accelerating Machine Learning and Deep Learning At Scale with Apache Spark,
https://www.slideshare.net/SparkSummit/accelerating-machine-learning-and-deep-learning-at-scalewithapache-spark-keynote-by-ziya-ma
12.
Deep Learning with Apache Spark and TensorFlow, https://databricks.com/blog/2016/01/25/deeplearning-with-apache-spark-and-tensorflow.html
13.
Tensor Flow Deep Learning Open SAP
14.
Overview of Smart Factory, https://www.slideshare.net/BrendanSheppard1/overview-of-smart-factorysolutions-68137094/6
15.
https://dzone.com/articles/sqoop-import-data-from-mysql-tohive
Big Data AI Center (BigDAI / HiPIC)
Jongwook Woo
CalStateLA
Download