Financial Fraud Detection with Azure and Spark ML

BigDAI HiPIC Predictive Analysis of Financial Fraud Detection using Azure and Spark ML IDEAS SoCal Conf 2018 Oct 20 2018 Jongwook Woo, PhD, jwoo5@calstatela.edu Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat Big Data AI Center (BigDAI / HiPIC) California State University Los Angeles Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Introduction To Big Data Predictive Analytics  Fraud Detection Predictive Analytics  Summary Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Myself Experience:  Since 2002, Professor at California State University Los Angeles – PhD in 2001: Computer Science and Engineering at USC  Since 1998: R&D consulting in Hollywood – Warner Bros (Matrix online game), E!, citysearch.com, ARM etc – Information Search and Integration with FAST, Lucene/Solr, Sphinx – implements eBusiness applications using J2EE and middleware  Since 2007: Exposed to Big Data at CitySearch.com  2012 - Present : Big Data Academic Partnerships – For Big Data research and training • Amazon AWS, MicroSoft Azure, IBM Bluemix • Databricks, Hadoop vendors Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Myself: Partners for Services Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experience in Big Data  Collaboration  Big Data Technical Advisor of Isaac Engineering for Smart * (Factory, Farms, …) in Korea  Council Member of IBM Spark Technology Center  City of Los Angeles for DSF, OpenHub and Open Data  Startup Companies in Los Angeles  External Collaborator and Advisor in Big Data – IMSC of USC – Pennsylvania State University – The Big Link, Softzen, Wiken in Korea  Grants  Oracle Cloud Big Data, IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Research and Education Grant  Partnership  Academic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS, Teradata Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Myself: Public Partners Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Myself: S/W Development Lead http://www.mobygames.com/game/windows/matrix-online/credits Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Introduction To Big Data Predictive Analytics  Fraud Detection Predictive Analytics  Summary Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Smart *: Sensor Data (IoT), Bioinformatics, Social Computing, Streaming data, smart phone, online game… Cannot handle with the legacy approach Too big Non-/Semi-structured data Too expensive Need new systems Non-expensive Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – Distributed Systems on non-expensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with non-expensive computers Own super computers Published papers in 2003, 2004 Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA What is Hadoop?  Hadoop Founder: o Doug Cutting  Apache Committer: Lucene, Nutch, … 11 Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Super Computer vs Hadoop Cluster for Store Cluster for Compute/Store Cluster for Compute Parallel vs. Distributed file systems by Michael Malak Updated by Jongwook Woo Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Hadoop Cluster: Logical Diagram Web Browser of Cluster nonitor: CM/Ambari Cluster Monitor HTTP(S) HDFS Agent Hadoop HDFS HDFS Hadoop Agent Agent Hadoop . . . Big Data AI Center (BigDAI / HiPIC) Hadoop HDFS Hadoop HIVE Agent Agent Hadoop Agent HDFS HDFS Agent Hadoop ZooKeeper Hadoop Agent . . . Impala Agent Hadoop . . . Jongwook Woo CalStateLA Hadoop Ecosystems http://dawn.dbsdataprojects.com/tag/hadoop/ Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Definition: Big Data Non-expensive frameworks that is distributed parallel systems and that can store a large scale data and process it in parallel [1, 2] Hadoop – Non-expensive Super Computer – More public than the traditional super computers • You can store and process your applications – In your university labs, small companies, research centers Others – NoSQL DB (Cassandra, MongoDB, Redis, HBase) – ElasticSearch Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Introduction To Big Data Predictive Analytics  Fraud Detection Predictive Analytics  Summary Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Alternate of Hadoop MapReduce Limitation in MapReduce Hard to program in Java Batch Processing – Not interactive Disk storage for intermediate data – Performance issue Spark by UC Berkley AMP Lab  In-Memory storage for intermediate data  20 ~ 100 times faster than N/W and Disk – MapReduce Good in Machine Learning – Iterative algorithms Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Integrating Spark and Hadoop  Spark  File Systems: Tachyon  Resource Manager: Mesos  Dedicated Spark – Cassandra, Couchbase…  Integrating Spark into Hadoop cluster  As Hadoop has been in the market for over 10 years  Cloud Computing – Oracle Cloud Big Data Compute, Amazon AWS, Azure HDInsight, IBM Bluemix, Google Cloud Platform • Object Storage, S3  Hadoop vendors – HDP, CDH  Databricks: Spark on AWS & Azure – Not much Hadoop ecosystems Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Spark Spark SQL Querying using SQL, HiveQL Data Frame Spark Streaming DStream – RDD in streaming ML Machine Learning on Data Frame, Pipelining MLib – On RDD – Sparse vector support, Decision trees, Linear/Logistic Regression, PCA, SVM, … Big Data AI Center (BDAIC / HiPIC) CalStateLA Jongwook Woo Contents  Myself  Introduction To Big Data  Introduction To Big Data Predictive Analytics  Fraud Detection Predictive Analytics  Summary Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Big Data Analysis and Prediction Flow Data Storage Data Collection Batch API: Yelp, Google Streaming: Twitter, Apache NiFi, Kafka, StereamSets, Storm Open Data: Government HDFS, S3, Object Storage, NoSQL DB (Couchbase)… Data Filtering Hive, Pig Data Analysis and Science Hive, Pig, Spark, BI Tools (Qlik, Tableau, …) - Big Data Engineering Big Data Analysis Big Data Science Data Visualization Big Data AI Center (BigDAI / HiPIC) Data Visualization Qlik, Excel PowerMap, Tableau, Looker, … Jongwook Woo CalStateLA Terms We know Data Engineering – Collect, clean, transform, filter data Data Analysis – Find insights from the existing data Data Science (Predictive Analysis) – Predict the trend or pattern from the existing data Do we know? Big Data Analysis and Science – Using Big Data for Data Analysis and Science • Hadoop, Spark, NoSQL DB, SAP HANA, ElasticSearch,.. – For Massive Data Set • How to store and compute? Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Big Data Science  Fraud Detection: Accepted to APJIS journal by Jongwook Woo et al in 2018 – Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat – Indexed SCOPUS Goal Analyzing Transaction data and Fraud Detection – For Mobile Money Transaction • based on a sample of real transactions – extracted from one month of financial logs from a mobile money service – using Spark ML (Big Data) and Azure ML (Traditional) Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Financial Data Set  Data is always issue  No public available datasets on financial services – Private nature of financial transactions PaySim – URL: https://www.kaggle.com/ntnu-testimon/paysim1 – generate a synthetic dataset • from the private dataset – that resembles the normal operation of transactions Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Financial Data Set (Cont‘d) Size: 470 MB (=> 718MB) 6,362,620 records Not that large scale data comparing to data set > GB But its architecture here can be applicable to much bigger data set – As it still adopt Spark Computing Engine in Big Data – Linearly scalable Attributes: 11 Target Column to Predict: ‘isFraud’ Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experiment Environment: Traditional Systems and Big Data Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experiment Environment Azure ML: Traditional small data set Implement fundamental prediction models – Using Sample data: 80MB (1/5 – 1/6 data set) Select the best model among number of classifications Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experiment Environment (Cont‘d) Spark ML Test with Databricks CE and IBM Cloud – 470 MB AWS EMR – Analyze all data • 470 MB (=> 718MB) – Implement and evaluate prediction model • 3 different models • Spark Clusters with 3 different # of nodes Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Hardware Specifications: Spark IBM DSX Lite Python 2, Spark 2.1 File System: Object Storage 2 Spark Executors, 16GB Memory Databricks Python 2, Spark 2.1 (Auto-updating, Scala 2.10) File System : Databricks File System Single/Unlimited Cluster, Memory : 6GB Memory Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experiment Environment AWS EMR EMR 12.1 – Spark 2.2.1 on Hadoop 2.8.3 – YARN with Ganglia 3.7.2 and Zeppelin 0.7.3.  m3.xlarge instance – Memory: 15.0 GiB, – CPU: 4 vCPUs, – Storage: 80 GiB (2 * 40 GiB SSD).  File System : S3 3 different EMR clusters – number of nodes that are servers: • 3, 6, 11 nodes Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA PySpark on Databricks Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Work Flow in Azure ML  Relatively Easy to build and test Drag and Drop GUI Work Flow 1. Data Engineering – – – Understanding Data Data preparation Balancing data statistically 2. Data Science: Machine Learning (ML) – – – Model building and validation • Classification algorithms Model evaluation Model interpretation Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Data Understanding • Numeric attributes: amount, oldbalanceOrg, newbalanceOrg, oldbalanceDest, newbalanceDest • Categorical attributes: step, type, isFraud, isFlaggedFraud • String attributes: nameOrig, nameDest Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experiment in Azure ML Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Precision vs Recall True Positive (TP): Fraud? Yes it is False Negative (FN): No fraud? but it is False Positive (FP): Fraud? but it is not  Precision  TP / (TP + FP)  Recall Positive: Event occurs (Fraud) Negative: Event does not Occur (non Fraud)  TP / (TP + FN)  Ref: https://en.wikipedia.org/wiki/Precision_and_recall Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Model Evaluation More into Recall to capture the most fraudelent transactions Bad Recall: Fatal – If many false negative (FN) • predict the transaction as normal not fraud – but it is a fraud – Painful • Need to decrease FN – That is to increase Recall Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experimental Results in AzureML Model Accuracy Precision Recall Two Class Logistic Regression Two Class Decision Forest Two Class Decision Jungle Big Data AI Center (BigDAI / HiPIC) 0.916 Jongwook Woo 0.998 CalStateLA Experimental Results Accuracy Decision Jungle – Highest Recall 0.998 • While Precision: 0.916 – With small sample data set: 359KB • takes 11 sec Performance: Times taken to build a model with whole data set: – 470MB + data tweaking – Over a day Good Guide to adopt the 3 similar algorithms for Spark ML – Decision Tree, Random Forest, Logistic Regression Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experiment with Spark ML 1. Load the data source  470 MB (=> 718MB) 2. Train and build the models o Balanced data statistically 3. Evaluate Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Define the pipeline Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Train the models Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Pipeline Classification with Spark ML Feature Transformer Classification Estimator ParamMap Estimator Classification Evaluator Validation Estimator Model Transformer Big Data AI Center (BigDAI / HiPIC) Classification Evaluator Jongwook Woo CalStateLA Pipeline Classification with Spark ML Feature Transformer Feature is generated from input columns Classification Estimator ParamMap Estimator Classification Evaluator Validation Estimator Model Transformer Big Data AI Center (BigDAI / HiPIC) Classification Evaluator Jongwook Woo CalStateLA Pipeline Classification with Spark ML Feature Transformer Classification Estimator Validation Estimator Model Transformer Big Data AI Center (BigDAI / HiPIC) ParamMap Estimator Classification Evaluator Classifiers: Decision Tree, RandomForest, LogisticRegression Classification Evaluator Jongwook Woo CalStateLA Pipeline Classification with Spark ML Feature Transformer Classification Estimator Validation Estimator Model Transformer Big Data AI Center (BigDAI / HiPIC) ParamMap Estimator Classification Evaluator Combination of Parameters: Max Bins, Max Depth,… Classification Evaluator Jongwook Woo CalStateLA Pipeline Classification with Spark ML Feature Transformer Classification Estimator Validation Estimator Model Transformer Big Data AI Center (BigDAI / HiPIC) ParamMap Estimator Classification Evaluator Validators: Cross Validator, Train Validation Split Classification Evaluator Jongwook Woo CalStateLA Results Model Area under ROC Precision Recall DecisionTreeClassifier RandomForestClassifier LogisticRegression 0.909573 • 3 models with different combinations of the parameters • Times taken (Spark Cluster): 1 hour • In theory of Linear Scalability: 2 minutes with 30 Spark clsters • The Random Forest has the best recall score • compared to Decision Tree and Logistic Regression. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experimental Results in AWS Execution times 3 nodes: – 40min – 70mins 11 nodes – 10min – 20mins Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Smart Factory with Big Data  Summary Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Summary Introduction to Big Data Introduction to Big Data Predictive Analytics Experimental Result of Fraud Detection Recall: – RandomForest in SparkML – DecisionJungle in AzureML Performance: – Traditional Systems: • not good for large scale data – Spark ML: • Linearly Scalable • Fast Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Questions? Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA References 1. “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang Xu, The 2011 international Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2011), Las Vegas (July 18-21, 2011) 2. Jongwook Woo, DMKD-00150, “Market Basket Analysis Algorithms with MapReduce”, Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445452, ISSN 1942-4795 3. Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016, Dallas, TX, Aug 12 2016 4. How to choose algorithms for Microsoft Azure Machine Learning, https://docs.microsoft.com/enus/azure/machine-learning/machine-learning-algorithm-choice 5. “Big Data Analysis using Spark for Collision Rate Near CalStateLA” , Manik Katyal, Parag Chhadva, Shubhra Wahi & Jongwook Woo, https://globaljournals.org/GJCST_Volume16/1-Big-Data-Analysis-using-Spark.pdf 6. Spark Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html 7. (Accepted in Sept 2018) Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat, Jongwook Woo, "Predictive Analysis of Financial Fraud Detection using Azure and Spark ML", Asia Pacific Journal of Information Systems Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA References 8. TensorFrames: Google Tensorflow on Apache Spark, https://www.slideshare.net/databricks/tensorframesgoogle-tensorflow-on-apache-spark 9. Deep learning and Apache Spark, https://www.slideshare.net/QuantUniversity/deep-learning-and-apachespark 10. Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark, https://www.slideshare.net/SparkSummit/which-is-deeper-comparison-of-deep-learning-frameworks-onspark 11. Accelerating Machine Learning and Deep Learning At Scale with Apache Spark, https://www.slideshare.net/SparkSummit/accelerating-machine-learning-and-deep-learning-at-scalewithapache-spark-keynote-by-ziya-ma 12. Deep Learning with Apache Spark and TensorFlow, https://databricks.com/blog/2016/01/25/deeplearning-with-apache-spark-and-tensorflow.html 13. Tensor Flow Deep Learning Open SAP 14. Overview of Smart Factory, https://www.slideshare.net/BrendanSheppard1/overview-of-smart-factorysolutions-68137094/6 15. https://dzone.com/articles/sqoop-import-data-from-mysql-tohive Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA

Financial Fraud Detection with Azure and Spark ML

Related documents

Products

Support

Financial Fraud Detection with Azure and Spark ML

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib