BigDAI HiPIC Predictive Analysis of Financial Fraud Detection using Azure and Spark ML IDEAS SoCal Conf 2018 Oct 20 2018 Jongwook Woo, PhD, jwoo5@calstatela.edu Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat Big Data AI Center (BigDAI / HiPIC) California State University Los Angeles Jongwook Woo CalStateLA Contents Myself Introduction To Big Data Introduction To Big Data Predictive Analytics Fraud Detection Predictive Analytics Summary Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Myself Experience: Since 2002, Professor at California State University Los Angeles – PhD in 2001: Computer Science and Engineering at USC Since 1998: R&D consulting in Hollywood – Warner Bros (Matrix online game), E!, citysearch.com, ARM etc – Information Search and Integration with FAST, Lucene/Solr, Sphinx – implements eBusiness applications using J2EE and middleware Since 2007: Exposed to Big Data at CitySearch.com 2012 - Present : Big Data Academic Partnerships – For Big Data research and training • Amazon AWS, MicroSoft Azure, IBM Bluemix • Databricks, Hadoop vendors Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Myself: Partners for Services Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experience in Big Data Collaboration Big Data Technical Advisor of Isaac Engineering for Smart * (Factory, Farms, …) in Korea Council Member of IBM Spark Technology Center City of Los Angeles for DSF, OpenHub and Open Data Startup Companies in Los Angeles External Collaborator and Advisor in Big Data – IMSC of USC – Pennsylvania State University – The Big Link, Softzen, Wiken in Korea Grants Oracle Cloud Big Data, IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Research and Education Grant Partnership Academic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS, Teradata Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Myself: Public Partners Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Myself: S/W Development Lead http://www.mobygames.com/game/windows/matrix-online/credits Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Contents Myself Introduction To Big Data Introduction To Big Data Predictive Analytics Fraud Detection Predictive Analytics Summary Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Smart *: Sensor Data (IoT), Bioinformatics, Social Computing, Streaming data, smart phone, online game… Cannot handle with the legacy approach Too big Non-/Semi-structured data Too expensive Need new systems Non-expensive Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – Distributed Systems on non-expensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with non-expensive computers Own super computers Published papers in 2003, 2004 Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA What is Hadoop? Hadoop Founder: o Doug Cutting Apache Committer: Lucene, Nutch, … 11 Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Super Computer vs Hadoop Cluster for Store Cluster for Compute/Store Cluster for Compute Parallel vs. Distributed file systems by Michael Malak Updated by Jongwook Woo Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Hadoop Cluster: Logical Diagram Web Browser of Cluster nonitor: CM/Ambari Cluster Monitor HTTP(S) HDFS Agent Hadoop HDFS HDFS Hadoop Agent Agent Hadoop . . . Big Data AI Center (BigDAI / HiPIC) Hadoop HDFS Hadoop HIVE Agent Agent Hadoop Agent HDFS HDFS Agent Hadoop ZooKeeper Hadoop Agent . . . Impala Agent Hadoop . . . Jongwook Woo CalStateLA Hadoop Ecosystems http://dawn.dbsdataprojects.com/tag/hadoop/ Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Definition: Big Data Non-expensive frameworks that is distributed parallel systems and that can store a large scale data and process it in parallel [1, 2] Hadoop – Non-expensive Super Computer – More public than the traditional super computers • You can store and process your applications – In your university labs, small companies, research centers Others – NoSQL DB (Cassandra, MongoDB, Redis, HBase) – ElasticSearch Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Contents Myself Introduction To Big Data Introduction To Big Data Predictive Analytics Fraud Detection Predictive Analytics Summary Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Alternate of Hadoop MapReduce Limitation in MapReduce Hard to program in Java Batch Processing – Not interactive Disk storage for intermediate data – Performance issue Spark by UC Berkley AMP Lab In-Memory storage for intermediate data 20 ~ 100 times faster than N/W and Disk – MapReduce Good in Machine Learning – Iterative algorithms Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Integrating Spark and Hadoop Spark File Systems: Tachyon Resource Manager: Mesos Dedicated Spark – Cassandra, Couchbase… Integrating Spark into Hadoop cluster As Hadoop has been in the market for over 10 years Cloud Computing – Oracle Cloud Big Data Compute, Amazon AWS, Azure HDInsight, IBM Bluemix, Google Cloud Platform • Object Storage, S3 Hadoop vendors – HDP, CDH Databricks: Spark on AWS & Azure – Not much Hadoop ecosystems Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Spark Spark SQL Querying using SQL, HiveQL Data Frame Spark Streaming DStream – RDD in streaming ML Machine Learning on Data Frame, Pipelining MLib – On RDD – Sparse vector support, Decision trees, Linear/Logistic Regression, PCA, SVM, … Big Data AI Center (BDAIC / HiPIC) CalStateLA Jongwook Woo Contents Myself Introduction To Big Data Introduction To Big Data Predictive Analytics Fraud Detection Predictive Analytics Summary Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Big Data Analysis and Prediction Flow Data Storage Data Collection Batch API: Yelp, Google Streaming: Twitter, Apache NiFi, Kafka, StereamSets, Storm Open Data: Government HDFS, S3, Object Storage, NoSQL DB (Couchbase)… Data Filtering Hive, Pig Data Analysis and Science Hive, Pig, Spark, BI Tools (Qlik, Tableau, …) - Big Data Engineering Big Data Analysis Big Data Science Data Visualization Big Data AI Center (BigDAI / HiPIC) Data Visualization Qlik, Excel PowerMap, Tableau, Looker, … Jongwook Woo CalStateLA Terms We know Data Engineering – Collect, clean, transform, filter data Data Analysis – Find insights from the existing data Data Science (Predictive Analysis) – Predict the trend or pattern from the existing data Do we know? Big Data Analysis and Science – Using Big Data for Data Analysis and Science • Hadoop, Spark, NoSQL DB, SAP HANA, ElasticSearch,.. – For Massive Data Set • How to store and compute? Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Big Data Science Fraud Detection: Accepted to APJIS journal by Jongwook Woo et al in 2018 – Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat – Indexed SCOPUS Goal Analyzing Transaction data and Fraud Detection – For Mobile Money Transaction • based on a sample of real transactions – extracted from one month of financial logs from a mobile money service – using Spark ML (Big Data) and Azure ML (Traditional) Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Financial Data Set Data is always issue No public available datasets on financial services – Private nature of financial transactions PaySim – URL: https://www.kaggle.com/ntnu-testimon/paysim1 – generate a synthetic dataset • from the private dataset – that resembles the normal operation of transactions Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Financial Data Set (Cont‘d) Size: 470 MB (=> 718MB) 6,362,620 records Not that large scale data comparing to data set > GB But its architecture here can be applicable to much bigger data set – As it still adopt Spark Computing Engine in Big Data – Linearly scalable Attributes: 11 Target Column to Predict: ‘isFraud’ Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experiment Environment: Traditional Systems and Big Data Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experiment Environment Azure ML: Traditional small data set Implement fundamental prediction models – Using Sample data: 80MB (1/5 – 1/6 data set) Select the best model among number of classifications Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experiment Environment (Cont‘d) Spark ML Test with Databricks CE and IBM Cloud – 470 MB AWS EMR – Analyze all data • 470 MB (=> 718MB) – Implement and evaluate prediction model • 3 different models • Spark Clusters with 3 different # of nodes Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Hardware Specifications: Spark IBM DSX Lite Python 2, Spark 2.1 File System: Object Storage 2 Spark Executors, 16GB Memory Databricks Python 2, Spark 2.1 (Auto-updating, Scala 2.10) File System : Databricks File System Single/Unlimited Cluster, Memory : 6GB Memory Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experiment Environment AWS EMR EMR 12.1 – Spark 2.2.1 on Hadoop 2.8.3 – YARN with Ganglia 3.7.2 and Zeppelin 0.7.3. m3.xlarge instance – Memory: 15.0 GiB, – CPU: 4 vCPUs, – Storage: 80 GiB (2 * 40 GiB SSD). File System : S3 3 different EMR clusters – number of nodes that are servers: • 3, 6, 11 nodes Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA PySpark on Databricks Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Work Flow in Azure ML Relatively Easy to build and test Drag and Drop GUI Work Flow 1. Data Engineering – – – Understanding Data Data preparation Balancing data statistically 2. Data Science: Machine Learning (ML) – – – Model building and validation • Classification algorithms Model evaluation Model interpretation Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Data Understanding • Numeric attributes: amount, oldbalanceOrg, newbalanceOrg, oldbalanceDest, newbalanceDest • Categorical attributes: step, type, isFraud, isFlaggedFraud • String attributes: nameOrig, nameDest Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experiment in Azure ML Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Precision vs Recall True Positive (TP): Fraud? Yes it is False Negative (FN): No fraud? but it is False Positive (FP): Fraud? but it is not Precision TP / (TP + FP) Recall Positive: Event occurs (Fraud) Negative: Event does not Occur (non Fraud) TP / (TP + FN) Ref: https://en.wikipedia.org/wiki/Precision_and_recall Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Model Evaluation More into Recall to capture the most fraudelent transactions Bad Recall: Fatal – If many false negative (FN) • predict the transaction as normal not fraud – but it is a fraud – Painful • Need to decrease FN – That is to increase Recall Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experimental Results in AzureML Model Accuracy Precision Recall Two Class Logistic Regression Two Class Decision Forest Two Class Decision Jungle Big Data AI Center (BigDAI / HiPIC) 0.916 Jongwook Woo 0.998 CalStateLA Experimental Results Accuracy Decision Jungle – Highest Recall 0.998 • While Precision: 0.916 – With small sample data set: 359KB • takes 11 sec Performance: Times taken to build a model with whole data set: – 470MB + data tweaking – Over a day Good Guide to adopt the 3 similar algorithms for Spark ML – Decision Tree, Random Forest, Logistic Regression Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experiment with Spark ML 1. Load the data source 470 MB (=> 718MB) 2. Train and build the models o Balanced data statistically 3. Evaluate Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Define the pipeline Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Train the models Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Pipeline Classification with Spark ML Feature Transformer Classification Estimator ParamMap Estimator Classification Evaluator Validation Estimator Model Transformer Big Data AI Center (BigDAI / HiPIC) Classification Evaluator Jongwook Woo CalStateLA Pipeline Classification with Spark ML Feature Transformer Feature is generated from input columns Classification Estimator ParamMap Estimator Classification Evaluator Validation Estimator Model Transformer Big Data AI Center (BigDAI / HiPIC) Classification Evaluator Jongwook Woo CalStateLA Pipeline Classification with Spark ML Feature Transformer Classification Estimator Validation Estimator Model Transformer Big Data AI Center (BigDAI / HiPIC) ParamMap Estimator Classification Evaluator Classifiers: Decision Tree, RandomForest, LogisticRegression Classification Evaluator Jongwook Woo CalStateLA Pipeline Classification with Spark ML Feature Transformer Classification Estimator Validation Estimator Model Transformer Big Data AI Center (BigDAI / HiPIC) ParamMap Estimator Classification Evaluator Combination of Parameters: Max Bins, Max Depth,… Classification Evaluator Jongwook Woo CalStateLA Pipeline Classification with Spark ML Feature Transformer Classification Estimator Validation Estimator Model Transformer Big Data AI Center (BigDAI / HiPIC) ParamMap Estimator Classification Evaluator Validators: Cross Validator, Train Validation Split Classification Evaluator Jongwook Woo CalStateLA Results Model Area under ROC Precision Recall DecisionTreeClassifier RandomForestClassifier LogisticRegression 0.909573 • 3 models with different combinations of the parameters • Times taken (Spark Cluster): 1 hour • In theory of Linear Scalability: 2 minutes with 30 Spark clsters • The Random Forest has the best recall score • compared to Decision Tree and Logistic Regression. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experimental Results in AWS Execution times 3 nodes: – 40min – 70mins 11 nodes – 10min – 20mins Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Contents Myself Introduction To Big Data Smart Factory with Big Data Summary Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Summary Introduction to Big Data Introduction to Big Data Predictive Analytics Experimental Result of Fraud Detection Recall: – RandomForest in SparkML – DecisionJungle in AzureML Performance: – Traditional Systems: • not good for large scale data – Spark ML: • Linearly Scalable • Fast Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Questions? Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA References 1. “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang Xu, The 2011 international Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2011), Las Vegas (July 18-21, 2011) 2. Jongwook Woo, DMKD-00150, “Market Basket Analysis Algorithms with MapReduce”, Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445452, ISSN 1942-4795 3. Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016, Dallas, TX, Aug 12 2016 4. How to choose algorithms for Microsoft Azure Machine Learning, https://docs.microsoft.com/enus/azure/machine-learning/machine-learning-algorithm-choice 5. “Big Data Analysis using Spark for Collision Rate Near CalStateLA” , Manik Katyal, Parag Chhadva, Shubhra Wahi & Jongwook Woo, https://globaljournals.org/GJCST_Volume16/1-Big-Data-Analysis-using-Spark.pdf 6. Spark Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html 7. (Accepted in Sept 2018) Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat, Jongwook Woo, "Predictive Analysis of Financial Fraud Detection using Azure and Spark ML", Asia Pacific Journal of Information Systems Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA References 8. TensorFrames: Google Tensorflow on Apache Spark, https://www.slideshare.net/databricks/tensorframesgoogle-tensorflow-on-apache-spark 9. Deep learning and Apache Spark, https://www.slideshare.net/QuantUniversity/deep-learning-and-apachespark 10. Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark, https://www.slideshare.net/SparkSummit/which-is-deeper-comparison-of-deep-learning-frameworks-onspark 11. Accelerating Machine Learning and Deep Learning At Scale with Apache Spark, https://www.slideshare.net/SparkSummit/accelerating-machine-learning-and-deep-learning-at-scalewithapache-spark-keynote-by-ziya-ma 12. Deep Learning with Apache Spark and TensorFlow, https://databricks.com/blog/2016/01/25/deeplearning-with-apache-spark-and-tensorflow.html 13. Tensor Flow Deep Learning Open SAP 14. Overview of Smart Factory, https://www.slideshare.net/BrendanSheppard1/overview-of-smart-factorysolutions-68137094/6 15. https://dzone.com/articles/sqoop-import-data-from-mysql-tohive Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA