Machine Learning Overview Artificial Intelligence has evolved significantly in the past year Artificial Intelligence = The ability to mimic human intelligence, eg. Understand, reason, and learn Machine Learning (Predictive AI) Generative AI Programs that can analyze content to make predictions and prescribe actions eg. forecast revenue based on historical sales eg. prescribe next best offer eg. visually identify product defect Analytics Machine Learning Deep Learning Programs that can generate net new content and better understand existing content eg. create image from prompt eg. answers questions from PDFs eg. summarize an article Foundation Model Large Language Model 13 Artificial intelligence, Machine Learning, and Deep Learning • AI components • Machine Learning • Deep Learning 14 Machine Learning Machine learning is the scientific study of algorithms and statistical models using inference instead of instructions to perform a task. Data Model Prediction Machine Learning flow 15 When to use machine learning? Classical programming approach Use machine learning when you have: • Large datasets, large number of variables Business logic Task Procedures • Lack of clear procedures to obtain the solution • Existing machine learning expertise • Infrastructure already in place to support ML • Management support for ML Machine learning approach Data Model Task 24 Machine Learning process ML pipeline: Business problem Business problem Problem formulation Example: You have a dataset containing information about various cars, such as their brand, model, year of manufacture, mileage, and price. The goal is to build a machine-learning model that can predict the resell price of a car based on its features. Problem formulation: Given the car’s features, predict its resell price. 26 ML pipeline: Data preparation Data handling and cleaning Business problem data data Problem formulation data data Collect and label data Evaluate data Name Country Sex dob Richard Roe UK Male 18/2/1972 Paulo Santos Male Mrs. Mary Major Denver F 37 Desai, Arnav USA M 2/22/1962 11/2/1969 27 ML pipeline: Iterative Model Training Business problem Problem formulation Collect and label data Tune model Evaluate data Feature Engineering Select and train model Evaluate model Feature augmentation Meets business goal? No Data augmentation 28 ML pipeline: Feature Engineering Name Country Richard Roe UK Paulo Santos Male Mrs. Mary Major Denver F 37 Desai, Arnav USA M 2/22/1962 Name Sex Male dob 18/2/1972 11/2/1969 ? USA UK sex age bm dow target Richard Roe 0 1 0 49 2 5 140,000 Paulo Santos 1 0 0 51 11 7 78,000 Mary Major 1 0 1 37 NAN 0 167,000 Arnav Desai 1 0 0 58 2 4 100,000 29 ML pipeline: Model Training Name USA UK sex age bm dow target Richard Roe 0 1 0 49 2 5 140,000 … … … … … … … … 10–20% Test data 80% Algorithm XGBoost Trained model {hyperparameters} 30 ML pipeline: Evaluating and tuning the model Name USA UK sex age bm dow target Richard Roe 0 1 0 49 2 5 140,000 … … … … … … … … Change features 80% Algorithm XGBoost 10-20% Test data predict Trained model Hosted model {hyperparameters} Change hyperparameters Metrics 31 Overfitting and underfitting Y Y Y X Overfitting X Underfitting X Balanced 32 ML pipeline: Deployment New data, retraining Deploy model Business problem Problem formulation Collect and label data Yes Tune model Evaluate data Feature engineering Select and train model Evaluate model Feature augmentation Meets business goal? No Data augmentation 33 Machine Learning Challenges Data Business • • • • Poor quality Non-representative Insufficient Overfitting and underfitting • Complexity in formulating questions • Explaining models to the business • Cost of building systems Users Technology • Lack of data science expertise • Cost of staffing with data scientists • Lack of management support • Data privacy issues • Tool selection can be complicated • Integration with other systems 36 Let's dive deeper to understand what a typical Machine Learning Pipeline entails. Machine Learning Pipeline 1. Scenario Introduction 2. Data Collection 3. Data Evaluation 4. Data Preprocessing 5. Model Development (Training) 6. Model Evaluation 7. Model Deployment 8. Hyperparameter and Model tuning 38 Machine Learning Pipeline New data and retraining Deploy model Business problem Problem formulation Collect and label data Yes Tune model Evaluate data Feature engineering Select and train model Evaluate model Meets business goal? No Feature augmentation Data augmentation © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. 39 Machine Learning Pipeline New data and retraining Deploy model Business problem Problem formulation Collect and label data Yes Tune model Evaluate data Feature engineering Select and train model Evaluate model Meets business goal? No Feature augmentation Data augmentation © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. 40 Define business objective What is the business goal? Questions to ask: • How is this task done today? • How will the business measure success? • How will the solution be used? • Do similar solutions exist that you might learn from? • What assumptions have been made? • Who are the domain experts? 42 How should you frame this problem? • Is the problem a machine learning problem? • Is the problem supervised or unsupervised? Supervised learning • What is the target to predict? • Do you have access to the data? • What is the minimum performance? Unsupervised learning Reinforcement learning • How would you solve this problem manually? • What’s the simplest solution? 43 Example: Problem Formulation You want to identify fraudulent credit card transactions so that you can stop the transaction before it processes. Why? Reduce the number of customers who end their membership because of fraud. 10% reduction in fraud claims in retail Can you measure it? Move from qualitative statements to quantitative statements that can be measured. 44 Make it an ML model Credit card transaction is either fraudulent or not fraudulent. Fraud Binary classification problem Not fraud Use historical data of fraud reports to help define your model. 45 Wine quality dataset Question: Based on the composition of the wine, can you predict the quality and, therefore the price? Why: • View statistics • Deal with outliers • Scale numeric data Citation Source: UCI Wine quality dataset Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009. 46 Car evaluation dataset Question: Can you use a car’s attributes to predict whether the car will be purchased? Why: • View statistics • Encode categorical data Citation Source: UCI Car evaluation dataset Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. 47 Machine Learning Pipeline New data and retraining Deploy model Business problem Problem formulation Collect and label data Yes Tune model Evaluate data Feature engineering Select and train model Evaluate model Meets business goal? No Feature augmentation Data augmentation 49 Data sources • Private data: Data that customers create • Commercial data: HKEx, Reuters, etc. • Open-source data: Data that is publicly available (check for limits on usage) • Kaggle • World Health Organization • U.S. Census Bureau • National Oceanic and Atmospheric Administration (U.S.) • UC Irvine Machine Learning Repository 51 Observations ML problems need a lot of data—also called observations— where the target answer or prediction is already known. Customer Date of transaction Vendor Charge amount Was this fraud? ABC 10/5 Store 1 10.99 No DEF 10/5 Store 2 99.99 Yes GHI 10/5 Store 2 15.00 No JKL 10/6 Store 2 99.99 ? MNO 10/6 Store 1 99.99 Yes Feature Target 52 Extract, transform, load (ETL) Data Catalog Crawler Data Stores Schedule or Event Extract Load Transform Script Data Source Data Target 54 Summary • The first step in the machine learning pipeline is to obtain data • Extract, transform, and load (ETL) is a common term for obtaining data for machine learning 55 Machine Learning Pipeline New data and retraining Deploy model Business problem Format data Problem formulation Collect and label data Yes Tune model Examine data types Perform descriptive statistics Evaluate data Feature engineering Select and train model Data Understanding Evaluate model Visualize data Feature augmentation Meets business goal? No Data augmentation 56 You must understand your data Before you can run statistics on your data, you must ensure that it’s in the right format for analysis. Customer Date of Transaction Vendor Charge Amount Was This Fraud? “Customer:ABC,DateOfTran ABC 10/5 Store 1 10.99 No sation:10/5,Vendor:Store1, DEF 10/5 Store 2 99.99 Yes GHI 10/5 Store 2 15.00 No JKL 10/6 Store 2 99.99 ? MNO 10/6 Store 1 99.99 Yes ChargeAmount:10.99,WasT hisFruad:No…” 57 Summary • The first step in evaluating data is to make sure that it’s in the right format • pandas is a popular Python library for working with data • Use descriptive statistics to learn about the dataset • Create visualizations with pandas to examine the dataset in more detail 70 Feature Engineering Machine Learning pipeline New data and retraining Deploy model Business problem Problem formulation Collect and label data Yes Tune model Evaluate data Feature engineering Select and train model Evaluate model Meets business goal? No Feature augmentation Data augmentation 72 Feature Selection and Extraction Feature Extraction Feature Selection XX X 73 Feature Extraction • Invalid values • Transformation • Wrong formats • Normalization • Misspelling • Dimensionality reduction • Duplicates • Consistency • Rescale • Encode categories • Remove outliers • Reassign outliers • Bucketing • Decomposition • Aggregation • Combination 74 Encoding ordinal data • Categorical data is non-numeric data. • Categorical data must be converted (encoded) to a numeric scale. Maintenance Costs Low Medium High Very High Encoding 1 2 3 4 75 Encoding non-ordinal data • If data is non-ordinal, the encoded values also must be non-ordinal. • Non-ordinal data might need to be broken into multiple categories. … Color … Red Blue Green … Red … 1 0 0 … Blue … 0 1 0 … Green … 0 0 1 … Blue … 0 1 0 Green … 0 0 1 © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. 76 Cleaning data Types of data to clean: Type Example Action Variations in strings Med. vs. Medium Convert to standard text Variations in scale Number of doors vs. number purchased Normalize to a common scale Columns with multiple data items Safe high maintenance Parse into multiple columns Missing data Missing columns of data Delete rows or impute data Outliers Various © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 77 Finding missing data • Missing data makes it difficult to interpret relationships • Causes of missing data – • Undefined values • Data collection errors • Data cleaning errors • Example pandas code to find missing data – df.isnull().sum() df.isnull().sum(axis=1) #count missing values for each column #count missing values for each row 78 Plan for missing values Before dropping or imputing (replacing) missing values, ask: What were the mechanisms that caused the missing values? Are these missing values missing at random? Are rows or columns missing that you are not aware of? 79 Dropping missing values Drop missing data with pandas • dropna function to drop rows • df.dropna() • dropna function to drop columns with null values • df.dropna (axis=1) • dropna function to drop a subset • df.dropna(subset=[“buying”]) 80 Imputing missing data • First, determine why the data is missing • Two ways to impute missing data – • Univariate: Adding data for a single row of missing data • Multivariate: Adding data for multiple rows of missing data from sklearn.preprocessing import Imputer import numpy as np Arr = np.array([[5,3,2,2],[3,None,1,9],[5,2,7,None]]) imputer = Imputer(strategy='mean’) imp = imputer.fit(arr) imputer.transform(arr) 81 Outliers • Outliers can – • Provide a broader picture of the data • Make accurate predictions difficult • Indicate the need for more columns • Types of outliers – • Univariate: Abnormal values for a single variable • Multivariate: Abnormal values for a combination of two or more variables 82 Finding outliers • Box plots show variation and distance from the mean • Example to the right shows a box plot for the amount of alcohol in a collection of wines • Scatter plots can also show outliers • Example to the right shows a scatter plot that shows the relationship between alcohol and sulphates in a collection of wines 83 Dealing with outliers Delete the outlier Outlier is based on an artificial error. Transform the outlier Reduces the variation that the extreme outlier value causes and the outlier’s influence on the dataset. Impute a new value for the outlier You might use the mean of the feature, for instance, and impute that value to replace the outlier value. 84 Feature Selection 1. Filter method • Use statistical methods 2. Wrapper methods • Train the model. Feature Selection X X X 3. Embedded methods • Integrated with model 85 Feature Selection: Filter methods • Measures • Pearson’s correlation • Linear discriminant analysis (LDA) • Analysis of variance (ANOVA) • Chi-square © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. All features Statistics and correlation Best features 86 Feature Selection: Wrapper All features • Methods • Forward selection • Backward selection Feature subset Train model Evaluate results Train ? Best features © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 87 Feature Selection: Embedded methods • Methods • Decision trees • LASSO and RIDGE All features Train model Feature subset Evaluate results ? © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 88 Summary • Feature engineering involves: • Selection • Extraction • Preprocessing gives you better data • Two categories for preprocessing • Cleaning up dirty data • Encoding the data • Develop a strategy for dirty data • Replace or delete rows with missing data • Delete, transform, or impute new values for outliers 89 Model Training Machine Learning pipeline New data and retraining Deploy model Business problem Problem formulation Collect and label data Yes Tune model Evaluate data Feature engineering Select and train model Evaluate model Meets business goal? No Feature augmentation Data augmentation 91 Why split the data? Y All data that is used for training and evaluation X Overfitting 93 Holdout method 10% - Test data 10% - Validation data 80% - Training data 94 K-fold Cross Validation Model 1 k=5 Model 2 Model 3 Model 4 Model 5 Test Train Train Train Train Train Test Train Train Train Train Train Test Train Train Train Train Train Test Train Train Train Train Train Test 95 Shuffle your data quality Fixed acidity sulphates 3 7.2 0.56 3 7.2 0.68 4 7.8 0.65 4 7.8 0.65 4 … … 5 … … 5 … … 5 … … 6 … … 6 … … Test data Training data 96 Machine Learning Algorithms Classification Quantitative Recommendations Unsupervised XGBoost XGBoost Factorization machines K-means Image classification Neural Topic Model (NTM) Principal Component Analysis Sequence-tosequence Latent Dirichlet Allocation (LDA) Decision Tree Regression Specialized 97 Summary • Split data into training and testing sets • Optionally, split into three sets, including the validation set • Can use k-fold cross-validation to use all the non-test data for validation • Can use two key algorithms for supervised learning • XGBoost • Linear learner • Use k-means for unsupervised learning 98 Machine Learning Pipeline New data and retraining Deploy model Business problem Problem formulation Collect and label data Yes Tune model Evaluate data Feature engineering Select and train model Evaluate model Meets business goal? No Feature augmentation Data augmentation 99 Is your model ready to deploy? Trained Tuned Tested 100 Model Deployment Predictions Managed environment hosting models In production 101 Model Deployment • Model Operationalization => Convert model into a web services • Exposed the ML model with an Open API Real-time? Batch Process? 102 Obtaining Predictions Data processing steps Model training Predictions 103 Model Evaluation Machine Learning pipeline New data and retraining Deploy model Business problem Problem formulation Collect and label data Yes Tune model Evaluate data Feature engineering Select and train model Evaluate model Meets business goal? No Feature augmentation Data augmentation 105 Evaluation determines how well your model predicts the target on future data 10% – Test data Testing and Evaluation: To evaluate the predictive quality of the trained model 10% – Validation data 80% – Training data 106 Success Metric Business problem: Fraudulent credit card transactions Model metric Success metric: 10% reduction in fraud claims in 6-month period The model metric must align to both the business problem and the success metric. 107 Confusion Matrix 10% – Test data Trained model Results © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Predicted Actual Cat Not cat Cat Not cat 107 23 69 42 108 Confusion Matrix terminology True positive (TP) Predicted a cat and it was a cat Predicted Actual Cat Not cat Cat TP FP Not cat FN TN False positive (FP) Predicted not a cat and it was a cat False negative (FN) Predicted a cat and it was not a cat True negative (TN) Predicted not a cat and it was not a cat © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. 109 Which model is better? Confusion matrices from two models that use the same data: Cat Not cat Actual Cat Not cat 107 23 69 42 Predicted Predicted Actual Cat Not cat Cat Not cat 148 53 28 12 110 Sensitivity Sensitivity (Recall): What percentage of cats were correctly identified? Predicted Actual Cat Not cat Cat TP:107 FP:23 Not cat FN:69 TN:42 TP sensitivity = TP + FN sensitivity = !"# !"#$%& = 0.6079 60% of cats were identified as cats. 111 Specificity Specificity: What percentage of not cats were correctly identified? Predicted Actual Cat Not cat Cat TP:107 FP:23 Not cat FN:69 TN:42 TN speci0icity = TN + FP speci0icity = '( '($() = 0.6461 64% of not cats were identified as not cats 112 Which model is better now? Model A Model B Cat Not cat Actual Cat Not cat 107 23 69 42 Predicted Predicted Actual Cat Not cat Cat Not cat 148 53 28 12 Sensitivity 60% Sensitivity 84% Specificity 64% Specificity 18% 113 Thresholds When models return probabilities: Predicted 0.96 if [predicted] >= threshold It’s a cat! if [predicted] < threshold NOT cat! 0.77 0.58 0.01 0.47 Changing the value of threshold can change the prediction. 114 Classification: ROC graph Receiver operating characteristic (ROC) 1.0 Sensitivity True-positive rate 0.8 Plot points for each threshold value 0.6 0.4 TPR = FPR Select threshold based on business case 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 1 – Specificity False-positive rate 115 Classification: AUC-ROC Area Under the Curve 1.0 Sensitivity True-positive rate 0.8 0.6 AUC-ROC is used to easily compare one model to another AUC: 89% 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 1-Specificity False-positive rate 116 Classification: Other metrics Accuracy = TP + TN (TP + TN)/(TP + TN + FP + FN) Precision = TP TP + FP precision ⋅ sensilvity F1 score = 2 · precision + sensilvity Model’s score Proportion of positive results that were correctly classified Combines Precision and Sensitivity (Recall) 117 Regression: Mean squared error (x6,y6) (x4,y4) (x3,y3) % MSE = 1 & ๐ฆ" − แปน" ๐ & y (x1,y1) (x5,y5) "#$ (x7,y7) (x2,y2) x 118 Model Evaluation Metrics Summary • Several ways to validate the model • Hold-out • K-fold cross validation • Two types of model evaluation • Classification • Confusion matrix • AUC-ROC • Regression testing • Mean squared Error 120 Hyperparameter and model tuning Machine Learning pipeline New data and retraining Deploy model Business problem Problem formulation Collect and label data Yes Tune model Evaluate data Feature engineering Select and train model Evaluate model Meets business goal? No Feature augmentation Data augmentation 122 ML Tuning process Tune model Feature engineering Select and train model Evaluate model Meets business goal? No Feature augmentation 123 Recap: Goal for models Balanced and generalizes well © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. 124 Hyperparameter categories Model Optimizer Data Help define the model How the model learns patterns on data Define attributes of the data itself Filter size, pooling, stride, padding Gradient descent, stochastic gradient descent Useful for small or homogenous datasets 125 Tuning Hyperparameters Manually select hyperparameters according to one’s intuition or experience However, this approach is often not as thorough or efficient as needed. 126 Python Library for Machine Learning 127 Python libraries for machine learning • Pandas: High-level library for data importing, manipulation and analysis (numerical tables, time series…) • NumPy: Math library to work with n-dimensional arrays • SciPy: Collection of numerical algorithms and domainspecific toolboxes (signal processing, optimization, statistics…) • Matplotlib: 2D / 3D plotting package • Scikit-learn: Collection of algorithms and tools for ML 128 Scikit-learn • Free software machine learning library • Classification, Regression and Clustering algorithms • Works with NumPy and SciPy • Easy to Implement 129 Jupyter Notebook Environment 130 Jupyter Notebook Environment • Anaconda • Install Anaconda in your own laptop • https://www.anaconda.com/products/individual • Use the installed version of Anaconda in the PC in DLB311 • Kaggle • https://www.kaggle.com/ • Google Colab • https://colab.research.google.com/ • https://www.geeksforgeeks.org/how-to-use-google-colab/ 131 Jupyter Notebook Jupyter Notebook Users Manual https://jupyter.brynmawr.edu/services/public/dblank/Jupyter%20Notebook%20Us ers%20Manual.ipynb Jupyter Notebook Cheatsheet http://datacamp-community-prod.s3.amazonaws.com/21fdc814-3f08-4aa9-90fa247eedefd655 132 End of Module 1