Uploaded by jayzeng220

001-FIN7790

advertisement
Machine Learning Overview
Artificial Intelligence has evolved significantly in the past year
Artificial
Intelligence
=
The ability to mimic human intelligence,
eg. Understand, reason, and learn
Machine Learning (Predictive AI)
Generative AI
Programs that can analyze content to
make predictions and prescribe actions
eg. forecast revenue based on historical sales
eg. prescribe next best offer
eg. visually identify product defect
Analytics
Machine Learning
Deep Learning
Programs that can generate net new
content and better understand existing
content
eg. create image from prompt
eg. answers questions from PDFs
eg. summarize an article
Foundation Model
Large Language Model
13
Artificial intelligence, Machine Learning, and Deep Learning
• AI components
• Machine Learning
• Deep Learning
14
Machine Learning
Machine learning is the scientific study of algorithms and statistical
models using inference instead of instructions to perform a task.
Data
Model
Prediction
Machine Learning flow
15
When to use machine learning?
Classical programming approach
Use machine learning when you have:
• Large datasets, large number of variables
Business logic
Task
Procedures
• Lack of clear procedures to obtain the solution
• Existing machine learning expertise
• Infrastructure already in place to support ML
• Management support for ML
Machine learning approach
Data
Model
Task
24
Machine Learning process
ML pipeline: Business problem
Business problem
Problem
formulation
Example: You have a dataset containing information about various cars, such as
their brand, model, year of manufacture, mileage, and price. The goal is to build a
machine-learning model that can predict the resell price of a car based on its
features.
Problem formulation: Given the car’s features, predict its resell price.
26
ML pipeline: Data preparation
Data handling and cleaning
Business problem
data
data
Problem
formulation
data
data
Collect and label
data
Evaluate data
Name
Country
Sex
dob
Richard Roe
UK
Male
18/2/1972
Paulo Santos
Male
Mrs. Mary Major
Denver
F
37
Desai, Arnav
USA
M
2/22/1962
11/2/1969
27
ML pipeline: Iterative Model Training
Business problem
Problem
formulation
Collect and
label data
Tune model
Evaluate data
Feature
Engineering
Select and train
model
Evaluate model
Feature augmentation
Meets
business
goal?
No
Data augmentation
28
ML pipeline: Feature Engineering
Name
Country
Richard Roe
UK
Paulo Santos
Male
Mrs. Mary Major
Denver
F
37
Desai, Arnav
USA
M
2/22/1962
Name
Sex
Male
dob
18/2/1972
11/2/1969
?
USA
UK
sex
age
bm
dow
target
Richard Roe
0
1
0
49
2
5
140,000
Paulo Santos
1
0
0
51
11
7
78,000
Mary Major
1
0
1
37
NAN
0
167,000
Arnav Desai
1
0
0
58
2
4
100,000
29
ML pipeline: Model Training
Name
USA
UK
sex
age
bm
dow
target
Richard Roe
0
1
0
49
2
5
140,000
…
…
…
…
…
…
…
…
10–20%
Test data
80%
Algorithm
XGBoost
Trained model
{hyperparameters}
30
ML pipeline: Evaluating and tuning the model
Name
USA
UK
sex
age
bm
dow
target
Richard Roe
0
1
0
49
2
5
140,000
…
…
…
…
…
…
…
…
Change
features
80%
Algorithm
XGBoost
10-20%
Test data
predict
Trained model
Hosted model
{hyperparameters}
Change hyperparameters
Metrics
31
Overfitting and underfitting
Y
Y
Y
X
Overfitting
X
Underfitting
X
Balanced
32
ML pipeline: Deployment
New data, retraining
Deploy model
Business problem
Problem
formulation
Collect and
label data
Yes
Tune model
Evaluate data
Feature
engineering
Select and train
model
Evaluate model
Feature augmentation
Meets
business
goal?
No
Data augmentation
33
Machine Learning Challenges
Data
Business
•
•
•
•
Poor quality
Non-representative
Insufficient
Overfitting and underfitting
• Complexity in formulating questions
• Explaining models to the business
• Cost of building systems
Users
Technology
• Lack of data science expertise
• Cost of staffing with data scientists
• Lack of management support
• Data privacy issues
• Tool selection can be complicated
• Integration with other systems
36
Let's dive deeper to understand what a typical
Machine Learning Pipeline entails.
Machine Learning Pipeline
1. Scenario Introduction
2. Data Collection
3. Data Evaluation
4. Data Preprocessing
5. Model Development (Training)
6. Model Evaluation
7. Model Deployment
8. Hyperparameter and Model tuning
38
Machine Learning Pipeline
New data and retraining
Deploy model
Business problem
Problem
formulation
Collect and
label data
Yes
Tune model
Evaluate data
Feature
engineering
Select and train
model
Evaluate model
Meets
business
goal?
No
Feature augmentation
Data augmentation
© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
39
Machine Learning Pipeline
New data and retraining
Deploy model
Business problem
Problem
formulation
Collect and
label data
Yes
Tune model
Evaluate data
Feature
engineering
Select and train
model
Evaluate model
Meets
business
goal?
No
Feature augmentation
Data augmentation
© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
40
Define business objective
What is the business goal?
Questions to ask:
• How is this task done today?
• How will the business measure success?
• How will the solution be used?
• Do similar solutions exist that you might learn from?
• What assumptions have been made?
• Who are the domain experts?
42
How should you frame this problem?
• Is the problem a machine learning problem?
• Is the problem supervised or unsupervised?
Supervised
learning
• What is the target to predict?
• Do you have access to the data?
• What is the minimum performance?
Unsupervised
learning
Reinforcement
learning
• How would you solve this problem manually?
• What’s the simplest solution?
43
Example: Problem Formulation
You want to identify fraudulent credit card transactions so that you
can stop the transaction before it processes.
Why?
Reduce the number of customers who end their membership
because of fraud.
10%
reduction in fraud
claims in retail
Can you measure it?
Move from qualitative statements to quantitative statements that
can be measured.
44
Make it an ML model
Credit card transaction is either fraudulent or not fraudulent.
Fraud
Binary classification problem
Not fraud
Use historical data of fraud reports to help define your model.
45
Wine quality dataset
Question:
Based on the composition of the wine, can you predict the quality and, therefore the price?
Why:
• View statistics
• Deal with outliers
• Scale numeric data
Citation
Source: UCI Wine quality dataset
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
46
Car evaluation dataset
Question:
Can you use a car’s attributes to predict whether the car will be purchased?
Why:
• View statistics
• Encode categorical data
Citation
Source: UCI Car evaluation dataset
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of
Information and Computer Science.
47
Machine Learning Pipeline
New data and retraining
Deploy model
Business problem
Problem
formulation
Collect and
label data
Yes
Tune model
Evaluate data
Feature
engineering
Select and train
model
Evaluate model
Meets
business
goal?
No
Feature augmentation
Data augmentation
49
Data sources
• Private data: Data that customers create
• Commercial data: HKEx, Reuters, etc.
• Open-source data: Data that is publicly available (check for limits on usage)
• Kaggle
• World Health Organization
• U.S. Census Bureau
• National Oceanic and Atmospheric Administration (U.S.)
• UC Irvine Machine Learning Repository
51
Observations
ML problems need a lot of data—also called observations— where
the target answer or prediction is already known.
Customer
Date of
transaction
Vendor
Charge
amount
Was this
fraud?
ABC
10/5
Store 1
10.99
No
DEF
10/5
Store 2
99.99
Yes
GHI
10/5
Store 2
15.00
No
JKL
10/6
Store 2
99.99
?
MNO
10/6
Store 1
99.99
Yes
Feature
Target
52
Extract, transform, load (ETL)
Data Catalog
Crawler
Data
Stores
Schedule or Event
Extract
Load
Transform Script
Data
Source
Data
Target
54
Summary
• The first step in the machine learning pipeline is to obtain data
• Extract, transform, and load (ETL) is a common term for obtaining
data for machine learning
55
Machine Learning Pipeline
New data and retraining
Deploy model
Business problem
Format data
Problem
formulation
Collect and
label data
Yes
Tune model
Examine data types
Perform descriptive statistics
Evaluate data
Feature
engineering
Select and train
model
Data
Understanding
Evaluate model
Visualize data
Feature augmentation
Meets
business
goal?
No
Data augmentation
56
You must understand your data
Before you can run statistics on your data, you must ensure that it’s in
the right format for analysis.
Customer
Date of
Transaction
Vendor
Charge
Amount
Was This
Fraud?
“Customer:ABC,DateOfTran
ABC
10/5
Store 1
10.99
No
sation:10/5,Vendor:Store1,
DEF
10/5
Store 2
99.99
Yes
GHI
10/5
Store 2
15.00
No
JKL
10/6
Store 2
99.99
?
MNO
10/6
Store 1
99.99
Yes
ChargeAmount:10.99,WasT
hisFruad:No…”
57
Summary
• The first step in evaluating data is to make sure that it’s in the right
format
• pandas is a popular Python library for working with data
• Use descriptive statistics to learn about the dataset
• Create visualizations with pandas to examine the dataset in more
detail
70
Feature Engineering
Machine Learning pipeline
New data and retraining
Deploy model
Business problem
Problem
formulation
Collect and
label data
Yes
Tune model
Evaluate data
Feature
engineering
Select and train
model
Evaluate model
Meets
business
goal?
No
Feature augmentation
Data augmentation
72
Feature Selection and Extraction
Feature Extraction
Feature Selection
XX
X
73
Feature Extraction
• Invalid values
• Transformation
• Wrong formats
• Normalization
• Misspelling
• Dimensionality reduction
• Duplicates
• Consistency
• Rescale
• Encode categories
• Remove outliers
• Reassign outliers
• Bucketing
• Decomposition
• Aggregation
• Combination
74
Encoding ordinal data
• Categorical data is non-numeric data.
• Categorical data must be converted (encoded) to a numeric scale.
Maintenance Costs
Low
Medium
High
Very High
Encoding
1
2
3
4
75
Encoding non-ordinal data
• If data is non-ordinal, the encoded values also must be non-ordinal.
• Non-ordinal data might need to be broken into multiple categories.
…
Color
…
Red
Blue
Green
…
Red
…
1
0
0
…
Blue
…
0
1
0
…
Green
…
0
0
1
…
Blue
…
0
1
0
Green
…
0
0
1
© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
76
Cleaning data
Types of data to clean:
Type
Example
Action
Variations in strings
Med. vs. Medium
Convert to standard text
Variations in scale
Number of doors vs.
number purchased
Normalize to a common
scale
Columns with multiple
data items
Safe high maintenance
Parse into multiple
columns
Missing data
Missing columns of data Delete rows or impute
data
Outliers
Various
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights
reserved.
77
Finding missing data
• Missing data makes it difficult to interpret
relationships
• Causes of missing data –
• Undefined values
• Data collection errors
• Data cleaning errors
• Example pandas code to find missing data –
df.isnull().sum()
df.isnull().sum(axis=1)
#count missing values for each column
#count missing values for each row
78
Plan for missing values
Before dropping
or imputing
(replacing)
missing values,
ask:
What were the mechanisms that caused the
missing values?
Are these missing values missing at random?
Are rows or columns missing that you are not
aware of?
79
Dropping missing values
Drop missing data with pandas
• dropna function to drop rows
• df.dropna()
• dropna function to drop columns with null values
• df.dropna (axis=1)
• dropna function to drop a subset
• df.dropna(subset=[“buying”])
80
Imputing missing data
• First, determine why the data is missing
• Two ways to impute missing data –
• Univariate: Adding data for a single row of missing data
• Multivariate: Adding data for multiple rows of missing data
from sklearn.preprocessing import Imputer
import numpy as np
Arr = np.array([[5,3,2,2],[3,None,1,9],[5,2,7,None]])
imputer = Imputer(strategy='mean’)
imp = imputer.fit(arr)
imputer.transform(arr)
81
Outliers
• Outliers can –
• Provide a broader picture of the data
• Make accurate predictions difficult
• Indicate the need for more columns
• Types of outliers –
• Univariate: Abnormal values for a single
variable
• Multivariate: Abnormal values for a
combination of two or more variables
82
Finding outliers
• Box plots show variation and distance from
the mean
• Example to the right shows a box plot for the
amount of alcohol in a collection of wines
• Scatter plots can also show outliers
• Example to the right shows a scatter plot that
shows the relationship between alcohol and
sulphates in a collection of wines
83
Dealing with outliers
Delete the outlier
Outlier is based on an
artificial error.
Transform the outlier
Reduces the variation that
the extreme outlier value
causes and the outlier’s
influence on the dataset.
Impute a new value
for the outlier
You might use the mean of
the feature, for instance,
and impute that value to
replace the outlier value.
84
Feature Selection
1. Filter method
• Use statistical methods
2. Wrapper methods
• Train the model.
Feature Selection
X X
X
3. Embedded methods
• Integrated with model
85
Feature Selection: Filter methods
• Measures
• Pearson’s correlation
• Linear discriminant analysis
(LDA)
• Analysis of variance (ANOVA)
• Chi-square
© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
All features
Statistics and
correlation
Best features
86
Feature Selection: Wrapper
All features
• Methods
• Forward selection
• Backward selection
Feature subset
Train model
Evaluate
results
Train
?
Best features
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights
reserved.
87
Feature Selection: Embedded methods
• Methods
• Decision trees
• LASSO and RIDGE
All features
Train model
Feature subset
Evaluate
results
?
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights
reserved.
88
Summary
• Feature engineering involves:
• Selection
• Extraction
• Preprocessing gives you better data
• Two categories for preprocessing
• Cleaning up dirty data
• Encoding the data
• Develop a strategy for dirty data
• Replace or delete rows with missing data
• Delete, transform, or impute new values for outliers
89
Model Training
Machine Learning pipeline
New data and retraining
Deploy model
Business problem
Problem
formulation
Collect and
label data
Yes
Tune model
Evaluate data
Feature
engineering
Select and train
model
Evaluate model
Meets
business
goal?
No
Feature augmentation
Data augmentation
91
Why split the data?
Y
All data that is
used for
training and
evaluation
X
Overfitting
93
Holdout method
10% - Test data
10% - Validation data
80% - Training data
94
K-fold Cross Validation
Model 1
k=5
Model 2
Model 3
Model 4
Model 5
Test
Train
Train
Train
Train
Train
Test
Train
Train
Train
Train
Train
Test
Train
Train
Train
Train
Train
Test
Train
Train
Train
Train
Train
Test
95
Shuffle your data
quality
Fixed acidity
sulphates
3
7.2
0.56
3
7.2
0.68
4
7.8
0.65
4
7.8
0.65
4
…
…
5
…
…
5
…
…
5
…
…
6
…
…
6
…
…
Test data
Training data
96
Machine Learning Algorithms
Classification
Quantitative
Recommendations
Unsupervised
XGBoost
XGBoost
Factorization
machines
K-means
Image
classification
Neural Topic
Model (NTM)
Principal
Component
Analysis
Sequence-tosequence
Latent Dirichlet
Allocation (LDA)
Decision Tree
Regression
Specialized
97
Summary
• Split data into training and testing sets
• Optionally, split into three sets, including the validation set
• Can use k-fold cross-validation to use all the non-test data for
validation
• Can use two key algorithms for supervised learning
• XGBoost
• Linear learner
• Use k-means for unsupervised learning
98
Machine Learning Pipeline
New data and retraining
Deploy model
Business problem
Problem
formulation
Collect and
label data
Yes
Tune model
Evaluate data
Feature
engineering
Select and train
model
Evaluate model
Meets
business
goal?
No
Feature augmentation
Data augmentation
99
Is your model ready to deploy?
Trained
Tuned
Tested
100
Model Deployment
Predictions
Managed
environment hosting
models
In production
101
Model Deployment
• Model Operationalization => Convert model into a web services
• Exposed the ML model with an Open API
Real-time?
Batch Process?
102
Obtaining Predictions
Data processing
steps
Model training
Predictions
103
Model Evaluation
Machine Learning pipeline
New data and retraining
Deploy model
Business problem
Problem
formulation
Collect and
label data
Yes
Tune model
Evaluate data
Feature
engineering
Select and train
model
Evaluate model
Meets
business
goal?
No
Feature augmentation
Data augmentation
105
Evaluation determines how well your model predicts the
target on future data
10% – Test data
Testing and Evaluation:
To evaluate the predictive
quality of the trained model
10% – Validation data
80% – Training data
106
Success Metric
Business problem:
Fraudulent credit
card transactions
Model metric
Success metric:
10% reduction in fraud
claims in 6-month
period
The model metric must align to both the business problem and the success metric.
107
Confusion Matrix
10% – Test data
Trained
model
Results
© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Predicted
Actual
Cat
Not cat
Cat
Not cat
107
23
69
42
108
Confusion Matrix terminology
True positive (TP)
Predicted a cat and it was a cat
Predicted
Actual
Cat
Not cat
Cat
TP
FP
Not cat
FN
TN
False positive (FP)
Predicted not a cat and it was a cat
False negative (FN)
Predicted a cat and it was not a cat
True negative (TN)
Predicted not a cat and it was not a cat
© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
109
Which model is better?
Confusion matrices from two models that use the same data:
Cat
Not cat
Actual
Cat
Not cat
107
23
69
42
Predicted
Predicted
Actual
Cat
Not cat
Cat
Not cat
148
53
28
12
110
Sensitivity
Sensitivity (Recall):
What percentage of cats were correctly identified?
Predicted
Actual
Cat
Not cat
Cat
TP:107
FP:23
Not cat
FN:69
TN:42
TP
sensitivity =
TP + FN
sensitivity =
!"#
!"#$%&
= 0.6079
60% of cats were identified as cats.
111
Specificity
Specificity:
What percentage of not cats were correctly identified?
Predicted
Actual
Cat
Not cat
Cat
TP:107
FP:23
Not cat
FN:69
TN:42
TN
speci0icity =
TN + FP
speci0icity =
'(
'($()
= 0.6461
64% of not cats were identified
as not cats
112
Which model is better now?
Model A
Model B
Cat
Not cat
Actual
Cat
Not cat
107
23
69
42
Predicted
Predicted
Actual
Cat
Not cat
Cat
Not cat
148
53
28
12
Sensitivity
60%
Sensitivity
84%
Specificity
64%
Specificity
18%
113
Thresholds
When models return probabilities:
Predicted
0.96
if [predicted] >= threshold
It’s a cat!
if [predicted] < threshold
NOT cat!
0.77
0.58
0.01
0.47
Changing the value of threshold can change the prediction.
114
Classification: ROC graph
Receiver operating
characteristic (ROC)
1.0
Sensitivity
True-positive rate
0.8
Plot points for each
threshold value
0.6
0.4
TPR = FPR
Select threshold based on
business case
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
1 – Specificity
False-positive rate
115
Classification: AUC-ROC
Area
Under the
Curve
1.0
Sensitivity
True-positive rate
0.8
0.6
AUC-ROC is used to
easily compare one
model to another
AUC: 89%
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
1-Specificity
False-positive rate
116
Classification: Other metrics
Accuracy =
TP + TN
(TP + TN)/(TP + TN + FP + FN)
Precision =
TP
TP + FP
precision ⋅ sensilvity
F1 score = 2 ·
precision + sensilvity
Model’s score
Proportion of positive results that
were correctly classified
Combines Precision and Sensitivity
(Recall)
117
Regression: Mean squared error
(x6,y6)
(x4,y4)
(x3,y3)
%
MSE =
1
& ๐‘ฆ" − แปน"
๐‘›
&
y
(x1,y1)
(x5,y5)
"#$
(x7,y7)
(x2,y2)
x
118
Model
Evaluation
Metrics
Summary
• Several ways to validate the model
• Hold-out
• K-fold cross validation
• Two types of model evaluation
• Classification
• Confusion matrix
• AUC-ROC
• Regression testing
• Mean squared Error
120
Hyperparameter and model tuning
Machine Learning pipeline
New data and retraining
Deploy model
Business problem
Problem
formulation
Collect and
label data
Yes
Tune model
Evaluate data
Feature
engineering
Select and train
model
Evaluate model
Meets
business
goal?
No
Feature augmentation
Data augmentation
122
ML Tuning process
Tune model
Feature
engineering
Select and train
model
Evaluate model
Meets
business
goal?
No
Feature augmentation
123
Recap: Goal for models
Balanced and generalizes well
© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
124
Hyperparameter categories
Model
Optimizer
Data
Help define the model
How the model learns
patterns on data
Define attributes of the data itself
Filter size, pooling,
stride, padding
Gradient descent,
stochastic gradient descent
Useful for small or homogenous
datasets
125
Tuning Hyperparameters
Manually select
hyperparameters according to
one’s intuition or experience
However, this approach is often not as thorough
or efficient as needed.
126
Python Library for Machine Learning
127
Python libraries for machine learning
• Pandas: High-level library for data importing,
manipulation and analysis (numerical tables, time
series…)
• NumPy: Math library to work with n-dimensional
arrays
• SciPy: Collection of numerical algorithms and domainspecific toolboxes (signal processing, optimization,
statistics…)
• Matplotlib: 2D / 3D plotting package
• Scikit-learn: Collection of algorithms and tools for ML
128
Scikit-learn
• Free software machine learning library
• Classification, Regression and Clustering algorithms
• Works with NumPy and SciPy
• Easy to Implement
129
Jupyter Notebook Environment
130
Jupyter Notebook Environment
• Anaconda
• Install Anaconda in your own laptop
• https://www.anaconda.com/products/individual
• Use the installed version of Anaconda in the PC in DLB311
• Kaggle
• https://www.kaggle.com/
• Google Colab
• https://colab.research.google.com/
• https://www.geeksforgeeks.org/how-to-use-google-colab/
131
Jupyter Notebook
Jupyter Notebook Users Manual
https://jupyter.brynmawr.edu/services/public/dblank/Jupyter%20Notebook%20Us
ers%20Manual.ipynb
Jupyter Notebook Cheatsheet
http://datacamp-community-prod.s3.amazonaws.com/21fdc814-3f08-4aa9-90fa247eedefd655
132
End of Module 1
Download