Uploaded by im.alobaid

introduction-to-ml-lecture-1

advertisement
Introduction to Machine Learning
Theory and Practice
David R. Pugh
Instructional Assistant Professor, KAUST
Director, SDAIA-KAUST AI
• 5+ years teaching applied machine learning and deep learning at KAUST.
• 2+ years as the director of SDAIA-KAUST AI where I work to match applied AI
problems of interest to SDAIA with AI solutions developed at KAUST.
• 15+ years experience with the core data science Python stack: NumPy, SciPy,
Pandas, Matplotlib, NetworkX, Jupyter, Scikit-Learn, PyTorch, etc.
KAUST Academy
2
Agenda
Introduction to Machine Learning: Theory and Practice
09:00 - 09:05
Welcome and Opening Remarks
Prof. David Pugh
09:05 - 10:30
The Machine Learning Landscape
Prof. David Pugh
10:30 - 10:45
Break
10:45 - 12:00
Classification and Regression
12:00 - 13:00
Lunch
13:00 - 14:30
Linear Regression with NumPy
14:30 - 14:45
Break
14:45 – 16:00
Introduction to Scikit-Learn
KAUST Academy
Prof. David Pugh
Prof. David Pugh + TAs
Prof. David Pugh + TAs
3
References
•
•
•
Slides closely follow Hands-on Machine Learning with Scikit-Learn,
Keras and Tensorflow by Aurelien Geron.
Another great reference is Machine Learning with PyTorch and ScikitLearn by Sebastian Raschka.
Official documentation for Scikit-Learn is also fantastic.
KAUST Academy
Prof. Da vi d R. Pugh
4
The ML Landscape
Prof. Da vi d R. Pugh
What is difference between AI and ML?
KAUST Academy
Prof. Da vi d R. Pugh
6
What is ML?
•
•
•
ML is the science (and art) of programming computers so they can learn from
data (Geron, 2019).
[ML is the] field of study that gives computers the ability to learn without
being explicitly programmed (Samuel, 1959).
A computer program is said to learn from experience E with respect to some
task T and some performance measure P, if its performance on T, as
measured by P, improves with experience E (Mitchell, 1997).
KAUST Academy
Prof. Da vi d R. Pugh
7
Why is ML so popular right now?
Stanford’s Coursera machine learning course had more than 100,000 expressing interest
in the first year.
1. The field has matured both in terms of identity and in terms of methods and tools.
2. There is an abundance of data available
3. There is an abundance of computation to run methods
4. There have been impressive results, increasing acceptance, respect, and competition
Resources + Ingredients + Tools + Desire = Popularity
KAUST Academy
Based on: http://machinelearningmastery.com/machine-learning-is-popular/?__s=yq1qzcnf67sfiuzmnvjf
8
Traditional approach is model/rules based...
KAUST Academy
Prof. Da vi d R. Pugh
9
...ML approach is data-driven!
KAUST Academy
Prof. Da vi d R. Pugh
10
ML adapts to change!
KAUST Academy
Prof. Da vi d R. Pugh
11
ML can help humans learn!
KAUST Academy
Prof. Da vi d R. Pugh
12
Types of ML systems
•
•
•
•
Supervised vs unsupervised
Semi-supervised vs self-supervised
Batch (offline) vs incremental (online)
Instance-based vs model-based
KAUST Academy
Prof. Da vi d R. Pugh
13
Supervised learning
Classification
KAUST Academy
Regression
Prof. Da vi d R. Pugh
14
Other forms of supervised learning
Semi-supervised learning
KAUST Academy
Self-supervised learning
Prof. Da vi d R. Pugh
15
Unsupervised learning
Clustering
KAUST Academy
Data visualization
Prof. Da vi d R. Pugh
16
Reinforcement Learning
KAUST Academy
Prof. Da vi d R. Pugh
17
Batch (offline) vs incremental (online) learning
Batch (offline) Learning
KAUST Academy
Incremental (online) learning
Prof. Da vi d R. Pugh
18
Out-of-core learning
KAUST Academy
Prof. Da vi d R. Pugh
19
Instance-based vs model-based learning
Instance-based learning
KAUST Academy
Model-based learning
Prof. Da vi d R. Pugh
20
Main Challenges of Applying ML
KAUST Academy
21
Main Challenges of Applying ML
•
•
•
•
•
•
Insufficient quantity of training data
Non-representative training data
Poor quality data
Irrelevant features
Overfitting the training data
Underfitting the training data
KAUST Academy
Prof. Da vi d R. Pugh
22
Insufficient quantity of training data
• The more data for training the
better!
• It can take a lot of data for most
ML algorithms to work.
• "Simple" problems often require
O(10k) samples.
• "Complex" problems often
require O(1m) samples.
KAUST Academy
Prof. Da vi d R. Pugh
23
Non-representative training data
• Need training data to be
representative of new data for
generalization.
• Sampling noise: not enough
data => training data not
representative by chance.
• Sampling bias: poor sampling
technique => training data not
representative (biased).
KAUST Academy
Prof. Da vi d R. Pugh
24
Poor quality training data
• Data can be full of errors,
outliers, and noise (e.g., due to
poor-quality measurements).
• Dirty data => hard for
any algorithm to detect
patterns.
• Significant amount of your
time will be spent cleaning
data.
KAUST Academy
• Data types? Do you have
numeric features? Ordinal
features? Categorical features?
• Look for outliers in your data:
Remove? Fix manually?
• Look for missing data:
Remove? Impute values?
Prof. Da vi d R. Pugh
25
Irrelevant features
Garbage in => garbage out!
• Learning requires sufficient
relevant features (and not too
many irrelevant ones!).
• Developing a good set of
features for training is critical
part of ML project.
• Significant amount of
your time will be spent doing
feature engineering.
KAUST Academy
Feature engineering is often
critical to success.
• Feature selection:
selecting the "best" subset
of features for training.
• Feature extraction:
combining existing features to
produce new ones.
• Creating new features
from new data.
Prof. Da vi d R. Pugh
26
Overfitting the training data
What is overfitting?
• Overfitting is when model
performs well on training data
but poorly on new data.
• If model is complex or training
data is limited, model will detect
spurious patterns.
• Constraining a complex
model to make it simpler is
called regularization.
KAUST Academy
Prof. Da vi d R. Pugh
27
Underfitting the training data
What is underfitting?
How to reduce underfitting?
• Underfitting is when a model is
too simple to learn the
underlying structure of the data.
• Linear models will often
underfit (but often a good place
to start).
• Select more complex (more
parameters) model.
• Feed better features to the
model (feature engineering).
• Reduce the constraints on
model (reduce regularization).
KAUST Academy
Prof. Da vi d R. Pugh
28
Validation and Testing
KAUST Academy
29
Why measure generalization error?
• Only way to know if your model
is good is to measure
performance new data!
• Split your data into train and
test sets: error on the test set is
estimate of generalization error.
• Low training error, high
generalization error =>
overfitting!
KAUST Academy
Some train-test split heuristics:
• For datasets smaller than
O(100k) samples, take 80%
for train and holdout 20%
for test.
• For larger datasets, O(1m)
samples, holdout 1-10% of the
dataset for test.
Prof. Da vi d R. Pugh
30
Model Selection
• Often need to tune
hyperparameters to find a good
model within a particular class
of models.
• How? Split training data into
training set and validation set.
• Always compare tuned models
using the test set!
KAUST Academy
• Validation set too small =>
might select "bad" model by
mistake.
• Validation set too large
=> training set too small!
• Cross validation: create lots
of small validation sets,
evaluate model on each
validation set, measure
average performance across
validation sets.
Prof. Da vi d R. Pugh
31
Model selection process
KAUST Academy
Prof. Da vi d R. Pugh
32
Thanks!
KAUST Academy
Prof. Da vi d R. Pugh
36
Download