Uploaded by im.alobaid

Machine Learning: Theory and Practice Presentation

Introduction to Machine Learning
Theory and Practice
David R. Pugh
Instructional Assistant Professor, KAUST
• 5+ years teaching applied machine learning and deep learning at KAUST.
• 2+ years as the director of SDAIA-KAUST AI where I work to match applied AI
problems of interest to SDAIA with AI solutions developed at KAUST.
• 15+ years experience with the core data science Python stack: NumPy, SciPy,
Pandas, Matplotlib, NetworkX, Jupyter, Scikit-Learn, PyTorch, etc.
KAUST Academy
Introduction to Machine Learning: Theory and Practice
09:00 - 09:05
Welcome and Opening Remarks
Prof. David Pugh
09:05 - 10:30
The Machine Learning Landscape
Prof. David Pugh
10:30 - 10:45
10:45 - 12:00
Classification and Regression
12:00 - 13:00
13:00 - 14:30
Linear Regression with NumPy
14:30 - 14:45
14:45 – 16:00
Introduction to Scikit-Learn
KAUST Academy
Prof. David Pugh
Prof. David Pugh + TAs
Prof. David Pugh + TAs
Slides closely follow Hands-on Machine Learning with Scikit-Learn,
Keras and Tensorflow by Aurelien Geron.
Another great reference is Machine Learning with PyTorch and ScikitLearn by Sebastian Raschka.
Official documentation for Scikit-Learn is also fantastic.
KAUST Academy
Prof. Da vi d R. Pugh
The ML Landscape
Prof. Da vi d R. Pugh
What is difference between AI and ML?
KAUST Academy
Prof. Da vi d R. Pugh
What is ML?
ML is the science (and art) of programming computers so they can learn from
data (Geron, 2019).
[ML is the] field of study that gives computers the ability to learn without
being explicitly programmed (Samuel, 1959).
A computer program is said to learn from experience E with respect to some
task T and some performance measure P, if its performance on T, as
measured by P, improves with experience E (Mitchell, 1997).
KAUST Academy
Prof. Da vi d R. Pugh
Why is ML so popular right now?
Stanford’s Coursera machine learning course had more than 100,000 expressing interest
in the first year.
1. The field has matured both in terms of identity and in terms of methods and tools.
2. There is an abundance of data available
3. There is an abundance of computation to run methods
4. There have been impressive results, increasing acceptance, respect, and competition
Resources + Ingredients + Tools + Desire = Popularity
KAUST Academy
Based on: http://machinelearningmastery.com/machine-learning-is-popular/?__s=yq1qzcnf67sfiuzmnvjf
Traditional approach is model/rules based...
KAUST Academy
Prof. Da vi d R. Pugh
...ML approach is data-driven!
KAUST Academy
Prof. Da vi d R. Pugh
ML adapts to change!
KAUST Academy
Prof. Da vi d R. Pugh
ML can help humans learn!
KAUST Academy
Prof. Da vi d R. Pugh
Types of ML systems
Supervised vs unsupervised
Semi-supervised vs self-supervised
Batch (offline) vs incremental (online)
Instance-based vs model-based
KAUST Academy
Prof. Da vi d R. Pugh
Supervised learning
KAUST Academy
Prof. Da vi d R. Pugh
Other forms of supervised learning
Semi-supervised learning
KAUST Academy
Self-supervised learning
Prof. Da vi d R. Pugh
Unsupervised learning
KAUST Academy
Data visualization
Prof. Da vi d R. Pugh
Reinforcement Learning
KAUST Academy
Prof. Da vi d R. Pugh
Batch (offline) vs incremental (online) learning
Batch (offline) Learning
KAUST Academy
Incremental (online) learning
Prof. Da vi d R. Pugh
Out-of-core learning
KAUST Academy
Prof. Da vi d R. Pugh
Instance-based vs model-based learning
Instance-based learning
KAUST Academy
Model-based learning
Prof. Da vi d R. Pugh
Main Challenges of Applying ML
KAUST Academy
Main Challenges of Applying ML
Insufficient quantity of training data
Non-representative training data
Poor quality data
Irrelevant features
Overfitting the training data
Underfitting the training data
KAUST Academy
Prof. Da vi d R. Pugh
Insufficient quantity of training data
• The more data for training the
• It can take a lot of data for most
ML algorithms to work.
• "Simple" problems often require
O(10k) samples.
• "Complex" problems often
require O(1m) samples.
KAUST Academy
Prof. Da vi d R. Pugh
Non-representative training data
• Need training data to be
representative of new data for
• Sampling noise: not enough
data => training data not
representative by chance.
• Sampling bias: poor sampling
technique => training data not
representative (biased).
KAUST Academy
Prof. Da vi d R. Pugh
Poor quality training data
• Data can be full of errors,
outliers, and noise (e.g., due to
poor-quality measurements).
• Dirty data => hard for
any algorithm to detect
• Significant amount of your
time will be spent cleaning
KAUST Academy
• Data types? Do you have
numeric features? Ordinal
features? Categorical features?
• Look for outliers in your data:
Remove? Fix manually?
• Look for missing data:
Remove? Impute values?
Prof. Da vi d R. Pugh
Irrelevant features
Garbage in => garbage out!
• Learning requires sufficient
relevant features (and not too
many irrelevant ones!).
• Developing a good set of
features for training is critical
part of ML project.
• Significant amount of
your time will be spent doing
feature engineering.
KAUST Academy
Feature engineering is often
critical to success.
• Feature selection:
selecting the "best" subset
of features for training.
• Feature extraction:
combining existing features to
produce new ones.
• Creating new features
from new data.
Prof. Da vi d R. Pugh
Overfitting the training data
What is overfitting?
• Overfitting is when model
performs well on training data
but poorly on new data.
• If model is complex or training
data is limited, model will detect
spurious patterns.
• Constraining a complex
model to make it simpler is
called regularization.
KAUST Academy
Prof. Da vi d R. Pugh
Underfitting the training data
What is underfitting?
How to reduce underfitting?
• Underfitting is when a model is
too simple to learn the
underlying structure of the data.
• Linear models will often
underfit (but often a good place
to start).
• Select more complex (more
parameters) model.
• Feed better features to the
model (feature engineering).
• Reduce the constraints on
model (reduce regularization).
KAUST Academy
Prof. Da vi d R. Pugh
Validation and Testing
KAUST Academy
Why measure generalization error?
• Only way to know if your model
is good is to measure
performance new data!
• Split your data into train and
test sets: error on the test set is
estimate of generalization error.
• Low training error, high
generalization error =>
KAUST Academy
Some train-test split heuristics:
• For datasets smaller than
O(100k) samples, take 80%
for train and holdout 20%
for test.
• For larger datasets, O(1m)
samples, holdout 1-10% of the
dataset for test.
Prof. Da vi d R. Pugh
Model Selection
• Often need to tune
hyperparameters to find a good
model within a particular class
of models.
• How? Split training data into
training set and validation set.
• Always compare tuned models
using the test set!
KAUST Academy
• Validation set too small =>
might select "bad" model by
• Validation set too large
=> training set too small!
• Cross validation: create lots
of small validation sets,
evaluate model on each
validation set, measure
average performance across
validation sets.
Prof. Da vi d R. Pugh
Model selection process
KAUST Academy
Prof. Da vi d R. Pugh
KAUST Academy
Prof. Da vi d R. Pugh