Uploaded by Maria Marrocco Trischitta

T01-Intro

advertisement
Introduction
Definitions
The Modeling Cycle
Challenges
Diploma di Data Science
Scuola Nazionale dell’Amministrazione
Introduzione ai modelli predittivi - il ciclo della
modellazione
B. Guardabascio, A. Virgillito
Introduzione ai modelli predittivi
1 / 25
Introduction
Definitions
The Modeling Cycle
Challenges
Introduction
This section of the course is focused on predictive modeling, that is the
use of statistical techniques and Machine Learning to predict likely future
outcomes basing on historical data.
We will learn about machine learning algorithm and their implementation
in R.
This is seen by many as the core activity of data scientists. Actually, it is
only one of the key capabilities a data scientist should master and it is
largely connected with the topics that were already covered in the course.
The first part of this section gives an qualitative, high-level overview of
the main terminology and concepts that will be given an in-depth
treatment (with implementations in R) in the following days.
B. Guardabascio, A. Virgillito
Introduzione ai modelli predittivi
2 / 25
Introduction
Definitions
The Modeling Cycle
Challenges
Topics
• Introduction to predictive modeling and machine learning
• Definitions
• The cycle of predictive modeling
B. Guardabascio, A. Virgillito
Introduzione ai modelli predittivi
3 / 25
Introduction
Definitions
The Modeling Cycle
Challenges
What is Machine Learning
Machine Learning (ML) is a set of techniques through which computers
can solve problems basing on data rather than explicit writing of
programs.
ML is used to extract information (business value) from data in an
automated way.
Possible applications:
• Predictions on future based on observation of the past
• Automated identification of patterns within (large amounts of) data
• Support to decisions
B. Guardabascio, A. Virgillito
Introduzione ai modelli predittivi
4 / 25
Introduction
Definitions
The Modeling Cycle
Challenges
Rule-based vs. ML
B. Guardabascio, A. Virgillito
Introduzione ai modelli predittivi
5 / 25
Introduction
Definitions
The Modeling Cycle
Challenges
Machine Learning can help human learn
B. Guardabascio, A. Virgillito
Introduzione ai modelli predittivi
6 / 25
Introduction
Definitions
The Modeling Cycle
Challenges
Some definitions
• Training set: data that is provided to the system as a basis to
”learn”
• Variables/Features/Predictors/attributes: characteristics included in
the training set (columns in a tabular data set)
• Target variable: variable subject of the prediction
B. Guardabascio, A. Virgillito
Introduzione ai modelli predittivi
7 / 25
Introduction
Definitions
The Modeling Cycle
Challenges
Types of Machine Learning
• Supervised learning: humans annotate the training set with the
”correct” answers. That is, the training set comes with known
values of the target variable.
• Unsupervised learning: training data is not annotated and no target
variable can be identified.
B. Guardabascio, A. Virgillito
Introduzione ai modelli predittivi
8 / 25
Introduction
Definitions
The Modeling Cycle
Challenges
Supervised learning
A sample of examples/training data is available where an outcome (value
of the target variable) is available together with predictors
Two aims:
Prediction of the outcome in future settings when the target variable
is not available, but we only have the predictors
Understand interrelationships among predictors and outcome
B. Guardabascio, A. Virgillito
Introduzione ai modelli predittivi
9 / 25
Introduction
Definitions
The Modeling Cycle
Challenges
Supervised learning
Classification: the target variable is qualitative (categorical), that is
it can assume two or more different values (classes). Example:
detect if an email is spam or not.
Regression: the target variable is quantitative (numerical). Example:
predict the price of houses, starting from their characteristics (size,
city, district, etc.).
B. Guardabascio, A. Virgillito
Introduzione ai modelli predittivi
10 / 25
Introduction
Definitions
The Modeling Cycle
Challenges
Classification vs. Regression
B. Guardabascio, A. Virgillito
Introduzione ai modelli predittivi
11 / 25
Introduction
Definitions
The Modeling Cycle
Challenges
Supervised learning Algorithms
• Linear regression
• Logistic regression
• Decision trees
• Random forests
• Support Vector Machines
• Neural Networks
B. Guardabascio, A. Virgillito
Introduzione ai modelli predittivi
12 / 25
Introduction
Definitions
The Modeling Cycle
Challenges
Unsupervised learning
Training data is unlabeled. The system tries to learn by detecting
patterns within the predictors.
Prediction of the outcome in future settings when the target variable
is not available, but we only have the predictors.
Understand interrelationships among predictors and outcome
Algorithms:
• Clustering
• Dimensionality reduction and visualization
• Anomaly detection
B. Guardabascio, A. Virgillito
Introduzione ai modelli predittivi
13 / 25
Introduction
Definitions
The Modeling Cycle
Challenges
Unsupervised learning: Clustering
Clustering algorithms automatically try to detect ”similar” data points
and create association among them in form of groups (clusters).
Analysis of emerging clusters can highlight interesting patterns in data
that may not be evident through traditional analysis techniques, for
example because groups are defined through more than one variable.
B. Guardabascio, A. Virgillito
Introduzione ai modelli predittivi
14 / 25
Introduction
Definitions
The Modeling Cycle
Challenges
Unsupervised learning: dimensionality reduction
Related to clustering are the two following tasks, also part of the :
Dimensionality reduction: simplify the data by removing unnecessary
or redundant features, or apply unsupervised techniques to collapse
several features into a ”compact” representation (e.g. Principal
Component Analysis).
Visualization: graphical representation of 2D or 3D data, often
results of dimensionality reduction, that can help highlights the
density zones corresponding to clusters and relate them to data
features.
B. Guardabascio, A. Virgillito
Introduzione ai modelli predittivi
15 / 25
Introduction
Definitions
The Modeling Cycle
Challenges
Unsupervised learning: Anomaly detection
Unsupervised learning techniques can also be used to detect data points
that deviate from the norm, for example for automated outlier removal or
for specific business applications (e.g. recognizing unusual credit card
transactions).
B. Guardabascio, A. Virgillito
Introduzione ai modelli predittivi
16 / 25
Introduction
Definitions
The Modeling Cycle
Challenges
Model validation
How do we say that a (supervised) model is “good” at predicting?
First we need to define some form of metric.
Roughly, in classification you can simply count the number of correct
predictions while in regression you can consider a measure of the error
between the predicted values and the correct ones.
There are several metrics that can be used, depend on what is the most
important aspect in our problem.
B. Guardabascio, A. Virgillito
Introduzione ai modelli predittivi
17 / 25
Introduction
Definitions
The Modeling Cycle
Challenges
Generalization
The better a model is, the more it can generalize to “unseen” data, i.e.
different records wrt those used for training.
The only way to know how well a model will generalize to new cases is to
actually try it out on new cases. This is done by using a subset of the
labeled data as the Test set.
You train your model using the training set, and you test it using the test
set. The error rate on new cases is called the generalization error (or
out-of-sample error).
By evaluating your model on the test set, you get an estimate of this
error. This value tells you how well your model will perform on instances
it has never seen before.
B. Guardabascio, A. Virgillito
Introduzione ai modelli predittivi
18 / 25
Introduction
Definitions
The Modeling Cycle
Challenges
Underfitting vs. Overfitting
Underfitting occurs when a model is too simple and cannot
represent well the characteristics of the training data
Overfitting refers to models that adapt “too well” on training data
but have poor predictive performance on test set.
B. Guardabascio, A. Virgillito
Introduzione ai modelli predittivi
19 / 25
Introduction
Definitions
The Modeling Cycle
Challenges
Hyperparameters
So evaluating a model seems quite simple: just use a test set. If you are
hesitating between two models you can train both and compare how well
they generalize using the test set.
However, behavior of models typically do not exclusively depend on
training data which you feed but is also dependent on specific
characteristics of each algorithm, represented by numerical values that
can be adjusted by the data scientist. These are called hyperparameters.
Hyperparameter tuning is an important phase of the modeling cycle, that
strongly affects model performance.
In the following we will introduce methods to correctly carry out
hyperparameter tuning, in order to prevent overfitting.
B. Guardabascio, A. Virgillito
Introduzione ai modelli predittivi
20 / 25
Introduction
Definitions
The Modeling Cycle
Challenges
Model development
• Retrieve the dataset
• Discover and visualize the data to gain insights
• Prepare the data for Machine Learning algorithms (select features,
remove outliers, build new features, etc.
• Iterate:
Select a model and train it
Evaluate performance
Fine-tune hyperparameters
• Present the solution and put it in production
B. Guardabascio, A. Virgillito
Introduzione ai modelli predittivi
21 / 25
Introduction
Definitions
The Modeling Cycle
Challenges
Challenges
Insufficient training data
Even for very simple problems you typically need thousands of
examples, and for complex problems such as image or speech
recognition you may need millions of examples.
Once given enough training data, differences among algorithms may
blur.
However, getting labeled data in large quantity is often not easy or
cheap, thus it is necessary to resort on working on algorithms.
B. Guardabascio, A. Virgillito
Introduzione ai modelli predittivi
22 / 25
Introduction
Definitions
The Modeling Cycle
Challenges
Challenges
Non-representative training data
In order to generalize well, it is crucial that your training data be
representative of the new cases you want to generalize to.
If the sample is too small, you will have sampling noise (i.e.,
nonrepresentative data as a result of chance), if the sampling method
is flawed you can have sampling bias even with large samples.
Dotted line is a linear model trained on blue points only, solid line is the same model
B. Guardabascio, A. Virgillito
adding red points.
Introduzione ai modelli predittivi
23 / 25
Introduction
Definitions
The Modeling Cycle
Challenges
Challenges
Poor-quality data
if training data is full of errors, outliers, and noise (e.g., due to poor
quality measurements), it will make it harder for the system to detect the
underlying patterns, so your system is less likely to perform well. It is
often well worth the effort to spend time cleaning up your training data.
If some instances are clearly outliers, it may help to simply discard
them or try to fix the errors manually.
If some instances are missing a few features, several options exist
like ignore this attribute altogether, ignore these instances, fill in the
missing values, or train one model with the feature and one model
without it, and so on.
B. Guardabascio, A. Virgillito
Introduzione ai modelli predittivi
24 / 25
Introduction
Definitions
The Modeling Cycle
Challenges
Challenges
Irrelevant features Features that present low influence on prediction
target can introduce noise and bias, rather than helping. The
construction of a good set of features for training is called feature
engineering and roughly consists in:
Feature selection: choosing only the most useful features and
getting rid of irrelevant/harmful ones.
Feature extraction: combining existing features to produce most
useful ones.
B. Guardabascio, A. Virgillito
Introduzione ai modelli predittivi
25 / 25
Download