Introduction Definitions The Modeling Cycle Challenges Diploma di Data Science Scuola Nazionale dell’Amministrazione Introduzione ai modelli predittivi - il ciclo della modellazione B. Guardabascio, A. Virgillito Introduzione ai modelli predittivi 1 / 25 Introduction Definitions The Modeling Cycle Challenges Introduction This section of the course is focused on predictive modeling, that is the use of statistical techniques and Machine Learning to predict likely future outcomes basing on historical data. We will learn about machine learning algorithm and their implementation in R. This is seen by many as the core activity of data scientists. Actually, it is only one of the key capabilities a data scientist should master and it is largely connected with the topics that were already covered in the course. The first part of this section gives an qualitative, high-level overview of the main terminology and concepts that will be given an in-depth treatment (with implementations in R) in the following days. B. Guardabascio, A. Virgillito Introduzione ai modelli predittivi 2 / 25 Introduction Definitions The Modeling Cycle Challenges Topics • Introduction to predictive modeling and machine learning • Definitions • The cycle of predictive modeling B. Guardabascio, A. Virgillito Introduzione ai modelli predittivi 3 / 25 Introduction Definitions The Modeling Cycle Challenges What is Machine Learning Machine Learning (ML) is a set of techniques through which computers can solve problems basing on data rather than explicit writing of programs. ML is used to extract information (business value) from data in an automated way. Possible applications: • Predictions on future based on observation of the past • Automated identification of patterns within (large amounts of) data • Support to decisions B. Guardabascio, A. Virgillito Introduzione ai modelli predittivi 4 / 25 Introduction Definitions The Modeling Cycle Challenges Rule-based vs. ML B. Guardabascio, A. Virgillito Introduzione ai modelli predittivi 5 / 25 Introduction Definitions The Modeling Cycle Challenges Machine Learning can help human learn B. Guardabascio, A. Virgillito Introduzione ai modelli predittivi 6 / 25 Introduction Definitions The Modeling Cycle Challenges Some definitions • Training set: data that is provided to the system as a basis to ”learn” • Variables/Features/Predictors/attributes: characteristics included in the training set (columns in a tabular data set) • Target variable: variable subject of the prediction B. Guardabascio, A. Virgillito Introduzione ai modelli predittivi 7 / 25 Introduction Definitions The Modeling Cycle Challenges Types of Machine Learning • Supervised learning: humans annotate the training set with the ”correct” answers. That is, the training set comes with known values of the target variable. • Unsupervised learning: training data is not annotated and no target variable can be identified. B. Guardabascio, A. Virgillito Introduzione ai modelli predittivi 8 / 25 Introduction Definitions The Modeling Cycle Challenges Supervised learning A sample of examples/training data is available where an outcome (value of the target variable) is available together with predictors Two aims: Prediction of the outcome in future settings when the target variable is not available, but we only have the predictors Understand interrelationships among predictors and outcome B. Guardabascio, A. Virgillito Introduzione ai modelli predittivi 9 / 25 Introduction Definitions The Modeling Cycle Challenges Supervised learning Classification: the target variable is qualitative (categorical), that is it can assume two or more different values (classes). Example: detect if an email is spam or not. Regression: the target variable is quantitative (numerical). Example: predict the price of houses, starting from their characteristics (size, city, district, etc.). B. Guardabascio, A. Virgillito Introduzione ai modelli predittivi 10 / 25 Introduction Definitions The Modeling Cycle Challenges Classification vs. Regression B. Guardabascio, A. Virgillito Introduzione ai modelli predittivi 11 / 25 Introduction Definitions The Modeling Cycle Challenges Supervised learning Algorithms • Linear regression • Logistic regression • Decision trees • Random forests • Support Vector Machines • Neural Networks B. Guardabascio, A. Virgillito Introduzione ai modelli predittivi 12 / 25 Introduction Definitions The Modeling Cycle Challenges Unsupervised learning Training data is unlabeled. The system tries to learn by detecting patterns within the predictors. Prediction of the outcome in future settings when the target variable is not available, but we only have the predictors. Understand interrelationships among predictors and outcome Algorithms: • Clustering • Dimensionality reduction and visualization • Anomaly detection B. Guardabascio, A. Virgillito Introduzione ai modelli predittivi 13 / 25 Introduction Definitions The Modeling Cycle Challenges Unsupervised learning: Clustering Clustering algorithms automatically try to detect ”similar” data points and create association among them in form of groups (clusters). Analysis of emerging clusters can highlight interesting patterns in data that may not be evident through traditional analysis techniques, for example because groups are defined through more than one variable. B. Guardabascio, A. Virgillito Introduzione ai modelli predittivi 14 / 25 Introduction Definitions The Modeling Cycle Challenges Unsupervised learning: dimensionality reduction Related to clustering are the two following tasks, also part of the : Dimensionality reduction: simplify the data by removing unnecessary or redundant features, or apply unsupervised techniques to collapse several features into a ”compact” representation (e.g. Principal Component Analysis). Visualization: graphical representation of 2D or 3D data, often results of dimensionality reduction, that can help highlights the density zones corresponding to clusters and relate them to data features. B. Guardabascio, A. Virgillito Introduzione ai modelli predittivi 15 / 25 Introduction Definitions The Modeling Cycle Challenges Unsupervised learning: Anomaly detection Unsupervised learning techniques can also be used to detect data points that deviate from the norm, for example for automated outlier removal or for specific business applications (e.g. recognizing unusual credit card transactions). B. Guardabascio, A. Virgillito Introduzione ai modelli predittivi 16 / 25 Introduction Definitions The Modeling Cycle Challenges Model validation How do we say that a (supervised) model is “good” at predicting? First we need to define some form of metric. Roughly, in classification you can simply count the number of correct predictions while in regression you can consider a measure of the error between the predicted values and the correct ones. There are several metrics that can be used, depend on what is the most important aspect in our problem. B. Guardabascio, A. Virgillito Introduzione ai modelli predittivi 17 / 25 Introduction Definitions The Modeling Cycle Challenges Generalization The better a model is, the more it can generalize to “unseen” data, i.e. different records wrt those used for training. The only way to know how well a model will generalize to new cases is to actually try it out on new cases. This is done by using a subset of the labeled data as the Test set. You train your model using the training set, and you test it using the test set. The error rate on new cases is called the generalization error (or out-of-sample error). By evaluating your model on the test set, you get an estimate of this error. This value tells you how well your model will perform on instances it has never seen before. B. Guardabascio, A. Virgillito Introduzione ai modelli predittivi 18 / 25 Introduction Definitions The Modeling Cycle Challenges Underfitting vs. Overfitting Underfitting occurs when a model is too simple and cannot represent well the characteristics of the training data Overfitting refers to models that adapt “too well” on training data but have poor predictive performance on test set. B. Guardabascio, A. Virgillito Introduzione ai modelli predittivi 19 / 25 Introduction Definitions The Modeling Cycle Challenges Hyperparameters So evaluating a model seems quite simple: just use a test set. If you are hesitating between two models you can train both and compare how well they generalize using the test set. However, behavior of models typically do not exclusively depend on training data which you feed but is also dependent on specific characteristics of each algorithm, represented by numerical values that can be adjusted by the data scientist. These are called hyperparameters. Hyperparameter tuning is an important phase of the modeling cycle, that strongly affects model performance. In the following we will introduce methods to correctly carry out hyperparameter tuning, in order to prevent overfitting. B. Guardabascio, A. Virgillito Introduzione ai modelli predittivi 20 / 25 Introduction Definitions The Modeling Cycle Challenges Model development • Retrieve the dataset • Discover and visualize the data to gain insights • Prepare the data for Machine Learning algorithms (select features, remove outliers, build new features, etc. • Iterate: Select a model and train it Evaluate performance Fine-tune hyperparameters • Present the solution and put it in production B. Guardabascio, A. Virgillito Introduzione ai modelli predittivi 21 / 25 Introduction Definitions The Modeling Cycle Challenges Challenges Insufficient training data Even for very simple problems you typically need thousands of examples, and for complex problems such as image or speech recognition you may need millions of examples. Once given enough training data, differences among algorithms may blur. However, getting labeled data in large quantity is often not easy or cheap, thus it is necessary to resort on working on algorithms. B. Guardabascio, A. Virgillito Introduzione ai modelli predittivi 22 / 25 Introduction Definitions The Modeling Cycle Challenges Challenges Non-representative training data In order to generalize well, it is crucial that your training data be representative of the new cases you want to generalize to. If the sample is too small, you will have sampling noise (i.e., nonrepresentative data as a result of chance), if the sampling method is flawed you can have sampling bias even with large samples. Dotted line is a linear model trained on blue points only, solid line is the same model B. Guardabascio, A. Virgillito adding red points. Introduzione ai modelli predittivi 23 / 25 Introduction Definitions The Modeling Cycle Challenges Challenges Poor-quality data if training data is full of errors, outliers, and noise (e.g., due to poor quality measurements), it will make it harder for the system to detect the underlying patterns, so your system is less likely to perform well. It is often well worth the effort to spend time cleaning up your training data. If some instances are clearly outliers, it may help to simply discard them or try to fix the errors manually. If some instances are missing a few features, several options exist like ignore this attribute altogether, ignore these instances, fill in the missing values, or train one model with the feature and one model without it, and so on. B. Guardabascio, A. Virgillito Introduzione ai modelli predittivi 24 / 25 Introduction Definitions The Modeling Cycle Challenges Challenges Irrelevant features Features that present low influence on prediction target can introduce noise and bias, rather than helping. The construction of a good set of features for training is called feature engineering and roughly consists in: Feature selection: choosing only the most useful features and getting rid of irrelevant/harmful ones. Feature extraction: combining existing features to produce most useful ones. B. Guardabascio, A. Virgillito Introduzione ai modelli predittivi 25 / 25