Introduction to Machine Learning Theory and Practice David R. Pugh Instructional Assistant Professor, KAUST Director, SDAIA-KAUST AI • 5+ years teaching applied machine learning and deep learning at KAUST. • 2+ years as the director of SDAIA-KAUST AI where I work to match applied AI problems of interest to SDAIA with AI solutions developed at KAUST. • 15+ years experience with the core data science Python stack: NumPy, SciPy, Pandas, Matplotlib, NetworkX, Jupyter, Scikit-Learn, PyTorch, etc. KAUST Academy 2 Agenda Introduction to Machine Learning: Theory and Practice 09:00 - 09:05 Welcome and Opening Remarks Prof. David Pugh 09:05 - 10:30 The Machine Learning Landscape Prof. David Pugh 10:30 - 10:45 Break 10:45 - 12:00 Classification and Regression 12:00 - 13:00 Lunch 13:00 - 14:30 Linear Regression with NumPy 14:30 - 14:45 Break 14:45 – 16:00 Introduction to Scikit-Learn KAUST Academy Prof. David Pugh Prof. David Pugh + TAs Prof. David Pugh + TAs 3 References • • • Slides closely follow Hands-on Machine Learning with Scikit-Learn, Keras and Tensorflow by Aurelien Geron. Another great reference is Machine Learning with PyTorch and ScikitLearn by Sebastian Raschka. Official documentation for Scikit-Learn is also fantastic. KAUST Academy Prof. Da vi d R. Pugh 4 The ML Landscape Prof. Da vi d R. Pugh What is difference between AI and ML? KAUST Academy Prof. Da vi d R. Pugh 6 What is ML? • • • ML is the science (and art) of programming computers so they can learn from data (Geron, 2019). [ML is the] field of study that gives computers the ability to learn without being explicitly programmed (Samuel, 1959). A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E (Mitchell, 1997). KAUST Academy Prof. Da vi d R. Pugh 7 Why is ML so popular right now? Stanford’s Coursera machine learning course had more than 100,000 expressing interest in the first year. 1. The field has matured both in terms of identity and in terms of methods and tools. 2. There is an abundance of data available 3. There is an abundance of computation to run methods 4. There have been impressive results, increasing acceptance, respect, and competition Resources + Ingredients + Tools + Desire = Popularity KAUST Academy Based on: http://machinelearningmastery.com/machine-learning-is-popular/?__s=yq1qzcnf67sfiuzmnvjf 8 Traditional approach is model/rules based... KAUST Academy Prof. Da vi d R. Pugh 9 ...ML approach is data-driven! KAUST Academy Prof. Da vi d R. Pugh 10 ML adapts to change! KAUST Academy Prof. Da vi d R. Pugh 11 ML can help humans learn! KAUST Academy Prof. Da vi d R. Pugh 12 Types of ML systems • • • • Supervised vs unsupervised Semi-supervised vs self-supervised Batch (offline) vs incremental (online) Instance-based vs model-based KAUST Academy Prof. Da vi d R. Pugh 13 Supervised learning Classification KAUST Academy Regression Prof. Da vi d R. Pugh 14 Other forms of supervised learning Semi-supervised learning KAUST Academy Self-supervised learning Prof. Da vi d R. Pugh 15 Unsupervised learning Clustering KAUST Academy Data visualization Prof. Da vi d R. Pugh 16 Reinforcement Learning KAUST Academy Prof. Da vi d R. Pugh 17 Batch (offline) vs incremental (online) learning Batch (offline) Learning KAUST Academy Incremental (online) learning Prof. Da vi d R. Pugh 18 Out-of-core learning KAUST Academy Prof. Da vi d R. Pugh 19 Instance-based vs model-based learning Instance-based learning KAUST Academy Model-based learning Prof. Da vi d R. Pugh 20 Main Challenges of Applying ML KAUST Academy 21 Main Challenges of Applying ML • • • • • • Insufficient quantity of training data Non-representative training data Poor quality data Irrelevant features Overfitting the training data Underfitting the training data KAUST Academy Prof. Da vi d R. Pugh 22 Insufficient quantity of training data • The more data for training the better! • It can take a lot of data for most ML algorithms to work. • "Simple" problems often require O(10k) samples. • "Complex" problems often require O(1m) samples. KAUST Academy Prof. Da vi d R. Pugh 23 Non-representative training data • Need training data to be representative of new data for generalization. • Sampling noise: not enough data => training data not representative by chance. • Sampling bias: poor sampling technique => training data not representative (biased). KAUST Academy Prof. Da vi d R. Pugh 24 Poor quality training data • Data can be full of errors, outliers, and noise (e.g., due to poor-quality measurements). • Dirty data => hard for any algorithm to detect patterns. • Significant amount of your time will be spent cleaning data. KAUST Academy • Data types? Do you have numeric features? Ordinal features? Categorical features? • Look for outliers in your data: Remove? Fix manually? • Look for missing data: Remove? Impute values? Prof. Da vi d R. Pugh 25 Irrelevant features Garbage in => garbage out! • Learning requires sufficient relevant features (and not too many irrelevant ones!). • Developing a good set of features for training is critical part of ML project. • Significant amount of your time will be spent doing feature engineering. KAUST Academy Feature engineering is often critical to success. • Feature selection: selecting the "best" subset of features for training. • Feature extraction: combining existing features to produce new ones. • Creating new features from new data. Prof. Da vi d R. Pugh 26 Overfitting the training data What is overfitting? • Overfitting is when model performs well on training data but poorly on new data. • If model is complex or training data is limited, model will detect spurious patterns. • Constraining a complex model to make it simpler is called regularization. KAUST Academy Prof. Da vi d R. Pugh 27 Underfitting the training data What is underfitting? How to reduce underfitting? • Underfitting is when a model is too simple to learn the underlying structure of the data. • Linear models will often underfit (but often a good place to start). • Select more complex (more parameters) model. • Feed better features to the model (feature engineering). • Reduce the constraints on model (reduce regularization). KAUST Academy Prof. Da vi d R. Pugh 28 Validation and Testing KAUST Academy 29 Why measure generalization error? • Only way to know if your model is good is to measure performance new data! • Split your data into train and test sets: error on the test set is estimate of generalization error. • Low training error, high generalization error => overfitting! KAUST Academy Some train-test split heuristics: • For datasets smaller than O(100k) samples, take 80% for train and holdout 20% for test. • For larger datasets, O(1m) samples, holdout 1-10% of the dataset for test. Prof. Da vi d R. Pugh 30 Model Selection • Often need to tune hyperparameters to find a good model within a particular class of models. • How? Split training data into training set and validation set. • Always compare tuned models using the test set! KAUST Academy • Validation set too small => might select "bad" model by mistake. • Validation set too large => training set too small! • Cross validation: create lots of small validation sets, evaluate model on each validation set, measure average performance across validation sets. Prof. Da vi d R. Pugh 31 Model selection process KAUST Academy Prof. Da vi d R. Pugh 32 Thanks! KAUST Academy Prof. Da vi d R. Pugh 36