Data Science and Public Policy Badge Introduction of machine learning Terminologies • The differences between popularterminologies What does machine learning do? • As opposed to (linear) regression? • Inference • Not causal inference; linear regressions in most cases do not infer causality but association • Inference from a sample to a population • Hence, p values; prob of an “abnormality” beyond sampling • Relatively small samples compared to populations • Numerical data (text translated to numbers, such as frequency) • ML centers on prediction • Accuracy reigns supreme • “big data” • 3V: volume, velocity, and variety Commonality and difference • Commonality • Neither regression nor ML is causal • Contrast • Regression on inference, but ML on prediction and predictive accuracy • Regression on functional forms, but ML often free of imposed forms • Regression on assumptions (linearity, iid, independence among variables), but ML often free of assumptions • Regression on relatively small samples, but ML on “big data” R or Python? • R rooted in S-PLUS back in the 70s or 80s • Statistical programming • Widely used in regression without and with ML • Python also back to the 80s • General programming language • Widely used in webpage developing • Not a right or wrong choice; up to your preference • Functionality/package developed in one will likely be available in another • “caret” as an example • One justification for choosing Python • If your work focuses on web scrapping, social media, and other web related topics • Seamless integration with websites, because many websites are written in Python Types of ML • Supervised ML • Supervised by who or what? • Not by me for sure • By the response/outcome variable, i.e., the dependent variable in regression language • Train an algorithm to predict an (actually happened) outcome • Classification if discrete outcomes • Regression if continuous outcomes • Unsupervised ML • Data mining • Pattern identification • Clustering Types of supervised ML • Regression • If the response (dep) variable is continuous • Example: • Willingness to pay at flight/hotel booking websites • Estimated wait time or congestion time • Classification • If the response variable is discrete (binary or multiclass) • Example: • Cancer or not • Face recognition: open the door or not? terrorist or not? • Netflix Self-driving car: supervised or unsupervised? Hands-on machine learning Workflow of ML • Data collection • Data cleaning and preparation • Model (algorithm) selection • Training the model • Testing and improving the model • Application and make predictions Workflow of ML: for this course • Data collection • Data cleaning and preparation • Model (algorithm) selection • Training the model • Testing and improving (retraining) the model • Application and make predictions Commonly used algorithms • Random forest • Tree based • Support vector machines • High dimensional feature spaces • Extreme gradient boosting • Tree based • All three models can be used for regression and classification questions Training a model: training is learning • A process of feeding an ML algorithm with data to identify and learn good values for all attributes involved • A key step because performance of trained algorithm determines quality of applications • Considerations for a “good” trained model • Complexity (parsimony), interpretability, performance, computer requirements, etc. • Performance/accuracy only for now Testing a model: testing is evaluating • A process for unbiased evaluation of performance/accuracy of final trained algorithm • Such unbiased performance evaluation can be used for model comparison and selection • Train vs. test datasets • • • • Test dataset is a sample of the dataset held back from training the model Critical to keep the test set completely separate from training “Peeking” is strongly discouraged Train-test split approach (70% vs. 30%) • Validation set • Many use validation and test sets interchangeably • More strictly, validation set is held back from training (same as test set) but used to tune a model (different from test set) • k-fold cross validation Bias and variance • Bias: a model’s prediction differs from the target value in the training data • A high level of bias can lead to underfitting • Underlying structure vs. noises • Variance: inconsistency of different predictions using different data (test or brandnew) • A high level of variance can lead to overfitting • Generalization • Bias-variance trade-off ML example: regression Background and setting • 311 requests • Calls, online, or app • Non-emergency public service requests or inquiries • Trash pickup, potholes, noises, service outage • During COVID, inquiries on test sites, vaccine availability • 311 requests in Miami-Dade County • Q: Can we predict request volumes in small geographic areas? • Geo unit: Census tracts, 529 in MDC • Response var: counts (total or by category) • Features: housing, demographic, economic characteristics plus autoregressive counts Accuracy indicators • MAE (Mean Absolute Error) • RMSE (Root Mean Squared Error) • R-squared (Coefficient of Determination) • In most cases, they provide consistent findings ML example: classification Background and setting • Still 311 requests in MDC • Unit of analysis: individual requests • Unique feature in MDC 311 data • Expected completion time • Such feature may not be available in other cities • Q: what determines shorter than expected completion time? • Binary • Alternatively, multiclass (shorter, on time, and longer) • Features: community, request Confusion matrix Sensitivity and specificity • Sensitivity (aka recall) = TP/(TP+FN) • • • • Numerator: correctly labeled/identified cases having a certain attribute Denominator: all cases having a certain attribute, regardless of being detected or not Example: detecting brain tumor FN is intolerable: we want to reduce missed brain tumors • Specificity = TN/(TN+FP) • • • • Numerator: correctly labeled/identified cases NOT having a certain attribute Denominator: all cases NOT having a certain attribute, regardless of being detected or not Example: drug test for employment FP (false alarm) is intolerable: we don’t want to mistakenly take away emp opportunities Accuracy and precision • Accuracy = (TP+TN)/(TP+FP+FN+TN) • • • • Numerator: all correctly labeled/identified cases (positives and negatives) Denominator: all cases How many cases did we correctly label out of all cases? Balanced view on positives and negatives • Precision = TP/(TP+FP) • • • • Numerator: correctly labeled/identified cases having a certain attribute Denominator: all labeled/identified cases having a certain attribute (true and false) How many cases that are labeled are actually having a certain attribute? Be more confident of true positives Relations • Sensitivity vs. specificity • Which one is more unacceptable: false negative vs. false positive? • Sensitivity vs. precision • Precision is how sure of your true positives (among predicted positives) while sensitivity is how sure that you are not missing any (true) positives • Accuracy vs. precision • Balanced view on TP and TN vs. TP out of labeled positives • TN may be disproportionally large in numbers but substantively unimportant • Excellent prediction on individuals who are not terrorists…did we miss the point? ML example: ensemble Spaghetti forecast Ensemble prediction • Take advantage of individual, independent predictions • Pros and cons of different models • Natures and characteristics of the datasets • Reduced generalization error if uncorrelated models are combined • How to combine models? • Equal weight average • Weighted average • Stacking • Individual predictions as inputs for an ensemble prediction • Many more • Why not choosing the model having the highest predictive accuracy? Additional resources Resources • There are so many… • Books: • Applied Predictive Modeling, by Max Kuhn and Kjell Johnson • YouTube: • StatQuest with Josh Starmer • Andrew Ng, Stanford CS229: Machine learning • Andrew Ng, Stanford CS230: Deep learning