Uploaded by aknappphdht

Module 4 Lecture Slides (PP DS badge ML - Tagged

advertisement
Data Science and Public
Policy Badge
Introduction of machine learning
Terminologies
• The differences between popularterminologies
What does machine learning do?
• As opposed to (linear) regression?
• Inference
• Not causal inference; linear regressions in most cases do not infer causality but
association
• Inference from a sample to a population
• Hence, p values; prob of an “abnormality” beyond sampling
• Relatively small samples compared to populations
• Numerical data (text translated to numbers, such as frequency)
• ML centers on prediction
• Accuracy reigns supreme
• “big data”
• 3V: volume, velocity, and variety
Commonality and difference
• Commonality
• Neither regression nor ML is causal
• Contrast
• Regression on inference, but ML on prediction and predictive accuracy
• Regression on functional forms, but ML often free of imposed forms
• Regression on assumptions (linearity, iid, independence among variables), but ML
often free of assumptions
• Regression on relatively small samples, but ML on “big data”
R or Python?
• R rooted in S-PLUS back in the 70s or 80s
• Statistical programming
• Widely used in regression without and with ML
• Python also back to the 80s
• General programming language
• Widely used in webpage developing
• Not a right or wrong choice; up to your preference
• Functionality/package developed in one will likely be available in another
• “caret” as an example
• One justification for choosing Python
• If your work focuses on web scrapping, social media, and other web related topics
• Seamless integration with websites, because many websites are written in Python
Types of ML
• Supervised ML
• Supervised by who or what?
• Not by me for sure
• By the response/outcome variable, i.e., the dependent variable in regression language
• Train an algorithm to predict an (actually happened) outcome
• Classification if discrete outcomes
• Regression if continuous outcomes
• Unsupervised ML
• Data mining
• Pattern identification
• Clustering
Types of supervised ML
• Regression
• If the response (dep) variable is continuous
• Example:
• Willingness to pay at flight/hotel booking websites
• Estimated wait time or congestion time
• Classification
• If the response variable is discrete (binary or multiclass)
• Example:
• Cancer or not
• Face recognition: open the door or not? terrorist or not?
• Netflix
Self-driving car: supervised or unsupervised?
Hands-on machine learning
Workflow of ML
• Data collection
• Data cleaning and preparation
• Model (algorithm) selection
• Training the model
• Testing and improving the model
• Application and make predictions
Workflow of ML: for this course
• Data collection
• Data cleaning and preparation
• Model (algorithm) selection
• Training the model
• Testing and improving (retraining) the model
• Application and make predictions
Commonly used algorithms
• Random forest
• Tree based
• Support vector machines
• High dimensional feature spaces
• Extreme gradient boosting
• Tree based
• All three models can be used for regression and classification questions
Training a model: training is learning
• A process of feeding an ML algorithm with data to identify and learn good values
for all attributes involved
• A key step because performance of trained algorithm determines quality of
applications
• Considerations for a “good” trained model
• Complexity (parsimony), interpretability, performance, computer requirements, etc.
• Performance/accuracy only for now
Testing a model: testing is evaluating
• A process for unbiased evaluation of performance/accuracy of final trained
algorithm
• Such unbiased performance evaluation can be used for model comparison and
selection
• Train vs. test datasets
•
•
•
•
Test dataset is a sample of the dataset held back from training the model
Critical to keep the test set completely separate from training
“Peeking” is strongly discouraged
Train-test split approach (70% vs. 30%)
• Validation set
• Many use validation and test sets interchangeably
• More strictly, validation set is held back from training (same as test set) but used to
tune a model (different from test set)
• k-fold cross validation
Bias and variance
• Bias: a model’s prediction differs from the target value in the training data
• A high level of bias can lead to underfitting
• Underlying structure vs. noises
• Variance: inconsistency of different predictions using different data (test or brandnew)
• A high level of variance can lead to overfitting
• Generalization
• Bias-variance trade-off
ML example: regression
Background and setting
• 311 requests
• Calls, online, or app
• Non-emergency public service requests or inquiries
• Trash pickup, potholes, noises, service outage
• During COVID, inquiries on test sites, vaccine availability
• 311 requests in Miami-Dade County
• Q: Can we predict request volumes in small geographic areas?
• Geo unit: Census tracts, 529 in MDC
• Response var: counts (total or by category)
• Features: housing, demographic, economic characteristics plus autoregressive
counts
Accuracy indicators
• MAE (Mean Absolute Error)
• RMSE (Root Mean Squared Error)
• R-squared (Coefficient of Determination)
• In most cases, they provide consistent findings
ML example: classification
Background and setting
• Still 311 requests in MDC
• Unit of analysis: individual requests
• Unique feature in MDC 311 data
• Expected completion time
• Such feature may not be available in other cities
• Q: what determines shorter than expected completion time?
• Binary
• Alternatively, multiclass (shorter, on time, and longer)
• Features: community, request
Confusion matrix
Sensitivity and specificity
• Sensitivity (aka recall) = TP/(TP+FN)
•
•
•
•
Numerator: correctly labeled/identified cases having a certain attribute
Denominator: all cases having a certain attribute, regardless of being detected or not
Example: detecting brain tumor
FN is intolerable: we want to reduce missed brain tumors
• Specificity = TN/(TN+FP)
•
•
•
•
Numerator: correctly labeled/identified cases NOT having a certain attribute
Denominator: all cases NOT having a certain attribute, regardless of being detected or not
Example: drug test for employment
FP (false alarm) is intolerable: we don’t want to mistakenly take away emp opportunities
Accuracy and precision
• Accuracy = (TP+TN)/(TP+FP+FN+TN)
•
•
•
•
Numerator: all correctly labeled/identified cases (positives and negatives)
Denominator: all cases
How many cases did we correctly label out of all cases?
Balanced view on positives and negatives
• Precision = TP/(TP+FP)
•
•
•
•
Numerator: correctly labeled/identified cases having a certain attribute
Denominator: all labeled/identified cases having a certain attribute (true and false)
How many cases that are labeled are actually having a certain attribute?
Be more confident of true positives
Relations
• Sensitivity vs. specificity
• Which one is more unacceptable: false negative vs. false positive?
• Sensitivity vs. precision
• Precision is how sure of your true positives (among predicted positives) while
sensitivity is how sure that you are not missing any (true) positives
• Accuracy vs. precision
• Balanced view on TP and TN vs. TP out of labeled positives
• TN may be disproportionally large in numbers but substantively unimportant
• Excellent prediction on individuals who are not terrorists…did we miss the point?
ML example: ensemble
Spaghetti forecast
Ensemble prediction
• Take advantage of individual, independent predictions
• Pros and cons of different models
• Natures and characteristics of the datasets
• Reduced generalization error if uncorrelated models are combined
• How to combine models?
• Equal weight average
• Weighted average
• Stacking
• Individual predictions as inputs for an ensemble prediction
• Many more
• Why not choosing the model having the highest predictive accuracy?
Additional resources
Resources
• There are so many…
• Books:
• Applied Predictive Modeling, by Max Kuhn and Kjell Johnson
• YouTube:
• StatQuest with Josh Starmer
• Andrew Ng, Stanford CS229: Machine learning
• Andrew Ng, Stanford CS230: Deep learning
Download