Machine Learning • It is somewhat reminiscent of the famous statement by British mathematician and professor of statistics George E. P. Box that “all models are wrong, but some are useful”. • The goal of ML is never to make “perfect” guesses, because ML deals in domains where there is no such thing. The goal is to make guesses that are good enough to be useful. What is machine learning? • A branch of artificial intelligence, concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data. • As intelligence requires knowledge, it is necessary for the computers to acquire knowledge. Defining the Learning Task Improve on task, T, with respect to performance metric, P, based on experience, E. T: Playing checkers P: Percentage of games won against an arbitrary opponent E: Playing practice games against itself T: Recognizing hand-written words P: Percentage of words correctly classified E: Database of human-labeled images of handwritten words T: Driving on four-lane highways using vision sensors P: Average distance traveled before a human-judged error E: A sequence of images and steering commands recorded while observing a human driver. T: Categorize email messages as spam or legitimate. P: Percentage of email messages correctly classified. E: Database of emails, some with human-given labels Learning system model Testing Input Samples Learning Method System Training Training and testing Data acquisition Practical usage Universal set (unobserved) Training set (observed) Testing set (unobserved) y = f(x) output prediction function features • Training: given a training set of labeled examples {(x1,y1), …, (xN,yN)}, estimate the prediction function f by minimizing the prediction error on the training set • Testing: apply f to a never before seen test example x and output the predicted value y = f(x) Slide credit: L. Lazebnik Algorithms • Supervised learning ( ) • Prediction • Classification (discrete labels), Regression (real values) • Unsupervised learning ( • Clustering • Probability distribution estimation • Finding association (in features) • Dimension reduction • Semi-supervised learning • Reinforcement learning • Decision making (robot, chess machine) ) Reinforcement Learning • Reinforcement learning comes into play when examples of desired behavior are not available • But where it is possible to score examples of behavior according to some performance criterion • Consider a simple scenario. Mobile phone users sometimes resort to the following procedure to obtain good reception in a new locale where coverage is poor • We move around with the phone while monitoring its signal strength indicator or by repeating “Do you hear me now?” and carefully listening to the reply • We keep doing this until we either find a place with an adequate signal or until we find the best place we can under the circumstances, at which point we either try to complete the call or give up Reinforcement Learning • Here, the information we receive is not directly telling us where we should go to obtain good reception • Nor is each reading telling us in which direction we should move next Each reading simply allows us to evaluate the goodness of our current situation • We have to move around—explore—in order to decide where we should go. We are not given examples of correct behavior What is Data? • Collection of data objects and their attributes Attributes Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes • A collection of attributes describe an instance 6 No Married 60K No 7 Yes Divorced 220K No • Instance is also known as record, point, case, sample, entity, or object 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes • An attribute is a property or characteristic of an instance • Examples: eye color of a person, temperature, etc. Instances • Attribute is also known as variable, field, characteristic, or feature 10 Attributes (field, variable, feature) • Categorical: A finite number of discrete values. • The type nominal denotes that there is no ordering between the values, such as last names and colors. • The type ordinal denotes that there is an ordering, such as in an attribute taking on the values low, medium, or high • Continuous (quantitative) Commonly, subset of real numbers, where there is a measurable difference between the possible values. • Integers are usually treated as continuous in practical problem Instance (example, case, record) • A single object of the world from which a model will be learned, or on which a model will be used (e.g., for prediction) • In most data mining work, instances are described by feature vectors • some work uses more complex representations (e.g., containing relations between instances or between parts of instances) Record Data • Data that consists of a collection of records, each of which consists of a fixed set of attributes 10 Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Document Data • Each document becomes a `term' vector, • each term is a component (attribute) of the vector, • the value of each component is the number of times the corresponding term occurs in the document. team coach pla y ball score game wi n lost timeout season Document 1 3 0 5 0 2 6 0 2 0 2 Document 2 0 7 0 2 1 0 0 3 0 0 Document 3 0 1 0 0 1 2 2 0 3 0 Transaction Data • A special type of record data, where • each record (transaction) involves a set of items. • For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. TID Items 1 2 3 4 5 Bread, Coke, Milk IceCream, Bread IceCream, Coke, Butter, Milk IceCream, Bread, Cheese, Milk Coke, Cheese, Milk Graph Data • Examples: Generic graph and HTML Links 2 1 5 2 5 <a href="papers/papers.html#bbbb"> Data Mining </a> <li> <a href="papers/papers.html#aaaa"> Graph Partitioning </a> <li> <a href="papers/papers.html#aaaa"> Parallel Solution of Sparse Linear System of Equations </a> <li> <a href="papers/papers.html#ffff"> N-Body Computation and Dense Linear System Solvers Chemical Data • Benzene Molecule: C6H6 Ordered Data Items/Events An element of the sequence • Sequences of transactions Ordered Data • Genomic sequence data GGTTCCGCCTTCAGCCCCGCGCC CGCAGGGCCCGCCCCGCGCCGTC GAGAAGGGCCCGCCTGGCGGGCG GGGGGAGGCGGGGCCGCCCGAGC CCAACCGAGTCCGACCAGGTGCC CCCTCTGCTCGGCCTAGACCTGA GCTCATTAGGCGGCAGCGGACAG GCCAAGTAGAACACGCGAAGCGC TGGGCTGCCTGCTGCGACCAGGG Ordered Data • Spatio-Temporal Data Average Monthly Temperature of land and ocean Instance Based Classifiers • First Example of Supervised Classification • Examples: • Rote-learner • Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly • Nearest neighbor • Uses k “closest” points (nearest neighbors) for performing classification Instance-Based Classifiers Set of Stored Cases Atr1 ……... AtrN Class A • Store the training records • Use training records to predict the class label of unseen cases B B Unseen Case C A C B Atr1 ……... AtrN 3/5/2025 Things to decide • Which ML method would be best ? • Ans: Cross validate different ML methods to get sense of how well they work in practice. • 1. Estimates the parameters for the ML method • 2. Evaluate how well the ML method works • Estimating the parameters are know as “Training the Algo” • Evaluating is called “Testing the Algo” 3/5/2025 3/5/2025 3/5/2025 3/5/2025 Rote-Learner • Rote-learner • Memorizes entire training data • Performs classification only if attributes of record match one of the training examples exactly Instance-Based Classifiers Set of Stored Cases Atr1 ……... AtrN Class A • Store the training records • Use training records to predict the class label of unseen cases B B Unseen Case C A C B Atr1 ……... AtrN Nearest-Neighbor Classifiers Unknown record Requires three things – The set of stored records – Distance Metric to compute distance between records – The value of k, the number of nearest neighbors to retrieve To classify an unknown record: – Compute distance to other training records – Identify k nearest neighbors – Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote) Definition of Nearest Neighbor X X X (a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x Nearest Neighbor Classification • Compute distance between two points: • Euclidean distance d ( p, q ) 2 ( p q ) i i d ( p , q ) abs ( pi qi ) i i • Determine the class from nearest neighbor list • take the majority vote of class labels among the k-nearest neighbors Example (NN Classifier) F1 1 0 0 1 F2 5 8 6 2 Class 0 0 1 1 1 1 0 0 3 4 3 4 ? ? ? ? Training Data Test Data Example (NN Classifier) Step 1: Compute Distance from Test Sample 1 to Training Data Step 2: Distance from Test Sample 1 to All Training Samples Class 1 2 3 4 0 |1-1|+|3-5| = 0 + 2 = 2 |1-0|+|3-8| = 1 + 5 = 6 |1-0|+|3-6| = 1 + 3 = 4 |1-1|+|3-2| = 0 + 1 = 1 0 1 1 Step 3: Assign the Test Sample to Class with minimum Distance, Here is Class 1. So Test Sample 1 belongs to Class 1 Example (NN Classifier) Exercise: Calculate for other 3 Test Samples Nearest Neighbor Classification… • Choosing the value of k: • If k is too small, sensitive to noise points • If k is too large, neighborhood may include points from other classes X Nearest Neighbor Classification… • Scaling issues • Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes • Example: • height of a person may vary from 1.5m to 1.8m • weight of a person may vary from 90lb to 300lb • income of a person may vary from $10K to $1M Example (NN Classifier) Normalize Data from 0 to 1 F1 1 0 0 1 1 1 0 0 F2 Class 0.5 0 1 0 0.667 1 0 1 0.167 ? 0.334 ? 0.167 ? 0.334 ? Training Data Test Data Training and testing • Training is the process of making the system able to learn. • No free lunch rule: • Training set and testing set come from the same distribution • Need to make some assumptions or bias Assessing Performance Training data performance is typically optimistic e.g., error rate on training data Reasons? - classifier may not have enough data to fully learn the concept (but on training data we don’t know this) - for noisy data, the classifier may overfit the training data In practice we want to assess performance “out of sample” how well will the classifier do on new unseen data? This is the true test of what we have learned (just like a classroom) With large data sets we can partition our data into 2 subsets, train and test - build a model on the training data - assess performance on the test data • The task of supervised learning is this: • TRAINING SET Given a training set of N example input–output pairs • (x1, y1), (x2, y2), . . . (xN, yN) , • where each yj was generated by an unknown function y = f(x), • discover a function h that approximates the true function f. • Here x and y can be any value; they need not be numbers. The function h is a hypothesis. • Learning is a search through the space of possible hypotheses for one that will perform well, • even on new examples beyond the training set. To measure the accuracy of a hypothesis we • give it a test set of examples that are distinct from the training set. • We say a hypothesis generalizes well if it correctly predicts the value of y for novel examples. Performance • There are several factors affecting the performance: • Types of training provided • The form and extent of any initial background knowledge • The type of feedback provided • The learning algorithms used • Two important factors: • Modeling • Optimization Algorithms • The success of machine learning system also depends on the algorithms. • The algorithms control the search to find and build the knowledge structures. • The learning algorithms should extract useful information from training examples. Algorithms Unsupervised learning Supervised learning Semi-supervised learning 48 Machine learning structure • Supervised learning The machine learning framework • Apply a prediction function to a feature representation of the image to get the desired output: f( ) = “apple” f( ) = “tomato” f( ) = “cow” Slide credit: L. Lazebnik • Regression machine learning systems: Systems where the value being predicted falls somewhere on a continuous spectrum. These systems help us with questions of “How much?” or “How many?”. • Multi-target: More than one continuous target variable • Classification machine learning systems: Systems where we seek a yes-or-no prediction, such as “Is this tumor cancerous?”, “Does this cookie meet our quality standards?”, and so on. • Multiclass (mutually exclusive class assignments) , multilabel (data point can lie in more than one classes) Machine learning structure • Unsupervised learning Clustering Strategies • K-means – Iteratively re-assign points to the nearest cluster center • Agglomerative clustering – Start with each point as its own cluster and iteratively merge the closest clusters • Mean-shift clustering – Estimate modes of pdf • Spectral clustering – Split the nodes in a graph based on assigned links with similarity weights As we go down this chart, the clustering strategies have more tendency to transitively group points even if they are not nearby in feature space Features • Raw pixels • Histograms etc •… Slide credit: L. Lazebnik Semi-supervised learning • “We expect unsupervised learning to become far more important in the longer term. Human and animal learning is largely unsupervised: we discover the structure of the world by observing it, not by being told the name of every object.” LeCun, Bengio, Hinton, Nature (2015) • Semi-supervised learning algorithms are trained on a combination of labeled and unlabeled data • Semi-supervised learning uses the unlabeled data to gain more understanding of the population structure in general • The process of labeling massive amounts of data for supervised learning is often prohibitively time-consuming and expensive. • What’s more, too much labeling can impose human biases on the model Choosing an Evaluation Metrics • How do you evaluate the result of your model? • Some misclassifications are worse than others • False negatives may be worse than false positive • A domain expert can decide the evaluation metrics Confusion Matrix • In the field of machine learning, a confusion matrix is a specific table layout that allows visualization of the performance of an algorithm Actual Negative Actual Positive Predicted Negative True Negative False Negative Predicted Positive False Positive True Positive Choosing an Evaluation Metrics • To evaluate the efficiency of your model • Needs domain expertise • It is a tradeoff between sensitivity and specificity • Sensitivity (True Positive Rate) • Correctly marking a positive as positive • Also known as recall • Specificity (True Negative Rate) • Correctly marking a negative as negative F- measure Choosing an Evaluation Metrics • One possibility is accuracy: which is the ratio of number of correct labels to the total number of labels • Misclassification rate = 1 – accuracy • How to determine value of k? • The one which gives minimum misclassification rate Confusion Matrix • TN is the number of correct predictions that an instance is negative • FP is the number of incorrect predictions that an instance is positive • FN is the number of incorrect predictions that an instance is negative • TP is the number of correct predictions that an instance is positive Confusion Matrix • Several standard terms have been defined for the 2 class matrix • The accuracy (AC) is the proportion of the total number of predictions that were correct TN TP Accuracy TN FN TP FP • Accuracy = 3 / 4 = 75% Confusion Matrix • The recall or true positive rate (TPR) is the proportion of positive cases that were correctly identified TP TPR TP FN • The false positive rate (FPR) is the proportion of negatives cases that were incorrectly classified as positive FP FPR FP TN • TPR or recall = 2 / 3 = 66.7% • FPR = 0 / 1 = 0 % Confusion Matrix • The true negative rate (TNR) is defined as the proportion of negatives cases that were classified correctly, TN TNR FP TN • The false negative rate (FNR) is the proportion of positives cases that were incorrectly classified as negative FN FNR FN TP • TNR = 1 / 1 = 100% • FNR = 1 / 3 = 33.3% Confusion Matrix • precision (P) is the proportion of the predicted positive cases that were correct, tp precision tp fp • precision = 2/2 = 100% • F measure is harmonic mean of precision and recall • F1 = (2 * 1 * 0.667)/(1+0.667) = 0.8 Cross Validation • Divide data into training set (80%) and test set (20%) • Find values of parameters by using training set and calculate MSE • Calculate MSE for TEST set as well • If both values of MSE are approximately the same, then it is a good model Training and Validation Data Full Data Set Training Data Validation Data Idea: train each model on the “training data” and then test each model’s accuracy on the validation data • The simplest approach is the one we have seen already: randomly split the available data into a training set from which the learning algorithm produces h and a test set on which the accuracy of h is evaluated. This method, sometimes called holdout cross-validation, has the disadvantage that it fails to use all the available data; if we use half the data for the test set, then we are only training on half the data, and we may get a poor hypothesis. The k-fold Cross-Validation Method • Why just choose one particular 90/10 “split” of the data? – In principle we could do this multiple times • “k-fold Cross-Validation” (e.g., k=10) – randomly partition our full data set into k disjoint subsets (each roughly of size n/k, n = total number of training data points) •for i = 1:10 (here k = 10) –train on 90% of data, –Acc(i) = accuracy on other 10% •end •Cross-Validation-Accuracy = 1/k i Acc(i) – choose the method with the highest cross-validation accuracy – common values for k are 5 and 10 – Can also do “leave-one-out” where k = n Disjoint Validation Data Sets Validation Data (aka Test Data) Full Data Set 1st partition Training Data Disjoint Validation Data Sets Validation Data (aka Test Data) Full Data Set 1st partition 2nd partition Training Data Disjoint Validation Data Sets Validation Data (aka Test Data) Full Data Set Validation Data 1st partition 2nd partition Training Data 3rd partition 4th partition 5th partition More on Cross-Validation • Notes – cross-validation generates an approximate estimate of how well the learned model will do on “unseen” data – by averaging over different partitions it is more robust than just a single train/validate partition of the data – “k-fold” cross-validation is a generalization •partition data into disjoint validation subsets of size n/k •train, validate, and average over the v partitions •e.g., k=10 is commonly used – k-fold cross-validation is approximately k times computationally more expensive than just fitting a model to all of the data The extreme is k = n, also known as leave-one-out cross-validation or LOOCV. from sklearn.model_selection import KFold X = ["a", "b", "c", "d"] kf = KFold(n_splits=2) count = 0 for train, test in kf.split(X): print("Split %i : Train %s Test %s " % (count, train, test)) count = count + 1 Ensembles Bagging We draw the samples with replacement. It is called Bootstrapping Boosting • Boosting is one of the most wide technique of ensemble models. It is a sequential process, where each model attempts to correct the predictions of the previous models. Boosting Models construct multiple weak models and then construct a strong model using them. A single weak model may perform well on a subset of a dataset. Combining them will increase the overall performance Random Forest 3/5/2025 3/5/2025 3/5/2025 3/5/2025 Step 1: Create BS Dataset 3/5/2025 3/5/2025 3/5/2025 3/5/2025 3/5/2025 3/5/2025 3/5/2025 3/5/2025 3/5/2025 3/5/2025 Inductive learning method • Construct/adjust h to agree with f on training set • (h is consistent if it agrees with f on all examples) • • E.g., curve fitting: How Overfitting affects Prediction Underfitting Overfitting Predictive Error Error on Test Data Error on Training Data Model Complexity Ideal Range for Model Complexity Underfitting, Overfitting, Bias and Variance Assessing Performance Training data performance is typically optimistic e.g., error rate on training data Reasons? - classifier may not have enough data to fully learn the concept (but on training data we don’t know this) - for noisy data, the classifier may overfit the training data In practice we want to assess performance “out of sample” how well will the classifier do on new unseen data? This is the true test of what we have learned (just like a classroom) With large data sets we can partition our data into 2 subsets, train and test - build a model on the training data - assess performance on the test data • The task of supervised learning is this: • TRAINING SET Given a training set of N example input–output pairs • (x1, y1), (x2, y2), . . . (xN, yN) , • where each yj was generated by an unknown function y = f(x), • discover a function h that approximates the true function f. • Here x and y can be any value; they need not be numbers. The function h is a hypothesis. • Learning is a search through the space of possible hypotheses for one that will perform well, • even on new examples beyond the training set. To measure the accuracy of a hypothesis we • give it a test set of examples that are distinct from the training set. • We say a hypothesis generalizes well if it correctly predicts the value of y for novel examples. References • An overview of ML: Speaker: Yi-Fan Chang • Slides: Isabelle Guyon, Erik Sudderth, Mark Johnson,Derek Hoiem, Lana Lazebnik • https://www.toptal.com/machine-learning/machine-learning-theoryan-introductory-primer • IST 511 Information Management: Information and TechnologyMachine Learning, Dr. C. Lee Giles, David Reese Professor, College of Information Sciences and Technology • Wikipedia • https://pdfs.semanticscholar.org/1c54/68ca9da39aaf7ba8850c0eafd8c86a cf5c09.pdf References • Introduction to Data Mining by Tan, Steinbach, Kumar (Lecture Slides) • http://robotics.stanford.edu/~ronnyk/glossary.html • http://www.cs.tufts.edu/comp/135/Handouts/introductionlecture-12-handout.pdf • https://www.educative.io/