Uploaded by mazharrizvi870

Machine Learning Fundamentals: Algorithms & Data

advertisement
Machine Learning
• It is somewhat reminiscent of the famous statement by British
mathematician and professor of statistics George E. P. Box that “all
models are wrong, but some are useful”.
• The goal of ML is never to make “perfect” guesses, because ML deals
in domains where there is no such thing. The goal is to make guesses
that are good enough to be useful.
What is machine learning?
• A branch of artificial intelligence, concerned with the design and development of
algorithms that allow computers to evolve behaviors based on empirical data.
• As intelligence requires knowledge, it is necessary for the computers to acquire
knowledge.
Defining the Learning Task
Improve on task, T, with respect to
performance metric, P, based on experience, E.
T: Playing checkers
P: Percentage of games won against an arbitrary opponent
E: Playing practice games against itself
T: Recognizing hand-written words
P: Percentage of words correctly classified
E: Database of human-labeled images of handwritten words
T: Driving on four-lane highways using vision sensors
P: Average distance traveled before a human-judged error
E: A sequence of images and steering commands recorded while
observing a human driver.
T: Categorize email messages as spam or legitimate.
P: Percentage of email messages correctly classified.
E: Database of emails, some with human-given labels
Learning system model
Testing
Input
Samples
Learning
Method
System
Training
Training and testing
Data acquisition
Practical usage
Universal set
(unobserved)
Training set
(observed)
Testing set
(unobserved)
y = f(x)
output
prediction
function
features
• Training: given a training set of labeled examples {(x1,y1), …,
(xN,yN)}, estimate the prediction function f by minimizing the
prediction error on the training set
• Testing: apply f to a never before seen test example x and output
the predicted value y = f(x)
Slide credit: L. Lazebnik
Algorithms
• Supervised learning (
)
• Prediction
• Classification (discrete labels), Regression (real values)
• Unsupervised learning (
• Clustering
• Probability distribution estimation
• Finding association (in features)
• Dimension reduction
• Semi-supervised learning
• Reinforcement learning
• Decision making (robot, chess machine)
)
Reinforcement Learning
• Reinforcement learning comes into play when examples of desired behavior are
not available
• But where it is possible to score examples of behavior according to some
performance criterion
• Consider a simple scenario. Mobile phone users sometimes resort to the
following procedure to obtain good reception in a new locale where coverage is
poor
• We move around with the phone while monitoring its signal strength
indicator or by repeating “Do you hear me now?” and carefully listening to
the reply
• We keep doing this until we either find a place with an adequate signal or
until we find the best place we can under the circumstances, at which point
we either try to complete the call or give up
Reinforcement Learning
• Here, the information we receive is not directly telling us where we should
go to obtain good reception
• Nor is each reading telling us in which direction we should move next Each
reading simply allows us to evaluate the goodness of our current situation
• We have to move around—explore—in order to decide where we should go.
We are not given examples of correct behavior
What is Data?
• Collection of data objects and their
attributes
Attributes
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
• A collection of attributes describe
an instance
6
No
Married
60K
No
7
Yes
Divorced 220K
No
• Instance is also known as record,
point, case, sample, entity, or
object
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
• An attribute is a property or
characteristic of an instance
• Examples: eye color of a person,
temperature, etc.
Instances
• Attribute is also known as
variable, field, characteristic, or
feature
10
Attributes (field, variable, feature)
• Categorical: A finite number of discrete values.
• The type nominal denotes that there is no ordering
between the values, such as last names and colors.
• The type ordinal denotes that there is an ordering, such as
in an attribute taking on the values low, medium, or high
• Continuous (quantitative) Commonly, subset of real
numbers, where there is a measurable difference
between the possible values.
• Integers are usually treated as continuous in practical
problem
Instance (example, case, record)
• A single object of the world from which a model will
be learned, or on which a model will be used (e.g., for
prediction)
• In most data mining work, instances are described by
feature vectors
• some work uses more complex representations (e.g.,
containing relations between instances or between
parts of instances)
Record Data
• Data that consists of a collection of records, each
of which consists of a fixed set of attributes
10
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
60K
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
Document Data
• Each document becomes a `term' vector,
• each term is a component (attribute) of the vector,
• the value of each component is the number of times
the corresponding term occurs in the document.
team
coach
pla
y
ball
score
game
wi
n
lost
timeout
season
Document 1
3
0
5
0
2
6
0
2
0
2
Document 2
0
7
0
2
1
0
0
3
0
0
Document 3
0
1
0
0
1
2
2
0
3
0
Transaction Data
• A special type of record data, where
• each record (transaction) involves a set of items.
• For example, consider a grocery store. The set of
products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the items.
TID
Items
1
2
3
4
5
Bread, Coke, Milk
IceCream, Bread
IceCream, Coke, Butter, Milk
IceCream, Bread, Cheese, Milk
Coke, Cheese, Milk
Graph Data
• Examples: Generic graph and HTML Links
2
1
5
2
5
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
<a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
Chemical Data
• Benzene Molecule: C6H6
Ordered Data
Items/Events
An element of the
sequence
• Sequences of transactions
Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data
• Spatio-Temporal Data
Average Monthly
Temperature of land
and ocean
Instance Based Classifiers
• First Example of Supervised Classification
• Examples:
• Rote-learner
• Memorizes entire training data and performs classification only if attributes of record
match one of the training examples exactly
• Nearest neighbor
• Uses k “closest” points (nearest neighbors) for performing classification
Instance-Based Classifiers
Set of Stored Cases
Atr1
……...
AtrN
Class
A
• Store the training records
• Use training records to
predict the class label of
unseen cases
B
B
Unseen Case
C
A
C
B
Atr1
……...
AtrN
3/5/2025
Things to decide
• Which ML method would be best ?
• Ans: Cross validate different ML methods to get sense of how well they work
in practice.
• 1. Estimates the parameters for the ML method
• 2. Evaluate how well the ML method works
• Estimating the parameters are know as “Training the Algo”
• Evaluating is called “Testing the Algo”
3/5/2025
3/5/2025
3/5/2025
3/5/2025
Rote-Learner
• Rote-learner
• Memorizes entire training data
• Performs classification only if attributes of record match one of the training examples
exactly
Instance-Based Classifiers
Set of Stored Cases
Atr1
……...
AtrN
Class
A
• Store the training records
• Use training records to
predict the class label of
unseen cases
B
B
Unseen Case
C
A
C
B
Atr1
……...
AtrN
Nearest-Neighbor Classifiers
Unknown record

Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve

To classify an unknown record:
– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the class
label of unknown record (e.g., by
taking majority vote)
Definition of Nearest
Neighbor
X
X
X
(a) 1-nearest neighbor
(b) 2-nearest neighbor
(c) 3-nearest neighbor
K-nearest neighbors of a record x are data points that
have the k smallest distance to x
Nearest Neighbor Classification
• Compute distance between two points:
• Euclidean distance
d ( p, q ) 
2
(
p

q
)
 i i
d ( p , q )   abs ( pi  qi )
i
i
• Determine the class from nearest neighbor list
• take the majority vote of class labels among the k-nearest neighbors
Example (NN Classifier)
F1
1
0
0
1
F2
5
8
6
2
Class
0
0
1
1
1
1
0
0
3
4
3
4
?
?
?
?
Training Data
Test Data
Example (NN Classifier)
Step 1: Compute Distance from Test Sample 1 to Training Data
Step 2:
Distance from Test Sample 1 to All Training
Samples
Class
1
2
3
4
0
|1-1|+|3-5| = 0 + 2 = 2
|1-0|+|3-8| = 1 + 5 = 6
|1-0|+|3-6| = 1 + 3 = 4
|1-1|+|3-2| = 0 + 1 = 1
0
1
1
Step 3: Assign the Test Sample to Class with minimum Distance, Here is Class 1. So
Test Sample 1 belongs to Class 1
Example (NN Classifier)
Exercise: Calculate for other 3 Test Samples
Nearest Neighbor Classification…
• Choosing the value of k:
• If k is too small, sensitive to noise points
• If k is too large, neighborhood may include points from other classes
X
Nearest Neighbor Classification…
• Scaling issues
• Attributes may have to be scaled to prevent distance measures from being
dominated by one of the attributes
• Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from $10K to $1M
Example (NN Classifier)
Normalize Data from 0 to 1
F1
1
0
0
1
1
1
0
0
F2
Class
0.5
0
1
0
0.667 1
0
1
0.167 ?
0.334 ?
0.167 ?
0.334 ?
Training Data
Test Data
Training and testing
• Training is the process of making the system able to learn.
• No free lunch rule:
• Training set and testing set come from the same distribution
• Need to make some assumptions or bias
Assessing Performance
Training data performance is typically optimistic
e.g., error rate on training data
Reasons?
- classifier may not have enough data to fully learn the concept (but
on training data we don’t know this)
- for noisy data, the classifier may overfit the training data
In practice we want to assess performance “out of sample”
how well will the classifier do on new unseen data? This is the
true test of what we have learned (just like a classroom)
With large data sets we can partition our data into 2 subsets, train and test
- build a model on the training data
- assess performance on the test data
• The task of supervised learning is this:
• TRAINING SET Given a training set of N example input–output pairs
• (x1, y1), (x2, y2), . . . (xN, yN) ,
• where each yj was generated by an unknown function y = f(x),
• discover a function h that approximates the true function f.
• Here x and y can be any value; they need not be numbers. The function h is a hypothesis.
• Learning is a search through the space of possible hypotheses for one that will perform
well,
• even on new examples beyond the training set. To measure the accuracy of a hypothesis
we
• give it a test set of examples that are distinct from the training set.
• We say a hypothesis generalizes well if it correctly predicts the value of y for novel
examples.
Performance
• There are several factors affecting the performance:
• Types of training provided
• The form and extent of any initial background knowledge
• The type of feedback provided
• The learning algorithms used
• Two important factors:
• Modeling
• Optimization
Algorithms
• The success of machine learning system also depends on the algorithms.
• The algorithms control the search to find and build the knowledge structures.
• The learning algorithms should extract useful information from training examples.
Algorithms
Unsupervised learning
Supervised learning
Semi-supervised learning
48
Machine learning structure
• Supervised learning
The machine learning framework
• Apply a prediction function to a feature representation of the
image to get the desired output:
f( ) = “apple”
f( ) = “tomato”
f( ) = “cow”
Slide credit: L. Lazebnik
• Regression machine learning systems: Systems where the value
being predicted falls somewhere on a continuous spectrum. These
systems help us with questions of “How much?” or “How many?”.
• Multi-target: More than one continuous target variable
• Classification machine learning systems: Systems where we seek a
yes-or-no prediction, such as “Is this tumor cancerous?”, “Does this
cookie meet our quality standards?”, and so on.
• Multiclass (mutually exclusive class assignments) , multilabel (data
point can lie in more than one classes)
Machine learning structure
• Unsupervised learning
Clustering Strategies
• K-means
– Iteratively re-assign points to the nearest cluster center
• Agglomerative clustering
– Start with each point as its own cluster and iteratively
merge the closest clusters
• Mean-shift clustering
– Estimate modes of pdf
• Spectral clustering
– Split the nodes in a graph based on assigned links with
similarity weights
As we go down this chart, the clustering strategies
have more tendency to transitively group points
even if they are not nearby in feature space
Features
• Raw pixels
• Histograms
etc
•…
Slide credit: L. Lazebnik
Semi-supervised learning
• “We expect unsupervised learning to become far more important in the longer term. Human and
animal learning is largely unsupervised: we discover the structure of the world by observing it, not
by being told the name of every object.” LeCun, Bengio, Hinton, Nature (2015)
• Semi-supervised learning algorithms are trained on a combination of
labeled and unlabeled data
• Semi-supervised learning uses the unlabeled data to gain more
understanding of the population structure in general
• The process of labeling massive amounts of data for supervised
learning is often prohibitively time-consuming and expensive.
• What’s more, too much labeling can impose human biases on the
model
Choosing an Evaluation Metrics
• How do you evaluate the result of your model?
• Some misclassifications are worse than others
• False negatives may be worse than false positive
• A domain expert can decide the evaluation metrics
Confusion Matrix
• In the field of machine learning, a confusion
matrix is a specific table layout that allows
visualization of the performance of an algorithm
Actual Negative
Actual Positive
Predicted Negative
True Negative
False Negative
Predicted Positive
False Positive
True Positive
Choosing an Evaluation Metrics
• To evaluate the efficiency of your model
• Needs domain expertise
• It is a tradeoff between sensitivity and specificity
• Sensitivity (True Positive Rate)
• Correctly marking a positive as positive
• Also known as recall
• Specificity (True Negative Rate)
• Correctly marking a negative as negative
F- measure
Choosing an Evaluation Metrics
• One possibility is accuracy: which is the ratio of
number of correct labels to the total number of labels
• Misclassification rate = 1 – accuracy
• How to determine value of k?
• The one which gives minimum misclassification rate
Confusion Matrix
• TN is the number of correct predictions that an
instance is negative
• FP is the number of incorrect predictions that an
instance is positive
• FN is the number of incorrect predictions that an
instance is negative
• TP is the number of correct predictions that an
instance is positive
Confusion Matrix
• Several standard terms have been defined for the
2 class matrix
• The accuracy (AC) is the proportion of the total
number of predictions that were correct
TN  TP
Accuracy 
TN  FN  TP  FP
• Accuracy = 3 / 4 = 75%
Confusion Matrix
• The recall or true positive rate (TPR) is the
proportion of positive cases that were correctly
identified
TP
TPR 
TP  FN
• The false positive rate (FPR) is the proportion of
negatives cases that were incorrectly classified as
positive
FP
FPR 
FP  TN
• TPR or recall = 2 / 3 = 66.7%
• FPR = 0 / 1 = 0 %
Confusion Matrix
• The true negative rate (TNR) is defined as the
proportion of negatives cases that were classified
correctly,
TN
TNR 
FP  TN
• The false negative rate (FNR) is the proportion of
positives cases that were incorrectly classified as
negative
FN
FNR 
FN  TP
• TNR = 1 / 1 = 100%
• FNR = 1 / 3 = 33.3%
Confusion Matrix
• precision (P) is the proportion of the predicted
positive cases that were correct,
tp
precision 
tp  fp
• precision = 2/2 = 100%
• F measure is harmonic mean of precision and
recall
• F1 = (2 * 1 * 0.667)/(1+0.667) = 0.8
Cross Validation
• Divide data into training set (80%) and test set (20%)
• Find values of parameters by using training set and calculate MSE
• Calculate MSE for TEST set as well
• If both values of MSE are approximately the same, then it is a good
model
Training and Validation Data
Full Data Set
Training Data
Validation Data
Idea: train each
model on the
“training data”
and then test
each model’s
accuracy on
the validation data
• The simplest approach is the one we have seen already: randomly
split the available data into a training set from which the learning
algorithm produces h and a test set on which the accuracy of h is
evaluated. This method, sometimes called holdout cross-validation,
has the disadvantage that it fails to use all the available data; if we
use half the data for the test set, then we are only training on half the
data, and we may get a poor hypothesis.
The k-fold Cross-Validation Method
• Why just choose one particular 90/10 “split” of the data?
– In principle we could do this multiple times
• “k-fold Cross-Validation” (e.g., k=10)
– randomly partition our full data set into k disjoint subsets (each
roughly of size n/k, n = total number of training data points)
•for i = 1:10 (here k = 10)
–train on 90% of data,
–Acc(i) = accuracy on other 10%
•end
•Cross-Validation-Accuracy = 1/k
i Acc(i)
– choose the method with the highest cross-validation accuracy
– common values for k are 5 and 10
– Can also do “leave-one-out” where k = n
Disjoint Validation Data Sets
Validation Data (aka Test Data)
Full Data Set
1st partition
Training Data
Disjoint Validation Data Sets
Validation Data (aka Test Data)
Full Data Set
1st partition
2nd partition
Training Data
Disjoint Validation Data Sets
Validation Data (aka Test Data)
Full Data Set
Validation
Data
1st partition
2nd partition
Training Data
3rd partition
4th partition
5th partition
More on Cross-Validation
• Notes
– cross-validation generates an approximate estimate of how well
the learned model will do on “unseen” data
– by averaging over different partitions it is more robust than just a
single train/validate partition of the data
– “k-fold” cross-validation is a generalization
•partition data into disjoint validation subsets of size n/k
•train, validate, and average over the v partitions
•e.g., k=10 is commonly used
– k-fold cross-validation is approximately k times computationally
more expensive than just fitting a model to all of the data
The extreme is k = n, also known as leave-one-out cross-validation or
LOOCV.
from sklearn.model_selection import KFold
X = ["a", "b", "c", "d"]
kf = KFold(n_splits=2)
count = 0
for train, test in kf.split(X):
print("Split %i : Train %s Test %s " % (count, train, test))
count = count + 1
Ensembles
Bagging
We draw the samples with replacement. It is called Bootstrapping
Boosting
• Boosting is one of the most wide technique of ensemble models. It is
a sequential process, where each model attempts to correct the
predictions of the previous models. Boosting Models construct
multiple weak models and then construct a strong model using them.
A single weak model may perform well on a subset of a dataset.
Combining them will increase the overall performance
Random Forest
3/5/2025
3/5/2025
3/5/2025
3/5/2025
Step 1: Create BS Dataset
3/5/2025
3/5/2025
3/5/2025
3/5/2025
3/5/2025
3/5/2025
3/5/2025
3/5/2025
3/5/2025
3/5/2025
Inductive learning method
• Construct/adjust h to agree with f on training set
• (h is consistent if it agrees with f on all examples)
•
• E.g., curve fitting:
How Overfitting affects Prediction
Underfitting
Overfitting
Predictive
Error
Error on Test Data
Error on Training Data
Model Complexity
Ideal Range
for Model Complexity
Underfitting, Overfitting, Bias and Variance
Assessing Performance
Training data performance is typically optimistic
e.g., error rate on training data
Reasons?
- classifier may not have enough data to fully learn the concept (but
on training data we don’t know this)
- for noisy data, the classifier may overfit the training data
In practice we want to assess performance “out of sample”
how well will the classifier do on new unseen data? This is the
true test of what we have learned (just like a classroom)
With large data sets we can partition our data into 2 subsets, train and test
- build a model on the training data
- assess performance on the test data
• The task of supervised learning is this:
• TRAINING SET Given a training set of N example input–output pairs
• (x1, y1), (x2, y2), . . . (xN, yN) ,
• where each yj was generated by an unknown function y = f(x),
• discover a function h that approximates the true function f.
• Here x and y can be any value; they need not be numbers. The function h is a hypothesis.
• Learning is a search through the space of possible hypotheses for one that will perform
well,
• even on new examples beyond the training set. To measure the accuracy of a hypothesis
we
• give it a test set of examples that are distinct from the training set.
• We say a hypothesis generalizes well if it correctly predicts the value of y for novel
examples.
References
• An overview of ML: Speaker: Yi-Fan Chang
• Slides: Isabelle Guyon, Erik Sudderth, Mark Johnson,Derek Hoiem,
Lana Lazebnik
• https://www.toptal.com/machine-learning/machine-learning-theoryan-introductory-primer
• IST 511 Information Management: Information and
TechnologyMachine Learning, Dr. C. Lee Giles, David Reese Professor,
College of Information Sciences and Technology
• Wikipedia
• https://pdfs.semanticscholar.org/1c54/68ca9da39aaf7ba8850c0eafd8c86a
cf5c09.pdf
References
•
Introduction to Data Mining by Tan, Steinbach, Kumar
(Lecture Slides)
• http://robotics.stanford.edu/~ronnyk/glossary.html
• http://www.cs.tufts.edu/comp/135/Handouts/introductionlecture-12-handout.pdf
• https://www.educative.io/
Download