pptx - CUNY.edu

advertisement
Lecture 1: Introduction
Machine Learning
CUNY Graduate Center
Today
•
•
•
•
•
Welcome
Overview of Machine Learning
Class Mechanics
Syllabus Review
Basic Classification Algorithm
1
My research and background
• Speech
– Analysis of Intonation
– Segmentation
• Natural Language Processing
– Computational Linguistics
• Evaluation Measures
• All of this research relies heavily on
Machine Learning
2
You
• Why are you taking this class?
• For Ph.D. students:
– What is your dissertation on?
– Do you expect it to require Machine Learning?
• What is your background and comfort with
– Calculus
– Linear Algebra
– Probability and Statistics
• What is your programming language of
preference?
– C++, java, or python are preferred
3
Machine Learning
• Automatically identifying patterns in data
• Automatically making decisions based on
data
• Hypothesis:
Data
Learning Algorithm
Behavior
≥
Data
Programmer
Behavior
4
Machine Learning in Computer
Science
Speech/Au
dio
Processing
Natural
Language
Processing
Biomedical/Cheme
dical
Informatics
Robotics
Planning
Machine
Learning
Human
Computer
Interaction
Vision/Imag
e
Processing
Financial Modeling
Analytics
5
Major Tasks
• Regression
– Predict a numerical value from “other
information”
• Classification
– Predict a categorical value
• Clustering
– Identify groups of similar entities
• Evaluation
6
Feature Representations
• How do we view data?
Entity in the
World
Web Page
User Behavior
Speech or Audio Data
Vision
Wine
People
Etc.
Our Focus
Feature
Representation
Machine
Learning
Algorithm
Feature
Extraction
7
Feature Representations
Height
Weight
Eye Color
Gender
66
170
Blue
Male
73
210
Brown
Male
72
165
Green
Male
70
180
Blue
Male
74
185
Brown
Male
68
155
Green
Male
65
150
Blue
Female
64
120
Brown
Female
63
125
Green
Female
67
140
Blue
Female
68
165
Brown
Female
66
130
Green
Female
8
Classification
• Identify which of N classes a data point, x,
belongs to.
• x is a column vector of features.
OR
9
Target Values
• In supervised approaches, in addition to a
data point, x, we will also have access to a
target value, t.
Goal of Classification
Identify a function y, such that y(x) = t
10
Feature Representations
Height
Weight
Eye Color
Gender
66
170
Blue
Male
73
210
Brown
Male
72
165
Green
Male
70
180
Blue
Male
74
185
Brown
Male
68
155
Green
Male
65
150
Blue
Female
64
120
Brown
Female
63
125
Green
Female
67
140
Blue
Female
68
165
Brown
Female
66
130
Green
Female
11
Graphical Example of Classification
12
Graphical Example of Classification
?
13
Graphical Example of Classification
?
14
Graphical Example of Classification
15
Graphical Example of Classification
16
Graphical Example of Classification
17
Decision Boundaries
18
Regression
• Regression is a supervised machine
learning task.
– So a target value, t, is given.
• Classification: nominal t
• Regression: continuous t
Goal of Classification
Identify a function y, such that y(x) = t
19
Differences between Classification
and Regression
• Similar goals: Identify y(x) = t.
• What are the differences?
– The form of the function, y (naturally).
– Evaluation
•
•
•
•
Root Mean Squared Error
Absolute Value Error
Classification Error
Maximum Likelihood
– Evaluation drives the optimization operation
that learns the function, y.
20
Graphical Example of Regression
?
21
Graphical Example of Regression
22
Graphical Example of Regression
23
Clustering
• Clustering is an unsupervised learning task.
– There is no target value to shoot for.
• Identify groups of “similar” data points, that
are “dissimilar” from others.
• Partition the data into groups (clusters) that
satisfy these constraints
1. Points in the same cluster should be similar.
2. Points in different clusters should be dissimilar.
24
Graphical Example of Clustering
25
Graphical Example of Clustering
26
Graphical Example of Clustering
27
Mechanisms of Machine Learning
• Statistical Estimation
– Numerical Optimization
– Theoretical Optimization
• Feature Manipulation
• Similarity Measures
28
Mathematical Necessities
• Probability
• Statistics
• Calculus
– Vector Calculus
• Linear Algebra
• Is this a Math course in disguise?
29
Why do we need so much math?
• Probability Density Functions allow the
evaluation of how likely a data point is under
a model.
– Want to identify good PDFs. (calculus)
– Want to evaluate against a known PDF. (algebra)
30
Gaussian Distributions
• We use Gaussian Distributions all over the
place.
31
Gaussian Distributions
• We use Gaussian Distributions all over the
place.
32
Class Structure and Policies
• Course website:
– http://eniac.cs.qc.cuny.edu/andrew/gcml-11/syllabus.html
• Google Group for discussions and announcements
– http://groups.google.com/gcml-spring2011
– Please sign up for the group ASAP.
– Or put your email address on the sign up sheet, and you will be
sent an invitation.
33
Data Data Data
• “There’s no data like more data”
• All machine learning techniques rely on the
availability of data to learn from.
• There is an ever increasing amount of data
being generated, but it’s not always easy to
process.
• UCI
– http://archive.ics.uci.edu/ml/
• LDC (Linguistic Data Consortium)
– http://www.ldc.upenn.edu/
34
Half time.
Get Coffee.
Stretch.
35
Decision Trees
color
blue
h
w
<66
m
w
<150
<140
w
w
<145
green
brown
m
m
h
<66
<170
f
h
f
f
m
<64
f
m
f
• Classification Technique.
36
Decision Trees
color
blue
h
w
<66
m
w
m
<150
m
h
f
h
<66
<170
f
w
<140
w
<145
green
brown
f
m
<64
f
m
f
• Very easy to evaluate.
• Nested if statements
37
More formal Definition of a
Decision Tree
•
•
•
•
A Tree data structure
Each internal node corresponds to a feature
Leaves are associated with target values.
Nodes with nominal features have N
children, where N is the number of nominal
values
• Nodes with continuous features have two
children for values less than and greater than
or equal to a break point.
38
Training a Decision Tree
• How do you decide what feature to use?
• For continuous features how do you
decide what break point to use?
• Goal: Optimize Classification Accuracy.
39
Example Data Set
Height
Weight
Eye Color
Gender
66
170
Blue
Male
73
210
Brown
Male
72
165
Green
Male
70
180
Blue
Male
74
185
Brown
Male
68
155
Green
Male
65
150
Blue
Female
64
120
Brown
Female
63
125
Green
Female
67
140
Blue
Female
68
165
Brown
Female
66
130
Green
Female
40
Baseline Classification Accuracy
• Select the majority class.
– Here 6/12 Male, 6/12 Female.
– Baseline Accuracy: 50%
• How good is each branch?
– The improvement to classification accuracy
41
Training Example
• Possible branches
color
blue
2M / 2F
brown
2M / 2F
green
2M / 2F
50% Accuracy before Branch
50% Accuracy after Branch
0% Accuracy Improvement
42
Example Data Set
Height
Weight
Eye Color
Gender
63
125
Green
Female
64
120
Brown
Female
65
150
Blue
Female
66
170
Blue
Male
66
130
Green
Female
67
140
Blue
Female
68
145
Brown
Female
6
155
Green
Male
70
180
Blue
Male
72
165
Green
Male
73
210
Brown
Male
74
185
Brown
Male
43
Training Example
• Possible branches
height
<68
1M / 5F
5M / 1F
50% Accuracy before Branch
83.3% Accuracy after Branch
33.3% Accuracy Improvement
44
Example Data Set
Height
Weight
Eye Color
Gender
64
120
Brown
Female
63
125
Green
Female
66
130
Green
Female
67
140
Blue
Female
68
145
Brown
Female
65
150
Blue
Female
68
155
Green
Male
72
165
Green
Male
66
170
Blue
Male
70
180
Blue
Male
74
185
Brown
Male
73
210
Brown
Male
45
Training Example
• Possible branches
weight
<165
1M / 6F
5M
50% Accuracy before Branch
91.7% Accuracy after Branch
41.7% Accuracy Improvement
46
Training Example
• Recursively train child nodes.
weight
<165
5M
height
<68
5F
1M / 1F
47
Training Example
weight
• Finished Tree
<165
5M
height
<68
5F
weight
<155
1F
1M
48
Generalization
• What is the performance of the tree on the
training data?
– Is there any way we could get less than 100%
accuracy?
• What performance can we expect on
unseen data?
49
Evaluation
• Evaluate performance on data that was
not used in training.
• Isolate a subset of data points to be used
for evaluation.
• Evaluate generalization performance.
50
Evaluation of our Decision Tree
• What is the Training performance?
• What is the Evaluation performance?
– Never classify female over 165
– Never classify male under 165, and under 68.
– The middle section is trickier.
• What are some ways to make these
similar?
51
Pruning
• There are many pruning techniques.
• A simple approach is to have a minimum
membership size in each node.
weight
weight
<165
<165
5M
height
<68
5M
height
<68
5F
5F
weight
1F / 1M
<155
1F
1M
52
Decision Tree Recap
• Training via Recursive Partitioning.
• Simple, interpretable models.
• Different node selection criteria can be
used.
– Information theory is a common choice.
• Pruning techniques can be used to make
the model more robust to unseen data.
53
Next Time: Math Primer
• Probability
– Bayes Rule
– Naïve Bayes Classification
• Calculus
– Vector Calculus
• Optimization
– Lagrange Multipliers
54
Download