week_1_2

advertisement
Machine Learning
Week 1, Lecture 2
Recap
Unknown Target f
Supervised Learning
5 0 4 1
9 2
Data Set
1 3
1 4
Hypothesis Set
Classification
Learning
Algorithm
Regression
Hyperplane
Halfspace < 0
Halfspace >0
w
Hypothesis h
h(x) ≈ f(x)
np-hard in general
Assume Data Is Linear Separable!!!
Perception find separating hyperplane
Convex
Today
• Convex Optimization
– Convex sets
– Convex functions
• Logistic Regression
– Maximum Likelihood
– Gradient Descent
• Maximum likelihood and Linear Regression
Convex Optimization
Optimization problem, in general very hard (if possible at all)!!!
For convex optimization problems
theoretical (polynomial time) and practical solutions exist
(most of the time)
Example:
Convex Sets
The “line” from x to y must also be in the set
Convex Set
Non-convex Set
Convex Sets
Intersection of convex sets
Union of convex setsmay not be convex
Convex Functions
f is concave if –f is convex
y,f(y)
Concave?, Convex? Both
x,f(x)
Differentiable Convex Functions
Example
y,f(y)
x,f(x)
f(x)+f’(x)(y-x)
Twice Differentiable Convex Functions
f is convex if the Hessian is positive semi-definite for all x.
Real symmetric matrix A is positive semidefinite if for all nonzero x
1D:
Simple 2D Example
More Examples
Affine Functions:
Quadratic Functions:
Convex if A is positive semidefinite
Convexity of Linear Regression
Quadratic Functions:
Convex if A is positive semidefinite
Real and Symmetric: Clearly
Epigraph
Connection between convex sets and convex functions
f is convex if epi(f) is a convex set
Sublevel sets
Convex function
Define α-Sublevel set:
Is Convex
Convex Optimization
f and g are convex, h is affine
Local Minima are Global Minima
Examples of Convex Optimization
• Linear Programming
• Quadratic Programming (P is positive semidefinite)
Summary
Rockafellar stated, in his 1993 SIAM Review
survey paper
“In fact the great watershed in optimization
isn’t between linearity and nonlinearity, but
convexity and nonconvexity”
Convex
GOOD!!!!
Estimating Probabilities
• Probability of getting cancer given your
situation.
• Probability that AGF wins against Viborg given
the last 5 results.
• Probability that the loan is not payed back as a
function of credit worthiness
• Probability of a student getting an A in
Machine Learning given his grades.
Data is actual events not probabilities,
e.g. some students that failed and some who did not…
Breast Cancer
http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29
Input Features
•
•
•
•
•
•
•
•
•
•
Target Function
• benign
• malignant
1. Sample code number: id number
2. Clump Thickness: 1 - 10
3. Uniformity of Cell Size: 1 - 10
4. Uniformity of Cell Shape: 1 - 10
5. Marginal Adhesion: 1 - 10
6. Single Epithelial Cell Size: 1 - 10
7. Bare Nuclei: 1 - 10
PREDICT PROBABILITY
8. Bland Chromatin: 1 - 10
OF BENIGN AND
9. Normal Nucleoli: 1 - 10
MALIGNANT ON
10. Mitoses: 1 - 10
FUTURE PATIENTS
Maximum Likelihood
Biased Coin, (bias θ probability of heads)
Flip it n times independently (Bernoulli trials),
Count the number of heads k
After seeing the data what can we infer
Take Logs
Fix θ, What is the probability of seeing D
Likelihood of the data
Maximum Likelihood
Maximize
Minimize
Negative Log Likelihood of the data (log is monotone)
Compute
Gradient
solve for 0
Bayesian Perspective
Bayes Rule:
Want:
Need:
A Prior
Normalizing factor
Likelihood x Prior
Posterior
Bayesian Perspective
• Compute the probability of each hypotheses
• Pick the most likely and use for predictions
(map = maximum a posteriori)
• Compute Expected Values (Weighed average
over all hypotheses)
Logistic Regression
Hard andHard
SoftThreshold
Threshold
Assume Independent Data Points,
Apply Maximum Likelihood (there is a Bayesian version to)
Can and is used for classification. Predict most likely y
Maximum Likelihood Logistic
Regression
Neg. Log likelihood is convex
Cannot solve for zero analytically
Descent Methods
where f is twice continuously differentiable
Iteratively move toward a better solution
• Pick start point x
• Repeat Until Stopping Criterion Satisfied
• Compute Descent Direction v
• Line Search: Compute Step Size t
• Update: x = x + t v
Gradient Descent
Line (Ray) Search
•
•
•
•
•
Pick start point x
Repeat Until Stopping Criterion Satisfied
Compute Descent Direction v
Line Search: Compute Step Size t
Update: x = x + t v
• Solve analytically (if possible)
• Backwards Search start high and decrease until
improving distance found [SL 9.2]
• Fix to a small constant
• Use size of the gradient scaled with small constant.
• Start with constant, let it decrease slowly or when to high
Stopping Criteria
• Gradient becomes very small
• Max number of iterations used
Gradient Descent for Linear Reg.
GD For Linear Regression
Matlab style
function theta= GD(X,y,theta)
LR = 0.1
for i=1:50
cost = (1/length(y))* sum((X*theta-y).^2)
grad = (1/length(y))*2.*X'*(X*theta-y)
theta = theta – LR * grad
end
Note we do not scale gradient to unit vector
Learning Rate
Learning Rate
Learning Rate
Gradient Descent Jump Around
Use Exact Line Search Starting From (10,1)
Gradient Descent Running Time
• Number of iterations x Cost per iteration.
• Cost Per Iteration is usually not a problem.
• Number of iterations depends choice of line
search and stopping Criterion clearly.
– Very Problem and Data Specific
– Need a lot of math to give bounds.
– We will not cover it in this course.
Gradient Descent
For Logistic Regression
Handin 1!
A long with
multiclass extension
Stochastic Gradient Descent
Pick
at random and use
Use K points
chosen at random
Mini-Batch Gradient Descent
Linear Classification with K classes
• Use Logistic regression All Vs one.
– Train K classifiers one for each class
– Input X is the same. Y is 1 for all elements from that
class and 0 otherwise (All vs. One)
– Prediction, compute the probability for all K classifiers
output class with highest probability.
• Use Softmax Regression
– Extension of logistic function to K classes in some
sense
– Covered in Handin 1.
Maximum Likelihood and Linear
Regression (Time to spare slide)
Assume:
Independently
Todays Summary
• Convex Optimizations
– Many Definitions
– Local Optimal is Global Optimal
– Usually theoretical and practically feasible
• Maximum likelihood
– Use
as a proxy for
– Assume Independent Data
• Gradient Descent
–
–
–
–
Minimize function
Iteratively finding better solution by local steps based on gradient
First order method (Uses gradient)
Other methods exist, e.g. Second order methods (use hessian)
Download