Machine Learning 10601 Recitation 6 Sep 30, 2009 Oznur Tastan

advertisement
Machine Learning 10601
Recitation 6
Sep 30, 2009
Oznur Tastan
Outline
• Multivariate Gaussians
• Logistic regression
Multivariate Gaussians (or "multinormal distribution“ or
“multivariate normal distribution”)
Univariate case: single mean  and variance 
Multivariate case:
Vector of observations x,
vector of means  and covariance matrix 
Dimension of x
Determinant
Multivariate Gaussians
Univariate case
Multivariate case
do not depend on x
normalization constants
depends on x and positive
The mean vector
 μ1 
μ 
 2
μ  E ( x)  . 
 
. 
 μm 
 
Covariance of two random variables
Recall for two random variables xi, xj
  Cov( xi , x j )
2
ij
 E[( xi  i )( x j   j )]
 E ( xi x j )  E ( xi ) E ( x j )
The covariance matrix
  E[ ( x  μ)( x  μ) ]
T
transpose operator
2


 12
1
 ( x1  μ1 ) 


  21  2 2


.




E
[( x1  μ1 )..( xn  μn )]   .
.
 .


.

  .


 ( xm  μm )
 
 m1  m 2
..  14 

.  24 
..
. 
..
. 
2 
..  m 
Var(xm)=Cov(xm, xm)
An example: 2 variate case
The pdf of the multivariate will be:
Determinant
Covariance matrix
An example: 2 variate case
Factorized into two independent Gaussians!
They are independent!
Recall in general case independence implies uncorrelation
but uncorrelation does not necessarily implies independence.
Multivariate Gaussians is a special case where uncorrelation
implies independence as well.
Diagonal covariance matrix
If all the variables are independent from each other,
The covariance matrix will be an diagonal one.
Reverse is also true:
If the covariance matrix is a diagonal one they are independent
 21 0 

2 
 0  2
Diagonal matrix: m matrix where off-diagonal terms are zero
 ij2  E[( xi  i )( x j   j )]  0
i j
Gaussian Intuitions: Size of 
Identity matrix
 = [0 0]
=I
 = [0 0]
 = 0.6I
As  becomes larger,
Gaussian becomes more spread out
 = [0 0]
 = 2I
Gaussian Intuitions: Off-diagonal
As the off-diagonal entries increase, more correlation between value of x
and value of y
Gaussian Intuitions: off-diagonal and diagonal
Decreasing non-diagonal entries (#1-2)
Increasing variance of one dimension in diagonal (#3)
Isocontours
Isocontours example
We have showed
Now let’s try to find for some constant c the isocontour
Isocontours continued
Isocontours continued
Define
Equation of an ellipse
Centered on μ1, μ2
and axis lengths 2r1 and 2r2
We had started with diaogonal matrix
In the diagonal covariance matrix case the ellipses will be axis
aligned.
Don’t confuse Multivariate Gaussians with Mixtures of
Gaussians
Mixture of Gaussians:
Component
Mixing coefficient
K=3
Logistic regression
Linear regression
Outcome variable Y is continuous
Logistic regression
Outcome variable Y is binary
Logistic function (Logit function)
logit(z)
1
 ( z) 
z
1 e
z
This term is [0, infinity]
Notice σ(z) is always bounded between [0,1] (a nice property)
and as z increase σ(z) approaches 1,
as z decreases σ(z) approaches to 0
Logistic regression
Learn a function to map X values to Y given data
Discrete
( X 1 , Y 1 ),.., ( X N , Y N )
X can be continuous or discrete
The function we try to learn is P(Y|X)
Logistic regression
1
P(Y  1 | X) 
N
w 0   wi X i
1
1 e
P(Y  0 | X)  1  P(Y  1| X) 
e
w0
1 e
1 wi X i
w0
N
1 wi X i
N
Classification
If this holds Y=0 is more probable
Than Y=1 given X
Classification
P(Y  0 | X) 
e
w0
1 e
P(Y  1 | X) 
1 wi X i
w0
N
1 wi X i
N
1
N
w 0   wi X i
1
1 e
Take log both sides
Classification rule if this holds Y=0
Logistic regression is a linear classifier
Decision boundary
Y=0
P (Y  0 | X )  0
w0 
1
N
w0 
1
wi X i  0
w0 
1wi X i
N
0
P(Y  1 | X )  0
Y=1
N
wi X i  0
Classification
σ(z)= σ(w0+w1X1))
wo=+2, to check evaluate at X1=0 g(z)~0.1
X1
Notice σ(z) is 0.5 when X1=2
X1
σ(z) is 0.5 when X1=0 to see
w0  w1 X 1  0
w0  w1 X 1  0
2  ( 1) X 1  0
0  ( 1) X 1  0
X 1  2
X1  0
Classify as Y=0
Classify as Y=0
Estimating the parameters
Given data
1
1
N
N
( X , Y ),.., ( X , Y )
Objective:
N
arg max  P(Y i | X i , w)
w
i 1
Train the model to get w that maximizes the conditional likelihood
Difference with Naïve Bayes of Logistic Regression
Loss function!
Optimize different functions → Obtain
different solutions
Naïve Bayes argmax P(X|Y) P(Y)
Logistic Regression argmax P(Y|X)
Naïve Bayes and Logistic Regression
• Have a look at the Tom Mitchell’s book chapter
http://www.cs.cmu.edu/%7Etom/mlbook/NBayesLogReg.pdf
Linked under Sep 23 Lecture Readings as well.
Some matlab tips for the last question in HW3
•
logical function might be useful for dividing into splits. An example of logical in
use (please read the Matlab help)
S=X(logical(X(:,1)==1),:)
this will also work S=X((X(:1)==1,:))
This will subset the portion of the X matrix where the first column has value 1
and will put in matrix S (like Data>Filter in Excel)
• Matlab has functions for mean, std, sum, inv, log2
• Scaling data to zero mean and unit variance:
•
•
shifting the mean by the mean (subtracting the mean from every element of the
vector) and scaling such that it has variance=1 ( dividing the every element of the
vector by standard deviation)
To be able to do that in matrices. You will need the repmat function, have a look at
that otherwise the size of the matrices would not match..etc
Elementwise multiplication
use .*
References
• http://www.stanford.edu/class/cs224s/lec/224s.09
.lec10.pdf
• http://www.cs.cmu.edu/%7Etom/mlbook/NBayesL
ogReg.pdf
• Carlos Guestrin lecture notes
• Andrew Ng lecture notes
Download