Recitation Shuang Li, Amir Afsharinejad, Kaushik Patnaik Machine Learning

advertisement
Recitation
Shuang Li, Amir Afsharinejad, Kaushik Patnaik
Machine Learning
CS 7641,CSE/ISYE 6740, Fall 2014
Probability and Statistics
2
Basic Probability Concepts
A sample space S is the set of all possible outcomes of a
conceptual or physical, repeatable experiment. (S can be finite
or infinite.)
E.g., S may be the set of all possible outcomes of a dice roll: S
1 2 3 4 5 6
E.g., S may be the set of all possible nucleotides of a DNA site: S
๐ด ๐ถ ๐บ ๐‘‡
E.g., S may be the set of all possible time-space positions of an
aircraft on a radar screen.
An Event A is any subset of S
Seeing "1" or "6" in a dice roll; observing a "G" at a site; UA007 in
space-time interval
3
Probability
An event space E is the possible worlds the outcomes can
happen.
All dice-rolls, reading a genome, monitoring the radar signal
A probability P(A) is a function that maps an event A onto the
interval [0,1]. P(A) is also called the probability measure or
probability mass of A.
Sample space of all
possible worlds.
Its area is 1
Worlds
in which
A is true
Worlds in which A is false
P(a) is the area of the oval
4
Kolmogorov Axioms
For any event ๐ด, ๐ต ⊆ ๐‘†:
1 ≥ ๐‘ƒ(๐ด) ≥ 0
๐‘ƒ ๐‘† =1
๐‘ƒ ๐ด ∪ ๐ต = ๐‘ƒ ๐ด + ๐‘ƒ ๐ต − ๐‘ƒ(๐ด ∩ ๐ต)
5
Random Variable
A random variable is a function that associates a unique
numerical value (a token) with every outcome of an experiment.
(The value of the r.v. will vary the experiment is repeated)
RV Distributions:
Continuous RV:
The outcome of observing the measured location of an aircraft
The outcome of recording the true location of an aircraft
Discrete RV:
The outcome of a dice-roll
The outcome of a coin toss
6
Discrete Prob. Distribution
A probability distribution P defined on a discrete sample space
S is an assignment of a non-negative real number P(s) to each
sample s∈S :
Probability Mass Function (PMF):๐‘๐‘ฅ ๐‘ฅ๐‘– = P[๐‘‹ = ๐‘ฅ๐‘– ]
Cumulative Mass Function (CMF): ๐น๐‘ฅ ๐‘ฅ๐‘– = P ๐‘‹ ≤ ๐‘ฅ๐‘– =
๐‘—:๐‘ฅ๐‘— <๐‘ฅ๐‘– ๐‘๐‘ฅ ๐‘ฅ๐‘—
Properties: ๐‘๐‘ฅ ๐‘ฅ๐‘– ≥ 0 and ๐‘– ๐‘๐‘‹ ๐‘ฅ๐‘– = 1
Examples:
Bernoulli Distribution:
1 − ๐‘ ๐‘“๐‘œ๐‘Ÿ ๐‘ฅ = 0
๐‘
๐‘“๐‘œ๐‘Ÿ ๐‘ฅ = 1
Binomial Distribution:
P(๐‘‹ = ๐‘˜)=
๐‘›
๐‘˜
๐‘๐‘˜ (1 − ๐‘)๐‘›−๐‘˜
7
Continuous Prob. Distribution
A continuous random variable X is defined on a continuous
sample space: an interval on the real line, a region in a high
dimensional space, etc.
It is meaningless to talk about the probability of the random variable
assuming a particular value --- P(x) = 0
Instead, we talk about the probability of the random variable assuming a
value within a given interval, or half interval, or arbitrary Boolean
combination of basic propositions.
Cumulative Distribution Function (CDF): ๐น๐‘ฅ ๐‘ฅ = P[๐‘‹ ≤ ๐‘ฅ]
Probability Density Function (PDF):๐น๐‘ฅ ๐‘ฅ =
๐‘‘ ๐น๐‘ฅ (๐‘ฅ)
๐‘‘๐‘ฅ
∞
๐‘“ (๐‘ฅ) ๐‘‘๐‘ฅ
−∞ ๐‘ฅ
๐‘ฅ,๐‘ฅ+Δ
P[๐‘‹ ∈
]
Δ
Properties: ๐‘“๐‘ฅ (๐‘ฅ) ≥ 0 and
Interpretation: ๐‘“๐‘ฅ ๐‘ฅ =
๐‘ฅ
๐‘“ (๐‘ฅ) ๐‘‘๐‘ฅ
−∞ ๐‘ฅ
or ๐‘“๐‘ฅ ๐‘ฅ =
=1
8
Continuous Prob. Distribution
Examples:
Uniform Density Function:
1
๐‘“๐‘ฅ ๐‘ฅ = ๐‘ − ๐‘Ž
0
๐‘“๐‘œ๐‘Ÿ ๐‘Ž ≤ ๐‘ฅ ≤ ๐‘
๐‘œ๐‘กโ„Ž๐‘’๐‘Ÿ๐‘ค๐‘–๐‘ ๐‘’
Exponential Density Function:
1 −๐œ‡๐‘ฅ
๐‘“๐‘ฅ ๐‘ฅ = ๐‘’
๐œ‡
๐‘“๐‘œ๐‘Ÿ ๐‘ฅ ≥ 0
๐น๐‘ฅ ๐‘ฅ = 1 −
๐‘“๐‘œ๐‘Ÿ ๐‘ฅ ≥ 0
−๐‘ฅ
๐‘’๐œ‡
Gaussian(Normal) Density Function
(๐‘ฅ−๐œ‡)2
1
−
๐‘“๐‘ฅ ๐‘ฅ =
๐‘’ 2๐œŽ2
2๐œ‹๐œŽ
9
Continuous Prob. Distribution
Gaussian Distribution:
If Z~N(0,1)
๐น๐‘ฅ ๐‘ฅ = Φ ๐‘ฅ =
๐‘ฅ
๐‘“๐‘ฅ ๐‘ง ๐‘‘๐‘ง =
1
๐‘ฅ
−๐‘ง 2
๐‘’ 2 ๐‘‘๐‘ง
2๐œ‹ −∞
This has no closed form expression, but is built in to most
software packages (eg. normcdf in matlab stats toolbox).
−∞
10
Characterizations
Expectation: The mean value, center of mass, first moment:
∞
๐ธ๐‘‹ ๐‘”(๐‘‹) =
๐‘” ๐‘ฅ ๐‘๐‘‹ ๐‘ฅ ๐‘‘๐‘ฅ = ๐œ‡
−∞
N-th moment: ๐‘” ๐‘ฅ = ๐‘ฅ ๐‘›
N-th central moment: ๐‘” ๐‘ฅ = (๐‘ฅ − ๐œ‡)๐‘›
Mean: ๐ธ๐‘‹ ๐‘‹ =
∞
๐‘ฅ๐‘๐‘‹
−∞
๐‘ฅ ๐‘‘๐‘ฅ
๐ธ ๐›ผ๐‘‹ = ๐›ผ๐ธ[๐‘‹]
๐ธ ๐›ผ + ๐‘‹ = ๐›ผ + ๐ธ[๐‘‹]
Variance(Second central moment): ๐‘‰๐‘Ž๐‘Ÿ ๐‘ฅ =
๐ธ๐‘‹ (๐‘‹ − ๐ธ๐‘‹ ๐‘‹ )2 = ๐ธ๐‘‹ ๐‘‹ 2 − ๐ธ๐‘‹ [๐‘‹]2
๐‘‰๐‘Ž๐‘Ÿ ๐›ผ๐‘‹ = ๐›ผ 2 ๐‘‰๐‘Ž๐‘Ÿ ๐‘‹
๐‘‰๐‘Ž๐‘Ÿ ๐›ผ + ๐‘‹ = ๐‘‰๐‘Ž๐‘Ÿ ๐‘‹
11
Central Limit Theorem
If (X1 ,X2, … Xn) are i.i.d. continuous random variables, then the
joint distribution is ๐‘“ ๐‘‹
CLT proves that ๐‘“ ๐‘‹ is Gaussian with mean ๐ธ[๐‘‹๐‘– ] and ๐‘‰๐‘Ž๐‘Ÿ[๐‘‹๐‘– ]
๐‘›
1
๐‘‹ = ๐‘“ ๐‘‹1 , ๐‘‹2 , … , ๐‘‹๐‘› =
๐‘‹๐‘–
๐‘Ž๐‘  ๐‘› → ∞
๐‘›
๐‘–=1
Somewhat of a justification for assuming Gaussian noise
12
Joint RVs and Marginal Densities
Joint densities: ๐น๐‘‹,๐‘Œ ๐‘ฅ, ๐‘ฆ = P ๐‘‹ ≤ ๐‘ฅ ∩ ๐‘Œ ≤ ๐‘ฆ
๐‘ฅ
๐‘ฆ
๐‘“ ๐›ผ, ๐›ฝ ๐‘‘๐›ผ๐‘‘๐›ฝ
−∞ −∞ ๐‘‹,๐‘Œ
=
Marginal densities:
๐‘“๐‘‹ ๐‘ฅ =
๐‘๐‘‹ ๐‘ฅ๐‘– =
∞
๐‘“
−∞ ๐‘‹,๐‘Œ
๐‘ฅ, ๐›ฝ ๐‘‘๐›ฝ
๐‘— ๐‘๐‘‹,๐‘Œ (๐‘ฅ๐‘– , ๐‘ฆ๐‘— )
Expectation and Covariance:
๐ธ ๐‘‹ + ๐‘Œ = ๐ธ ๐‘‹ + ๐ธ[๐‘Œ]
๐‘๐‘œ๐‘ฃ ๐‘‹, ๐‘Œ = ๐ธ[ ๐‘‹ − ๐ธ๐‘‹ ๐‘‹ ๐‘Œ − ๐ธ๐‘Œ ๐‘Œ = ๐ธ ๐‘‹๐‘Œ − ๐ธ ๐‘‹ ๐ธ[๐‘Œ]
๐‘‰๐‘Ž๐‘Ÿ ๐‘‹ + ๐‘Œ = ๐‘‰๐‘Ž๐‘Ÿ ๐‘‹ + 2๐‘๐‘œ๐‘ฃ ๐‘‹, ๐‘Œ + ๐‘‰๐‘Ž๐‘Ÿ(๐‘Œ)
13
Conditional Probability
P(X|Y)= Fraction of the worlds in which X is true given that Y is
also true.
For example:
H=“Having a headache”
F=“Coming down with flu”
๐‘ƒ(๐ป๐‘’๐‘Ž๐‘‘๐‘โ„Ž๐‘’|๐น๐‘™๐‘ข) = fraction of flu-inflicted worlds in which you
have a headache. How to calculate?
Definition:
Corollary:
๐‘ƒ(๐‘‹ ∩ ๐‘Œ) P Y X P X
๐‘ƒ ๐‘‹๐‘Œ =
=
๐‘ƒ(๐‘Œ)
P Y
๐‘ƒ ๐‘‹ ∩ ๐‘Œ = ๐‘ƒ ๐‘Œ ๐‘‹ ๐‘ƒ(๐‘‹)
This is called Bayes Rule
14
The Bayes Rule
๐‘ƒ ๐ป๐‘’๐‘Ž๐‘‘๐‘Ž๐‘โ„Ž๐‘’ ๐น๐‘™๐‘ข =
=
๐‘ƒ ๐ป๐‘’๐‘Ž๐‘‘๐‘Ž๐‘โ„Ž๐‘’,๐น๐‘™๐‘ข
๐‘ƒ ๐น๐‘™๐‘ข
๐‘ƒ
๐น๐‘™๐‘ข ๐ป๐‘’๐‘Ž๐‘‘๐‘Ž๐‘โ„Ž๐‘’
๐‘ƒ ๐ป๐‘’๐‘Ž๐‘‘๐‘Ž๐‘โ„Ž๐‘’
๐‘ƒ ๐น๐‘™๐‘ข
Other cases:
๐‘ƒ ๐‘Œ๐‘‹ =
๐‘ƒ
๐‘ƒ
๐‘‹๐‘Œ
๐‘ƒ ๐‘Œ = ๐‘ฆ๐‘– ๐‘‹ =
๐‘‹๐‘Œ
๐‘ƒ(๐‘Œ)
๐‘‹ ¬๐‘Œ ๐‘ƒ(¬๐‘Œ)
๐‘ƒ ๐‘‹ ๐‘Œ ๐‘ƒ(๐‘Œ)
๐‘‹ ๐‘Œ = ๐‘ฆ๐‘– ๐‘ƒ ๐‘Œ=๐‘ฆ๐‘–
๐‘ƒ ๐‘Œ +๐‘ƒ
๐‘–∈๐‘† ๐‘ƒ
๐‘ƒ ๐‘‹ ๐‘Œ, ๐‘ ๐‘ƒ(๐‘Œ,๐‘)
๐‘ƒ ๐‘Œ ๐‘‹, ๐‘ =
=
๐‘ƒ(๐‘‹,๐‘)
๐‘ƒ ๐‘‹ ๐‘Œ, ๐‘ ๐‘ƒ(๐‘Œ,๐‘)
๐‘ƒ ๐‘‹ ๐‘Œ, ๐‘ ๐‘ƒ ๐‘Œ,๐‘ +๐‘ƒ ๐‘‹ ¬๐‘Œ, ๐‘ ๐‘ƒ(¬๐‘Œ,๐‘)
15
Rules of Independence
Recall that for events E and H, the probability of E given H,
written as P(E|H), is
๐‘ƒ ๐ธ∩๐ป
๐‘ƒ ๐ธ๐ป =
๐‘ƒ ๐ป
E and H are (statistically) independent if
๐‘ƒ ๐ธ ∩ ๐ป = ๐‘ƒ ๐ธ ๐‘ƒ(๐ป)
Or equivalently
๐‘ƒ ๐ธ = ๐‘ƒ(๐ธ|๐ป)
That means, the probability of E is true doesn't depend on
whether H is true or not
E and F are conditionally independent given H if
๐‘ƒ ๐ธ ๐ป, ๐น = ๐‘ƒ(๐ธ|๐ป)
Or equivalently
๐‘ƒ ๐ธ, ๐น ๐ป = ๐‘ƒ ๐ธ ๐ป ๐‘ƒ(๐น|๐ป)
16
Rules of Independence
Examples:
P(Virus | Drink Beer) = P(Virus)
iff Virus is independent of Drink Beer
P(Flu | Virus;DrinkBeer) = P(Flu|Virus)
iff Flu is independent of Drink Beer, given Virus
P(Headache | Flu;Virus;DrinkBeer) =
P(Headache|Flu;DrinkBeer)
iff Headache is independent of Virus, given Flu and Drink Beer
Assume the above independence, we obtain:
P(Headache;Flu;Virus;DrinkBeer)
=P(Headache | Flu;Virus;DrinkBeer) P(Flu | Virus;DrinkBeer)
P(Virus | Drink Beer) P(DrinkBeer)
=P(Headache|Flu;DrinkBeer) P(Flu|Virus) P(Virus) P(DrinkBeer)
17
Multivariate Gaussian
1
1
โŠค Σ −1 ๐‘ฅ − ๐œ‡ }
๐‘ ๐‘ฅ μ, Σ =
exp{−
๐‘ฅ
−
๐œ‡
2
2๐œ‹ ๐‘›/2 |Σ
โ€‹ |1/2
Moment Parameterization ๐œ‡ = ๐ธ(๐‘‹)
Σ = ๐ถ๐‘œ๐‘ฃ ๐‘‹ = ๐ธ[ ๐‘‹ − ๐œ‡ ๐‘‹ − ๐œ‡ โŠค ]
Mahalanobis Distance โˆ†2 = ๐‘ฅ = ๐œ‡ โŠค Σ −1 ๐‘ฅ − ๐œ‡
Canonical Parameterization
1 โŠค
โŠค
๐‘ ๐‘ฅ ๐œ‚, Λ = exp{๐‘Ž + ๐œ‚ ๐‘ฅ − ๐‘ฅ Λ๐‘ฅ}
2
1
where Λ = Σ −1 , ๐œ‚ = Σ −1 ๐œ‡, ๐‘Ž = − (๐‘› ๐‘™๐‘œ๐‘”2๐œ‹ − log Λ + ๐œ‚ โŠค Λ๐œ‚ )
2
Tons of applications (MoG, FA, PPCA, Kalman filter,…)
18
Multivariate Gaussian P (X1, X2)
Joint Gaussian ๐‘ƒ ๐‘‹1 , ๐‘‹2
๐œ‡1
๐œ‡= ๐œ‡
2
Marginal Gaussian
๐œ‡2๐‘š = ๐œ‡2
=
11
12
21
22
Σ2๐‘š = Σ2
Conditional Gaussian ๐‘ƒ ๐‘‹1 |๐‘‹2 = ๐‘ฅ2
−1
๐œ‡1|2 = ๐œ‡1 + Σ12 Σ22
๐‘ฅ2 − ๐œ‡2
−1
Σ1|2 = Σ11 − Σ12 Σ22
Σ21
19
Operations on Gaussian R.V.
The linear transform of a Gaussian r.v. is a Gaussian. Remember
that no matter how x is distributed
๐ธ ๐ด๐‘‹ + ๐‘ = ๐ด๐ธ ๐‘‹ + ๐‘
๐ถ๐‘œ๐‘ฃ ๐ด๐‘‹ + ๐‘ = ๐ด๐ถ๐‘œ๐‘ฃ ๐‘‹ ๐ดโŠค
this means that for Gaussian distributed quantities:
๐‘‹ ∼ ๐‘ ๐œ‡, Σ → ๐ด๐‘‹ + ๐‘ ∼ ๐‘(๐ด๐œ‡ + ๐‘, ๐ดΣ๐ดโŠค )
The sum of two independent Gaussian r.v. is a Gaussian
Y = X1 + X2 , X1 ⊥ ๐‘‹2 → ๐œ‡๐‘ฆ = ๐œ‡1 + ๐œ‡2 , Σ๐‘ฆ = Σ1 + Σ2
The multiplication of two Gaussian functions is another
Gaussian function (although no longer normalized)
๐‘ ๐‘Ž, ๐ด ๐‘ ๐‘, ๐ต ∝ ๐‘ ๐‘, ๐ถ ,
๐‘คโ„Ž๐‘’๐‘Ÿ๐‘’ ๐ถ = ๐ด−1 + ๐ต−1 −1 , ๐‘ = CA−1 ๐‘Ž + ๐ถ๐ต−1 ๐‘
20
MLE
Example: toss a coin
Objective function:
๐‘™ ๐œƒ; ๐ป๐‘’๐‘Ž๐‘‘ = log ๐‘ƒ ๐ป๐‘’๐‘Ž๐‘‘ ๐œƒ = log ๐œƒ ๐‘› 1 − ๐œƒ
๐‘−๐‘›
= ๐‘› log ๐œƒ + ๐‘ − ๐‘› log( 1 − ๐œƒ)
We need to maximize this w.r.t. ๐œƒ
Take derivatives w.r.t. ๐œƒ
๐‘‘๐‘™ ๐‘› ๐‘ − ๐‘›
= −
=0
๐‘‘๐œƒ ๐œƒ 1 − ๐œƒ
๐‘›
๐œƒ๐‘€๐ฟ๐ธ =
๐‘
21
Linear Algebra
22
Outline
Motivating Example – Eigenfaces
Basics
Dot and Vector Products
Identity, Diagonal and Orthogonal Matrices
Trace
Norms
Rank and linear independence
Range and Null Space
Column and Row Space
Determinant and Inverse of a matrix
Eigenvalues and Eigenvectors
Singular Value Decomposition
Matrix Calculus
23
Motivating Example - EigenFaces
Consider the task of recognizing faces from images.
Given images of size 512x512, each image contains 262,144
dimensions or features.
Not all dimensions are equally important in classifying faces.
Solution – To use ideas from
linear algebra, especially
eigenvectors, to form a new set
of reduced features.
24
Linear Algebra Basics - I
Linear algebra provides a way of compactly representing and
operating on sets of linear equations
4๐‘ฅ1 − 5๐‘ฅ2 = −13 −2๐‘ฅ1 + 3๐‘ฅ2 = 9
can be written in the form of ๐ด๐‘ฅ = ๐‘
−13
4 5
๐ด=
๐‘=
9
−2 3
๐ด ∈ โ„๐‘š×๐‘› denotes a matrix with m rows and ๐‘› columns, where
elements belong to real numbers.
๐‘ฅ ∈ โ„๐‘› denotes a vector with n real entries. By convention an
n dimensional vector is often thought as a matrix with n rows
and 1 column.
25
Linear Algebra Basics - II
Transpose of a matrix results from flipping the rows and
columns. Given A ∈ โ„๐‘š×๐‘› , transpose is ๐ดโŠค ∈ โ„๐‘›×๐‘š
For each element of the matrix, the transpose can be written
as ๏ƒ  AโŠค ij = ๐ด๐‘—๐‘–
The following properties of the transposes are easily verified
๐ดโŠค โŠค = ๐ด
(๐ด๐ต)โŠค = ๐ตโŠค ๐ดโŠค
(๐ด + ๐ต)โŠค = ๐ดโŠค + ๐ตโŠค
A square matrix ๐ด ∈ โ„๐‘›×๐‘› is symmetric if ๐ด = ๐ดโŠค and it is
anti-symmetric if ๐ด = − ๐ดโŠค . Thus each matrix can be written
as a sum of symmetric and anti-symmetric matrices.
26
Vector and Matrix Multiplication - I
The product of two matrices A ∈ โ„๐‘š×๐‘› and B ∈ โ„๐‘›×๐‘ is given
by C ∈ โ„๐‘š×๐‘ , where ๐ถ๐‘–๐‘— = ๐‘›๐‘˜=1 ๐ด๐‘–๐‘˜๐ต๐‘˜๐‘—
Given two vectors x, y ∈ โ„๐‘› , the term ๐‘ฅโŠค๐‘ฆ (also ๐‘ฅ โˆ™ ๐‘ฆ) is called
the inner product or dot product of the vectors, and is a real
number given by ๐‘›๐‘–=1 ๐‘ฅ๐‘–๐‘ฆ๐‘– . For example,
๐‘ฅโŠค๐‘ฆ = ๐‘ฅ1
๐‘ฅ2
๐‘ฅ3
๐‘ฆ1
๐‘ฆ2 =
๐‘ฆ3
3
๐‘–=1 ๐‘ฅ๐‘–๐‘ฆ๐‘–
Given two vectors x ∈ โ„๐‘› , y ∈ โ„๐‘š , the term ๐‘ฅ๐‘ฆโŠค is called the
outer product of the vectors, and is a matrix given by
(๐‘ฅ๐‘–๐‘ฆ๐‘—)โŠค = ๐‘ฅ๐‘–๐‘ฆ๐‘—. . For example,
27
Vector and Matrix Multiplication - II
๐‘ฅ1
๐‘ฅ๐‘ฆโŠค = ๐‘ฅ2 ๐‘ฆ1
๐‘ฅ3
๐‘ฅ1๐‘ฆ1
๐‘ฆ3 = ๐‘ฅ2๐‘ฆ1
๐‘ฅ3๐‘ฆ1
๐‘ฆ2
๐‘ฅ1๐‘ฆ2 ๐‘ฅ1๐‘ฆ3
๐‘ฅ2๐‘ฆ2 ๐‘ฅ2๐‘ฆ3
๐‘ฅ3๐‘ฆ2 ๐‘ฅ3๐‘ฆ3
The dot product also has a geometrical interpretation, for
vectors in x, y ∈ โ„2 with angle ๐œƒ between them
x
๐œƒ
๐‘ฅ โˆ™ ๐‘ฆ = ๐‘ฅ ๐‘ฆ cos๐œƒ
y
which leads to use of dot product for testing orthogonality,
getting the Euclidean norm of a vector, and scalar projections.
28
Trace of a Matrix
The trace of a matrix ๐ด ∈ โ„๐‘›×๐‘› , denoted as ๐’•๐’“(๐‘จ), is the sum
of the diagonal elements in the matrix
๐‘ก๐‘Ÿ(๐ด) = ๐‘›๐‘–=1 ๐ด๐‘–๐‘–
The trace has the following properties
For ๐ด ∈ โ„๐‘›×๐‘› , ๐‘ก๐‘Ÿ(๐ด) = ๐‘ก๐‘Ÿ๐ดโŠค
For ๐ด, ๐ต ∈ โ„๐‘›×๐‘› , ๐‘ก๐‘Ÿ(๐ด + ๐ต) = ๐‘ก๐‘Ÿ(๐ด) + ๐‘ก๐‘Ÿ(๐ต)
For ๐ด ∈ โ„๐‘›×๐‘› , t ∈ โ„, ๐‘ก๐‘Ÿ ๐‘ก๐ด = ๐‘ก โˆ™ ๐‘ก๐‘Ÿ(๐ด)
For ๐ด, ๐ต, ๐ถ such that ABC is a square matrix ๐‘ก๐‘Ÿ(๐ด๐ต๐ถ) =
๐‘ก๐‘Ÿ(๐ต๐ถ๐ด) = ๐‘ก๐‘Ÿ(๐ถ๐ด๐ต)
The trace of a matrix helps us easily compute norms and
eigenvalues of matrices as we will see later
29
Linear Independence and Rank
A set of vectors ๐‘ฅ1, ๐‘ฅ2, . . . , ๐‘ฅ๐‘› ⊂ โ„๐‘š are said to be (linearly)
independent if no vector can be represented as a linear
combination of the remaining vectors. That is if
๐‘›−1
๐‘ฅ๐‘› =
๐›ผ๐‘–๐‘ฅ๐‘–
๐‘–=1
for some scalar values ๐›ผ1, ๐›ผ2, … ∈ โ„ then we say that the vectors
are linearly dependent; otherwise the vectors are linearly
independent
The column rank of a matrix ๐ด ∈ โ„๐‘š×๐‘› is the size of the
largest subset of columns of A that constitute a linearly
independent set. Row rank of a matrix is defined similarly for
rows of a matrix. For any matrix row rank = column rank
30
Range and Null Space
The span of a set of vectors ๐‘ฅ1, ๐‘ฅ2, . . . , ๐‘ฅ๐‘› is the set of all
vectors that can be expressed as a linear combination of the
set ๐‘ฃ: ๐‘ฃ = ๐‘›๐‘–=1 ๐›ผ๐‘–๐‘ฅ๐‘– , ๐›ผ๐‘– ∈ โ„
If ๐‘ฅ1, ๐‘ฅ2, . . . , ๐‘ฅ๐‘› ∈ โ„๐‘› is a set of linearly independent set of
vectors, then span( ๐‘ฅ1, ๐‘ฅ2, . . . , ๐‘ฅ๐‘› ) = โ„๐‘›
The range of a matrix ๐ด ∈ โ„๐‘š×๐‘› , denoted as โ„›(๐ด), is the span
of the columns of ๐ด
The nullspace of a matrix ๐ด ∈ โ„๐‘š×๐‘› , denoted ๐’ฉ ๐ด , is the
set of all vectors that equal 0 when multiplied by ๐ด
๐’ฉ ๐ด = ๐‘ฅ ∈ โ„๐‘› โˆถ ๐ด๐‘ฅ = 0
31
Column and Row Space
The row space and column space are the linear subspaces
generated by row and column vectors of a matrix
Linear subspace, is a vector space that is a subset of some
other higher dimension vector space
For a matrix ๐ด ∈ โ„๐‘š×๐‘›
๐ถ๐‘œ๐‘™ ๐‘ ๐‘๐‘Ž๐‘๐‘’(๐ด) = ๐‘ ๐‘๐‘Ž๐‘›(๐‘๐‘œ๐‘™๐‘ข๐‘š๐‘›๐‘  ๐‘œ๐‘“ ๐ด)
๐‘…๐‘Ž๐‘›๐‘˜(๐ด) = dim(๐‘Ÿ๐‘œ๐‘ค ๐‘ ๐‘๐‘Ž๐‘๐‘’(๐ด)) = dim(๐‘๐‘œ๐‘™ ๐‘ ๐‘๐‘Ž๐‘๐‘’(๐ด))
32
Norms – I
Norm of a vector ๐‘ฅ is informally a measure of the “length”
of a vector
More formally, a norm is any function f : โ„๐‘› → โ„ that satisfies
For all ๐‘ฅ ∈ โ„๐‘› , f(๐‘ฅ) ≥ 0 (non-negativity)
f(๐‘ฅ) = 0 is and only if ๐‘ฅ = 0 (definiteness)
For ๐‘ฅ ∈ โ„๐‘› , t ∈ โ„, ๐‘“ ๐‘ก๐‘ฅ = ๐‘ก ๐‘“ ๐‘ฅ (homogeneity)
For all ๐‘ฅ, ๐‘ฆ ∈ โ„๐‘› , f(๐‘ฅ + ๐‘ฆ) ≤ f(๐‘ฅ) + f(๐‘ฆ) (triangle inequality)
Common norms used in machine learning are
โ„“2 norm
๐‘ฅ
2
=
๐‘›
2
๐‘–=1 ๐‘ฅ๐‘–
33
Norm - II
โ„“1 norm
๐‘ฅ
1 =
๐‘›
๐‘–=1
๐‘ฅ๐‘–
โ„“∞ norm
๐‘ฅ
∞
= ๐‘š๐‘Ž๐‘ฅ๐‘– ๐‘ฅ๐‘–
All norms presented so far are examples of the family of โ„“๐‘
norms, which are parameterized
by a real number p ≥ 1
1
๐‘ฅ
๐‘
=
๐‘›
๐‘–=1
๐‘ฅ๐‘–
๐‘
๐‘
Norms can be defined for matrices, such as the Frobenius
norm.
๐ด
๐น
=
๐‘š
๐‘–=1
๐‘›
2
๐‘—=1 ๐ด๐‘–๐‘—
=
๐‘ก๐‘Ÿ(๐ดโŠค ๐ด)
34
Identity, Diagonal and Orthogonal Matrices
The identity matrix, denoted by I ∈ โ„๐‘›×๐‘› is a square matrix
with ones on the diagonal and zeros everywhere else
A diagonal matrix is matrix where all non-diagonal matrices are
0. This is typically denoted as D = ๐‘‘๐‘–๐‘Ž๐‘”(๐‘‘1, ๐‘‘2, ๐‘‘3, . . . , ๐‘‘๐‘›)
Two vectors x, y ∈ โ„๐‘› are orthogonal if ๐‘ฅ. ๐‘ฆ = 0. A square
matrix U ∈ โ„๐‘›×๐‘› is orthogonal if all its columns are orthogonal
to each other and are normalized
It follows from orthogonality and normality that
UTU = I = UUT
๐‘ˆ๐‘ฅ 2 = ๐‘ฅ
2
35
Determinant and Inverse of a Matrix
The determinant of a square matrix ๐ด ∈ โ„๐‘›×๐‘› is a function f:
โ„๐‘›×๐‘› → โ„ , denoted by ๐ด ๐‘œ๐‘Ÿ det ๐ด , and is calculated as
๐ด =
๐‘›
๐‘–+๐‘—
(−1)
๐‘–=1
๐‘Ž๐‘–๐‘— |๐ด\i, \j | (for any ๐‘— ∈ 1,2, … , ๐‘›)
The inverse of a square matrix ๐ด ∈ โ„๐‘›×๐‘› is denoted ๐ด−1 and
is the unique matrix such that ๐ด−1 ๐ด = ๐ผ = ๐ด๐ด−1
For some square matrices ๐ด−1 may not exist, and we say that A
is singular or non-invertible. In order for A to have an inverse,
A must be full rank.
For non-square matrices the inverse, denoted by ๐ด+ ,is given
by ๐ด+ = ๐ดโŠค๐ด −1 AT called the pseudo inverse
36
Eigenvalues and Eigenvectors - I
Given a square matrix ๐ด ∈ โ„๐‘›×๐‘› we say that ๐œ† ∈ โ„‚ is an
๐‘›
eigenvalue of A and ๐‘ฅ ∈ โ„‚ is an eigenvector if
๐ด๐‘ฅ = ๐œ†๐‘ฅ,
๐‘ฅ ≠0
Intuitively this means that upon multiplying the matrix A with a
vector x, we get the same vector, but scaled by a parameter ๐œ†
Geometrically, we are transforming the matrix A from its
original orthonormal basis/co-ordinates to a new set of
orthonormal basis ๐‘ฅ with magnitude as ๐œ†
37
Eigenvalues and Eigenvectors - II
Source - Wikipedia
38
Eigenvalues and Eigenvectors - III
All the eigenvectors can be written together as ๐ด๐‘‹ = ๐‘‹Λ
where the diagonals of ๐‘‹ are the eigenvectors of ๐ด, and Λ is a
diagonal matrix whose elements are eigenvalues of ๐ด
If the eigenvectors of A are invertible, then ๐ด = ๐‘‹Λ๐‘‹ −1
There are several properties of eigenvalues and eigenvectors
๐‘‡๐‘Ÿ(๐ด) = ๐‘›๐‘–=1 ๐œ†๐‘–
๐ด = ๐‘›๐‘–=1 ๐œ†๐‘–
Rank of A is the number of non-zero eigenvalues of A
If A is non-singular then 1/๐œ†๐‘– are the eigenvalues of ๐ด−1
The eigenvalues of a diagonal matrix are the diagonal elements
of the matrix itself!
39
Eigenvalues and Eigenvectors - IV
For a symmetric matrix A, it can be shown that eigenvalues are
real and the eigenvectors are orthonormal. Thus it can be
represented as ๐‘ˆΛ๐‘ˆ โŠค
Considering quadratic form of A,
๐‘ฅ โŠค ๐ด๐‘ฅ = ๐‘ฅ โŠค ๐‘ˆΛ๐‘ˆ โŠค ๐‘ฅ = ๐‘ฆ โŠค Λ๐‘ฆ =
๐‘›
2
๐‘–=1 ๐œ†๐‘– ๐‘ฆ๐‘–
(where ๐‘ฆ = ๐‘ˆ โŠค ๐‘ฅ)
Since ๐‘ฆ๐‘–2 is always positive the sign of the expression always
depends on ๐œ†๐‘–. If ๐œ†๐‘– >0 then the matrix A is positive definite,
if ๐œ†๐‘– ≥ 0 then the matrix A is positive semidefinite
For a multi variate Gaussian, the variances of x and y do not
fully describe the distribution. The eigenvectors of this
covariance matrix capture the directions of highest variance
and eigenvalues the variance
40
Eigenvalues and Eigenvectors - V
We can rewrite the original equation in the following manner
๐ด๐‘ฅ = ๐œ†๐‘ฅ,
๐‘ฅ ≠0
⇒ ๐œ†๐ผ − ๐ด ๐‘ฅ = 0,
๐‘ฅ≠0
This is only possible if ๐œ†๐ผ − ๐ด is singular, that is | ๐œ†๐ผ − ๐ด | =
0.
Thus, eigenvalues and eigenvectors can be computed.
Compute the determinant of ๐ด − ๐œ†๐ผ.
This results in a polynomial of degree ๐‘›.
Find the roots of the polynomial by equating it to zero.
The n roots are the n eigenvalues of ๐ด. They make ๐ด − ๐œ†๐ผ singular.
For each eigenvalue ๐œ† , solve ๐ด − ๐œ†๐ผ ๐‘ฅ to find an eigenvector ๐‘ฅ
41
Singular Value Decomposition
Singular value decomposition, known as SVD, is a factorization
of a real matrix with applications in calculating pseudo-inverse,
rank, solving linear equations, and many others.
For a matrix ๐‘€ ∈ โ„๐‘š×๐‘› assume ๐‘› ≤ ๐‘š
๐‘€ = ๐‘ˆΣ๐‘‰โŠค where U ∈ โ„๐‘š×๐‘š , VโŠค ∈ โ„๐‘›×๐‘› , Σ ∈ โ„๐‘š×๐‘›
The m columns of U, and the n columns of V are called the left
and right singular vectors of M. The diagonal elements of Σ, Σii
are known as the singular values of M.
Let ๐‘ฃ be the ith column of ๐‘‰, and ๐‘ข be the ith column of ๐‘ˆ, and
๐œŽ be the ith diagonal element of Σ
๐‘€๐‘ฃ = ๐œŽ๐‘ข and ๐‘€โŠค๐‘ข = ๐œŽ๐‘ฃ
42
Singular Value Decomposition - II
๐‘€ = ๐‘ข1
๐‘ข2 … ๐‘ข๐‘›
principal directions
Σ11 โ‹ฏ Σ1๐‘›
๐‘ฃ1
โ‹ฎ
โ‹ฑ
โ‹ฎ
Σ๐‘š1 โ‹ฏ Σ๐‘š๐‘›
Scaling factor
๐‘ฃ2 … ๐‘ฃ๐‘›
T
Projection in
principal directions
Singular value decomposition is related to eigenvalue
decomposition
Suppose ๐‘‹ = ๐‘ฅ1 − ๐‘ข
๐‘ฅ2 − ๐‘ข …
Then covariance matrix is ๐ถ =
๐‘ฅ๐‘š − ๐‘ข ∈ โ„๐‘š×๐‘›
1
๐‘‹๐‘‹โŠค
๐‘š
Starting from singular vector pair
๐‘€โŠคu = σ๐‘ฃ
⇒ ๐‘€๐‘€โŠคu = σ๐‘€๐‘ฃ
⇒ ๐‘€๐‘€โŠคu = σ2๐‘ข
⇒ ๐ถ๐‘ข = ๐œ†๐‘ข
43
Singular Value Decomposition - III
Source - Wikipedia
44
Matrix Calculus
For a vector x, b ∈ โ„๐‘› , let ๐‘“ ๐‘ฅ = ๐‘ โŠค ๐‘ฅ , then ๐›ป๐‘ฅ ๐‘ โŠค ๐‘ฅ is equal
to ๐‘
๐œ•๐‘“(๐‘ฅ)
๐œ•๐‘ฅ๐‘˜
=
๐œ•
๐œ•๐‘ฅ๐‘˜
๐‘›
๐‘–=1 ๐‘๐‘–๐‘ฅ๐‘–
= ๐‘๐‘˜
Now for a quadratic function, ๐‘“ ๐‘ฅ = ๐‘ฅ โŠค ๐ด๐‘ฅ, with ๐ด ∈ ๐•Š๐‘› ,
๐œ•๐‘“(๐‘ฅ)
๐œ•๐‘ฅ๐‘˜
= 2๐ด๐‘ฅ
๐œ•๐‘“(๐‘ฅ)
๐œ•๐‘ฅ๐‘˜
=
=2
=
๐œ•
๐œ•๐‘ฅ๐‘˜
๐‘›
๐‘–=1
๐‘–≠๐‘˜ ๐ด๐‘–๐‘˜๐‘ฅ๐‘– +
๐‘›
๐‘–=1 ๐ด๐‘˜๐‘–๐‘ฅ๐‘–
๐‘›
๐‘–=1 ๐ด๐‘–๐‘—๐‘ฅ๐‘–๐‘ฅ๐‘—
๐‘—≠๐‘˜ ๐ด๐‘˜๐‘—๐‘ฅ๐‘—
+ 2๐ด๐‘˜๐‘˜๐‘ฅ๐‘˜
Let ๐‘“ ๐‘‹ = ๐‘‹ −1 , then ๐œ• ๐‘‹ −1 = −๐‘‹ −1 (๐œ•๐‘‹)๐‘‹ −1
45
References for self study
Best Resources for review of material Linear Algebra Review and Reference by Zico Kotler
Matrix Cookbook by KB Peterson
46
Download