Recitation Shuang Li, Amir Afsharinejad, Kaushik Patnaik Machine Learning CS 7641,CSE/ISYE 6740, Fall 2014 Probability and Statistics 2 Basic Probability Concepts A sample space S is the set of all possible outcomes of a conceptual or physical, repeatable experiment. (S can be finite or infinite.) E.g., S may be the set of all possible outcomes of a dice roll: S 1 2 3 4 5 6 E.g., S may be the set of all possible nucleotides of a DNA site: S ๐ด ๐ถ ๐บ ๐ E.g., S may be the set of all possible time-space positions of an aircraft on a radar screen. An Event A is any subset of S Seeing "1" or "6" in a dice roll; observing a "G" at a site; UA007 in space-time interval 3 Probability An event space E is the possible worlds the outcomes can happen. All dice-rolls, reading a genome, monitoring the radar signal A probability P(A) is a function that maps an event A onto the interval [0,1]. P(A) is also called the probability measure or probability mass of A. Sample space of all possible worlds. Its area is 1 Worlds in which A is true Worlds in which A is false P(a) is the area of the oval 4 Kolmogorov Axioms For any event ๐ด, ๐ต ⊆ ๐: 1 ≥ ๐(๐ด) ≥ 0 ๐ ๐ =1 ๐ ๐ด ∪ ๐ต = ๐ ๐ด + ๐ ๐ต − ๐(๐ด ∩ ๐ต) 5 Random Variable A random variable is a function that associates a unique numerical value (a token) with every outcome of an experiment. (The value of the r.v. will vary the experiment is repeated) RV Distributions: Continuous RV: The outcome of observing the measured location of an aircraft The outcome of recording the true location of an aircraft Discrete RV: The outcome of a dice-roll The outcome of a coin toss 6 Discrete Prob. Distribution A probability distribution P defined on a discrete sample space S is an assignment of a non-negative real number P(s) to each sample s∈S : Probability Mass Function (PMF):๐๐ฅ ๐ฅ๐ = P[๐ = ๐ฅ๐ ] Cumulative Mass Function (CMF): ๐น๐ฅ ๐ฅ๐ = P ๐ ≤ ๐ฅ๐ = ๐:๐ฅ๐ <๐ฅ๐ ๐๐ฅ ๐ฅ๐ Properties: ๐๐ฅ ๐ฅ๐ ≥ 0 and ๐ ๐๐ ๐ฅ๐ = 1 Examples: Bernoulli Distribution: 1 − ๐ ๐๐๐ ๐ฅ = 0 ๐ ๐๐๐ ๐ฅ = 1 Binomial Distribution: P(๐ = ๐)= ๐ ๐ ๐๐ (1 − ๐)๐−๐ 7 Continuous Prob. Distribution A continuous random variable X is defined on a continuous sample space: an interval on the real line, a region in a high dimensional space, etc. It is meaningless to talk about the probability of the random variable assuming a particular value --- P(x) = 0 Instead, we talk about the probability of the random variable assuming a value within a given interval, or half interval, or arbitrary Boolean combination of basic propositions. Cumulative Distribution Function (CDF): ๐น๐ฅ ๐ฅ = P[๐ ≤ ๐ฅ] Probability Density Function (PDF):๐น๐ฅ ๐ฅ = ๐ ๐น๐ฅ (๐ฅ) ๐๐ฅ ∞ ๐ (๐ฅ) ๐๐ฅ −∞ ๐ฅ ๐ฅ,๐ฅ+Δ P[๐ ∈ ] Δ Properties: ๐๐ฅ (๐ฅ) ≥ 0 and Interpretation: ๐๐ฅ ๐ฅ = ๐ฅ ๐ (๐ฅ) ๐๐ฅ −∞ ๐ฅ or ๐๐ฅ ๐ฅ = =1 8 Continuous Prob. Distribution Examples: Uniform Density Function: 1 ๐๐ฅ ๐ฅ = ๐ − ๐ 0 ๐๐๐ ๐ ≤ ๐ฅ ≤ ๐ ๐๐กโ๐๐๐ค๐๐ ๐ Exponential Density Function: 1 −๐๐ฅ ๐๐ฅ ๐ฅ = ๐ ๐ ๐๐๐ ๐ฅ ≥ 0 ๐น๐ฅ ๐ฅ = 1 − ๐๐๐ ๐ฅ ≥ 0 −๐ฅ ๐๐ Gaussian(Normal) Density Function (๐ฅ−๐)2 1 − ๐๐ฅ ๐ฅ = ๐ 2๐2 2๐๐ 9 Continuous Prob. Distribution Gaussian Distribution: If Z~N(0,1) ๐น๐ฅ ๐ฅ = Φ ๐ฅ = ๐ฅ ๐๐ฅ ๐ง ๐๐ง = 1 ๐ฅ −๐ง 2 ๐ 2 ๐๐ง 2๐ −∞ This has no closed form expression, but is built in to most software packages (eg. normcdf in matlab stats toolbox). −∞ 10 Characterizations Expectation: The mean value, center of mass, first moment: ∞ ๐ธ๐ ๐(๐) = ๐ ๐ฅ ๐๐ ๐ฅ ๐๐ฅ = ๐ −∞ N-th moment: ๐ ๐ฅ = ๐ฅ ๐ N-th central moment: ๐ ๐ฅ = (๐ฅ − ๐)๐ Mean: ๐ธ๐ ๐ = ∞ ๐ฅ๐๐ −∞ ๐ฅ ๐๐ฅ ๐ธ ๐ผ๐ = ๐ผ๐ธ[๐] ๐ธ ๐ผ + ๐ = ๐ผ + ๐ธ[๐] Variance(Second central moment): ๐๐๐ ๐ฅ = ๐ธ๐ (๐ − ๐ธ๐ ๐ )2 = ๐ธ๐ ๐ 2 − ๐ธ๐ [๐]2 ๐๐๐ ๐ผ๐ = ๐ผ 2 ๐๐๐ ๐ ๐๐๐ ๐ผ + ๐ = ๐๐๐ ๐ 11 Central Limit Theorem If (X1 ,X2, … Xn) are i.i.d. continuous random variables, then the joint distribution is ๐ ๐ CLT proves that ๐ ๐ is Gaussian with mean ๐ธ[๐๐ ] and ๐๐๐[๐๐ ] ๐ 1 ๐ = ๐ ๐1 , ๐2 , … , ๐๐ = ๐๐ ๐๐ ๐ → ∞ ๐ ๐=1 Somewhat of a justification for assuming Gaussian noise 12 Joint RVs and Marginal Densities Joint densities: ๐น๐,๐ ๐ฅ, ๐ฆ = P ๐ ≤ ๐ฅ ∩ ๐ ≤ ๐ฆ ๐ฅ ๐ฆ ๐ ๐ผ, ๐ฝ ๐๐ผ๐๐ฝ −∞ −∞ ๐,๐ = Marginal densities: ๐๐ ๐ฅ = ๐๐ ๐ฅ๐ = ∞ ๐ −∞ ๐,๐ ๐ฅ, ๐ฝ ๐๐ฝ ๐ ๐๐,๐ (๐ฅ๐ , ๐ฆ๐ ) Expectation and Covariance: ๐ธ ๐ + ๐ = ๐ธ ๐ + ๐ธ[๐] ๐๐๐ฃ ๐, ๐ = ๐ธ[ ๐ − ๐ธ๐ ๐ ๐ − ๐ธ๐ ๐ = ๐ธ ๐๐ − ๐ธ ๐ ๐ธ[๐] ๐๐๐ ๐ + ๐ = ๐๐๐ ๐ + 2๐๐๐ฃ ๐, ๐ + ๐๐๐(๐) 13 Conditional Probability P(X|Y)= Fraction of the worlds in which X is true given that Y is also true. For example: H=“Having a headache” F=“Coming down with flu” ๐(๐ป๐๐๐๐โ๐|๐น๐๐ข) = fraction of flu-inflicted worlds in which you have a headache. How to calculate? Definition: Corollary: ๐(๐ ∩ ๐) P Y X P X ๐ ๐๐ = = ๐(๐) P Y ๐ ๐ ∩ ๐ = ๐ ๐ ๐ ๐(๐) This is called Bayes Rule 14 The Bayes Rule ๐ ๐ป๐๐๐๐๐โ๐ ๐น๐๐ข = = ๐ ๐ป๐๐๐๐๐โ๐,๐น๐๐ข ๐ ๐น๐๐ข ๐ ๐น๐๐ข ๐ป๐๐๐๐๐โ๐ ๐ ๐ป๐๐๐๐๐โ๐ ๐ ๐น๐๐ข Other cases: ๐ ๐๐ = ๐ ๐ ๐๐ ๐ ๐ = ๐ฆ๐ ๐ = ๐๐ ๐(๐) ๐ ¬๐ ๐(¬๐) ๐ ๐ ๐ ๐(๐) ๐ ๐ = ๐ฆ๐ ๐ ๐=๐ฆ๐ ๐ ๐ +๐ ๐∈๐ ๐ ๐ ๐ ๐, ๐ ๐(๐,๐) ๐ ๐ ๐, ๐ = = ๐(๐,๐) ๐ ๐ ๐, ๐ ๐(๐,๐) ๐ ๐ ๐, ๐ ๐ ๐,๐ +๐ ๐ ¬๐, ๐ ๐(¬๐,๐) 15 Rules of Independence Recall that for events E and H, the probability of E given H, written as P(E|H), is ๐ ๐ธ∩๐ป ๐ ๐ธ๐ป = ๐ ๐ป E and H are (statistically) independent if ๐ ๐ธ ∩ ๐ป = ๐ ๐ธ ๐(๐ป) Or equivalently ๐ ๐ธ = ๐(๐ธ|๐ป) That means, the probability of E is true doesn't depend on whether H is true or not E and F are conditionally independent given H if ๐ ๐ธ ๐ป, ๐น = ๐(๐ธ|๐ป) Or equivalently ๐ ๐ธ, ๐น ๐ป = ๐ ๐ธ ๐ป ๐(๐น|๐ป) 16 Rules of Independence Examples: P(Virus | Drink Beer) = P(Virus) iff Virus is independent of Drink Beer P(Flu | Virus;DrinkBeer) = P(Flu|Virus) iff Flu is independent of Drink Beer, given Virus P(Headache | Flu;Virus;DrinkBeer) = P(Headache|Flu;DrinkBeer) iff Headache is independent of Virus, given Flu and Drink Beer Assume the above independence, we obtain: P(Headache;Flu;Virus;DrinkBeer) =P(Headache | Flu;Virus;DrinkBeer) P(Flu | Virus;DrinkBeer) P(Virus | Drink Beer) P(DrinkBeer) =P(Headache|Flu;DrinkBeer) P(Flu|Virus) P(Virus) P(DrinkBeer) 17 Multivariate Gaussian 1 1 โค Σ −1 ๐ฅ − ๐ } ๐ ๐ฅ μ, Σ = exp{− ๐ฅ − ๐ 2 2๐ ๐/2 |Σ โ |1/2 Moment Parameterization ๐ = ๐ธ(๐) Σ = ๐ถ๐๐ฃ ๐ = ๐ธ[ ๐ − ๐ ๐ − ๐ โค ] Mahalanobis Distance โ2 = ๐ฅ = ๐ โค Σ −1 ๐ฅ − ๐ Canonical Parameterization 1 โค โค ๐ ๐ฅ ๐, Λ = exp{๐ + ๐ ๐ฅ − ๐ฅ Λ๐ฅ} 2 1 where Λ = Σ −1 , ๐ = Σ −1 ๐, ๐ = − (๐ ๐๐๐2๐ − log Λ + ๐ โค Λ๐ ) 2 Tons of applications (MoG, FA, PPCA, Kalman filter,…) 18 Multivariate Gaussian P (X1, X2) Joint Gaussian ๐ ๐1 , ๐2 ๐1 ๐= ๐ 2 Marginal Gaussian ๐2๐ = ๐2 = 11 12 21 22 Σ2๐ = Σ2 Conditional Gaussian ๐ ๐1 |๐2 = ๐ฅ2 −1 ๐1|2 = ๐1 + Σ12 Σ22 ๐ฅ2 − ๐2 −1 Σ1|2 = Σ11 − Σ12 Σ22 Σ21 19 Operations on Gaussian R.V. The linear transform of a Gaussian r.v. is a Gaussian. Remember that no matter how x is distributed ๐ธ ๐ด๐ + ๐ = ๐ด๐ธ ๐ + ๐ ๐ถ๐๐ฃ ๐ด๐ + ๐ = ๐ด๐ถ๐๐ฃ ๐ ๐ดโค this means that for Gaussian distributed quantities: ๐ ∼ ๐ ๐, Σ → ๐ด๐ + ๐ ∼ ๐(๐ด๐ + ๐, ๐ดΣ๐ดโค ) The sum of two independent Gaussian r.v. is a Gaussian Y = X1 + X2 , X1 ⊥ ๐2 → ๐๐ฆ = ๐1 + ๐2 , Σ๐ฆ = Σ1 + Σ2 The multiplication of two Gaussian functions is another Gaussian function (although no longer normalized) ๐ ๐, ๐ด ๐ ๐, ๐ต ∝ ๐ ๐, ๐ถ , ๐คโ๐๐๐ ๐ถ = ๐ด−1 + ๐ต−1 −1 , ๐ = CA−1 ๐ + ๐ถ๐ต−1 ๐ 20 MLE Example: toss a coin Objective function: ๐ ๐; ๐ป๐๐๐ = log ๐ ๐ป๐๐๐ ๐ = log ๐ ๐ 1 − ๐ ๐−๐ = ๐ log ๐ + ๐ − ๐ log( 1 − ๐) We need to maximize this w.r.t. ๐ Take derivatives w.r.t. ๐ ๐๐ ๐ ๐ − ๐ = − =0 ๐๐ ๐ 1 − ๐ ๐ ๐๐๐ฟ๐ธ = ๐ 21 Linear Algebra 22 Outline Motivating Example – Eigenfaces Basics Dot and Vector Products Identity, Diagonal and Orthogonal Matrices Trace Norms Rank and linear independence Range and Null Space Column and Row Space Determinant and Inverse of a matrix Eigenvalues and Eigenvectors Singular Value Decomposition Matrix Calculus 23 Motivating Example - EigenFaces Consider the task of recognizing faces from images. Given images of size 512x512, each image contains 262,144 dimensions or features. Not all dimensions are equally important in classifying faces. Solution – To use ideas from linear algebra, especially eigenvectors, to form a new set of reduced features. 24 Linear Algebra Basics - I Linear algebra provides a way of compactly representing and operating on sets of linear equations 4๐ฅ1 − 5๐ฅ2 = −13 −2๐ฅ1 + 3๐ฅ2 = 9 can be written in the form of ๐ด๐ฅ = ๐ −13 4 5 ๐ด= ๐= 9 −2 3 ๐ด ∈ โ๐×๐ denotes a matrix with m rows and ๐ columns, where elements belong to real numbers. ๐ฅ ∈ โ๐ denotes a vector with n real entries. By convention an n dimensional vector is often thought as a matrix with n rows and 1 column. 25 Linear Algebra Basics - II Transpose of a matrix results from flipping the rows and columns. Given A ∈ โ๐×๐ , transpose is ๐ดโค ∈ โ๐×๐ For each element of the matrix, the transpose can be written as ๏ Aโค ij = ๐ด๐๐ The following properties of the transposes are easily verified ๐ดโค โค = ๐ด (๐ด๐ต)โค = ๐ตโค ๐ดโค (๐ด + ๐ต)โค = ๐ดโค + ๐ตโค A square matrix ๐ด ∈ โ๐×๐ is symmetric if ๐ด = ๐ดโค and it is anti-symmetric if ๐ด = − ๐ดโค . Thus each matrix can be written as a sum of symmetric and anti-symmetric matrices. 26 Vector and Matrix Multiplication - I The product of two matrices A ∈ โ๐×๐ and B ∈ โ๐×๐ is given by C ∈ โ๐×๐ , where ๐ถ๐๐ = ๐๐=1 ๐ด๐๐๐ต๐๐ Given two vectors x, y ∈ โ๐ , the term ๐ฅโค๐ฆ (also ๐ฅ โ ๐ฆ) is called the inner product or dot product of the vectors, and is a real number given by ๐๐=1 ๐ฅ๐๐ฆ๐ . For example, ๐ฅโค๐ฆ = ๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฆ1 ๐ฆ2 = ๐ฆ3 3 ๐=1 ๐ฅ๐๐ฆ๐ Given two vectors x ∈ โ๐ , y ∈ โ๐ , the term ๐ฅ๐ฆโค is called the outer product of the vectors, and is a matrix given by (๐ฅ๐๐ฆ๐)โค = ๐ฅ๐๐ฆ๐. . For example, 27 Vector and Matrix Multiplication - II ๐ฅ1 ๐ฅ๐ฆโค = ๐ฅ2 ๐ฆ1 ๐ฅ3 ๐ฅ1๐ฆ1 ๐ฆ3 = ๐ฅ2๐ฆ1 ๐ฅ3๐ฆ1 ๐ฆ2 ๐ฅ1๐ฆ2 ๐ฅ1๐ฆ3 ๐ฅ2๐ฆ2 ๐ฅ2๐ฆ3 ๐ฅ3๐ฆ2 ๐ฅ3๐ฆ3 The dot product also has a geometrical interpretation, for vectors in x, y ∈ โ2 with angle ๐ between them x ๐ ๐ฅ โ ๐ฆ = ๐ฅ ๐ฆ cos๐ y which leads to use of dot product for testing orthogonality, getting the Euclidean norm of a vector, and scalar projections. 28 Trace of a Matrix The trace of a matrix ๐ด ∈ โ๐×๐ , denoted as ๐๐(๐จ), is the sum of the diagonal elements in the matrix ๐ก๐(๐ด) = ๐๐=1 ๐ด๐๐ The trace has the following properties For ๐ด ∈ โ๐×๐ , ๐ก๐(๐ด) = ๐ก๐๐ดโค For ๐ด, ๐ต ∈ โ๐×๐ , ๐ก๐(๐ด + ๐ต) = ๐ก๐(๐ด) + ๐ก๐(๐ต) For ๐ด ∈ โ๐×๐ , t ∈ โ, ๐ก๐ ๐ก๐ด = ๐ก โ ๐ก๐(๐ด) For ๐ด, ๐ต, ๐ถ such that ABC is a square matrix ๐ก๐(๐ด๐ต๐ถ) = ๐ก๐(๐ต๐ถ๐ด) = ๐ก๐(๐ถ๐ด๐ต) The trace of a matrix helps us easily compute norms and eigenvalues of matrices as we will see later 29 Linear Independence and Rank A set of vectors ๐ฅ1, ๐ฅ2, . . . , ๐ฅ๐ ⊂ โ๐ are said to be (linearly) independent if no vector can be represented as a linear combination of the remaining vectors. That is if ๐−1 ๐ฅ๐ = ๐ผ๐๐ฅ๐ ๐=1 for some scalar values ๐ผ1, ๐ผ2, … ∈ โ then we say that the vectors are linearly dependent; otherwise the vectors are linearly independent The column rank of a matrix ๐ด ∈ โ๐×๐ is the size of the largest subset of columns of A that constitute a linearly independent set. Row rank of a matrix is defined similarly for rows of a matrix. For any matrix row rank = column rank 30 Range and Null Space The span of a set of vectors ๐ฅ1, ๐ฅ2, . . . , ๐ฅ๐ is the set of all vectors that can be expressed as a linear combination of the set ๐ฃ: ๐ฃ = ๐๐=1 ๐ผ๐๐ฅ๐ , ๐ผ๐ ∈ โ If ๐ฅ1, ๐ฅ2, . . . , ๐ฅ๐ ∈ โ๐ is a set of linearly independent set of vectors, then span( ๐ฅ1, ๐ฅ2, . . . , ๐ฅ๐ ) = โ๐ The range of a matrix ๐ด ∈ โ๐×๐ , denoted as โ(๐ด), is the span of the columns of ๐ด The nullspace of a matrix ๐ด ∈ โ๐×๐ , denoted ๐ฉ ๐ด , is the set of all vectors that equal 0 when multiplied by ๐ด ๐ฉ ๐ด = ๐ฅ ∈ โ๐ โถ ๐ด๐ฅ = 0 31 Column and Row Space The row space and column space are the linear subspaces generated by row and column vectors of a matrix Linear subspace, is a vector space that is a subset of some other higher dimension vector space For a matrix ๐ด ∈ โ๐×๐ ๐ถ๐๐ ๐ ๐๐๐๐(๐ด) = ๐ ๐๐๐(๐๐๐๐ข๐๐๐ ๐๐ ๐ด) ๐ ๐๐๐(๐ด) = dim(๐๐๐ค ๐ ๐๐๐๐(๐ด)) = dim(๐๐๐ ๐ ๐๐๐๐(๐ด)) 32 Norms – I Norm of a vector ๐ฅ is informally a measure of the “length” of a vector More formally, a norm is any function f : โ๐ → โ that satisfies For all ๐ฅ ∈ โ๐ , f(๐ฅ) ≥ 0 (non-negativity) f(๐ฅ) = 0 is and only if ๐ฅ = 0 (definiteness) For ๐ฅ ∈ โ๐ , t ∈ โ, ๐ ๐ก๐ฅ = ๐ก ๐ ๐ฅ (homogeneity) For all ๐ฅ, ๐ฆ ∈ โ๐ , f(๐ฅ + ๐ฆ) ≤ f(๐ฅ) + f(๐ฆ) (triangle inequality) Common norms used in machine learning are โ2 norm ๐ฅ 2 = ๐ 2 ๐=1 ๐ฅ๐ 33 Norm - II โ1 norm ๐ฅ 1 = ๐ ๐=1 ๐ฅ๐ โ∞ norm ๐ฅ ∞ = ๐๐๐ฅ๐ ๐ฅ๐ All norms presented so far are examples of the family of โ๐ norms, which are parameterized by a real number p ≥ 1 1 ๐ฅ ๐ = ๐ ๐=1 ๐ฅ๐ ๐ ๐ Norms can be defined for matrices, such as the Frobenius norm. ๐ด ๐น = ๐ ๐=1 ๐ 2 ๐=1 ๐ด๐๐ = ๐ก๐(๐ดโค ๐ด) 34 Identity, Diagonal and Orthogonal Matrices The identity matrix, denoted by I ∈ โ๐×๐ is a square matrix with ones on the diagonal and zeros everywhere else A diagonal matrix is matrix where all non-diagonal matrices are 0. This is typically denoted as D = ๐๐๐๐(๐1, ๐2, ๐3, . . . , ๐๐) Two vectors x, y ∈ โ๐ are orthogonal if ๐ฅ. ๐ฆ = 0. A square matrix U ∈ โ๐×๐ is orthogonal if all its columns are orthogonal to each other and are normalized It follows from orthogonality and normality that UTU = I = UUT ๐๐ฅ 2 = ๐ฅ 2 35 Determinant and Inverse of a Matrix The determinant of a square matrix ๐ด ∈ โ๐×๐ is a function f: โ๐×๐ → โ , denoted by ๐ด ๐๐ det ๐ด , and is calculated as ๐ด = ๐ ๐+๐ (−1) ๐=1 ๐๐๐ |๐ด\i, \j | (for any ๐ ∈ 1,2, … , ๐) The inverse of a square matrix ๐ด ∈ โ๐×๐ is denoted ๐ด−1 and is the unique matrix such that ๐ด−1 ๐ด = ๐ผ = ๐ด๐ด−1 For some square matrices ๐ด−1 may not exist, and we say that A is singular or non-invertible. In order for A to have an inverse, A must be full rank. For non-square matrices the inverse, denoted by ๐ด+ ,is given by ๐ด+ = ๐ดโค๐ด −1 AT called the pseudo inverse 36 Eigenvalues and Eigenvectors - I Given a square matrix ๐ด ∈ โ๐×๐ we say that ๐ ∈ โ is an ๐ eigenvalue of A and ๐ฅ ∈ โ is an eigenvector if ๐ด๐ฅ = ๐๐ฅ, ๐ฅ ≠0 Intuitively this means that upon multiplying the matrix A with a vector x, we get the same vector, but scaled by a parameter ๐ Geometrically, we are transforming the matrix A from its original orthonormal basis/co-ordinates to a new set of orthonormal basis ๐ฅ with magnitude as ๐ 37 Eigenvalues and Eigenvectors - II Source - Wikipedia 38 Eigenvalues and Eigenvectors - III All the eigenvectors can be written together as ๐ด๐ = ๐Λ where the diagonals of ๐ are the eigenvectors of ๐ด, and Λ is a diagonal matrix whose elements are eigenvalues of ๐ด If the eigenvectors of A are invertible, then ๐ด = ๐Λ๐ −1 There are several properties of eigenvalues and eigenvectors ๐๐(๐ด) = ๐๐=1 ๐๐ ๐ด = ๐๐=1 ๐๐ Rank of A is the number of non-zero eigenvalues of A If A is non-singular then 1/๐๐ are the eigenvalues of ๐ด−1 The eigenvalues of a diagonal matrix are the diagonal elements of the matrix itself! 39 Eigenvalues and Eigenvectors - IV For a symmetric matrix A, it can be shown that eigenvalues are real and the eigenvectors are orthonormal. Thus it can be represented as ๐Λ๐ โค Considering quadratic form of A, ๐ฅ โค ๐ด๐ฅ = ๐ฅ โค ๐Λ๐ โค ๐ฅ = ๐ฆ โค Λ๐ฆ = ๐ 2 ๐=1 ๐๐ ๐ฆ๐ (where ๐ฆ = ๐ โค ๐ฅ) Since ๐ฆ๐2 is always positive the sign of the expression always depends on ๐๐. If ๐๐ >0 then the matrix A is positive definite, if ๐๐ ≥ 0 then the matrix A is positive semidefinite For a multi variate Gaussian, the variances of x and y do not fully describe the distribution. The eigenvectors of this covariance matrix capture the directions of highest variance and eigenvalues the variance 40 Eigenvalues and Eigenvectors - V We can rewrite the original equation in the following manner ๐ด๐ฅ = ๐๐ฅ, ๐ฅ ≠0 ⇒ ๐๐ผ − ๐ด ๐ฅ = 0, ๐ฅ≠0 This is only possible if ๐๐ผ − ๐ด is singular, that is | ๐๐ผ − ๐ด | = 0. Thus, eigenvalues and eigenvectors can be computed. Compute the determinant of ๐ด − ๐๐ผ. This results in a polynomial of degree ๐. Find the roots of the polynomial by equating it to zero. The n roots are the n eigenvalues of ๐ด. They make ๐ด − ๐๐ผ singular. For each eigenvalue ๐ , solve ๐ด − ๐๐ผ ๐ฅ to find an eigenvector ๐ฅ 41 Singular Value Decomposition Singular value decomposition, known as SVD, is a factorization of a real matrix with applications in calculating pseudo-inverse, rank, solving linear equations, and many others. For a matrix ๐ ∈ โ๐×๐ assume ๐ ≤ ๐ ๐ = ๐Σ๐โค where U ∈ โ๐×๐ , Vโค ∈ โ๐×๐ , Σ ∈ โ๐×๐ The m columns of U, and the n columns of V are called the left and right singular vectors of M. The diagonal elements of Σ, Σii are known as the singular values of M. Let ๐ฃ be the ith column of ๐, and ๐ข be the ith column of ๐, and ๐ be the ith diagonal element of Σ ๐๐ฃ = ๐๐ข and ๐โค๐ข = ๐๐ฃ 42 Singular Value Decomposition - II ๐ = ๐ข1 ๐ข2 … ๐ข๐ principal directions Σ11 โฏ Σ1๐ ๐ฃ1 โฎ โฑ โฎ Σ๐1 โฏ Σ๐๐ Scaling factor ๐ฃ2 … ๐ฃ๐ T Projection in principal directions Singular value decomposition is related to eigenvalue decomposition Suppose ๐ = ๐ฅ1 − ๐ข ๐ฅ2 − ๐ข … Then covariance matrix is ๐ถ = ๐ฅ๐ − ๐ข ∈ โ๐×๐ 1 ๐๐โค ๐ Starting from singular vector pair ๐โคu = σ๐ฃ ⇒ ๐๐โคu = σ๐๐ฃ ⇒ ๐๐โคu = σ2๐ข ⇒ ๐ถ๐ข = ๐๐ข 43 Singular Value Decomposition - III Source - Wikipedia 44 Matrix Calculus For a vector x, b ∈ โ๐ , let ๐ ๐ฅ = ๐ โค ๐ฅ , then ๐ป๐ฅ ๐ โค ๐ฅ is equal to ๐ ๐๐(๐ฅ) ๐๐ฅ๐ = ๐ ๐๐ฅ๐ ๐ ๐=1 ๐๐๐ฅ๐ = ๐๐ Now for a quadratic function, ๐ ๐ฅ = ๐ฅ โค ๐ด๐ฅ, with ๐ด ∈ ๐๐ , ๐๐(๐ฅ) ๐๐ฅ๐ = 2๐ด๐ฅ ๐๐(๐ฅ) ๐๐ฅ๐ = =2 = ๐ ๐๐ฅ๐ ๐ ๐=1 ๐≠๐ ๐ด๐๐๐ฅ๐ + ๐ ๐=1 ๐ด๐๐๐ฅ๐ ๐ ๐=1 ๐ด๐๐๐ฅ๐๐ฅ๐ ๐≠๐ ๐ด๐๐๐ฅ๐ + 2๐ด๐๐๐ฅ๐ Let ๐ ๐ = ๐ −1 , then ๐ ๐ −1 = −๐ −1 (๐๐)๐ −1 45 References for self study Best Resources for review of material Linear Algebra Review and Reference by Zico Kotler Matrix Cookbook by KB Peterson 46