Recitation Shuang Li, Amir Afsharinejad, Kaushik Patnaik Machine Learning

Recitation Shuang Li, Amir Afsharinejad, Kaushik Patnaik Machine Learning CS 7641,CSE/ISYE 6740, Fall 2014 Probability and Statistics 2 Basic Probability Concepts A sample space S is the set of all possible outcomes of a conceptual or physical, repeatable experiment. (S can be finite or infinite.) E.g., S may be the set of all possible outcomes of a dice roll: S 1 2 3 4 5 6 E.g., S may be the set of all possible nucleotides of a DNA site: S 𝐴 𝐶 𝐺 𝑇 E.g., S may be the set of all possible time-space positions of an aircraft on a radar screen. An Event A is any subset of S Seeing "1" or "6" in a dice roll; observing a "G" at a site; UA007 in space-time interval 3 Probability An event space E is the possible worlds the outcomes can happen. All dice-rolls, reading a genome, monitoring the radar signal A probability P(A) is a function that maps an event A onto the interval [0,1]. P(A) is also called the probability measure or probability mass of A. Sample space of all possible worlds. Its area is 1 Worlds in which A is true Worlds in which A is false P(a) is the area of the oval 4 Kolmogorov Axioms For any event 𝐴, 𝐵 ⊆ 𝑆: 1 ≥ 𝑃(𝐴) ≥ 0 𝑃 𝑆 =1 𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵 − 𝑃(𝐴 ∩ 𝐵) 5 Random Variable A random variable is a function that associates a unique numerical value (a token) with every outcome of an experiment. (The value of the r.v. will vary the experiment is repeated) RV Distributions: Continuous RV: The outcome of observing the measured location of an aircraft The outcome of recording the true location of an aircraft Discrete RV: The outcome of a dice-roll The outcome of a coin toss 6 Discrete Prob. Distribution A probability distribution P defined on a discrete sample space S is an assignment of a non-negative real number P(s) to each sample s∈S : Probability Mass Function (PMF):𝑝𝑥 𝑥𝑖 = P[𝑋 = 𝑥𝑖 ] Cumulative Mass Function (CMF): 𝐹𝑥 𝑥𝑖 = P 𝑋 ≤ 𝑥𝑖 = 𝑗:𝑥𝑗 <𝑥𝑖 𝑝𝑥 𝑥𝑗 Properties: 𝑝𝑥 𝑥𝑖 ≥ 0 and 𝑖 𝑝𝑋 𝑥𝑖 = 1 Examples: Bernoulli Distribution: 1 − 𝑝 𝑓𝑜𝑟 𝑥 = 0 𝑝 𝑓𝑜𝑟 𝑥 = 1 Binomial Distribution: P(𝑋 = 𝑘)= 𝑛 𝑘 𝑝𝑘 (1 − 𝑝)𝑛−𝑘 7 Continuous Prob. Distribution A continuous random variable X is defined on a continuous sample space: an interval on the real line, a region in a high dimensional space, etc. It is meaningless to talk about the probability of the random variable assuming a particular value --- P(x) = 0 Instead, we talk about the probability of the random variable assuming a value within a given interval, or half interval, or arbitrary Boolean combination of basic propositions. Cumulative Distribution Function (CDF): 𝐹𝑥 𝑥 = P[𝑋 ≤ 𝑥] Probability Density Function (PDF):𝐹𝑥 𝑥 = 𝑑 𝐹𝑥 (𝑥) 𝑑𝑥 ∞ 𝑓 (𝑥) 𝑑𝑥 −∞ 𝑥 𝑥,𝑥+Δ P[𝑋 ∈ ] Δ Properties: 𝑓𝑥 (𝑥) ≥ 0 and Interpretation: 𝑓𝑥 𝑥 = 𝑥 𝑓 (𝑥) 𝑑𝑥 −∞ 𝑥 or 𝑓𝑥 𝑥 = =1 8 Continuous Prob. Distribution Examples: Uniform Density Function: 1 𝑓𝑥 𝑥 = 𝑏 − 𝑎 0 𝑓𝑜𝑟 𝑎 ≤ 𝑥 ≤ 𝑏 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Exponential Density Function: 1 −𝜇𝑥 𝑓𝑥 𝑥 = 𝑒 𝜇 𝑓𝑜𝑟 𝑥 ≥ 0 𝐹𝑥 𝑥 = 1 − 𝑓𝑜𝑟 𝑥 ≥ 0 −𝑥 𝑒𝜇 Gaussian(Normal) Density Function (𝑥−𝜇)2 1 − 𝑓𝑥 𝑥 = 𝑒 2𝜎2 2𝜋𝜎 9 Continuous Prob. Distribution Gaussian Distribution: If Z~N(0,1) 𝐹𝑥 𝑥 = Φ 𝑥 = 𝑥 𝑓𝑥 𝑧 𝑑𝑧 = 1 𝑥 −𝑧 2 𝑒 2 𝑑𝑧 2𝜋 −∞ This has no closed form expression, but is built in to most software packages (eg. normcdf in matlab stats toolbox). −∞ 10 Characterizations Expectation: The mean value, center of mass, first moment: ∞ 𝐸𝑋 𝑔(𝑋) = 𝑔 𝑥 𝑝𝑋 𝑥 𝑑𝑥 = 𝜇 −∞ N-th moment: 𝑔 𝑥 = 𝑥 𝑛 N-th central moment: 𝑔 𝑥 = (𝑥 − 𝜇)𝑛 Mean: 𝐸𝑋 𝑋 = ∞ 𝑥𝑝𝑋 −∞ 𝑥 𝑑𝑥 𝐸 𝛼𝑋 = 𝛼𝐸[𝑋] 𝐸 𝛼 + 𝑋 = 𝛼 + 𝐸[𝑋] Variance(Second central moment): 𝑉𝑎𝑟 𝑥 = 𝐸𝑋 (𝑋 − 𝐸𝑋 𝑋 )2 = 𝐸𝑋 𝑋 2 − 𝐸𝑋 [𝑋]2 𝑉𝑎𝑟 𝛼𝑋 = 𝛼 2 𝑉𝑎𝑟 𝑋 𝑉𝑎𝑟 𝛼 + 𝑋 = 𝑉𝑎𝑟 𝑋 11 Central Limit Theorem If (X1 ,X2, … Xn) are i.i.d. continuous random variables, then the joint distribution is 𝑓 𝑋 CLT proves that 𝑓 𝑋 is Gaussian with mean 𝐸[𝑋𝑖 ] and 𝑉𝑎𝑟[𝑋𝑖 ] 𝑛 1 𝑋 = 𝑓 𝑋1 , 𝑋2 , … , 𝑋𝑛 = 𝑋𝑖 𝑎𝑠 𝑛 → ∞ 𝑛 𝑖=1 Somewhat of a justification for assuming Gaussian noise 12 Joint RVs and Marginal Densities Joint densities: 𝐹𝑋,𝑌 𝑥, 𝑦 = P 𝑋 ≤ 𝑥 ∩ 𝑌 ≤ 𝑦 𝑥 𝑦 𝑓 𝛼, 𝛽 𝑑𝛼𝑑𝛽 −∞ −∞ 𝑋,𝑌 = Marginal densities: 𝑓𝑋 𝑥 = 𝑝𝑋 𝑥𝑖 = ∞ 𝑓 −∞ 𝑋,𝑌 𝑥, 𝛽 𝑑𝛽 𝑗 𝑝𝑋,𝑌 (𝑥𝑖 , 𝑦𝑗 ) Expectation and Covariance: 𝐸 𝑋 + 𝑌 = 𝐸 𝑋 + 𝐸[𝑌] 𝑐𝑜𝑣 𝑋, 𝑌 = 𝐸[ 𝑋 − 𝐸𝑋 𝑋 𝑌 − 𝐸𝑌 𝑌 = 𝐸 𝑋𝑌 − 𝐸 𝑋 𝐸[𝑌] 𝑉𝑎𝑟 𝑋 + 𝑌 = 𝑉𝑎𝑟 𝑋 + 2𝑐𝑜𝑣 𝑋, 𝑌 + 𝑉𝑎𝑟(𝑌) 13 Conditional Probability P(X|Y)= Fraction of the worlds in which X is true given that Y is also true. For example: H=“Having a headache” F=“Coming down with flu” 𝑃(𝐻𝑒𝑎𝑑𝑐ℎ𝑒|𝐹𝑙𝑢) = fraction of flu-inflicted worlds in which you have a headache. How to calculate? Definition: Corollary: 𝑃(𝑋 ∩ 𝑌) P Y X P X 𝑃 𝑋𝑌 = = 𝑃(𝑌) P Y 𝑃 𝑋 ∩ 𝑌 = 𝑃 𝑌 𝑋 𝑃(𝑋) This is called Bayes Rule 14 The Bayes Rule 𝑃 𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒 𝐹𝑙𝑢 = = 𝑃 𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒,𝐹𝑙𝑢 𝑃 𝐹𝑙𝑢 𝑃 𝐹𝑙𝑢 𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒 𝑃 𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒 𝑃 𝐹𝑙𝑢 Other cases: 𝑃 𝑌𝑋 = 𝑃 𝑃 𝑋𝑌 𝑃 𝑌 = 𝑦𝑖 𝑋 = 𝑋𝑌 𝑃(𝑌) 𝑋 ¬𝑌 𝑃(¬𝑌) 𝑃 𝑋 𝑌 𝑃(𝑌) 𝑋 𝑌 = 𝑦𝑖 𝑃 𝑌=𝑦𝑖 𝑃 𝑌 +𝑃 𝑖∈𝑆 𝑃 𝑃 𝑋 𝑌, 𝑍 𝑃(𝑌,𝑍) 𝑃 𝑌 𝑋, 𝑍 = = 𝑃(𝑋,𝑍) 𝑃 𝑋 𝑌, 𝑍 𝑃(𝑌,𝑍) 𝑃 𝑋 𝑌, 𝑍 𝑃 𝑌,𝑍 +𝑃 𝑋 ¬𝑌, 𝑍 𝑃(¬𝑌,𝑍) 15 Rules of Independence Recall that for events E and H, the probability of E given H, written as P(E|H), is 𝑃 𝐸∩𝐻 𝑃 𝐸𝐻 = 𝑃 𝐻 E and H are (statistically) independent if 𝑃 𝐸 ∩ 𝐻 = 𝑃 𝐸 𝑃(𝐻) Or equivalently 𝑃 𝐸 = 𝑃(𝐸|𝐻) That means, the probability of E is true doesn't depend on whether H is true or not E and F are conditionally independent given H if 𝑃 𝐸 𝐻, 𝐹 = 𝑃(𝐸|𝐻) Or equivalently 𝑃 𝐸, 𝐹 𝐻 = 𝑃 𝐸 𝐻 𝑃(𝐹|𝐻) 16 Rules of Independence Examples: P(Virus | Drink Beer) = P(Virus) iff Virus is independent of Drink Beer P(Flu | Virus;DrinkBeer) = P(Flu|Virus) iff Flu is independent of Drink Beer, given Virus P(Headache | Flu;Virus;DrinkBeer) = P(Headache|Flu;DrinkBeer) iff Headache is independent of Virus, given Flu and Drink Beer Assume the above independence, we obtain: P(Headache;Flu;Virus;DrinkBeer) =P(Headache | Flu;Virus;DrinkBeer) P(Flu | Virus;DrinkBeer) P(Virus | Drink Beer) P(DrinkBeer) =P(Headache|Flu;DrinkBeer) P(Flu|Virus) P(Virus) P(DrinkBeer) 17 Multivariate Gaussian 1 1 ⊤ Σ −1 𝑥 − 𝜇 } 𝑝 𝑥 μ, Σ = exp{− 𝑥 − 𝜇 2 2𝜋 𝑛/2 |Σ |1/2 Moment Parameterization 𝜇 = 𝐸(𝑋) Σ = 𝐶𝑜𝑣 𝑋 = 𝐸[ 𝑋 − 𝜇 𝑋 − 𝜇 ⊤ ] Mahalanobis Distance ∆2 = 𝑥 = 𝜇 ⊤ Σ −1 𝑥 − 𝜇 Canonical Parameterization 1 ⊤ ⊤ 𝑝 𝑥 𝜂, Λ = exp{𝑎 + 𝜂 𝑥 − 𝑥 Λ𝑥} 2 1 where Λ = Σ −1 , 𝜂 = Σ −1 𝜇, 𝑎 = − (𝑛 𝑙𝑜𝑔2𝜋 − log Λ + 𝜂 ⊤ Λ𝜂 ) 2 Tons of applications (MoG, FA, PPCA, Kalman filter,…) 18 Multivariate Gaussian P (X1, X2) Joint Gaussian 𝑃 𝑋1 , 𝑋2 𝜇1 𝜇= 𝜇 2 Marginal Gaussian 𝜇2𝑚 = 𝜇2 = 11 12 21 22 Σ2𝑚 = Σ2 Conditional Gaussian 𝑃 𝑋1 |𝑋2 = 𝑥2 −1 𝜇1|2 = 𝜇1 + Σ12 Σ22 𝑥2 − 𝜇2 −1 Σ1|2 = Σ11 − Σ12 Σ22 Σ21 19 Operations on Gaussian R.V. The linear transform of a Gaussian r.v. is a Gaussian. Remember that no matter how x is distributed 𝐸 𝐴𝑋 + 𝑏 = 𝐴𝐸 𝑋 + 𝑏 𝐶𝑜𝑣 𝐴𝑋 + 𝑏 = 𝐴𝐶𝑜𝑣 𝑋 𝐴⊤ this means that for Gaussian distributed quantities: 𝑋 ∼ 𝑁 𝜇, Σ → 𝐴𝑋 + 𝑏 ∼ 𝑁(𝐴𝜇 + 𝑏, 𝐴Σ𝐴⊤ ) The sum of two independent Gaussian r.v. is a Gaussian Y = X1 + X2 , X1 ⊥ 𝑋2 → 𝜇𝑦 = 𝜇1 + 𝜇2 , Σ𝑦 = Σ1 + Σ2 The multiplication of two Gaussian functions is another Gaussian function (although no longer normalized) 𝑁 𝑎, 𝐴 𝑁 𝑏, 𝐵 ∝ 𝑁 𝑐, 𝐶 , 𝑤ℎ𝑒𝑟𝑒 𝐶 = 𝐴−1 + 𝐵−1 −1 , 𝑐 = CA−1 𝑎 + 𝐶𝐵−1 𝑏 20 MLE Example: toss a coin Objective function: 𝑙 𝜃; 𝐻𝑒𝑎𝑑 = log 𝑃 𝐻𝑒𝑎𝑑 𝜃 = log 𝜃 𝑛 1 − 𝜃 𝑁−𝑛 = 𝑛 log 𝜃 + 𝑁 − 𝑛 log( 1 − 𝜃) We need to maximize this w.r.t. 𝜃 Take derivatives w.r.t. 𝜃 𝑑𝑙 𝑛 𝑁 − 𝑛 = − =0 𝑑𝜃 𝜃 1 − 𝜃 𝑛 𝜃𝑀𝐿𝐸 = 𝑁 21 Linear Algebra 22 Outline Motivating Example – Eigenfaces Basics Dot and Vector Products Identity, Diagonal and Orthogonal Matrices Trace Norms Rank and linear independence Range and Null Space Column and Row Space Determinant and Inverse of a matrix Eigenvalues and Eigenvectors Singular Value Decomposition Matrix Calculus 23 Motivating Example - EigenFaces Consider the task of recognizing faces from images. Given images of size 512x512, each image contains 262,144 dimensions or features. Not all dimensions are equally important in classifying faces. Solution – To use ideas from linear algebra, especially eigenvectors, to form a new set of reduced features. 24 Linear Algebra Basics - I Linear algebra provides a way of compactly representing and operating on sets of linear equations 4𝑥1 − 5𝑥2 = −13 −2𝑥1 + 3𝑥2 = 9 can be written in the form of 𝐴𝑥 = 𝑏 −13 4 5 𝐴= 𝑏= 9 −2 3 𝐴 ∈ ℝ𝑚×𝑛 denotes a matrix with m rows and 𝑛 columns, where elements belong to real numbers. 𝑥 ∈ ℝ𝑛 denotes a vector with n real entries. By convention an n dimensional vector is often thought as a matrix with n rows and 1 column. 25 Linear Algebra Basics - II Transpose of a matrix results from flipping the rows and columns. Given A ∈ ℝ𝑚×𝑛 , transpose is 𝐴⊤ ∈ ℝ𝑛×𝑚 For each element of the matrix, the transpose can be written as  A⊤ ij = 𝐴𝑗𝑖 The following properties of the transposes are easily verified 𝐴⊤ ⊤ = 𝐴 (𝐴𝐵)⊤ = 𝐵⊤ 𝐴⊤ (𝐴 + 𝐵)⊤ = 𝐴⊤ + 𝐵⊤ A square matrix 𝐴 ∈ ℝ𝑛×𝑛 is symmetric if 𝐴 = 𝐴⊤ and it is anti-symmetric if 𝐴 = − 𝐴⊤ . Thus each matrix can be written as a sum of symmetric and anti-symmetric matrices. 26 Vector and Matrix Multiplication - I The product of two matrices A ∈ ℝ𝑚×𝑛 and B ∈ ℝ𝑛×𝑝 is given by C ∈ ℝ𝑚×𝑝 , where 𝐶𝑖𝑗 = 𝑛𝑘=1 𝐴𝑖𝑘𝐵𝑘𝑗 Given two vectors x, y ∈ ℝ𝑛 , the term 𝑥⊤𝑦 (also 𝑥 ∙ 𝑦) is called the inner product or dot product of the vectors, and is a real number given by 𝑛𝑖=1 𝑥𝑖𝑦𝑖 . For example, 𝑥⊤𝑦 = 𝑥1 𝑥2 𝑥3 𝑦1 𝑦2 = 𝑦3 3 𝑖=1 𝑥𝑖𝑦𝑖 Given two vectors x ∈ ℝ𝑛 , y ∈ ℝ𝑚 , the term 𝑥𝑦⊤ is called the outer product of the vectors, and is a matrix given by (𝑥𝑖𝑦𝑗)⊤ = 𝑥𝑖𝑦𝑗. . For example, 27 Vector and Matrix Multiplication - II 𝑥1 𝑥𝑦⊤ = 𝑥2 𝑦1 𝑥3 𝑥1𝑦1 𝑦3 = 𝑥2𝑦1 𝑥3𝑦1 𝑦2 𝑥1𝑦2 𝑥1𝑦3 𝑥2𝑦2 𝑥2𝑦3 𝑥3𝑦2 𝑥3𝑦3 The dot product also has a geometrical interpretation, for vectors in x, y ∈ ℝ2 with angle 𝜃 between them x 𝜃 𝑥 ∙ 𝑦 = 𝑥 𝑦 cos𝜃 y which leads to use of dot product for testing orthogonality, getting the Euclidean norm of a vector, and scalar projections. 28 Trace of a Matrix The trace of a matrix 𝐴 ∈ ℝ𝑛×𝑛 , denoted as 𝒕𝒓(𝑨), is the sum of the diagonal elements in the matrix 𝑡𝑟(𝐴) = 𝑛𝑖=1 𝐴𝑖𝑖 The trace has the following properties For 𝐴 ∈ ℝ𝑛×𝑛 , 𝑡𝑟(𝐴) = 𝑡𝑟𝐴⊤ For 𝐴, 𝐵 ∈ ℝ𝑛×𝑛 , 𝑡𝑟(𝐴 + 𝐵) = 𝑡𝑟(𝐴) + 𝑡𝑟(𝐵) For 𝐴 ∈ ℝ𝑛×𝑛 , t ∈ ℝ, 𝑡𝑟 𝑡𝐴 = 𝑡 ∙ 𝑡𝑟(𝐴) For 𝐴, 𝐵, 𝐶 such that ABC is a square matrix 𝑡𝑟(𝐴𝐵𝐶) = 𝑡𝑟(𝐵𝐶𝐴) = 𝑡𝑟(𝐶𝐴𝐵) The trace of a matrix helps us easily compute norms and eigenvalues of matrices as we will see later 29 Linear Independence and Rank A set of vectors 𝑥1, 𝑥2, . . . , 𝑥𝑛 ⊂ ℝ𝑚 are said to be (linearly) independent if no vector can be represented as a linear combination of the remaining vectors. That is if 𝑛−1 𝑥𝑛 = 𝛼𝑖𝑥𝑖 𝑖=1 for some scalar values 𝛼1, 𝛼2, … ∈ ℝ then we say that the vectors are linearly dependent; otherwise the vectors are linearly independent The column rank of a matrix 𝐴 ∈ ℝ𝑚×𝑛 is the size of the largest subset of columns of A that constitute a linearly independent set. Row rank of a matrix is defined similarly for rows of a matrix. For any matrix row rank = column rank 30 Range and Null Space The span of a set of vectors 𝑥1, 𝑥2, . . . , 𝑥𝑛 is the set of all vectors that can be expressed as a linear combination of the set 𝑣: 𝑣 = 𝑛𝑖=1 𝛼𝑖𝑥𝑖 , 𝛼𝑖 ∈ ℝ If 𝑥1, 𝑥2, . . . , 𝑥𝑛 ∈ ℝ𝑛 is a set of linearly independent set of vectors, then span( 𝑥1, 𝑥2, . . . , 𝑥𝑛 ) = ℝ𝑛 The range of a matrix 𝐴 ∈ ℝ𝑚×𝑛 , denoted as ℛ(𝐴), is the span of the columns of 𝐴 The nullspace of a matrix 𝐴 ∈ ℝ𝑚×𝑛 , denoted 𝒩 𝐴 , is the set of all vectors that equal 0 when multiplied by 𝐴 𝒩 𝐴 = 𝑥 ∈ ℝ𝑛 ∶ 𝐴𝑥 = 0 31 Column and Row Space The row space and column space are the linear subspaces generated by row and column vectors of a matrix Linear subspace, is a vector space that is a subset of some other higher dimension vector space For a matrix 𝐴 ∈ ℝ𝑚×𝑛 𝐶𝑜𝑙 𝑠𝑝𝑎𝑐𝑒(𝐴) = 𝑠𝑝𝑎𝑛(𝑐𝑜𝑙𝑢𝑚𝑛𝑠 𝑜𝑓 𝐴) 𝑅𝑎𝑛𝑘(𝐴) = dim(𝑟𝑜𝑤 𝑠𝑝𝑎𝑐𝑒(𝐴)) = dim(𝑐𝑜𝑙 𝑠𝑝𝑎𝑐𝑒(𝐴)) 32 Norms – I Norm of a vector 𝑥 is informally a measure of the “length” of a vector More formally, a norm is any function f : ℝ𝑛 → ℝ that satisfies For all 𝑥 ∈ ℝ𝑛 , f(𝑥) ≥ 0 (non-negativity) f(𝑥) = 0 is and only if 𝑥 = 0 (definiteness) For 𝑥 ∈ ℝ𝑛 , t ∈ ℝ, 𝑓 𝑡𝑥 = 𝑡 𝑓 𝑥 (homogeneity) For all 𝑥, 𝑦 ∈ ℝ𝑛 , f(𝑥 + 𝑦) ≤ f(𝑥) + f(𝑦) (triangle inequality) Common norms used in machine learning are ℓ2 norm 𝑥 2 = 𝑛 2 𝑖=1 𝑥𝑖 33 Norm - II ℓ1 norm 𝑥 1 = 𝑛 𝑖=1 𝑥𝑖 ℓ∞ norm 𝑥 ∞ = 𝑚𝑎𝑥𝑖 𝑥𝑖 All norms presented so far are examples of the family of ℓ𝑝 norms, which are parameterized by a real number p ≥ 1 1 𝑥 𝑝 = 𝑛 𝑖=1 𝑥𝑖 𝑝 𝑝 Norms can be defined for matrices, such as the Frobenius norm. 𝐴 𝐹 = 𝑚 𝑖=1 𝑛 2 𝑗=1 𝐴𝑖𝑗 = 𝑡𝑟(𝐴⊤ 𝐴) 34 Identity, Diagonal and Orthogonal Matrices The identity matrix, denoted by I ∈ ℝ𝑛×𝑛 is a square matrix with ones on the diagonal and zeros everywhere else A diagonal matrix is matrix where all non-diagonal matrices are 0. This is typically denoted as D = 𝑑𝑖𝑎𝑔(𝑑1, 𝑑2, 𝑑3, . . . , 𝑑𝑛) Two vectors x, y ∈ ℝ𝑛 are orthogonal if 𝑥. 𝑦 = 0. A square matrix U ∈ ℝ𝑛×𝑛 is orthogonal if all its columns are orthogonal to each other and are normalized It follows from orthogonality and normality that UTU = I = UUT 𝑈𝑥 2 = 𝑥 2 35 Determinant and Inverse of a Matrix The determinant of a square matrix 𝐴 ∈ ℝ𝑛×𝑛 is a function f: ℝ𝑛×𝑛 → ℝ , denoted by 𝐴 𝑜𝑟 det 𝐴 , and is calculated as 𝐴 = 𝑛 𝑖+𝑗 (−1) 𝑖=1 𝑎𝑖𝑗 |𝐴\i, \j | (for any 𝑗 ∈ 1,2, … , 𝑛) The inverse of a square matrix 𝐴 ∈ ℝ𝑛×𝑛 is denoted 𝐴−1 and is the unique matrix such that 𝐴−1 𝐴 = 𝐼 = 𝐴𝐴−1 For some square matrices 𝐴−1 may not exist, and we say that A is singular or non-invertible. In order for A to have an inverse, A must be full rank. For non-square matrices the inverse, denoted by 𝐴+ ,is given by 𝐴+ = 𝐴⊤𝐴 −1 AT called the pseudo inverse 36 Eigenvalues and Eigenvectors - I Given a square matrix 𝐴 ∈ ℝ𝑛×𝑛 we say that 𝜆 ∈ ℂ is an 𝑛 eigenvalue of A and 𝑥 ∈ ℂ is an eigenvector if 𝐴𝑥 = 𝜆𝑥, 𝑥 ≠0 Intuitively this means that upon multiplying the matrix A with a vector x, we get the same vector, but scaled by a parameter 𝜆 Geometrically, we are transforming the matrix A from its original orthonormal basis/co-ordinates to a new set of orthonormal basis 𝑥 with magnitude as 𝜆 37 Eigenvalues and Eigenvectors - II Source - Wikipedia 38 Eigenvalues and Eigenvectors - III All the eigenvectors can be written together as 𝐴𝑋 = 𝑋Λ where the diagonals of 𝑋 are the eigenvectors of 𝐴, and Λ is a diagonal matrix whose elements are eigenvalues of 𝐴 If the eigenvectors of A are invertible, then 𝐴 = 𝑋Λ𝑋 −1 There are several properties of eigenvalues and eigenvectors 𝑇𝑟(𝐴) = 𝑛𝑖=1 𝜆𝑖 𝐴 = 𝑛𝑖=1 𝜆𝑖 Rank of A is the number of non-zero eigenvalues of A If A is non-singular then 1/𝜆𝑖 are the eigenvalues of 𝐴−1 The eigenvalues of a diagonal matrix are the diagonal elements of the matrix itself! 39 Eigenvalues and Eigenvectors - IV For a symmetric matrix A, it can be shown that eigenvalues are real and the eigenvectors are orthonormal. Thus it can be represented as 𝑈Λ𝑈 ⊤ Considering quadratic form of A, 𝑥 ⊤ 𝐴𝑥 = 𝑥 ⊤ 𝑈Λ𝑈 ⊤ 𝑥 = 𝑦 ⊤ Λ𝑦 = 𝑛 2 𝑖=1 𝜆𝑖 𝑦𝑖 (where 𝑦 = 𝑈 ⊤ 𝑥) Since 𝑦𝑖2 is always positive the sign of the expression always depends on 𝜆𝑖. If 𝜆𝑖 >0 then the matrix A is positive definite, if 𝜆𝑖 ≥ 0 then the matrix A is positive semidefinite For a multi variate Gaussian, the variances of x and y do not fully describe the distribution. The eigenvectors of this covariance matrix capture the directions of highest variance and eigenvalues the variance 40 Eigenvalues and Eigenvectors - V We can rewrite the original equation in the following manner 𝐴𝑥 = 𝜆𝑥, 𝑥 ≠0 ⇒ 𝜆𝐼 − 𝐴 𝑥 = 0, 𝑥≠0 This is only possible if 𝜆𝐼 − 𝐴 is singular, that is | 𝜆𝐼 − 𝐴 | = 0. Thus, eigenvalues and eigenvectors can be computed. Compute the determinant of 𝐴 − 𝜆𝐼. This results in a polynomial of degree 𝑛. Find the roots of the polynomial by equating it to zero. The n roots are the n eigenvalues of 𝐴. They make 𝐴 − 𝜆𝐼 singular. For each eigenvalue 𝜆 , solve 𝐴 − 𝜆𝐼 𝑥 to find an eigenvector 𝑥 41 Singular Value Decomposition Singular value decomposition, known as SVD, is a factorization of a real matrix with applications in calculating pseudo-inverse, rank, solving linear equations, and many others. For a matrix 𝑀 ∈ ℝ𝑚×𝑛 assume 𝑛 ≤ 𝑚 𝑀 = 𝑈Σ𝑉⊤ where U ∈ ℝ𝑚×𝑚 , V⊤ ∈ ℝ𝑛×𝑛 , Σ ∈ ℝ𝑚×𝑛 The m columns of U, and the n columns of V are called the left and right singular vectors of M. The diagonal elements of Σ, Σii are known as the singular values of M. Let 𝑣 be the ith column of 𝑉, and 𝑢 be the ith column of 𝑈, and 𝜎 be the ith diagonal element of Σ 𝑀𝑣 = 𝜎𝑢 and 𝑀⊤𝑢 = 𝜎𝑣 42 Singular Value Decomposition - II 𝑀 = 𝑢1 𝑢2 … 𝑢𝑛 principal directions Σ11 ⋯ Σ1𝑛 𝑣1 ⋮ ⋱ ⋮ Σ𝑚1 ⋯ Σ𝑚𝑛 Scaling factor 𝑣2 … 𝑣𝑛 T Projection in principal directions Singular value decomposition is related to eigenvalue decomposition Suppose 𝑋 = 𝑥1 − 𝑢 𝑥2 − 𝑢 … Then covariance matrix is 𝐶 = 𝑥𝑚 − 𝑢 ∈ ℝ𝑚×𝑛 1 𝑋𝑋⊤ 𝑚 Starting from singular vector pair 𝑀⊤u = σ𝑣 ⇒ 𝑀𝑀⊤u = σ𝑀𝑣 ⇒ 𝑀𝑀⊤u = σ2𝑢 ⇒ 𝐶𝑢 = 𝜆𝑢 43 Singular Value Decomposition - III Source - Wikipedia 44 Matrix Calculus For a vector x, b ∈ ℝ𝑛 , let 𝑓 𝑥 = 𝑏 ⊤ 𝑥 , then 𝛻𝑥 𝑏 ⊤ 𝑥 is equal to 𝑏 𝜕𝑓(𝑥) 𝜕𝑥𝑘 = 𝜕 𝜕𝑥𝑘 𝑛 𝑖=1 𝑏𝑖𝑥𝑖 = 𝑏𝑘 Now for a quadratic function, 𝑓 𝑥 = 𝑥 ⊤ 𝐴𝑥, with 𝐴 ∈ 𝕊𝑛 , 𝜕𝑓(𝑥) 𝜕𝑥𝑘 = 2𝐴𝑥 𝜕𝑓(𝑥) 𝜕𝑥𝑘 = =2 = 𝜕 𝜕𝑥𝑘 𝑛 𝑖=1 𝑖≠𝑘 𝐴𝑖𝑘𝑥𝑖 + 𝑛 𝑖=1 𝐴𝑘𝑖𝑥𝑖 𝑛 𝑖=1 𝐴𝑖𝑗𝑥𝑖𝑥𝑗 𝑗≠𝑘 𝐴𝑘𝑗𝑥𝑗 + 2𝐴𝑘𝑘𝑥𝑘 Let 𝑓 𝑋 = 𝑋 −1 , then 𝜕 𝑋 −1 = −𝑋 −1 (𝜕𝑋)𝑋 −1 45 References for self study Best Resources for review of material Linear Algebra Review and Reference by Zico Kotler Matrix Cookbook by KB Peterson 46

Recitation Shuang Li, Amir Afsharinejad, Kaushik Patnaik Machine Learning

Related documents

Products

Support

Recitation Shuang Li, Amir Afsharinejad, Kaushik Patnaik Machine Learning

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib