Lecture 3: Math Primer II Machine Learning Andrew Rosenberg Today • • • • Wrap up of probability Vectors, Matrices. Calculus Derivation with respect to a vector. 1 Properties of probability density functions Sum Rule Product Rule 2 Expected Values • Given a random variable, with a distribution p(X), what is the expected value of X? 3 Multinomial Distribution • If a variable, x, can take 1-of-K states, we represent the distribution of this variable as a multinomial distribution. • The probability of x being in state k is μk 4 Expected Value of a Multinomial • The expected value is the mean values. 5 Gaussian Distribution • One Dimension • D-Dimensions 6 Gaussians 7 How machine learning uses statistical modeling • Expectation – The expected value of a function is the hypothesis • Variance – The variance is the confidence in that hypothesis 8 Variance • The variance of a random variable describes how much variability around the expected value there is. • Calculated as the expected squared error. 9 Covariance • The covariance of two random variables expresses how they vary together. • If two variables are independent, their covariance equals zero. 10 Linear Algebra • Vectors – A one dimensional array. – If not specified, assume x is a column vector. • Matrices – Higher dimensional array. – Typically denoted with capital letters. – n rows by m columns 11 Transposition • Transposing a matrix swaps columns and rows. 12 Transposition • Transposing a matrix swaps columns and rows. 13 Addition • Matrices can be added to themselves iff they have the same dimensions. – A and B are both n-by-m matrices. 14 Multiplication • To multiply two matrices, the inner dimensions must be the same. – An n-by-m matrix can be multiplied by an m-by-k matrix 15 Inversion • The inverse of an n-by-n or square matrix A is denoted A-1, and has the following property. • Where I is the identity matrix is an n-by-n matrix with ones along the diagonal. – Iij = 1 iff i = j, 0 otherwise 16 Identity Matrix • Matrices are invariant under multiplication by the identity matrix. 17 Helpful matrix inversion properties 18 Norm • The norm of a vector, x, represents the euclidean length of a vector. 19 Positive Definite-ness • Quadratic form – Scalar – Vector • Positive Definite matrix M • Positive Semi-definite 20 Calculus • Derivatives and Integrals • Optimization 21 Derivatives • A derivative of a function defines the slope at a point x. 22 Derivative Example 23 Integrals • Integration is the inverse operation of derivation (plus a constant) • Graphically, an integral can be considered the area under the curve defined by f(x) 24 Integration Example 25 Vector Calculus • Derivation with respect to a matrix or vector • Gradient • Change of Variables with a Vector 26 Derivative w.r.t. a vector • Given a vector x, and a function f(x), how can we find f’(x)? 27 Derivative w.r.t. a vector • Given a vector x, and a function f(x), how can we find f’(x)? 28 Example Derivation 29 Example Derivation Also referred to as the gradient of a function. 30 Useful Vector Calculus identities • Scalar Multiplication • Product Rule 31 Useful Vector Calculus identities • Derivative of an inverse • Change of Variable 32 Optimization • Have an objective function that we’d like to maximize or minimize, f(x) • Set the first derivative to zero. 33 Optimization with constraints • What if I want to constrain the parameters of the model. – The mean is less than 10 • Find the best likelihood, subject to a constraint. • Two functions: – An objective function to maximize – An inequality that must be satisfied 34 Lagrange Multipliers • Find maxima of f(x,y) subject to a constraint. 35 General form • Maximizing: • Subject to: • Introduce a new variable, and find a maxima. 36 Example • Maximizing: • Subject to: • Introduce a new variable, and find a maxima. 37 Example Now have 3 equations with 3 unknowns. 38 Example Eliminate Lambda Substitute and Solve 39 Why does Machine Learning need these tools? • Calculus – We need to identify the maximum likelihood, or minimum risk. Optimization – Integration allows the marginalization of continuous probability density functions • Linear Algebra – Many features leads to high dimensional spaces – Vectors and matrices allow us to compactly describe and manipulate high dimension al feature spaces. 40 Why does Machine Learning need these tools? • Vector Calculus – All of the optimization needs to be performed in high dimensional spaces – Optimization of multiple variables simultaneously – Gradient Descent – Want to take a marginal over high dimensional distributions like Gaussians. 41 Next Time • Linear Regression – Then Regularization 42