Some Basic Mathematics for Machine Learning Iain Murray, www.gatsby.ucl.ac.uk/∼iam23/ Angela J. Yu <ajyu@ucsd.edu> 1 Probability Theory See sections 2.1–2.3 of David MacKay’s book: www.inference.phy.cam.ac.uk/mackay/itila/book.html The probability a discrete variable A takes value a is: 0 ≤ P (A = a) ≤ 1 Probabilities of alternative outcomes add: P (A ∈ {a, a′ }) = P (A = a) + P (A = a′ ) X The probabilities of all outcomes must sum to one: P (A = a) = 1 Probability Alternatives Normalization all possible a P (A = a, B = b) is the joint probability that both A = a and B = b occur. Joint Probability Variables can be “summed out” of joint distributions: X P (A = a) = P (A = a, B = b) Marginalization P (A = a|B = b) is the probability A = a occurs given the knowledge B = b. Conditional Probability Product Rule all possible b P (A = a, B = b) = P (A = a) P (B = b|A = a) = P (B = b) P (A = a|B = b) Bayes rule can be derived from the above: P (A = a|B = b, H) = Bayes Rule P (B = b|A = a, H) P (A = a|H) ∝ P (A = a, B = b|H) P (B = b|H) Marginalizing over all possible a gives the evidence or normalizing constant: X P (A = a, B = b|H) = P (B = b|H) Normalizing Constant The following hold, for all a and b, if and only if A and B are independent: Independence a P (A = a|B = b) = P (A = a) P (B = b|A = a) = P (B = b) P (A = a, B = b) = P (A = a) P (B = b) . All the above theory basically still applies to continuous variables if the sums are converted into integrals1 . The probability that X lies between x and x+dx is p (x) dx, where p (x) is a probability density function with range [0, ∞]. Z ∞ Z ∞ Z x2 p (x, y) dy. p (x) dx = 1 and p (x) = p (x) dx , P (x1 < X < x2 ) = −∞ −∞ x1 The expectation or mean under a probability distribution is: Z X Ef (a) = hf (a)i = P (A = a) f (a) or hf (x)i = p (x) f (x)dx Pn are the equivalent of sums for continuous variables, e.g i=1 f (xi )∆x becomes the integral and xi = a + i∆x. f (x)dx in the limit ∆x → 0, n → ∞, where ∆x = b−a n 1 Integrals a Expectation ∞ −∞ a Rb Continuous Variables 1 2 Linear Algebra This complements Sam Roweis’s “Matrix Identities”: www.cs.toronto.edu/~roweis/notes/matrixid.pdf Scalars are individual numbers, vectors are columns of numbers, matrices are of numbers, eg: x1 A11 A12 · · · A1n x2 A21 A22 · · · A2n x = 3.4, x = . , A= . .. .. .. .. .. . . . xn Am1 Am2 · · · Amn rectangular grids In the above example x is 1 × 1, x is n × 1 and A is m × n. Dimensions ( ′ in Matlab), swaps the rows and columns: A⊤ ij = Aji x⊤ = x, x⊤ = x1 x2 · · · xn , The transpose operator, ⊤ Transpose Quantities whose inner dimensions match may be “multiplied” by summing over this index. The outer dimensions give the dimensions of the answer. Ax has elements (Ax)i = n X (AA⊤ )ij = Aij xj and j=1 n X Aik A⊤ k=1 kj = n X Multiplication Aik Ajk k=1 All the following are allowed (the dimensions of the answer are also shown): x⊤ x xx⊤ Ax AA⊤ A⊤ A x⊤ Ax 1×1 scalar n×n matrix m×1 vector m×m matrix n×n matrix 1×1 scalar Check Dimensions , while xx, AA and xA do not make sense for m 6= n 6= 1. Can you see why? An exception to the above rule is that we may write: xA. Every element of the matrix A is multiplied by the scalar x. Simple and valid manipulations: (AB)C = A(BC) Multiplication by Scalar Basic Properties (A + B)⊤ = A⊤ + B ⊤ A(B + C) = AB + AC (AB)⊤ = B ⊤ A⊤ Note that AB 6= BA in general. 2.1 Square Matrices A square matrix has equal number of rows and columns, e.g. B with dimensions n × n. A diagonal matrix is a square matrix with all off-diagonal elements being zero: Bij = 0 if i 6= j. Diagonal Matrix The identity matrix is a diagonal matrix with all diagonal elements equal to one. Identity Matrix “I is the identity matrix” ⇔ Iij = 0 if i 6= j and Iii = 1 ∀i The identity matrix leaves vectors and matrices unchanged upon multiplication. Ix = x IB = B = BI 2 x⊤ I = x⊤ Some square matrices have inverses: Inverse B −1 B = BB −1 = I B −1 which have the additional properties: (BC)−1 = C −1 B −1 B −1 −1 ⊤ =B, = B⊤ Linear simultaneous equations could be solved this way: −1 Solving Linear Equations if Bx = y then x = B −1 y Some other commonly used matrix definitions include: Symmetry Bij = Bji ⇔ “B is symmetric” Trace(B) = Tr(B) = n X Bii = “sum of diagonal elements” Trace i=1 Cyclic permutations are allowed inside trace. Trace of a scalar is a scalar: x⊤ Bx = Tr(x⊤ Bx) = Tr(xx⊤ B) Tr(BCD) = Tr(DBC) = Tr(CDB) The determinant2 is written Det(B) or |B|. It is a scalar regardless of n. |BC| = |B||C| , |x| = x , Exercise |xB| = xn |B| , Determinant −1 B = 1 . |B| It determines if B can be inverted: |B| = 0 ⇒ B −1 undefined. If the vector to every point of a shape is pre-multiplied by B then the shape’s area or volume increases by a factor of |B|. It also appears in the normalizing constant of a Gaussian. For a diagonal matrix the volume scaling factor is simply the product of the diagonal elements. In general the determinant is the product of the eigenvalues. Bv(i) = λ(i) v(i) ⇔ “λ(i) is an eigenvalue of B with eigenvector v(i) ” Y X |B| = λ(i) Trace(B) = λ(i) i i If B is real and symmetric (eg a covariance matrix), the eigenvectors are orthogonal (perpendicular) and so form a basis (can be used as axes). 2 This section is only intended to give you a flavor so you understand other references and Sam’s crib sheet. More detailed history and overview is here: http://www.wikipedia.org/wiki/Determinant 3 Eigenvalue, Eigenvector 3 Differentiation The gradient of a straight line y = mx+c is a constant: y ′ = y(x+∆x)−y(x) ∆x = m. Many functions look like straight lines over a small enough range. The gradient of this line, the derivative, is not constant, but a new function: y ′ (x) = The following results are well known (c is a constant): c 0 cxn cnxn−1 cx c loge (x) 1/x ex ex At a maximum or minimum the function is rising on one side and falling on the other. In between the gradient must reach zero somewhere. Therefore maxima and minima satisfy: Differentiation d2 y dy ′ which could be y ′′ = = 2 differentiated again: dx dx dy y(x+∆x) − y(x) = lim , dx ∆x→0 ∆x f (x) : f ′ (x) : Gradient df (x) =0 dx or Standard Derivatives Optimization df (x) df (x) =0 ⇔ = 0 ∀i dx dxi If we can’t solve this analytically, we can evolve our variable x, or variables x, on a computer using gradient information until we find a place where the gradient is zero. But there’s no guarantee that we would find all maxima and minima. A function may be approximated by a straight line3 about any point a. f (a + x) ≈ f (a) + xf ′ (a) , e.g.: log(1 + x) ≈ log(1 + 0) + x Approximation 1 =x 1+0 The derivative operator is linear: Linearity df (x) dg(x) d(f (x) + g(x)) = + , dx dx dx e.g.: d (x + exp(x)) = 1 + exp(x). dx Dealing with products is slightly more involved: d (u(x)v(x)) du dv =v +u , dx dx dx The “chain rule” e.g.: Product Rule d (x · exp(x)) = exp(x) + x exp(x). dx du df (u) df (u) = , allows results to be combined. dx dx du For example: Chain Rule d exp (ay m ) d (ay m ) d exp (ay m ) = · “with u = ay m ” dy dy d (ay m ) = amy m−1 · exp (ay m ) Convince yourself of the following: d 1 c a exp(az) + e = exp(az) − dz (b + cz) b + cz (b + cz)2 Note that a, b, c and e are constants and 1/u = u−1 . This might be hard if you haven’t done differentiation (for a long time). You may want to grab a calculus textbook for a review. 3 More accurate approximations can be made. Look up Taylor series. 4 Exercise