Some Basic Mathematics for Machine Learning

advertisement
Some Basic Mathematics for Machine Learning
Iain Murray, www.gatsby.ucl.ac.uk/∼iam23/
Angela J. Yu <ajyu@ucsd.edu>
1
Probability Theory
See sections 2.1–2.3 of David MacKay’s book:
www.inference.phy.cam.ac.uk/mackay/itila/book.html
The probability a discrete variable A takes value a is:
0 ≤ P (A = a) ≤ 1
Probabilities of alternative outcomes add: P (A ∈ {a, a′ }) = P (A = a) + P (A = a′ )
X
The probabilities of all outcomes must sum to one:
P (A = a) = 1
Probability
Alternatives
Normalization
all possible a
P (A = a, B = b) is the joint probability that both A = a and B = b occur.
Joint Probability
Variables can be “summed out” of joint distributions:
X
P (A = a) =
P (A = a, B = b)
Marginalization
P (A = a|B = b) is the probability A = a occurs given the knowledge B = b.
Conditional
Probability
Product Rule
all possible b
P (A = a, B = b) = P (A = a) P (B = b|A = a) = P (B = b) P (A = a|B = b)
Bayes rule can be derived from the above:
P (A = a|B = b, H) =
Bayes Rule
P (B = b|A = a, H) P (A = a|H)
∝ P (A = a, B = b|H)
P (B = b|H)
Marginalizing over all possible a gives the evidence or normalizing constant:
X
P (A = a, B = b|H) = P (B = b|H)
Normalizing
Constant
The following hold, for all a and b, if and only if A and B are independent:
Independence
a
P (A = a|B = b) = P (A = a)
P (B = b|A = a) = P (B = b)
P (A = a, B = b) = P (A = a) P (B = b) .
All the above theory basically still applies to continuous variables if the sums are converted
into integrals1 . The probability that X lies between x and x+dx is p (x) dx, where p (x) is a
probability density function with range [0, ∞].
Z ∞
Z ∞
Z x2
p (x, y) dy.
p (x) dx = 1 and p (x) =
p (x) dx ,
P (x1 < X < x2 ) =
−∞
−∞
x1
The expectation or mean under a probability distribution is:
Z
X
Ef (a) = hf (a)i =
P (A = a) f (a) or hf (x)i =
p (x) f (x)dx
Pn
are the equivalent of sums for continuous variables, e.g
i=1 f (xi )∆x becomes the integral
and xi = a + i∆x.
f (x)dx in the limit ∆x → 0, n → ∞, where ∆x = b−a
n
1 Integrals
a
Expectation
∞
−∞
a
Rb
Continuous
Variables
1
2
Linear Algebra
This complements Sam Roweis’s “Matrix Identities”: www.cs.toronto.edu/~roweis/notes/matrixid.pdf
Scalars are individual numbers, vectors are columns of numbers, matrices are
of numbers, eg:



x1
A11 A12 · · · A1n
 x2 
 A21 A22 · · · A2n



x = 3.4,
x =  . ,
A= .
..
..
..
 .. 
 ..
.
.
.
xn
Am1 Am2 · · · Amn
rectangular grids





In the above example x is 1 × 1, x is n × 1 and A is m × n.
Dimensions
( ′ in Matlab), swaps the rows and columns:
A⊤ ij = Aji
x⊤ = x,
x⊤ = x1 x2 · · · xn ,
The transpose operator,
⊤
Transpose
Quantities whose inner dimensions match may be “multiplied” by summing over this index.
The outer dimensions give the dimensions of the answer.
Ax has elements (Ax)i =
n
X
(AA⊤ )ij =
Aij xj and
j=1
n
X
Aik A⊤
k=1
kj
=
n
X
Multiplication
Aik Ajk
k=1
All the following are allowed (the dimensions of the answer are also shown):
x⊤ x
xx⊤
Ax
AA⊤
A⊤ A
x⊤ Ax
1×1
scalar
n×n
matrix
m×1
vector
m×m
matrix
n×n
matrix
1×1
scalar
Check
Dimensions
,
while xx, AA and xA do not make sense for m 6= n 6= 1. Can you see why?
An exception to the above rule is that we may write: xA. Every element of the matrix A is
multiplied by the scalar x.
Simple and valid manipulations:
(AB)C = A(BC)
Multiplication
by Scalar
Basic Properties
(A + B)⊤ = A⊤ + B ⊤
A(B + C) = AB + AC
(AB)⊤ = B ⊤ A⊤
Note that AB 6= BA in general.
2.1
Square Matrices
A square matrix has equal number of rows and columns, e.g. B with dimensions n × n.
A diagonal matrix is a square matrix with all off-diagonal elements being zero: Bij = 0 if i 6= j.
Diagonal Matrix
The identity matrix is a diagonal matrix with all diagonal elements equal to one.
Identity Matrix
“I is the identity matrix”
⇔
Iij = 0 if i 6= j and Iii = 1 ∀i
The identity matrix leaves vectors and matrices unchanged upon multiplication.
Ix = x
IB = B = BI
2
x⊤ I = x⊤
Some square matrices have inverses:
Inverse
B −1 B = BB −1 = I
B −1
which have the additional properties:
(BC)−1 = C −1 B −1
B −1
−1
⊤
=B,
= B⊤
Linear simultaneous equations could be solved this way:
−1
Solving Linear
Equations
if Bx = y then x = B −1 y
Some other commonly used matrix definitions include:
Symmetry
Bij = Bji ⇔ “B is symmetric”
Trace(B) = Tr(B) =
n
X
Bii = “sum of diagonal elements”
Trace
i=1
Cyclic permutations are allowed inside trace. Trace of a scalar is a scalar:
x⊤ Bx = Tr(x⊤ Bx) = Tr(xx⊤ B)
Tr(BCD) = Tr(DBC) = Tr(CDB)
The determinant2 is written Det(B) or |B|. It is a scalar regardless of n.
|BC| = |B||C| ,
|x| = x ,
Exercise
|xB| = xn |B| ,
Determinant
−1 B = 1 .
|B|
It determines if B can be inverted: |B| = 0 ⇒ B −1 undefined. If the vector to every point of
a shape is pre-multiplied by B then the shape’s area or volume increases by a factor of |B|.
It also appears in the normalizing constant of a Gaussian. For a diagonal matrix the volume
scaling factor is simply the product of the diagonal elements. In general the determinant is the
product of the eigenvalues.
Bv(i) = λ(i) v(i) ⇔ “λ(i) is an eigenvalue of B with eigenvector v(i) ”
Y
X
|B| =
λ(i)
Trace(B) =
λ(i)
i
i
If B is real and symmetric (eg a covariance matrix), the eigenvectors are orthogonal (perpendicular) and so form a basis (can be used as axes).
2 This section is only intended to give you a flavor so you understand other references and Sam’s crib sheet.
More detailed history and overview is here: http://www.wikipedia.org/wiki/Determinant
3
Eigenvalue,
Eigenvector
3
Differentiation
The gradient of a straight line y = mx+c is a constant: y ′ =
y(x+∆x)−y(x)
∆x
= m.
Many functions look like straight lines over a small enough range. The gradient of this line,
the derivative, is not constant, but a new function:
y ′ (x) =
The following results are well known (c is a constant):
c
0
cxn
cnxn−1
cx
c
loge (x)
1/x
ex
ex
At a maximum or minimum the function is rising on one side and falling on the other. In
between the gradient must reach zero somewhere. Therefore
maxima and minima satisfy:
Differentiation
d2 y
dy ′
which could be
y ′′ =
=
2
differentiated again:
dx
dx
dy
y(x+∆x) − y(x)
= lim
,
dx ∆x→0
∆x
f (x) :
f ′ (x) :
Gradient
df (x)
=0
dx
or
Standard
Derivatives
Optimization
df (x)
df (x)
=0 ⇔
= 0 ∀i
dx
dxi
If we can’t solve this analytically, we can evolve our variable x, or variables x, on a computer
using gradient information until we find a place where the gradient is zero. But there’s no
guarantee that we would find all maxima and minima.
A function may be approximated by a straight line3 about any point a.
f (a + x) ≈ f (a) + xf ′ (a) ,
e.g.: log(1 + x) ≈ log(1 + 0) + x
Approximation
1
=x
1+0
The derivative operator is linear:
Linearity
df (x) dg(x)
d(f (x) + g(x))
=
+
,
dx
dx
dx
e.g.:
d (x + exp(x))
= 1 + exp(x).
dx
Dealing with products is slightly more involved:
d (u(x)v(x))
du
dv
=v
+u
,
dx
dx
dx
The “chain rule”
e.g.:
Product Rule
d (x · exp(x))
= exp(x) + x exp(x).
dx
du df (u)
df (u)
=
, allows results to be combined.
dx
dx du
For example:
Chain Rule
d exp (ay m )
d (ay m ) d exp (ay m )
=
·
“with u = ay m ”
dy
dy
d (ay m )
= amy m−1 · exp (ay m )
Convince yourself of the following:
d
1
c
a
exp(az) + e = exp(az)
−
dz (b + cz)
b + cz
(b + cz)2
Note that a, b, c and e are constants and 1/u = u−1 . This might be hard if you haven’t done
differentiation (for a long time). You may want to grab a calculus textbook for a review.
3 More
accurate approximations can be made. Look up Taylor series.
4
Exercise
Download