Lecture Note 2 – Calculus and Probability Shuaiqiang Wang Department of CS & IS University of Jyväskylä http://users.jyu.fi/~swang/ shuaiqiang.wang@jyu.fi Part 1: Calculus Definition • Given a function 𝑓(𝑥), the derivative is 𝑑 𝑓 𝑥 + 𝑡 − 𝑓(𝑥) 𝑓 𝑥 = 𝑓(𝑥) = lim 𝑡→0 𝑑𝑥 𝑡 ′ 𝑑 𝑑𝑓 𝑑𝑡 𝑓(𝑡) = 𝑑𝑥 𝑑𝑡 𝑑𝑥 𝑑 2=0 𝑑𝑥 Polynomial Function 𝑑 𝑎 𝑥 = 𝑎𝑥 𝑎−1 𝑑𝑥 • Example: 𝑑 2 𝑥+𝑡 2 −𝑥 2 𝑥 2 +2𝑥𝑡+𝑡 2 −𝑥 2 𝑥 = lim = lim 𝑑𝑥 𝑡 𝑡 𝑡→0 𝑡→0 2𝑥𝑡+𝑡 2 𝑡(2𝑥+𝑡) lim = lim = lim 2𝑥 + 𝑡 𝑡 𝑡 𝑡→0 𝑡→0 𝑡→0 = = 2𝑥 Proof: Polynomial Function 𝑛 𝑛 1 𝑛 • 𝑥+𝑡 =𝑥 + 𝑥 𝑛 ⋯ + 𝑛−1 𝑥𝑡 𝑛−1 + 𝑡 𝑛 • 𝑛−1 𝑡+ 𝑑 𝑛 𝑥+𝑡 𝑛 −𝑥 𝑛 𝑥 = lim = 𝑑𝑥 𝑡 𝑡→0 𝑛−2 𝑡 2 +⋯+𝑡 𝑛 𝑛𝑥 𝑛−1 𝑡+ 𝑛 𝑥 2 lim 𝑡→0 𝑡 𝑛 2 𝑥 𝑛−2 𝑡 2 + = lim 𝑛𝑥 𝑛−1 + 𝑡→0 Logarithm Function 𝑑 1 ln 𝑥 = 𝑑𝑥 𝑥 • Where the base 𝑒 = lim 1 + 𝑛→∞ 1 𝑛 𝑛 • Example: 𝑑 𝑑 ln 𝑥 1 𝑑 1 log a 𝑥 = = ln 𝑥 = 𝑑𝑥 𝑑𝑥 ln 𝑎 ln 𝑎 𝑑𝑥 𝑥 ln 𝑎 𝑑 𝑑𝑡 2 +2 𝑑 2 𝑡=𝑥 ln(𝑥 + 2) = ln 𝑡 × 𝑑𝑥 𝑑𝑡 𝑑𝑥 1 2𝑥 = × 2𝑥 = 2 𝑡 𝑥 +2 Proof: Logarithm Function • 𝑑 ln(𝑥) 𝑑𝑥 = ln 𝑥+𝑡 −ln(𝑥) lim 𝑡 𝑡→0 = 1 lim (ln 𝑡→0 𝑡 𝑥+𝑡 − Exponential Function 𝑑 𝑥 𝑒 = 𝑒𝑥 𝑑𝑥 Example: 𝑑 𝑥 2 +𝑥 𝑒 𝑑𝑥 = 𝑒𝑡 𝑡=𝑥 2 +𝑥 = 𝑑 𝑡 𝑑𝑡 𝑒 × 𝑑𝑡 𝑑𝑥 × 2𝑥 + 1 = 2𝑥 + 1 2 +𝑥 𝑥 𝑒 Proof: Exponential Function • Let’s calculate • • • 𝑑 ln(𝑒 𝑥 ). 𝑑𝑥 Let 𝑒 𝑥 = 𝑡. Then 𝑑 𝑑 𝑥 ln(𝑒 )= 𝑥 = 1 𝑑𝑥 𝑑𝑥 𝑑 1 𝑑𝑡 1 𝑑 𝑥 ln(𝑡) = = 𝑥 𝑒 𝑑𝑥 𝑡 𝑑𝑥 𝑒 𝑑𝑥 1 𝑑 𝑥 𝑑 𝑥 Thus 𝑥 𝑒 = 1, and 𝑒 𝑒 𝑑𝑥 𝑑𝑥 = 𝑒𝑥 Exponential Function 𝑑 𝑥 𝑎 = 𝑎 𝑥 ln 𝑎 𝑑𝑥 • Proof. • Let 𝑎 𝑥 = 𝑡. Then ln 𝑡 = ln 𝑎 𝑥 = 𝑥 ln 𝑎 • • • 𝑑 𝑑 𝑑 𝑥 ln(𝑡) = ln(𝑎 )= ln 𝑎 𝑥 𝑑𝑥 𝑑𝑥 𝑑𝑥 1 𝑑𝑡 = ln 𝑎 𝑡 𝑑𝑥 𝑑𝑡 Thus = 𝑡 ln 𝑎 = 𝑎 𝑥 ln 𝑎 𝑑𝑥 = ln 𝑎 Taylor Series 𝑓 𝑥 =𝑓 𝑎 + 𝑓′ 𝑎 1! (𝑥 − 𝑎) + 𝑓′′ 𝑎 2! 𝑥−𝑎 𝑓′′ 𝑎 2! 𝑥2 + ⋯ 2 When 𝑎 = 0: 𝑓 𝑥 =𝑓 0 + 𝑓′ 0 1! 𝑥+ Example: 𝑓 𝑥 = 𝑥 2 𝑥2 = = ′ 𝑥 ′′ 0 𝑓 𝑓 𝑥=0 𝑥=0 2 2 𝑥 𝑥=0 + 𝑥+ 𝑥 +⋯ 1! 2! 2𝑥 𝑥=0 2 2 0 2 0+ 𝑥 + 𝑥 + 𝑥 + ⋯ = 𝑥2 1! 2! 2! +⋯ Partial Derivative and Gradient 𝑥1 𝒙= ⋮ 𝑥𝑛 For example 𝑓 𝒙 = 𝑎𝑥1 𝑥2 + 𝑏𝑥22 Partial derivative of a function 𝑓 𝒙 with respect to certain variable 𝑥𝑖 is the derivative of 𝑓 𝒙 while regarding other variables 𝑥1 , … , 𝑥𝑖−1 , 𝑥𝑖+1 , … , 𝑥𝑛 as constants. 𝜕𝑓 = 𝑎𝑥2 𝜕𝑥1 𝜕𝑓 = 𝑎𝑥1 + 2𝑏𝑥2 𝜕𝑥2 𝜕𝑓 𝜕𝑥1 𝛻𝑓 𝒙 = ⋮ 𝜕𝑓 𝜕𝑥𝑛 Taylor Approximation Taylor Series Taylor Approximation ∞ 𝑓 𝑥 = 𝑓 𝑓 𝑥 ≈ 𝑖=0 𝑎 𝑖! 𝑖=0 𝑘 𝑖 𝑓 𝑖 𝑖! 𝑎 𝑥−𝑎 𝑖 𝑥−𝑎 𝑖 First-Order Taylor Approximation 1 dimension 𝑓 𝑥 ≈ 𝑓 𝑎 + 𝑓 ′ 𝑎 (𝑥 − 𝑎) 𝑥1 𝑛 dimensions when 𝒙 = ⋮ 𝑥𝑛 𝑓 𝒙 ≈ 𝑓 𝒂 + 𝛻𝑓 𝒙 ⊤ (𝒙 − 𝒂) Gradient Descent Optimization According to the first order Taylor approximation of f ( x) : f ( xn hu ) f ( xn ) hf ( xn )T u O (1) It can be written as: f ( xn hu ) f ( xn ) hf ( xn )T u O (1) where h is the learning rate, and u is a unit vector representing direction. Let xn 1 xn hu , which is the value of x in the next iteration. Our optimization objective function is: arg min f ( xn hu ) f ( xn ) arg min hf ( xn )T u O (1) u u The optimal solution is: u f ( xn ) Gradient Descent Algorithm For n 1, 2,K , N max : g n f ( xn ) if ||g ( xn )|| , return xn xn 1 xn hg n n n 1 End Part 2: Probability Independent Events • Let 𝐴 and 𝐵 be two independent events. 𝑃 𝐴, 𝐵 = 𝑃 𝐴 𝑃(𝐵) • Example 2: Taking exams – Each exam is independent to previous ones – Fail 3 times: 𝑃 𝐹, 𝐹, 𝐹 = 𝑃 𝐹 𝑃 𝐹 𝑃 𝐹 = 𝑃 𝐹 – Pass at least 1 time: 𝑃 𝑃 = 1 − 𝑃 𝐹, 𝐹, 𝐹 = 1 − 𝑃 𝐹 • Example 1: Coin tossing – Each tossing is independent to previous ones – 𝑃 𝐻𝑒𝑎𝑑, 𝑇𝑎𝑖𝑙 = 𝑃 𝐻𝑒𝑎𝑑 𝑃(𝑇𝑎𝑖𝑙) 3 3 Conditional Probability 𝑃(𝐴, 𝐵) 𝑃 𝐴𝐵 = 𝑃(𝐵) Example • A person goes to sauna 6 times during the last 10 days, at most once per day. • It snowed 8 days during the last 10 days. • It snowed 4 days during the 6 sauna days. • P(sauna | snow) = ? • P(snow | sauna) = ? Bayes’ Theorem Since 𝑃(𝑦, 𝜃) 𝑃 𝜃𝑦 = 𝑃(𝑦) 𝑃 𝑦, 𝜃 = 𝑃 𝑦 𝜃 𝑃 𝜃 = 𝑃 𝜃 𝑦 𝑃 𝑦 Then 𝑃(𝑦, 𝜃) 𝑃 𝑦 𝜃 𝑃(𝜃) 𝑃 𝜃𝑦 = = 𝑃(𝑦) 𝑃(𝑦) Bayes’ Theorem 𝑃(𝑦, 𝜃) 𝑃 𝑦 𝜃 𝑃(𝜃) 𝑃 𝜃𝑦 = = 𝑃(𝑦) 𝑃(𝑦) With same data 𝑃(𝑦) and same prior 𝑃(𝜃) 𝑃 𝜃𝑦 ∝𝑃 𝑦𝜃 Maximum Likelihood Estimation • Input: A set of observations 𝑃(𝑦𝑖 𝜃) with parameters 𝜃 • Output: The estimation of 𝜃 • Assume that all of the observations are independent • Thus their probability can be calculated as 𝑛 ℒ(𝑦 𝜃) = 𝑃(𝑦𝑖 𝜃) 𝑖=1 Maximum Likelihood Estimation • We try to find the largest probability of 𝜃 with the given observations 𝑦 𝜃 = max ℒ(𝜃 𝑦) 𝜃 • With same 𝑝(𝑦) and 𝑝(𝜃), we can actually maximize ℒ(𝑦 𝜃): 𝑛 𝑃(𝑦𝑖 𝜃) 𝜃 = max ℒ(𝑦 𝜃) = min −ℒ(𝑦 𝜃) = min − 𝜃 𝜃 𝜃 𝑖=1 Optimization • Since ln 𝑥 is a increasing function, it is equivalent to 𝑛 𝜃 = min − ln ℒ(𝑦 𝜃) = min − ln 𝜃 𝜃 𝑛 = min − 𝜃 𝑃(𝑦𝑖 𝜃) 𝑖=1 ln 𝑃(𝑦𝑖 𝜃) 𝑖=1 Then we can optimize it with gradient descent. Any Question?