LectureNotes 3-Calculus_Probability

advertisement
Lecture Note 2 – Calculus and
Probability
Shuaiqiang Wang
Department of CS & IS
University of Jyväskylä
http://users.jyu.fi/~swang/
shuaiqiang.wang@jyu.fi
Part 1: Calculus
Definition
• Given a function 𝑓(𝑥), the derivative is
𝑑
𝑓 𝑥 + 𝑡 − 𝑓(𝑥)
𝑓 𝑥 =
𝑓(𝑥) = lim
𝑡→0
𝑑𝑥
𝑡
′
𝑑
𝑑𝑓 𝑑𝑡
𝑓(𝑡) =
𝑑𝑥
𝑑𝑡 𝑑𝑥
𝑑
2=0
𝑑𝑥
Polynomial Function
𝑑 𝑎
𝑥 = 𝑎𝑥 𝑎−1
𝑑𝑥
• Example:
𝑑 2
𝑥+𝑡 2 −𝑥 2
𝑥 2 +2𝑥𝑡+𝑡 2 −𝑥 2
𝑥 = lim
= lim
𝑑𝑥
𝑡
𝑡
𝑡→0
𝑡→0
2𝑥𝑡+𝑡 2
𝑡(2𝑥+𝑡)
lim
= lim
= lim 2𝑥 + 𝑡
𝑡
𝑡
𝑡→0
𝑡→0
𝑡→0
=
= 2𝑥
Proof: Polynomial Function
𝑛
𝑛
1
𝑛
• 𝑥+𝑡 =𝑥 +
𝑥
𝑛
⋯ + 𝑛−1 𝑥𝑡 𝑛−1 + 𝑡 𝑛
•
𝑛−1
𝑡+
𝑑 𝑛
𝑥+𝑡 𝑛 −𝑥 𝑛
𝑥 = lim
=
𝑑𝑥
𝑡
𝑡→0
𝑛−2 𝑡 2 +⋯+𝑡 𝑛
𝑛𝑥 𝑛−1 𝑡+ 𝑛
𝑥
2
lim
𝑡→0
𝑡
𝑛
2
𝑥 𝑛−2 𝑡 2 +
= lim 𝑛𝑥 𝑛−1 +
𝑡→0
Logarithm Function
𝑑
1
ln 𝑥 =
𝑑𝑥
𝑥
• Where the base 𝑒 = lim 1 +
𝑛→∞
1 𝑛
𝑛
• Example:
𝑑
𝑑 ln 𝑥
1 𝑑
1
log a 𝑥 =
=
ln 𝑥 =
𝑑𝑥
𝑑𝑥 ln 𝑎 ln 𝑎 𝑑𝑥
𝑥 ln 𝑎
𝑑
𝑑𝑡
2 +2 𝑑
2
𝑡=𝑥
ln(𝑥 + 2) =
ln 𝑡 ×
𝑑𝑥
𝑑𝑡
𝑑𝑥
1
2𝑥
= × 2𝑥 = 2
𝑡
𝑥 +2
Proof: Logarithm Function
•
𝑑
ln(𝑥)
𝑑𝑥
=
ln 𝑥+𝑡 −ln(𝑥)
lim
𝑡
𝑡→0
=
1
lim (ln
𝑡→0 𝑡
𝑥+𝑡 −
Exponential Function
𝑑 𝑥
𝑒 = 𝑒𝑥
𝑑𝑥
Example:
𝑑 𝑥 2 +𝑥
𝑒
𝑑𝑥
=
𝑒𝑡
𝑡=𝑥 2 +𝑥
=
𝑑 𝑡 𝑑𝑡
𝑒 ×
𝑑𝑡
𝑑𝑥
× 2𝑥 + 1 = 2𝑥 + 1
2 +𝑥
𝑥
𝑒
Proof: Exponential Function
• Let’s calculate
•
•
•
𝑑
ln(𝑒 𝑥 ).
𝑑𝑥
Let 𝑒 𝑥 = 𝑡. Then
𝑑
𝑑
𝑥
ln(𝑒 )= 𝑥 = 1
𝑑𝑥
𝑑𝑥
𝑑
1 𝑑𝑡
1 𝑑 𝑥
ln(𝑡) =
= 𝑥 𝑒
𝑑𝑥
𝑡 𝑑𝑥
𝑒 𝑑𝑥
1 𝑑 𝑥
𝑑 𝑥
Thus 𝑥 𝑒 = 1, and 𝑒
𝑒 𝑑𝑥
𝑑𝑥
= 𝑒𝑥
Exponential Function
𝑑 𝑥
𝑎 = 𝑎 𝑥 ln 𝑎
𝑑𝑥
• Proof.
• Let 𝑎 𝑥 = 𝑡. Then ln 𝑡 = ln 𝑎 𝑥 = 𝑥 ln 𝑎
•
•
•
𝑑
𝑑
𝑑
𝑥
ln(𝑡) = ln(𝑎 )= ln 𝑎 𝑥
𝑑𝑥
𝑑𝑥
𝑑𝑥
1 𝑑𝑡
= ln 𝑎
𝑡 𝑑𝑥
𝑑𝑡
Thus = 𝑡 ln 𝑎 = 𝑎 𝑥 ln 𝑎
𝑑𝑥
= ln 𝑎
Taylor Series
𝑓 𝑥 =𝑓 𝑎 +
𝑓′ 𝑎
1!
(𝑥 − 𝑎) +
𝑓′′ 𝑎
2!
𝑥−𝑎
𝑓′′ 𝑎
2!
𝑥2 + ⋯
2
When 𝑎 = 0:
𝑓 𝑥 =𝑓 0 +
𝑓′ 0
1!
𝑥+
Example: 𝑓 𝑥 = 𝑥 2
𝑥2 =
=
′ 𝑥
′′ 0
𝑓
𝑓
𝑥=0
𝑥=0 2
2
𝑥 𝑥=0 +
𝑥+
𝑥 +⋯
1!
2!
2𝑥 𝑥=0
2 2
0 2
0+
𝑥 + 𝑥 + 𝑥 + ⋯ = 𝑥2
1!
2!
2!
+⋯
Partial Derivative and Gradient
𝑥1
𝒙= ⋮
𝑥𝑛
For example 𝑓 𝒙 = 𝑎𝑥1 𝑥2 + 𝑏𝑥22
Partial derivative of a function 𝑓 𝒙 with respect to certain
variable 𝑥𝑖 is the derivative of 𝑓 𝒙 while regarding other
variables 𝑥1 , … , 𝑥𝑖−1 , 𝑥𝑖+1 , … , 𝑥𝑛 as constants.
𝜕𝑓
= 𝑎𝑥2
𝜕𝑥1
𝜕𝑓
= 𝑎𝑥1 + 2𝑏𝑥2
𝜕𝑥2
𝜕𝑓
𝜕𝑥1
𝛻𝑓 𝒙 = ⋮
𝜕𝑓
𝜕𝑥𝑛
Taylor Approximation
Taylor
Series
Taylor
Approximation
∞
𝑓 𝑥 =
𝑓
𝑓 𝑥 ≈
𝑖=0
𝑎
𝑖!
𝑖=0
𝑘
𝑖
𝑓
𝑖
𝑖!
𝑎
𝑥−𝑎
𝑖
𝑥−𝑎
𝑖
First-Order Taylor Approximation
1 dimension
𝑓 𝑥 ≈ 𝑓 𝑎 + 𝑓 ′ 𝑎 (𝑥 − 𝑎)
𝑥1
𝑛 dimensions when 𝒙 = ⋮
𝑥𝑛
𝑓 𝒙 ≈ 𝑓 𝒂 + 𝛻𝑓 𝒙 ⊤ (𝒙 − 𝒂)
Gradient Descent Optimization
According to the first order Taylor approximation of f ( x) :
f ( xn  hu )  f ( xn )  hf ( xn )T u  O (1)
It can be written as:
f ( xn  hu )  f ( xn )  hf ( xn )T u  O (1)
where h is the learning rate, and u is a unit vector representing direction.
Let xn 1  xn  hu , which is the value of x in the next iteration.
Our optimization objective function is:
arg min f ( xn  hu )  f ( xn )  arg min hf ( xn )T u  O (1)
u
u
The optimal solution is: u  f ( xn )
Gradient Descent Algorithm
For n  1, 2,K , N max :
g n  f ( xn )
if ||g ( xn )||   , return xn
xn 1  xn  hg n
n  n 1
End
Part 2: Probability
Independent Events
• Let 𝐴 and 𝐵 be two independent events.
𝑃 𝐴, 𝐵 = 𝑃 𝐴 𝑃(𝐵)
• Example 2: Taking exams
– Each exam is independent to previous ones
– Fail 3 times:
𝑃 𝐹, 𝐹, 𝐹 = 𝑃 𝐹 𝑃 𝐹 𝑃 𝐹 = 𝑃 𝐹
– Pass at least 1 time:
𝑃 𝑃 = 1 − 𝑃 𝐹, 𝐹, 𝐹 = 1 − 𝑃 𝐹
• Example 1: Coin tossing
– Each tossing is independent to previous ones
– 𝑃 𝐻𝑒𝑎𝑑, 𝑇𝑎𝑖𝑙 = 𝑃 𝐻𝑒𝑎𝑑 𝑃(𝑇𝑎𝑖𝑙)
3
3
Conditional Probability
𝑃(𝐴, 𝐵)
𝑃 𝐴𝐵 =
𝑃(𝐵)
Example
• A person goes to sauna 6 times during the last 10
days, at most once per day.
• It snowed 8 days during the last 10 days.
• It snowed 4 days during the 6 sauna days.
• P(sauna | snow) = ?
• P(snow | sauna) = ?
Bayes’ Theorem
Since
𝑃(𝑦, 𝜃)
𝑃 𝜃𝑦 =
𝑃(𝑦)
𝑃 𝑦, 𝜃 = 𝑃 𝑦 𝜃 𝑃 𝜃 = 𝑃 𝜃 𝑦 𝑃 𝑦
Then
𝑃(𝑦, 𝜃) 𝑃 𝑦 𝜃 𝑃(𝜃)
𝑃 𝜃𝑦 =
=
𝑃(𝑦)
𝑃(𝑦)
Bayes’ Theorem
𝑃(𝑦, 𝜃) 𝑃 𝑦 𝜃 𝑃(𝜃)
𝑃 𝜃𝑦 =
=
𝑃(𝑦)
𝑃(𝑦)
With same data 𝑃(𝑦) and same prior 𝑃(𝜃)
𝑃 𝜃𝑦 ∝𝑃 𝑦𝜃
Maximum Likelihood Estimation
• Input: A set of observations 𝑃(𝑦𝑖 𝜃) with
parameters 𝜃
• Output: The estimation of 𝜃
• Assume that all of the observations are
independent
• Thus their probability can be calculated as
𝑛
ℒ(𝑦 𝜃) =
𝑃(𝑦𝑖 𝜃)
𝑖=1
Maximum Likelihood Estimation
• We try to find the largest probability of 𝜃 with the
given observations 𝑦
𝜃 = max ℒ(𝜃 𝑦)
𝜃
• With same 𝑝(𝑦) and 𝑝(𝜃), we can actually maximize
ℒ(𝑦 𝜃):
𝑛
𝑃(𝑦𝑖 𝜃)
𝜃 = max ℒ(𝑦 𝜃) = min −ℒ(𝑦 𝜃) = min −
𝜃
𝜃
𝜃
𝑖=1
Optimization
• Since ln 𝑥 is a increasing function, it is
equivalent to
𝑛
𝜃 = min − ln ℒ(𝑦 𝜃) = min − ln
𝜃
𝜃
𝑛
= min −
𝜃
𝑃(𝑦𝑖 𝜃)
𝑖=1
ln 𝑃(𝑦𝑖 𝜃)
𝑖=1
Then we can optimize it with gradient descent.
Any Question?
Download