Calculus I for Machine Learning

advertisement
Calculus I for Machine Learning
Some Applications of Concepts of
Sequence and Series
Mohammed Nasser
Professor, Dept. of Statistics, RU,Bangladesh
Email: mnasser.ru@gmail.com
1
P.C. Mahalanobis(1893-1972), the pioneer of statistics in ASIA
“A good mathematician may not be a good statistician but
a good statistician must be good mathematician”
2
Andrey Nikolaevich Kolmogorov (Russian) (25 April
1903 – 20 October 1987)
In 1933, Kolmogorov published the book, Foundations of
the Theory of Probability, laying the modern axiomatic
3
foundations of probability theory
Statistics+Machine Learning
Vladimir Vapnik
Jerome H. Friedman
4
Learning and Inference
The inductive inference process:
Observe a phenomenon
Construct a model of the phenomenon
Make predictions
→This is more or less the definition of natural
sciences !
→The goal of Machine Learning is to
automate this process
→The goal of Learning Theory is to formalize it.
5
What is Learning?
• ‘The action of receiving instruction or acquiring
knowledge’
•‘A process which leads to the modification of
behaviour or the acquisition of new abilities or
responses, and which is additional to natural
development by growth or maturation’
6
Machine Learning
• Negnevitsky:
‘In general, machine
learning involves adaptive
mechanisms that enable computers to learn from
experience, learn by example and learn by analogy’
(2005:165)
•Callan:
‘A machine or software tool would not be viewed as
intelligent if it could not adapt to changes in its
environment’ (2003:225)
•Luger:
‘Intelligent agents must be able to change through
the course of their interactions with the world’
(2002:351)
7
The Sub-Fields of ML
Classification
• Supervised Learning
Regression
Clustering
Unsupervised Learning
Density
estimation
Reinforcement Learning
8
Classical Problem
What is the wt of the
elephant?
What is the wt/distance of
sun?
9
Classical Problem
What is the wt/size of
baby in the womb?
What is the wt of a DNA
molecule?
10
Solution of the Classical Problem
Let us suppose somehow we have x1,x2,- - -xn
measurements
One million dollar question:
How can we choose the optimum one among
infinite possible alternatives to combine these n
obs. to estimate the target,μ
What is the optimum n?
11
We need the concepts:
ith observations
Probability distributions
-
Target that we want to
estimate
Probability measures,
X i    i ,  ~ F (x /  )
12
Our Targets
 We want to chose T s.t.T(Xi,….,Xn) is always very
near to μ
How do we quantify the problem?
Let us elaborate this issue through examples.
13
Inference with a Single Observation
Population
?
Sampling
Parameter: 
Inference
Observation Xi
• Each observation Xi in a random sample is a
representative of unobserved variables in population
• Each observation is an estimator of μ but its variance
is as much as the poppulation variance
14
Normal Distribution
• In this problem normal distribution is the
most popular model for our overall population
• Can calculate the probability of getting
observations greater than or less than any value
• Usually we don’t have a single observation,
but instead the mean of a set of observations
15
Inference with Sample Mean
Population
?
Sampling
Sample
Parameter: 
Inference
Estimation
Statistic: x
• Sample mean is our estimate of population mean
• How much would the sample mean change if we took
a different sample?
• Key to this question: Sampling Distribution of x
16
Sampling Distribution of Sample Mean
• Distribution of values taken by statistic in all possible
samples of size n from the same population
• Model assumption: our observations xi are sampled
from a population with mean  and variance 2
Population
Unknown
Parameter:

Sample 1 of size n
Sample 2 of size n
Sample 3 of size n
Sample 4 of size n
Sample 5 of size n
Sample 6 of size n
Sample 7 of size n
Sample 8 of size n
.
.
.
x
x
x
x
x
x
x
x
Distribution
of these
values?
17
Points to Be Remembered
If population is finite
If population is countably
infinite
If population is
uncountably infinite
No of sample means are finite
No of sample means are
countably infinite
No of sample means are
uncountably infinite
18
Meaning of Sampling
Distribution
Replications B=10000
19
• Comparing the sampling distribution of the
sample mean when n = 1 vs. n = 10
20
Examination on a Real Data Set
We also consider a real set of health data of 1491
Japanese adult male students from various districts of
Japan as population.
Four head measurements: head length, head breadth,
head height and headcircumference and two physical
measurements: stature and weight
 Data were taken by one observer, Funmio Ohtsuki
(Hossain et al. 2005) using the technique of Martin and
Saller (1957).
21
Histogram and Density of Head
Length (Truncated at the Left)
22
Basic Information about Two
Populations
Type
Mean
Variance b1
b2
size
original
178.99 37.13
.08
2.98 1491
Truncated 181.85 19.63
.80
3.45 1063
23
Sampling Distributions
n (X n   )
Xn
, n=10. 20, 100 & 500
, n=10. 20, 100 & 500
Replications=10000
24
Boxplots of Means
for Original Population
Xn
n (X n   )
Replications=10000
25
Descriptive Statistics of
Sampling Distribution of Means
for Original Population
biassim
[1,] 0.0221
varsim varasim
3.5084
35.0836
[2,] -0.0230
1.8560
37.1210
[3,] 0.0022
0.3634
36.3167
[4,] 0.0041
0.0715
35.7484
37.13
26
Density of Means
for Original Population
27
Histograms of Means
for Truncated Population
28
Boxplots of Means
for Truncated Population
29
Descriptive Statistics of Sampling
Distribution of Means
for Truncated Population
[1,]
biassim
-0.0105
[2,] -0.0002
[3,]
[4,]
-0.0014
varsim varasim
2.0025
20.0249
0.9810
19.62088
0.1958
19.5790
-0.0029 0.0395
19.7419
19.63
30
Chi-square with Two D.F.
31
Boxplots of Means
for

2
2
32
Histogram of Means
for

2
2
33
Central Limit Theorem
• If the sample size is large enough, then the
sample mean x has an approximately
Normal distribution

• This is true no matter what the shape of
the distribution of the original data!!!!
34
Histogram of 100000 Obs from
Standard Cauchy
35
Xn
N=500
36
n (X n   )
N=500
37
Central Limit Theorem
This is a special case of
convergence in
distribution
(x ) 


1
2

e
1
2 2
t2
dt
FX (x )  Pr ( n (X n  )  x )
n
Subject to existence of
mean and variance
x
FX (x )  (x ) x as n  
n
Research is going on to relax i.i.d.
condition
38
How many sequences in
CLT?
 Basic random functional sequence, X n (w)
 Derived random functional sequence,
 Real sequence,
1/
X n (w)  
n
to compare convergence of
X n (w)  
to 0.
 Another real nonnegative functional
sequence,
FX ( x )
n
39
Significance of CLT
 From mathematics we know, we could approximate an
by a as accurate as we wish when an → a
 Sampling
distribution
of
means
can
be
approximated by normal distribution when CLT holds
and sample size is fairly large.
 It justifies to build confidence interval for μ using
sample mean and normal table in non-normal cases
40
More Topics Worth-studying
 To have error bounds like sup FX n (x )  F (x )  g(n ) n
n
 To characterize extreme fluctuations
using sequences like logn, loglogn etc
Law of the iterated Logarithm (Hartman and
Wintner,1941)
41
Berry Essen
Theorem(1941,1945)
 Check uniformity of convergence:Uniform
convergence is better than simple pointwise
convergence.Polya theorem guarantees that
since normal cdf is everywhere continuous,
covergence in CLT is uniform.
Why do we use x to estimate μ?
P1. E(X )= μ
Meaning of “E”??
 P2. V(X )=E[( X -μ)2 ]=V(X)/n
P.3
What is its
significance??
X converges to μ in probability
a n ( )
Lt an ( )  Lt Pr( X n     )  1   0
n 
Subject to
n 
t [1  F (t )  F ( t )]  0 as t  
42
1
Why do we use x to estimate μ?
Pr( : Lt X n ()    0)  1
n 
Subject to
E( X1 )  
2
Condition 2 implies condition1
P.4 X converges to μ almost surely.
43
Difference between Two Limits
Lt an ( )  Lt P r( X n     )  1   0
n 
n 
Probability is
calculated first,
P r( : Lt
n 
then limit is taken.
X n ()    0)  1
Limit is calculated first,
then probability is
calculated.
44
Why do we use x?
P.5 Let X ~N(μ.σ2)↔ε~N(0, σ2). Then X~ N(μ.σ2/n)
So we can make statements like Pr[a(Xn)<μ<b(Xn)]=1-α≈1
P.6 Centre Limit Theorem justifies using statements like
Pr[a(Xn)<μ<b(Xn)]=1-α≈1 when X is not normal
p.7 V(Xn)<=V(Tn) whnever E(Tn)=μ and
X ~N(μ.σ2)↔ε~N(0, σ2).
45
Meaning of Epectation of g(X), E(g(X))
If X is discrete,
E(g(X))=∑g(xi)p(xi)
if E(|g(X)|)= ∑|g(xi)|p(xi) <∞
 absolute convergence implies conditional
convergence, but the converse is not true.
Rearrangement of the terms does not change the
limit
If X is absolutely
continuous,
E(g(X))=∫g(x)f(x)dx
if E(|g(X)|)= ∫|g(x)|f(x) dx<∞
In Riemann sense
46
Binomial Distribution
 Mathematically very simple model
 Effective model for several practical
situations
 Computationally It is very troublesome
> choose(100,50)
[1] 1.008913e+29
> choose(10000,5000)
[1] Inf
>
 Binomial Table of Thousand pages is not
quite sufficient to cover all cases.
47
Stirling’s Approximation (1730)
The formula as typically used in applications is
Very hard to prove
The next term in the O(log(n)) is 1⁄2ln(2πn);
a more precise variant of the formula is therefore
often written
48
Journey from Binomial to
Normal
Abraham de Moivre
1667 - 1754
Johann Carl Friedrich
Gauss
1777 - 1855
To know this heroic journey read
the attached file and - - -
49
Computional Advantages of Normal
Distribution
 One page table is enough for almost all
applications
 Using wonderful properties of power-series of
infinite radius of convergence we could approximate
cdf of standard normal as much as we want.
x
1
1  12 x 2
(x )   
e
dx
2 0 2
3
1
1
x
 
(x 
 x 5 / 5.2!22  x 7 / 7.3!23
3.2
2
2
)
50
51
52
53
Meaning of Measure
(


  (A
An ) 
n 1
n 1
n
)

Whenever An  A
Does rearrangement beget any problem?
54
Analytic Concepts versus
Probability Concepts
Bounded (Big
“Oh”)
Little “oh” or
convergence in
measure
Pointwise or
convergence
a.e.
Stochastically bounded
Convergence in
probability
1. Convergence in
law/
distribution/weak
convergence
2. A.s. convergence
3. rth mean
convergence
55
Some Definitions
56
Some Definitions
57
Example
58
Definitions of Convergence in Law
and Convergence in probability
59
Definitions of Almost Sure Convergence
and Convergence in Lp norm
60
61
62
Relation between Various Modes of
Cnvergence
63
Equivalent Definitions of
Convergence in Distribution
64
Continuity Mapping Theorem and
Slutsky Theorem
65
Classical Delta
Theorem
Scheffe Theorem
66
Thanks
67
Download