Moments and descriptive statistics

advertisement
Lecture 9
Moments of distributions
Body size distribution of European
Collembola
Body size distribution of European Collembola
Body
Species
weight
[mg]
Tetrodontophora bielanensis (Waga 1842)
13.471729
Orchesella chiantica Frati & Szeptycki 1990
13.471729
Disparrhopalites tergestinus Fanciulli, Colla, Dallai 2005
12.924837
Orchesella dallaii Frati & Szeptycki 1990
9.4503028
Seira pini Jordana & Arbea 1989
9.4503028
Isotomurus pentodon (Kos,1937)
7.1044808
Heteromurus (V.) longicornis (Absolon 1900)
7.1044808
Pogonognathellus flavescens (Tullberg 1871)
6.9512714
Orchesella hoffmanni Stomp 1968
6.9512714
Heteromurus (H) constantinellus Lučić, Ćurčić & Mitić 2007 6.3862223
Pogonognathellus longicornis (Müller 1776)
6.2133935
Orchesella devergens Handschin 1924
6.2133935
Orchesella flavescens (Bourlet 1839)
6.2133935
Orchesella quinquefasciata (Bourlet 1841)
6.2133935
Number of species
500
Modus
Collembola
400
300
200
100
0
-4.72 -4.02 -3.32 -2.62 -1.93 -1.23 -0.53 0.16 0.86 1.56 2.25
ln body weight class
ln body
Number
ln
weight [mg]
of
weight
class means species
2.6006
-4.71511
7
2.6006 -4.018377
53
2.5592 -3.321643
133
2.246 -2.624909
224
2.246 -1.928176
353
1.9607 -1.231442
395
1.9607 -0.534708
325
1.9389 0.162025
126
1.9389 0.858759
45
1.8541 1.555493
24
1.8267 2.252226
9
1.8267
1.8267
1.8267
The histogram of raw data
Three Collembolan
weight classes
Class 1
N
25
Mean 1.8169079
2.6005933
2.5591508
2.2460468
2.2460468
1.9607257
1.9607257
1.9389246
1.9389246
1.8541429
1.8267072
1.8267072
1.8267072
1.8267072
1.8267072
1.584378
1.584378
1.584378
1.584378
1.584378
1.584378
1.5326904
1.5326904
1.5064044
1.4529137
1.4529137
Class 2
31
1.032923
1.313477
1.313477
1.313477
1.313477
1.313477
1.301948
1.225568
1.165038
1.165038
1.165038
1.165038
1.006355
1.006355
1.006355
1.006355
1.006355
1.006355
1.006355
1.006355
1.006355
1.006355
0.939683
0.871022
0.871022
0.835906
0.835906
0.800247
0.800247
0.764026
0.756712
0.727225
Class 3
43
0.531059
0.651808
0.651808
0.651808
0.651808
0.651808
0.651808
0.651808
0.651808
0.651808
0.651808
0.651808
0.651808
0.651808
0.651808
0.651808
0.651808
0.651808
0.613152
0.573835
0.573835
0.533834
0.493125
0.493125
0.493125
0.493125
0.493125
0.489014
0.451682
0.451682
0.451682
0.451682
0.409479
What is the average body weight?
n
n

x
i 1
i
x
x
i 1
i
n
n
Population mean
Sample mean
Weighed mean
x
25
31
43
1.812  1.033  0.531  1.013
99
99
99
k
k
ni
1 k
x   xi ni   xi   xi f (i)
n i 1
n i 1
i 1
Number of species
0.25
0.2
f ( x1 ) 
Weighed mean
Collembola
ni
n
n
k
k
xi
ni xi
x 
  xi f ( xi )
i 1 n
i 1 n
i 1
0.15
0.1
0.05
Discrete distributions
0
-4.72 -4.02 -3.32 -2.62 -1.93 -1.23 -0.53 0.16 0.86 1.56 2.25
ln body weight class
ln body
Number
weight [mg]
of
class means species
-4.72
-4.02
-3.32
-2.62
-1.93
-1.23
-0.53
0.16
0.86
1.56
2.25
7
53
133
224
353
395
325
126
45
24
9
Sum
1694
Frequency
Arithmetic
mean
=B2/B14
0.031286895
0.078512397
0.132231405
0.208382527
0.233175915
0.191853601
0.074380165
0.026564345
0.014167651
0.005312869
=A2*C2
=(A2-D14)^2*C2
-0.125723
0.202268085
-0.26079
0.267516588
-0.347095
0.174619987
-0.401798
0.042653444
-0.287143
0.013917567
-0.102586
0.169898317
0.0120514
0.199510727
0.0228124
0.144774029
0.0220377
0.130178627
0.0119658
0.073837264
-1.475751
StDev
Variance
1.462535979
1.209353538
The average European springtail has
a body weight of e-1.476 = 023 mg.
Most often encounted is a weight
around e-1.23 = 029 mg.
Continuous distributions

max
 xf ( x)dx
min
Why did we use log transformed values?
Average
Body
body length weight
[mm]
[mg]
Species
Tetrodontophora bielanensis (Waga 1842)
Orchesella chiantica Frati & Szeptycki 1990
Disparrhopalites tergestinus Fanciulli, Colla, Dallai 2005
Orchesella dallaii Frati & Szeptycki 1990
Seira pini Jordana & Arbea 1989
Isotomurus pentodon (Kos,1937)
Heteromurus (V.) longicornis (Absolon 1900)
Pogonognathellus flavescens (Tullberg 1871)
Orchesella hoffmanni Stomp 1968
Heteromurus (H) constantinellus Lučić, Ćurčić & Mitić 2007
Pogonognathellus longicornis (Müller 1776)
Orchesella devergens Handschin 1924
Orchesella flavescens (Bourlet 1839)
Orchesella quinquefasciata (Bourlet 1841)
Log transformed data
Collembola
400
300
200
100
0
-6.00
13.472
13.472
12.925
9.4503
9.4503
7.1045
7.1045
6.9513
6.9513
6.3862
6.2134
6.2134
1.875
6.2134
6.2134
=JEŻELI(B86=0;0;EXP(-1.875+LN(B86)*2.3))
W[mg]  e
Linear data
500
Number of species
Number of species
500
7
7
6.875
6
6
5.3
5.3
5.25
5.25
5.06
5
5
5
5
5
[W / L]L[mm]2.3
Collembola
400
The distribution is
skewed
300
200
100
0
-4.00
-2.00
0.00
ln body weight class
2.00
4.00
0
2
4
6
Body weight class
8
10
W [m g]  e 1.875 [W / L]L[m m]2.3
Body weight Number
[mg] class
of
means
species
W  W0 Lz
ln W  ln W0  z ln L
Number of species
500
Collembola
400
300
200
100
0
0
2
4
6
Body weight class
n
n
n
x
i
e
 ln xi
i 1
i 1
n
8
10
0.01
0.02
0.04
0.07
0.15
0.29
0.59
1.18
2.36
4.74
9.51
7
53
133
224
353
395
325
126
45
24
9
Sum
Exp()
1694
lb scaled weight
classes
Frequency
Arithmetic
mean
Geometric
mean
0.004132231
0.031286895
0.078512397
0.132231405
0.208382527
0.233175915
0.191853601
0.074380165
0.026564345
0.014167651
0.005312869
3.702E-05
0.0005626
0.0028338
0.0095797
0.0303016
0.0680574
0.1123956
0.0874629
0.062698
0.0671181
0.0505194
-0.019483926
-0.125722539
-0.260790153
-0.347095405
-0.401798187
-0.287142615
-0.102585655
0.012051446
0.02281237
0.022037681
0.011965782
0.491566
-1.4757512
0.228606933
The average European
springtail has a body weight of
e-1.476 = 023 mg.
Geometric mean
In the case of exponentially distributed data we have to use the geometric mean.
To make things easier we first log-transform our data.
How to use geometric means
n
n
n
x
i
 ln xi
 e i1
n
i 1
A tropical forest is logged during three years:
first year 0.1%, second year 1% and third year 10% of area.
Hence the total decrease in forest area is
11% of area has been logged during three year.
What is the mean logging rate per year?
A3  (1  0.001)(1  0.01)(1  0.1)A0  0.890A0
Arithmetic mean
0.999  0.99  0.9
 0.963
3
A3  0.9633 A 0  0.893A 0

Geometric mean
  (0.999*0.99*0.9)1/ 3  0.962
A3  0.9623 A0  0.890A 0
In multiplicative processes we should use the geometric mean.
ln body
Number
weight [mg]
of
class means species
-4.72
-4.02
-3.32
-2.62
-1.93
-1.23
-0.53
0.16
0.86
1.56
2.25
7
53
133
224
353
395
325
126
45
24
9
Sum
1694
Frequency
=B2/B14
0.031286895
0.078512397
0.132231405
0.208382527
0.233175915
0.191853601
0.074380165
0.026564345
0.014167651
0.005312869
Arithmetic
mean
=A2*C2
=(A2-D14)^2*C2
-0.125723
0.202268085
-0.26079
0.267516588
-0.347095
0.174619987
-0.401798
0.042653444
-0.287143
0.013917567
-0.102586
0.169898317
0.0120514
0.199510727
0.0228124
0.144774029
0.0220377
0.130178627
0.0119658
0.073837264
-1.475751
StDev
1.462535979
1.209353538
Mean
Number of species
0.25
0.2
f ( x1 ) 
i 1
n 1
2 
 (x
i 1
i
 )2
n
Degrees of freedom
Variance
n
s   ( xi  x) 2 f ( xi )
2
i 1
Continuous distributions
s2 
2
(
x

x
)
f ( x)dx

min
0.15
0.1
s2 
 ( xi  x )
2
max
Collembola
ni
n
n
n
Variance
1 SD
s  s2
Standard deviation
0.05
0
-4.72 -4.02 -3.32 -2.62 -1.93 -1.23 -0.53 0.16 0.86 1.56 2.25
ln body weight class
The standard deviation is a measure
of the width of the statistical
distribution that has the sam
dimension as the mean.
Mean
Variance
Standard
deviation
5.66
10.45
3.23
The standard deviation as a measure of errors
Distance
1
2
3
4
5
6
7
8
9
10
Average NOx Standard
concentration deviation
9.53
1.70
7.37
1.18
5.24
0.86
3.15
0.26
2.17
0.18
1.05
0.09
0.84
0.14
0.63
0.10
0.32
0.03
0.21
0.02
The precision of
derived metrics
should always
match the
precision of the
raw data
Concentration
Environmental
pollution
Station NOx [ppm]
1
8.49
2
1.12
3
9.11
4
7.75
5
0.75
6
8.23
7
0.97
8
6.06
9
8.48
10
5.88
11
8.51
12
9.62
13
3.35
14
7.74
15
2.03
16
5.06
17
7.61
18
0.99
19
2.55
20
8.91
± 1 standard deviation is the
most often used estimator of
error.
The probablity that the true
mean is within ± 1 standard
deviation is approximately 68%.
The probablity that the true
mean is within ± 2 standard
deviations is approximately 95%.
14
12
10
8
6
4
2
0
± 1 standard deviation
1
2
3
4
5
6
Distance [km]
7
8
9
10
Standard deviation and standard error
Mean
Standard
deviation
5.44
4.15
4.49
5.29
5.55
3.39
5.56
3.13
The standard deviation is constant irrespective of
sample size.
The precision of the estimate of the mean should
increase with sample size n.
The standard error is a measure of precision.
SE 
Average NOx Standard
Distance
concentration deviation
1
2
3
4
5
6
7
8
9
10
9.53
7.37
5.24
3.15
2.17
1.05
0.84
0.63
0.32
0.21
3.32
2.45
1.24
0.67
0.87
0.34
0.14
0.10
0.03
0.02
Standard
error
n=20
0.74
0.55
0.28
0.15
0.19
0.08
0.03
0.02
0.01
0.01
SD
n
12
10
Concentration
Environmental
pollution
NOx
Station
[ppm]
1
8.49
2
1.12
3
9.11
4
7.75
5
0.75
6
8.23
7
0.97
8
6.06
9
8.48
10
5.88
11
8.51
12
9.62
13
3.35
14
7.74
15
2.03
16
5.06
17
7.61
18
0.99
19
2.55
20
8.91
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10
Distance [km]
Central moments
n
n
n
n
i 1
i 1
s   ( xi  x) f ( xi )   ( xi ) f ( xi )  2 xi x f ( xi )   ( x) 2 f ( xi )
2
2
2
i 1
i 1
n
n
s   ( xi ) f ( xi )  2 x x  ( x) 1   ( xi ) 2 f ( xi )  x
2
2
i 1
n
2
2
i 1
 n 
xi   xi 

s 2  i 1
  i 1 
n 1  n 1 




2
E(x2)
2
[E(x)]2
Mathematical expectation
First central moment
First moment of central tendency
The variance is the difference between the mean of the squared values and
the squared mean
k-th central moment
  E( X )
 2  E( x 2 )  E( x)2
n
E ( X )   xi f ( xi )
k
k
i 1

E( X ) 
k


k
x f ( x )dx
n
2
2
2
(
X


)
f
(
X
)

E
((
X


)
)


 i
i
i 1
Frequency distributions of resource use or wealth in a population can be described by a
power law (the famous Pareto-Zipf law) with exponents that often have values around
-5/2. What are the mean and the variance of such a power function distribution?
Discrete distribution
max
 f ( x)  1
f ( x)  ax z
min
f(x)
1
0.176777
0.06415
0.03125
0.017889
0.01134
0.007714
0.005524
0.004115
0.003162
1.32192
f(x)/sum
0.756475
0.133727
0.048528
0.02364
0.013532
0.008579
0.005835
0.004179
0.003113
0.002392
1
Mean
1.50942
xf(x)
0.756475
0.267454
0.145584
0.094559
0.067661
0.051472
0.040846
0.033432
0.028018
0.023922
1.50942
Variance StDev
18.9121 4.34881
(x-m)2f(x)
0.566929
1.542484
1.860055
2.001836
2.078674
2.12562
2.156716
2.178547
2.194559
2.206711
18.9121
1
Frequency
z
2.5
x
1
2
3
4
5
6
7
8
9
10
Sum
f ( x)  x 5 / 2
0.1
x 5 / 2
f ( x) 
1.32
0.01
0.001
1
Wealth class
Most people are in the lowest income class and the average is half
between the first and the second.
10
Continuous approximation
Note that the yaxis is at log
scale.
Frequency
1
0.1
0.01
0.001
0
1
2
3
4
5
6
7
8
9
10
Upper bound
of ten would
only cover half
of the column
Wealth class
2a  3 / 2
a10.53 / 2 a0.53 / 2
5 / 2
0.5ax dx (  3) x 0.5   3   3  0.93a
10.5
0.93a  1  a 
10.5
1
 1.07
0.93
10.5
10.5
10.5
10.5
The estimate of a is
imprecise
1
1 1/ 2
5 / 2
1/ 2
1/ 2
xx
dx

1
.
07
x


2
.
14
(
10
.
5

0
.
5
)  2.37

0.93 0.5
 0.5
0.5
1
1 1/ 2
5 / 2
1/ 2
1/ 2
xx
dx

0
.
76
x


1
.
52
(
10
.
5

0
.
5
)  1.56

1.32 0.5
 0.5
0.5
The Arrhenius probability model assumes the same probability of an event irrespective of
the time that elapsed from the starting.
What are the mean and the variance of such a distribution?

max
a  t 
a  e dt  1 
e
1
0

0
 f ( x)dx  1
min
a
f (t )  aet
 t
1 a  


  e t dt  1
Cumulative density function
0

x    te dt 
 t
e
(t  1)
1



0
 t
( t  2t  2)
2
 2
2


0
0

E[ x ]    t e dt 
2
2  t
0
2
1
1
 2  2    2
  
2

 t
e
2 2

Third central moment
E(( X  )3 )  E( X 3 )  3 E( X 2 )  3 2 E( X )  3  E( X 3 )  3E( X 2 )  23
Skewness
f(x)
2
4
x
6
8
0
Kurtosis
  E(

4
)3
1000
x
1500
1
0.8
0.6
0.4
0.2
0
1
2
4
x
6
8
1.5
x
2
Left skewed distribution
=0
0
<0
1
0.8
0.6
0.4
0.2
0
2000
Right skewed distribution
Symmetric distribution
( X   )4
500
f(x)
0
>0
1
0.8
0.6
0.4
0.2
0
f(x)
=0
1
0.8
0.6
0.4
0.2
0
f(x)
f(x)
E (( X   )3 )

3
1
0.8
0.6
0.4
0.2
0
>0
0
2
4
x
6
8
How to get the modus?
f(x)
y  xex
1
0.8
0.6
0.4
0.2
0
We need the maximum of the pdf
A probability distribution if
Mode
Mean

 xe
0
2
4
 x
dx 
e
 x
0
6

(x  1)
1
 1  1


0
x
dxe x
  xe x  e  x  0  x  1
dx
Arithmetic mean



E ( x)   xxe dx   x e dx   e ( x  2 x  2)  2
 x
2 x
x
2
0
0
0
Body volumes are estimated from measures of height*length*width. Assume you
estimated the thorax volume of insects and used this volume to infer the body weight.
V  c  Length Height Width
W[mg]  a  Length HeightWidth
z
How to get the parameters a and z?
W[mg]  a  Length HeightWidth
z
3.000
2.500
Dry weight
Body weights are estimated from
species weights against thorax
volume.
y = 1.7754x0.6072
2.000
1.500
1.000
The body weight of a
new species is
estimated from the
regression function
0.500
0.000
0.000
0.500
1.000
1.500
2.000
Thorax volume
Height, length and width could be measured with an accuracy of ± 2%.
Standard deviation is a measure of accuracy (error)
 
2
Independent measurements
n
i 1
 total 2  0.022  0.022  0.022  0.0012
 total  0.0012  0.035
n
 total   i ; total   i 2
2
2
2
i 1
The error of the thorax
estimate is 3.5%.
Home work and literature
Refresh:
•
•
•
•
•
•
•
Arithmetic, geometric, harmonic mean
Cauchy inequality
Statistical distribution
Probability distribution
Moments of distributions
Error law of Gauß
Bootstrap
Prepare to the next lecture:
• Bionomial distribution
• Mean and variance of the binomial
distribution
• Poisson distribution
• Mean and variance of the Poisson
distribution
• Moments of distributions
• DNA mutations
• Transition matrix
Literature:
Łomnicki: Statystyka dla biologów
Binomial distribution:
http://www.stat.yale.edu/Courses/199798/101/binom.htm
Poisson dstribution:
http://en.wikipedia.org/wiki/Poisson_distribution
Download