First steps in statistics

advertisement
Lecture 8
First steps in statistics
Motivation
What is interesting?
Why is it interesting?
Cui bono?
Envisioned
method
of analysis
Theory
Experimental or
observational study
Data analysis
How to perform a biological study
Literature
Theory
Planning
Defining the problem
Identifying the state of art
Formulating specific
hypothesis to be tested
Study design, power analysis,
choosing the analytical methods,
design of the data base,
Data
Observations, experiments
Meta analysis
Analysis
Statistical analysis,
modelling
Interpretation
Comparing with
current theory
Publication
Scientific writing,
expertise
Preparing the experimental or data collecting phase
•
•
•
•
•
•
•
•
•
•
•
Let’s look a bit closer to data collecting. Before you start any data collecting you have
to have a clear vision of what you want to do with these data. Hence you have to
answer some important questions
For what purpose do I collect data?
Did I read the relevant literature?
Have similar data already been collected by others?
Is the experimental or observational design appropriate for the statistical data
analytical tests I want to apply?
Are the data representative?
How many data do I need for the statistical data analytical tests I want to apply?
Does the data structure fit into the hypothesis I want to test?
Can I compare my data and results with other work?
How large are the errors in measuring? Do theses errors prevent clear final results?
How large might the errors be for the data being still meaningful?
How to lie with statistics
Unknown
28%
PO
33%
Samoobrona
10%
LiD
10%
PIS
19%
Representative sampling
4500 4500
1000 1000
100
10
1
100
10
1
0.2
0.20.8
0.83.2
3.2
12.8 12.8
51.2 51.2
BodyBody
length
length
classclass
[mm][mm]
4000 4000
Number of species
Number of species
Number of species
Number of species
1000010000
3500 3500
3000 3000
2500 2500
2000 2000
1500 1500
1000 1000
500
500
0
0
0.2
0.20.8
0.83.2
3.2
12.8 12.8
51.2 51.2
BodyBody
length
length
classclass
[mm][mm]
25
50
20
Events
30
15
20
10
10
0
2
3
4
5
6
7
8
9
10
11
5
12
Classes
0
1
3
5
7
9
Classes
14
100
90
80
70
60
50
40
30
20
10
0
12
10
Events
1
Events
Events
40
8
6
4
2
0
1 4 7 10 13 16 19 22
Classes
1
2
Classes
3
11
z
Mean density
10
10
Mean density
z
100
1
0.1
0.01
0.001
0.001
0.01
0.1
1
10
100
z
Mean density
10
.01
0.1
1
Body weight [mg]
10
100
1
0.1
0.01
0.01
0.1
1
0.1
0.01
0.001
0.001
0.01
0.1
Body weight
Body weight [mg]
0.001
0.001
1
10
Body weight class [mg]
100
5.00
Birthrate
4.00
3.00
2.00
1.00
Number of storks nests and birthrates in Switzerland
0.00
0
20
40
60
80
100
120
Numbers of storks
% catholics
100.00%
80.00%
60.00%
40.00%
20.00%
0.00%
1.6
1.65
1.7
1.75
Mean body height
1.8
S1
10
8
6
4
2
0
A
B
C
D
E
Variable 1
12
9
6
3
0
12
Variable
Variable
E
D
C
B
A
Better
Variable 2
Worse
A B C
9
6
3
S1
D
0
E
A
B
Variable
C
D
E
Variable
60
6
40
4
0
B
C
D
A
Variable 2
Influence of variable 1 on
variable 2
6 y = f(x)
4 R2 = n.s.

2
= 5.5
0
A
B
C
D
E
0.01
D
E
100
B
C
D
E
10
8
6
4
2
0
0
2
4
6
10
8
6
4
2
0
A
B
C
Variable
Variable 1
Variable 2
1
Variable 1
C
10
8
6
4
2
0
A
10
1
E
Variable
8
0.1
C
D
Variable
B
E
10
0.01
B
80
60
40
20
0
0
A
Variable 2
3
A
20
2
6
Variable
80
8
9
0
Variable
10
Variable
12
A
B
C
D
E
8
Variable 1
10
D
E
Scientific publications of any type are classically divided into 6 major parts
•Title, affiliations and abstract
In this part you give a short and meaningful title that may contain already an essential result. The abstract
is a short text containing the major hypothesis and results. The abstract should make clear why a study has
been undertaken
•The introduction
The introduction should shortly discuss the state of art and the theories the study is based on , describe
the motivation for the present study, and explain the hypotheses to be tested. Do not review the literature
extensively but discuss all of the relevant literature necessary to put the present paper in a broader
context. Explain who might be interested in the study and why this study is worth reading!
•Materials and methods
A short description of the study area (if necessary), the experimental or observational techniques used for
data collection, and the techniques of data analysis used. Indicate the limits of the techniques used.
•Results
This section should contain a description of the results of your study. Here the majority of tables and
figures should be placed. Do not double data in tables and figures.
•Discussion
This part should be the longest part of the paper. Discuss your results in the light of current theories and
scientific belief. Compare the results with the results of other comparable studies. Again discuss why your
study has been undertaken and what is new. Discuss also possible problems with your data and
misconceptions. Give hints for further work.
•Acknowledgments
Short acknowledgments, mentioning of people who contributed material but did not figure as co-authors.
Mentioning of fund giving institutions
•Literature
The source data base
ln Body weight
Country
Albania
Andorra
Austria
Azores
Baleary Islands
Belarus
Belgium
Bosnia and Herzegovina
Bulgaria
Canary Islands
Corsica
Crete
Croatia
Czech Republic
Denmark
Dodecanese Is.
Estonia
Faroe Is.
Finland
France
Franz Josef Land
Germany
Greece
Hungary
Island/
Area [km 2] DeltaT [°C]
Mainland
m
m
m
i
i
m
m
m
m
i
i
i
m
m
m
i
m
i
m
m
i
m
m
m
28748
468
83871
2200
5014
207650
30528
51197
110971
7270
8680
8259
56594
78866
43093
2663
45227
1399
338145
543965
16134
357021
131992
93054
17
14.7
20
7
15
23
15
20
21
5
13
13
21
19
16
14
21
7
23
15
27
19
17
22
Lat
Long
41.33
42.5
48.12
37.73
39.55
53.87
50.9
43.82
42.65
27.93
41.92
35.33
45.82
50.1
55.63
36.4
59.35
62
60.32
48.73
79.85
52.38
37.9
47.43
19.92
1.5
14.57
-28.01
2.65
28
4
18
25
-15.4
8.73
24.83
15.5
15.5
12.57
23.73
26
-7
25
2.3
57.42
13.42
23.73
20
Each row gets a single data record.
Columns contain variables.
Variables can be of text or metric type.
Days
below
zero
34
60
92
1
18
144
50
114
102
1
11
1
114
119
85
2
143
30
169
50
310
97
2
100
Min
-4.28959
-0.867014
-4.84426
-4.13091
-3.98247
-4.13091
-4.64414
-4.84426
-4.84426
-6.95173
-3.87025
-3.76326
-3.98247
-4.92946
-4.84426
-3.46924
-3.87025
-4.64414
-4.64414
-5.06348
-2.8658
-5.06348
-3.98247
-4.84426
Max
Body weight distribution
Mean
2.60059
-1.31798
1.58438 -0.0465939
2.60059
-1.26057
1.93892
-1.15658
0.651808
-1.74797
1.82671
-1.02222
1.93892
-1.14785
2.60059
-1.03804
1.93892
-1.0666
0.651808
-2.06754
1.82671
-1.1579
1.58438
-1.39088
2.60059
-0.85965
2.60059
-1.43186
1.93892
-1.47086
1.58438
-1.16453
1.82671
-1.19182
1.93892
-1.34386
1.93892
-1.39185
1.93892
-1.43424
-0.280761 -1.00011
2.60059
-1.422
1.58438
-1.06325
2.60059
-1.29947
Variance Skewness
1.87086
1.22393
1.61122
1.62273
1.19506
1.98246
1.77014
1.38211
1.74795
1.85064
1.17258
1.74242
1.71272
1.78157
2.00751
1.22197
1.52952
1.79648
1.7577
1.46779
0.50044
1.69377
1.45253
1.74949
0.0616831
1.79255
-0.0019179
-0.045433
-0.0467805
0.0370745
-0.0967065
0.327301
-0.143106
-0.0919176
0.315027
0.192635
0.321948
0.0042882
-0.062116
0.235818
0.43837
-0.164351
0.0182062
0.014345
-1.70537
-0.0851031
-0.120441
0.0120114
Kurtosis
-0.210158
3.39576
-0.0696599
0.101475
-0.179688
-0.386657
-0.216933
1.01551
-0.0350317
0.779021
0.669108
-0.766043
-0.213202
-0.0565949
-0.483545
0.207184
0.424804
-0.0688218
-0.334913
0.0427404
3.45228
-0.0590463
-0.209457
-0.205994
Species Sources
132
4
486
94
42
48
209
145
209
115
60
108
141
361
222
36
40
85
225
641
15
420
103
408
Thibaud, 1992; Thiba
Deharveng, 2007
Pomorski, 2006; Qu
Gama, 2005a,b
Jordana et al., 2005;
Kuznetsova, 2002; D
Janssens, 2008
Bogojević 1968; Deh
Rusek, 1965; Tsone
Gama, 2005b; Deha
Deharveng, 2007
Ellis, 1976; Schultz
Bogojević, 1968; Oz
Rusek 1977, 1979, 1
Fjellberg, 2007a
Deharveng, 2007
Kanal, 2004; Deharv
Fjellberg, 2007a
Fjellberg, 2007a
Liste des Collembole
Babenko & Fjellberg
Pallisa, 2000, Dehar
Schultz & Lymberak
Traser & Dányi, 2008
Never use the original data base for calculations.
Use only a replicate.
Take care of empty cells.
In calculated cells take care of impossible values.
http://folk.uio.no/ohammer/past/
No
Raw data
Classes
Class means
Counter
1
2
3
4
5
6
7
8
9
10
11
0.154497
0.919498
0.517978
0.742013
0.295932
0.819647
0.693982
0.194982
0.276991
0.054868
0.386411
0-0.1
0.1-0.2
0.2-0.3
0.3-0.4
0.4-0.5
0.5-0.6
0.6-0.7
0.7-0.8
0.8-0.9
0.9-1
0.05
0.15
0.25
0.35
0.45
0.55
0.65
0.75
0.85
0.95
20
48
83
107
127
149
172
185
198
200
0.00286
13
0.129657
+LICZ.JEŻE
+D10+0.1 LI(B$2:B$2 =E11-E12 =F11/E$11
01;"<1")
=G11+H12
Frequency distribution
0.2
40
35
30
25
20
15
10
5
0
0.15
f(X)
N
12
Number of
Cummulative
Frequencies
occassions
frquencies
20
0.1
0.1
28
0.14
0.24
35
0.175
0.415
24
0.12
0.535
20
0.1
0.635
22
0.11
0.745
23
0.115
0.86
13
0.065
0.925
13
0.065
0.99
2
0.01
1
0.1
0.05
0
0
0.2
0.4
0.6
X
0.8
1
0
0.2
0.4
0.6
X
0.8
1
No
Raw data
Classes
Class means
Counter
1
2
3
4
5
6
7
8
9
10
11
0.154497
0.919498
0.517978
0.742013
0.295932
0.819647
0.693982
0.194982
0.276991
0.054868
0.386411
0-0.1
0.1-0.2
0.2-0.3
0.3-0.4
0.4-0.5
0.5-0.6
0.6-0.7
0.7-0.8
0.8-0.9
0.9-1
0.05
0.15
0.25
0.35
0.45
0.55
0.65
0.75
0.85
0.95
20
48
83
107
127
149
172
185
198
200
+LICZ.JEŻE
+D10+0.1 LI(B$2:B$2 =E11-E12 =F11/E$11
01;"<1")
12
0.00286
13
0.129657
=G11+H12
1
0.2
0.8
F(X)
0.15
f(X)
Number of
Cummulative
Frequencies
occassions
frquencies
20
0.1
0.1
28
0.14
0.24
35
0.175
0.415
24
0.12
0.535
20
0.1
0.635
22
0.11
0.745
23
0.115
0.86
13
0.065
0.925
13
0.065
0.99
2
0.01
1
0.1
0.05
0.6
0.4
0.2
Frequency distribution
0
Cumulative frequency distribution
0
0
0.2
0.4
0.6
X
0.8
1
0
0.2
0.4
0.6
X
0.8
1
0.2
0.2
0.15
0.15
f(X)
f(X)
Discrete and continuous distributions
0.1
0.05
Probability generating
function (pgf)
0.1
0.05
Discrete distribution
0
Continuous distribution
0
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
X
0.6
0.8
X
Probability density function (pdf)
xmax
F ( xi )   f ( xi )  1
1
F ( xmax ) 
xmax

xmin
Statistical or probability distributions add up to one.
f ( x)dx  1
1
0.25
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
left skewed
symmetric
0.2
f(x i)
f(x i)
Shapes of frequency distributions
0.15
0.1
0.05
0
0
5
10
0
15
5
15
0.16
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
right skewed
bimodal
0.14
0.12
0.1
f(x i)
f(x i)
10
x
x
0.08
0.06
0.04
0.02
0
0
5
10
0
15
5
10
15
x
x
0.25
0.14
decreasing
0.12
0.2
f(x i)
f(x i)
0.1
0.15
0.1
0.08
0.06
0.04
0.05
0.02
0
U-shaped
0
0
5
10
x
15
0
5
10
x
15
Many statistical methods rely on a comparison of observed frequency distributions with
theoretical distributions.
Deviations from theory (from expectation) (so called residuals) are measures of statistical
significance.
0.3
0.25
Df(x)
f(X)
0.2
0.15
0.1
0.05
Df(x)
0
0
0.2
0.4
0.6
0.8
1
X
If the Df(x) are too large we accept the hypothesis that our observations differ from
the theoretical expectation.
The problem in statistical inference is to find the appropriate theoretical distribution
that can be applied to our data.
Home work and literature
Refresh:
Literature:
•
•
•
•
•
Mathe-online
Łomnicki: Statystyka dla biologów.
•
•
•
•
Arithmetic, geometric, harmonic mean
Variance, standard deviation standard error
Central moments
Third and fourth central moment
Mean and variance of power and
exponental function statistical distributions
Pseudocorrelation
Sample bias
Coefficient of variation
Representative sample
Prepare to the next lecture:
•
•
•
•
Bernoulli distribution
Pascal distribution
Hypergeometric distribution
Linear random number
Download