Lecture 8 First steps in statistics Motivation What is interesting? Why is it interesting? Cui bono? Envisioned method of analysis Theory Experimental or observational study Data analysis How to perform a biological study Literature Theory Planning Defining the problem Identifying the state of art Formulating specific hypothesis to be tested Study design, power analysis, choosing the analytical methods, design of the data base, Data Observations, experiments Meta analysis Analysis Statistical analysis, modelling Interpretation Comparing with current theory Publication Scientific writing, expertise Preparing the experimental or data collecting phase • • • • • • • • • • • Let’s look a bit closer to data collecting. Before you start any data collecting you have to have a clear vision of what you want to do with these data. Hence you have to answer some important questions For what purpose do I collect data? Did I read the relevant literature? Have similar data already been collected by others? Is the experimental or observational design appropriate for the statistical data analytical tests I want to apply? Are the data representative? How many data do I need for the statistical data analytical tests I want to apply? Does the data structure fit into the hypothesis I want to test? Can I compare my data and results with other work? How large are the errors in measuring? Do theses errors prevent clear final results? How large might the errors be for the data being still meaningful? How to lie with statistics Unknown 28% PO 33% Samoobrona 10% LiD 10% PIS 19% Representative sampling 4500 4500 1000 1000 100 10 1 100 10 1 0.2 0.20.8 0.83.2 3.2 12.8 12.8 51.2 51.2 BodyBody length length classclass [mm][mm] 4000 4000 Number of species Number of species Number of species Number of species 1000010000 3500 3500 3000 3000 2500 2500 2000 2000 1500 1500 1000 1000 500 500 0 0 0.2 0.20.8 0.83.2 3.2 12.8 12.8 51.2 51.2 BodyBody length length classclass [mm][mm] 25 50 20 Events 30 15 20 10 10 0 2 3 4 5 6 7 8 9 10 11 5 12 Classes 0 1 3 5 7 9 Classes 14 100 90 80 70 60 50 40 30 20 10 0 12 10 Events 1 Events Events 40 8 6 4 2 0 1 4 7 10 13 16 19 22 Classes 1 2 Classes 3 11 z Mean density 10 10 Mean density z 100 1 0.1 0.01 0.001 0.001 0.01 0.1 1 10 100 z Mean density 10 .01 0.1 1 Body weight [mg] 10 100 1 0.1 0.01 0.01 0.1 1 0.1 0.01 0.001 0.001 0.01 0.1 Body weight Body weight [mg] 0.001 0.001 1 10 Body weight class [mg] 100 5.00 Birthrate 4.00 3.00 2.00 1.00 Number of storks nests and birthrates in Switzerland 0.00 0 20 40 60 80 100 120 Numbers of storks % catholics 100.00% 80.00% 60.00% 40.00% 20.00% 0.00% 1.6 1.65 1.7 1.75 Mean body height 1.8 S1 10 8 6 4 2 0 A B C D E Variable 1 12 9 6 3 0 12 Variable Variable E D C B A Better Variable 2 Worse A B C 9 6 3 S1 D 0 E A B Variable C D E Variable 60 6 40 4 0 B C D A Variable 2 Influence of variable 1 on variable 2 6 y = f(x) 4 R2 = n.s. 2 = 5.5 0 A B C D E 0.01 D E 100 B C D E 10 8 6 4 2 0 0 2 4 6 10 8 6 4 2 0 A B C Variable Variable 1 Variable 2 1 Variable 1 C 10 8 6 4 2 0 A 10 1 E Variable 8 0.1 C D Variable B E 10 0.01 B 80 60 40 20 0 0 A Variable 2 3 A 20 2 6 Variable 80 8 9 0 Variable 10 Variable 12 A B C D E 8 Variable 1 10 D E Scientific publications of any type are classically divided into 6 major parts •Title, affiliations and abstract In this part you give a short and meaningful title that may contain already an essential result. The abstract is a short text containing the major hypothesis and results. The abstract should make clear why a study has been undertaken •The introduction The introduction should shortly discuss the state of art and the theories the study is based on , describe the motivation for the present study, and explain the hypotheses to be tested. Do not review the literature extensively but discuss all of the relevant literature necessary to put the present paper in a broader context. Explain who might be interested in the study and why this study is worth reading! •Materials and methods A short description of the study area (if necessary), the experimental or observational techniques used for data collection, and the techniques of data analysis used. Indicate the limits of the techniques used. •Results This section should contain a description of the results of your study. Here the majority of tables and figures should be placed. Do not double data in tables and figures. •Discussion This part should be the longest part of the paper. Discuss your results in the light of current theories and scientific belief. Compare the results with the results of other comparable studies. Again discuss why your study has been undertaken and what is new. Discuss also possible problems with your data and misconceptions. Give hints for further work. •Acknowledgments Short acknowledgments, mentioning of people who contributed material but did not figure as co-authors. Mentioning of fund giving institutions •Literature The source data base ln Body weight Country Albania Andorra Austria Azores Baleary Islands Belarus Belgium Bosnia and Herzegovina Bulgaria Canary Islands Corsica Crete Croatia Czech Republic Denmark Dodecanese Is. Estonia Faroe Is. Finland France Franz Josef Land Germany Greece Hungary Island/ Area [km 2] DeltaT [°C] Mainland m m m i i m m m m i i i m m m i m i m m i m m m 28748 468 83871 2200 5014 207650 30528 51197 110971 7270 8680 8259 56594 78866 43093 2663 45227 1399 338145 543965 16134 357021 131992 93054 17 14.7 20 7 15 23 15 20 21 5 13 13 21 19 16 14 21 7 23 15 27 19 17 22 Lat Long 41.33 42.5 48.12 37.73 39.55 53.87 50.9 43.82 42.65 27.93 41.92 35.33 45.82 50.1 55.63 36.4 59.35 62 60.32 48.73 79.85 52.38 37.9 47.43 19.92 1.5 14.57 -28.01 2.65 28 4 18 25 -15.4 8.73 24.83 15.5 15.5 12.57 23.73 26 -7 25 2.3 57.42 13.42 23.73 20 Each row gets a single data record. Columns contain variables. Variables can be of text or metric type. Days below zero 34 60 92 1 18 144 50 114 102 1 11 1 114 119 85 2 143 30 169 50 310 97 2 100 Min -4.28959 -0.867014 -4.84426 -4.13091 -3.98247 -4.13091 -4.64414 -4.84426 -4.84426 -6.95173 -3.87025 -3.76326 -3.98247 -4.92946 -4.84426 -3.46924 -3.87025 -4.64414 -4.64414 -5.06348 -2.8658 -5.06348 -3.98247 -4.84426 Max Body weight distribution Mean 2.60059 -1.31798 1.58438 -0.0465939 2.60059 -1.26057 1.93892 -1.15658 0.651808 -1.74797 1.82671 -1.02222 1.93892 -1.14785 2.60059 -1.03804 1.93892 -1.0666 0.651808 -2.06754 1.82671 -1.1579 1.58438 -1.39088 2.60059 -0.85965 2.60059 -1.43186 1.93892 -1.47086 1.58438 -1.16453 1.82671 -1.19182 1.93892 -1.34386 1.93892 -1.39185 1.93892 -1.43424 -0.280761 -1.00011 2.60059 -1.422 1.58438 -1.06325 2.60059 -1.29947 Variance Skewness 1.87086 1.22393 1.61122 1.62273 1.19506 1.98246 1.77014 1.38211 1.74795 1.85064 1.17258 1.74242 1.71272 1.78157 2.00751 1.22197 1.52952 1.79648 1.7577 1.46779 0.50044 1.69377 1.45253 1.74949 0.0616831 1.79255 -0.0019179 -0.045433 -0.0467805 0.0370745 -0.0967065 0.327301 -0.143106 -0.0919176 0.315027 0.192635 0.321948 0.0042882 -0.062116 0.235818 0.43837 -0.164351 0.0182062 0.014345 -1.70537 -0.0851031 -0.120441 0.0120114 Kurtosis -0.210158 3.39576 -0.0696599 0.101475 -0.179688 -0.386657 -0.216933 1.01551 -0.0350317 0.779021 0.669108 -0.766043 -0.213202 -0.0565949 -0.483545 0.207184 0.424804 -0.0688218 -0.334913 0.0427404 3.45228 -0.0590463 -0.209457 -0.205994 Species Sources 132 4 486 94 42 48 209 145 209 115 60 108 141 361 222 36 40 85 225 641 15 420 103 408 Thibaud, 1992; Thiba Deharveng, 2007 Pomorski, 2006; Qu Gama, 2005a,b Jordana et al., 2005; Kuznetsova, 2002; D Janssens, 2008 Bogojević 1968; Deh Rusek, 1965; Tsone Gama, 2005b; Deha Deharveng, 2007 Ellis, 1976; Schultz Bogojević, 1968; Oz Rusek 1977, 1979, 1 Fjellberg, 2007a Deharveng, 2007 Kanal, 2004; Deharv Fjellberg, 2007a Fjellberg, 2007a Liste des Collembole Babenko & Fjellberg Pallisa, 2000, Dehar Schultz & Lymberak Traser & Dányi, 2008 Never use the original data base for calculations. Use only a replicate. Take care of empty cells. In calculated cells take care of impossible values. http://folk.uio.no/ohammer/past/ No Raw data Classes Class means Counter 1 2 3 4 5 6 7 8 9 10 11 0.154497 0.919498 0.517978 0.742013 0.295932 0.819647 0.693982 0.194982 0.276991 0.054868 0.386411 0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 20 48 83 107 127 149 172 185 198 200 0.00286 13 0.129657 +LICZ.JEŻE +D10+0.1 LI(B$2:B$2 =E11-E12 =F11/E$11 01;"<1") =G11+H12 Frequency distribution 0.2 40 35 30 25 20 15 10 5 0 0.15 f(X) N 12 Number of Cummulative Frequencies occassions frquencies 20 0.1 0.1 28 0.14 0.24 35 0.175 0.415 24 0.12 0.535 20 0.1 0.635 22 0.11 0.745 23 0.115 0.86 13 0.065 0.925 13 0.065 0.99 2 0.01 1 0.1 0.05 0 0 0.2 0.4 0.6 X 0.8 1 0 0.2 0.4 0.6 X 0.8 1 No Raw data Classes Class means Counter 1 2 3 4 5 6 7 8 9 10 11 0.154497 0.919498 0.517978 0.742013 0.295932 0.819647 0.693982 0.194982 0.276991 0.054868 0.386411 0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 20 48 83 107 127 149 172 185 198 200 +LICZ.JEŻE +D10+0.1 LI(B$2:B$2 =E11-E12 =F11/E$11 01;"<1") 12 0.00286 13 0.129657 =G11+H12 1 0.2 0.8 F(X) 0.15 f(X) Number of Cummulative Frequencies occassions frquencies 20 0.1 0.1 28 0.14 0.24 35 0.175 0.415 24 0.12 0.535 20 0.1 0.635 22 0.11 0.745 23 0.115 0.86 13 0.065 0.925 13 0.065 0.99 2 0.01 1 0.1 0.05 0.6 0.4 0.2 Frequency distribution 0 Cumulative frequency distribution 0 0 0.2 0.4 0.6 X 0.8 1 0 0.2 0.4 0.6 X 0.8 1 0.2 0.2 0.15 0.15 f(X) f(X) Discrete and continuous distributions 0.1 0.05 Probability generating function (pgf) 0.1 0.05 Discrete distribution 0 Continuous distribution 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 X 0.6 0.8 X Probability density function (pdf) xmax F ( xi ) f ( xi ) 1 1 F ( xmax ) xmax xmin Statistical or probability distributions add up to one. f ( x)dx 1 1 0.25 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 left skewed symmetric 0.2 f(x i) f(x i) Shapes of frequency distributions 0.15 0.1 0.05 0 0 5 10 0 15 5 15 0.16 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 right skewed bimodal 0.14 0.12 0.1 f(x i) f(x i) 10 x x 0.08 0.06 0.04 0.02 0 0 5 10 0 15 5 10 15 x x 0.25 0.14 decreasing 0.12 0.2 f(x i) f(x i) 0.1 0.15 0.1 0.08 0.06 0.04 0.05 0.02 0 U-shaped 0 0 5 10 x 15 0 5 10 x 15 Many statistical methods rely on a comparison of observed frequency distributions with theoretical distributions. Deviations from theory (from expectation) (so called residuals) are measures of statistical significance. 0.3 0.25 Df(x) f(X) 0.2 0.15 0.1 0.05 Df(x) 0 0 0.2 0.4 0.6 0.8 1 X If the Df(x) are too large we accept the hypothesis that our observations differ from the theoretical expectation. The problem in statistical inference is to find the appropriate theoretical distribution that can be applied to our data. Home work and literature Refresh: Literature: • • • • • Mathe-online Łomnicki: Statystyka dla biologów. • • • • Arithmetic, geometric, harmonic mean Variance, standard deviation standard error Central moments Third and fourth central moment Mean and variance of power and exponental function statistical distributions Pseudocorrelation Sample bias Coefficient of variation Representative sample Prepare to the next lecture: • • • • Bernoulli distribution Pascal distribution Hypergeometric distribution Linear random number