Data collection and Statistics Evert Jan Bakker and Gerrit Gort Biometris - Wageningen University Biometris Quantitative Methods brought to Life Introduction: What is Statistics? 1. Probability calculus - theoretical and exact (Easy program: PQRS) 2. Descriptive Statistics Just describes the data. All conclusions only refer to the sample. The conclusions are ‘always correct’. Càn be convincing already. Graphical representations of the data. 3. Inference (Test of Hypothesis, Estimate Conf. Interval) Conclusions are drawn about a population (e.g. Wageningen Students) or a general phenomenon (maize yield), only using data from a limited sample. 4. Experimental design/ Sampling design Randomisation, Blocking, Special designs…/sample size Biometris Quantitative Methods brought to Life CORE STATEMENT BEFORE collecting the data You should know which analyses you will do, that is: know the model / models to be used, be confident that the model is reasonable such that the precision to be obtained will be sufficient In data collection be aware of Replication (true replication vs pseudo-replication) Randomization Reduction of error variation Biometris Quantitative Methods brought to Life Qualitative vs Quantitative data “green” Biometris Quantitative Methods brought to Life 2 types of research aims 1. exploration : generate new ideas Measure many response variables; report any fact of interest / relationship / differences, using “any” descriptive analysis. 2. Inference (test / confidence interval): drawing conclusions about a population or a general phenomenon based on sample data. Inference has to be done according to the rules, so as not to ‘Lie with Statistics’. The model of analysis should be reasonable Biometris Quantitative Methods brought to Life Inference An experiment used for inference : Question / Hypothesis Design of the experiment Statistics Carry out the experiment Analysis of the experimental data Statistics For standard designs, the data analysis follows a fixed calculation pattern, which is known before the experiment is done. Biometris Quantitative Methods brought to Life The statistical MODEL / model types Model = assumptions about the observations Systematic part (how the mean value of the response depends on the factor levels / factor level combinations) Random part: independence, Normality and equal variance (independence follows from correct randomisation) All influencing factors not in the systematic part, end up in the random part If response is quantitative (e.g. yield, blood pressure) Qualitative factor(s) e.g. variety 2 sample t-test or ANOVA Quantitative factor(s), e.g. amount of fertilizer, amount of rainfall linear regression Biometris Both Analysis of Covariance Quantitative Methods brought to Life Data collection Primary data collection: for observational research: sampling, how?, how many? for experimental research: design of experiment (choice of exerimental units, randomisation, measurement of response(s), nr. of replications In case secondary data is used: know how the data were obtained (meta-data). Otherwise the conclusion will be about an unknown population. Sampling: random, stratification, subsampling, ... Conclusion can be drawn about a population from which a random sample was taken. Biometris Quantitative Methods brought to Life Design principles : brief overview 1. Repetition (n > 1) required for more precision 1-sample example: st.dev of 𝒚- is σ 𝑛 required to know natural variation 2-sample example: 𝒚𝟏- 𝒚𝟐 must be compared with the natural variation, impossible without repetition 2. Random drawings / Random allocation of treatments no bias (systematic error) introduction of chance in the system Biometris Quantitative Methods brought to Life Design principles : brief overview (2) 3. Increase homogeneity : all experimental units are as similar and in as similar conditions as possible, - except the conditions influenced by the treatment 4. Measure other variables that may influence the response in the analysis used as covariates 5. In case of known other possible sources of variation: Blocking create homogeneous groups (blocks) In the analysis, block-effects can be corrected for. Total variation = Total variation = Treatm effect + Error Treatm effect + Block/cov eff + Error Biometris Quantitative Methods brought to Life Lessons, also from personal experience Own PhD experience: Not believing the results led to an extra year of analyses! Lesson: know your analysis in advance Real-life research experience in Mali Choice of experimental units Biometris Quantitative Methods brought to Life Cows observed in pasture land - example During 10 days, 3 cows are observed, one per observer, during 8 hours, 12 times per hour, during 60 (s). Measurement: amount of time spent walking (%) = y. Result for walking (%) between 10 and 12 a.m.: 72 observ. observations per cow, (suppose): within-cow sE = 10. Some cows walk more than others, e.g. Between-cow standard deviation of mean time spent walking: sC = 4. Biometris Quantitative Methods brought to Life Cows example y=C+E C = mean for a (random) cow, E = deviation = measurement – C Var (𝑦) = Var(𝐶) + Var(𝐸) = 42/3 + 102/ 72 = 5.33 + 0.84 = 6.17 So, using 1 cow per observer: se(𝒚) = 6.17 = 2.6 If 2 cows per observer were used: Var (𝑦) =Var(𝐶) + Var(𝐸) = 42/6 + 102/ 120 = 3.5 se(𝒚) = 3.5 = 2.01 If 4 cows per observer were used, ..... se(𝒚) = 1.65 Biometris Quantitative Methods brought to Life Cows example Make sure to think about the sources of variation. Important sources need to be often sampled independently. The observations were pseudo-replications. The many within-cow observations enabled us to have a very precise estimate of the mean walking % for each of the 3 cows, but not for the overall mean. Experimental /sampling units: units to which a treatment is assigned / that were randomly sampled. Measured units: units on which measurements are taken. Example: pens vs chickens in the pen. Biometris Quantitative Methods brought to Life Sample size calculations: 2 treatments 2 Hypothetical Populations, one for each treatment. We call the population means: μ1 and μ2 Parameter of interest: Δ=μ1- μ2 Samples: y1,1, …, y1,n1; y2,1, …, y2,n2 Model = Assumptions: the data are outcomes of n1 and n2 independent drawings from N(μ1, σ1) and N(μ2, σ2). Extra assumption: σ1= σ2 = σ. Biometris Quantitative Methods brought to Life 3 (of many) possible realities Δ= 0 (no difference) Δ= Δ1 (large difference) Δ= Δ2 (small difference) Assumed: Normality and σ1= σ2 s C=T D1 C T D=0 C D2 T Biometris Quantitative Methods brought to Life Testing: reality vs. conclusion Given a relevant Ha reality (value for Δ ), and given α (e.g. 0.05) the power of a planned experiment can be calculated. Biometris Quantitative Methods brought to Life Simulations to mimick the test result Excel: simulations 2 samples.xls one experiment with test is repeated 200 times We assume that σ is approximately known We can vary “reality” Δ = μ1 – μ2 That is: let us assume that Δ is …. (so and so much) Then see how frequent H0 is rejected (=power of the test) We can vary sample size n (=n1=n2). We can vary α We can then simulate power (demonstration of simulation program) Biometris Quantitative Methods brought to Life Formula for sample size : confidence interval Confidence Interval limits : y1 y2 t / 2 s 2 n 2s t / 2 n 2 M 2 2 Formula (n per sample), for a (1-α) C.I. Error Margin ≤ M. tα/2≈ 2.0 - 2.2 Precision criteria that have to be specified: 1- α = confidence level and M = max Error Margin Notes 1) σ has to estimated 2) if α=0.05, t=2.0 – 2.2. 3) if outcome for n is small (< 10) change the t-value with df = 2(n -1) and calculate again. 4) In testing, in stead of M, we specify Δ, the minimum relevant difference and (=1 –power) Biometris Quantitative Methods brought to Life 2C. Power calculation with Russ Lenth Lenth, R. V. (2006). Java Applets for Power and Sample Size [Computer software]. Retrieved March 15, 2009, from http://www.stat.uiowa.edu/~rlenth/Power. Example : Estimate p = fraction of baby’s with constipation (<0.2) with an Error Margin of at most 1%. Define y=1 (yes) or 0 (no). Then Var(y) = σ2 = p(1-p) < (0.2*0.8)=0.16. formula: n ≥ … Biometris Quantitative Methods brought to Life Conclusions In design phase Think about the relevant “sources of variation” (influential factors) which of them will you include in design, which of them will you keep constant? Block design? Split plot? Measure conditions that vary (weather,...) Measure general conditions (even if they do not vary across treatments in your experiment) Correct randomisation Avoid / be aware of pseudo-replication experimental units measured unit sampling unit measured unit Biometris Quantitative Methods brought to Life Conclusions For sample size calculations, the researcher must know beforehand which analysis she will perform with the collected data. specify research goals in terms of precision requirements: Minimum relevant difference , power (0.8/0.9), α (5%) know error variation: s (guess: range/4) Decide on sample sizes (Russ Lenth Power) Measure and store quantitative data, when possible, not binary data. Biometris Quantitative Methods brought to Life Analysis Conclusions from a statistical analysis are drawn in the context of a statistical model. The correctness and the relevance of the conclusion depend on the correctness and the relevance of the model. Model = assumptions about the observations Systematic part (how the mean value of the response depends on the factor levels / factor level combinations) Random part: independence, Normality and equal variance (independence follows from correct randomisation) Biometris Quantitative Methods brought to Life Conclusions In case of need, contact a statistician ! ... beforehand. Biometris Quantitative Methods brought to Life