Biometris - WGS Data Management Planning course

advertisement
Data collection and Statistics
Evert Jan Bakker and Gerrit Gort
Biometris - Wageningen University
Biometris
Quantitative Methods brought to Life
Introduction: What is Statistics?
1.
Probability calculus - theoretical and exact (Easy program:
PQRS)
2.
Descriptive Statistics
Just describes the data. All conclusions only refer to the sample.
The conclusions are ‘always correct’. Càn be convincing already.
Graphical representations of the data.
3.
Inference (Test of Hypothesis, Estimate Conf. Interval)
Conclusions are drawn about a population (e.g.
Wageningen Students) or a general phenomenon (maize
yield), only using data from a limited sample.
4.
Experimental design/ Sampling design
Randomisation, Blocking, Special designs…/sample size
Biometris
Quantitative Methods brought to Life
CORE STATEMENT

BEFORE collecting the data




You should know which analyses you will do,
that is: know the model / models to be used,
be confident that the model is reasonable
such that the precision to be obtained will be sufficient
In data collection be aware of



Replication (true replication vs pseudo-replication)
Randomization
Reduction of error variation
Biometris
Quantitative Methods brought to Life
Qualitative vs Quantitative data
“green”
Biometris
Quantitative Methods brought to Life
2 types of research aims
1. exploration : generate new ideas
Measure many response variables; report any fact of
interest / relationship / differences, using “any” descriptive
analysis.
2. Inference (test / confidence interval):
drawing conclusions about a population or a
general phenomenon based on sample data.
Inference has to be done according to the rules, so
as not to ‘Lie with Statistics’.
The model of analysis should be reasonable
Biometris
Quantitative Methods brought to Life
Inference

An experiment used for inference :
 Question
/ Hypothesis
 Design of the experiment
Statistics
 Carry out the experiment
 Analysis of the experimental data
Statistics

For standard designs, the data analysis follows
a fixed calculation pattern, which is known
before the experiment is done.
Biometris
Quantitative Methods brought to Life
The statistical MODEL / model types

Model = assumptions about the observations




Systematic part (how the mean value of the response
depends on the factor levels / factor level combinations)
Random part: independence, Normality and equal
variance
(independence follows from correct randomisation)
All influencing factors not in the systematic part, end up in
the random part
If response is quantitative (e.g. yield, blood pressure)



Qualitative factor(s) e.g. variety  2 sample t-test or
ANOVA
Quantitative factor(s), e.g. amount of fertilizer, amount of
rainfall  linear regression
Biometris
Both  Analysis of Covariance
Quantitative Methods brought to Life
Data collection

Primary data collection:


for observational research: sampling, how?, how
many?
for experimental research: design of experiment
(choice of exerimental units, randomisation,
measurement of response(s), nr. of replications

In case secondary data is used: know how the data
were obtained (meta-data). Otherwise the conclusion
will be about an unknown population.

Sampling: random, stratification, subsampling, ...
Conclusion can be drawn about a population from
which a random sample was taken.
Biometris
Quantitative Methods brought to Life
Design principles : brief overview
1. Repetition (n > 1)


required for more precision
1-sample example: st.dev of 𝒚-  is
σ
𝑛
required to know natural variation
2-sample example:
𝒚𝟏- 𝒚𝟐 must be
compared with the natural variation, impossible
without repetition
2. Random drawings / Random allocation of treatments


no bias (systematic error)
introduction of chance in the system
Biometris
Quantitative Methods brought to Life
Design principles : brief overview (2)
3. Increase homogeneity : all experimental units are as
similar and in as similar conditions as possible,
- except the conditions influenced by the treatment
4. Measure other variables that may influence the
response  in the analysis used as covariates
5. In case of known other possible sources of variation:
Blocking  create homogeneous groups (blocks)
In the analysis, block-effects can be corrected for.
Total variation
=
Total variation
=
Treatm effect
+ Error
Treatm effect
+ Block/cov eff + Error
Biometris
Quantitative Methods brought to Life
Lessons, also from personal experience


Own PhD experience:

Not believing the results led to an extra year of
analyses!

Lesson: know your analysis in advance
Real-life research experience in Mali
Choice of experimental units
Biometris
Quantitative Methods brought to Life
Cows observed in pasture land - example
During 10 days, 3 cows are observed, one per
observer, during 8 hours, 12 times per hour, during 60
(s).
Measurement: amount of time spent walking (%) = y.
Result for walking (%) between 10 and 12 a.m.: 72
observ. observations per cow, (suppose): within-cow sE
= 10.
Some cows walk more than others, e.g. Between-cow
standard deviation of mean time spent walking: sC =
4.
Biometris
Quantitative Methods brought to Life
Cows example
y=C+E
C = mean for a (random) cow,
E = deviation = measurement – C
Var (𝑦) = Var(𝐶) + Var(𝐸) = 42/3 + 102/ 72 =
5.33 + 0.84 = 6.17
So, using 1 cow per observer: se(𝒚) = 6.17 = 2.6
If 2 cows per observer were used:
Var (𝑦) =Var(𝐶) + Var(𝐸) = 42/6 + 102/ 120 = 3.5
se(𝒚) = 3.5 = 2.01
If 4 cows per observer were used, ..... se(𝒚) = 1.65
Biometris
Quantitative Methods brought to Life
Cows example

Make sure to think about the sources of variation.
Important sources need to be often sampled
independently.
The observations were pseudo-replications.
The many within-cow observations enabled us to have
a very precise estimate of the mean walking % for
each of the 3 cows, but not for the overall mean.

Experimental /sampling units: units to which a
treatment is assigned / that were randomly
sampled.
Measured units: units on which measurements are
taken. Example: pens vs chickens
in the pen.
Biometris
Quantitative Methods brought to Life
Sample size calculations:
2 treatments

2 Hypothetical Populations, one for each treatment.
We call the population means: μ1 and μ2

Parameter of interest: Δ=μ1- μ2
Samples: y1,1, …, y1,n1;
y2,1, …, y2,n2
Model = Assumptions: the data are outcomes of n1
and n2 independent drawings from
N(μ1, σ1) and N(μ2, σ2).


Extra assumption: σ1= σ2 = σ.
Biometris
Quantitative Methods brought to Life
3 (of many) possible realities
Δ= 0
(no difference)
Δ= Δ1 (large difference)
Δ= Δ2 (small difference)
Assumed: Normality and σ1=
σ2
s
C=T
D1
C
T
D=0
C
D2
T
Biometris
Quantitative Methods brought to Life
Testing: reality vs. conclusion
Given a relevant Ha reality (value for Δ ), and given α
(e.g. 0.05) the power of a planned experiment can be
calculated.
Biometris
Quantitative Methods brought to Life
Simulations to mimick the test result

Excel: simulations 2 samples.xls
one experiment with test is repeated 200 times
We assume that σ is approximately known

We can vary “reality” Δ = μ1 – μ2

That is: let us assume that Δ is …. (so and so much)
Then see how frequent H0 is rejected (=power of the
test)
We can vary sample size n (=n1=n2).
We can vary α
We can then simulate power
(demonstration of simulation program)




Biometris
Quantitative Methods brought to Life
Formula for sample size : confidence interval
Confidence Interval limits : y1  y2  t / 2  s
2
n
2s t / 2
n
2
M
2 2

Formula (n per sample), for a (1-α) C.I.
Error Margin ≤ M. tα/2≈ 2.0 - 2.2

Precision criteria that have to be specified:
1- α = confidence level and M = max Error Margin
Notes 1) σ has to estimated
2) if α=0.05, t=2.0
– 2.2.
3) if outcome for n is small (< 10) change the
t-value with df = 2(n -1) and calculate again.

4) In testing, in stead of M, we specify Δ, the
minimum relevant difference and  (=1
–power)
Biometris
Quantitative Methods brought to Life
2C. Power calculation with Russ Lenth

Lenth, R. V. (2006). Java Applets for Power and
Sample Size [Computer software]. Retrieved March
15, 2009, from
http://www.stat.uiowa.edu/~rlenth/Power.

Example : Estimate p = fraction of baby’s with constipation (<0.2) with an Error Margin of at most 1%.
Define y=1 (yes) or 0 (no). Then Var(y) = σ2 = p(1-p) <
(0.2*0.8)=0.16.  formula: n ≥ …
Biometris
Quantitative Methods brought to Life
Conclusions

In design phase

Think about the relevant “sources of variation” (influential
factors) which of them will you include in design, which of
them will you keep constant? Block design? Split plot?

Measure conditions that vary (weather,...)

Measure general conditions (even if they do not vary
across treatments in your experiment)

Correct randomisation

Avoid / be aware of pseudo-replication
experimental units  measured unit
sampling unit  measured unit
Biometris
Quantitative Methods brought to Life
Conclusions

For sample size calculations, the researcher must
 know beforehand which analysis she will perform
with the collected data.
 specify research goals in terms of precision
requirements: Minimum relevant difference , power
(0.8/0.9), α (5%)
 know error variation: s (guess: range/4)
 Decide on sample sizes (Russ Lenth Power)

Measure and store quantitative data, when possible,
not binary data.
Biometris
Quantitative Methods brought to Life
Analysis

Conclusions from a statistical analysis are drawn in
the context of a statistical model. The correctness
and the relevance of the conclusion depend on the
correctness and the relevance of the model.

Model = assumptions about the observations


Systematic part (how the mean value of the response
depends on the factor levels / factor level
combinations)
Random part: independence, Normality and equal
variance
(independence follows from correct randomisation)
Biometris
Quantitative Methods brought to Life
Conclusions

In case of need, contact a statistician  !
... beforehand.
Biometris
Quantitative Methods brought to Life
Download