anova_intro

advertisement
Research problem (dataset can be found online):
We have measurements of type, petal width (PW), petal length (PL), sepal width
(SW), and sepal length (SL) for a sample of 150 irises. The lengths are measured in
millimeters. Type 0 is Setosa; type 1 is Verginica; and type 2 is Versicolor.
Statistical analysis steps:
1. Define your research problem, context. Ask your questions in research terms. For
example: You study irises. How do their characteristics vary depending on type? The
questions can get more specific than that.
2. Ask the questions in statistical terms, after you defined your research problem. Eg: Is the
mean sepal width for the Setosa different from 3.5mm? Maybe you found this claim in a
journal and you think that is not right and you want to provide some scientific evidence
for that.
3. Collect data, draw random sample, measure sepal width.
4. Find the method to test this claim. What are your claims? What are your hypotheses?
What are your data? Here, you want to check if the mean of a sample is different or not
from a certain value. What statistical method do you choose? One-sample t-test.
Then, what if we want to know if the mean sepal width is different for two types of
irises? Two-sample t-test. What if your two samples are dependent? Paired t-test.
What if we have three types of irises and we want to know if there are significant
differences between the three in terms of the sepal width? One-way ANOVA. Notice we
have a continuous, numerical response, and a categorical (nominal) factor of interest with
3 or more levels. What if you have another factor (covariate) you know it has an impact
but you cannot control it? ANCOVA. Two or more factors of interest? Two-Way
ANOVA.
Note: We like to plot data and get a visual feel whenever we can!
Statistical Concepts:
1. Random sample: to ensure sample is representative for the population
2. Data- prefer continuous, it contains more information
3. Hypothesis testing. Different types of alternatives, corresponding to the research
question of interest. (Analogy to trial: considered innocent (null) until proven guilty
(alternative))
4. Normal distribution- Bell curve
5. T distribution- fatter tails than the normal
6. In Statistics, the terms Type I error (also, α error, or false positive) and type II
error (β error, or a false negative) are used to describe possible errors made in a
statistical decision process.
Type I (α): reject the null hypothesis when the null hypothesis is true, and
Type II (β): fail to reject the null when the null hypothesis is false
7. Tukey HSD. The test compares the means of every treatment to the means of every
other treatment; that is, it applies simultaneously to the set of all pairwise
comparisons
and identifies where the difference between two means is
greater than the standard error would be expected to allow. Confidence for the set,
when all sample sizes are equal, is exactly 1 – α.
8. P-value. Pick the one from JMP output corresponding to the alternative of interest.pvalue: The P value is a probability, with a value ranging from zero to one. It is the
answer to this question: If the populations really have the same mean overall, what is
the probability that random sampling would lead to a difference between sample
means as large (or larger) than you observed? If the P value is 0.03, that means that
there is a 3% chance of observing a difference as large as you observed even if the
two population means are identical. It is tempting to conclude, therefore, that there is
a 97% chance that the difference you observed reflects a real difference between
populations and a 3% chance that the difference is due to chance. Wrong. What you
can say is that random sampling from identical populations would lead to a difference
smaller than you observed in 97% of experiments and larger than you observed in 3%
of experiments. You have to choose. Would you rather believe in a 3% coincidence?
Or that the population means are really different?
Download