Research problem (dataset can be found online): We have measurements of type, petal width (PW), petal length (PL), sepal width (SW), and sepal length (SL) for a sample of 150 irises. The lengths are measured in millimeters. Type 0 is Setosa; type 1 is Verginica; and type 2 is Versicolor. Statistical analysis steps: 1. Define your research problem, context. Ask your questions in research terms. For example: You study irises. How do their characteristics vary depending on type? The questions can get more specific than that. 2. Ask the questions in statistical terms, after you defined your research problem. Eg: Is the mean sepal width for the Setosa different from 3.5mm? Maybe you found this claim in a journal and you think that is not right and you want to provide some scientific evidence for that. 3. Collect data, draw random sample, measure sepal width. 4. Find the method to test this claim. What are your claims? What are your hypotheses? What are your data? Here, you want to check if the mean of a sample is different or not from a certain value. What statistical method do you choose? One-sample t-test. Then, what if we want to know if the mean sepal width is different for two types of irises? Two-sample t-test. What if your two samples are dependent? Paired t-test. What if we have three types of irises and we want to know if there are significant differences between the three in terms of the sepal width? One-way ANOVA. Notice we have a continuous, numerical response, and a categorical (nominal) factor of interest with 3 or more levels. What if you have another factor (covariate) you know it has an impact but you cannot control it? ANCOVA. Two or more factors of interest? Two-Way ANOVA. Note: We like to plot data and get a visual feel whenever we can! Statistical Concepts: 1. Random sample: to ensure sample is representative for the population 2. Data- prefer continuous, it contains more information 3. Hypothesis testing. Different types of alternatives, corresponding to the research question of interest. (Analogy to trial: considered innocent (null) until proven guilty (alternative)) 4. Normal distribution- Bell curve 5. T distribution- fatter tails than the normal 6. In Statistics, the terms Type I error (also, α error, or false positive) and type II error (β error, or a false negative) are used to describe possible errors made in a statistical decision process. Type I (α): reject the null hypothesis when the null hypothesis is true, and Type II (β): fail to reject the null when the null hypothesis is false 7. Tukey HSD. The test compares the means of every treatment to the means of every other treatment; that is, it applies simultaneously to the set of all pairwise comparisons and identifies where the difference between two means is greater than the standard error would be expected to allow. Confidence for the set, when all sample sizes are equal, is exactly 1 – α. 8. P-value. Pick the one from JMP output corresponding to the alternative of interest.pvalue: The P value is a probability, with a value ranging from zero to one. It is the answer to this question: If the populations really have the same mean overall, what is the probability that random sampling would lead to a difference between sample means as large (or larger) than you observed? If the P value is 0.03, that means that there is a 3% chance of observing a difference as large as you observed even if the two population means are identical. It is tempting to conclude, therefore, that there is a 97% chance that the difference you observed reflects a real difference between populations and a 3% chance that the difference is due to chance. Wrong. What you can say is that random sampling from identical populations would lead to a difference smaller than you observed in 97% of experiments and larger than you observed in 3% of experiments. You have to choose. Would you rather believe in a 3% coincidence? Or that the population means are really different?