Please bring this handout with you every time you do data analysis in the computer lab. This handout is also available on my website. JMP Instructions A worked example Your biology instructor has encouraged you to investigate some aspects of a flowering plant wingstem. You have decided to investigate whether wingstem grows better in forested areas than in grasslands. You randomly choose wingstem from each of these two habitats for laboratory analysis, ending up with 22 plants - 12 from forest and 10 from grassland. In an attempt to impress your instructor you also note whether there is a trillium plant within 1 meter of the wingstem plant - there seemed to be quite a few trillium plants near wingstem plants in the forested habitat. Entering data in the table Each column must be one variable. For the purposes of this example, you have decided to weigh each plant (“wet weight” means it hasn’t had a chance to dry out), and to measure the length of its stem. If you are comparing forests and grasslands, then one column will be for “habitat” (forest or grassland), the 2nd column will be “wet weight”, the 3rd column will be “stem length”, and your final column will be “trillium nearby?”. You can enter these in any order you’d like. Make sure each column is assigned an appropriate “modeling type” (continuous, nominal, or ordinal data) as follows: In the Columns box, you should see each of your heading names with a symbol (button). Click on the button to get a pop-up menu and change the variable type. Most of the variables you use for analysis should be classified as either continuous (data that you measure in some manner) or nominal (data that you apply a name or category-type to). We may discuss ordinal data at a later time. Thus make sure that the “habitat” and “trillium nearby” columns have an “N” button (for nominal), and the “wet weight” and “stem length” columns have a “C” button (for continuous). The most recent version of jmp replaces the “N” with a bar graph, and the “C” with a line graph. Your data file will look something like this: 1 Which test to use? The answer to this question depends on (1) which questions you ask and on (2) the types of variables you are considering for asking your questions. Let’s consider a few examples. Example I. Do wingstem plants from grasslands tend to weigh more than wingstem plants from forested regions? To answer this question you would ask whether the mean “wet weight” of wingstem plants from grasslands is significantly greater than the mean “wet weight” of wingstem from forested regions. 2 There are two steps to this process: a. Identify your dependent and independent variables dependent variable - the variable you believe might be influenced or modified by some treatment or exposure independent variable - the variable you believe may be influencing or affecting the dependent variable In this case we believe that habitat (the independent variable) may be influencing “wet weight” (the dependent variable). Note that it doesn’t make sense to look at it the other way around! b. Decide whether each of your variables is nominal or continuous. If the dependent variable is continuous, and the independent variable is nominal, you will use a t-test for cases when your independent variable has two categories (such as this case where we have either forest or grassland for our “habitat” variable), or an analysis of variance (ANOVA) in cases where your independent variable has three or more categories. The table below summarizes all of the possible combinations. At this point we are confining the discussion to the first row of the table. Dependent variable Continuous Independent variable Nominal Type of comparison Test statistics Means Continuous Continuous Correlation or Regression Nominal Nominal Nominal Continuous Frequencies or Proportions or Percentages Discriminant Analysis or Logistic Regression t (for t-test) F (for ANOVA (analysis of variance)) r (correlation coefficient) R2 Contingency or X2 (chi-square) Will not be covered Table 1. Types of variables, types of comparisons and test statistics commonly used for data analysis You already knew that you were comparing means - now you know you will do a t-test. t-tests Please do the following: 1. Open or create your data file 2. .Analyze menu Fit Y by X 3. You should see a window as follows: 3 Remember that in this case, our dependent variable is “wet weight”, and we are hypothesizing that it is being influenced by the independent variable - “habitat”. Also make sure you understand that “wet weight” is continuous and “habitat” is nominal (you can confirm by looking at the little buttons in the select columns window). If you check out table 1, you will confirm that you will be comparing the “wet weight” mean for the forested habitat with the “wet weight” mean for the “grassland” habitat, and using a t-test to test for whether the means are statistically significantly different from each other. 4. Highlight “wet weight” and click on “y,response” in the “fit y by x” window. (In JMP language, “y, response” is the same as dependent variable) 5. Highlight “habitat” and click on “x-factor” in the “fit y by x” window. (In JMP language, xfactor is the same as independent variable). 6. Click OK. The following will come up on your screen (hopefully) 4 Oneway Analysis of “wet weight” in grams By habitat 8 wet weight in grams 7 6 5 4 3 2 1 0 forest grassland habitat 7. Click on the red triangle in the title bar to get a pop-up menu. Select “Means and Std Dev”. Below the graph you will see calculations of the means, Standard deviations, Standard Error of the Mean and confidence intervals for each of your independent variables. In the graph some blue lines have been added among your black data points. The center blue line is the mean. It is connected by a vertical blue line to a pair of short blue horizontal lines that represent the mean-plus-one-standard-error and the mean-minus-one-standard-error. Further away from the mean are two longer horizontal blue lines that represent the mean-plus-one-standarddeviation and the mean-minus-one-standard-deviation. 5 8. Click the red triangle again and select “ t test”. You will see the addition of another graph and other calculations. Don’t worry about the graph for now. There are two important numbers in this output. T Ratio = t = -1.81115. The negative sign doesn’t matter, and you should usually report t-vales down to two decimal points, so in your report you would state that t = 1.81 Prob > |t| = p value = 0.0866. Please refer to the end of this handout for information on interpreting p-values. Example II - Linear Correlation Do taller plants tend to weigh more than shorter plants. A more formal way of stating this in hypothesis form is as follows: Hypothesis: There is a positive correlation between “stem length” and “wet weight”. In this case we believe that “stem length” (the independent variable) may be influencing “wet weight” (the dependent variable). In this case you need to do some serious thinking to convince yourself that “stem length” is the factor that would influence “wet weight”, and not the other way around. Because we hypothesize that “stem length” is influencing “wet weight”, “stem length” is the independent variable and “wet weight” is the dependent variable. Both of the variables are continuous. If both variables are continuous, refer to Table 1 and note that the correct analysis is correlation or linear regression (these two analyses are functionally equivalent). JMP Steps 1. Analyze menu Fit Y by X 2. Click on your dependent variable - ““wet weight” in grams” (which should have a C beside it). then on “Y, Response”. Click on the independent variable - “stem length” (cm) - (which should also have a C beside it) and then on “X, Factor.” Click OK. You will see a scatter plot of the data. 6 3. Click on the red triangle in the title bar and select “ Fit Line.” You will see a red line appear through your data points. That is the best straight line that fits your data (it minimizes the sum of the distances from each data point to the line). A bunch of output will also appear below the graph. 7 For now I am concerned with two parts of this output. A. Under “summary of fit” you will note the RSquare = 0.880328. If you were reporting your results you would report that R2 = 0.88. In general, the closer R2 is to 1, the stronger is your correlation. The graph looks impressively linear, so it is no surprise to have a high R 2 value. B. Under “Parameter Estimates” you will note that if you go across from “stem length” and down from Prob > |t| you will find a value of <.0001. That tells us that the correlation between “stem length” and “wet weight” is very strong. Finally to determine if the correlation is positive or negative, you will need to look at the graph and see if the slope of the line is positive or negative. Example III - Contingency or Chi-square (X2) analysis - for this problem, use the made-up data table from the previous lab (the wingstem data table with 22 cases) Table 1. Wingstem data table from previous lab. Is trillium more likely to be associated with wingstem in forested regions than in grasslands? Again you could state this more formally: Hypothesis: There is a higher frequency of trillium near wingstem in forested regions than in grassland regions. 8 You are asking whether habitat type (forest or grassland) influences whether trillium is likely to be nearby (yes or no). Thus both variables are nominal. The dependent variable is “trillium nearby?” and the independent variable is “habitat type”. Refer to Table 2 and note that in cases in which both variables are nominal, you will compare frequencies or percentages using a contingency or chi-square test. Dependent variable Continuous Independent variable Nominal Type of comparison Test statistics Means Continuous Continuous Correlation or Regression Nominal Nominal Nominal Continuous Frequencies or Proportions or Percentages Discriminant Analysis or Logistic Regression t (for t-test) F (for ANOVA (analysis of variance)) r (correlation coefficient) R2 Contingency or X2 (chi-square) Will not be covered Table 2. Types of variables, types of comparisons and test statistics commonly used for data analysis JMP steps 1. Analyze menu Fit Y by X 2. Click on your dependent variable - “trillium nearby?’ (which should have an N beside it) then on “Y, Response.” Click on the independent variable (which should have an N beside it), and then on “X, Factor”. Click OK. This is what you will see if you are a lucky person. 9 3. You will see a “Mosaic Plot,” which shows, for each independent variable, the percentage of total observations in each category of the Y axis. 4. Below the mosaic plot you will see a “Contingency table.” The upper left square in the table explains to you what each % of the table refers to (check with me to get a more complete explanation). 5. Below the contingency table are the test statistics. For our purposes there are three numbers which are relevant. 10 A. DF stands for degrees of freedom: In the “source” table you should note that the model DF = 1. Your instructor may elect to explain degrees of freedom to you. B. In the “Test” table, note that the Likelihood Ratio ChiSquare value =- 3.005, and the Prob>ChiSq = 0.083. OK, how might you go about reporting this? I would recommend something like the following: Eight out of 12 (66.67%) of the wingstem from the forested region had trillium nearby, while 3 out of 10 (30%) of the wingstem from the grassland had trillium nearby. These differences in frequency are not statistically significantly different (X2 = 3.005, df = 1, p = 0.083). CHI-Square (contingency table) analysis - the easy way The data table you made allowed you to test the hypothesis that there is a higher frequency of trillium near wingstem in forested regions than in grassland regions. The data lean in that direction, as follows: 66.67% of the wingstem had trillium nearby in the forested region 30% of the wingstem had trillium nearby in the grassland Please check the contingency table output to make sure you see where these percentages came from. Based on these data, your conclusion should be that the p value of 0.083 is not low enough to support your research hypothesis. I am also hoping that your conclusion will also be that you need more data to adequately test the hypothesis. Unfortunately for you, your instructor felt the same way, and sent you back out to collect more data. The new data look like this: Forested habitat Grassland habitat Trillium nearby? Yes no 21 41 37 18 One problem with your previous data table was that it was very cumbersome. You had to type in yes or no for each case, which was OK when it was 22 times, but would be a real drag when you had to do it for these data (58 yeses and 59 nos) Rejoice - there is an easy way. Just enter the table that looks like this: 11 Then do the following steps: 1. Analyze menu Fit Y by X 2. Click on your dependent variable - “trillium nearby?’ (which should have an N beside it) then on “Y, Response.” Click on the independent variable - habitat (which should have an N beside it), and then on “X, Factor”. Click on “how often”, which is your weighting variable, and click on “weight”. Click OK. The output will be exactly the same format as before, but the actual numbers will change as a result of your valiant data collection. You will get a graph that looks like this: and some data that look like this: 12 How would you report these findings. Do the results support or fail to support your hypothesis? Example IV. Analysis of Variance (ANOVA) Let’s go back to the case in which the dependent variable is continuous and the independent variable is nominal. Just to remind you, you could organize your understanding of the possible cases with the following, now famous, table 1. Dependent variable Continuous Independent variable Nominal Type of comparison Test statistics Means Continuous Continuous Correlation or Regression Nominal Nominal Nominal Continuous Frequencies or Proportions or Percentages Discriminant Analysis or Logistic Regression 13 t (for t-test) F (for ANOVA (analysis of variance)) r (correlation coefficient) R2 Contingency or X2 (chi-square) Will not be covered Table 1. Types of variables, types of comparisons and test statistics commonly used for data analysis You will notice that there are two possibilities: t-test and ANOVA. Which do you use? The answer is very simple. If you’re comparing the means for two categories, like when you compared the “wet weight” of wingstem in forested vs. the “wet weight” of wingstem in grassland habitats, then you use a t-test. When you are comparing the means for three or more categories, then you use an ANOVA. From what we know about mushrooms, we could propose the following: Hypothesis: Wetter and shadier habitats will tend to have a higher species richness of mushrooms than dryer sunnier habitats. Prediction: If the hypothesis is true, experimental quadrats in forests along the river will have the greatest mean number of species, while experimental quadrats in dry grasslands will have lowest mean species richness. Forest away from the river (deep forest) and grasslands along the river would have intermediate values. (The hypothesis does not make distinguishing predictions between the latter two categories.) Notice that we have four categories: Deep forest, river forest, dry grassland and river grassland. Here are the made-up data in a JMP table: 14 Table 2. Made-up data of number of different species of mushrooms collected from five 10 X 10 M quadrats in each of four habitats. Notice that I have a column called “replicate”. Each replicate (or replication) is a sample of the category. Because we set up five quadrats in each habitat, we have five replicates for each of the four habitats. I put the replicate numbers in the data table because it facilitates discussion. For example I could say “replicate 3 of river forest”, and you would know I’m referring to the quadrat with 12 different species of mushrooms. Numbering your replicates is very useful for complex data tables. But for simple data tables you probably don’t need to worry about it. OK. So we want to test our Prediction that experimental quadrats in river forest will have the greatest mean number of species, while experimental quadrats in dry grasslands will have lowest mean species richness. Deep forest and river grasslands will have intermediate values. Please do the following: 1. Open or create your data file 2. .Analyze menu Fit Y by X 3. Select “species richness” as your Y,response, and “habitat” as your X, factor. Click OK. 4. You will get a graph that shows the data points for each category. 15 Figure 1. Data points for each of four habitat categories. 5. Then click on the red triangle to the left of “oneway”, drag down to means/anova, and you should see a new window with a great deal of data on it. You will also that the graph has been redrawn, with the gray horizontal bar representing the mean number of species overall, the longer green horizontal bar representing the mean for each category, and the top and bottom of the diamond shape representing the 95% confidence intervals. Based on these samples we are 95% confident that the true mean for that category falls between the top tip and bottom tip of the diamond. We won’t worry about the meaning of the two shorter green horizontal bars for now. 16 Here are the essential data from this table: A. In the “Summary of Fit” table, the overall mean of response = 3.8. You won’t report that, but it’s nice to know. B. In the Analysis of Variance (ANOVA) table, you will note the DF (degrees of freedom) for habitat = 3, and DF for error = 16. The F-ratio is 11.90, and P = 0.0002. You would report that as follows: “F3,16 = 11.90, P = 0.0002.” 17 So what does this mean? Because the P-value is so low (much less than 0.05), we can support our hypothesis that species richness is significantly different for different habitats. C. In the “Means for Oneway Anova” table, you will see the means and standard errors reported for each category. Beware! Those standard errors are for the entire sample, not for the individual categories. To get the standard error for each category, click on the red triangle and drag to “Means and Std Dev.” You’ll get the following: These are the correct values for means, standard deviations, and standard errors. Your report should include (either in the text of the paper, in a figure, or in a table) the means for each category, and either the standard error of the mean, or the standard deviation. The frustrating thing is that while we now know that habitat influences mushroom species richness, we still don’t know which category is significantly different from which. In other words we still haven’t fully tested the predictions of our hypothesis. There’s only one more step. 6. Go back to that little red triangle, drag to “compare means”, then drag to all “pairs Tukey HSD”. You will get the box below: 18 The bottom portion of this output is of relevance. You will notice that “Levels not connected by same letter are significantly different.” (for our purpose levels = categories). For example the mean (7.6) for River forest has an A next to it. Thus its mean is significantly greater than the 2 categories that don’t have an A next to them (both grassland categories), but it isn’t significantly greater than Deep forest (which also has an A next to it). Similarly Deep forest’s mean is significantly greater than Dry grassland’s mean, but not River grassland (Because Deep forest and River grassland both share the letter B). So that’s how we will be comparing means in this class when there are more than two categories for a nominal independent variable. Notice that I used the term significantly greater rather than significantly different - if the means are different I want to know which is greater than which. Appendix - HYPOTHESIS TESTING We used a t-test to address the question of whether the mean “wet weight” of wingstem plants from grasslands is significantly greater than the mean “wet weight” of wingstem from forested regions. The t-test, and all of the statistics we have been discussing, allow you to ascribe a confidence value to the conclusions you will draw based on the analysis. In this case we are formally testing the hypothesis that wingstem from grasslands on average weigh more than wingstem from forested regions. We can (and should!) compare the means and say that the answer is “yes”, based on the sample we collected. But we’re still not sure whether the same conclusion can be drawn for the entire population of wingstem plants, in part because we haven’t weighed very many plants. Are we justified in concluding that there is truly a significant difference in ‘“wet weight”” based on our sample? If so, how comfortable do we feel about this conclusion? The answer to this question is that we will never be able to say for sure wingstem in grasslands weigh more on average than wingstem in forested regions unless we test the entire population. Fortunately for us, statisticians have developed techniques that allow us to make inferences from relatively small samples to much larger populations. The hypothesis we wish to test is called the research hypothesis (HA). In this case HA is that there truly is a difference in mean wingstem “wet weight” between the two habitats . To establish some credibility for the research hypothesis, we first define a null hypothesis (Ho) that directly opposes HA. In this case Ho is that in reality, there is no difference in mean wingstem “wet weight” weight between the two habitats. To support HA, we attempt to show that the sample provides evidence that Ho is unlikely. How unlikely? It is up to the researcher to decide how unlikely Ho must be before it can be rejected. This is known as the critical value (). The most common is .05. This means that if Ho is true, we would expect to see such a difference between the mean wingstem ““wet weight”s” in the two habitats in only one of every 20 samples. Do the data allow us to reject Ho? The t- test allows us to compute the probability of how often we would expect to have a difference as great as we observe for two habitats by chance 19 alone (i.e. if Ho were true). The calculation of the frequency with which we would expect to see such a difference if Ho is true is the p-value. If p =.01, we will expect to see the observed difference between the two groups (in other words between the ““wet weight”s” of wingstems in forest vs. grassland) due to chance alone, only one out of 100 times. That’s very unlikely, so as scientists, we can feel pretty good claiming that the Ho is probably wrong, and therefore that the HA is supported. In a formal sense, we can then compare the p-value to ; if p < as it is when p = .01), then we can reject the Ho. This provides support for HA, but does not prove it (recall we can prove our research hypothesis only by measuring the entire population). Also beware! If we cannot reject Ho, that does not mean that Ho is correct. It is possible that we would be able to reject Ho if we had a larger sample size. The key realization for you to make is that the higher the value of t, the lower your p-value. The lower your p-value, the more confident you are about rejecting Ho and supporting HA. In the case presented at the beginning of this handout, the p value = 0.0866. This is greater than our critical value of 0.05. Thus we are unable to reject H0, and fail to support the research hypothesis. However, it is still possible that in the natural world, wingstem on average have higher mean “wet weight” in grasslands than in forested regions. Our data are not sufficient to support this hypothesis. A big however is that we really were much too lazy and should have collected many more plants (in other words, our sample size was pretty pitiful). One wonderful aspect of this approach to hypothesis testing is that all of the tests you use will generate a p-value. The p-value will be interpreted in the same manner regardless of whether you are using a t-test, ANOVA, linear regression, or chi-square test. If p << 0.05, you have supported your research hypothesis If P >> 0.05, you have failed to support your research hypothesis. But if P is in the neighborhood of 0.05, you are in the uncomfortable zone, in that it is not clear whether you have support for your research hypothesis. The reality is that there really is nothing magical about a critical value of 0.05, so maybe we should say that between 0.04 and 0.08 we have some weak support for our research hypothesis. We’ll talk about this later in the year. 20