DESCRIPTIVE METHODS First, we will consider some graphical and numerical methods that allow us to examine the relationship between two numerical variables. Example 7.1: Once again, consider the example in which the degree of clinical agreement among different physicians on the presence or absence of generalized lymphadenopathy was assessed. The data consist of the total number of palpable lymph nodes counted by each of the two physicians and are given in the file LymphNodes.JMP. The original data: Previously, we calculated the difference between the two doctors’ counts in order to make comparisons: Page 1 of 15 Recall that we can select Analyze > Distribution to view the distribution of the differences: Questions: 1. What can you say about the differences between these two doctors based on this output? 2. Based on only these univariate descriptive summaries, do you have any insight into WHEN the two doctors tend to agree or disagree? That is, are there certain patients for whom the doctors are closer in their estimates? Scatterplots Next, we will investigate the relationship between the two doctors’ counts via a scatterplot. To create the scatterplot in JMP, select Analyze > Fit Y by X. Page 2 of 15 JMP returns the following: Since we expect a 1:1 relationship in this example, we will make sure that the scales on the two axes match. To do this, double-click on the y-axis. Then you can change the scale so that the yaxis runs from 0 to 20. Also, because we expect a 1:1 relationship, we can add the y=x line to this plot: Questions: 1. What does the y=x line represent? 2. How many points fall exactly on the line? What does this mean? 3. Most of the points fall below the y=x reference line. What does this tell you about the amount of agreement between the two doctors? Explain. Page 3 of 15 Next, add a trend line based on the data to this plot. In JMP, select Fit Line from the red dropdown menu. 4. Notice the trend line (based on the data) and the y=x reference line start out in almost the same spot. What does this tell you about the amount of agreement between these two doctors? 5. The trend line appears to be flatter than the y=x reference line (i.e., the slope is smaller). What does this tell us about the agreement between these two doctors? Page 4 of 15 Correlation The scatterplot is a GRAPHICAL representation of the relationship between the two doctors’ counts. We can also consider the correlation between the counts of the two doctors, which is a NUMERICAL summary of this relationship. To calculate the correlation in JMP, click on the red drop-down arrow next to Bivariate Fit and choose Density Ellipse. The density ellipse itself is beyond the scope of this course; however, this request in JMP gives us the correlation coefficient: Interpreting the Correlation Coefficient: 1. A positive correlation coefficient indicates a positive association between the two numerical variables, and negative correlation indicates a negative association. 2. The correlation coefficient is ALWAYS between -1 and 1. Values near zero indicate a very weak relationship exists. Values close to 1 indicate a very strong positive relationship exists. Values close to -1 indicate a very strong negative relationship exists. Questions: 1. What does this correlation coefficient say about the direction of the relationship between the counts of Doctors A and B? Does this agree with what you saw in the scatterplot? 2. What does this correlation coefficient say about the strength of the relationship between the counts of Doctors A and B? Does this agree with what you saw in the scatterplot? Page 5 of 15 Example 7.2: The U.S. Environmental Protection Agency (EPA) declared El Paso, Texas, a timecritical emergency Superfund site in July of 2002. The emergency declaration was based on 35 residential sites sampled in West and Central El Paso neighborhoods that contained some lead and arsenic above screening levels. The data in the file ElPasoLead.JMP contain several variables measured in a follow-up study. Here, we will investigate the relationship between YearsClose and IQ. That is, does living near the lead pollution source seem to have an impact on IQ? The following scatterplot shows the relationship between the two variables: Questions: 1. Discuss the general trends you observe in this plot. 2. What impact does YearsClose have on IQ? Explain. The Concept of Conditioning We could also examine these data using side-by-side boxplots. Note that to do this in JMP, you must first change YearsClose to an ordinal data type. Then, select Analyze > Fit Y by X and place YearsClose in the X, Factor box and IQ in the Y, Response box. Select Display Options from the red drop-down arrow, and choose Box Plots. Finally, make sure that X Axis Proportional is UNCHECKED. Page 6 of 15 We can also display the Means and Standard Deviations for each level of YearsClose: Questions: 1. Do the patterns that are observed in the side-by-side boxplots agree with what was observed in the scatterplot? Explain. 2. What patterns do you see in the list of means and standard deviations? Do these agree with the scatterplot and side-by-side boxplots? Explain. The concept of observing a different mean and possibly a different standard deviation for each level of the X-variable is important in an investigation of two numerical variables. This concept is known as conditioning. For example, identify the following conditional means and standard deviations: Mean of IQ|YearsClose=2: Standard Deviation of IQ|YearsClose=2: Mean of IQ|YearsClose=10: Standard Deviation of IQ|YearsClose=10: Next, compare and contrast the two conditional means and standard deviations for YearsClose=2 and YearsClose=10. Page 7 of 15 Example 7.3: Finally, consider an example that displays the relationship between Age and Length of fish from Lake Mary in Minnesota (the data are in the file LakeMary.JMP). First, we construct a scatterplot of Age vs. Length: Next, we view the side-by-side boxplots: Finally, we view the conditional means and standard deviations: Question: Based on these summaries, what can you say about the relationship that exists between Age and Length? To identify trends, we can consider the following: Page 8 of 15 Connecting the Conditional Means Connecting the Conditional Medians Usually, you see a straight line model used to display this trend: In the next section, we will discuss this method of fitting a straight line detail. Page 9 of 15 SIMPLE LINEAR REGRESSION: MODELING Y WITH ONE X-VARIABLE Example 7.4: Consider the following study which recently took place at the Winona Clinic. The study involved measuring bone density in middle-aged women. The goal was to determine whether or not forearm bone density could be used in place of other bone density measurements which were more accepted (e.g., from the spine, hip or neck). Bone density in this study was measured by a T-Score. Category T-Score Normal T-Score ≥ -1 Osteopenia -2.5 ≤ T-Score ≤ -1 Osteoporosis T-Score ≤ -2.5 Page 10 of 15 The data can be found in the file TScores.JMP, and a portion of the data is shown below: First, let’s consider the relationship between Spine T-Score and Forearm T-Score. Select Analyze > Fit Y by X, and place Forearm T-Score in the X, Factor box and Spine T-Score in the Y, Response box. From the red drop-down menu, select Fit Line. Original Scatterplot With the Y=X Reference Line Questions: 1. For what T-Scores do these two sets of measurements tend to agree? Discuss. 2. For what T-Scores do these two sets of measurements tend to disagree? Discuss. To carry out a simple linear regression analysis in JMP, we can use the Analyze > Fit Y by X menu (as shown above). However, we will use the Analyze > Fit Model menu instead. Place Page 11 of 15 the y-variable in the Y box and the x-variable in the Construct Model Effects box (place your cursor on the variable name and select Add). JMP returns the following output: Page 12 of 15 Understanding the Regression Output: This p-value tests whether or not the overall regression model is useful. Ho: Regression model is NOT useful Ha: Regression model is useful p-value: Decision: Conclusion: and These p-values test hypotheses concerning both the intercept and the slope: Intercept Slope Ho: Intercept = 0 Ha: Intercept ≠ 0 Ho: Slope = 0 Ha: Slope ≠ 0 p-value: p-value: Decision: Decision: Conclusion: Conclusion: This is the coefficient of determination, usually called R-Square (R2). This quantity measures the percent of total variation in the y-variable that can be explained by the x-variable Page 13 of 15 R2 = Interpretation: This is the Root Mean Square Error (RMSE). This quantity estimates the average distance of a data point from the fitted line, measured along the vertical axis. Such a distance is known as a residual. The smaller the RMSE, the better the fit of the model. RMSE = These values give the estimates of both the intercept and slope so that we can write an equation for the fitted line: Mean Spine|Forearm = -.19 + .81 Forearm, or Spine .19 .81 Forearm Interpretations: Page 14 of 15 Predicting the Mean of Y Given X To use the forearm T-score to predict the spine T-score, we simply use the equation: Mean Spine|Forearm = -.19 + .81 Forearm For example, predict the spine T-score when the forearm T-score is 0.2: Mean Spine|Forearm = -.19 + .81 (0.2) = -.03 To get these predicted values (and the residuals) in JMP, select Save Columns > Predicted Values from the red drop-down arrow. A portion of the results are shown below (after sorting the data by the residual values): Page 15 of 15