STAT 210

advertisement
DESCRIPTIVE METHODS
First, we will consider some graphical and numerical methods that allow us to examine the
relationship between two numerical variables.
Example 7.1: Once again, consider the example in which the degree of clinical agreement
among different physicians on the presence or absence of generalized lymphadenopathy
was assessed. The data consist of the total number of palpable lymph nodes counted by
each of the two physicians and are given in the file LymphNodes.JMP.
The original data:
Previously, we calculated the difference between
the two doctors’ counts in order to make
comparisons:
Page 1 of 15
Recall that we can select Analyze > Distribution to view the distribution of the differences:
Questions:
1. What can you say about the differences between these two doctors based on this output?
2. Based on only these univariate descriptive summaries, do you have any insight into
WHEN the two doctors tend to agree or disagree? That is, are there certain patients for
whom the doctors are closer in their estimates?
Scatterplots
Next, we will investigate the relationship between the two doctors’ counts via a scatterplot.
To create the scatterplot in JMP, select Analyze > Fit Y by X.
Page 2 of 15
JMP returns the following:
Since we expect a 1:1 relationship in this example, we will make sure that the scales on the two
axes match. To do this, double-click on the y-axis. Then you can change the scale so that the yaxis runs from 0 to 20. Also, because we expect a 1:1 relationship, we can add the y=x line to
this plot:
Questions:
1. What does the y=x line represent?
2. How many points fall exactly on the line? What does this mean?
3. Most of the points fall below the y=x reference line. What does this tell you about the
amount of agreement between the two doctors? Explain.
Page 3 of 15
Next, add a trend line based on the data to this plot. In JMP, select Fit Line from the red dropdown menu.
4. Notice the trend line (based on the data) and the y=x reference line start out in almost
the same spot. What does this tell you about the amount of agreement between these
two doctors?
5. The trend line appears to be flatter than the y=x reference line (i.e., the slope is smaller).
What does this tell us about the agreement between these two doctors?
Page 4 of 15
Correlation
The scatterplot is a GRAPHICAL representation of the relationship between the two doctors’
counts. We can also consider the correlation between the counts of the two doctors, which is a
NUMERICAL summary of this relationship.
To calculate the correlation in JMP, click on the red drop-down arrow next to Bivariate Fit and
choose Density Ellipse.
The density ellipse itself is beyond the scope of this course; however, this request in JMP gives
us the correlation coefficient:
Interpreting the Correlation Coefficient:
1. A positive correlation coefficient indicates a positive association between the two
numerical variables, and negative correlation indicates a negative association.
2. The correlation coefficient is ALWAYS between -1 and 1.



Values near zero indicate a very weak relationship exists.
Values close to 1 indicate a very strong positive relationship exists.
Values close to -1 indicate a very strong negative relationship exists.
Questions:
1. What does this correlation coefficient say about the direction of the relationship between
the counts of Doctors A and B? Does this agree with what you saw in the scatterplot?
2. What does this correlation coefficient say about the strength of the relationship between
the counts of Doctors A and B? Does this agree with what you saw in the scatterplot?
Page 5 of 15
Example 7.2: The U.S. Environmental Protection Agency (EPA) declared El Paso, Texas, a timecritical emergency Superfund site in July of 2002. The emergency declaration was based on
35 residential sites sampled in West and Central El Paso neighborhoods that contained some
lead and arsenic above screening levels. The data in the file ElPasoLead.JMP contain
several variables measured in a follow-up study. Here, we will investigate the relationship
between YearsClose and IQ. That is, does living near the lead pollution source seem to have
an impact on IQ?
The following scatterplot shows the relationship between the two variables:
Questions:
1. Discuss the general trends you observe in this plot.
2. What impact does YearsClose have on IQ? Explain.
The Concept of Conditioning
We could also examine these data using side-by-side boxplots. Note that to do this in JMP, you
must first change YearsClose to an ordinal data type. Then, select Analyze > Fit Y by X and
place YearsClose in the X, Factor box and IQ in the Y, Response box. Select Display Options
from the red drop-down arrow, and choose Box Plots. Finally, make sure that X Axis
Proportional is UNCHECKED.
Page 6 of 15
We can also display the Means and Standard Deviations for each level of YearsClose:
Questions:
1. Do the patterns that are observed in the side-by-side boxplots agree with what was
observed in the scatterplot? Explain.
2. What patterns do you see in the list of means and standard deviations? Do these agree
with the scatterplot and side-by-side boxplots? Explain.
The concept of observing a different mean and possibly a different standard deviation for each
level of the X-variable is important in an investigation of two numerical variables. This concept
is known as conditioning.
For example, identify the following conditional means and standard deviations:
 Mean of IQ|YearsClose=2:
 Standard Deviation of IQ|YearsClose=2:
 Mean of IQ|YearsClose=10:
 Standard Deviation of IQ|YearsClose=10:
Next, compare and contrast the two conditional means and standard deviations for
YearsClose=2 and YearsClose=10.
Page 7 of 15
Example 7.3: Finally, consider an example that displays the relationship between Age and
Length of fish from Lake Mary in Minnesota (the data are in the file LakeMary.JMP).
First, we construct a scatterplot of Age vs. Length:
Next, we view the side-by-side boxplots:
Finally, we view the conditional means and standard deviations:
Question: Based on these summaries, what can you say about the relationship that exists
between Age and Length?
To identify trends, we can consider the following:
Page 8 of 15
Connecting the Conditional Means
Connecting the Conditional Medians
Usually, you see a straight line model used to display this trend:
In the next section, we will discuss this method of fitting a straight line detail.
Page 9 of 15
SIMPLE LINEAR REGRESSION: MODELING Y WITH ONE X-VARIABLE
Example 7.4: Consider the following study which recently took place at the Winona Clinic. The
study involved measuring bone density in middle-aged women. The goal was to determine
whether or not forearm bone density could be used in place of other bone density
measurements which were more accepted (e.g., from the spine, hip or neck). Bone density
in this study was measured by a T-Score.
Category
T-Score
Normal
T-Score ≥ -1
Osteopenia
-2.5 ≤ T-Score ≤ -1
Osteoporosis
T-Score ≤ -2.5
Page 10 of 15
The data can be found in the file TScores.JMP, and a portion of the data is shown below:
First, let’s consider the relationship between Spine T-Score and Forearm T-Score. Select
Analyze > Fit Y by X, and place Forearm T-Score in the X, Factor box and Spine T-Score in the Y,
Response box. From the red drop-down menu, select Fit Line.
Original Scatterplot
With the Y=X Reference Line
Questions:
1. For what T-Scores do these two sets of measurements tend to agree? Discuss.
2. For what T-Scores do these two sets of measurements tend to disagree? Discuss.
To carry out a simple linear regression analysis in JMP, we can use the Analyze > Fit Y by X
menu (as shown above). However, we will use the Analyze > Fit Model menu instead. Place
Page 11 of 15
the y-variable in the Y box and the x-variable in the Construct Model Effects box (place your
cursor on the variable name and select Add).
JMP returns the following output:
Page 12 of 15
Understanding the Regression Output:
This p-value tests whether or not the overall regression model is
useful.
Ho: Regression model is NOT useful
Ha: Regression model is useful
p-value:
Decision:
Conclusion:
and
These p-values test hypotheses concerning both the intercept and the
slope:
Intercept
Slope
Ho: Intercept = 0
Ha: Intercept ≠ 0
Ho: Slope = 0
Ha: Slope ≠ 0
p-value:
p-value:
Decision:
Decision:
Conclusion:
Conclusion:
This is the coefficient of determination, usually called R-Square (R2).
This quantity measures the percent of total variation in the y-variable
that can be explained by the x-variable
Page 13 of 15
R2 =
Interpretation:
This is the Root Mean Square Error (RMSE). This quantity estimates
the average distance of a data point from the fitted line, measured
along the vertical axis. Such a distance is known as a residual.
The smaller the RMSE, the better the fit of the model.
RMSE =
These values give the estimates of both the intercept and slope so that
we can write an equation for the fitted line:
Mean Spine|Forearm = -.19 + .81  Forearm, or

Spine  .19  .81 Forearm
Interpretations:
Page 14 of 15
Predicting the Mean of Y Given X
To use the forearm T-score to predict the spine T-score, we simply use the equation:
Mean Spine|Forearm = -.19 + .81  Forearm
For example, predict the spine T-score when the forearm T-score is 0.2:
Mean Spine|Forearm = -.19 + .81  (0.2) = -.03
To get these predicted values (and the residuals) in JMP, select Save Columns > Predicted
Values from the red drop-down arrow.
A portion of the results are shown below (after sorting the data by the residual values):
Page 15 of 15
Download