Linear regression - Integrated Quantitative Science

advertisement
Integrated Quantitative Science
Regression 1 Homework
1. Many metals expand and contract with changes in temperature. Here are some measurements of
the length of a particular copper rod, measured at different temperatures.
x = Temperature (C)
20.1
28.2
38.5
44.6
57.4
66.2
78.1
y = Length (mm)
2461.16
2461.49
2461.88
2462.10
2462.62
2462.93
2463.38
a. Scatterplot this data in Excel. Based on your plot, do you think a linear model is a
reasonable way to model the relationship between temperature and length for this rod?
Justify your answer.
b. Find the equation for the linear model that best fits this data. Add this line to your
scatterplot from part a.
c. Using your model, predict the length of this rod when its temperature is 25.0C, and when
it is 82.0C.
Judging linear model fit.
In the preceding exercise, you were asked to judge the reasonableness of fitting a linear model to a data
set. Researchers who use linear regression models like to use a quantity called the Pearson correlation
coefficient to measure how well a regression line fits a particular data set. This quantity, usually denoted
by 𝑅 2 , is a quantity that can be computed directly from the data set, thanks to a really ugly formula (that
we don’t have to worry about). This number is always between 0 and 1, and we can use its value to
measure how well a regression line fits a data set like this:
ο‚·
ο‚·
Values of 𝑅 2 that are close to 1 indicate a good fit.
Values of 𝑅 2 that are close to 0 indicate a poor fit.
Fortunately, Excel has a built-in function, called RSQ, that will compute the value of 𝑅 2 for us. It works
exactly the same as the SLOPE and INTERCEPT commands we learned in class. So, to compute RSQ
for a data set, do exactly the same steps that we use for the SLOPE command, except replace “SLOPE”
with “RSQ”.
Example: The value of 𝑅 2 for the following data set is 0.9815. You can check that you are using the
RSQ command correctly by trying this yourself in Excel, and seeing if you get 0.9815 too.
t
1
2
3
4
5
y
2.1
3.4
3.8
5.1
5.9
2. (NOTE: The data for this exercise can be found in the Excel file Regression1HWdata.xls, which
is posted with this assignment in Blackboard.) Researchers who study pre-birth child
development measure the size of a growing fetus by using height and weight, just as for children
and adults. However, “height” is a funny term to use for fetuses (and for infants too, for that
matter), because they can’t actually stand up. A better term may be “length”, but obstetric and
pediatric researchers tend to prefer the term fetal stature. The following table gives estimated
“normal” values for fetal stature for various fetal ages.1
Age (in weeks
since
conception)
18
20
22
24
26
28
30
32
34
36
38
40
Stature
(cm)
19.81
23.8
27.4
30.6
33.72
36.52
39.13
41.58
43.84
46.03
48.08
50.02
a. Scatterplot this data in Excel. Based on your plot, do you think a linear model is a
reasonable way to model the relationship between age and stature? Justify your answer.
b. Use the Excel RSQ function to compute the Pearson correlation coefficient 𝑅 2 for this
data set. How closely does this data set seem to fit a straight line?
1
Shipman, P, et al, The Human Skeleton, Harvard University Press, 1985, p. 253. (Copy of pages 252-254 posted in
Blackboard (Course Documents/Journal Articles) as Shipman1985_portion.pdf.)
c. Find the equation for the linear model that best fits this data. Add this line to your
scatterplot from part a. Does it seem to be a reasonable fit?
d. What does the slope of your regression line represent, in terms of fetal age and fetal
stature?
3. (NOTE: The data for this exercise can be found in the Excel file Regression1HWdata.xls, which
is posted with this assignment in Blackboard.) In a well-known study by Huxley2, published in
the journal Nature, the weight x (in mg) of each member of a group of fiddler crabs was
compared to the weight y (in mg) of each member’s comically-large claw. Here is Huxley’s data:
x
57.6
109.2
199.7
270.0
355.2
470.1
617.9
743.3
y
5.3
13.7
38.3
59.0
104.5
164.9
243.0
319.2
a. Scatterplot this data in Excel. Based on your plot, do you think a linear model is a
reasonable way to model the relationship between crab weight and claw weight? Justify
your answer.
b. Find the equation for the power function model that best fits this data. Add this curve to
your scatterplot from part a.
c. Use your model from part b. to estimate the claw weight for a crab whose body weight is
500 mg.
Exponential Model Regression
One of the most common models used for scientific data is the exponential function, which has the
generic form
𝑦 = 𝐢𝑒 π‘˜π‘‘ ,
where
t is the independent variable
y is the dependent variable
C and k are constants (parameters) whose values will be chosen to fit the data
2
Huxley, Julian S, Constant differential growth ratios and their significance, Nature, vol. 114, 20 December 1924, pp. 895896. (Copy posted in Blackboard as Huxley1924.pdf.)
If we attack this model as we did with the power function model in class (i.e. by taking the natural log of
both sides), we get:
ln⁑(𝑦) = ln⁑(𝐢𝑒 π‘˜π‘‘ )
ln(𝑦) = ln(𝐢) + ln⁑(𝑒 π‘˜π‘‘ )
ln(𝑦) = ln(𝐢) + π‘˜π‘‘
So, an exponential model has a linear relationship between t and ln⁑(𝑦), with k = slope and ln(𝐢) =
intercept. Copying what we did in the power function case, we can fit an exponential model to a data set
like this:
ο‚·
ο‚·
ο‚·
Step 1: Transform the data set by computing the column of ln⁑(𝑦) values.3
Step 2: Use linear regression to fit a linear model to the transformed data set (𝑑, ln(𝑦)) to get a
slope m and an intercept b.
Step 3: The corresponding exponential model for the original data set is 𝑦 = 𝐢𝑒 π‘˜π‘‘ , with π‘˜ = π‘š
and 𝐢 = 𝑒 𝑏 .
The next Exercise will give you a chance to take this for a test-drive.
4. Over the years, physicians and medical researchers have devised a number of different ways to
quantify the influence of a patient’s health and behavior on that patient’s risk of coronary heart
disease (CHD). One of the most commonly-used is called the Framingham Risk Scoring System.4
In this system, each patient is assigned a score, based on age, sex, total cholesterol, HDL (i.e.
“good”) cholesterol, systolic blood pressure, treatment for hypertension, and tobacco use. Once a
patient’s score is computed, a published table is used to convert their score into a “10-year
percent risk for CHD”. Here is a portion of the table for males:
Framingham
Score
1
4
7
10
13
16
3
10-Year
Percent Risk
for CHD
1
1
3
6
12
25
This transformation, in which only the dependent variable is fed to the logarithm, is called a semi-log transform, and is
very widely-used in the sciences.
4
Reference: Executive summary of the third report of the National Cholesterol Education Program (NCEP) Expert Panel on
detection, evaluation, and treatment of high blood cholesterol in adults, Journal of the American Medical Association, vol.
285, no. 19, May 16, 2001, pp. 2486-2497. (Copy posted in Blackboard as JAMA_CHDrisk2001.pdf.)
The table is read like this: A patient with a Framingham score (F-score) of 13 has a 12% risk of
developing CHD over the next 10 years.
a. Scatterplot this data in Excel. Based on your plot, do you think a linear model is a
reasonable way to model the relationship between F-score and CHD risk? Justify your
answer.
b. Find the equation for the exponential model that best fits this data. Add this curve to your
scatterplot from part a.
c. Using your model, estimate the 10-year risk for a patient with a F-score of 14.
d. The Framingham table for males doesn’t go past a F-score of 17 (although a patient can
have a score larger than 17), so using the model in part b. for values much larger than 17
would likely be foolish. Let’s be foolish! Assuming your model from part b. is valid for
F-scores beyond 17, use it to estimate the 10-year risk for a patient with an F-score of 23.
How believable is the result?
Download