Integrated Quantitative Science Regression 1 Homework 1. Many metals expand and contract with changes in temperature. Here are some measurements of the length of a particular copper rod, measured at different temperatures. x = Temperature (C) 20.1 28.2 38.5 44.6 57.4 66.2 78.1 y = Length (mm) 2461.16 2461.49 2461.88 2462.10 2462.62 2462.93 2463.38 a. Scatterplot this data in Excel. Based on your plot, do you think a linear model is a reasonable way to model the relationship between temperature and length for this rod? Justify your answer. b. Find the equation for the linear model that best fits this data. Add this line to your scatterplot from part a. c. Using your model, predict the length of this rod when its temperature is 25.0C, and when it is 82.0C. Judging linear model fit. In the preceding exercise, you were asked to judge the reasonableness of fitting a linear model to a data set. Researchers who use linear regression models like to use a quantity called the Pearson correlation coefficient to measure how well a regression line fits a particular data set. This quantity, usually denoted by π 2 , is a quantity that can be computed directly from the data set, thanks to a really ugly formula (that we don’t have to worry about). This number is always between 0 and 1, and we can use its value to measure how well a regression line fits a data set like this: ο· ο· Values of π 2 that are close to 1 indicate a good fit. Values of π 2 that are close to 0 indicate a poor fit. Fortunately, Excel has a built-in function, called RSQ, that will compute the value of π 2 for us. It works exactly the same as the SLOPE and INTERCEPT commands we learned in class. So, to compute RSQ for a data set, do exactly the same steps that we use for the SLOPE command, except replace “SLOPE” with “RSQ”. Example: The value of π 2 for the following data set is 0.9815. You can check that you are using the RSQ command correctly by trying this yourself in Excel, and seeing if you get 0.9815 too. t 1 2 3 4 5 y 2.1 3.4 3.8 5.1 5.9 2. (NOTE: The data for this exercise can be found in the Excel file Regression1HWdata.xls, which is posted with this assignment in Blackboard.) Researchers who study pre-birth child development measure the size of a growing fetus by using height and weight, just as for children and adults. However, “height” is a funny term to use for fetuses (and for infants too, for that matter), because they can’t actually stand up. A better term may be “length”, but obstetric and pediatric researchers tend to prefer the term fetal stature. The following table gives estimated “normal” values for fetal stature for various fetal ages.1 Age (in weeks since conception) 18 20 22 24 26 28 30 32 34 36 38 40 Stature (cm) 19.81 23.8 27.4 30.6 33.72 36.52 39.13 41.58 43.84 46.03 48.08 50.02 a. Scatterplot this data in Excel. Based on your plot, do you think a linear model is a reasonable way to model the relationship between age and stature? Justify your answer. b. Use the Excel RSQ function to compute the Pearson correlation coefficient π 2 for this data set. How closely does this data set seem to fit a straight line? 1 Shipman, P, et al, The Human Skeleton, Harvard University Press, 1985, p. 253. (Copy of pages 252-254 posted in Blackboard (Course Documents/Journal Articles) as Shipman1985_portion.pdf.) c. Find the equation for the linear model that best fits this data. Add this line to your scatterplot from part a. Does it seem to be a reasonable fit? d. What does the slope of your regression line represent, in terms of fetal age and fetal stature? 3. (NOTE: The data for this exercise can be found in the Excel file Regression1HWdata.xls, which is posted with this assignment in Blackboard.) In a well-known study by Huxley2, published in the journal Nature, the weight x (in mg) of each member of a group of fiddler crabs was compared to the weight y (in mg) of each member’s comically-large claw. Here is Huxley’s data: x 57.6 109.2 199.7 270.0 355.2 470.1 617.9 743.3 y 5.3 13.7 38.3 59.0 104.5 164.9 243.0 319.2 a. Scatterplot this data in Excel. Based on your plot, do you think a linear model is a reasonable way to model the relationship between crab weight and claw weight? Justify your answer. b. Find the equation for the power function model that best fits this data. Add this curve to your scatterplot from part a. c. Use your model from part b. to estimate the claw weight for a crab whose body weight is 500 mg. Exponential Model Regression One of the most common models used for scientific data is the exponential function, which has the generic form π¦ = πΆπ ππ‘ , where t is the independent variable y is the dependent variable C and k are constants (parameters) whose values will be chosen to fit the data 2 Huxley, Julian S, Constant differential growth ratios and their significance, Nature, vol. 114, 20 December 1924, pp. 895896. (Copy posted in Blackboard as Huxley1924.pdf.) If we attack this model as we did with the power function model in class (i.e. by taking the natural log of both sides), we get: lnβ‘(π¦) = lnβ‘(πΆπ ππ‘ ) ln(π¦) = ln(πΆ) + lnβ‘(π ππ‘ ) ln(π¦) = ln(πΆ) + ππ‘ So, an exponential model has a linear relationship between t and lnβ‘(π¦), with k = slope and ln(πΆ) = intercept. Copying what we did in the power function case, we can fit an exponential model to a data set like this: ο· ο· ο· Step 1: Transform the data set by computing the column of lnβ‘(π¦) values.3 Step 2: Use linear regression to fit a linear model to the transformed data set (π‘, ln(π¦)) to get a slope m and an intercept b. Step 3: The corresponding exponential model for the original data set is π¦ = πΆπ ππ‘ , with π = π and πΆ = π π . The next Exercise will give you a chance to take this for a test-drive. 4. Over the years, physicians and medical researchers have devised a number of different ways to quantify the influence of a patient’s health and behavior on that patient’s risk of coronary heart disease (CHD). One of the most commonly-used is called the Framingham Risk Scoring System.4 In this system, each patient is assigned a score, based on age, sex, total cholesterol, HDL (i.e. “good”) cholesterol, systolic blood pressure, treatment for hypertension, and tobacco use. Once a patient’s score is computed, a published table is used to convert their score into a “10-year percent risk for CHD”. Here is a portion of the table for males: Framingham Score 1 4 7 10 13 16 3 10-Year Percent Risk for CHD 1 1 3 6 12 25 This transformation, in which only the dependent variable is fed to the logarithm, is called a semi-log transform, and is very widely-used in the sciences. 4 Reference: Executive summary of the third report of the National Cholesterol Education Program (NCEP) Expert Panel on detection, evaluation, and treatment of high blood cholesterol in adults, Journal of the American Medical Association, vol. 285, no. 19, May 16, 2001, pp. 2486-2497. (Copy posted in Blackboard as JAMA_CHDrisk2001.pdf.) The table is read like this: A patient with a Framingham score (F-score) of 13 has a 12% risk of developing CHD over the next 10 years. a. Scatterplot this data in Excel. Based on your plot, do you think a linear model is a reasonable way to model the relationship between F-score and CHD risk? Justify your answer. b. Find the equation for the exponential model that best fits this data. Add this curve to your scatterplot from part a. c. Using your model, estimate the 10-year risk for a patient with a F-score of 14. d. The Framingham table for males doesn’t go past a F-score of 17 (although a patient can have a score larger than 17), so using the model in part b. for values much larger than 17 would likely be foolish. Let’s be foolish! Assuming your model from part b. is valid for F-scores beyond 17, use it to estimate the 10-year risk for a patient with an F-score of 23. How believable is the result?