Regression Work – Math 3311 Given data, the usual procedure is to analyze it and find a governing equation. One can then project values for which no data points exist. Often the type of equation used is influenced by information about the field in which you are working. Sometimes, though, you are on your own to find a best fit. Let’s work through the following problems to develop some skill at analyzing data. We will be relying on a statistic, R2 , which is easy to calculate with technology and a bit difficult to manage from scratch. Here’s a printout on it from http://www.hedgefundindex.com/d_rsquared.asp#Formula. Note that this is a business-related website. Top management relies on statistics and projections quite a bit. A general version, based on comparing the variability of the estimation errors with the variability of the original values, is Another version is common in statistics texts but holds only if the modeled values are obtained by ordinary least squares regression (which must include a fitted intercept or constant term): it is In the above definitions, where are the original data values and modeled values respectively. That is, SST is the total sum of squares, SSR is the regression sum of squares, and SSE is the sum of squared errors. In some texts, the abbreviations SSR and SSE have the opposite meaning: SSR stands for the residual sum of squares (which then refers to the sum of squared errors in the upper example) and SSE stands for the explained sum of squares (another name for the regression sum of squares). In the second definition, R2 is the ratio of the variability of the modeled values to the variability of the original data values. Another version of the definition, which again only holds if the modeled values are obtained by ordinary least squares regression, gives R2 as the square of the correlation coefficient between the original and modeled data values. Problem 1 Varnish Drying Time Amt -gm Hours to dry 1 7.2 2 6.7 3 4.7 4 3.7 5 4.7 6 4.2 7 5.2 8 5.7 Sometimes it is useful to watch paint dry. Let’s see what we can find out about this data Graph the data. Find the best fit using technology. Be sure to support your conclusion by getting the Regression Coefficient R2 Extension: How long will it take 10 grams of varnish to dry? Problem 2 Thunderstorm Data It is conjectured that in a lightning storm, the distance between the observer and the lightning is linearly related to the time interval between the flash and the bang. Answer the following questions: A. Consider d the distance to the storm in kilometers as a function of time t in seconds. Suppose, as an experiment, a friend travels along with the storm and reports the actual distance between the storm and your house as you record the seconds between the flash and the bang. B. Make a scatter plot of the data and use regression to write the particular equation for this direct variation function. T 2.98 6.09 14.94 28.99 37.11 d 1 2 5 10 12 C. Recall the regression equation is a "best fit." What might be an actual particular equation that models the data? D. Use your model to work backwards in order to calculate the times for the thunder sound to reach you from lightning bolts that are 1.5, 2.5, and 15 kilometers away. What do you call the processes of looking within and beyond your actual data? CHALLENGE: What would your formula be with seconds and miles as your units? Problem 3 Heights of People who Date Each Other A student wonders if tall women tend to date taller men than do short women. She measures herself, her dormitory roommate, and all the women in the adjoining rooms. Then she measures the next man each woman dates. Here are the data (heights are in inches). Women 66 64 66 65 70 65 Men 72 68 70 68 71 65 Make a scatter plot of these data. Do you expect the correlation to be near one? Find the best fit regression line…find the correlation coefficient. What is your conclusion from the data? What height man would a 61 inch woman date based on this data? If every woman dated a man exactly 3 inches taller than herself, what would be the correlation between male and female heights? Problem 4 Let’s look at a made up set of data and discuss it. x y 1 1 2 3 3 3 4 5 10 1 10 11 Make the scatter plot and find the regression line. Calculate the correlation coefficient. What has happened to the data here? Would you consider any of the points to be anomalies? Problem 5 Airport Pathway Hearing Loss Data x y 47 15.1 56 14.1 116 13.2 178 12.7 19 14.6 75 13.8 160 11.9 31 14.8 12 15.3 164 12.6 43 14.7 74 14 x = # weeks y = hearing range Analyze this data. If you are at 90 weeks, what is the associated hearing loss? Problem 6 Growth of a tubeworm* It has been shown that the marine tubeworm is the longest-lived non-colonial marine invertebrate known* Since tubeworms live on the ocean floor and live longer than humans, scientists do not measure their age directly. Instead scientists measure their growth rate at various lengths and then construct a model for the growth rate in terms of length. The length is measured in meters per years. Length, meters 0 Growth rate in meters per year 0.0510 0.5 0.0255 1.0 0.0128 1.5 0.0064 2.0 0.0032 Often in biology the growth rate is modeled as a decreasing linear function of length. For some organisms, however, it may be appropriate to model the growth rate as a decreasing exponential function. Use the data in the table to decide which model is more appropriate here. Support your decision. Give a practical explanation of the slope or percentage decay, whichever is applicable. Use functional notation to express the growth rate at a length of 0.64, and calculate that value. *D. Bergquist, F. Williams, and C. Fisher, “Longevity record for deep-sea invertebrate,” Nature 403 (2000), 499-500. Note that their conservative estimate for the life span is between 170 and 250 years. Problem 7 Given the following set of data, which model is the best fit and why? x 3 7 9 10 15 y 33.5 988.8 5470.8 12830 893442 Sketch the data first, please Problem 8 Growth rate vs weight: Ecologists have studied how a population’s intrinsic exponential growth rate r is related to the body weight W for nerbivorous mammals. In the following table, W is the adult weight measure in pounds, and r is the growth rate per year. Plot the data and try several best fit models. In actuality, most ecologists use a power function…do these data support that type of model? Animal Weight Rate Short-tailed vole 0.07 4.56 Norway rat 0.7 3.91 Roe deer 55 0.23 White-tailed elk 165 0.55 American elk 595 0.27 African elephant 8160 0.06 Problem 9 Given the following data, which is the best fit model and why? x F(x) 1 −4 2 −5 3 −8 4 −13 5 −20 Problem 10 Traffic accidents: The following table shows the cost C of traffic accidents, in cents per vehicle-mile, as a function of vehicular speed, s, in miles per hour, for commercial vehicles driving at night on urban streets. speed 20 25 30 35 40 45 50 cost 1.3 0.4 0.1 0.3 0.9 2.2 5.8 The rate of vehicular involvement in traffic accidents (per vehicle-mile) can be modeled as a quadratic function of vehicular speed, s, and the cost per vehicular involvement is roughly a linear function of s, so we expect that C (the product of these two functions) can be modeled as a cubic function of s. Just how important is it to know the information in the last paragraph. What could you have done to figure it out yourself? How much training would it take for you to automatically figure it out? Sketch the graph…what would you have initially guessed? Take common differences…how far do you have to go? Does this suggest cubic? What is the best fit equation? Why is this better than any other model? Problem 11 Charles's Law Physicist Jacques Charles (1746-1823) discovered that the volume of a gas at a constant pressure increases linearly with the temperature of the gas. The table below illustrates this relationship between volume and temperature. In the table, hydrogen is held at a constant pressure of one atmosphere. The volume V is measured in liters and the temperature T is measured in degrees Celsius. T -40 -20 0 20 40 60 80 V 19.1482 20.7908 22.4334 24.0760 25.7186 27.3612 29.0038 A. Use the table above and what you have learned about regression to find a model for the linear relationship. Have you seen the value that you found for the constant in the equation before? B. Solve the equation that you have found for T to find lim T V 0 Have you seen the value that you found for the limit before? C. Save the data for Problem 12, a continuation of this problem! Problem 12 A study of residuals: A “residual” is the difference between the observed and recorded value and the predicted value from the model. To check whether a linear model is appropriate for data, plot the residuals. A histogram of the residuals can be checked for multiple modes and for outliers. Take the data from Problem 11 and calculate the residuals, then plot the residuals. Let’s discuss them! Problem 13 Given the following data for the stretch of a spring, select the best model from among those below: ( x 103 , y 105 ) What do x and y represent? x 5 10 15 20 25 30 35 40 45 y 0 19 57 94 134 173 216 256 343 a. y Ax b. y a(b) x c. y ax 2 How do you know you are right? Problem 14 A leaking can: The side of a cylindrical can full of water springs a leak, and the water begins to stream out. The depth H, in inches, of water remaining in the can is a function of the distance D in inches (measured out from the base of the can) at which the stream of water strikes the ground. D H 0 1.03 1 1.20 2 2.10 3 3.27 4 4.99 What is the best fit model for this data? Why are you sure of that? Problem 15 Given this data: find a model x y 1 1 3 3 4 5 6 2 7 1 8 4 10 8 Problem 16 Poiseuille’s Law for rate of fluid flow. This law applies only to laminar flow, not turbulent flow. F cR 4 where c is a constant and R is the radius of the tube Create a list of perfect data. How would you analyze it if you didn’t already have the formula. Create a list of “perturbed” data with the notion of creating a set of data that is “realistic”. What should your residuals look like with this “realistic” data? Assume R increases by 10%. Explain what F increases by 46.41%. What is the flow rate through a ¾ inch pipe compared with that through a ½ inch pipe? Suppose that an artery supplying blood to the heart muscle is partially blocked and is only half its normal radius. What percentage of the usual blood flow is happening in that artery? Problem 17 Water flowing in a tank: The following table shows the number of gallons W of water left in a tank t hours after it starts to leak. t, hours 0 3 6 9 12 W, gallons left 860 725 612 515 433 Explain exactly what we’re going to model: is it number of gallons or dW ? dt Find the best fit line and get an estimate of the amount of water left after 8 hours. What else can you find out about the data given? Problem 18 The following table gives the amount of waste in the US that has been recycled over the 35 years from 1960 to 1995 in millions of tons. Come up with a sensible way to model the curve and project to the year 2000. year Millions of tons 1960 5.6 1970 8.0 1980 14.4 1990 56.2 1995 56.2 Problem 19 Amount of alcohol (in grams) in the body after n hours, consuming half a drink per hour. Assume the first drink is at hour zero. Assume the drinking is steady and goes on for 48 hours. N 0 A(n) 7 1 7.64 2 8.07 3 8.39 4 8.62 10 9.19 24 9.33 48 9.33 Graph this and model this data. Problem 20 Comparisons of diameter and circumference of various circles. Diameter In cm 5 Circum. In cm 15.7 7 22.0 12.5 39.3 53.4 53.4 What function models this relationship? How else do you know what you’ve found? Problem 21 The following table gives the approximate amount of petroleum used for transportation in the US from 1970 to 1995. The amount of petroleum is given in quadrillion Btu (British Thermal Units). Stats from USDOT. Year Q.Btu 1970 15.3 1975 17.5 1980 19.4 1985 20.0 1990 21.6 1995 23.8 Problem 22 Water Pressure: Below is a table of water pressure in atmospheres taken at 5 depths in the Atlantic Ocean. What is the governing equation for water pressure. Depth In feet 0 Pressure In atms 1 66 3 167 5 300 10 500 16 Problem 23 According to the World Bank website, Latvia’s average annual growth rate from 1998 – 2015 is predicted to be −0.8%. If the population of Latvia in 1998 was 2.4 million people, predict the population in 2015 using this projected growth rate. What type of function should this be? What will the formula be? Problem 24 Exponential functions can be used to model the increase in the cost of items due to inflation. If a gallon of mile currently costs $4.95 and the annual inflation rate is projected to be a steady 3.2%, predict the cost of a gallon of milk in 3 years….5 years… What formula are you using and how did you find the formula? Problem 25 The Richter scale, developed in 1935 by Charles F. Richter, is a device for comparing earthquakes. The largest shocks ever recorded have a magnitude of about 8.9. It happens that a small change in the Richter scale indicates a large change in the severity of the quake…note that 5.3 is a moderate quake. In fact, an earthquake that is 6.3 is 10 times worse than one that measures 5.3 and a 7.3 is 100 times as powerful as a 5.3. What kind of scaling is this? Here is some data 1811 8.8 New Madrid, Missouri, changed the course of the Mississippi 1989 7.1 San Francisco How much more powerful was the 1811 quake than the 1989 quake? If an earthquake 1000 times more powerful than the 1989 quake happened what would be the Richter scale reading? 2004 ? Indian Ocean, near Indonesia….80 times more powerful than San Franciso…what was its Richter scale reading?