Regression Analysis Problems

advertisement
Regression Work – Math 3311
Given data, the usual procedure is to analyze it and find a governing equation. One can then
project values for which no data points exist.
Often the type of equation used is influenced by information about the field in which you are
working. Sometimes, though, you are on your own to find a best fit.
Let’s work through the following problems to develop some skill at analyzing data.
We will be relying on a statistic, R2 , which is easy to calculate with technology and a bit
difficult to manage from scratch. Here’s a printout on it from http://www.hedgefundindex.com/d_rsquared.asp#Formula. Note that this is a business-related website. Top
management relies on statistics and projections quite a bit.
A general version, based on comparing the variability of the estimation errors with the variability
of the original values, is
Another version is common in statistics texts but holds only if the modeled values are obtained
by ordinary least squares regression (which must include a fitted intercept or constant term): it is
In the above definitions,
where
are the original data values and modeled values respectively. That is, SST is the total
sum of squares, SSR is the regression sum of squares, and SSE is the sum of squared errors. In
some texts, the abbreviations SSR and SSE have the opposite meaning: SSR stands for the residual
sum of squares (which then refers to the sum of squared errors in the upper example) and SSE
stands for the explained sum of squares (another name for the regression sum of squares).
In the second definition, R2 is the ratio of the variability of the modeled values to the variability
of the original data values. Another version of the definition, which again only holds if the
modeled values are obtained by ordinary least squares regression, gives R2 as the square of the
correlation coefficient between the original and modeled data values.
Problem 1
Varnish Drying Time
Amt -gm
Hours to dry
1
7.2
2
6.7
3
4.7
4
3.7
5
4.7
6
4.2
7
5.2
8
5.7
Sometimes it is useful to watch paint dry. Let’s see what we can find out about this data
Graph the data.
Find the best fit using technology. Be sure to support your conclusion by getting the Regression
Coefficient R2
Extension: How long will it take 10 grams of varnish to dry?
Problem 2
Thunderstorm Data
It is conjectured that in a lightning storm, the distance between the observer and the lightning is
linearly related to the time interval between the flash and the bang. Answer the following
questions:
A. Consider d the distance to the storm in kilometers as a function of time t in seconds.
Suppose, as an experiment, a friend travels along with the storm and reports the actual
distance between the storm and your house as you record the seconds between the flash
and the bang.
B. Make a scatter plot of the data and use regression to write the particular equation for this
direct variation function.
T
2.98
6.09
14.94
28.99
37.11
d
1
2
5
10
12
C. Recall the regression equation is a "best fit." What might be an actual particular equation
that models the data?
D. Use your model to work backwards in order to calculate the times for the thunder sound
to reach you from lightning bolts that are 1.5, 2.5, and 15 kilometers away. What do you
call the processes of looking within and beyond your actual data?
CHALLENGE: What would your formula be with seconds and miles as your units?
Problem 3
Heights of People who Date Each Other
A student wonders if tall women tend to date taller men than do short women. She measures
herself, her dormitory roommate, and all the women in the adjoining rooms. Then she measures
the next man each woman dates. Here are the data (heights are in inches).
Women
66
64
66
65
70
65
Men
72
68
70
68
71
65
Make a scatter plot of these data. Do you expect the correlation to be near one?
Find the best fit regression line…find the correlation coefficient.
What is your conclusion from the data? What height man would a 61 inch woman date based on
this data?
If every woman dated a man exactly 3 inches taller than herself, what would be the correlation
between male and female heights?
Problem 4
Let’s look at a made up set of data and discuss it.
x
y
1
1
2
3
3
3
4
5
10
1
10
11
Make the scatter plot and find the regression line. Calculate the correlation coefficient.
What has happened to the data here?
Would you consider any of the points to be anomalies?
Problem 5
Airport Pathway Hearing Loss Data
x
y
47
15.1
56
14.1
116
13.2
178
12.7
19
14.6
75
13.8
160
11.9
31
14.8
12
15.3
164
12.6
43
14.7
74
14
x = # weeks
y = hearing range
Analyze this data. If you are at 90 weeks, what is the associated hearing loss?
Problem 6
Growth of a tubeworm*
It has been shown that the marine tubeworm is the longest-lived non-colonial marine invertebrate
known*
Since tubeworms live on the ocean floor and live longer than humans, scientists do not measure
their age directly. Instead scientists measure their growth rate at various lengths and then
construct a model for the growth rate in terms of length. The length is measured in meters per
years.
Length,
meters
0
Growth rate
in meters per
year
0.0510
0.5
0.0255
1.0
0.0128
1.5
0.0064
2.0
0.0032
Often in biology the growth rate is modeled as a decreasing linear function of length. For some
organisms, however, it may be appropriate to model the growth rate as a decreasing exponential
function. Use the data in the table to decide which model is more appropriate here. Support
your decision. Give a practical explanation of the slope or percentage decay, whichever is
applicable.
Use functional notation to express the growth rate at a length of 0.64, and calculate that value.
*D. Bergquist, F. Williams, and C. Fisher, “Longevity record for deep-sea invertebrate,”
Nature 403 (2000), 499-500. Note that their conservative estimate for the life span is between
170 and 250 years.
Problem 7
Given the following set of data, which model is the best fit and why?
x
3
7
9
10
15
y
33.5
988.8
5470.8
12830
893442
Sketch the data first, please
Problem 8
Growth rate vs weight:
Ecologists have studied how a population’s intrinsic exponential growth rate r is related to the
body weight W for nerbivorous mammals. In the following table, W is the adult weight measure
in pounds, and r is the growth rate per year.
Plot the data and try several best fit models. In actuality, most ecologists use a power
function…do these data support that type of model?
Animal
Weight
Rate
Short-tailed vole 0.07
4.56
Norway rat
0.7
3.91
Roe deer
55
0.23
White-tailed elk
165
0.55
American elk
595
0.27
African elephant 8160
0.06
Problem 9
Given the following data, which is the best fit model and why?
x
F(x)
1
−4
2
−5
3
−8
4
−13
5
−20
Problem 10
Traffic accidents:
The following table shows the cost C of traffic accidents, in cents per
vehicle-mile, as a function of vehicular speed, s, in miles per hour, for commercial vehicles
driving at night on urban streets.
speed
20
25
30
35
40
45
50
cost
1.3
0.4
0.1
0.3
0.9
2.2
5.8
The rate of vehicular involvement in traffic accidents (per vehicle-mile) can be modeled as a
quadratic function of vehicular speed, s, and the cost per vehicular involvement is roughly a
linear function of s, so we expect that C (the product of these two functions) can be modeled as a
cubic function of s.
Just how important is it to know the information in the last paragraph. What could you have
done to figure it out yourself? How much training would it take for you to automatically figure it
out?
Sketch the graph…what would you have initially guessed? Take common differences…how far
do you have to go? Does this suggest cubic?
What is the best fit equation? Why is this better than any other model?
Problem 11
Charles's Law
Physicist Jacques Charles (1746-1823) discovered that the volume of a gas at a constant pressure
increases linearly with the temperature of the gas. The table below illustrates this relationship
between volume and temperature. In the table, hydrogen is held at a constant pressure of one
atmosphere. The volume V is measured in liters and the temperature T is measured in degrees
Celsius.
T
-40
-20
0
20
40
60
80
V
19.1482
20.7908
22.4334
24.0760
25.7186
27.3612
29.0038
A. Use the table above and what you have learned about regression to find a model for the linear
relationship.
Have you seen the value that you found for the constant in the equation before?
B. Solve the equation that you have found for T to find
lim T
V  0
Have you seen the value that you found for the limit
before?
C. Save the data for Problem 12, a continuation of this problem!
Problem 12
A study of residuals:
A “residual” is the difference between the observed and recorded value and the predicted value
from the model. To check whether a linear model is appropriate for data, plot the residuals. A
histogram of the residuals can be checked for multiple modes and for outliers.
Take the data from Problem 11 and calculate the residuals, then plot the residuals.
Let’s discuss them!
Problem 13
Given the following data for the stretch of a spring, select the best model from among those
below:
( x 103 , y 105 )
What do x and y represent?
x
5
10
15
20
25
30
35
40
45
y
0
19
57
94
134
173
216
256
343
a.
y  Ax
b.
y  a(b) x
c.
y  ax 2
How do you know you are right?
Problem 14
A leaking can:
The side of a cylindrical can full of water springs a leak, and the water
begins to stream out. The depth H, in inches, of water remaining in the can is a function of the
distance D in inches (measured out from the base of the can) at which the stream of water strikes
the ground.
D
H
0
1.03
1
1.20
2
2.10
3
3.27
4
4.99
What is the best fit model for this data? Why are you sure of that?
Problem 15
Given this data: find a model
x
y
1
1
3
3
4
5
6
2
7
1
8
4
10
8
Problem 16
Poiseuille’s Law for rate of fluid flow. This law applies only to laminar flow, not turbulent flow.
F  cR 4
where c is a constant and R is the radius of the tube
Create a list of perfect data. How would you analyze it if you didn’t already have the formula.
Create a list of “perturbed” data with the notion of creating a set of data that is “realistic”. What
should your residuals look like with this “realistic” data?
Assume R increases by 10%. Explain what F increases by 46.41%.
What is the flow rate through a ¾ inch pipe compared with that through a ½ inch pipe?
Suppose that an artery supplying blood to the heart muscle is partially blocked and is only half
its normal radius. What percentage of the usual blood flow is happening in that artery?
Problem 17
Water flowing in a tank:
The following table shows the number of gallons W of water left in a tank t hours after it starts to
leak.
t, hours
0
3
6
9
12
W, gallons
left
860
725
612
515
433
Explain exactly what we’re going to model: is it number of gallons or
dW
?
dt
Find the best fit line and get an estimate of the amount of water left after 8 hours.
What else can you find out about the data given?
Problem 18
The following table gives the amount of waste in the US that has been recycled over the 35 years
from 1960 to 1995 in millions of tons.
Come up with a sensible way to model the curve and project to the year 2000.
year
Millions of tons
1960
5.6
1970
8.0
1980
14.4
1990
56.2
1995
56.2
Problem 19
Amount of alcohol (in grams) in the body after n hours, consuming half a drink per hour.
Assume the first drink is at hour zero. Assume the drinking is steady and goes on for 48 hours.
N
0
A(n)
7
1
7.64
2
8.07
3
8.39
4
8.62
10
9.19
24
9.33
48
9.33
Graph this and model this data.
Problem 20
Comparisons of diameter and circumference of various circles.
Diameter
In cm
5
Circum.
In cm
15.7
7
22.0
12.5
39.3
53.4
53.4
What function models this relationship?
How else do you know what you’ve found?
Problem 21
The following table gives the approximate amount of petroleum used for transportation in the US
from 1970 to 1995. The amount of petroleum is given in quadrillion Btu (British Thermal
Units). Stats from USDOT.
Year
Q.Btu
1970
15.3
1975
17.5
1980
19.4
1985
20.0
1990
21.6
1995
23.8
Problem 22
Water Pressure: Below is a table of water pressure in atmospheres taken at 5 depths in the
Atlantic Ocean. What is the governing equation for water pressure.
Depth
In feet
0
Pressure
In atms
1
66
3
167
5
300
10
500
16
Problem 23
According to the World Bank website, Latvia’s average annual growth rate from 1998 – 2015 is
predicted to be −0.8%. If the population of Latvia in 1998 was 2.4 million people, predict the
population in 2015 using this projected growth rate.
What type of function should this be? What will the formula be?
Problem 24
Exponential functions can be used to model the increase in the cost of items due to inflation. If a
gallon of mile currently costs $4.95 and the annual inflation rate is projected to be a steady 3.2%,
predict the cost of a gallon of milk in 3 years….5 years…
What formula are you using and how did you find the formula?
Problem 25
The Richter scale, developed in 1935 by Charles F. Richter, is a device for comparing
earthquakes. The largest shocks ever recorded have a magnitude of about 8.9. It happens that a
small change in the Richter scale indicates a large change in the severity of the quake…note that
5.3 is a moderate quake. In fact, an earthquake that is 6.3 is 10 times worse than one that
measures 5.3 and a 7.3 is 100 times as powerful as a 5.3.
What kind of scaling is this?
Here is some data
1811
8.8
New Madrid, Missouri, changed the course of the Mississippi
1989
7.1
San Francisco
How much more powerful was the 1811 quake than the 1989 quake?
If an earthquake 1000 times more powerful than the 1989 quake happened what would be the
Richter scale reading?
2004
?
Indian Ocean, near Indonesia….80 times more powerful than
San Franciso…what was its Richter scale reading?
Download