Chapter 8 homework

advertisement
Chapter 8 homework
Name/period __________________
Chapter notes linear regression
If there is a linear relationship between two variables, we can _______ the relationship
with a line and give its equation.
The linear model (called the Least squares regression line) is just an ____________ of a
straight line through the data. The modeled line cannot go through all of the points on the
scatterplot and still be linear, but it can summarize the general pattern of the relationship.
Remember, models are not perfect, they don’t match reality exactly.
Once we have a modeled line (equation), we can predict the response variable (y) using
the explanatory variables (x). The estimate for the y-variable is called the _________
__________, and is denoted as _____. The difference between the observed y and the
^
predicted y is called a residual.
^
resid  y  y
The least square regression line is the “best fit” for the data, meaning that the line will
pass through the scatterplot minimizing the sum of squares (which is a fancy way of
saying that the line will cross through all points in a way that minimizes all residuals).
Example
1
Correlation and the line.
This section is optional reading. It will help build a better
understanding of regression if you read it, but it is not required.
Skip to page 176 (The regression line in real units)
We know the line of best fit is called the least square regression line (LSRL). The LSRL
is an equation similar to what you learned in Algebra class. Back then, the equation was
in slope intercept form y = mx + b. In Statistics, it’s the same thing but equation looks
different. In Stats, the equation is y  b0  b1 x notice that the y-intercept is b0 and the
slope is b1 . In your calculator, the y-intercept is a and the slope is b. This equation is
used to predict y values for a given x value. The formula for the
slope
_____________________
y intercept __________________
These formulas are given on your formula sheet, but you will never have to utilize them,
the calculator will always find them for you or you will be given the slope and y-intercept
in a computer output page. You have to be able to know what they both mean in context.
The slope is____________________________________________________________
The units for slope is ________________________________________________.
The y-intercept is ______________________________________.
The units of the y-intercept are ____________________________
*******It is very important that you understand how to define the slope and y-intercept
in context perfectly.
Do the example on page 178 on your own. Calculating a regression equation.
2
Residual are also very important. The formula for the residual is
Residual plots – A residual plot is a very important idea in regression analysis. A
residual scatterplot is very similar to a normal scatterplot of x and y’s. Basically, a
residual scatterplot is a plot where the y-values are the residuals and the x values stay the
same. Sometimes, the y values are the residuals and the x-values become the predicted
values. You should be comfortable reading both ways. The Residual plot’s main purpose
is to determine if the original data sets (x and y) have ______________________ form.
This happens if the form of the residual scatterplot has ______________________ form.
We can also use the residual plot to find residuals, raw y-values, or predicted y-values.
for example.
3
The standard deviation of the residuals se - The standard deviation of the residual
gives us a measure of how much the points are spread around the regression line. The
points spread around the line are measured as residuals, so basically se is the standard
deviation of the residuals. We always want the smallest possible residuals in a regression
analysis, so we want the se to be small too. The residuals should also have a symmetric
unimodal shape, (which implies a normal distribution)
Example
R-squared – R2 is the only definition that you have to memorize without a full
understanding of the concept. I will try my best. Here’s the definition
R2 is the
4 Regression assumptions
4
Homework
1.
Cereals
For many people, breakfast cereal is an important source of fiber in
their diets. Cereals also contain potassium, a mineral shown to be associated with
maintaining a healthy blood pressure. An analysis of the amount of fiber (in grams) and
the potassium content (in milligrams) in servings of 77 breakfast cereals produced the
^
regression model Potassium  38  27 Fiber . If your cereal provides 9 greams of fiber
per serving, how much potassium does the model estimate you will get?
2.
Horsepower In Chapter 7’s Exercise 33 we examined the relationship between
the fuel economy (mpg) and horsepower for 15 models of cars. Further analysis produces
^
the regression model mpg  46.87  0.084HP. If the car you are thinking of buying has a
200-horsepower engine, what does this model suggest your gas mileage would be?
3.
More cereal Exercise I describes a regression model that estimates a cereal’s
potassium content for the amount of fiber it contains. In this context, what does it mean to
say that a cereal has a negative residual?
4.
Horsepower again
Exercise 2 describes a regression model that uses a car’s
horsepower to estimate its fuel economy. In this context, what does it mean to say that a
certain car has a positive residual.
5
5.
What slope If you create a regression model for predicting the weight of a car (
in pounds) form its Length (in feet) is the slope most likely to be 3,30,300, or 3000?
Explain.
6.
Real Estate A random sample of records of sales of homes from Feb. 15 to
Apr. 30, 1993, from the files maintained by the Albequerque Board of Realtors gives the
Price and Size (in square feet) of 117 homes. A regression to predict Price (in thousands
of dollars) from size has an r = 0.845.
a)
What are the variables and units in this regression (list explanatory and response)
b)
What units does the slope have?
c.
Do you think the slope is positive or negative? Explain
7.
More real estate
Consider the Albuquerque home sales from Exercise 6
^
again. The regression analysis gives the model price  47.82  0.061( Size ) .
a)
Explain what the slope of the line says about housing prices and house size
b)
What price would you predict for a 3000-square foot house in this market?
c)
A real estate agent shows a potential buyer a 1200-square-foot home, saying that
the asking price is $6000 less that what one would expect to pay for a house of this size.
What is the asking price, and what is the $6000 called?
6
8.
More slope practice
Refer to questions 1 and 2
Expain what the slope and y-intercept mean in context
^
a)
Potassium  38  27 Fiber
b)
mpg  46.87  0.084HP.
Potassium is in mg and Fiber is in grams
^
9.
Birthrates 2005
The table shows the number of live births per 1000 women
aged 15-44 years in the US, starting in 1965. (National Center for Health Statistics,
www.cdc.gov/nchs/)
Year
Rate
1965
19.4
1970
18.4
1975
14.8
1980
15.9
1985
15.6
1990
16.4
1995
14.8
2000
14.4
2005
14
a)
Make a scatterplot and describe the association and what the scatterplot tells us
overall. (Enter Year as years since 1900: 65,70,75, etc.)
b)
Find the Least square regression equation.
c)
Interpret the slope in context
d)
The table gives rates only at 5-year intervals. Estimate what the rate was in 1978.
e)
In 1978, the birthrate was actually 15. How close did your model come?
ll
f)
Find the residual for year 1980
7
10.
Cereals again
The correlation between a cereal’s fiber and potassium
contents is r = 0.903. What percent of the variability in potassium is accounted for by the
amount of fiber that servings contain?
11.
Residuals
Tell what each of the residual plots below indicates about the
appropriateness of the linear model that was fit to the data. (page 193, #11)
12.
Real estate again
2
had R =71.4%.
The regression of Price and Size of homes in Albuquerque
a.
Write a sentence (in context) summarizing what the R2 says about this
regression.
b.
What is the correlation between Price and Size?
Explain why you chose + or -
13.
Cereal again The correlation between a cereals fiber and potassium contents is
r = 0.903. What percent of the variability in potassium is accounted for by the amount of
fiber that servings contain?
14.
Last Cereal For the cereal regression problem model predicting potassium
content (in mg) from the amount of fiber (in g) in breakfast cereals, se  30.77 . Explain
in this context what this means.
8
15.
Cigarettes
Is the nicotine content in a cigarette related to the “tars”? A
collection of data (in milligrams) on 29 cigarettes produced the scatterplot , residual plot,
and regression analysis shown on page 194 (#27)
a.
Do you think a linear model is appropriate her explain?
b.
Explain the meaning of R2 in this context.
c.
What is the correlation between Tar and Nicotine?
16.
Last Cigarette
Take another look at the regression analysis of tar
and nicotine content of the cigarette in problem 15 (page 194, #27)
a.
Write the equation of the least square regression line.
b.
Estimate the nicotine content of cigarettes with 4 milligrams of tar.
c.
Find slope of the regression line and interpret its meaning in context. (be sure to
include the units of slope)
d.
What does the y-intercept mean in context.
e.
If a new brand of cigarette contains 7 mg of tar and a nicotine level whose
residual is -0.5, what is the actual nicotine content.
9
17.
Online Clothes
An online realtor keeps track of its customers purchases.
For those customers that signed up for the company credit card, the company also has
information on the customer’s Age and Income. A random sample of 500 of these
customers showed the following scatterplot of Total Yearly Purchases and Age:
(scatterplot is shown on page 196 on top left corner)
The correlation between Total Yearly Purchases and Age is r = 0.037 Summary statistics
for the two variables are
Mean
SD
Age
29.67 years 8.51 years
Total Yearly Purchases $572.52
$253.62
a.
What is the linear regression equation for predicting Total Yearly Purchases from
Age?
b.
Do the assumptions and conditions for regression appear to be met?
c.
What is the predicted average Total Yearly Purchases for and 18 year old? A 50
year old?
d.
What percent of the variability in Total Yearly Purchases is accounted for by the
increase of Age?
e.
Do you think the regression might be useful for this company? Explain
f.
The s e  $175, explain what this number is in context
10
For the last 4 problems, put them on you own sheet of paper(s) and staple them to this
packet. If you can do these problems with minimal help, then you are track to do well on
the next test. I want to see a lot of effort and good writing on these problems
1.
#37 (a-g) on page 196
2.
#41 (a-e) on page 197
3.
#43 (a-f) on page 197
4.
#45 (a-g) on page 198
11
Download