Lecture I: An Overview of Regression Analysis

advertisement
ECO 391- 007
Lecture Handout for Chapter 15
REGRESSION ANALYSIS
SPRING 2003
Sections 15.1, 15.2
Brief outline:
I. What is Regression Analysis?
A. Define.
B. Independent and Dependent Variables.
C. MBA Admissions Example.
D. A formal definition of regression analysis and what you can use it for.
II. Linear Equations
A. One-variable case
B. Case with many variables
III. Deterministic and Stochastic Relationships
IV. The Simple Linear Regression Model
I. What is Regression Analysis?
Regression analysis is a statistical tool that allows us to look at the impact of one variable on
another while controlling for potential confounding effects. (It holds other things constant, or, in
Latin, "Ceteris Paribus")
Examples:
B. Independent and Dependent Variables
Examples:
Which one of the following is a dependent variable and which are independent ones?
Rain , Agricultural Output
Education, Earnings, Experience
Alcohol Consumption, Potential for Heart Attack, Smoking
Advertising Expenditures, Sales, Prices of Substitute Goods
Independent Variables: (also called exogenous or explanatory variables) are the variables whose
value influences or determines the value of another variable (the dependent variable).
Dependent Variables: (also called endogenous variables) are the variables whose values are
influenced by the value of the independent variable.
Examples:
(1) n is the sample size and i represents the observation number.
Observation
Number
(i), n = 3
Independent Variable
Make of Car
Dependent Variable
Gasoline Mileage
1
Nissan
30
2
Cadillac
18
3
Yugo
50
(2)
Observation
Number
(i), n = 3
Dependent Variable
Yearly Income ($'s)
Independent Variable
Education (years)
Independent Variable
Years of Work
Experience
1
12,000
8
0
2
20,000
10
5
3
30,000
12
10
3) Dependent Variable - The number of votes a candidate receives during an election
List potential independent variables:
(4) Dependent Variable - The grade you will receive in this class:
List potential independent variables:
C. MBA Admissions Example
The Dean of B&E college needs help to determine which applicants to accept to our MBA program. He
hires you to predict how each applicant would do academically in our MBA program.
1) What factors (variables) on the applicants would you want data on?
2) How can we measure the impact of each of these variables on MBA academic performance?
D. Formal definition of regression analysis
Regression Analysis: A statistical technique that attempts to explain changes in the dependent
variable as a function of changes in independent (explanatory) variables, through the quantification
of an equation. (holding all else constant)
Econometrics is what we call regression analysis when we apply it to economic phenomena.
Reasons to use regression analysis:
1) To quantify theories.
(Describe economic reality.)
2) To test our theories
(test hypothesis)
3) To measure the strength of a relationship
4) To use for forecasting
II. Linear Relationships
A. One-Variable Case
Let X = number of minutes you talk on the phone (long-distance call to Europe)
Let Y = the size of the bill for the call.
(Y denotes the dependent variable.)
Observation
Number
Call #
Dependent Variable
Y or the bill in $.
Independent Variable
X or minutes
1
1.2
5
2
17.3
70
3
2.6
15
4
4.7
30
5
7.5
50
There is a mathematical relationship b/w X and Y:
Y = f(X)
Y=
Plot these points to get a scatter plot diagram.
The points are (Xi, Yi) where i denotes the observation number.
Y
Points to plot:
(X1, Y1) = (5, 1.2)
(X2, Y2) = (70, 17.3)
(X3, Y3) = (15, 2.6)
(X4, Y4) = (30, 4.7)
(X5, Y5) = (50, 7.5)
Specific functional form for
a linear relationship:
Yi = o + 1Xi
10
8
6
4
2
0
0
10
20
0: constant term (or Y-intercept term).
0 tells us the value of Y when X is zero.
(Graphically, value of Y where the line hits the Y axis.)
1 is called the slope term. (rise over the run)
or Y/X or (Y1-Y0)/(X1-X0)
where (Xo,Yo) and (X1,Y1) are two points on the line.
30
40
50
60
70 X
If 1 < 0 the line slopes downward X and Y are inversely related.
If 1 > 0 the line slopes upward X and Y are positively related
Specific interpretation of 1:
As X increases by one unit, Y increases by the amount 1 .
if 0 > 0 and 1 > 0
Y
Yi =0 + 1Xi
1
0
X
Looking ahead:
Regression analysis allows us to estimate the values of 0 and 1 that characterize the relationship between
X and Y.
B. Case of Several Variables:
Example: Expenditure on food (at constant prices) as a function of the quantities of goods.
Say, consumers choose among bread, cheese and beer.
Y= money spent on basket i, dollars
X’S = amounts of the goods in basket i
Yi = o + 1Xibread +2Xicheese + 3Xibeer
Interpretation of the coefficients:
o
1
2
3
---- still a linear relationship.
III.
Deterministic and Stochastic Relationships
A deterministic relationship is one in which each value of X is paired with only one Y value. It’s
an
exact relationship, of the same nature as discussed in the previous section.
Additional example 1: Let’s assume I am selling apples at a constant price.
Y = my income from selling apples ($'s)
X = number of apples I sell
A deterministic linear relationship is represented by a straight line (one-variable case) or a three-dimentional
plane (2 variable case), etc.
Deterministic relationship:
Yi = o + 1Xi
A stochastic relationship is one in which one value of X may be associated with several different
values of Y for different data points. In short, there is an underlying linear relation between X and Y,
but Y is subject to some external “noise”.
Example:
Y = yearly family expenditures on recreational activity.
X = yearly family income.
Example: Height (X) and Weight (Y) of people.
Stochastic relationship:
Yi = o + 1Xi +ε i
The εi in the stochastic equation is called the random or stochastic error term.
The stochastic error term accounts for all of the other variables besides X that determine the value of
Y.
εi accounts for:
1) Independent explanatory variables besides X. (omitted from our equation.)
2) Measurement errors in data.
3) Incorrect functional form
4)Randomness-unpredictable occurrences
Note: Some dependent variables will have more inherent error than others.
Car prices VS. Divorce Rates
Regression analysis: A method of estimating stochastic relationships and analyzing the estimates.
One-variable Stochastic Relationships are best illustrated by a scatter plot diagram:
Example: height-weight stochastic relationship.
Hight/Weight relationship
350
300
250
200
weight
150
100
50
0
50
55
60
65
70
75
Aside: On Scatter Plot diagrams
We use scatter plot diagrams because they show us…
1) If a relationship exists between two variables.
Sample A
Sample B
Y
Y
X
X
2) If two variables are positively(directly) or negatively(inversely) related.
Sample A
X=income, Y=consumption
Y
Sample B
X=price of cars,Y=# of cars sold
Y
X
X
80
85
3) If the relationship between two variables is linear or nonlinear.
Linear
Nonlinear
Y
Y
X
X
4) Something about the strength of a relationship between two variables.
Sample A
Sample B
Y
Y
X
X
IV. The Simple Linear Regression Model.
Recall that a stochastic relationship between two variables is one in which the explanatory, independent
variable explains some of the value of the dependent variable, but it is not the sole determinant of Y.
Since other variables and error in data collection might also be affecting the value of Y, we include a
random error term, , that accounts for everything that X does not.
Consider the general form of a stochastic equation below:
Yi = o + 1Xi + εi
where: o and 1 are coefficients
εi is the random or stochastic error term
and i denotes the observation number.
This equation shows the behavioral relationship between X and Y and if we estimate the specific values
of o and 1 then we have statistically quantified the relationship.
The knowledge of the  -parameters is extremely valuable in many practical applications. However, the
exact values of ’s can be known only if we have all population data in our possession, (which we,
unfortunately, do not)
The goal of linear regression analysis is to estimate the values of o and 1 using sample
data.
For example,
Let Xi be a family’s income and let Yi be the family’s spending on recreational activities.
Two families who both have an income of $60,000 per year, (X1 = $60,000 and X2 = $60,000), may have
different levels of recreational spending. (Y1 = $5,000 and Y2 = $10,000)
For any given value of X, Y is said to be a random variable meaning that Y can take on any one in a
distribution of possible values. We expect this distribution to have a mean or expected value. For
instance, ten different families who all earn $60,000 dollars may all spend different amounts on recreation,
but we may say that on average, families who earn $60,000 per year spend $7,000 on recreation.
E(YiX = Xi) or E(YiXi) is called the conditional expected value of the random variable Yi when X
takes on a specific value. Below is a distribution showing the different values the random variable Yi can
take on given that Xi takes a specific value. (here: Xi = $60,000)
E(YiXi)
given Xi = $60,000
Yi
For a linear regression model
E(YiXi) = o + 1Xi
This is called the population regression equation.
The mean of the Y distribution at each value of X falls on the population regression line.
f(Y)
Y
X2
X1
X2
X3
X
The actual (observed) data points and the population regression line:
Y
E(YiXi) = o + 1Xi
True Population Regression Line
X
Note that the actual data points from a sample do not all actually fall directly on the true population
regression line.
The difference between the data points and the line is represented by the random error term.
The random error term,
εi = Yi - E(YiXi)
εi = Yi - o + 1Xi or
Yi = o + 1Xi + εi (The Stochastic Equation)
Thus,
1) The (o + 1Xi) portion of the above equation is the systematic or deterministic component of the
stochastic equation. If Y depended solely upon this part of the equation, then each value of X would
only be associated with one value of Y.
2) εi is the random error term. This accounts for any part of the Y value that is explained by factors
other than X. This is the part of the equation that allows one X value to be associated with more than
one Y value. (i.e. “the garbage collector”)
Again, we do not observe the entire population to get the values of β1 and β2. We need to estimate these
values using samples.
Sample Information:
1) Ŷi = bo + b1Xi is called the sample regression equation (estimated regression equation) that
shows the behavioral relationship between X and Y for the sample data. This equation serves as an
estimate of the true population regression line that we cannot actually measure.
This implies that bo is an estimate of o
and b1 is an estimate of 1
2) ei is the estimated value of εi and it represents the distance between
observed data points and the sample regression line. It is called the residual value.
Yi(hat) is called the predicted (or fitted) value of Yi given X = Xi.
The actual (observed) data points and the sample regression line:
Y
Yi = βo + β1Xi
Population Regression Line
Yi = bo + b1Xi
Sample Regression Line
eI (the residual) is an estimate of εi and it represents the difference between the actual observed Yi value
and the Ŷi value that is predicted by plugging Xi into the estimated regression line formula.
There will be n residual values, one for each data point pair.
e1, e2, and e3 , etc.
ei = Yi - Ŷi
or
ei = Yi - bo - b1Xi
Example:
Y = consumption in dollars per day
X = income in dollars per day
Observation #
Xi
Yi
1
10
6
2
15
8
3
8
5
4
12
8
5
14
10
Yi(hat) the estimated value
of Yi (predicted value)
ei, the estimated
value of εi (the
residual)
Suppose that we take these data points and estimate the sample regression equation.
(We would be using formulas and techniques that you will learn in 15.3.) We would estimate:
E(YiXi) = o + 1Xi
(the population regression line)
using
Ŷi = bo + b1Xi
(the sample regression line)
After using the method of least squares that we will learn, we find the bo = 2 and b1 = .5 or
Ŷi = 2 + .5Xi
IN CLASS EXERCISE:
1) Graph the sample regression line. Return to the previous table and for each value of X, calculate the predicted
value of Yi, or Ŷi. Plot each of these five predicted values on the graph below. Connect these points and you
have the sample regression line. You will be graphing the points (Xi, Ŷi). As we plug each of the values of X into
the sample regression equation, we will calculate the predicted value Ŷi. This is the value of Y if we fit it perfectly
into the behavioral relationship defined by the sample regression line. Complete the fourth column of the table.
2) Plot the five original, observed data points. Label the actual, observed data points 1, 2, 3, 4, and 5.
(X1,Y1), (X2,Y2), etc.
3) On the graph, mark the distance between the sample regression line and the
actual observed data points. These distances represent the residuals. In the table above, calculate the value of the
residuals to complete the last column.
Recall that the residual is calculated as ei = Yi - Ŷi.
Y
14
12
10
8
6
4
2
2 4 6 8 10 12 14 16 18 20
X
Next time we will study how to estimate ’s using the sample data above (actually, we will look for such bo, b1 that
minimize the sum of squared residuals. For now, let’s take for granted that the best estimates are bo = 2 and b1=0.5
Part 2: Write the intuitive interpretation of the estimated coefficients:
bo = 2: means that….
b1 = .5: means that…
An Overview of Regression Analysis
Questions for Practice
1) To test your understanding of linear relationships, try graphing the following linear equations:
a) Y = 4 + 2X
b) Y = 4 - 2X
c) Y = 2 + 2X
d) Y = 2 + 3X
Note that larger values of the slope make the graph of the line appear steeper.
e) Try to verbally interpret the coefficients.
2) Suppose that a company installs and repairs copying machines. The company studied the relationship
between repair costs for a sample of six machines and the number of pages copied by each machine. The
goal is to identify machines whose costs are too high relative to their copying volumes. The repair costs
in dollars and the pages copied in thousands for the six machines are as follows:
Machine
1
2
3
4
5
6
Repair Cost
85
120
70
165
125
90
Pages
Copied
900
1350
550
850
1500
800
a) Which variable is the dependent variable and which is the independent variable? Why?
b) Make a scatter diagram of these observations.
c) Does the maintenance cost of any machine seem to be out of line?
d) Does there appear to be any relationship between repair costs and the number of pages copied? (i.e.
direct or inverse, linear or nonlinear, weak or strong.)
e) Can you think of any other independent variables that might be influencing this dependent variable?
3) a) Based on lecture to this point, write your own definition of regression analysis that makes sense to
you and memorize it.
b) What are the three primary uses for regression analysis? Give one specific example of each that we did
not discuss in class.
4) If the points (3,18) and (6,9) are two points on a straight line,
a) What is the slope of that line?
b) Are the variables X and Y positively or negatively related?
c) Interpret the value of the slope.
d) Based on the information you have been given, can you find the value of the
Y-intercept term? If so, find it.
5) Consider the following related variable pairs. Which pairs show deterministic relationships and which
show stochastic relationships? Explain.
X
Number of hamgurgers
consumed per week by
person i
Y
Person i’s weight
Number of people who
pay for a ticket to ball
game i
Ticket Revenues from
ball game i
Number of hours
spent studying per
week by person i
Person i’s GPA
6) Does regression analysis attempt to estimate deterministic or stochastic relationships? Explain.
7) Explain the four factors that contribute to the random error term.
8) Which dependent variable, people’s annual income or attendance at UK basketball games, would you
expect to exhibit more random (unexplainable) inherent variation and why?
9) Along with this practice sheet you were given a copy of UK’s MBA program admission application. In
an earlier class, we considered the variables that might determine an applicant’s academic success in the
program. Looking at the application, you will see that the class came up with most of the same variables
that the admissions office actually considers. If we wanted to estimate a student’s MBA GPA as a
function of these potential determinants, list the variables from the application form that we can actually
quantify (measure and use numerically) in our estimation. What unit of measure would we use for each of
these dependent variables? For each variable discuss how reliable you think the data are. (An important
bit of info for this class - the word data is plural.)
10) Each year top American cities are ranked according to their ability to provide high-quality and lowcost labor to companies that are relocating. One important measure used to form the rankings is the labor
stress index, which indicates the availability of workers in the city. (The higher the index, the tighter the
job market - i.e. the more difficult for employers to find employees.) Note that one of the determinants of
this measure is the unemployment rate. The values of these two variables for each of the top 10 cities are
listed below in the table.
Obs. #
1
2
3
4
5
6
7
8
9
10
Labor Market
107
107
100
100
80
100
100
93
87
80
Stress Index(Y)
Unemployment
4.5% 3.8% 5.1% 4.9% 5.4% 4.8% 5.5% 4.3% 5.7%
4.6%
Rate(X)
(When calculating your statistics, treat the percentages as whole numbers, i.e. enter 4.5% as the number
4.5 rather than .045. The results should be comparable, but your calculations by hand will be less
tedious.)
a) What is the independent variable? Explain.
b) What is the dependent variable? Explain.
c) Is this a stochastic relationship? Explain.
d) Construct a scatter plot diagram.
e) Based on your scatter plot diagram, what is your initial conclusion about the relationship between the
labor market stress index and the unemployment rate? (Relationship positive or negative, linear or
nonlinear, strong or weak?)
11) a) Graph the true regression line and the estimated regression line assuming that o > o and
1 < 1, with each being positive. Clearly denote each line.
b) In the graph, plot one observation (data point) that is below both lines. Show for that observation the
residual, e, and the stochastic error term, . (2 points)
12) True/False and Explain:
a) One Drawback of conducting controlled experiments is the potential for confounding effects.
b) Regression analysis is used to test theories, quantify theories, and make forecasts.
Lecture I: An Overview of Regression Analysis
Questions for Practice
KEY KEY KEY
1) When graphing a linear equation there are a few things to keep in mind. The most obvious place to
start is with the intercept term. The Y=intercept, or o, tells the value of Y when X is zero. This is the
number that appears as the constant term in the equation. So for a) we know that one point on the line is
the point (0,4). Find another point that satisfies the equation. For instance if X = 2, Y = 4 + 2(2) = 4 + 4
= 8. So another point on the line is the point (2,8). All you need to graph a linear function are two points.
Graph these two points a draw a line that runs through both.
d a
Y
c
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
X
b
Note that larger values of the slope make the graph of the line appear steeper.
e) A verbal interpretion of the coefficient would say as X increases by one unit Y increases by the value
of the slope. For instance, in part a), as X increases by one unit, Y increases by 2.
2) a) The repair cost is dependent because it depends upon the level of use. Pages copied would then be
the independent variable.
b) Scatter diagram:
Repair Costs
170
160
150
140
130
120
110
100
90
80
70
500 600 700 800 900 1,000 1,100 1,200 1,300 1,400 1,500 #of copies
c) the maintenance cost of machine 4 seems to be out of line. It stands out from the other points in the
diagram.
d) Given the appearance of the scatter diagram it would seem that the variables are positively and linearly
related. The relationship appears to be very strong.
e) Other independent variables that might be influencing the repair cost could be i) how often the user
cleans the machine and how well they maintain service for the machine; ii) Do they give the machine a
rest in between running big jobs; iii) do they use the appropriate type of paper; iv) do they use the machine
as the backboard in the office’s big Nerf basketball championship; etc.
3) a) This one I leave to you.
b) Regression analysis may be used to test theories, quantify relationships, and make predictions or
forecasts. I will let you work on the examples.
4) If the points (3,18) and (6,9) are two points on a straight line,
a) the slope would be (Y2 - Y1)/(X2 - X1) or (9-18)/(6-3) or -9/3 or -3.
b) Since the slope is negative, we can assume that the variables are negatively related.
c) A slope of -3 says that as X increases by one unit, Y decreases by 3 units.
d) Based on the information you have been given, you can find the value of the
Y-intercept term. Try a little simple algebra. We know that a linear equation can be written as Y = o +
1X. We know that 1 = -3. Plug in the X and Y values from one of the points. We know these points
“satisfy” the equation.
18 = o - 3(3) or 18 = o - 9 or 27 = o
Also, if you draw the graph, plot the two points, and draw the line going through them. You can usually
see where it hits the Y-axis. (although this is not always the most accurate approach.)
5) i) The relationship between hamburger consumption and human weight is stochastic. While hamburger
consumption certainly might have an impact on weight, other factors besides hamburger consumption are
also important in determining weight.
ii) The relationship between ticket sales and ticket revenues is deterministic because the number of
tickets sold (as long as we know the price) completely determines the revenue from selling the tickets.
iii) The relationship between study time and GPA is stochastic because other factors in addition to study
time are essential in determining the value of GPA.
6) Regression analysis attempts to estimate stochastic relationships. The whole point
of the analysis is to explain the factors that make one observation have a different value of the dependent
variable from some other observation. With a deterministic equation, we would already know why a
difference occurred. For instance, if girl scout cookies sell for $2.50 per box and Ingrid sells 10 boxes,
her cookie revenue will be $25.00. If Constance also sells 10 boxes, she too will have revenue of $25.00.
BORING. There is not really anything there to analyze. Now suppose that Ingrid and Constance, who are
both girl scouts, hit the streets selling cookies. Ingrid sells 100 boxes and Constance sells 20 boxes. The
interesting question is to figure out why. What is the difference between these two girl scouts that might
20
explain the wide difference in sales? This is something regression analysis might allow us to consider. Is
age a factor? Did each girl sell in their home neighborhood? How many doors did each girl knock upon?
Did they use the phone to try to make sales? Is Ingrid more pleasant looking or more outgoing? Is
Constance less motivated? Does Ingrid come from a very big family with LOTS of relatives?
7) The random error term consists of four components:
i) Omitted explanatory variables
ii) measurement error in the data
iii) selection of the wrong functional form to represent the relationship
iv) purely random variation
8) Which dependent variable, people’s annual income or attendance at UK basketball
games, would you expect to exhibit more random (unexplainable) inherent variation and why? The
variable that has the most of this type of variation is the one that we feel we can explain the least. So for
each variable - try to think of what explains it. For basketball games, attendance might be determined by
how well the team is doing, weather, flu epidemics, school vacations, etc. We can do a pretty good job of
explaining it. Now let’s think about income. It is affected by our level of education, training, job
experience, personal connections, motivation, physical skills, etc. I wrote this question and I am not
exactly sure myself of what the answer is, but I would imagine that in KY we can probably explain and
predict attendance at basketball games better than we can predict someone’s annual income. This implies
that there are reasons two people might have different incomes that we cannot determine.
10) a) The independent variable is the unemployment rate. This variable is one of the determinants of the
stress index that tells us how tight the job market is in an area.
b) The dependent variable is the stress index. Its value is determined or a function of the unemployment
rate.
c) This is a stochastic relationship. The value of the stress index varies for other reasons besides just the
level of unemployment. (i.e. unemployment is not the sole determinant of the stress index.)
21
d) See Below: 110
Stress
Index
105
100
95
90
85
80
Unemployment Rate
3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8
e) Although it is somewhat difficult to see given this scatter plot, it would appear that there is some sort
of linear, negative relationship although it does not look very strong.
11)
Estimated Regression Line
a) and b)
Y
1
1
i
ei
o
True Population
Regression Line

o
X
12) a) False: Controlled experiments allow you to avoid the problems related to
confounding affects by controlling for potential confounding factors.
b) True: These are the reasons we discussed for using regression analysis.
22
Download