ECO 391- 007 Lecture Handout for Chapter 15 REGRESSION ANALYSIS SPRING 2003 Sections 15.1, 15.2 Brief outline: I. What is Regression Analysis? A. Define. B. Independent and Dependent Variables. C. MBA Admissions Example. D. A formal definition of regression analysis and what you can use it for. II. Linear Equations A. One-variable case B. Case with many variables III. Deterministic and Stochastic Relationships IV. The Simple Linear Regression Model I. What is Regression Analysis? Regression analysis is a statistical tool that allows us to look at the impact of one variable on another while controlling for potential confounding effects. (It holds other things constant, or, in Latin, "Ceteris Paribus") Examples: B. Independent and Dependent Variables Examples: Which one of the following is a dependent variable and which are independent ones? Rain , Agricultural Output Education, Earnings, Experience Alcohol Consumption, Potential for Heart Attack, Smoking Advertising Expenditures, Sales, Prices of Substitute Goods Independent Variables: (also called exogenous or explanatory variables) are the variables whose value influences or determines the value of another variable (the dependent variable). Dependent Variables: (also called endogenous variables) are the variables whose values are influenced by the value of the independent variable. Examples: (1) n is the sample size and i represents the observation number. Observation Number (i), n = 3 Independent Variable Make of Car Dependent Variable Gasoline Mileage 1 Nissan 30 2 Cadillac 18 3 Yugo 50 (2) Observation Number (i), n = 3 Dependent Variable Yearly Income ($'s) Independent Variable Education (years) Independent Variable Years of Work Experience 1 12,000 8 0 2 20,000 10 5 3 30,000 12 10 3) Dependent Variable - The number of votes a candidate receives during an election List potential independent variables: (4) Dependent Variable - The grade you will receive in this class: List potential independent variables: C. MBA Admissions Example The Dean of B&E college needs help to determine which applicants to accept to our MBA program. He hires you to predict how each applicant would do academically in our MBA program. 1) What factors (variables) on the applicants would you want data on? 2) How can we measure the impact of each of these variables on MBA academic performance? D. Formal definition of regression analysis Regression Analysis: A statistical technique that attempts to explain changes in the dependent variable as a function of changes in independent (explanatory) variables, through the quantification of an equation. (holding all else constant) Econometrics is what we call regression analysis when we apply it to economic phenomena. Reasons to use regression analysis: 1) To quantify theories. (Describe economic reality.) 2) To test our theories (test hypothesis) 3) To measure the strength of a relationship 4) To use for forecasting II. Linear Relationships A. One-Variable Case Let X = number of minutes you talk on the phone (long-distance call to Europe) Let Y = the size of the bill for the call. (Y denotes the dependent variable.) Observation Number Call # Dependent Variable Y or the bill in $. Independent Variable X or minutes 1 1.2 5 2 17.3 70 3 2.6 15 4 4.7 30 5 7.5 50 There is a mathematical relationship b/w X and Y: Y = f(X) Y= Plot these points to get a scatter plot diagram. The points are (Xi, Yi) where i denotes the observation number. Y Points to plot: (X1, Y1) = (5, 1.2) (X2, Y2) = (70, 17.3) (X3, Y3) = (15, 2.6) (X4, Y4) = (30, 4.7) (X5, Y5) = (50, 7.5) Specific functional form for a linear relationship: Yi = o + 1Xi 10 8 6 4 2 0 0 10 20 0: constant term (or Y-intercept term). 0 tells us the value of Y when X is zero. (Graphically, value of Y where the line hits the Y axis.) 1 is called the slope term. (rise over the run) or Y/X or (Y1-Y0)/(X1-X0) where (Xo,Yo) and (X1,Y1) are two points on the line. 30 40 50 60 70 X If 1 < 0 the line slopes downward X and Y are inversely related. If 1 > 0 the line slopes upward X and Y are positively related Specific interpretation of 1: As X increases by one unit, Y increases by the amount 1 . if 0 > 0 and 1 > 0 Y Yi =0 + 1Xi 1 0 X Looking ahead: Regression analysis allows us to estimate the values of 0 and 1 that characterize the relationship between X and Y. B. Case of Several Variables: Example: Expenditure on food (at constant prices) as a function of the quantities of goods. Say, consumers choose among bread, cheese and beer. Y= money spent on basket i, dollars X’S = amounts of the goods in basket i Yi = o + 1Xibread +2Xicheese + 3Xibeer Interpretation of the coefficients: o 1 2 3 ---- still a linear relationship. III. Deterministic and Stochastic Relationships A deterministic relationship is one in which each value of X is paired with only one Y value. It’s an exact relationship, of the same nature as discussed in the previous section. Additional example 1: Let’s assume I am selling apples at a constant price. Y = my income from selling apples ($'s) X = number of apples I sell A deterministic linear relationship is represented by a straight line (one-variable case) or a three-dimentional plane (2 variable case), etc. Deterministic relationship: Yi = o + 1Xi A stochastic relationship is one in which one value of X may be associated with several different values of Y for different data points. In short, there is an underlying linear relation between X and Y, but Y is subject to some external “noise”. Example: Y = yearly family expenditures on recreational activity. X = yearly family income. Example: Height (X) and Weight (Y) of people. Stochastic relationship: Yi = o + 1Xi +ε i The εi in the stochastic equation is called the random or stochastic error term. The stochastic error term accounts for all of the other variables besides X that determine the value of Y. εi accounts for: 1) Independent explanatory variables besides X. (omitted from our equation.) 2) Measurement errors in data. 3) Incorrect functional form 4)Randomness-unpredictable occurrences Note: Some dependent variables will have more inherent error than others. Car prices VS. Divorce Rates Regression analysis: A method of estimating stochastic relationships and analyzing the estimates. One-variable Stochastic Relationships are best illustrated by a scatter plot diagram: Example: height-weight stochastic relationship. Hight/Weight relationship 350 300 250 200 weight 150 100 50 0 50 55 60 65 70 75 Aside: On Scatter Plot diagrams We use scatter plot diagrams because they show us… 1) If a relationship exists between two variables. Sample A Sample B Y Y X X 2) If two variables are positively(directly) or negatively(inversely) related. Sample A X=income, Y=consumption Y Sample B X=price of cars,Y=# of cars sold Y X X 80 85 3) If the relationship between two variables is linear or nonlinear. Linear Nonlinear Y Y X X 4) Something about the strength of a relationship between two variables. Sample A Sample B Y Y X X IV. The Simple Linear Regression Model. Recall that a stochastic relationship between two variables is one in which the explanatory, independent variable explains some of the value of the dependent variable, but it is not the sole determinant of Y. Since other variables and error in data collection might also be affecting the value of Y, we include a random error term, , that accounts for everything that X does not. Consider the general form of a stochastic equation below: Yi = o + 1Xi + εi where: o and 1 are coefficients εi is the random or stochastic error term and i denotes the observation number. This equation shows the behavioral relationship between X and Y and if we estimate the specific values of o and 1 then we have statistically quantified the relationship. The knowledge of the -parameters is extremely valuable in many practical applications. However, the exact values of ’s can be known only if we have all population data in our possession, (which we, unfortunately, do not) The goal of linear regression analysis is to estimate the values of o and 1 using sample data. For example, Let Xi be a family’s income and let Yi be the family’s spending on recreational activities. Two families who both have an income of $60,000 per year, (X1 = $60,000 and X2 = $60,000), may have different levels of recreational spending. (Y1 = $5,000 and Y2 = $10,000) For any given value of X, Y is said to be a random variable meaning that Y can take on any one in a distribution of possible values. We expect this distribution to have a mean or expected value. For instance, ten different families who all earn $60,000 dollars may all spend different amounts on recreation, but we may say that on average, families who earn $60,000 per year spend $7,000 on recreation. E(YiX = Xi) or E(YiXi) is called the conditional expected value of the random variable Yi when X takes on a specific value. Below is a distribution showing the different values the random variable Yi can take on given that Xi takes a specific value. (here: Xi = $60,000) E(YiXi) given Xi = $60,000 Yi For a linear regression model E(YiXi) = o + 1Xi This is called the population regression equation. The mean of the Y distribution at each value of X falls on the population regression line. f(Y) Y X2 X1 X2 X3 X The actual (observed) data points and the population regression line: Y E(YiXi) = o + 1Xi True Population Regression Line X Note that the actual data points from a sample do not all actually fall directly on the true population regression line. The difference between the data points and the line is represented by the random error term. The random error term, εi = Yi - E(YiXi) εi = Yi - o + 1Xi or Yi = o + 1Xi + εi (The Stochastic Equation) Thus, 1) The (o + 1Xi) portion of the above equation is the systematic or deterministic component of the stochastic equation. If Y depended solely upon this part of the equation, then each value of X would only be associated with one value of Y. 2) εi is the random error term. This accounts for any part of the Y value that is explained by factors other than X. This is the part of the equation that allows one X value to be associated with more than one Y value. (i.e. “the garbage collector”) Again, we do not observe the entire population to get the values of β1 and β2. We need to estimate these values using samples. Sample Information: 1) Ŷi = bo + b1Xi is called the sample regression equation (estimated regression equation) that shows the behavioral relationship between X and Y for the sample data. This equation serves as an estimate of the true population regression line that we cannot actually measure. This implies that bo is an estimate of o and b1 is an estimate of 1 2) ei is the estimated value of εi and it represents the distance between observed data points and the sample regression line. It is called the residual value. Yi(hat) is called the predicted (or fitted) value of Yi given X = Xi. The actual (observed) data points and the sample regression line: Y Yi = βo + β1Xi Population Regression Line Yi = bo + b1Xi Sample Regression Line eI (the residual) is an estimate of εi and it represents the difference between the actual observed Yi value and the Ŷi value that is predicted by plugging Xi into the estimated regression line formula. There will be n residual values, one for each data point pair. e1, e2, and e3 , etc. ei = Yi - Ŷi or ei = Yi - bo - b1Xi Example: Y = consumption in dollars per day X = income in dollars per day Observation # Xi Yi 1 10 6 2 15 8 3 8 5 4 12 8 5 14 10 Yi(hat) the estimated value of Yi (predicted value) ei, the estimated value of εi (the residual) Suppose that we take these data points and estimate the sample regression equation. (We would be using formulas and techniques that you will learn in 15.3.) We would estimate: E(YiXi) = o + 1Xi (the population regression line) using Ŷi = bo + b1Xi (the sample regression line) After using the method of least squares that we will learn, we find the bo = 2 and b1 = .5 or Ŷi = 2 + .5Xi IN CLASS EXERCISE: 1) Graph the sample regression line. Return to the previous table and for each value of X, calculate the predicted value of Yi, or Ŷi. Plot each of these five predicted values on the graph below. Connect these points and you have the sample regression line. You will be graphing the points (Xi, Ŷi). As we plug each of the values of X into the sample regression equation, we will calculate the predicted value Ŷi. This is the value of Y if we fit it perfectly into the behavioral relationship defined by the sample regression line. Complete the fourth column of the table. 2) Plot the five original, observed data points. Label the actual, observed data points 1, 2, 3, 4, and 5. (X1,Y1), (X2,Y2), etc. 3) On the graph, mark the distance between the sample regression line and the actual observed data points. These distances represent the residuals. In the table above, calculate the value of the residuals to complete the last column. Recall that the residual is calculated as ei = Yi - Ŷi. Y 14 12 10 8 6 4 2 2 4 6 8 10 12 14 16 18 20 X Next time we will study how to estimate ’s using the sample data above (actually, we will look for such bo, b1 that minimize the sum of squared residuals. For now, let’s take for granted that the best estimates are bo = 2 and b1=0.5 Part 2: Write the intuitive interpretation of the estimated coefficients: bo = 2: means that…. b1 = .5: means that… An Overview of Regression Analysis Questions for Practice 1) To test your understanding of linear relationships, try graphing the following linear equations: a) Y = 4 + 2X b) Y = 4 - 2X c) Y = 2 + 2X d) Y = 2 + 3X Note that larger values of the slope make the graph of the line appear steeper. e) Try to verbally interpret the coefficients. 2) Suppose that a company installs and repairs copying machines. The company studied the relationship between repair costs for a sample of six machines and the number of pages copied by each machine. The goal is to identify machines whose costs are too high relative to their copying volumes. The repair costs in dollars and the pages copied in thousands for the six machines are as follows: Machine 1 2 3 4 5 6 Repair Cost 85 120 70 165 125 90 Pages Copied 900 1350 550 850 1500 800 a) Which variable is the dependent variable and which is the independent variable? Why? b) Make a scatter diagram of these observations. c) Does the maintenance cost of any machine seem to be out of line? d) Does there appear to be any relationship between repair costs and the number of pages copied? (i.e. direct or inverse, linear or nonlinear, weak or strong.) e) Can you think of any other independent variables that might be influencing this dependent variable? 3) a) Based on lecture to this point, write your own definition of regression analysis that makes sense to you and memorize it. b) What are the three primary uses for regression analysis? Give one specific example of each that we did not discuss in class. 4) If the points (3,18) and (6,9) are two points on a straight line, a) What is the slope of that line? b) Are the variables X and Y positively or negatively related? c) Interpret the value of the slope. d) Based on the information you have been given, can you find the value of the Y-intercept term? If so, find it. 5) Consider the following related variable pairs. Which pairs show deterministic relationships and which show stochastic relationships? Explain. X Number of hamgurgers consumed per week by person i Y Person i’s weight Number of people who pay for a ticket to ball game i Ticket Revenues from ball game i Number of hours spent studying per week by person i Person i’s GPA 6) Does regression analysis attempt to estimate deterministic or stochastic relationships? Explain. 7) Explain the four factors that contribute to the random error term. 8) Which dependent variable, people’s annual income or attendance at UK basketball games, would you expect to exhibit more random (unexplainable) inherent variation and why? 9) Along with this practice sheet you were given a copy of UK’s MBA program admission application. In an earlier class, we considered the variables that might determine an applicant’s academic success in the program. Looking at the application, you will see that the class came up with most of the same variables that the admissions office actually considers. If we wanted to estimate a student’s MBA GPA as a function of these potential determinants, list the variables from the application form that we can actually quantify (measure and use numerically) in our estimation. What unit of measure would we use for each of these dependent variables? For each variable discuss how reliable you think the data are. (An important bit of info for this class - the word data is plural.) 10) Each year top American cities are ranked according to their ability to provide high-quality and lowcost labor to companies that are relocating. One important measure used to form the rankings is the labor stress index, which indicates the availability of workers in the city. (The higher the index, the tighter the job market - i.e. the more difficult for employers to find employees.) Note that one of the determinants of this measure is the unemployment rate. The values of these two variables for each of the top 10 cities are listed below in the table. Obs. # 1 2 3 4 5 6 7 8 9 10 Labor Market 107 107 100 100 80 100 100 93 87 80 Stress Index(Y) Unemployment 4.5% 3.8% 5.1% 4.9% 5.4% 4.8% 5.5% 4.3% 5.7% 4.6% Rate(X) (When calculating your statistics, treat the percentages as whole numbers, i.e. enter 4.5% as the number 4.5 rather than .045. The results should be comparable, but your calculations by hand will be less tedious.) a) What is the independent variable? Explain. b) What is the dependent variable? Explain. c) Is this a stochastic relationship? Explain. d) Construct a scatter plot diagram. e) Based on your scatter plot diagram, what is your initial conclusion about the relationship between the labor market stress index and the unemployment rate? (Relationship positive or negative, linear or nonlinear, strong or weak?) 11) a) Graph the true regression line and the estimated regression line assuming that o > o and 1 < 1, with each being positive. Clearly denote each line. b) In the graph, plot one observation (data point) that is below both lines. Show for that observation the residual, e, and the stochastic error term, . (2 points) 12) True/False and Explain: a) One Drawback of conducting controlled experiments is the potential for confounding effects. b) Regression analysis is used to test theories, quantify theories, and make forecasts. Lecture I: An Overview of Regression Analysis Questions for Practice KEY KEY KEY 1) When graphing a linear equation there are a few things to keep in mind. The most obvious place to start is with the intercept term. The Y=intercept, or o, tells the value of Y when X is zero. This is the number that appears as the constant term in the equation. So for a) we know that one point on the line is the point (0,4). Find another point that satisfies the equation. For instance if X = 2, Y = 4 + 2(2) = 4 + 4 = 8. So another point on the line is the point (2,8). All you need to graph a linear function are two points. Graph these two points a draw a line that runs through both. d a Y c 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 X b Note that larger values of the slope make the graph of the line appear steeper. e) A verbal interpretion of the coefficient would say as X increases by one unit Y increases by the value of the slope. For instance, in part a), as X increases by one unit, Y increases by 2. 2) a) The repair cost is dependent because it depends upon the level of use. Pages copied would then be the independent variable. b) Scatter diagram: Repair Costs 170 160 150 140 130 120 110 100 90 80 70 500 600 700 800 900 1,000 1,100 1,200 1,300 1,400 1,500 #of copies c) the maintenance cost of machine 4 seems to be out of line. It stands out from the other points in the diagram. d) Given the appearance of the scatter diagram it would seem that the variables are positively and linearly related. The relationship appears to be very strong. e) Other independent variables that might be influencing the repair cost could be i) how often the user cleans the machine and how well they maintain service for the machine; ii) Do they give the machine a rest in between running big jobs; iii) do they use the appropriate type of paper; iv) do they use the machine as the backboard in the office’s big Nerf basketball championship; etc. 3) a) This one I leave to you. b) Regression analysis may be used to test theories, quantify relationships, and make predictions or forecasts. I will let you work on the examples. 4) If the points (3,18) and (6,9) are two points on a straight line, a) the slope would be (Y2 - Y1)/(X2 - X1) or (9-18)/(6-3) or -9/3 or -3. b) Since the slope is negative, we can assume that the variables are negatively related. c) A slope of -3 says that as X increases by one unit, Y decreases by 3 units. d) Based on the information you have been given, you can find the value of the Y-intercept term. Try a little simple algebra. We know that a linear equation can be written as Y = o + 1X. We know that 1 = -3. Plug in the X and Y values from one of the points. We know these points “satisfy” the equation. 18 = o - 3(3) or 18 = o - 9 or 27 = o Also, if you draw the graph, plot the two points, and draw the line going through them. You can usually see where it hits the Y-axis. (although this is not always the most accurate approach.) 5) i) The relationship between hamburger consumption and human weight is stochastic. While hamburger consumption certainly might have an impact on weight, other factors besides hamburger consumption are also important in determining weight. ii) The relationship between ticket sales and ticket revenues is deterministic because the number of tickets sold (as long as we know the price) completely determines the revenue from selling the tickets. iii) The relationship between study time and GPA is stochastic because other factors in addition to study time are essential in determining the value of GPA. 6) Regression analysis attempts to estimate stochastic relationships. The whole point of the analysis is to explain the factors that make one observation have a different value of the dependent variable from some other observation. With a deterministic equation, we would already know why a difference occurred. For instance, if girl scout cookies sell for $2.50 per box and Ingrid sells 10 boxes, her cookie revenue will be $25.00. If Constance also sells 10 boxes, she too will have revenue of $25.00. BORING. There is not really anything there to analyze. Now suppose that Ingrid and Constance, who are both girl scouts, hit the streets selling cookies. Ingrid sells 100 boxes and Constance sells 20 boxes. The interesting question is to figure out why. What is the difference between these two girl scouts that might 20 explain the wide difference in sales? This is something regression analysis might allow us to consider. Is age a factor? Did each girl sell in their home neighborhood? How many doors did each girl knock upon? Did they use the phone to try to make sales? Is Ingrid more pleasant looking or more outgoing? Is Constance less motivated? Does Ingrid come from a very big family with LOTS of relatives? 7) The random error term consists of four components: i) Omitted explanatory variables ii) measurement error in the data iii) selection of the wrong functional form to represent the relationship iv) purely random variation 8) Which dependent variable, people’s annual income or attendance at UK basketball games, would you expect to exhibit more random (unexplainable) inherent variation and why? The variable that has the most of this type of variation is the one that we feel we can explain the least. So for each variable - try to think of what explains it. For basketball games, attendance might be determined by how well the team is doing, weather, flu epidemics, school vacations, etc. We can do a pretty good job of explaining it. Now let’s think about income. It is affected by our level of education, training, job experience, personal connections, motivation, physical skills, etc. I wrote this question and I am not exactly sure myself of what the answer is, but I would imagine that in KY we can probably explain and predict attendance at basketball games better than we can predict someone’s annual income. This implies that there are reasons two people might have different incomes that we cannot determine. 10) a) The independent variable is the unemployment rate. This variable is one of the determinants of the stress index that tells us how tight the job market is in an area. b) The dependent variable is the stress index. Its value is determined or a function of the unemployment rate. c) This is a stochastic relationship. The value of the stress index varies for other reasons besides just the level of unemployment. (i.e. unemployment is not the sole determinant of the stress index.) 21 d) See Below: 110 Stress Index 105 100 95 90 85 80 Unemployment Rate 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 e) Although it is somewhat difficult to see given this scatter plot, it would appear that there is some sort of linear, negative relationship although it does not look very strong. 11) Estimated Regression Line a) and b) Y 1 1 i ei o True Population Regression Line o X 12) a) False: Controlled experiments allow you to avoid the problems related to confounding affects by controlling for potential confounding factors. b) True: These are the reasons we discussed for using regression analysis. 22