Estimating Demand Outline •Where do demand functions come from? •Sources of information for demand estimation •Cross-sectional versus time series data •Estimating a demand specification using the ordinary least squares (OLS) method. •Goodness of fit statistics. The goal of forecasting To transform available data into equations that provide the best possible forecasts of economic variables—e.g., sales revenues and costs of production—that are crucial for management. Demand for air travel Houston to Orlando Now we will explain how we estimated this demand equation Recall that our demand function was estimated as follows: Q = 25 + 3Y + PO – 2P [4.1] Where Q is the number of seats sold; Y is a regional income index; P0 is the fare charged by a rival airline, and P is the airline’s own fare. Questions managers should ask about a forecasting equations 1. What is the “best” equation that can be obtained (estimated) from the available data? 2. What does the equation not explain? 3. What can be said about the likelihood and magnitude of forecast errors? 4. What are the profit consequences of forecast errors? How do get the data to estimate demand forecasting equations? •Customer surveys and interviews. •Controlled market studies. •Uncontrolled market data. Campbell’s soup estimates demand functions from data obtained from a survey of more than 100,000 consumers Survey pitfalls Sample bias Response bias Response accuracy Cost Types of data Time -series data: historical data--i.e., the data sample consists of a series of daily, monthly, quarterly, or annual data for variables such as prices, income , employment , output , car sales, stock market indices, exchange rates, and so on. Cross-sectional data: All observations in the sample are taken from the same point in time and represent different individual entities (such as households, houses, etc.) Time series data: Daily observations, Korean Won per dollar Year Month Day Won per Dollar 1997 3 10 877 1997 3 11 880.5 1997 3 12 879.5 1997 3 13 880.5 1997 3 14 881.5 1997 3 17 882 1997 3 18 885 1997 3 19 887 1997 3 20 886.5 1997 3 21 887 1997 3 24 890 1997 3 25 891 Example of cross sectional data Student ID Sex Age Height Weight 777672431 M 21 6’1” 178 lbs. 231098765 M 28 5’11” 205 lbs. 111000111 F 19 5’8” 121 lbs. 898069845 F 22 5’4” 98 lbs. 000341234 M 20 6’2” 183 lbs Estimating demand equations using regression analysis Regression analysis is a statistical technique that allows us to quantify the relationship between a dependent variable and one or more independent or “explanatory” variables. Y Regression theory X and Y are not perfectly correlated. However, there is on average a positive relationship between Y and X 0 X1 X2 X We assume that expected conditional values of Y associated with alternative values of X fall on a line. Y E(Y |Xi) = 0 + 1Xi Y1 1 1 = Y1 - E(Y|X1) E(Y|X1) 0 X1 X Specifying a single variable model Our model is specified as follows: Q = f (P) where Q is ticket sales and P is the fare Q is the dependent variable—that is, we think that variations in Q can be explained by variations in P, the “explanatory” variable. Estimating the single variable model Q i 0 1Pi Q i 0 1 Pi i [1] Since the data points are unlikely to fall exactly on a line, (1) must be modified to include a disturbance term (εi) [2] 0 and 1 are called parameters or population parameters. We estimate these parameters using the data we have available Estimated Simple Linear Regression Equation The estimated simple linear regression equation ŷ b0 b1 x • The graph is called the estimated regression line. • b0 is the y intercept of the line. • b1 is the slope of the line. • ŷ is the estimated value of y for a given x value. Estimation Process Regression Model y = 0 + 1x + Regression Equation E(y) = 0 + 1x Unknown Parameters 0, 1 b0 and b1 provide estimates of 0 and 1 Sample Data: x y x1 y1 . . . . xn yn Estimated Regression Equation ŷ b0 b1 x Sample Statistics b0, b1 Least Squares Method Least Squares Criterion min (y i y i ) 2 where: yi = observed value of the dependent variable for the ith observation y^i = estimated value of the dependent variable for the ith observation Least Squares Method Slope for the Estimated Regression Equation b1 ( x x )( y y ) (x x ) i i 2 i Least Squares Method y-Intercept for the Estimated Regression Equation b0 y b1 x where: xi = value of independent variable for ith observation yi = value of dependent variable for ith _ observation x = mean value for independent variable _ y = mean value for dependent variable n = total number of observations Line of best fit The line of best fit is the one that minimizes the squared sum of the vertical distances of the sample points from the line The 4 steps of demand estimation using regression 1. Specification 2. Estimation 3. Evaluation 4. Forecasting Year and Quarter 97-1 97-2 97-3 97-4 98-1 98-2 98-3 98-4 99-1 99-2 99-3 99-4 00-1 00-2 00-3 00-4 Mean Std. Dev. Average Number Average Coach Seats Fare 64.8 250 33.6 265 37.8 265 83.3 240 111.7 230 137.5 225 109.6 225 96.8 220 59.5 230 83.2 235 90.5 245 105.5 240 75.7 250 91.6 240 112.7 240 102.2 235 87.3 239.7 27.9 13.1 Table 4-2 Ticket Prices and Ticket Sales along an Air Route Simple linear regression begins by plotting Q-P values on a scatter diagram to determine if there exists an approximate linear relationship: Scatter plot diagram 290 280 270 Fare 260 250 240 230 220 210 20 40 60 80 100 Passengers 120 140 160 Scatter plot diagram with possible line of best fit Average One-way Fare Demand curve: Q = 330-P $ 27 0 26 0 25 0 24 0 23 0 22 0 0 50 100 150 Number of Seats Sold per Flight Note that we use X to denote the explanatory variable and Y is the dependent variable. So in our example Sales (Q) is the “Y” variable and Fares (P) is the “X” variable. Q=Y P=X Computing the OLS estimators We estimated the equation using the statistical software package SPSS. It generated the following output: Coefficientsa Model 1 (Constant) FARE Unstandardized Coefficients B Std. Error 478.690 88.036 -1.633 .367 a. Dependent Variable: PASS Standardi zed Coefficien ts Beta -.766 t 5.437 -4.453 Sig. .000 .001 Reading the SPSS Output From this table we see that our estimate of 0 is 478.7 and our estimate of 1 is –1.63. Thus our forecasting equation is given by: Qˆ i 478.7 1.63Pi Step 3: Evaluation Now we will evaluate the forecasting equation using standard goodness of fit statistics, including: 1. The standard errors of the estimates. 2. The t-statistics of the estimates of the coefficients. 3. The standard error of the regression (s) 4. The coefficient of determination (R2) Standard errors of the estimates •We assume that the regression coefficients are normally distributed variables. •The standard error (or standard deviation) of the estimates is a measure of the dispersion of the estimates around their mean value. •As a general principle, the smaller the standard error, the better the estimates (in terms of yielding accurate forecasts of the dependent variable). The following rule-of-thumb is useful: The standard error of the regression coefficient should be less than half of the size of the corresponding regression coefficient. Computing the standard error of 1 ŝ 1 Let denote the standard error of our estimate of 1 Thus we have: sˆ 1 sˆ 1 2 s xi Xi X and ˆi ei Qi Q Where: 2 ˆ 1 Note that: 2 e i n k xi 2 and k is the number of estimated coefficients Coefficientsa Model 1 (Constant) FARE Unstandardized Coefficients B Std. Error 478.690 88.036 -1.633 .367 Standardi zed Coefficien ts Beta -.766 t 5.437 -4.453 Sig. .000 .001 a. Dependent Variable: PASS By reference to the SPSS output, we see that the standard error of our estimate of 1 is 0.367, whereas the (absolute value)our estimate of 1 is 1.63 Hence our estimate is about 4 ½ times the size of its standard error. The SPSS output tells us that the t statistic for the the fare coefficient (P) is –4.453 The t test is a way of comparing the error suggested by the null hypothesis to the standard error of the estimate. The t test To test for the significance of our estimate of 1, we set the following null hypothesis, H0, and the alternative hypothesis, H1 H0: 1 0 H1: 1 < 0 The t distribution is used to test for statistical significance of the estimate: ˆ 1 1 1.63 0 t 4.45 sˆ 0.049 1 Coefficient of determination (R2) The coefficient of determination, R2, is defined as the proportion of the total variation in the dependent variable (Y) "explained" by the regression of Y on the independent variable (X). The total variation in Y or the total sum of squares (TSS) is defined as: 2 TSS Yi Y yi 2 n i 1 n Note: yi Yi Y i 1 The explained variation in the dependent variable(Y) is called the regression sum of squares (RSS) and is given by: RSS Yˆ Y 2 n i i 1 n i 1 ˆi 2 y What remains is the unexplained variation in the dependent variable or the error sum of squares (ESS) n 2 n ESS Yi Yˆ ei 2 i 1 i 1 We can say the following: •TSS = RSS + ESS, or •Total variation = Explained variation + Unexplained variation R2 is defined as: n R2 RSS ESS 1 TSS RSS n ˆi y yi 2 2 1 i 1 n i 1 2 e i i 1 n i 1 yi 2 ANOVAb Model 1 Regres sion Residual Total Sum of Squares 6863.624 4846.816 11710. 440 Mean Square 6863.624 346.201 df 1 14 15 F 19.826 a. Predic tors: (Constant), FARE b. Dependent Variable: PASS Mode l Summary Model 1 R R Square .766a .586 Adjust ed R Square .557 St d. Error of the Es timate 18.6065 a. Predic tors : (Const ant), FARE We see from the SPSS model summary table that R2 for this model is .586 Sig. .001a Notes on R2 Note that: 0 R2 1 If R2 = 0, all the sample points lie on a horizontal line or in a circle If R2 = 1, the sample points all lie on the regression line In our case, R2 0.586, meaning that 58.6 percent of the variation in the dependent variable (consumption) is explained by the regression. This is not a particularly good fit based on R2 since 41.4 percent of the variation in the dependent variable is unexplained. Standard error of the regression The standard error of the regression (s) is given by: n e i s i 1 nk 2 Mode l Summary Model 1 R R Square .766a .586 Adjust ed R Square .557 St d. Error of the Es timate 18.6065 a. Predic tors : (Const ant), FARE The model summary tells us that s = 18.6 Regression is based on the assumption that the error term is normally distributed, so that 68.7% of the actual values of the dependent variable (seats sold) should be within one standard error ($18.6 in our example) of their fitted value. Also, 95.45% of the observed values of seats sold should be within 2 standard errors of their fitted values (37.2). Step 4: Forecasting Recall the equation obtained from the regression results is : Qˆ i 478.7 1.63Pi Our first step is to perform an “in-sample” forecast. At the most basic level, forecasting consists of inserting forecasted values of the explanatory variable P (fare) into the forecasting equation to obtain forecasted values of the dependent variable Q (passenger seats sold). In-Sample Forecast of Airline Sales Year and Quarter 97-1 97-2 97-3 97-4 98-1 98-2 98-3 98-4 99-1 99-2 99-3 99-4 00-1 00-2 00-3 00-4 Predicted Sales (Q*) 64.8 33.6 37.8 83.3 111.7 137.5 109.6 96.8 59.5 83.2 90.5 105.5 75.7 91.6 112.7 102.2 Actual Sales (Q) 70.44 45.94 45.94 86.77 103.1 111.26 111.26 119.43 103.1 94.94 78.61 86.77 70.44 86.77 86.77 94.94 Q* - Q (Q* - Q)sq 5.64 31.81 12.34 152.28 8.14 66.26 3.47 12.04 -8.6 73.96 -26.24 688.54 1.66 2.76 22.63 512.12 43.6 1900.96 11.74 137.83 -11.89 141.37 -18.73 350.81 -5.26 27.67 -4.83 23.33 -25.93 672.36 -7.26 52.71 Sum of Squared Errors 4846.80 In-Sample Forecast of Airline Sales 160 140 120 100 80 60 40 Actual 20 97.1 Fitted 97.3 98.1 98.3 99.1 Yea r/Quarter 99.3 00.1 00.3 Can we make a good forecast? Our ability to generate accurate forecasts of the dependent variable depends on two factors: •Do we have good forecasts of the explanatory variable? •Does our model exhibit structural stability, i.e., will the causal relationship between Q and P expressed in our forecasting equation hold up over time? After all, the estimated coefficients are average values for a specific time interval (1987-2001). While the past may be a serviceable guide to the future in the case of purely physical phenomena, the same principle does not necessarily hold in the realm of social phenomena (to which economy belongs). Single Variable Regression Using Excel We will estimate an equation and use it to predict home prices in two cities. Our data set is on the next slide City •Income (Y) is average family income in 2003 •Home Price (HP) is the average price of a new or existing home in 2003. Income Home Price Akron, OH 74.1 114.9 Atlanta, GA 82.4 126.9 Birmingham, AL 71.2 130.9 Bismark, ND 62.8 92.8 Cleveland, OH 79.2 135.8 Columbia, SC 66.8 116.7 Denver, CO 82.6 161.9 Detroit, MI 85.3 145 Fort Lauderdale, FL 75.8 145.3 Hartford, CT 89.1 162.1 Lancaster, PA 75.2 125.9 Madison, WI 78.8 145.2 Naples, FL 100 173.6 Nashville, TN 77.3 125.9 87 151.5 Savannah, GA 67.8 108.1 Toledo, OH 71.2 101.1 Washington, DC 97.4 191.9 Philadelphia, PA Model Specification HP b0 b1Y Scatter Diagram: Income and Home Prices 200 Home Prices 180 160 140 120 100 80 50 60 70 80 Income 90 100 110 Excel Output ANOVA df SS 1 9355.71550 2 Residual 16 2017.36949 8 Total 17 11373.085 Regression Coefficients Regression Statistics Multiple R 0.906983447 R Square 0.822618973 Adjusted R Square 0.811532659 Standard Error 11.22878416 Observations 18 Standard Error t Stat Intercept -48.11037724 21.58459326 -2.228922114 Income 2.332504769 0.270780116 8.614017895 Equation and prediction HP 48.11 2.33Y City Income Predicted HP Meridian, MS 59,600 $ 138,819.89 Palo Alto, CA 121,000 $ 281,881.89