UNIVERSITY OF PARMA Faculty of Economy December, 20 2011 EXAMINATIONS Economic Statistics Duration – 1.20 hours Examination Aids: Calculator Table and Formulas In conducting this test is only allowed to use pocket calculators, at most, with basic statistics functions. You cannot use programmable calculators. You may use statistical tables (Standard Normal, Student t and F) and SUMMARY OF UNIVARIATE AND BIVARIATE AND MULTIPLE REGRESSION FORMULAS attached to this part and **Unless otherwise specified, use the conventional 5 percent significance level** Part 1: 3 written questions worth a total of 30 points Part 2: 3 multiple choice questions worth 1 point each for a total of 3 points Show your work and answer clearly, concisely, and completely. Questions can be clarified, but no hints will be provided. You may begin good luck. Part 1 EXERCISE 1 [10 points] A large food company conducts a survey in a sample of 64 countries in order to know the factors that influence the sales of a new energy bar. The dependent variable Y consists of the monthly sales (thousands of euro) while the independent variables (regressors) are X2: television promotion expenditures (thousands of euro), X3 radio promotion expenditures (thousands of euros), X4 Price in euros. We choose to fit to the data a model of multiple linear regression. Here are: i) the mean and standard deviation for each variable, ii) estimates of the parameters of the model with Gretl OLS procedure. Summary Statistics, using the observations 1 - 64 Variable Mean Std. Dev. Y (SALES) 44410.92 1583.45 X2 (TV) 17175.43 886.64 X3 (RADIO) 6055.35 805.32 X4 (PRICE) 2.27 1.06 Model 1: OLS, using observations 1-64 Dependent variable: SALES Coefficient -9615.08 Std. Error 7654.85 t-ratio -1,256077 X2 (TV) 1.53 0.11 13,909091 X3 (RADIO) 1.47 1.26 1,1666667 X4 (PRICE) -175.52 901.42 -0,194715 const Mean dependent var Sum squared resid R-squared F 44410.92 0.782 71.81 S.D. dependent var S.E. of regression Adjusted R-squared P-value(F) 1583.45 7.74766e-020 a) [point 1] State the multiple regression equation in conventional term and interpret the meaning of the slopes, b2 and b3 and b4 in this problem b) [point 1] Predict the Sales for an expenditure in TV Advertising of 30000 Thousands euro and in RADIO advertising of 5000 Thousands euro and with a price of 3 euro. c) [points 2] Which type of advertising is more effective? Explain d) [points 2] Determine whether there is a significant relationship between Sales and the three independent variables (TV, RADIO and PRICE) at the 0.05 level of significance. Interpret the meaning of the p-value. . e) [points 2] At the 0.01 level of significance, determine whether each independent variable makes a significant contribution to the regression model ( Stating clearly the null hypothesis). On the basis of these result, indicate the independent variables to include in this model. f) [points 2] you decide to fit a reduced model where Y depends on X2 (TV). The coefficient of determination for this simple linear model is: R2 = 0,776. Compare by means of a suitable test (at 0.01) the reduced model with the complete model (in which all variables appear) and specify clearly the null hypothesis. SOLUTION a) Y-hat = -9615.08 + 1.53*X2 + 1.47*X3 -175.52*X4 (7654.85) (0.11) (1.26) (901.42) n = 64, R-squared = 0.782 (standard errors in parentheses) In this model, the regression coefficients are interpreted as follows: 1) Holding constant the spending in Radio advertising and Price, for each increase of 1.0 thousand Euro in radio advertising , the Sales is estimated to increase by 1.53 thousand Euro (i.e., Euro 15300). 2) Holding constant the spending in TV advertising and Price , for each increase of 1.0 thousand Euro in Radio advertising , the Sales is estimated to increase by 1.47 thousand Euro (i.e., Euro14700). 3) Holding constant the spending in TV and Radio , for each increase of 1.0 Euro in Price , the Sales is estimated to increase by 2.27 thousand euro(i.e., euro 14700). 3) The sample Y intercept (b1 = -9615.08) estimate the value of Sales when there is no money spent on radio and TV advertising and Price are equal zero. Because these value of promotion and price are outside the range of RADIO and TV used and Price in this market study, and are nonsensical, the value of b1 has no practical interpretation. b) Predict the Sales for an expenditure in TV Advertising of 30000 Thousands euro and in RADIO advertising of 5000 Thousands euro and with a price of 3 euro Sales_hat = -9615.08 + 1.53*30000+ 1.47*5000 -175.52*3 = 43108.36 thousands euro c) [points 2] Which type of advertising is more effective? Explain Holding the other independent variable constant, TV advertising seems to be more effective because its slope is greater. But in such case if the mean and the variability of the two independent variables are different, standardized versions of the regression coefficients provide more meaningful comparisons. In our case we do not know the variability and is better to compute the standardized partial coefficients. beta2 b2 beta3 b3 sX 2 sY sX3 sY 1.53* 886.64 0.857 1583.45 1.47* 805.32 0.748 1583.45 The type of advertising more effective is TV advertising d) Determine whether there is a significant relationship between Sales and the three independent variables (TV, RADIO and PRICE) at the 0.05 level of significance. Interpret the meaning of the p-value. Our next task is to test the "significance" of this model based on that F-ratio using the standard five step hypothesis testing procedure. Hypotheses: H0: all coefficients are zero H1: almost one is different from 0 Critical value: an F-value based on (k-1) numerator df and (n - k) denominator df gives us F(3, 60) at 0.05 = 2.758 Calculated Value: R 2 (k 1) 0.782 / 3 0.2607 F (k 1, n k ) 71.81 2 (1 R ) (n k ) (1 0.782) / 60 0.00363 From above the F-calc is 71.81 Compare: F-calc > F-crit and thus we reject H0. Conclusion: This model has explanatory power with respect to Y. In other words the set of X variables in this model help us explain or predict the Y variable. This model is SIGNIFICANT. The p-value associated to F-calc is 7.74766e-020, that is much less than α. So, in another way we can say that the value of F-crit falls in the rejection zone of the null hypothesis. e) [points 2] At the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model ( Stating clearly the null hypothesis). On the basis of these result, indicate the independent variables to include in this model. Our step is to test the significance of the individual coefficients in the equation. We will conduct a t-test for each b associated with an X variable. Mechanically the actual test is going to be the value of b1 (or b2, b3.....bi) over SEb1 (or SEb1...SEbi) compared to a t-critical with n - k ) df (the Error df from the ANOVA table). Or we consider the p-values to determine whether to reject or accept Ho. The Ho being tested by this test is βi = 0. which means this variable is not related to Y. We consider each variable separately and thus must conduct as many t-tests as there are X variables. What NULL are we considering? Hypotheses: we are testing H0: βi=0 This variable is unrelated to the dependent variable at alpha=0.05. With the actual values of the b's and the SEb's, we obtain the t-value (one for each X variable ): TTV = 1.53/0.11 = 13.909091 tRADIO = 1.47/1.26 = 1,1666667 tPRICE = -175.52/901.42 = -0.194715 and comparing them with t-critical value (it is the same for each t-test within a single model) to determine whether to reject or accept the Ho associated with each X. tcritical = 2.0003 with 60 df At the 0.05 significance level, reject H0 if t ≥2.0003or t 2.0003. Do not reject H0 if 2.0003t 2.0003. The critical value from the t-table is t = 2.0003 with 60 degrees of freedom. Compare the t statistics (13.909091 , 1,1666667 and -0.194715) to the critical value X2 is significant independent variable and X3 and X4 are not significant independent variable. Conclusion: Variables X2 (TV) is significant and contributes to the model’s explanatory power and X3(Radio) and X4 (Price) are not significant. f) [points 2] you decide to fit a reduced model where Y depends on X2 (TV). The coefficient of determination for this simple linear model is: R2 = 0,776. Compare by means of a suitable test (at 0.05) the reduced model with the complete model (in which all variables appear) and specify clearly the null hypothesis. To test this, we consider two separate regressions: (Restricted) Y 1 2 X 2 u (Complete or Unrestricted) Y 2 X 2 3 X 3 4 X 4 u 1 Does X3 and X4 variables have a significant impact? We perform an F test comparing RSS when the X3 and X4 variables are included (RSS2 = RSSc) with RSS when it is not (RSS1 = RSSr). The null hypothesis is Ho: β3 = 0 and β4 =0 H1: β3 ≠ 0 and β4 ≠ 0 How can we test this hypothesis? The test statistic is defined in the following way: ( RSSr RSSc ) / df1 ( Rc2 Rr2 ) / df1 ( Rc2 Rr2 ) (n m) F ( m k , n m) RSSc / df 2 (1 Rc2 ) / df 2 (1 Rc2 ) (m k ) RSSr =RSS1 = Sum of squared residual of reduced model RSSc =RSS2 = Sum of squared residual of complete (unrestricted) model df1 = m - k no. extra parameters, df2 = n-m complete model k = number of reduced model parameters, m number of complete (unrestricted) model parameters ( Rc2 Rr2 ) (n m) (0.782 0.776) (64 4) F (2, 60) 0.826 (1 Rc2 ) (m k ) (1 0.782) (4 2) Decision Rule From the F-table, F(0.01, 2, 60) 4.98. The decision rule is to reject H0 if F 4.98 and accept (do not reject) H0 if F 4.98. The test statistic is F = 0.826 which falls in the rejection region. Do not accept H0 and conclude that the introduction in the model of X3 and X4 does not provide a significant improvement in the explanation of Y EXERCISE 2 Use the information in the table below to answer the following questions. United States (dollar) South Korea (won) Israel (shekel) Poland (zloty) Big Mac Price Exchange Rate (June 4, 1998) $2.53 – W 2,600 1,475 W/$ sh 12.50 zl 5.30 3.70 sh/$ 3.46 zl/$ a. Calculate whether the won, the shekel, and the zloty are overvalued or undervalued with respect to the U.S. dollar in terms of Big Macs purchases. Explain what it means to be overvalued or undervalued. Answer: One way to answer this is to calculate the dollar price of a Big Mac in South Korea, Israel and Poland using current exchange rates. If the dollar price is less than the price of a Big Mac in the US then the country’s currency is undervalued. If otherwise, then the currency is overvalued. In South Korea: W2600 / 1475W/$ = $1.76. This is less than $2.53, the US price, therefore the South Korean won is undervalued. In Israel: sh12.50 / 3.7 sh/$ = $3.38. This greater than $2.53 therefore the Israeli shekel is overvalued. In Poland: zl5.30 / 3.46 zl/$ = $1.53 This is less than $2.53 therefore the Polish zloty is undervalued. Answer: A second way to answer this (solution proposed by the student DANNI Andrea is to calculate the purchasing power parity (PPP) using the Big Mac price from each country: In South Korea: USPPPSK = PW / P$ = 2600 / $2.53 = 1027.67 W/$ In Israel: USPPPIS = Psh / P$ = sh12.50 / $2.53 = 4.94 sh/$ In Poland USPPPPL = Pzl / P$ = zl 5.30 / $2.53 = 2.09 zl/$ If the PPP is less than the Exchange Rate, than the country’s currency is undervalued. If otherwise, then the currency is overvalued. For example for South Korea we have: (USPPPSK – USEXRSK)/ USEXRSK = - 0.303. That is Won is undervalued regarding Us dollar of 30.3%. b. What would the exchange rates have to be in order to equalize Big Mac prices between South Korea and the United States, Israel and the United States, and Poland and the United States? Answer: Here you can simply apply the purchasing power parity formula using the Big Mac price from each country, In South Korea USPPPSK = PW / P$ = R2600 / $2.53 = 1027.67W/$ In Israel USPPPIS = Psh / P$ = sh12.50 / $2.53 = 4.94 sh/$ In Poland USPPPPL = Pzl / P$ = zl 5.30 / $2.53 = 2.09 zl/$ These are the PPP exchange rates based on Big Mac prices. c. If in the long run the exchange rate moves to satisfy Big Mac purchasing power parity (PPP), will the won, shekel, and zloty, appreciate or depreciate in terms of dollars? Explain the logic. Answer: In order to reach the PPP exchange rate the won would have to change from 1475 W/$ to 1027.67 W/$ . Since this exchange rate is the value of the $ ($s in the denominator) the dollar would need to depreciate, therefore the won would appreciate. This means also that if the won is undervalued the won would need to appreciate to reach its PPP value. Similarly, the shekel exchange rate would have to change from 3.70 sh/$ to 4.94 sh/ $, representing a $ appreciation, or a shekel depreciation. For the zloty, the exchange would need to change from 3.46 zl/$ to 2.09 zl/$, meaning the $ would have to depreciate or the zloty appreciate. EXERCISE 3 [10 points] Dummy variables [5 points] Role of Categorical (dummy) Variables in the Linear Regression Model (Here you find some hints): 1. 2. 3. 4. What is a Dummy variable? Type of Dummy variables . [Analysis & 1 – 2 sentences] How many dummy variables are needed? . [Analysis & 1 – 2 sentences] How to Interpret Dummy Variables. [Analysis & 1 – 2 sentences] How do you add an interaction to a regression? . [Analysis & 1 – 2 sentences] What is a Dummy variable? A Dummy variable or Indicator Variable is an artificial variable created to represent an attribute with two or more distinct categories/levels. Things to keep in mind about dummy variables Dummy variables assign the numbers ‘0’ and ‘1’ to indicate membership in any mutually exclusive and exhaustive category. 1. The number of dummy variables necessary to represent a single attribute variable is equal to the number of levels (categories) in that variable minus one. 2. For a given attribute variable, none of the dummy variables constructed can be redundant. That is, one dummy variable cannot be a constant multiple or a simple linear relation of another. 3. The interaction of two attribute variables (e.g. Gender and Marital Status) is represented by a third dummy variable which is simply the product of the two individual dummy variables. How many dummy variables are needed? In a multiple regression there are times we want to include a categorical variable in our model. Examples might include gender or education level. Unfortunately we cannot just enter them directly because they are not continuously measured variables. However, they can be represented by dummy variables. The answer to "how many?" is easy. It is r-1 where r = the number of categories in the categorical variable. Thus for gender (male - female) we would need only one dummy variable with a coding scheme of Xi=1 when the individual is male, and 0 when female. Thus female becomes the base case and the bi associate with Xi becomes the amount of change in Y when the individual is male versus female. For the education level example, if we have a question with "highest level completed" with categories (1) grammer school, (2) high school, (3) undergrad, (4) graduate, we would have 4 categories we would need 3 dummy variables (4-1). Thus we would create 3 X variables and insert them in our regression equation. We decide on our base case - in this example it will be grammer school. This category will not have an X variable but instead will be represented by the other 3 dummy variables all being equal to zero. We can make X1 = 1 for high school, X2 = 1 for undergrad and X3 = 1 for graduate. For each of these we are comparing the category in question to the grammer school category (our base case). The best way to lay this out is to build a little table to organize that coding. see below: category/variable X1 X2 X3 Grammer School 0 0 0 High School 1 0 0 Undergraduate 0 1 0 Graduate 0 0 1 Thus no matter how many other variables are in the model, in order to include education level in your model you will have to add 3 new dummy variables (X's) to the model. How to Interpret Dummy Variables. When a Multiple Regression equation is calculated by the computer you will get a b value associated with each X variable, whether they are dummy variables or not. The significance of the model and each individual coefficient is tested the same as before. Concluding that a dummy variable is significant (rejecting the null and concluding that this variable does contribute to the model's explanatory power) means that the fact that we know what category a person falls in helps us explain more variance in Y. So for instance in the example above with education level, if we test the B associated with X1 and determine it to be "significant" then that tells us that X1 (high school vs. grammer school) does contribute to the model's explanatory power. Thus by knowing whether a person has a high school education (versus on a grammer school education) helps us explain more of whatever the Y variable is. This process is repeated for each dummy variable, just as it is for each X variable in general. Location Quotient [5 points] 1) For what reasons do you calculate the Location Quotient o in an analysis of your local economy? . [Analysis & 1 – 2 sentences] 2) Write the basic formula for calculate the Location Quotient when we are comparing the regional economy to the national economy, highlighting what key inputs are required to calculate the Location Quotient . [Analysis & 1 – 2 sentences] 3) How you define “the basic export employment”? . [Analysis & 1 – 2 sentences] The Location Quotient Technique is the most commonly utilized economic base analysis method. It was developed in part to offer a slightly more complex model to the variety of analytical tools available to economic base analysts. This technique compares the local economy to a reference economy, in the process attempting to identify specializations in the local economy. The location quotient technique is based upon a calculated ratio between the local economy and the economy of some reference unit. This ratio, called an industry "location quotient" gives this technique its name Location Quotient Calculation To calculate any location quotient the following formula is applied. Note that in this formula we are comparing the Regional Economy (often a county) to the National Economy. Location quotients may also be calculated that compare the county to a state. Regional Employment in Industry k National Employment in Industry k Location / Quotient= Total Regional Employment Total National Employment Examining this formula more closely, we see that to allocate employment to the basic and non-basic sectors, location quotients are calculated for each industry. Simply stated, the location quotient method compares Local Employment to National Employment. The LQ provides evidence for the existence of basic employment in a given industry. Interpreting Calculated Location Quotients Interpreting the Location Quotient is very simple. Only three general outcomes are possible when calculating location quotients. These outcomes are as follows: LQ < 1.0 LQ = 1.0 LQ > 1.0 LQ < 1.0 = All Employment is Non-Basic A LQ that is less than zero suggests that local employment is less than was expected for a given industry. Therefore, that industry is not even meeting local demand for a given good or service. Therefore all of this employment is considered non-basic by definition. A LQ = 1.0 = All Employment is Non-Basic A LQ that is equal to zero suggests that the local employment is exactly sufficient to meet the local demand for a given good or service. Therefore, all of this employment is also considered non-basic because none of these goods or services are exported to non-local areas. A LQ > 1.0 = Some Employment is Basic A LQ that is greater than zero provides evidence of basic employment for a given industry. When an LQ > 1.0, the analyst concludes that local employment is greater than expected and it is therefore assumed that this "extra" employment is basic. These extra jobs then must export their goods and services to non-local areas which, by definition, makes them Basic sector employment. Calculating the Level of Basic Employment When the LQ is calculated to be greater than 1.0, it has been determined that some of that industry's employment is Basic. However, it is must be emphasized that a LQ > 1.0 does not mean that all that industry's employment is basic in nature. Recall that it is assumed that any employment "below" an LQ of 1.0 is Non-Basic; those jobs serve local demand. Only those jobs over and above what was expected for the region can be identified as Basic sector jobs. Because of the assumptions of the Location Quotient approach, a second formula must be applied to determine the number of Basic sector jobs when the LQ is greater than 1.0. This formula is as follows: Basic Sector Employment = Regional Employment Industry k National Employment Industry k - Total Regional Employment Total National Employment X National Employment Industry k Part 2 1) [1 points] Consider the following data related to US GDP: GDP: $12 Trillion Consumption: $9.2 Trillion Government Purchases: $1.8 Trillion US investment abroad: $0.4 Trillion Imports: $1.5 Trillion Domestic Investment: $1.6 Trillion Private Savings: $2.2 Trillion 1a) What is the value of US exports? (A) 0.5; (B) -0.9; (C) 1.2; (D) 0.0; (E) 0.9. 1b) Is the US running a trade deficit or surplus? NX=(GDP−C−I−G)=(12−9.2−1.6−1.8)= −0.6. This is negative, so the US is running a trade deficit. NX=EX−IM, so EX=NX+IM=−0.6+1.5=0.9 Answer: 2) [1 points] A student obtain the following results in several different regression problems. In which cases could you be certain that an error has been committed? ( hint: R2Y.234 denote the coefficient of multiple determination between Y and the set of independent variables X2, X3 and X4 and R2 Y.2345 denote the coefficient of multiple determination between Y and the set of independent variables X2, X3 , X4 and X5) a) R2Y.234 = 0.89 R2 Y.2345 = 0.86 b) Adjusted R2Y.234 = 0.86 Adjusted R2 Y.2345 = 0.82 Answer: the statement a) is wrong, because when I introduce a further variable in the regression model the coefficient of determination cannot decrease 3) Suppose you calculate X 12, Y 24 , sX = 2, sY = 4, and sXY = -12. How do you know you must have made a mistake in calculating these statistics? Answer: if we compute the correlation coefficient the result is: rxy= sXY/( sX* sY) = -12/(2*4) = - 1.5 This is impossible because the range of rxy is between -1 +1.