Module 2 Review of Regression Analysis Methods with Applications to Time Series Analysis Class notes for Statistics 451: Applied Time Series Iowa State University Copyright 2015 W. Q. Meeker. November 11, 2015 17h 20min 2-1 Module 2 Segment 1 Fitting a Simple Linear Regression Line To a Time Series Data Set 2-2 500 400 300 200 100 Thousands of Passengers 600 International Airline Passengers 1949-1960 1949 1951 1953 1955 1957 1959 1961 Year 2-3 400 300 200 100 Thousands of Passengers 500 International Airline Passengers 1949-1960 on Log Axis 1949 1951 1953 1955 1957 1959 1961 Year 2-4 6.0 5.5 5.0 Log Thousands of Passengers 6.5 Log International Airline Passengers 1949-1960 Model 1 Linear Regression Trend Line 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 Year 2-5 300 200 100 Thousands of Passengers 400 500 600 International Airline Passengers 1949-1960 Model 1 Log Linear Regression Trend Line 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 Year 2-6 R airline.frame Data Frame > airline.frame Passengers Month Time 1 112 Jan 1949.000 2 118 Feb 1949.083 3 132 Mar 1949.167 4 129 Apr 1949.250 5 121 May 1949.333 ....................... 141 508 Sep 1960.667 142 461 Oct 1960.750 143 390 Nov 1960.833 144 432 Dec 1960.917 Note that “coded time” Time runs from 1949, 1949+1/12, . . . , 1960+11/12 2-7 Airline Model 1 R Output > airline.fit1 <- lm(log(passengers.ts) ~ Time) > summary(airline.fit1) Call: lm(formula = log(passengers.ts) ~ Time) Residuals: Min 1Q Median 3Q Max -0.3086 -0.1039 -0.01796 0.09737 0.2954 Coefficients: Value Std. Error (Intercept) -230.1879 6.5389 Time 0.1206 0.0033 t value -35.2029 36.0505 Pr(>|t|) 0.0000 0.0000 Residual standard error: 0.139 on 142 degrees of freedom Multiple R-Squared: 0.9015 F-statistic: 1300 on 1 and 142 degrees of freedom, the p-value is 0 2-8 Time Series Regression Model and Ordinary Least Squares (OLS) Estimates • Model: y = β0 + β1Time + a yt = β0 + β1Timet + at where ◮ at ∼ IID(0, σa2) or at ∼ NID(0, σa2). The NID assumption is needed for statistical efficiency, normal-theory inference and normal-theory prediction intervals. ◮ Time is numerical time or coded time (e.g., day, week, or year), and ◮ t = 1, 2, . . . , n indexes the equally spaced observations. • Model parameter estimates: ybt = βb0 + βb1Timet, ba S=σ 2-9 Airline Model 1 and OLS Estimates • Model: log(Passengers) = β0 + β1Time + at Time is coded time (in this case Time = 1949, 1949+1/12, ..., 1960+11/12) • Model parameter estimates: \ log(Passengers) = βb0 + βb1Time = −230. + 0.12Time S = 0.139 R2 = .9015 2 - 10 Module 2 Segment 2 Time Series Residuals, Residual Statistics, and ANOVA 2 - 11 300 200 100 Thousands of Passengers 400 500 600 International Airline Passengers 1949-1960 Model 1 Log Linear Regression Trend Line 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 Year 2 - 12 Time Series Residuals • Residuals are deviations between the observed response (data) and fitted value of the response (model). Unobserved: at = yt − E(Yt) = yt − (β0 + β1Time) b t = yt − ybt = yt − (βb0 + βb1Time) Observed: a 2 - 13 Residual Standard Deviation • The residual standard deviation (estimate of σa) is S= s Unexplained Sum of Squares n−p v u Pn 2 u b a = t t=1 t n−p v u Pn u (yt − ybt)2 t t=1 = n−p where p is the number of model parameters. 2 - 14 Coefficient of Determination • The coefficient of determination (proportion of variability explained by the model) is Explained Sum of Squares R = Total Sum of Squares 2 Pn 2 b (y − y ) t t = 1 − Pt=1 n (y − ȳ)2 t=1 t 2 - 15 Module 2 Segment 3 Dummy Variable Model for Time Series Seasonality 2 - 16 Airline Model 2 R Output > airline.fit2 <- lm(log(passengers.ts) ~ Time + Month) > summary(airline.fit2) Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) -230.6755 2.7985 -82.4290 0.0000 Time 0.1208 0.0014 84.3990 0.0000 MonthAug 0.2144 0.0242 8.8549 0.0000 MonthDec -0.0982 0.0242 -4.0541 0.0001 MonthFeb -0.0990 0.0242 -4.0870 0.0001 ......................... MonthJun 0.1198 0.0242 4.9468 0.0000 MonthMar 0.0313 0.0242 1.2914 0.1989 MonthMay -0.0024 0.0242 -0.0978 0.9222 MonthNov -0.2121 0.0242 -8.7548 0.0000 MonthOct -0.0684 0.0242 -2.8228 0.0055 MonthSep 0.0698 0.0242 2.8814 0.0046 Residual standard error: 0.0593 on 131 degrees of freedom Multiple R-Squared: 0.9835 F-statistic: 649.4 on 12 and 131 degrees of freedom, the p-value is 0 2 - 17 Dummy (or Indicator) Variables Ordered months: > sort(unique(Month)) [1] Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep Letting April be the baseline month, we define Aug = MonthAug = Dec = MonthDec = .. Sep = MonthSep = ( ( ( 1 in August 0 otherwise 1 in December 0 otherwise 1 in September 0 otherwise Because April is the baseline month, it needs no dummy variable. 2 - 18 Airline Model 2 > sort(unique(Month)) [1] Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep • Model: log(Passengers) = β0 + β1Time + β2Aug + β3Dec + . . . +β10Nov + β11Oct + β12Sep + at 2 - 19 Airline Model 2 > sort(unique(Month)) [1] Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep • Model: log(Passengers) = β0 + β1Time + β2Aug + β3Dec + . . . +β10Nov + β11Oct + β12Sep + at • Apr: log(Passengers) = β0 + β1Time 2 - 20 Airline Model 2 > sort(unique(Month)) [1] Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep • Model: log(Passengers) = β0 + β1Time + β2Aug + β3Dec + . . . +β10Nov + β11Oct + β12Sep + at • Apr: log(Passengers) = β0 + β1Time • Aug: log(Passengers) = β0 + β1Time + β2 = (β0 + β2) + β1Time 2 - 21 Airline Model 2 > sort(unique(Month)) [1] Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep • Model: log(Passengers) = β0 + β1Time + β2Aug + β3Dec + . . . +β10Nov + β11Oct + β12Sep + at • Apr: log(Passengers) = β0 + β1Time • Sep: log(Passengers) = β0 + β1Time + β12 = (β0 + β12) + β1Time 2 - 22 Airline Model 2 OLS Estimates • Model parameter estimates: \ log(Passengers) = βb0 + βb1Time + βb2Aug + βb3Dec + . . . +βb10Nov + βb11Oct + βb12Sep = −231 + 0.12Time + 0.21Aug − 0.10Dec + . . . −0.21Nov − 0.07Oct + 0.07Sep S = 0.0593 R2 = 0.9835 2 - 23 Airline Model 1 and OLS Estimates • Model: log(Passengers) = β0 + β1Time + at Time is coded time (in this case Time = 1949, 1949+1/12, ..., 1960+11/12) • Model parameter estimates: \ log(Passengers) = βb0 + βb1Time = −230. + 0.12Time S = 0.139 R2 = .9015 2 - 24 International Airline Passengers 1949-1960 Model 2 Fitted Values 600 Jun 500 Jul Aug Mar 300 200 100 Thousands of Passengers 400 Nov 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 Year 2 - 25 -0.15 -0.10 -0.05 0.0 0.05 0.10 International Airline Passengers 1949-1960 Model 2 Residual Time Series Plot 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 2 - 26 International Airline Passengers 1949-1960 Model 2 Residuals versus Fitted Values • 0.10 • • • •• 0.0 • -0.10 -0.05 • • • • • •• -0.15 residuals(airline.fit2) 0.05 • • •• • • • • • •• •• •• • • • • •• • ••• • • • • • • • • • • • • •• • • • • • •• ••• • •• • • •• •• •• • • •••• •• • • • • • •• • • • •• •• • •• • •• • • • • • • • • • • •• • •• • • • • • • • • • • • • •• • • • • • • • • 5.0 5.5 6.0 fitted.values(airline.fit2) 2 - 27 International Airline Passengers 1949-1960 Model 2 Normal Q-Q Plot of the Residuals • 0.10 • 0.0 -0.05 -0.10 -0.15 residuals(airline.fit2) 0.05 •• • • • •• •• ••• • • • • ••• ••• • • • • • • • • • • •••••• •••••••• • • ••••••• • • • •••• ••••••• • • • • ••• ••••••• ••••• • ••••••••• • • • • •••• •••• • • •• ••• •••••• • • ••• • • • • ••• • •• • • • • -2 -1 0 1 2 Quantiles of Standard Normal 2 - 28 International Airline Passengers 1949-1960 Model 2 ACF of the Residuals -0.2 0.0 0.2 ACF 0.4 0.6 0.8 1.0 Series : residuals(airline.fit2) 0.0 0.5 1.0 1.5 Lag 2 - 29 Module 2 Segment 4 Dummy Variable plus Interaction Model for Time Series Seasonality 2 - 30 Airline Model 3 R Output > airline.fit3 <- lm(log(passengers.ts) ~ Time + Month + Time:Month) > summary(airline.fit3) Value Std. Error t value Pr(>|t|) (Intercept) -222.1037 9.1389 -24.3030 0.0000 Time 0.1164 0.0047 24.9058 0.0000 MonthAug -27.7202 12.9255 -2.1446 0.0340 MonthDec 2.0172 12.9266 0.1561 0.8763 MonthFeb 13.9378 12.9238 1.0785 0.2830 ........................................... MonthOct -10.3970 12.9260 -0.8043 0.4228 MonthSep -7.2389 12.9258 -0.5600 0.5765 TimeMonthAug 0.0143 0.0066 2.1611 0.0327 TimeMonthDec -0.0011 0.0066 -0.1634 0.8705 ........................................... TimeMonthOct 0.0053 0.0066 0.7991 0.4258 TimeMonthSep 0.0037 0.0066 0.5655 0.5728 Residual standard error: 0.05591 on 120 degrees of freedom Multiple R-Squared: 0.9865 F-statistic: 382.4 on 23 and 120 degrees of freedom, the p-value is 0 2 - 31 Airline Model 3 and OLS Estimates > sort(unique(Month)) [1] Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep • Model: log(Passengers) = β0 + β1Time + β2Aug + β3Dec + . . . +β11Oct + β12Sep + β13TimeMonthAug . . . +β22TimeMonthOct + β23TimeMonthSep + at • Where TimeMonthAug = Time × MonthAug .. TimeMonthOct = Time × MonthOct 2 - 32 International Airline Passengers 1949-1960 Model 3 Fitted Values 600 July Aug Jun 500 Mar 300 200 100 Thousands of Passengers 400 Nov 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 Year 2 - 33 ANOVA Tables for Model 2 and Model 3 > anova(airline.fit2) Analysis of Variance Table Response: log(passengers.ts) Terms added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) Time 1 25.12337 25.12337 7143.588 0 Month 11 2.28429 0.20766 59.047 0 Residuals 131 0.46072 0.00352 > anova(airline.fit3) Analysis of Variance Table Response: log(passengers.ts) Terms added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) Time 1 25.12337 25.12337 8037.763 0.000000000 Month 11 2.28429 0.20766 66.438 0.000000000 Time:Month 11 0.08564 0.00779 2.491 0.007494726 Residuals 120 0.37508 0.00313 2 - 34 Comparing Model 2 and Model 3 To test statistical the significance of the Time:Month terms F = (SSEReduced − SSEFull)/#parameters tested SSEFull/doffull (0.46072 − 0.37508)/11 = = 2.49 0.37508/120 > qf(.95,11,120) [1] 1.86929 Thus because F = 2.49 > F(.95;11,120) = 1.87 we conclude that there is evidence of differing slopes at the 5% level of significance. Alternatively, the p-value is Pr(F > 2.49) = 0.0075. > 1-pf(2.49,11,120) [1] 0.007510861 2 - 35 Module 2 Segment 5 Dummy Variable plus Interaction Model without Transformation 2 - 36 Airline Model 4 R Output > airline.fit4 <- lm(passengers.ts ~ Time + Month + Time:Month) > summary(airline.fit4) Value Std. Error t value Pr(>|t|) (Intercept) -57192.9980 2618.7677 -21.8397 0.0000 Time 29.3951 1.3397 21.9417 0.0000 MonthAug -26257.8983 3703.8125 -7.0894 0.0000 MonthDec 2005.7773 3704.1284 0.5415 0.5892 MonthFeb 9479.1992 3703.3389 2.5596 0.0117 ........................................... MonthOct -2134.5245 3703.9704 -0.5763 0.5655 MonthSep -9219.5218 3703.8915 -2.4891 0.0142 TimeMonthAug 13.4685 1.8946 7.1089 0.0000 TimeMonthDec -1.0385 1.8946 -0.5481 0.5846 ........................................... TimeMonthOct 1.0839 1.8946 0.5721 0.5683 TimeMonthSep 4.7273 1.8946 2.4951 0.0140 Residual standard error: 16.02 on 120 degrees of freedom Multiple R-Squared: 0.985 F-statistic: 343.4 on 23 and 120 degrees of freedom, the p-value is 0 2 - 37 400 300 200 100 Thousands of Passengers 500 600 International Airline Passengers 1949-1960 Model 4 Fitted Values 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 Year 2 - 38 -20 0 20 International Airline Passengers 1949-1960 Model 4 Residual Time Series Plot 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 2 - 39 International Airline Passengers 1949-1960 Model 4 Residuals versus Fitted Values • • • • • • • • • • • • • •• •• • • • • • • • •••• • • • • • •• • • • • • • • •• • • • • • •• • • • • • •• • •• • • •• • • • • • • 200 • • • • • • • • • •• • • • • • • •• • • • •• • • • • • • •• • •• •• •• ••• • • • • • • •• 0 • • • 100 • • -20 residuals(airline.fit4) 20 • • • • • • •• • • • • • • • • • 300 400 500 600 fitted.values(airline.fit4) 2 - 40 International Airline Passengers 1949-1960 Model 4 Normal Q-Q Plot of the Residuals • • • • 0 -20 residuals(airline.fit4) 20 •• • • • • • •• •• •••• • ••• • •••• • • •••• •••• • • ••••••• • • •• ••••••••• ••••••• • • • • • • ••••••• ••••••• • • • • ••••• ••••• • • • ••••• •••• • • ••• • • • • • • ••• •• • • • •• • • • • • •• • • • -2 -1 0 1 2 Quantiles of Standard Normal 2 - 41 International Airline Passengers 1949-1960 Model 4 Residual ACF -0.2 0.0 0.2 ACF 0.4 0.6 0.8 1.0 Series : residuals(airline.fit4) 0.0 0.5 1.0 1.5 Lag 2 - 42 Variability Often Depends on Level • Possible model: Stock Price varies ±10% of E(Price). • Example: Stock #1 E(Price)=10 Price = 10 ±10% Price = 10 ±1 Stock #2 E(Price)=100 Price = 100 ±10% Price = 100 ±10 This implies nonconstant variance! • Percent change prediction: \ \ Price = [Price/1.10, 1.10 × Price] \ ± log(1.10) log(Price) = log(Price) \ ± .0953 log(Price) = log(Price) 2 - 43 Module 2 Segment 6 Interpretation of Log-Linear Model Parameters and Introduction to Prediction Intervals 2 - 44 Interpretation of Regression Coefficients in a Log-Linear Model (fit linear model after taking the logarithm of the response) Example: Yt = β0 + β1t + at E(Yt) = β0 + β1t = E(Yt−1) + β1 exp[E(Yt )] = exp[E(Yt−1)] × exp[β1] For the airline data exp[βb1] = exp[.1206] = 1.128 or about 13% growth/year In general: 100(exp(slope) − 1) = percent growth per unit time and 100(slope) ≈ percent growth per unit time if the slope is small (e.g., a few percent) 2 - 45 Prediction of the Response in a Regression Model • A 100(1 − α)% prediction interval for a future value of Yt is ybt ± t(1−α/2,n−p)S(Y −Yb ) where the prediction standard error S(Y −Yb ), is an estimate q of Var(Y − Yb ) • Familiar formulas for the prediction variance for a regression response at x0: # 2 1 (x0 − x̄) 2 b +1 Simple: Var(Y − Y ) = σ + Pn 2 n i=1(xi − x̄) i h ′ −1 2 ′ Multiple: Var(Y − Yb ) = σ x0(X X) x0 + 1 " 2 - 46 Prediction Standard Error for an Independent Future Observation • There are two sources of variability in a statistical prediction. Var(Y − Yb ) = Var(Y ) + Var(Yb ) An estimate of this variance is S2 = S 2 + S 2b b Y (Y −Y ) Pn 2 where S = MSE = t=1(Yt − Ybt)2/(n − p), • As n → ∞ S 2b → 0 Y 2 S → σ2 Thus for large n S2 ≈ S2 b (Y −Y ) 2 - 47