R: 2/3/16 Time Series Forecasting Homework Problems The videos for this module walk through completing the first homework problem. Follow along and complete the first homework problem. on the time series day class schedule demonstrate how to do time series forecasting on the Amtrak example. Work through the tutorial before working on this homework. 1. Department Store Problem Data Description This DeptStore.csv file reflects the quarterly sales of a department store from 2009 through 2014. Forecast the sales for the year 2015. The screen capture below shows the first several records in the downloaded raw data. Top of raw data Prepare the Data You will not use the data in columns A and B directly in your model building. Instead you will use them to generate data that you will use to build the model. Create a textual value for QtrName (1st, 2nd, 3rd, 4th), so that dummy variables will be generated. The time component in this model is quarter, so create t_qtr = 1, 2, 3, …27, 28. Below, the first several records and last records are shown of how the data should look when it ready to be partitioned so that it can be used in the forecasting process. Top of prepped data Bottom of prepped data The dataset contains values through the end of 2014. Create input values for the year 2015 so that you will be able to forecast the sales for the quarters in 2015. Create Partitions The table below summarizes which records to use for four datasets. Use VisMiner to create the four datasets. The video tutorials walk you through this process. Training Validation TrainPlusValid Forecast (Future) Begin 1st quarter 2009 1st quarter 2014 1st quarter 2009 1st quarter 2015 End 4th quarter 2013 4th quarter 2014 4th quarter 2014 4th quarter 2015 Number of Records 20 4 24 4 Visualize the Data Trend a. Create an MS Excel chart based on the data from 2009 through 2014 to visualize the data for the trend and assess the strength of a linear and polynomial relationship between t_qtr and ySales. Note: Remember to only plot the TrainPlusValid data. Do not include the records to be forecasted since you have placed zeros there are placeholders. If you include these in the trendline, it will mess up the R2 values. Use Excel to calculate the R2 for a linear relationship between t_qtr and ySales. Record the R2 for a linear relationship. 2 Use Excel to determine the strength of R2 for a polynomial relationship between t_qtr and ySales. Record the R2 for that relationship. Create and Evaluate Models b. Complete the three models below. The table below shows which input variables should be included in each model. Determine which of the models produces the most accurate results. Note: In VisMiner, the way to determine which input variables will be included in a model is to create a derived dataset that only includes your chosen input variables and the output variable. The t_qtr2 is not in your prepped data because the Polynomial 2nd Order modeler in VisMiner will create the squared term from the non-squared term automatically. Some check figures are provided. c. Fill in the quality metrics in the table below. Note: The mouse-over feature in VisMiner shows the RMSE and R2. We could also calculate MAE and MAPE in MS Excel if we chose to, but it is not necessary for the purpose of this exercise. Model 1.1 1.2 1.3 Input Variables to Include Qtr t_qtr t_qtr2 √ √ √ √ √ √ Model quality metrics for validation dataset R2 RMSE MAE MAPE 0.206 14,785 5,514 4.4% d. Save a screen capture of the coefficients of the most accurate models. Update the Coefficients and Create the Forecast e. Now that you know which model has the best fit, use the TrainPlusValid dataset to re-estimate the coefficients for that model. Also save a screen capture of the coefficients of the model. f. With the resulting model, forecast the sales for Forecast dataset. Save a screen capture of the forecasted data. Check figure: The first forecasted ySales value should be 72,406. 3 2. Berlin Tunnel Homework Problem Data Description This data reflects the number of vehicles that went through the Berlin tunnel each day for about two years. Download the Tunnel2.csv dataset. The dataset contains only the Date and yVehicles columns, so you will need to create additional data columns so that you can use them in your analysis. The last day that we have historical data for is 11/16/05. To make the process easier, I have added dates for which we want to create forecasts from 11/17/05 through 12/7/05, and inserted a zero to act as a place holder for the number of vehicles for records where we will need to generate a forecast. First Records Shows records where known traffic ends and placeholders for the forecasted values begins. Prepare the Data In this problem, instead of using Excel, use the date manipulation functions in VisMiner (Create derived dataset then Computed Columns) to create additional data that correspond to the given dates. Specifically, use the following date manipulation functions to create the following fields: 4 Date manipulation function in VisMiner To create this field DateDiff t_day MonthNbrName Month DayOfWeek Wday Explanation Time index for day. In this function <later date> is the Date column and <earlier date> is “10/31/03” so that 11/1/03 will be computed to be day 1. As explained in an earlier VisMiner reading, specific dates must be surrounded by quotation marks in VisMiner when creating calculated fields. Dummy Variables for Months (e.g., 1Jan, 2Feb, 3Maretc) Dummy variables for day of week (e.g., 1Sun, 2Mon, 3Tue, etc.) Below, some records are shown of how the data should look when it ready to be partitioned so that it can be used in the forecasting process. First few records Bottom records with know values The table below shows the top and bottom records in the dataset to be forecasted. First few records Bottom records 5 Create Partitions The table below shows which records should be in each partition. The videos show how to create these partitions in VisMiner. Training Validation TrainPlusValid Forecast (Future) Begin 11/1/03 10/1/05 11/1/03 11/17/05 End 9/30/05 11/16/05 11/16/05 12/7/05 Records 1-700 701-747 1-747 748-768 Number of Records 700 47 747 21 Visualize the Data Trend a. Create an MS Excel line chart for traffic from 11/1/03 up through the validation data (11/16/05) to visualize the data for the trend and assess the strength of a linear and polynomial relationship between t_day and yVehicles. Use Excel to calculate the R2 for a linear relationship and a polynomial relationship between tunnel traffic and t_day. Record the R2 for the linear relationship and the R2 for the polynomial relationship. Create and Evaluate Models b. Complete the three models below. The table below shows which input variables should be included in each model. Some check figures are provided. Determine which of the models produces the most accurate results. Use Excel to calculate the model quality metrics for the validation dataset for each model. Fill in the quality metrics in the table below. Model 2.1 2.2 2.2 Input Variables to Include Wday Month t_day t_day2 √ √ √ √ √ √ √ √ √ RMSE Quality metrics for validation dataset 2 R MAE MAPE 4,402 3.9% 4,702 6 Update the Coefficients and Create the Forecast c. Now that you know which model has the best fit, train a model with the TrainPlusValid dataset using the same variables from the most accurate model. Save a screen capture of the coefficients of the most accurate model. d. With the resulting model, in VisMiner forecast the number of vehicles that will go through the tunnel for the forecast period. Save the dataset that includes the forecasted records and bring it to class. 7