MGO 616 – Final Exam Rajarshi Chakraborty, MSS Q1. Time Series Phenomenon (a) Nonstationary A time-series data or a stochastic process is called stationary when its properties are unaffected by a change of time origin. This implies that if there are m observations in a stochastic process at any “set of times” t1, t2, …, tm, then the joint probability distribution of the observations made after a lag of k is the same as that for the observations before. Formally speaking, a time-series data is said to be stationary if its mean, variance and autocorrelation function are all essentially constant through time. A nonstationary time-series data, on the other hand, violates this property. In real life, most stochastic processes are nonstationary though. Many empirical time series (e.g. stock prices) behave as though they had no fixed mean. The property of a stochastic distribution that displays the most nonstationarity is the mean of the series. In other words, the means of these processes are not constant through different time lags. As an example, the level of this series in the figure above seems to rise and fall episodically rather than trending in one direction. In spite of the lack of a fixed mean, however, sometimes on a global level one part of the time-series data may behave much like another part. This is referred to as a homogeneous nonstationary behavior. The importance of the latter is that fact it can be converted into stationary data using differencing. This method however allows us to only deal with nonstationary means and not nonstationary variance. Nonstationary variance on the other hand can be dealt through logarithmic transformations or sometimes double differencing. Two other kinds of stationarities can be obtained from nonstationary data: trend-stationary and difference-stationary. While in the former, a deterministic trend can be removed to fit the ARMA(p,q) model, in the latter the differencing process is with respect to time. A process is non-stationary, if one of the roots of its characteristic equations is 1. If the other characteristic roots lie within the unit circle, then stationarity can be restored by using the first differencing transformation. One of the most important tests of nonstationarity is the Unit Root Tests of which the Augmented Dickey-Fuller (ADF) test is the most important. Other alternatives are the Phillips-Perron Test or the ADF-GLS procedure. The ADF test statistic is a negative number and the more negative it is, the stronger is the rejection of the hypothesis that there is a unit root at a certain level of confidence (usually 1%). Source of nonstationarity: In MIS areas, nonstationary time series data can be see usually in daily or monthly reports of attacks on computers over the Internet. This data would exhibit nonstationary properties because every time there is a new kind of attack, most of the victim computers do get patched by security updates while on the other hand hackers come up with new attacks that need not get the same starting level of attacks as before. Hence it is hard to find a stationary mean in this kind of MIS-related data. (b) Seasonality Seasonal variation is very commonly seen in data in the field of economics and business that is periodic in nature. This variation happens with a period of a year. In fact, any time-series data can be divided into 4 components: (a) Seasonal, (b) Trend, (c) Cyclical and (d) Error. The seasonal component corresponds to the same magnitude at the same time of the year. One of the biggest differences between doing ARIMA with seasonal data and that with any other kind is that in the former, the differencing is to be done over a certain seasonal period, “s”. The latter is also called the length of periodicity. Thus observations that are a multiple of the lag “s” apart are similar. This leads to spikes in ACF and PACF for seasonal series at one or more multiples of these lags. In fact, a seasonal process with a nonstationary mean has an ACF similar to that for a nonstationary, nonseasonal process. A seasonal process with a nonstationary mean has ACF spikes at lags s, 2s, 3s, etc. that do not damp out rapidly to zero. Source of seasonality: In MIS, seasonality can be usually seen in data about number of upgrades being released for software around the world, or in the number of software update DVDs being shipped around the world or at least in a single country. This seasonality comes from the fact that most companies that produce these software applications, especially the enterprise versions, have a certain product cycle for new versions as well as for service packs (like patch upgrades). Such data is however hard to obtain. (d) Changing Volatility Volatility refers to the changing variance in a time-series data. More formally, it is the phenomenon where the conditional variance of the time series varies over time. The figure above is an example of a volatile time-series data. It shows the daily values of the privately traded CREF stock fund over the period of August 26, 2004 to August 15, 2006. What we see here is that the stock fund returns are more volatile over certain time periods and especially towards the end (the 500th day mark). Engle(1982) first proposed the autoregressive conditional heteroskedasticity (ARCH) to model this kind of changing variance. More specifically, in these processes, the variance of the error displays autoregressive behavior – certain successive periods demonstrate large error variance while certain others show small error variance. The ARCH method essentially regresses the conditional variance, also known as conditional volatility, on the squared returns or observations from past lags. Hence, ARCH is usually specified with the help of the lags being used for the covariates. So ARCH(q) will be represented as: The error terms are also referred to as innovations in the time-series data. Often the variance of the current innovation is related to the squares of the previous innovations. Due to the regression nature of the ARCH method, it can be estimated using Ordinary Least Squares regression where the null hypothesis tested is that all αi = 0 for all i=1..q. On the other hand, Bollerslev in 1986 proposed a more generalized version of ARCH, called GARCH which introduces p lags of conditional variance. So combining with ARCH, this becomes GARCH (p,q). In the example below, we volatility clustering in a simulated ARCH(1) model with αo = 0.01 and α1 = 0.9. Here the large fluctuations cluster together but since the conditional variance process has “very short memory”, the overall series seems to suffer less from these fluctuations. Sources of volatility: In MIS, several Denial of Service computer attacks can show volatile nature if recorded over a period of time. This phenomenon especially shows volatile clustering due to sporadic nature of the attacks, which are however very active over successive time periods or lags. The volatility can arise especially when the computers get attached for a brief period upon which the computers die and reboot and then again faces attack. This cycle repeats for a small period but the computers are patched up immediately which requires the attackers to take a long pause before returning with a new set of innovative attacks to break past the patches. (4) Dynamics A time-series data has 4 components: seasonality, trend, cyclical component and the error. If the cyclical component as well as the error component both change with time, then we call such a process a dynamic time series model. In order to account for these changes, many researchers use dynamic interdependency models as prescribed by (Lin 1992). This model, which uses simultaneous equations as well as a 3SLS method of estimation, was shown to perform better in forecasting compared to Box-Jenkins. Sources of dynamics: One of the key sources of dynamic nature of some IT-related time-series data (like investment or computer attacks) is the speed of innovation that IT experiences more than any other field. Hence not every component of an organization evolves at an equal pace and very often IT outpaces everything. This leads to change in adoption, purchase, effective use, etc. (5) Randomness Randomness in a time-series data is usually discussed in the context of a random walk. These processes are generally associated with irregular growth. One of the ways to handle this is to use first differences. A random process can be stationary as well as shown in the diagram below. The reason it is called “random walk” is because from one period to the next, the original time series merely takes a random “step” away from its last recorded position. A constant term missing from the random walk model means it is a random walk without a drift. One of the most significant tests for detecting randomness in time-series data is the Runs Test. It essentially checks if a consecutive sequence of 0s and 1s (a “run”) is the correct number for a series that is supposed to be random. For large number of observations, the distribution of the “runs” variable is approximately normal. Sources of randomness: There are certain phenomena in IT that are hard to be stripped off of their randomness. One of them is associated with the dynamic property mentioned earlier. In other words, the innovation in technology is not always planned but rather also an outcome of some development teach achieving a breakthrough or an epiphany about solving a hard problem. (6) Non-normality Normality is the basic assumptions for many statistical models. For time-series analysis, a model is said to be a good fit on a collected or observed data if the error terms (or residues) are considered to be normally and independently distributed. This is typically verified through the JB-test which measures skewness and kurtosis in a variable. Sources of non-normality: In IT and MIS, normality is ever achieved 100% but an approximate fit of that distribution could be IT expenses by organizations. Most of the organizations could be expected to be making a “middle-of-the-road” kind of ITrelated purchases like buying computers and related technologies and creating new positions to maintain, develop and upgrade those pieces of technologies within the company. Only a few companies would either invest excessively in IT for their operations while a very few others would invest very little. This phenomenon, we believe, could give rise to a normal distribution of time-series data in the IT domain – IT investment. On other hand, the rise in commoditization of IT could actually result in a non-normal and highly skew distribution in IT expenses at each point of time. This could be attributed to the other rationale that no company can invest very little in IT these days and hence the distribution would lose the bell-curve shape. (7) Nonlinearity Linearity in time series data is tested through the BDS test (Brocke, Dechert, & Scheinkman, 1987). The null hypothesis tested here is that thedata is independently and identically distributed (IID). This test examines the ``spatial dependence'' of the observed series. To do this, the series is embedded in m-space and the dependence of x is examined by counting ``near'' points. Points for which the distance is less than eps are called ``near''. The BDS test statistic is asymptotically standard Normal. Sources of nonlinearity: Nonlinearity in IT data can be seen explain from the fact that several research have shown heavy influence of moderating variables in any kind of causality that involves IT adoption or IT investment. These moderating effects show that no variables can be linearly dependent. 2. Time Series Analysis Data: Attacks on Web Servers around the world Software: Gretl and R For our time-series analysis we use a dataset from the website www.dshield.org. This website maintained by the “Internet Storm Center” collects thousands of data every day about problems and threats to computers around the world. Sensors associated with network devices report these pieces of information to this database. The dataset used here is about various attacks including unwanted traffic on Web (HTTP: Port 80) servers around the world. In the raw form, it is a daily data that has been collected from January 7, 2003 till April 30, 2011. However, since there was too much randomness in this daily data, I decided to average out the data over each month to make this a monthly data. This resulted in a monthly time-series dataset with sample size 100. This data consists of 2 variables: (a) Targets of attacks and (b) Sources of attacks. The motivation to choose this data came from a strong belief that attackers’ confidence in targeting new machines depends a lot on the history of success or attempts made for attacking Web servers (including popular attacks like Denial of Service). This phenomenon, which is grounded in the informationassurance field under Management Science and Systems (MSS), thus has a consequence for not only the number of targets of attacks but also on the sources of attacks. The following is the regular time-series plot of the collected monthly average data: This data clearly tells us that attacking Web servers was certainly easier back in the early 2000s. Also Government regulation, awareness and other social factors must have led to a steady decline in the number of sources of these attacks. Before we start off with the time-series analysis some basic summary statistics and normality test results are presented here for reference purposes. Summary statistics, using the observations 2003:01 - 2011:04 Targets Sources Mean Median 7.8166e+05 5.0537e+05 77403. 58973. Targets Sources Std. Dev. C.V. 8.1723e+05 1.0455 47847. 0.61816 Minimum Maximum 1.6741e+05 5.0547e+06 24508. 2.1894e+05 Skewness 3.1404 1.3929 Ex. kurtosis 10.762 1.1625 Neither of the two variables shows any normality. Test for normality of Targets: Doornik-Hansen test = 347.996, with p-value 2.71372e-76 Shapiro-Wilk W = 0.596099, with p-value 3.6318e-15 Lilliefors test = 0.278988, with p-value ~= 0 Jarque-Bera test = 646.912, with p-value 3.349e-141 Test for normality of Sources: Doornik-Hansen test = 79.2284, with p-value 6.24847e-18 Shapiro-Wilk W = 0.829703, with p-value 2.25974e-09 Lilliefors test = 0.215948, with p-value ~= 0 Jarque-Bera test = 37.9678, with p-value 5.69364e-09 (1) Box-Jenkins univariate analysis: The Box-Jenkins approach is applied only for univariate analysis. We thus start with the variable called Target. Step 1: Identification: In this stage, the ACF/PACF of the variable are explored to make an educated guess about the orders (p and q) in order to start with either a AR() or MA() or ARMA() model. However this identification step should also include the identification of the need for differencing based on the existence of nonstationarity. Thus Augmented Dickey-Fuller test is conducted first as shown in the output below: Augmented Dickey-Fuller Test data: targetD Dickey-Fuller = -3.0582, Lag order = 4, p-value = 0.1383 alternative hypothesis: stationary This result tells us that not only should we be looking for the p and q orders but also for d. Since the p-value is > 0.01, we cannot reject the null hypothesis that the data has unit roots and is thus stationary. We will thus start with d=1. Next we look at the ACF and PACF of “Targets”: Following the strategy in Table 6.2 of Pankratz, it is decided that we should start with p = 3 and q = 0. Step2: Estimation – Next we estimate the coefficients of the ARMA model (with a few iterations) using exact maximum likelihood which gave the following: So perhaps taking p=3 is not the best idea. The next combination we try is: p = 1, d = 0, q = 1: Furthermore, choosing p = 1, d = 1, q = 1, we get: In addition we also check the Residual Correlogram for both (1,0,1) and (1,1,1), and we see that there’s more white noise in the former as shown below: (p=1,d=0,q=1) (p=1,d=1,q=1) So at the end of Step 1 of Box-Jenkins it is decided that the ideal combination is p = 1, d = 0, q = 1. Thus the model that best fits our data is ARMA(1,1). > arima0(targetD,order=c(1,0,1)) Call: arima0(x = targetD, order = c(1, 0, 1)) Coefficients: ar1 ma1 intercept 0.6933 0.2774 781642.2 s.e. 0.0942 0.1182 195802.9 sigma^2 estimated as 2.298e+11: log likelihood = -1450.46, aic = 2908.92 Therefore, Φ1 = 0.6933, θ1 = 0.2774 and C = 781642.2 Thus the ARMA(1,1) model that fits our time-series computer attack data looks like the following: ̂ 𝒕 = 781642.2 + 0.6933𝒀𝒕−𝟏 + 𝜺𝒕 − 0.2774𝜺𝒕−𝟏 𝒀 Step 3:Diagnostics – We look at the residuals to conduct this step. First we plot the standardize residuals over time: This is where it was found out that the standardized residuals actual converge with time thus revealing a trend and this suggests that ARMA(1,1) for the original data is possibly not a good model. We will do further diagnostics to confirm this. Next we check the normality of the residuals using a Quantile-Quantile (Q-Q) plot as given below. Since the points on this plot do not seem to follow the straight line fairly closely, we can reject normality of the error terms of our model. By conducting both the Box-Pierce Test and the Ljung-Box Test on the residuals with ARMA(1,1) DID NOT result in white noise for the residuals. We know that BoxPierce test is not good for small samples. We overcome that challenge here by virtue of a sample size 100. > Box.test(arima0$residuals) Box-Pierce test data: arima0$residuals X-squared = 0.1023, df = 1, p-value = 0.749 > Box.test(arima0$residuals,type="Ljung-Box") Box-Ljung test data: arima0$residuals X-squared = 0.1055, df = 1, p-value = 0.7454 These high p-values show that there is no auto-correlation between the residuals and thus they DO NOT represent a white noise. Hence our initial model selection in Box-Jenkins, that is, ARMA(1,1) is NOT GOOD. We thus have to go back to Step 1 as shown below to try something else. Step 1(b): Since most of the other combinations of p, d and q with the original data did not give us any favorable results in the estimation stage, we need to look into transforming the data. We thus chose Log Differencing as a transformation as shown in the R command prompt below: > log_d_targetD <- diff(log(targetD)) Checking the correlograms on this transformed data revealed immediately that there might be an MA(1) process here: Also mapping an autoregressive process to the log-differenced target data revealed that we should pick p = 1 for AR. > ar(log_d_targetD) Call: ar(x = log_d_targetD) Coefficients: 1 -0.1816 Order selected 1 sigma^2 estimated as 0.1921 > We thus pick p = 1 and q = 1 to start with for this transformed data. Further the ADF test on this transformed data sowed that it is not nonstationary due to absence of any unit roots: (Gretl output): Dickey-Fuller test for ld_Targets sample size 98 unit-root null hypothesis: a = 1 test without constant model: (1-L)y = (a-1)*y(-1) + e 1st-order autocorrelation coeff. for e: -0.014 estimated value of (a - 1): -1.18129 test statistic: tau_nc(1) = -11.832 p-value 2.325e-118 test with constant model: (1-L)y = b0 + (a-1)*y(-1) + e 1st-order autocorrelation coeff. for e: -0.014 estimated value of (a - 1): -1.18172 test statistic: tau_c(1) = -11.7763 p-value 0.000342 with constant and trend model: (1-L)y = b0 + b1*t + (a-1)*y(-1) + e 1st-order autocorrelation coeff. for e: -0.015 estimated value of (a - 1): -1.18186 test statistic: tau_ct(1) = -11.7176 p-value 1.055e-14 (R output) > adf.test(log_d_targetD) Augmented Dickey-Fuller Test data: log_d_targetD Dickey-Fuller = -6.018, Lag order = 4, p-value = 0.01 alternative hypothesis: stationary Warning message: In adf.test(log_d_targetD) : p-value smaller than printed p-value We are thus going to investigate the model of ARIMA(1,0,1) with log-differenced target data. Step 2(b): Here we estimate the AR and MA coefficients: > model2 Call: arima(x = diff(log(targetD)), order = c(1, 0, 1)) Coefficients: ar1 ma1 intercept 0.7105 -1.0000 -0.0118 s.e. 0.0730 0.0522 0.0045 sigma^2 estimated as 0.1658: log likelihood = -52.95, aic = 111.9 Clearly this model is better than the one with the original data we had found in the previous iteration as the AIC value is much smaller and also because the coefficients are much further than 0. We will thus run diagnostics on this new ARMA(1,1) with log-differenced target data. Step 3(b): Here we look at the same output as we produced earlier. However, as we shall see below, except the randomness and lack of trends in the residuals, there is not a lot of improvement in terms of diagnostics compared to our previous model: Furthermore, although there is a better alignment with Q-Q line than in the previous case, normality is still not completely established in the residuals from our ARMA(1,1) with log-differenced target data. Thus, the Q-Q plot on the residuals does not confirm any white noise as desired: Step 4: - Forecasting – Finally we use the ARMA(1,1) models for both the original data about targets as well as the log-differenced data about targets to compare the forecasts of the number of computers or Web servers (port 80) that will be targeted in the months of May and June in 2011. The following are those results: We can definitely see that ARMA(1,1) for original data gives us better visual indication of where the new number of targets will be. However, the forecast for the log-differenced data shows a better confidence interval for the forecast. As we saw in the above Box-Jenkins univariate analysis, in spite of the flaws in both the models, there is a slight improvement in the estimation when a transformation was adopted. Further analysis is out of the scope of this answer but will be pursued for research purposes in new future. (2) Tiao-Box Multivariate Analysis: If zt = (z1t, z2t, …, zkt)’ is a k-dimensional time-series data, then in order to improve the accuracy of forecasts, multivariate time-series analysis is required. Tiao-Box multivariate time series analysis deals with a vector of two or more non-stationary time series analyses. The purpose of this type of analysis is to understand the dynamic relationships among the vectors. --------- (2) An iterative approach with the following steps was proposed by Tiao and Box to build a multiple time series model: 1. Identification through tentative specification 2. Estimation 3. Diagnostic checking for ARMA The first step requires checking the cross-correlation between the two vectors as shown below: The persistence of large correlations between the two vectors suggests the possibility of autoregressive (AR) behavior. Lower order MA model can be detected as well from the pattern of indicator symbols which are used in place of numerical values in in the cross-correlation tables as shown in the exam below from (Tiao and Box, 1981): Estimation may be evaluated by both conditional and exact likelihood functions. For general ARMA(p,q) models, exact likelihood can be approximated by using the following transformation: Finally diagnostic checking can be performed by: (a) Plotting standardized residuals against time and/or other variables (b) Cross-correlation matrices of the residuals where the residual series will be of the following form: (3) VAR(q) and Impulse-Response Function While Tiao-Box Multivariate analysis deals with multiple series ARMA(p,q) processes, it is practically difficult to analyze multiple processes that have significant moving averages (MA). VAR(q) or Vector Autoregression models rather deals with multiple time-series processes with AR(q). More precisely, a VAR consists of a set of K endogenous variables with a certain lag order. Before running VAR in Gretl, it was thus important to let the software program select the lag order (based on a maximum lag of 12 since this is a monthly data) for us as shown below, including a constant and a trend: This is further confirmed by the VARselect command in R whose output shows lag = 6 as the optimal choice based on AIC: We estimate the VAR model using lag = 6 which is the optimal lag produced by the lag selection model. The following are the VAR results followed by results about the impulse-response function: For “Targets” with lag = 6: For “Sources” with lag = 6: AIC = 51.1839 BIC = 51.9415 Residual plot for each variable: Combined VAR residuals plot Combined Impulse-Response functions: (4) State-space model and Kalman Filter We demonstrate here the method that would be used to estimate a State-space model. A univariate time series can be modeled as satisfying two Gaussian white noise series as follows: This represents a state-space model where yt is the data while μt is the “state”. The objective of this analysis is to infer the properties of that state from the data and the model. There are three types of inferences: (a) Filtering: remove the measurement errors from the data. (b) Prediction: to forecast either the state or the data (c) Smoothing: estimate the state The goal of Kalman Filter on the other hand is to update the knowledge about the state μt “recursively” when a new data point becomes available. The following set of equations demonstrates this: (5) Intervention and transfer-function models Transfer-function models, also known as dynamic regression model, are estimated by bringing in an exogenous variable that may follow some ARIMA(p,q,d) model. The latter is also called a stochastic noise process whose specification is done by examining the residuals from an OLS fit of the one of the time-series variable on another. Since we are introducing an exogenous variable here, we take help of the ARIMAX model in Gretl to apply towards our data about computer attacks. For our purposes we select the variable “Sources” as an exogenous variable in ARIMA. The result shows all coefficients are significant as follows: As we can see in the above results, “Sources” is the exogenous variable and it is quite significantly affecting our dependent variable “Targets” at time t. This new model gives a satisfactory forecast as seen below: This forecast result should be compared with the forecasts from ARMA(1,1) with original and log-differenced target data. The plot above can also be obtained while comparing fitted Vs. actual “Targets” against time. (6) Unit root tests These have been performed at the beginning of part (1) where we were doing the Box-Jenkins analysis. This was done early because we wanted to get a notion if our data needs differencing. So kindly scroll back up to Part (1) titled “Box-Jenkins Univariate Analysis. (7) Co-Integration and Error correction Model As we can see in the results from both the Engle-Granger test and the Johansen test and the below no evidence of cointegration was in found in our data and thus we didn’t have to worry about formulating an error correction model. (8) GARCH / ARCH / EGARCH 1. ARCH: GARCH q = 0 and ARCH p = 1 for “Targets” implies there is ARCH(1) in the data GARCH q = 1 and ARCH p = 1 for “Targets” implies there is less GARCH in the data: 2. EGARCH – Estimation failed (9 – 11): Discussion What we have seen from the data above is that the number of targets of computer attacks, especially those running Web servers at port 80 do seem to be by the same number a month ago. As discussed at the beginning of this answer, an interesting observation, especially from VAR(p) model is that there seems to be an effect of the number of sources attacking on the number of targets attacking. This finding seems logical because of the sources and targets are merely connected by the Internet. One of the key findings, especially from the bad diagnostics results in the BoxJenkins analysis is that, given the speed at which IT and computer and networking technology evolves, we might not get a very reliable time-series model to capture data that spans 8 years. It is evident form the initial plot of the targets that more computers used to be attacked (and successfully so) back in early 2000s and this could partly be attributed to a combination of factors like (a) lack of awareness about security issues, (b) lack of affordable security technologies and (c) lack of skillset among website maintainers. This tells us that it is not appropriate to apply a purely time-series model for estimating, predicting and forecasting the average number of Web servers that are targets of attacks every month. In other words, as seen in the intervention stage, some non-stochastic exogenous variable might play a significant role in the model and be able to explain the expected number of target web servers much better than what we were able to achieve here. This kind of prediction and forecast information is extremely essential for managers of organizations as well as security device and software companies to understand their market. That motivation is what drives our research and to the best of our knowledge not a lot of event studies have been done to understand the patterns of web server attacks, be it in computer science or MIS.