Final-draft

advertisement
MGO 616 – Final Exam
Rajarshi Chakraborty, MSS
Q1. Time Series Phenomenon
(a) Nonstationary
A time-series data or a stochastic process is called stationary when its properties
are unaffected by a change of time origin. This implies that if there are m
observations in a stochastic process at any “set of times” t1, t2, …, tm, then the joint
probability distribution of the observations made after a lag of k is the same as that
for the observations before. Formally speaking, a time-series data is said to be
stationary if its mean, variance and autocorrelation function are all essentially
constant through time. A nonstationary time-series data, on the other hand, violates
this property.
In real life, most stochastic processes are nonstationary though. Many empirical
time series (e.g. stock prices) behave as though they had no fixed mean. The
property of a stochastic distribution that displays the most nonstationarity is the
mean of the series. In other words, the means of these processes are not constant
through different time lags.
As an example, the level of this series in the figure above seems to rise and fall
episodically rather than trending in one direction. In spite of the lack of a fixed
mean, however, sometimes on a global level one part of the time-series data may
behave much like another part. This is referred to as a homogeneous nonstationary
behavior. The importance of the latter is that fact it can be converted into stationary
data using differencing. This method however allows us to only deal with
nonstationary means and not nonstationary variance. Nonstationary variance on the
other hand can be dealt through logarithmic transformations or sometimes double
differencing. Two other kinds of stationarities can be obtained from nonstationary
data: trend-stationary and difference-stationary. While in the former, a deterministic
trend can be removed to fit the ARMA(p,q) model, in the latter the differencing
process is with respect to time.
A process is non-stationary, if one of the roots of its characteristic equations is 1. If
the other characteristic roots lie within the unit circle, then stationarity can be
restored by using the first differencing transformation. One of the most important
tests of nonstationarity is the Unit Root Tests of which the Augmented Dickey-Fuller
(ADF) test is the most important. Other alternatives are the Phillips-Perron Test or
the ADF-GLS procedure. The ADF test statistic is a negative number and the more
negative it is, the stronger is the rejection of the hypothesis that there is a unit root
at a certain level of confidence (usually 1%).
Source of nonstationarity: In MIS areas, nonstationary time series data can be see
usually in daily or monthly reports of attacks on computers over the Internet. This
data would exhibit nonstationary properties because every time there is a new kind
of attack, most of the victim computers do get patched by security updates while on
the other hand hackers come up with new attacks that need not get the same
starting level of attacks as before. Hence it is hard to find a stationary mean in this
kind of MIS-related data.
(b) Seasonality
Seasonal variation is very commonly seen in data in the field of economics and
business that is periodic in nature. This variation happens with a period of a year. In
fact, any time-series data can be divided into 4 components: (a) Seasonal, (b) Trend,
(c) Cyclical and (d) Error. The seasonal component corresponds to the same
magnitude at the same time of the year. One of the biggest differences between
doing ARIMA with seasonal data and that with any other kind is that in the former,
the differencing is to be done over a certain seasonal period, “s”. The latter is also
called the length of periodicity. Thus observations that are a multiple of the lag “s”
apart are similar. This leads to spikes in ACF and PACF for seasonal series at one or
more multiples of these lags. In fact, a seasonal process with a nonstationary mean
has an ACF similar to that for a nonstationary, nonseasonal process. A seasonal
process with a nonstationary mean has ACF spikes at lags s, 2s, 3s, etc. that do not
damp out rapidly to zero.
Source of seasonality: In MIS, seasonality can be usually seen in data about number
of upgrades being released for software around the world, or in the number of
software update DVDs being shipped around the world or at least in a single
country. This seasonality comes from the fact that most companies that produce
these software applications, especially the enterprise versions, have a certain
product cycle for new versions as well as for service packs (like patch upgrades).
Such data is however hard to obtain.
(d) Changing Volatility
Volatility refers to the changing variance in a time-series data. More formally, it is
the phenomenon where the conditional variance of the time series varies over time.
The figure above is an example of a volatile time-series data. It shows the daily
values of the privately traded CREF stock fund over the period of August 26, 2004 to
August 15, 2006. What we see here is that the stock fund returns are more volatile
over certain time periods and especially towards the end (the 500th day mark).
Engle(1982) first proposed the autoregressive conditional heteroskedasticity
(ARCH) to model this kind of changing variance. More specifically, in these
processes, the variance of the error displays autoregressive behavior – certain
successive periods demonstrate large error variance while certain others show
small error variance. The ARCH method essentially regresses the conditional
variance, also known as conditional volatility, on the squared returns or
observations from past lags. Hence, ARCH is usually specified with the help of the
lags being used for the covariates. So ARCH(q) will be represented as:
The error terms are also referred to as innovations in the time-series data. Often the
variance of the current innovation is related to the squares of the previous
innovations. Due to the regression nature of the ARCH method, it can be estimated
using Ordinary Least Squares regression where the null hypothesis tested is that all
αi = 0 for all i=1..q. On the other hand, Bollerslev in 1986 proposed a more
generalized version of ARCH, called GARCH which introduces p lags of conditional
variance. So combining with ARCH, this becomes GARCH (p,q).
In the example below, we volatility clustering in a simulated ARCH(1) model with αo
= 0.01 and α1 = 0.9. Here the large fluctuations cluster together but since the
conditional variance process has “very short memory”, the overall series seems to
suffer less from these fluctuations.
Sources of volatility: In MIS, several Denial of Service computer attacks can show
volatile nature if recorded over a period of time. This phenomenon especially shows
volatile clustering due to sporadic nature of the attacks, which are however very
active over successive time periods or lags. The volatility can arise especially when
the computers get attached for a brief period upon which the computers die and
reboot and then again faces attack. This cycle repeats for a small period but the
computers are patched up immediately which requires the attackers to take a long
pause before returning with a new set of innovative attacks to break past the
patches.
(4) Dynamics
A time-series data has 4 components: seasonality, trend, cyclical component and the
error. If the cyclical component as well as the error component both change with
time, then we call such a process a dynamic time series model. In order to account
for these changes, many researchers use dynamic interdependency models as
prescribed by (Lin 1992). This model, which uses simultaneous equations as well as
a 3SLS method of estimation, was shown to perform better in forecasting compared
to Box-Jenkins.
Sources of dynamics: One of the key sources of dynamic nature of some IT-related
time-series data (like investment or computer attacks) is the speed of innovation
that IT experiences more than any other field. Hence not every component of an
organization evolves at an equal pace and very often IT outpaces everything. This
leads to change in adoption, purchase, effective use, etc.
(5) Randomness
Randomness in a time-series data is usually discussed in the context of a random
walk. These processes are generally associated with irregular growth. One of the
ways to handle this is to use first differences. A random process can be stationary as
well as shown in the diagram below. The reason it is called “random walk” is
because from one period to the next, the original time series merely takes a random
“step” away from its last recorded position. A constant term missing from the
random walk model means it is a random walk without a drift. One of the most
significant tests for detecting randomness in time-series data is the Runs Test. It
essentially checks if a consecutive sequence of 0s and 1s (a “run”) is the correct
number for a series that is supposed to be random. For large number of
observations, the distribution of the “runs” variable is approximately normal.
Sources of randomness: There are certain phenomena in IT that are hard to be
stripped off of their randomness. One of them is associated with the dynamic
property mentioned earlier. In other words, the innovation in technology is not
always planned but rather also an outcome of some development teach achieving a
breakthrough or an epiphany about solving a hard problem.
(6) Non-normality
Normality is the basic assumptions for many statistical models. For time-series
analysis, a model is said to be a good fit on a collected or observed data if the error
terms (or residues) are considered to be normally and independently distributed.
This is typically verified through the JB-test which measures skewness and kurtosis
in a variable.
Sources of non-normality: In IT and MIS, normality is ever achieved 100% but an
approximate fit of that distribution could be IT expenses by organizations. Most of
the organizations could be expected to be making a “middle-of-the-road” kind of ITrelated purchases like buying computers and related technologies and creating new
positions to maintain, develop and upgrade those pieces of technologies within the
company. Only a few companies would either invest excessively in IT for their
operations while a very few others would invest very little. This phenomenon, we
believe, could give rise to a normal distribution of time-series data in the IT domain
– IT investment. On other hand, the rise in commoditization of IT could actually
result in a non-normal and highly skew distribution in IT expenses at each point of
time. This could be attributed to the other rationale that no company can invest very
little in IT these days and hence the distribution would lose the bell-curve shape.
(7) Nonlinearity
Linearity in time series data is tested through the BDS test (Brocke, Dechert, &
Scheinkman, 1987). The null hypothesis tested here is that thedata is independently
and identically distributed (IID). This test examines the ``spatial dependence'' of the
observed series. To do this, the series is embedded in m-space and the dependence
of x is examined by counting ``near'' points. Points for which the distance is less than
eps are called ``near''. The BDS test statistic is asymptotically standard Normal.
Sources of nonlinearity: Nonlinearity in IT data can be seen explain from the fact
that several research have shown heavy influence of moderating variables in any
kind of causality that involves IT adoption or IT investment. These moderating
effects show that no variables can be linearly dependent.
2. Time Series Analysis
Data: Attacks on Web Servers around the world
Software: Gretl and R
For our time-series analysis we use a dataset from the website www.dshield.org.
This website maintained by the “Internet Storm Center” collects thousands of data
every day about problems and threats to computers around the world. Sensors
associated with network devices report these pieces of information to this database.
The dataset used here is about various attacks including unwanted traffic on Web
(HTTP: Port 80) servers around the world. In the raw form, it is a daily data that has
been collected from January 7, 2003 till April 30, 2011. However, since there was
too much randomness in this daily data, I decided to average out the data over each
month to make this a monthly data. This resulted in a monthly time-series dataset
with sample size 100. This data consists of 2 variables: (a) Targets of attacks and
(b) Sources of attacks. The motivation to choose this data came from a strong belief
that attackers’ confidence in targeting new machines depends a lot on the history of
success or attempts made for attacking Web servers (including popular attacks like
Denial of Service). This phenomenon, which is grounded in the informationassurance field under Management Science and Systems (MSS), thus has a
consequence for not only the number of targets of attacks but also on the sources of
attacks.
The following is the regular time-series plot of the collected monthly average data:
This data clearly tells us that attacking Web servers was certainly easier back in the
early 2000s. Also Government regulation, awareness and other social factors must
have led to a steady decline in the number of sources of these attacks. Before we
start off with the time-series analysis some basic summary statistics and normality
test results are presented here for reference purposes.
Summary statistics, using the observations 2003:01 - 2011:04
Targets
Sources
Mean
Median
7.8166e+05 5.0537e+05
77403.
58973.
Targets
Sources
Std. Dev.
C.V.
8.1723e+05 1.0455
47847.
0.61816
Minimum
Maximum
1.6741e+05 5.0547e+06
24508.
2.1894e+05
Skewness
3.1404
1.3929
Ex. kurtosis
10.762
1.1625
Neither of the two variables shows any normality.
Test for normality of Targets:

Doornik-Hansen test = 347.996, with p-value 2.71372e-76

Shapiro-Wilk W = 0.596099, with p-value 3.6318e-15

Lilliefors test = 0.278988, with p-value ~= 0

Jarque-Bera test = 646.912, with p-value 3.349e-141
Test for normality of Sources:

Doornik-Hansen test = 79.2284, with p-value 6.24847e-18

Shapiro-Wilk W = 0.829703, with p-value 2.25974e-09

Lilliefors test = 0.215948, with p-value ~= 0

Jarque-Bera test = 37.9678, with p-value 5.69364e-09
(1) Box-Jenkins univariate analysis:
The Box-Jenkins approach is applied only for univariate analysis. We thus start with
the variable called Target.
Step 1: Identification: In this stage, the ACF/PACF of the variable are explored to
make an educated guess about the orders (p and q) in order to start with either a
AR() or MA() or ARMA() model. However this identification step should also include
the identification of the need for differencing based on the existence of
nonstationarity. Thus Augmented Dickey-Fuller test is conducted first as shown in
the output below:
Augmented Dickey-Fuller Test
data: targetD
Dickey-Fuller = -3.0582, Lag order = 4, p-value = 0.1383
alternative hypothesis: stationary
This result tells us that not only should we be looking for the p and q orders but also
for d. Since the p-value is > 0.01, we cannot reject the null hypothesis that the data
has unit roots and is thus stationary. We will thus start with d=1.
Next we look at the ACF and PACF of “Targets”:
Following the strategy in Table 6.2 of Pankratz, it is decided that we should start
with p = 3 and q = 0.
Step2: Estimation – Next we estimate the coefficients of the ARMA model (with a few
iterations) using exact maximum likelihood which gave the following:
So perhaps taking p=3 is not the best idea. The next combination we try is:
p = 1, d = 0, q = 1:
Furthermore, choosing p = 1, d = 1, q = 1, we get:
In addition we also check the Residual Correlogram for both (1,0,1) and (1,1,1), and
we see that there’s more white noise in the former as shown below:
(p=1,d=0,q=1)
(p=1,d=1,q=1)
So at the end of Step 1 of Box-Jenkins it is decided that the ideal combination is
p = 1, d = 0, q = 1.
Thus the model that best fits our data is ARMA(1,1).
> arima0(targetD,order=c(1,0,1))
Call:
arima0(x = targetD, order = c(1, 0, 1))
Coefficients:
ar1 ma1 intercept
0.6933 0.2774 781642.2
s.e. 0.0942 0.1182 195802.9
sigma^2 estimated as 2.298e+11: log likelihood = -1450.46, aic = 2908.92
Therefore, Φ1 = 0.6933, θ1 = 0.2774 and C = 781642.2
Thus the ARMA(1,1) model that fits our time-series computer attack data looks like
the following:
̂ 𝒕 = 781642.2 + 0.6933𝒀𝒕−𝟏 + 𝜺𝒕 − 0.2774𝜺𝒕−𝟏
𝒀
Step 3:Diagnostics – We look at the residuals to conduct this step. First we plot the
standardize residuals over time:
This is where it was found out that the standardized residuals actual converge with
time thus revealing a trend and this suggests that ARMA(1,1) for the original data
is possibly not a good model. We will do further diagnostics to confirm this.
Next we check the normality of the residuals using a Quantile-Quantile (Q-Q) plot as
given below. Since the points on this plot do not seem to follow the straight line
fairly closely, we can reject normality of the error terms of our model.
By conducting both the Box-Pierce Test and the Ljung-Box Test on the residuals
with ARMA(1,1) DID NOT result in white noise for the residuals. We know that BoxPierce test is not good for small samples. We overcome that challenge here by virtue
of a sample size 100.
> Box.test(arima0$residuals)
Box-Pierce test
data: arima0$residuals
X-squared = 0.1023, df = 1, p-value = 0.749
> Box.test(arima0$residuals,type="Ljung-Box")
Box-Ljung test
data: arima0$residuals
X-squared = 0.1055, df = 1, p-value = 0.7454
These high p-values show that there is no auto-correlation between the residuals
and thus they DO NOT represent a white noise. Hence our initial model selection in
Box-Jenkins, that is, ARMA(1,1) is NOT GOOD. We thus have to go back to Step 1 as
shown below to try something else.
Step 1(b): Since most of the other combinations of p, d and q with the original data
did not give us any favorable results in the estimation stage, we need to look into
transforming the data. We thus chose Log Differencing as a transformation as
shown in the R command prompt below:
> log_d_targetD <- diff(log(targetD))
Checking the correlograms on this transformed data revealed immediately that
there might be an MA(1) process here:
Also mapping an autoregressive process to the log-differenced target data revealed
that we should pick p = 1 for AR.
> ar(log_d_targetD)
Call:
ar(x = log_d_targetD)
Coefficients:
1
-0.1816
Order selected 1 sigma^2 estimated as 0.1921
>
We thus pick p = 1 and q = 1 to start with for this transformed data.
Further the ADF test on this transformed data sowed that it is not nonstationary due
to absence of any unit roots:
(Gretl output):
Dickey-Fuller test for ld_Targets
sample size 98
unit-root null hypothesis: a = 1
test without constant
model: (1-L)y = (a-1)*y(-1) + e
1st-order autocorrelation coeff. for e: -0.014
estimated value of (a - 1): -1.18129
test statistic: tau_nc(1) = -11.832
p-value 2.325e-118
test with constant
model: (1-L)y = b0 + (a-1)*y(-1) + e
1st-order autocorrelation coeff. for e: -0.014
estimated value of (a - 1): -1.18172
test statistic: tau_c(1) = -11.7763
p-value 0.000342
with constant and trend
model: (1-L)y = b0 + b1*t + (a-1)*y(-1) + e
1st-order autocorrelation coeff. for e: -0.015
estimated value of (a - 1): -1.18186
test statistic: tau_ct(1) = -11.7176
p-value 1.055e-14
(R output)
> adf.test(log_d_targetD)
Augmented Dickey-Fuller Test
data: log_d_targetD
Dickey-Fuller = -6.018, Lag order = 4, p-value = 0.01
alternative hypothesis: stationary
Warning message:
In adf.test(log_d_targetD) : p-value smaller than printed p-value
We are thus going to investigate the model of ARIMA(1,0,1) with log-differenced
target data.
Step 2(b): Here we estimate the AR and MA coefficients:
> model2
Call:
arima(x = diff(log(targetD)), order = c(1, 0, 1))
Coefficients:
ar1 ma1 intercept
0.7105 -1.0000 -0.0118
s.e. 0.0730 0.0522 0.0045
sigma^2 estimated as 0.1658: log likelihood = -52.95, aic = 111.9
Clearly this model is better than the one with the original data we had found in the
previous iteration as the AIC value is much smaller and also because the coefficients
are much further than 0. We will thus run diagnostics on this new ARMA(1,1) with
log-differenced target data.
Step 3(b): Here we look at the same output as we produced earlier. However, as we
shall see below, except the randomness and lack of trends in the residuals, there is
not a lot of improvement in terms of diagnostics compared to our previous model:
Furthermore, although there is a better alignment with Q-Q line than in the previous
case, normality is still not completely established in the residuals from our
ARMA(1,1) with log-differenced target data. Thus, the Q-Q plot on the residuals does
not confirm any white noise as desired:
Step 4: - Forecasting – Finally we use the ARMA(1,1) models for both the original
data about targets as well as the log-differenced data about targets to compare the
forecasts of the number of computers or Web servers (port 80) that will be targeted
in the months of May and June in 2011. The following are those results:
We can definitely see that ARMA(1,1) for original data gives us better visual
indication of where the new number of targets will be. However, the forecast for the
log-differenced data shows a better confidence interval for the forecast.
As we saw in the above Box-Jenkins univariate analysis, in spite of the flaws in both
the models, there is a slight improvement in the estimation when a transformation
was adopted. Further analysis is out of the scope of this answer but will be pursued
for research purposes in new future.
(2) Tiao-Box Multivariate Analysis:
If zt = (z1t, z2t, …, zkt)’ is a k-dimensional time-series data, then in order to improve
the accuracy of forecasts, multivariate time-series analysis is required. Tiao-Box
multivariate time series analysis deals with a vector of two or more non-stationary
time series analyses. The purpose of this type of analysis is to understand the
dynamic relationships among the vectors.
--------- (2)
An iterative approach with the following steps was proposed by Tiao and Box to
build a multiple time series model:
1. Identification through tentative specification
2. Estimation
3. Diagnostic checking for ARMA
The first step requires checking the cross-correlation between the two vectors as
shown below:
The persistence of large correlations between the two vectors suggests the
possibility of autoregressive (AR) behavior. Lower order MA model can be detected
as well from the pattern of indicator symbols which are used in place of numerical
values in in the cross-correlation tables as shown in the exam below from (Tiao and
Box, 1981):
Estimation may be evaluated by both conditional and exact likelihood functions. For
general ARMA(p,q) models, exact likelihood can be approximated by using the
following transformation:
Finally diagnostic checking can be performed by:
(a) Plotting standardized residuals against time and/or other variables
(b) Cross-correlation matrices of the residuals
where the residual series will be of the following form:
(3) VAR(q) and Impulse-Response Function
While Tiao-Box Multivariate analysis deals with multiple series ARMA(p,q)
processes, it is practically difficult to analyze multiple processes that have
significant moving averages (MA). VAR(q) or Vector Autoregression models rather
deals with multiple time-series processes with AR(q). More precisely, a VAR
consists of a set of K endogenous variables with a certain lag order. Before running
VAR in Gretl, it was thus important to let the software program select the lag order
(based on a maximum lag of 12 since this is a monthly data) for us as shown below,
including a constant and a trend:
This is further confirmed by the VARselect command in R whose output shows lag =
6 as the optimal choice based on AIC:
We estimate the VAR model using lag = 6 which is the optimal lag produced by the
lag selection model. The following are the VAR results followed by results about the
impulse-response function:
For “Targets” with lag = 6:
For “Sources” with lag = 6:
AIC = 51.1839
BIC = 51.9415
Residual plot for each variable:
Combined VAR residuals plot
Combined Impulse-Response functions:
(4) State-space model and Kalman Filter
We demonstrate here the method that would be used to estimate a State-space
model.
A univariate time series can be modeled as satisfying two Gaussian white noise
series as follows:
This represents a state-space model where yt is the data while μt is the “state”. The
objective of this analysis is to infer the properties of that state from the data and the
model. There are three types of inferences:
(a) Filtering: remove the measurement errors from the data.
(b) Prediction: to forecast either the state or the data
(c) Smoothing: estimate the state
The goal of Kalman Filter on the other hand is to update the knowledge about the
state μt “recursively” when a new data point becomes available. The following set of
equations demonstrates this:
(5) Intervention and transfer-function models
Transfer-function models, also known as dynamic regression model, are estimated by
bringing in an exogenous variable that may follow some ARIMA(p,q,d) model. The
latter is also called a stochastic noise process whose specification is done by
examining the residuals from an OLS fit of the one of the time-series variable on
another. Since we are introducing an exogenous variable here, we take help of the
ARIMAX model in Gretl to apply towards our data about computer attacks.
For our purposes we select the variable “Sources” as an exogenous variable in
ARIMA. The result shows all coefficients are significant as follows:
As we can see in the above results, “Sources” is the exogenous variable and it is quite
significantly affecting our dependent variable “Targets” at time t.
This new model gives a satisfactory forecast as seen below:
This forecast result should be compared with the forecasts from ARMA(1,1) with
original and log-differenced target data. The plot above can also be obtained while
comparing fitted Vs. actual “Targets” against time.
(6) Unit root tests
These have been performed at the beginning of part (1) where we were doing the
Box-Jenkins analysis. This was done early because we wanted to get a notion if our
data needs differencing. So kindly scroll back up to Part (1) titled “Box-Jenkins
Univariate Analysis.
(7) Co-Integration and Error correction Model
As we can see in the results from both the Engle-Granger test and the Johansen
test and the below no evidence of cointegration was in found in our data and thus
we didn’t have to worry about formulating an error correction model.
(8) GARCH / ARCH / EGARCH
1. ARCH:
GARCH q = 0 and ARCH p = 1 for “Targets” implies there is ARCH(1) in the data
GARCH q = 1 and ARCH p = 1 for “Targets” implies there is less GARCH in the data:
2. EGARCH – Estimation failed
(9 – 11): Discussion
What we have seen from the data above is that the number of targets of computer
attacks, especially those running Web servers at port 80 do seem to be by the same
number a month ago. As discussed at the beginning of this answer, an interesting
observation, especially from VAR(p) model is that there seems to be an effect of the
number of sources attacking on the number of targets attacking. This finding seems
logical because of the sources and targets are merely connected by the Internet.
One of the key findings, especially from the bad diagnostics results in the BoxJenkins analysis is that, given the speed at which IT and computer and networking
technology evolves, we might not get a very reliable time-series model to capture
data that spans 8 years. It is evident form the initial plot of the targets that more
computers used to be attacked (and successfully so) back in early 2000s and this
could partly be attributed to a combination of factors like (a) lack of awareness
about security issues, (b) lack of affordable security technologies and (c) lack of
skillset among website maintainers. This tells us that it is not appropriate to apply a
purely time-series model for estimating, predicting and forecasting the average
number of Web servers that are targets of attacks every month. In other words, as
seen in the intervention stage, some non-stochastic exogenous variable might play a
significant role in the model and be able to explain the expected number of target
web servers much better than what we were able to achieve here.
This kind of prediction and forecast information is extremely essential for managers
of organizations as well as security device and software companies to understand
their market. That motivation is what drives our research and to the best of our
knowledge not a lot of event studies have been done to understand the patterns of
web server attacks, be it in computer science or MIS.
Download