A Study in Robust Quickest Detection ... Hidden Markov Models Case

A Case Study in Robust Quickest Detection for
Hidden Markov Models
by
MASSACHUSETTS INSTfiJTE
OF TECHNOLOGY
OCT 3 5 2010
Aliaa Atwi
BEng, American University of Beirut (2008)
LIBRARIES
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Masters of Science in Electrical Engineering
ARCHIVES
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
September 2010
@ Massachusetts Institute of Technology 2010. All rights reserved.
Au th or ....................
...
.......................
Department of Electrical Engineering and Computer Science
September 3, 2010
Certified by............
.
.......
Munther A. Dahleh
Professor of Electrical Engineering and Computer Science
Thesis Supervisor
/1
A~
Ad-
Accepted by ............
Terry P. Orlando
Chairman, Department Committee on Graduate Theses
2
A Case Study in Robust Quickest Detection for Hidden
Markov Models
by
Aliaa Atwi
Submitted to the Department of Electrical Engineering and Computer Science
on September 3, 2010, in partial fulfillment of the
requirements for the degree of
Masters of Science in Electrical Engineering
Abstract
Quickest Detection is the problem of detecting abrupt changes in the statistical behavior of an observed signal in real-time. The literature has focused much attention
on the problem for i.i.d. observations. In this thesis, we assess the feasibility of two
HMM quickest detection frameworks recently suggested for detecting rare events in a
real data set. The first method is a dynamic programming based Bayesian approach,
and the second is a non-Bayesian approach based on the cumulative sum algorithm.
We discuss implementation considerations for each method and show their performance through simulations for a real data set. In addition, we examine, through
simulations, the robustness of the non-Bayesian method when the disruption model
is not exactly known but belongs to a known class of models.
Thesis Supervisor: Munther A. Dahleh
Title: Professor of Electrical Engineering and Computer Science
4
Acknowledgments
First and foremost, I thank God for all the experiences He put in my way, including
the ones I was too short-sighted to understand and appreciate until much later.
I would like to thank my advisor Professor Munther Dahleh for giving me the
opportunity to join his group, and for being extremely patient and supportive thereafter. I appreciate the lessons I learned from you, both technical and otherwise. Your
insight allowed me to experience a very rewarding research journey, much needed for
a rookie like me.
I would also like to thank Dr. Ketan Savla, for going above and beyond the call
of duty on several occasions to help me with this work. Your feedback and criticism
are highly appreciated, and I look forward to more.
My collegues in LIDS have each contributed to making my time more enjoyable.
I would especially like to thank Mesrob Ohannessian for supporting me in difficult
times and for his willingness to go through philosophical discussions that no one else
would agree to have with me. I would also like to thank Parikshit Shah for expanding
my foodie horizons and for all the fun times we had together. Yola Katsargiri for
being a constant source of positive energy and enthusiasm for me and a lot of people
in LIDS. Michael Rinehart for the long enjoyable chats, tons of advice, and the toldyou-so's that followed. Giancarlo Baldan for being an awesome office mate, friend
and source of encouragement. Sleiman Itani for introducing me to LIDS and being a
valuable friend ever since. I would also like to thank Giacomo Como, Rose Faghih,
Mitra Osqui, Ermin Wei, AmirAli Ahmadi, Ali Parandegheibi, Soheil Feizi, Arman
Rezaee, Ulric Ferner, Noah Stein for the conversations, help and support they have
given me.
Finally, I would like to thank the dearest people to my heart, my family. Mum and
dad for their unconditional love and countless sacrifices. My grandparents for keeping
me in their dawn prayers year after year. And Sidhant, for being my bestfriend and
my rock. I dedicate my thesis to all of you.
6
Contents
13
1 Introduction
2
17
Quickest Detection for I.I.D. Observations
2.1
2.2
Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.1.1
Non-Bayesian Formulations
. . . . . . . . . . . . . . . . . . .
18
2.1.2
Bayesian Formulations . . . . . . . . . . . . . . . . . . . . . .
19
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
R obustness
3 DP-Based Bayesian Quickest Detection for Observations Drawn from
23
a Hidden Markov Model
3.1
Setup and Problem Statement . . . . . . . . . . . . . . . . . . . . . .
23
3.2
Bayesian Framework
. . . . . . . . . . . . . . . . . . . . . . . . . . .
24
3.3
Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.3.1
3.3.2
An Infinite Horizon Dynamic Program with Imperfect State
Inform ation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
. . . . . . . . . . . . .
28
E-Optimal Alarm Time and Algorithm
4 Implementation of the Bayesian DP-based HMM Quickest Detection
Algorithm for Detecting Rare Events on a Real Data Set
31
4.1
Description of Data Set . . . . . . . . . . . . . . . . . . . . . . . . . .
31
4.2
Proposed Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
4.3
Modeling the Data Set using HMMs
. . . . . . . . . . . . . . . . . .
34
4.4
Implementation of the Quickest Detection Algorithm
. . . . . . . . .
37
4.5
Simulation Results
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 NonBayesian CUSUM Method for HMM Quickest Detection
6
...............
38
43
43
5.1
Sequential Probability Ratio Test (SPRT)
5.2
Cumulative Sum (CUSUM) Procedure .....................
44
5.3
HMM CUSUM-Like Procedure for Quickest Detection ............
46
5.4
Simulation Results for Real Data Set ...................
47
5.5
Robustness Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
Conclusion
7 Appendix
61
63
List of Figures
4-1
An Average Week of Network Traffic . . . . . . . . . . . . . . . . . .
4-2
Three Weeks of Data from Termini with a Surge in Cellphone Network
Traffic around Termini on Tuesday of the Second Week . . . . . . . .
32
33
4-3
log{Pr[O I Q(N,)]} for Increasing Number of States of Underlying Model 36
5-1
Detection Time Versus h for 1 week Termini Data Fit with a 6-state
HM M : h c (1, 1200) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5-2
Detection Time Versus h for 1 week Termini Data Fit with a 6-state
HM M : h E (10,1000) . . . . . . . . . . . . . . . . . . . . . . . . . . .
5-3
48
49
Detection Time Versus h for 1 week Termini Data Fit with a 6-state
H M M : h c (0, 1.6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
5-4 HMM-CUSUM as a Repeated SPRT: No False Alarm Case . . . . . .
50
5-5
HMM-CUSUM as a Repeated SPRT: Case of False Alarm around h=150 51
5-6 HMM-CUSUM as a Repeated SPRT: Case of No Detection Caused by
Too High Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5-7
Average Alarm Time Versus h for Locally Optimal 6-State Model Describing Rome Data . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5-8
52
Average Alarm Time Versus h for Different Models in the Disruption
Class. (* = model 1) (o = model 2) (- = model 3)
5-9
51
. . . . . . . . . .
53
Performance of HMM-CUSUM when the Assumed Disruption Model is
that Obtained from the Termini Data and the Actual Disruptions are
Drawn from each Model in C (- = Model from Termini) (* = Perturbed
Model 1) (o = Perturbed Model 2) (- = Perturbed Model 3) . . . . .
56
5-10 Performance of HMM-CUSUM when the Assumed Disruption Model
is "Perturbed Model 1" and the Actual Disruptions are Drawn from
each Model in C (- = Model from Termini) (* = Perturbed Model 1)
(o = Perturbed Model 2) (- = Perturbed Model 3)
. . . . . . . . . .
57
5-11 Performance of HMM-CUSUM when the Assumed Disruption Model
is "Perturbed Model 2" and the Actual Disruptions are Drawn from
each Model in C (- = Model from Termini) (* = Perturbed Model 1)
(o = Perturbed Model 2) (- = Perturbed Model 3)
. . . . . . . . . .
58
5-12 Performance of HMM-CUSUM when the Assumed Disruption Model
is "Perturbed Model 3" and the Actual Disruptions are Drawn from
each Model in C (- = Model from Termini) (* = Perturbed Model 1)
(o = Perturbed Model 2) (- = Perturbed Model 3)
. . . . . . . . . .
59
List of Tables
4.1
Model Parameters of Business As Usual and Disruption States for N,
4.2
Detection Time and Horizon for Different False Alarm Costs when
=
3 39
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
4.3
Model Parameters for Business As Usual and Disruption for N, = 4 .
39
4.4
Detection Time and Horizon for Different False Alarm Costs when
Ns = 3
Ns=4 .......
5.1
..
40
..................................
Actual Disruption Emission A's (Transition Matrix and Initial Probability Same as Those from Termini) . . . . . . . . . . . . . . . . . . .
5.2
54
Actual Disruption Emission A's (Transition Matrix and Initial Probability Same as Those from Termini) . . . . . . . . . . . . . . . . . . .
55
7.1
Model Parameters for Business As Usual for N, = 6
. . . . . . . . .
63
7.2
Model Parameters for Business As Usual and Disruption for N, = 6 .
64
12
Chapter 1
Introduction
Quickest Detection is the problem of detecting abrupt changes in the statistical behavior of an observed signal in real-time. Designing optimal quickest detection procedures
typically involves a tradeoff between two performance criteria; one being a measure
of the delay between the actual change point and the time of detection, and the other
being a measure of the frequency of false alarms [15].
The literature has focused much attention on the case of i.i.d. observations before and after the change occurs. However, real applications often involve complex
structural interdependencies between data points which may be modelled using hidden markov models (HMMs). HMMs have been successfully used in wide array of
real applications including speech recognition, economics, digital communications,
etc. A recent paper by Dayanik and Goulding provides a Bayesian dynamic programming based framework to study the quickest detection problem when observations
are drawn from a hidden Markov model. In addition, Chen and Willett suggested
a non-Bayesian framework for studying the problem in [5]. In this thesis, we assess
the feasibility of the suggested methods in detecting disruptions in a real data set,
consisting of a time series of normalized communication network traffic from Rome.
A recent trend in urban planning is to view cities as cyber-physical systems (CPS)
that acquire data from different sources and use it to makes inferences about the states
of cities and provide services to their inhabitants. Cell phone data is available at little
to no cost to city planners, and can be used to detect emergencies that require a rapid
response. A timely response is crucial in such an application to avert catastrophes,
and false alarms may lead to unneeded costly measures. Thus, the problem of using
cell phone data to detect abnormal states in a city can be suitably formulated as a
quickest detection problem. In addition, the nature of cell phone data as data that
reflects periodic human activity indicates that HMMs would provide a more accurate
description of the data than i.i.d. alternatives.
An important consideration in our application is robustness when the disruption
model is not exactly known. This is a realistic assumption to make about rare events
(abnormal states). A recent paper by Unnikrishnan et al [8] addresses this problem
for the case of i.i.d. observations. In this thesis, we examine the issue of robustness
for the non-bayesian framework suggested in [5] through simulations.
The rest of this thesis is oraginized as follows:
* In Chapter 2, we describe the classical quickest detection problem where the
observations before and after the change are i.i.d. samples from two independent
distributions. We outline both Bayesian and non-Bayesian formulations often
used to describe the optimal tradeoff between detection delay and false alarm.
We also describe the minimax robustness criterion in [5] for the i.i.d. case.
" In chapter 3, we present the Bayesian DP-based formulation suggested in [6]
for solving the problem when the observations are sampled from HMMs. We
describe the optimal solution, which is based on infinite horizon dynamic programming with imperfect state knowledge.
" In chapter 4, we discuss the implementation aspects of the Bayesian DP-based
quickest detection algorithm in [6] to detect disruptions in the real data set at
hand. More especifically, we describe the data set and the procedure we follow
for modeling it with HMMs. We then describe the approximations required
to address several computational challenges in the dynamic program describing
the optimal solution. Finally, we present the results of our simulations for the
Rome data.
* In chapter 5, we present the non-Bayesian formulation suggested in [5] which
extends the CUSUM algorithm often used for the i.i.d. case to the case of
observations from HMM. We explain the behavior of the suggested procedure
through simulations, and proceed to examine the performance of the procedure
when the disruption model is not entirely known, but is known to belong to a
class of models.
* Finally, in chapter 6, we summarize our findings and suggest directions for
future work.
16
Chapter 2
Quickest Detection for I.I.D.
Observations
The problem of detecting abrupt changes in the statistical behavior of an observed
signal has been studied since the 1930s in the context of quality control. In more
recent years, the problem was studied for a wide variety of applications including
climate modeling, econometrics, enviromental and public health, finance, image analysis, medical diagnosis, navigation, network security, neuroscience, remote sensing,
fraud detection and counter-terrorism, and even the analysis of historical texts. In
many of those applications, the detection is required to occur online (in real time) as
soon as possible after it happens. Some examples of situations that require immediate detection are: the monitoring of cardiac patients, detecting the onset of seismic
events that precede earthquakes, and detecting security breaches. [15]
This type of problem, known as the "quickest detection problem", will be the focus
of this chapter. In particular, we will address the case when the observed data is i.i.d.
before the change, and continues to be i.i.d. thereafter under a different model.
2.1
Problem Statement
Consider a sequence Z1, Z2 , ... of random observations, and suppose that there is
a change time T > 1 such that, Z1, Z 2 , ...
, ZT_1
are i.i.d. according to the known
marginal distribution Qo, and ZT, ZT+1,...
are i.i.d. according to a different known
marginal distribution Q1. For a given T, the sequence Z 1 , Z 2 ,. ..
dent of ZT, ZT+1,...
,
ZT_1 is indepen-
Our aim is to detect the change in the model describing the
observations as quickly as possible after it happens while minimizing the frequency of
false alarms. A sequential change detection procedure is characterized by a stopping
time T with respect to the observation sequence. Essentially, r is a random variable
defined on the space of observation sequences, where r= k is equivalent to deciding
that the change time T for a certain sequence has occurred at or before time k. We
denote by A the set all of allowable stopping times.
2.1.1
Non-Bayesian Formulations
In Non-Bayesian quickest detection, the change time T is taken to be a fixed but
unknown quantity. Several formulations have been proposed for the optimal tradeoff
between false alarm and detection delay.
The minimax formulation suggested by Lorden in [10] is based on minimizing
the worst-case detection delay (over all possible change points T and all possible
realizations of the pre-change sequence) subject to a false alarm constraint. The
worst-case delay is defined as,
WDD(r) = sup ess sup EQ[(T- T + 1)+IZ1, Z 2 ,...,Z]
T>1
where EQ refers to the expectation operator when the change happens at T and the
pre-change and post-change sequences are sampled from Qo and Q1 respectively. The
false alarm rate is defined as,
FAR(T) =
where E,
[T]
1
EQ [,T]
can be interpreted as the mean time to false alarm. The objective can
then be expressed as:
minWDD(r) s.t. FAR(r) < a-
(2.1)
The optimal solution to [10] was shown by Moustakides [12] to be Page's decision
rule [13]:
n
rc = inf{n 2 1: max
LQ(Zi) > 3}
1<k< nE
~ ~ i=k
(2.2)
where LQ is the log-likelihood ratio between Q1 and Qo and 3 is chosen such that,
E (Zi) > 1.
Another formulation was suggested by Pollak in [14]. Worst-case average delay
was used instead of WDD(T) to quantify delay.
2.1.2
Bayesian Formulations
In Bayesian quickest detection, the change time T is assumed to be a random variable
with a known prior distribution. Let CD be a constant describing the cost for each
observation we take past the change time T, and CF be the cost of false alarm. The
performance measures are the average detection delay:
ADD(r) = E[(T
-
T)+]
where expectation is over r and and all possible observation sequences. The probability of false alarm:
PFA(T) = P(r < T)
The objective is to seekr E A that solves the optimization problem:
inf ADD(T) s.t PFA(T)
TrEA
<
a
Equivalently, the optimization problem can be expressed as:
inf CF.PFA(T) + CD.ADD(T)
-rEA
where C
=
CFCD is chosen to guarantee PFA(r) < a.
Taking the prior on T to be geometric, the optimal policy is given by Shiryaev's
test[18]??:
7,Pt = inf{k > 0 |
7rk >
7T
where irk is the posterior probability that T < k given the observation sequence up
to time k, and 7r* is a threshold that depends on the ratio C =
I*
D.
When C = 1,
=0.
2.2
Robustness
The policies outlined above are optimal assuming that we have exact knowledge
of Qo and Q1. Several applications, however, involve imperfect knowledge of these
distributions. [8] provides robust versions of the quickest detection problems above
when the pre-change and post-change distributions are not known exactly but belong
to known uncertainty classes of distributions Po and P. For the Bayesian criterion,
the version of the (minimax) robust problem suggested in [8] is:
min
sup
PoEPo,P1EP1
E{(T- T + 1)+I
s.t. sup Pr(r < T) < y
P0
P0
For uncertaintly classes that satisfy specific conditions, least favorable distributions (LFDs) can be identified such that the solution to the robust problem is the
same as the solution for the non-robust problem designed for the LFDs. To describe
the conditions, we need to introduce the notion of "joint stochastic boundedness".
Notation: p
>-p'
means that if X ~ po and X' ~ pi, then Pr(X > m) > Pr(X' >
m), for all real m.
Definition (Joint Stochastic Boundedness): Let (o, vi) E Qo x Qi be a pair of
Let L* denote
distributions such that vi is absolutely continuous with respect to 17E5.
the log likelihood ratio between v and Fo.
distribution of L*(X) when X ~ vj,
L*(X) when X
-
j
For each vj E Pj, let p denote the
= 0,1. jio and p, denote the distribution of
4o and X ~ v, repectively. The pair (Qo, Q1) is said to be jointly
stochastically bounded by (UK, v) if for all (vo, v1 ) in (Qo, Q1), p7-
-
,Po and P1
-4
a.
Under certain assumptions on Qo and Q1, the pair (Uo, v) are LFDs for the robust
quickest detection problem. Loosely speaking, the LFD from one uncertainty class is
the distribution that is "nearest" to the other uncertainty class.
The conditions on Qo and Q1 for the Bayesian robust quickest detection problem
are:
" Qo contains only one distribution vo, and the pair (Qo, Q1) is jointly stochastically bounded by (vo, vi).
" The prior distribution of T is geometric.
" L*(.) is continuous over the support of vo.
22
Chapter 3
DP-Based Bayesian Quickest
Detection for Observations Drawn
from a Hidden Markov Model
3.1
Setup and Problem Statement
Consider a finite-state Markov chain M
=
{Mt; t > 1} with d states, and suppose
that the initial state distribution and the one-step transition matrix of M change
suddenly at some unobservable random time T. Conditioned on the change time, M
is time-homogenous before T with initial state distribution y and one-step transition
matrix Wo and is time-homogenous thereafter with initial distribution p and one-step
transition matrix W1 .
The change time T is assumed to have a zero-modified geometric distribution with
parameters 0 and 0, meaning that
T =
Let the process X
=
0,
t,
w. p. 0
w.p. (1 - 00)(1 - 6)'~10
{Xt; t > 1} denote a sequence of noisy observations of M.
The probability distribution of Xt is a function of the current state Mt and whether
or not the change has occured by time t. For instance, Xt can be assumed to have a
poisson distribution with parameter Aij, where i E 1,...,d refers to the value of Mt,
and j = 11t
T
is 0 before the change and 1 thereafter.
We would like to use the noisy observation sequence X to detect the change in
the underlying unobservable sequence M as soon as possible while minimizing false
alarms.
The schemes in chapter 2 were developed assuming independence between observations conditioned on the change time. The Markov assumption of the setup in this
chapter necessitates a different approach for solving the problem.
3.2
Bayesian Framework
The framework presented in this section was presented in [6]. It proceeds as follows:
Define the process Y by Y = (Mt, 1{t
T})
for t > 1. Y = (d, 0) has the interpre-
tation that Mt = d and the change has not occured yet (t < T). Similarly, Y = (d, 1)
means that the change has occured and Mt = d. The state space of the process is
Y = {(0, 0), (1, 0), ...
,7
(d, 0), (0, 1), (1,1 1), ...
,7
(d, 1)}
and is partitioned into
Yo = {(0, 0), (1, 0)..., (d, 0)} and Y1 = {(0, 1), (1, 1),
... ,
(d, 1)}
First, we observe that Y is a Markov process with initial distribution r = ((1 - 0o)p, Oop)
and one-step transition matrix P=
(
0 )Wjow1.
0
We can also see that Y
W1
forms a recurrent class, and the states in Yo are transient. Therefore, the change
time T of M is the time till absorption of Y in Y1 :
T = min{t > 1;Yt
Yo}
The quickest detection problem for HMMs thus reduces to the problem of using
noisy observation to detect absorption in an unobservable Markov chain as quickly
as possible with a false alarm constraint.
The desire to detect changes quickly is reflected in a cost a paid for every observation taken after T without detecting a change. False alarms are penalized by a cost b
for declaring a change before T. 1 The Bayes' risk associated with a certain decision
rule T is thus:
p(r) = a.E [(T - T)+] + b. Pr(T < T)
(3.1)
where expectation is taken over all possible sequences X and Y and all change times
T.
The objective is to solve the following optimization problem:
inf
3.3
[t(r)
(3.2)
Solution
For every t > 0, let Ht = (Ut(y), y E Y) be the row vector of posterior probabilities
t(y) = Pr{Yt= y X, X 2 ,...X}, y E Y
that the Markov chain Y is in state y E Y at time t given the history of the observation
process X.
The process {f1t, t > 0} is a Markov process on the probability simplex state space
P = {7r E [0, 1]1Y; Z 7r(y) = 1} with
yY
=
UItPdiag(f(Xt+1))
(33)
fltPf (Xt+i)
where f(Xt+1) is the row vector of emission probabilities of Xt+1 under y E Y, and
'The framework suggested in [6] allows different states of Y to be associated with different costs
for delay and false alarm; however, in this thesis, we assume the cost to be uniform across all states.
diag(f(Xt+1)) is the diagonal matrix formed using the elements of f(Xt+1).
Let g(7r) be the expected delay cost for the current sample over all possible underlying y E Y given the past. Similarly, define h(ir) as the expected cost of false
alarm over y E Y given our knowledge of the past if change is declared at the current
sample. The two quantities are given by:
g(r)
=
ar(y)
h(7r)
=
br(y)
yE Yo
The Bayes' risk can then be expressed as
P(T) = E
[
g(Ut) + h(H(T)) , for all T E A
t=0
where the expectation is taken over all possible observation sequences X and underlying markov chains Y.
[6] suggests that the optimal cost of the problem (3.2) is a function of TI, the initial
state probability distribution of Y. More especifically,
p* =
inf
p(T) = Jn)
where
g( 7It + h(Ur)
J(r) = inf E,
t=0
J(ir) is the value function of an optimal stopping problem over the Markov process
Hi, and E, is the expected value over all sequences X and Y given that HO = 7r.
[6] proceeds to prove that the value function satisfies the following Bellman equation:
J(7r) = min{h(7r), g(wr) + E [J(7r') r] }
where
7r'
(3.4)
is obtained from 7r through the update equation (3.3), and expectation is
taken over HU'. This will be explained in more detail in the next subsection.
3.3.1
An Infinite Horizon Dynamic Program with Imperfect
State Information
Equation (3.4) captures the tradeoff between immediate and future costs at the heart
of dynamic programming(DP). Dynamic programming is concerned with decisions
made in stages, where each decision poses an immediate cost, and influences the
context in which future decisions are made in way that can be predicted to some
extent in the present. The goal is to find decision making policies that minimize the
cost incurred over a number of stages. In DP formulations, a discrete-time dynamic
system with current state i is assumed to transition to state
dependent on the control u. A transition from i to
j
j
with probability pij (u)
under u poses a cost pc(i, u, j)
in the present and a cost J*(j) in the future, where J*(j) is refered to as the optimal
cost-to-go of state j over all remaining stages. The costs to go can be shown to satisfy
a form of Bellman's equation:
J*(i) = min E [pc(i, u, j) + J*(j) I i, u] , ,for all i,
(3.5)
U
where the expectation is taken over
j. [4]
The DP formulation just described assumes perfect knowledge of the system's
states. Our setup, however, grants us access only to noisy observations of the states
Y, making this a DP problem with imperfect state information. [3] shows that this
type of problems can be reduced to a perfect-state-knowledge DP where the state
space consists of the partial information available at every time. In our application,
the partial information at time t is the vector of posterior probabilities 7rt obtained
from the observation sequence X 1 , X 2 , ...Xt and the initial state distribution
70.
Equation (3.4) is clearly a DP Bellman equation of the same form as (3.5) where
the state space is the 2d-dimensional simplex P. The decision u is either to"declare
a change and stop" or "continue sampling". If we choose to continue sampling, we
incure a cost pc(r, continue, 7r')
=
g(ir) in the present representing the expected
current sample cost (over the possible underlying states of Y given the history so
far). In addition we incur E [J(ir') Iff] in the future where expectation is taken over
the next posterior H' (a function of the next random noisy observation). On the other
hand, if we choose to declare a change, we incur a cost pc(wr, stop,7r') = h(7r) in the
present representing the expected cost of false alarm, and no future costs.
The Bellman equation (3.4) has an infinite horizon making it computationally
infeasible.
An approximate solution can be obtained by solving the problem for
only N(e) stages, where E corresponds to a tolerable error margin. Let pN(e) be the
optimal cost for the finite horizon problem with N(E) stages. [6] indicates that for
every positive e, there exists N(E) such that pN(,)
N(E) = - -+
-
1* < e. More especifically,
I - (1- 6)Wo-1(y,y'))
This expression is consistent with the intuition that higher false alarm costs, lower
sampling costs, higher precision, and higher expected time of change value each call
for a longer horizon. pN(e) is the N-stage value function evaluated at 7ro = rq. The
N-stage finite horizon value function for a starting posterior probability 7rt can be
obtained through the following recursion:
Jk+i(rt) = min {h(wt), g(7rt) + Ext,, (J(Ut+1)
where
j
= 1, 2, -
I It
= 7tt)}
(3.6)
, N(E) and J 0 := h, and rt+1 can easily be calculated from 7t
for a given value of Xi+1 using (3.3). Simply stated, this equation uses the optimal
cost-to-go calculated for a horizon k to calculate the optimal cost-to-go for a horizon
k +1.
3.3.2
E-Optimal Alarm Time and Algorithm
We recap that the problem is to detect a change in the underlying Markov model
through observing noisy samples of it. Every sample past the change point where the
change goes undetected poses a cost a, and false alarms pose a cost b. The algorithm
for detecting the change suggested in [6] is as follows:
Starting with observation X1 and ro = rl, calculate the updated posterior 7ri from
r0 and X 1 using (3.3). If r1 belongs to the region
FN(e) =
{7r E P; JN(E)(7)
=
then declare that a change has occured and stop sampling. Otherwise, repeat for
t = 2 by sampling X 2 , and calculating r2 from 7ri and X 2 , and so on.
The region FN(e) is calculated offline only once for a set of model parameters
and costs.
[6] shows that the region is a non-empty closed convex subset of the
2d-dimensional simplex P that decreases with increasing N(E), and converges to F
(the infinite horizon optimal stopping region) as N -* oc. Calculating this region in
practice poses problems that will be addressed in the next chapter.
30
Chapter 4
Implementation of the Bayesian
DP-based HMM Quickest
Detection Algorithm for Detecting
Rare Events on a Real Data Set
In this chapter, we propose a framework for detecting distruptions in real correlated
data sequences based on the HMM quickest detection algorithm described in chapter
3. We proceed to test the framework on a real life data set representing cell phone
traffic in Rome.
4.1
Description of Data Set
A recent trend in urban planning is to view cities as cyber-physical systems that
acquire data from different sources and use it to make inferences about the states
of cities and provide services for their inhabitants. Wireless network traffic data is
available at little to no cost to city planners [7], and can be used to detect emergencies
that require a rapid response. A timely response is crucial in such an application to
avert catastrophes, and false alarms may lead to unnecessary costly measures. Thus
the problem of using network traffic to detect disruptions in a city can be suitably
formulated as a quickest detection problem. In addition, the nature of this kind of
data as one that reflects periodic human activity indicates that HMMs would provide
a more accurate description of the data than i.i.d. alternatives.
100 -
0
200
4WO
6W
BOW
low
1200
14WU
16W0
16W
2")
Figure 4-1: An Average Week of Network Traffic
In 2006, a collaboration between Telecom Italia (TI) and MIT's SENSEable City
Laboratory allowed unprecedented access to aggregate mobile phone data from Rome.
The data consists of normalized numbers of initiated calls in a series of 15 minute
intervals spanning 3 months around Termini, Rome's business subway and railway
station. The data exhibits differences between weekdays, fridays and weekends, with
the latter exhibiting lower overall values. All days, however, share a pattern of rapid
increase in communication activity between 6 and 10a.m. followed by a slight decrease
and another increase after working hours. Night time exhibits the lowest level of
traffic[7]. This activity is depicted in figure 4-1. .
A disruption of the periodic pattern is known to have occured in week 8 of the
interval covered, leading to a surge in communication traffic, as shown in figure 4-2.
In the following section we propose a method to detect such disruptions in correlated
data using the HMM quickest detection algorithm in [6]. We test the performance of
the algorithm on the Rome data assuming perfect knowledge of the business as usual
and disruption models. This unrealistic assumption will be subsequently adressed in
chapter 5.
Figure 4-2: Three Weeks of Data from Termini with a Surge in Cellphone Network
Traffic around Termini on Tuesday of the Second Week
4.2
Proposed Setup
Consider the two-state Markov chain S
=
{St; t > 0}, and let St denote the state of
Rome at time t where St E {0, 1} for all t. St = 0 indicates that at time t, the data
is in "business as usual", and St = 1 indicates a disruption. Assume that the initial
probability of being in state St = 1 is 0, and that the transition matrix is
U
=
Oa
1 - Oa
6b
1 -
6b
Assume without a loss of generality that the chain starts with state Si = 0. The
time T until the chain enters state 1 for the first time is a zero-modified geometric
random variable with parameters 0 and 9a. Detecting a disruption as fast as possible
is equivalent to applying a quickest detection algorithm on real observations where
the change time is T. The model parameters W , W and the emission probabilities
for the pre/post-change models are obtained from training data. Emission probability
distributions describing the normalized initiated call counts for different states of the
city will be assumed Poisson. Upon detecting a disruption, the algorithm is restarted
while switching the pre-change and post-change HMMs to detect the end of the
disruption and return of business as usual. The new change time is a zero-modified
geometric random variable with parameters 01 and 6b, where 01 is the probability that
the first observation after declaring a disruption is a business as usual observaton.
4.3
Modeling the Data Set using HMMs
The quickest detection algorithm in [6] assumes that the HMM model describing
the observation sequence is known before and after the change. Eventhough this
assumption may be unrealistic for the disruption state, the description for business
as usual state can be obtained fairly accurately through fitting an HMM in long
training sequences from the past. In this section, we focus on the implementation
aspects of fitting an HMM in the business as usual Rome data obtained from Termini.
For a given number of underlying states N, the parameters that describe a hidden
markov model with Poisson distributions are [17]:
" The transition probability matrix A of the underlying Markov chain, where
as = Pr [st+1 = j | st = i] and st is the underlying state at time t.
" The emission probability vector B such that B(i) is the parameter of the Poisson
random variable describing the observations under state i of the underlying
Markov chain.
* The vector I of initial probabilities of the underlying states.
The joint probability and defining property of an HMM sequence is:
t-1
Pr(si,s2, ,stzi,x 2,...,xt) =1 Ijasksk+1
.k=1
si
[ Poiss(x,
t
-1=1
where Poiss(xl, sl) refers to the Poisson pdf at x, with parameter B(si).
We would like to solve for the model Q = (A, B, I, N,) that best describes a finite
training sequence of business as usual data 0 for a fixed number of states N,. There
is no known way to analytically solve for the model Q with maximizes Pr(O IQ) [17].
However, starting with an intial "guess" model Q1 = (A1 , B 1 , I1, N,), we can choose
a model Q2
=
(A 2 , B 2 , 12, N,) that locally maximizes Pr(O I Q) using an iterative
procedure like Baum-Welch [9] (a variant of the EM Algorithm [11]). In this work,
we used functions from the "mhsmm" package in the statistical computing language
R [16] to implement the Baum-Welch algorithm for Poisson observations. Note that
this approach assumes the number of states to be fixed by the user and provides
locally optimal solutions.
The choice of initial model parameters in the Baum-Welch algorithm has a significant effect on the optimality of its final estimate. In particular, the choice of
emission probability parameters are essential for rapid and proper convergence of
the re-estimation formulas used [17]. In our implementation, we made the following
choices for the initial model:
" All underlying states are initially equally likely.
" All transitions are allowed and equally likely.
" The parameters of the different Poisson emission probabilities were obtained by
sorting the observation sequence, dividing it into N, bins of equal length, and
averaging the observations in each bin resulting in B.
To determine the number of states N, that best describes our data sequence 0,
we ran the Baum-Welch algorithm on 0 for increasing number of states N,. Pr(O I
Q(N,)) where Q(N,) is the locally optimal model for N, states, was taken as a measure
of accuracy of fit.
The probability of an observation sequence 0 of length len under a given HMM
Q=(A,B,I) can be calculated using the Forward Algorithm [2] [1]. Consider the forward
variable defined by:
at(i) = Pr[O102...Ot, St = i
I Q]
We can find at(i) recursively, as follows:
" Initialization: ai(i) = I(i).Poiss(01 , B(i)) for 1 < i < N,.
" Induction: at+1(j) =
1 < t < len.
(
at(i)ai) Poiss(Ot+
1 , B(j)) for 1 <
j <
N, and
x 10,
-2 -
-2.5-
00
-35-
-4-/
-4.5
-5
2
4
6
8
10
16
14
12
18
Ns
Figure 4-3: log{Pr[O 1Q(N,)]} for Increasing Number of States of Underlying Model
* Termination: Pr(OQ) =
ENi aen(i)
In practice, the forward variable "underflows" for larger t, meaning that it heads
exponentially to 0 exceeding the precision range of essentially any machine [17]. A
standard procedure to tackle this issue is to scale at(i) by a factor independent of i:
Ct
1
NO
Ei at(i)
Then we can calculate the desired probability as:
len
log[Pr(O | Q)] =
log(ct)
t=1
Figure 4-3 is a plot of log{Pr(O 1 Q(N,))} for N, between 2 and 18. Note that
increasing the number of states initially results in increasing accuracy, but beyond
N, = 6 there is no significant gain in increasing N,.
In many real life scenarios, the training data available for HMM fitting has missing
samples creating the need for interpolation. The data set available to us had missing samples spanning days at a time, making interpolation an unfavorable option,
especially knowing the pattern of variation within each day. To preserve the periodic
pattern of the data described in [7], we chose to address the issue by replacing missing
samples with the average value corresponding to available samples in the same day
of the week (on a different week) and time of the day as the missing sample.
4.4
Implementation of the Quickest Detection Algorithm
For a given HMM description of the data, the quickest detection algorithm cosists of
two parts: one that needs to be executed only once and offline to calculate the region
pN(,) for the given model parameters, and one that uses ]pN(e) to detect changes in
different data sequences from the model in real time.
Despite the favorable properties of region FN(e) (namely convexity and closedness), its boundary cannot generally be expressed in closed form [6].
determination of
pN(e)
involves computing
jN(e)
(ir) for all
7r
The offline
E P. Since P is the
2d-dimensional probability simplex, this compuation is intractable. To get around
this problem, we resort to a simple form of cost-to-go function approximation, where
the optimal cost-to-go is computed only for a set of representative states.
The representative states are chosen through a discretization of the 2d-dimensional
probability simplex. The optimal cost to go is then calculated offline for all posterior
probability vectors in the resulting grid. However, for a posterior probability vector
7 in the grid, the calculation requires knowing the value of the optimal cost-togo function at all states (posterior probability vectors) accessible from r. Nearest
neighbour interpolation is used to estimate the cost-to-go values corresponding to
accessible states outside the grid.
Determining FN(e) poses another challenge relating to the calculation of the
factor at stage k+1: Ex,, (Jk(nt+i) |
It =
Q-
7rt). Even when the values of Jk(Hlt+1)
for different Ut+1's are known, the expectation is taken over Xt+1 that can take
countably infinite values. Conditioned on the underlying state Yt+1 , Xt+1 is a Poisson
random variable with a parameter A(y) that depends on the value y of Yt+1. For that
reason, we approximate the Q-factor for a certain Hit = 7t by the weighted average of
Jk(Ut+ 1 ) evaluated at a well-chosen set SA of values of Xt+1 . More specifically, the
approximate Q-factor is:
ZkeSA
app
Pr(Xt+1 = X
I 7rt)jk(7rt+1)
Pr(Xt+1 E A I7rt)
The set A is chosen to include points where most of the Poisson p.d.f. is centered for
parameters A(y) where y E Y.
4.5- Simulation Results
In this section, we show through simulation that the algorithm outlined in [6] can
indeed be utilized to detect the disruption shown in figure 4-2 despite several approximations taken to facilitate its running. However, the computational complexity
of the algorithm renders it extremely impractical, especially for rich data sets whose
description requires a high number of markov states.
In section 4.3, we showed that the number of underlying states required to model
the data set with "reasonable" accuracy is N, = 6. However, the discretization step
described in 4.4 results in exponential computational complexity in the number of
states N,. For simplicity of computations, we start by modeling the data with a three
state HMM, hoping that the model captures enough of the properties of the business
as usual and disruption states to detect the disruption. The number of underlying
states is then increased if needed in order to achieve better detection.
A locally optimal three state HMM for the Termini data is shown in table 4.1.
The initial model for the HMM fitting was obtained as per the guidelines in section
4.3 .
We run the change detection algorithm in the beginning of the disruption week.
Model Parameter
Inital Probability
Transition Matrix
Business As Usual Model
[1 0 0]
0
0.9851 0.0149
0.0167 0.9737 0.0096
0
0.0089
0.9911
Disruption Model
[1 0 0]
0.9522 0.0478 0.0000
0.0905 0.8000 0.1095
0.0000
0.1191
0.8809
[197 232 249]
[3 27 107]
Emission A's
Table 4.1: Model Parameters of Business As Usual and Disruption States for N, = 3
The disruption week consists of 2016 samples, where the disruption starts on Tuesday
around sample 406 (with a value of 181).
Horizon N(E)
31
301
3034
6143
False Alarm Cost
0.5
1
10
20
Detection Sample
404
405
405
405
Table 4.2: Detection Time and Horizon for Different False Alarm Costs when N, = 3
With a discretization step size of 0.25, sample cost of 1, error margin e=3 (equivalent to detecting on average three samples after a disruption has started, if no false
alarms are allowed), and a prior 0 = 1/300 (equivalently expressed as expected time
till disruption being 300 samples starting from business as usual), we obtain the
detection times in table(4.2) for different false alarm costs.
Model Parameter
Inital Probability
.
.
Transition Matrix
Business As Usual Model
[1 0 0 0]
0
0
0.9869 0.0131
0
0.0153 0.9683 0.0164
0
0.0279 0.9510 0.0210
0
Emission A's
0
0.0127
[ 2 19 58 117]
0.9873
Disruption Model
[10 0 0]
0.0727 0.9273 0.0000 0.0000
0.0000 0.9502 0.0498 0.0000
0.0000 0.0907 0.7992 0.1101
0
0.0000
0.1217
[181 198 232 250]
Table 4.3: Model Parameters for Business As Usual and Disruption for N, = 4
The number of states of the underlying model was then increased to N, = 4,
keeping all other parameters constant, resulting in the model in table 4.3, and the
0.8783
False Alarm Cost
0.5
1
10
20
Horizon N(E)
134
267
2700
5467
Detection Sample
404
406
406
406
Table 4.4: Detection Time and Horizon for Different False Alarm Costs when N, = 4
detection times in table 4.4.
For N, = 3, we observe that a false alarm occured for all values of CF/CD chosen,
resulting in a cost of 20. The value observed at 405 is 171, which is generally assumed
to be a business as usual value for weekday peak time traffic.
However, loosely
speaking, this value is considered to be at the boundary of allowable business as
usual observations, since the highest A in the business as usual state for N, = 3 is
107. Meanwhile the lowest A in the disruption state is 197. It is then understandable
that for N,
=
3 and for a cost of false alarm in the range we chose, this boundary case
is resolved by detecting a disruption. Note that running the algorithm for CFCD high
enough to result in correct detection was practically infeasible on a regular machine.
Coincidentally, for this particular case, the false-alarm-causing value (171) was close
enough to the actual disruption that the behavior was acceptable for real applications.
Increasing N, to 4 helped resolve the false alarm issue discussed above for a computationally practical cost CF/CD. We also notice that the detection accuracy was
almost independent of the false alarm cost when the cost of false alarm is greater
than the sample cost. This behavior is not typical of the quickest detection algorithm, which is generally expected to exhibit variation in performance with varying
cost criteria. We hypothesize that this behavior was observed due to a much higher
likelihood of the disruption observations under the disruption model than the business as usual model. "Much higher" is measured in terms of the difference between
sample and observation cost.
The fact that we were able to detect the change with N, = 4 indicates that the
difference between business as usual and disruption data is significant enough that
even a crude model was enough to describe it.
The computational complexity of the offline procedure with the approximations
suggested is in the oder of
1d
(N(e) + Np)
dstep
+ 1
where dstep is the probability grid step size, and N, the number of poisson samples
used to approximate the Q-Factor.
42
Chapter 5
NonBayesian CUSUM Method for
HMM Quickest Detection
A different approach for HMM-quickest detection was proposed by Chen and Willet
[5]. Their approach is non-Bayesian and mimicks the CUSUM test for i.i.d. observations. In this chapter, we present the algorithm suggested in [5], and compare it
with the previously discusses approach in [6], especially from the point of view of
feasibility for real data sets. We then shed light on the robustness of the algorithm
when the disruption model is not entirely known.
5.1
Sequential Probability Ratio Test (SPRT)
The sequential probability ratio test (SPRT) was devised by Wald [19] to solve the
following sequential hypothesis testing problem. Suppose that X 1 , X 2 ,...,X, is a
sequence of observations arriving continually (i.e. n is not a fixed number), and
suppose that H (K) is the hypothesis that the observations are i.i.d. according to
distribution PH (PK). The purpose is to be able to identify the correct hypothesis as
soon as possible, while minimizing decision errors.
For a given n, the likelihood ratio is
L (n) =
Pk(X, ... , Xn)
PH(X1, ... ,X)
=
PK(X1) n PK(Xi I Xi-1 , ... , X1)
1
-
PH(X1) =2 PH(X |
X'_ 1 ,..
.
,X 1 )
The SPRT defines the following stopping rule:
N* = min{n > 1 : L(n) > A or L(n)
The decision made is H if {L(n)
B}
A} and K otherwise. The thresholds A and B are
chosen to meet a set of performance criteria. The performance measures include, as
in the traditional hypothesis testing problem, the probability of error (Pe) and rate
of false alarm (Pf):
Pe = PK(L(N*)
B)
Pf = PH(L(N*) >A)
The design parameters A and B can be determined in terms of the target Pe and
Pf using Wald's approximation, giving:
A
B
1 - leP
Pf
(5.1)
Pf
(5.2)
1- Pe
In addition to minimizing P, and Pf, the sequential nature of the problem poses
two more performance criteria: the average run length (ARL) under H and K. The
SPRT is optimal in the sense that it minimizes ARL under both H and K for fixed
Pe and Pf.
5.2
Cumulative Sum (CUSUM) Procedure
In section 2.1.1, we mentioned that the optimal stopping time for the minimax formulation of the i.i.d. non-Bayesian quickest detection problem is given by Page's
decision rule (2.2).
For pre-change and post-change probability measures fH and fK, Page's test can
be easily reformulated in terms of the following recursion, known as the CUSUM test
[13]:
N* = min{n > 1 : S, 2 h}
where
S. = max{0, Sn-_
+ g(X.)}
and
g(X,) = ln
_____
(fH (Xn)
h is a parameter than can be chosen according to our desired tradeoff between false
alarm and detection delays.
Essentially, the CUSUM test is a series of SPRTs on the two hypotheses H (observations are due to fH) and K (observations are due to fH) where the lower limit A is
0 and the upper limit B is h. Every time the SPRT declares hypothesis A as its decision, we can assume that the change has not occured yet, and the SPRT is restarted.
This pattern continues until the first time that the SPRT detects hypothesis K (hence
crossing the threshold h for the first time).
The main requirement for CUSUM-like procedures to work is the "antipodality"
condition:
E[g(Xn) H] < 0
E[g(Xn)
K]
> 0
For g(X,) = ln(K xn)), those conditions follows from the non-negativity of condi-
tional KL distance.
The only performance measures for CUSUM are the ARL under H and K, since
detection is guaranteed to happen for all "closed" tests (Pr[N*< oc] = 1). CUSUM
exhibits minimax optimality in the sense that, for a given constraint on the delay
between false alarms, it minimizes the worst case delay to detection.
5.3
HMM CUSUM-Like Procedure for Quickest
Detection
Designing CUSUM procedures involves finding a relevant function g(X,) that satisfies
Page's recursion and the antipodality principle. [5] proposes using the scaled forward
variable described in section 4.3 for defining such a function for quickest detection on
HMM observations. For the k'th sample after the last SPRT,
g(n; k)(Xn) = In1n
f(Xn
I
.Xk)
(fH(Xn I Xn_1,7
Xk))
The CUSUM recursion is then
Sn = max{O, Sn_1 + g(n; k)}
To calculate fK(X,
I
Xn- 1 , ... ,Xk) and fH(Xn I Xn- 1, ... ,Xk), the scaled forward
variable &t(i) is used:
N
fH(X I X_1, ... ,X1) =
SaHt(i)
N
fK(Xt I Xt1, ... ,X1)
=
ZcKt(i)
The antipodality property for this procedure is proven in [5] and is again directly
related to the non-negativity of conditional KL-distance. It is also shown that the
average detection delay is linear in h, whereas the average delay between false alarms
is exponential in h, just like the behavior exhibited for the i.i.d.
CUSUM. This
"log-linear" behavior is the main reason why CUSUM-like algorithms work.
More especifically, the average detection delay (D) and time till false alarm (T)
are given by:
D~z
h
D(fK 1 fH)
eh
where B is the expected value of a single SPRT test statistic B knowing that a correct
detection for H has occured. B can be obtained through simulation.
5.4
Simulation Results for Real Data Set
In this section, we discuss the results of running the HMM-CUSUM algorithm on the
data from Termini. The algorithm does not require any priors to be given, except for
the HMM before and after disruption and a threshold h. The models parameter used
correspond to N, = 6 and are given in the Appendix. The threshold h determines the
desired tradeoff between the frequency of false alarms and average detection delay.
Higher h results in exponentially lower frequency of alarms, and a linear increase in
detection delay.
Figures 5-1, 5-2 and 5-3 show the alarm times obtained from running HMMCUSUM on Termini data for different values of h. In figure 5-3, we observe that
small h (h < 1.5) leads to false alarms. The time between false alarms, however,
increases exponentially as h increases (figure 5-1), and for 1.5 < h < 1100, HMMCUSUM consistently detected the disruption but with linearly increasing delay (figure
5-2). Finally, beyond h = 1100, the alarm time went to infinity (figure 5-1), meaning
that the disruption went undetected.
This behavior can be explained by plotting the trajectory of S, the cumulative
sum, for the Termini data under different values of h from the intervals outlined
above. Figure 5-4 shows the behavior of the cumulative sum in the "detection region".
Initially, the data showed an overwhelming likelihood of being "business as usual"
which lead to restarting the SPRT with the negative Sn reset to 0. As n approached
Monday's peak traffic time, the likelihood of disruption increased (since high traffic
180016001400S1200 -
E
1
0
8-0
cc
o-
600400200--
:E
HM
01
0
Eh
(1, 120
200
400
600
h
800
1000
1200
Figure 5-1: Detection Time Versus h for 1 week Termini Data Fit with a 6-state
HMM: h E (1, 1200)
is characteristic of the disruption) leading to a positive drift in the direction of h, but
the increase was not enough to cross the threshold and raise an alarm. Eventually,
Tuesday's surge in traffic raised the likelihood of disruption to the point of crossing
the threshold, leading to detection. Figure 5-5 shows the behavior of S,, when h is
in the false alarm region. In this case, the positive drift caused by Monday's peak
traffic was enough to cross the low threshold, and cause a false alarm. Finally, when
h is too high, the drift caused by 45 disruption samples is not enough for S" to cross
the threshold, and the consequent business as usual samples only serve to point the
drift away from the threshold towards 0, as shown in figure 5-6.
Figure 5-7 shows an approximate average alarm time for data sampled from the
Termini disruption model (Appendix) under different values of h. The average alarm
time for each h was obtained through Monte Carlo simulations. To guarantee detection, on average, the value of h required is greater than 7.5, much higher than the
corresponding value (> 1.5) required for the specific Termini sequence. One explanation for that is that the Termini sequence was used to define business as usual and
440-
435-
430E
4)
E
425-
< 420-
415-
410~
405
I
0
100
200
300
400
600
500
h e (10,1000)
700
800
900
1000
Figure 5-2: Detection Time Versus h for 1 week Termini Data Fit with a 6-state
HMM: h E (10, 1000)
disruption states, and is thus a "perfect fit" to the model and decision rule, clearly
depicting the variation between business as usual and disruption. This behavior is
not necessarily typical of all sequences from the model.
In the context of disruption detection for real data, the HMM-CUSUM algorithm
exhibits several advantages as compared to the DP-based bayesian framework suggested in [6]:
" It gives instant "real time" results without the need to run lengthy computations
offline.
" Its implementation was fairly simple and did not involve some of the issues
encountered with the DP-based algorithm (infinite state space, infinite horizon
and a cost-to-go function and Q-factor without a closed form).
" Increasing accuracy in HMM-CUSUM did not lead to any aditional computational complexity and was simply achieved by increasing the threshold h, unlike
...
..
.....
3002502001501005000
Figure 5-3: Detection Time Versus h for 1 week Termini Data Fit with a 6-state
HMM: h E (0, 1.6)
2.5-
2
1.5-
0.5-
0
0
50
250
300
200
Sample Number n
n s 150 corresponds to Monday peak
n ~406 corresponds to the disruptive peak on Tuesday
100
150
350
400
Figure 5-4: HMM-CUSUM as a Repeated SPRT: No False Alarm Case
" OAMi" -
-
---
-
0
50
100
--
200
250
300
Sample Number n
nr 150 is the Monday peak
n ~406 isthe disruptive peak on Tuesday
150
-
-
-
350
400
Figure 5-5: HMM-CUSUM as a Repeated SPRT: Case of False Alarm around h=150
1000 -
500 F
i
j
200
400
600
1000
1200
800
Sample Number n
1400
1600
1800
Figure 5-6: HMM-CUSUM as a Repeated SPRT: Case of No Detection Caused by
Too High Threshold
..............
..........
5 K50
............
..............
..............
...............
....
..
......
- I............
.....
............
.....
.....
......
.....
.....
...
....
........
....
.............
.........
Figure 5-7: Average Alarm Time Versus h for Locally Optimal 6-State Model Describing Rome Data
in the bayesian framework where increasing CF/CD resulted in a longer horizon
and more computations.
" The CUSUM procedure scales well with the increasing number of underlying
model states. The DP-based algorithm has exponential complexity in the number of states under the proposed discretization scheme.
" The HMM-CUSUM does not require a look-up table (memory) for its implementation, as does the DP-based algorithm (for storing the decision region).
The Bayesian approach in turn has the following advantages when compared to
CUSUM:
* The Bayesian approach has rigorous theoretical guarantees on its performance,
whereas HMM-CUSUM's performance guarantees often include approximations.
" The user is given more parameters to control the tradeoff between false alarm
and detection delay. In addition to CF and CD, the user can control the prior
belief on the time until disruption and the probability of starting in the disruption state.
:
-
...
........
.......................................
4;
N N MEN ! wNwiO
-_
...............
* The costs CF and CD are intuitively meaningful, unlike the threshold h which
can represent vastly different tradeoff for different problems.
Robustness Results
5.5
In this section we examine, through simulation, the robustness of the HMM-CUSUM
procedure when the disruption model is not entirely known.
440
420
400
380
360
E
340
320
300
280
260
240 L
0
5
10
15
20
h*1 00/30
h e [0.01,12]
25
30
35
40
Figure 5-8: Average Alarm Time Versus h for Different Models in the Disruption
Class. (* = model 1) (o = model 2) (- = model 3)
First, we study the effect of incorrectly assuming that the disruption model is
the one obtained from Termini data (Appendix) when the actual disruption model
is as described in table 5.1. The actual models have the same underlying transition
matrix and initial state distribution as the assumed model, with progressively smaller
Model
1
2
3
Emission A's
161 166 177 194 213 228
141 146 157 174 193 208
121 126 137 154 173 188
Table 5.1: Actual Disruption Emission A's (Transition Matrix and Initial Probability
Same as Those from Termini)
emission As. Figure 5-8 shows the average alarm time for data drawn from those
different disruption models under different values of h.
If the actual disruption model is in fact the one used to design the HMM-CUSUM
procedure (Appendix), an optimal threshold of interest is h = 7.5. On average, the
interval h > 7.5 guarantees negligible false alarm rates, and since detection delay
increases with the threshold, the lowest h in the interval would in addition guarantee
minimal detection delay. Hereafter, we focus our analysis on the threshold h = 7.5.
Figure 5-8 shows that the detection delay for h = 7.5 under the different disruption
models is on average higher (less optimal) than the one obtained when the actual
and assumed models were identical (figure 5-7).
This phenomenon has a simple
explanation. When the actual As of the disruption model are less than the assumed
ones, it will take more samples for S, to show the magnitude of positive drift that
would lead to crossing the threshold designed for higher As. This leads to delayed
detections on average.
Second, we attempt to find a robust HMM-CUSUM procedure that would guarantee a desired performance when the disruption is known to belong to a class C with
|C| models. The desired performance we focus on is again minimizing detection delay
for negligible false alarm rate. The components defining the robust HMM procedure
are the threshold h and the assumed disruption model.
We suggest the following procedure:
* For each model Ci in the class (C), plot the average alarm time when the
HMM-CUSUM is designed for disruptions from Ci and the actual disruptions
are drawn from C. with j = 1,..., 1C|
Threshold For Worst Average Alarm
Assumed
Time At Threshold
False Alarm
Model
437
9.6
Termini
427
9.6
Perturbed Model 1
427
10
Perturbed Model 2
43
6.4
Perturbed Model 3
Worst Delay
Model
Perturbed Model
Perturbed Model
Perturbed Model
Perturbed Model
3
3
3
1
Table 5.2: Actual Disruption Emission A's (Transition Matrix and Initial Probability
Same as Those from Termini)
" From each Ci plot obtained, find the lowest threshold hi that guarantees negligible false alarm rate (average alarm time> 406) for all Cj. In addition, find
the maximum average delay (Di) for all C, at the chosen hi.
" Choose the pairing of Ci and hi that minimizes the maximum average delay Di
found in the previous step.
This procedure outlines a practical method to find an approximate minimax robust HMM-CUSUM for a desired performance tradeoff, when the assumed model is
constrained to belong to C. For the class C consisting of the models in table 5.1 and
the original model obtained from Termini, we obtain the plots in figures 5-9, 5-10,
5-11, and 5-12. In figure 5-9, we observe that when the HMM-CUSUM is designed
for the Termini disruption, the desired false alarm performance is guaranteed for all
models in C when h = 9.6. For that threshold, the worst average detection delay
happens for data drawn from perturbed model 3 5.1 and has a value of 31 samples
(worst case average alarm time = 437). Table 5.2 summarizes the results observed for
all models in the class. Based on the simulation results, the design model of choice
that guarantees minimax robustness for the desired performance tradeoff is perturbed
Model 1 or 2 with a worst case detection delay of 21. The tie can be resolved by taking
the one that has a lower threshold, perturbed Model 1. Formalizing this procedure is
left for future work.
5
10
15
20
h*1 0/3
h e (0.01,12)
25
30
35
40
Figure 5-9: Performance of HMM-CUSUM when the Assumed Disruption Model is
that Obtained from the Termini Data and the Actual Disruptions are Drawn from
each Model in C (- = Model from Termini) (* = Perturbed Model 1) (o = Perturbed
Model 2) (- = Perturbed Model 3)
....................
....
................................
"9
450
E 350
300
2
4
6
8
10
12
h*2.5
hE (2,11)
14
16
18
20
22
Figure 5-10: Performance of HMM-CUSUM when the Assumed Disruption Model is
"Perturbed Model 1" and the Actual Disruptions are Drawn from each Model in C
(- = Model from Termini) (* = Perturbed Model 1) (o = Perturbed Model 2) (- =
Perturbed Model 3)
#
450
350
(D
300
2
4
6
8
10
12
h*2.5
h e (2,11)
14
16
18
20
22
Figure 5-11: Performance of HMM-CUSUM when the Assumed Disruption Model is
"Perturbed Model 2" and the Actual Disruptions are Drawn from each Model in C
(- = Model from Termini) (* = Perturbed Model 1) (o = Perturbed Model 2) (- =
Perturbed Model 3)
......
.......
.........
500
E
1=
350 --
&300L,
250-
200---
150
1
2
4
6
8
10
12
h*2.5
h e (2, 11)
14
16
18
20
22
Figure 5-12: Performance of HMM-CUSUM when the Assumed Disruption Model is
"Perturbed Model 3" and the Actual Disruptions are Drawn from each Model in C
(- = Model from Termini) (* = Perturbed Model 1) (o = Perturbed Model 2) (- =
Perturbed Model 3)
60
Chapter 6
Conclusion
In this thesis, we assessed the feasibility of two HMM quickest detection procedures
for detecting disruptions in a real data sequence.
For the Bayesian procedure described in [6], we suggested approximations for
finding the optimal region described by an infinite horizon dynamic program with
a continuous state space. The approximations included discretizing the continuous
state space and the countably infinite observation space, resulting in approximate
cost-to-go function and Q-factor values for a grid of probability vectors. The suggested approximations do not scale well with increasing state-space dimension, which
led us to experiment with running the algorithm on a three and four state HMM
representation of the Rome data (which requires at least six states for its accurate
representation). The algorithm successfully detected the disruption in the data for
relatively low costs only when the number of HMM states was four. Eventhough the
algorithm succeeded at detecting the disruption, our assessment was that the method
is impractical for rich real data sets under the approximation methods we chose.
We proceeded to assess the feasibility of the non-Bayesian CUSUM-like procedure
suggested in [5]. The procedure had the advantage of being very simple to implement,
and of providing real-time detection without the use of time consuming offline calculations or memory intensive look-up tables. It scales well with the number of states
of the underlying HMM models, which allowed us to to use a 6-state representation
of the Rome data and find the alarm time for even the most extreme tradeoffs be-
tween false alarm frequency and detection delay. While the performance guarantees
on HMM-CUSUM often involve approximations, the method is more suited for real
data due to the advantages just outlined.
Finally, we examined the robustness of the HMM-CUSUM when the disruption
model is not exactly known, but the belongs to a known class of HMMs. When the
actual model for the disruption data is 'nearer' to the business as usual model than
the assumed disruption model is, simulations show a noticeable performance drop
for a fixed threshold. We then suggested an experimental approximate method for
designing a CUSUM procedure that guarantees some sense of minimax robustness
when the disruption belongs to a known class of models.
Future work on this topic can focus on:
" Finding optimal solutions for the robust HMM quickest detection in the minimax sense, extending the i.i.d. results in [81
* Finding rigorous notions of distance for HMMs compatible with the quickest
detection problem, especially when the initial state distributions of the prechange and post-change sequences are not the steady state distributions of the
corresponding HMMs.
" Finding more efficient approximations for the optimal region described in [6]
exploiting its characteristics (convexity, closedness, etc) to increase the range
of computationally feasible dimensionality.
Chapter 7
Appendix
Parameter
Inital Probability
.
.
Transition Matrix
Emission A's
0.9857
0.0259
0.0000
0
0
0
Business As Usual Model
[1 0 0 0 0 0]
0
0
0.0143 0.0000
0.9383 0.0357 0.0000
0
0
0.0331 0.9415 0.0254
0.0000 0.0376 0.9381 0.0243
0.0158 0.9565
0
0
0
0.0000 0.0334
0
[ 1 10 24 50 94 133]
0
0
0
0.0000
0.0276
0.9666
Table 7.1: Model Parameters for Business As Usual for N, = 6
Parameter
Inital Probability
.
i
Transition Matrix
Emission A's
0.0000
0.0000
0.0000
0.0000
0.0000
1.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
Disruption Model
[1 0 0 0 0 0]
0.0000 0.0000
0.0000 1.0000
1.0000 0.0000
0.2423 0.5102
0.0000 0.1201
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.8799
0.0000
0.0000
0.0000
0.2475
0.0000
0.0900
0.9100
[181 186 197 214 233 248]
Table 7.2: Model Parameters for Business As Usual and Disruption for N,
=
6
Bibliography
[1] L. E. Baum and J. A. Egon. An inequality with applications to statistical estimation for probabilistic functions of a markov ptocess and to a model for ecology.
Bull. Amer. Meteorol. Soc., 73:360-363, 1967.
[2] L. E. Baum and G. R. Sell. Growth functions for transformations on manifolds.
Pac. J. Math, 27(2), 1968.
[3] D.P. Bertsekas. Dynamic Programming and Optimal Control, volume 1, chapter 8. Athena Scientific, third edition, 2007.
[4] D.P. Bertsekas and J.N. Tsitsiklis.
Athena Scientific, 2007.
Neuro-Dynamic Programming, chapter 1.
[5] B. Chen and P. Willett. Detection of hidden markov model transient signals.
IEEE Transactions on Aerospace and Electronic Systems, 36(4):1253-1268, October 2000.
[6] S. Dayanik and C. Goulding. Sequential detection and identification of a change
in the distribution of a markov-modulated random sequence. IEEE Transactions
on Information Theory, 55(7), July 2009.
[7] A. Sevtsuk J. Reades, F. Calabrese and C. Ratti. Cellular census: Explorations
in urban data collection. IEEE Pervasive Computing, 6(3), 2007.
[8] V. V. Venugopal J. Unnikrishnan and S. Meyn. Minimax robust quickest change
detection. IEEE Transactions on Information Theory, 2, June 2010. Draft.
Submitted Nov 2009, revised May 2010.
[9] G. Soules N. Weiss L. Baum, T. Petrie. A maximization technique occuring
in the statistical analysis of probabilitic functions of markov chains. Annals of
Mathematical Statistics, 41:164-171, 1970.
[10] G. Lorden. Procedures for reacting to a change in distribution. Annals of Mathematical Statistics, 42(6):1897-1908, 1971.
[11] T. Moon. The expectation maximization algorithm. IEEE Signal Processing
Magazine, 13:47-60, 1996.
[12] G.V. Moustakides. Optimal stopping times for detecting changes in distributions.
Annals of Statistics, 14(4):1379-1387, December 1986.
[13] E.S. Page. Continuous inspection schemes. Biometrika, 41:100-115, 1954.
[14] M. Pollak. Optimal detection of a change in distribution. Annals of Statistics,
13(1):206-227, 1985.
[15] V.H. Poor and 0. Hadjiliadis. Quickest Detection. Cambridge Univeristy Press,
first edition, 2009.
[16] R Development Core Team. R: A Language and Environment for Statistical
Computing. R Foundation for Statistical Computing, Vienna, Austria, 2010.
ISBN 3-900051-07-0.
[17] L. R. Rabiner. A tutorial on hidden markov models and selected applications in
speech recognition. Proceedings of the IEEE, 77(2), February 1989.
[18] A.N. Shiryaev. On optimum methods in quickest detection problems. Theory of
Prob. and App., 8:22-46, 1963.
[19] A. Wald. Sequential Analysis. New York: Wiley, 1947.