Abstract: The last three decades have seen major strides in statistical methodologies.
That, combined with the exploitation of increased computing power and interfaces of
statistics with machine learning and artificial intelligence, has enabled the consideration
of many problems that were typically ignored for a variety of reasons such as too many
parameters, small samples, non-normality, identifiability issues, etc. While many fields
have taken advantage of that progress, some traditional fields of applied probability like
queueing lag behind terribly. We illustrate the opportunities offered by this through some
examples of attempts to bring in more statistical methodologies in our applied probability
work at AT&T and conclude with some personal thoughts on the challenges ahead.
1. Introduction
Even a cursory reading of Professor DasGupta’s page [1] in the recent IMS Bulletin
will show how the landscape of statistical inference has changed in the last thirty years.
It is noted by him that many new age topics like the bootstrap, EM, MCMC, function
estimation, extreme values etc., have replaced many classical parametric and nonparametric methods so much so that what is now considered a set of core courses in
statistical inference has but a small intersection with what was considered core in the 70’s
and earlier.
In addition to the new developments in Mathematical Statistics, much concurrent
development has occurred in the areas of data mining and machine learning. These,
combined with exponentially growing computing power, have allowed statistics to make
major leaps into problem areas previously considered intractable. Many areas like biostatistics, computational biology, genetics, and mathematical finance have incorporated
many of these advances. Yet, some traditional areas like queueing are yet to take
advantage of these advances. In the context of stochastic processes, even in the areas
where statistical methods are incorporated systematically, the real successes are primarily
in the area of time series analysis only.
Statistical problems arising in an area like queueing models typically tend to be harder
due to the inherent non-linearity of performance measures of interest as functions of input
distributions and parameters, and these difficulties are further exacerbated by the lack of
explicit formulas.
Yet on the positive side, in many application areas like
telecommunications, computer performance and finance, data is available in plenty and
the ability to collect, store and manipulate data has increased enormously. The
availability of plentiful data also allows one to validate methods using the more stringent
way of using part of the data as a learning set and the rest as a test set. That gives much
opportunity for the researcher and the applied probabilist to test their models and
algorithms and to refine them in some realistic scenarios. For the academician,
particularly the availability of large data sets over the Worldwide Web provides enough
of a testing ground for new ideas and methods, no longer forcing a necessity to relegate
the completion of the research cycle through validation to industry.
Our goal is to illustrate the opportunities available through a simple set of examples
taken from our recent work at AT&T. We shall also share some of our personal thoughts
on several research problems suggested by our work that may enable fields like queueing
to incorporate the newer ideas and to become more useful, and for statistics to permeate
even more deeply into the domain of stochastic processes.
2. Poisson Regression for the Non-stationary Poisson Process
Wi-Fi hotspots in public venues like coffee shops have become an important aspect of
wireless and mobile communications, and there is interest in modeling various aspects of
their performance. We have provided a detailed analysis in the Proceedings of the 2011
IEEE INFOCOM conference [3], and here we highlight only the way we modeled the
arrival stream of connections over week days as a non-stationary Poisson process based
on 15 minute counts of arrivals obtained over a set of five weeks in over 200 coffee shops
in New York and San Francisco. We used the first four weeks of data for model fitting
and the last week’s data for testing the model.
Shown in Figure 1 are the average 15 minute counts of newly starting WiFi connections
over two consecutive days after grouping the venues into four subclasses by size. From
here on, we will concentrate just on the “Large” group comprising of 51 coffee shops. It
is clear that the arrival pattern is non-stationary and exhibits a daily periodicity.
Average number of arrivals
3 am 6 am 9 am 12 pm 3 pm 6 pm 9 pm 12 am 3 am 6 am 9 am 12 pm 3 pm 6 pm 9 pm 12 am
Two weekdays (15 min bins)
Figure 1: Fifteen minute arrival counts over 2 consecutive days
A Poisson process is suggested by the fact that a large number of customers typically
arrive into the coffee shop but only a very small fraction log into the Wi-Fi system there,
and non-stationarity is suggested from the above preliminary graphical analysis. Let us
concentrate on one of the groups, say the “Large” one and see how we could proceed.
A simple way of attempting to fit would be to take the empirical mean curve (the blue
curve of Figure 1) averaged over one day and to try to fit to it a continuous curve, say a
polynomial, yielding the required values λ(t). There are several problems associated with
this approach: (a) the throwing away of considerable amounts of information in the data;
(b) inability to ensure nonnegativity of fitted values; and (c) the high degree of the
polynomial that would be needed to mimic the flat regions, peaks and troughs in the data.
With this in mind, I suggested building a methodology based on the Poisson regression
model. The idea of Poisson regression, which is a part of the GLM methodology in
Statistics, is to fit a Poisson distribution to data on a random variable Y by assuming that
its mean λ = exp(α + β.x) , where x is a possibly vector valued co-variate, and then to
estimate the parameters in the model from the observed pairs (x(i), y(i)) using the
maximum likelihood procedure. The details of this may be found in most standard
statistical texts today and won’t be discussed here. This, however, appears to be its first
use in the context of a non-stationary Poisson process.
To that end, we need to identify one or more co-variates, and obviously since the mean is
a function of time, a simple way of achieving this is to choose the covariates so that they
determine the time point at which the mean is to be estimated. The graph of the mean
suggested that it is probably a good idea to group the 15-minute slots into blocks of say 3
hours each starting at 12 AM; that in turn would yield 8 blocks each with 12 fifteen
minute counts for each day. In short, we decided we would use the co-variate X=(I,J)
where I varied over values 1 to 8 and J over 1 to 12, with the pairs (1,1), (1,2) ...., (8,12)
covering the 96 consecutive 15 minute time slots over a day starting at 12 AM. Note that
the 3 hour block structure is motivated by a hope that it will help to model the flat regions
without much trouble. We initially used a Poisson regression model given by
log  (i, j )     (  k i k   k j k ) .
k 1
A third degree polynomial appears to be the minimum needed; we were trying to be
optimistic in our approach involving only the first three powers of the co-variates.
That model did not fare well, and some thought revealed one glaring problem with the
model. The blocks are not identical with respect to intra-block behavior of the process.
Thus, for instance, while in the three hour block from 12 AM to 3 AM hardly any change
occurred in the counts, the three hour block from 6 AM to 9 AM shows a rapid increase
in the counts. So, we decided to include an interaction term in the model enhancing it to
log  (i, j )     (  k i k   k j k )   (ij ) .
k 1
A significant improvement occurred in the model, but however, when we tried the model
on the test data spanning the last week, performance was not at all satisfactory.
At this point, I decided to try to inject yet another method from more modern statistics,
namely clustering. Recall that our grouping of the time slots into three hour blocks was
entirely arbitrary, and perhaps we could do something better. So what we did was to do a
clustering of the 96 time slots of the day into 8 clusters using the “mean clustering”
algorithm; this chooses eight centroids for cluster means and assigns individual points to
the clusters such that the intra-cluster variances are minimized. Several interesting things
started happening.
Following are the eight clusters obtained by us.
Note that cluster 0 includes time slots 1 and 96 showing the daily cyclical pattern in the
data identified automatically by the clustering scheme. Also, the clustering scheme
automatically groups like intervals into a common cluster even if they are not contiguous.
Thus, for instance, the time slots 32-34, 38-40 and 64-66 that mark very fast ascents of
the counts are grouped together. Now with the variables I and J defined by the above
groupings, we re-computed the Poisson regression model with an interaction term.
Shown below in Figure 2 is the data over the last week, the estimated expected values,
and the 95% confidence limits computed based on the Poisson distributions.
Number of arrivals
Observed data
Model mean
2.5% quantile
97.5% quantile
5 weekdays (15 min bins)
Figure 2: Comparison of model and observations over test data for arrivals
Of the 5x96 = 480 observations, there are very few that fall outside the confidence limits
and when used as a statistical test are so small in number as to provide no reason to reject
the hypothesis that the fit is good.
We thus obtained an interesting new method for fitting a non-stationary Poisson process
model to time series data of arrivals by exploiting Poisson regression, a much used
procedure in modern Statistics for static data. Combining it with a clustering algorithm
was key to its success in our context.
Number of simultaneously present customers
Just to conclude our discussion of this data analysis, we wish to note that we did not stop
there. We fitted a heavy tailed distribution to the connection times from the first four
weeks and modeled the number of simultaneous connections as a M(t)/G/∞ queue. We
then compared the resulting model with the actually observed data in the last one week.
Following is a figure providing a comparison similar to that in Figure 2.
Observed data
Model mean, m(t)
2.5% quantile
97.5% quantile
5 weekdays (15 min bins)
Figure 3: Comparison of model and test data for simultaneous connections
Thus, we have managed to obtain an interesting way to fit and validate an M(t)/G/∞
model to our data set. To the best of our knowledge, these types of model fitting or
validation have not been made earlier in the queueing literature. For further details, refer
to [3].
3. Modeling Heavy Tailed Distributions
There are many practical situations where the distribution of an underlying variable is
heavy tailed. For a quick introduction to heavy tailed distributions, refer to Sigman [8]
and references cited therein. These are distributions whose tail P[X>x] decays slowly
with x, and among those an important class is those for which for large x,
P[ X  x}  O( x  )
for some α >0. Note that the tail of a heavy tailed distribution decays very slowly, for
example relative to an exponential distribution which has tail decay of the form exp(-λx).
A simple way to recognize a heavy tail of the type noted is to consider a log log plot of
the complementary distribution P[X>x]; in the heavy tailed case we will see that for
sufficiently large x, it has a linear decay with decay rate (– α ).
A heavy tailed distribution with a tail of the form noted above will not have a finite mean
if α < 1 and no finite variance unless α >2, In other words, not all moments may exist
finitely. This has serious implications to using some commonly used descriptive
measures and also for estimating them based on data.
To understand the difference between the heavy tailed and other distributions even more,
assume X is the repair time of some equipment after a failure and consider a conditional
probability such as P[X>2x | X>x]. For an exponential distribution, this is easily
computed to be e  x which goes to 0 very fast, whereas in the heavy tailed case
exemplified above, for large x, it is approximately given by 1 / 2  . Thus, for α =2, we see
that this conditional probability is 0.71 for all large x which should indicate a very
unacceptable situation indeed.
Heavy tailed behavior is undesirable in many contexts. For example in a queueing model,
if service times are heavy tailed, delays have even heavier tails, the queue becomes
subject to large excursions, and often descriptors such as steady state measures of
performance and other commonly used measures to characterize the system and control it
become meaningless since convergence to them is not obtained or is too slow.
Some well-known examples of heavy tailed distributions are the Pareto, Weibull and log
Normal distributions. While the Pareto distribution is characterized by a complementary
cdf of the form
P[ X  x]  K / x  ,
the Weibull is such that
P[ X  x]  Ke  x ,
where K is a suitable normalizing constant. Finally, we say that X is log normal if log X
is normally distributed. There are slightly modified versions of the Pareto and Weibull
distributions which allow for more than one parameter for greater flexibility.
In the literature, it is customary to use one of the above distributions to model heavy
tailed random variables. One difficulty with this has always been that while this allows
for matching the tail fairly well, the head of the distribution is not matched. In many
practical examples, the heavy tail comes from only a very small fraction of the data, and
thus bulk of the data fails to be well represented properly. This, in turn, results in many
performance measures of interest not getting characterized and predicted accurately.
Some authors have attempted to remedy this in an ad-hoc manner by forming mixtures of
heavy tailed and non-heavy tailed distributions, but as far as we know, no cogent theory
has evolved that has yielded one common class of distributions.
Our own encounters with heavy tails have occurred in three diverse contexts within the
last couple of years. The first was the Wi-Fi modeling scenario [3] discussed earlier
where we found the connection durations to be heavy tailed. A second one involved
repair times in a certain part of our communications network [5]. And the third is the
well known example of internet file sizes which we [4] revisited recently for evaluating
the impact of heavy tails on TCP performance.
It is well-known that phase type distributions [6],[7], that is distributions that can be
realized as absorption time distributions in finite Markov chains with an absorbing state,
are dense in the set of all distributions on [0, ∞ ). A difficulty with this class in modeling
heavy tailed random variables is that phase type distributions always have an
exponentially decaying tail; the decay parameter is the Perron eigenvalue of the
submatrix in the infinitesimal generator corresponding to the transient states, although
this fact is not very relevant to our limited discussion here. Just note that when we use a
PH distribution as a model for X, we get for large x,
P[ X  x]  K e x ,
where η is some positive constant.
Using phase type distributions as a basis, we can, however, obtain an interesting class of
distributions to model heavy tails. To that end, we consider a family that I will call log
PH in analogy with the log normal which is defined by saying that X is a log PH random
variable if log(X) has a phase type distribution. It is easy to verify that if log(X) has an
exponential decay parameter η, then for large x,
P[ X  x]  K / x .
The Pareto distribution is a trivial special case of this formulation, and the general
formulation we have made should provide much greater versatility due to the generality
of the class of phase type distributions. With these in mind, I advocated the use of this
model class in the three examples where we encountered heavy tails. Since the Wi-Fi
example is available in the open literature, I will just discuss some results related to the
other two examples only.
3a. The Repair Times Example
The motivation for us [5] to fit a distribution to repair time data arose from trying to
answer an important question as to whether a certain set of observed long repair times
impairing an undesirably large number of transactions were to be treated as outliers or
were somewhat endemic in the system. In the latter case, there exists both the need and
scope for improvement through more stringent vendor management etc. For proprietary
reasons, we are unable to describe the context in greater detail although I must note that
our work did help to catch a potentially damaging situation fairly early and to take
various successful corrective steps.
Figure 4 provides the repair times (units are deliberately masked to preserve
confidentiality), and Figure 5 below provides a log log plot of the empirical
complementary distribution function exhibiting a heavy tail though a linear asymptote.
Figure 4: Year 2010 Repair time distribution (sample size = 151)
Figure 5: log log plot for Year 2010
We attempted to fit a phase type distribution to the logarithms and after some
experimentations with different values of the number of phases, we obtained, using the
EM algorithm, a phase type distribution of order 6 characterized by the following
Α = (0,5867, 0.0001, 0.3989, 0.0143)
and the matrix T given by
[1,] -2.675229 2.675229 0.000000 0.000000 0.000000 0.000000
[2,] 0.000000 -2.675230 2.675230 0.000000 0.000000 0.000000
[3,] 0.000000 0.000000 -2.949639 2.949639 0.000000 0.000000
[4,] 0.000000 0.000000 0.000000 -2.949639 2.949639 0.000000
[5,] 0.000000 0.000000 0.000000 0.000000 -2.949639 2.949639
[6,] 0.000000 0.000000 0.000000 0.000000 0.000000 -3.719682
Note the small number of distinct parameters identified, and these we found to be quite
stable upon testing with different starting points etc. in the iterative scheme.
The following two figures compare the empirical and model cdf directly and through a
quantile to quantile plot.
Figure 6: Empirical and fitted cdf of 2010 log repair times
Figure 7: Comparison of sample and model quantiles for Year 2010
What the above demonstrated for us and to the community of our vendors was that the
heavy tailed phenomenon was inherent in the system, and some action had to be taken to
make repair time durations more acceptable. That squarely put to rest all arguments
concerning “black swans,” etc., brought in as possible things one has to live with.
As we were doing the modeling, some efforts were ongoing to improve repair times, and
we used data in the first half of 2011 and compared it to that in 2010. The following
figure suggests two interesting inferences. The first is that our tail estimate appears to be
very robust. The second one, more of interest to the business, is that while the good is
getting better, the undesirable continues to remain undesirable. This conclusion is one
that should hopefully allow a major change of course that would also eliminate much
wasted effort and cost of failures.
Figure 8: A comparison of years 2010 and 2011
3b. Long File Sizes and TCP Performance
Recently, we undertook an effort to re-examine the famous internet data set earlier
considered in the literature by Crovella, see [2], to see if with our new tools we could
obtain a more accurate understanding of the effect of heavy tails on internet performance.
Given below in Figure 9 is the histogram of the logarithm of 130,000 WWW files
downloaded in the measured system in 1995.
Figure 9: Distribution of log file sizes
The complementary distribution shown in Figure 10 demonstrates the presence of a
heavy tail.
Figure 10: log log plot indicating a heavy tail
There was negligible amount of data (<0.01%) below 200 bytes. So we decided to take
as our variable of interest S/200, where S is the size of the file in bytes and then took the
logarithm after sweeping values below 200 to the value 200; see footnote2.
We thus
decided to fit a phase type distribution to the variable Y= max(0, log(X/200)), and later to
set the original variable X =200 with probability P(X≤200) and X = 200 * exp(Y) with
probability P(X>200).
For the resulting set of values Y, our approach with the EM algorithm allowed us to
settle on a phase type fit with 5 transient phases with initial vector (0,0,0,0,1,0) and
infinitesimal generator T for the transient part given by
1.31 -1.31
1.79 -2.11 0.32
0 -2.11
2.0 0.11 -2.11
In Figure 11 below, we show the empirical histogram and the density of the fitted PH
distribution. We decided to live with the inability of the fit to match a couple of peaks in
the empirical density since the areas under the curve is what really matters (see the next
plot), and also because we did not want to increase the dimension of the fit enormously
just to accomplish that.
Figure 11: Empirical histogram and fitted density
The following graph, Figure 12, which provides a comparison of the empirical and model
cdfs shows that the fit does indeed do a very good job. In passing, we also note that a
property of the EM fit is that the model mean will match the empirical mean. Hopefully,
this phase type fit should then pass muster when used in a queueing model.
Figure 12: Empirical and model cdf plots
We then compared our phase type fit to several of the classical types of fit and show the
results in Figure 13. It is clear that the Pareto and Weibull (with additional parameters to
also match the mean) fare very poorly. While visually the log normal appears more
reasonable in the main body, a more detailed analysis (to be presented in [3]) showed that
its tail decays at a rate much faster than what the data would support.
Figure 13: Comparison of models for fit of file size data
In the context of our discussion, the real proof of the pudding is how well the models
predict performance of systems. Certainly, we already have reasons to believe that the fit
based on the phase type distributions will do very well. But how much better would it
do? That is an interesting question.
To answer that, we considered the simple example, see Figure 14, of a bottleneck link
and a set of clients who download files from a web server.
Figure 14: The network analyzed
We considered several different scenarios wherein clients act as a finite source – i.e.,
once they make a request, they may not make another one until that download is
complete - and also the open system where clients can make additional requests even as
prior requests are pending, different system configurations, and many different load
levels. We assumed the “think times’ to be exponentially distributed and chose different
rates for it to obtain the desired load level. For brevity we show here only one set of
examples corresponding to the finite source case with a loading of 80% of the link for a
case of 40 clients working under TCP Tahoe with parameters as given by the default
settings in NS2.
The following Figure 15 provides a comparison of queue length distributions obtained
from a set 1000 simulations (using the NS2 simulator implementing TCP Tahoe) based
on the different models for file sizes as well as simulations of file sizes based on
sampling the empirical trace of file sizes. Note the remarkable accuracy with which the
log ph model is able to predict queue lengths in the output buffer at the server. The poor
match of the head of the distribution by the other models has resulted in their curves not
even maintaining the right shape, and the underestimation of the tail is very noticeable
particularly for the Weibull and the log normal.
Figure 15: Comparison of fits based on queue lengths
One of the major difficulties with heavy tails is that long run averages of queue statistics
converge very slowly, even if they do. So, we decided to consider the distribution of the
per second connection throughput which is a fairly complex functional of the sample
paths. In this case, the log ph model still performs quite well while the other models fail
miserably and provide significant overestimates of throughput.
Figure 16: Comparison of fits based on throughput
This combined with our previous example, we hope, will at least generate a much
stronger interest in phase type and log ph distributions.
The above has raised a number of research questions of a modeling/mathematical kind
which are along the lines of what we consider as creating a greater interplay between
modern methods of statistics and applied probability:
(a) Can we obtain any mathematical insights on the quality of the estimator of the tail
decay parameter? What are good methods to estimate it from data as part of the fitting
procedure? Having used one such method, can we force the log PH fit to result in tail
decay equal to the value obtained as the estimate?
(b) What if I were to insist that the fitted means of my non-stationary Poisson process
also show some smoothness properties at the knot points? What are good algorithms to
achieve that?
(c) Can the method be extended to fit more complicated processes like the Markov
modulated Poisson process? How well would a clustering algorithm help to identify the
number of different phases to be used in the model and the associated Poisson rates?
How well would it work as a fitting technique (i) for the arrival process? (ii) for queueing
(d) Is there any need or scope for using zero inflated non-stationary models (as they do in
Statistics for the static case) to handle situations when, relative to the Poisson, there is an
excessive number of intervals with no events? This would, for example, be a meaningful
model in software bug estimation for a well-tested software where there would be a
preponderous number of zero counts. If you are the real stochastic processes type, is
there really a continuous time stochastic process generating such a zero inflated process
and what is its structure?
There are many stochastic processes that could lend themselves to an approach based on
Poisson regression. An example would be the modeling of software errors where the
covariates could be the specific software module with the bug, the number of times
specific sections of code are invoked, the number of bugs already found and fixed in the
block, etc. Although there are some reported instances of the use of Poisson and zero
inflated Poisson regression in the software reliability context, we are not aware of their
use in the context of a non-stationary Poisson process model with time also taken into
Turning to the log PH class of distributions, we believe we have asserted it to be an
interesting class in its own right. Studying them in some depth from the perspectives of
extreme value and heavy tailed distributions in probability would therefore become a
worthwhile addition to the literature. The log PH approach provides a way to re-examine
with improved models many networking issues which we have considered in the past
with various ad-hoc procedures.
We used the EM algorithm to fit the phase type distribution to the logarithms. But this
may not be best for getting a good estimate of the tail. Could we obtain an estimate of
the tail independently and constrain the estimated PH to conform to that? For instance,
we could estimate the tail parameter from a regression model for the log log plot of the
complementary distribution function.
For the TCP example, we have used simulation primarily for examining the efficacy of
the log PH and EM framework within a nearly realistic context, but even more
importantly due to the fact that at present we do not have the necessary tools to handle
the queueing model analytically. Developing such tools and appropriate asymptotics and
approximations should form an interesting set of problems for applied probability.
One of the lessons we learned in the TCP modeling is on the risk of extrapolating far
beyond data when we use a heavy tailed distribution with an infinite support based on
data with a finite support. Unlike in the case of fast decaying tails, this extrapolation can
significantly alter the system estimates such as queue length distributions since they are
very sensitive to the tail. The direct use of the log PH and other heavy tailed distribution
results in significantly pessimistic results for the queue compared to the trace based
simulation. In our case, to obtain a good match we had to rescale our distributions within
a finite interval, and we arbitrarily took that to cover 1.2 times the maximum data point
seen; the total mass to the right was very small and less than 10 4. Can we bring more
science into this aspect? That is not only a challenging statistical question but a very
important one for practical applications.
We hope that the examples we have presented above from our recent efforts and the types
of questions they raise drive home our point that there is much that could be done in
effecting a greater tie in for the mutual progress of both applied probability and statistics.
This is certainly a topic that has caught our recent fancy.
It is indeed unfortunate that we as applied probabilists have not taken advantage of the
more recent developments in statistics. And it is equally unfortunate that statisticians
mostly contend themselves with static problems and do not get involved in making
significant contributions that will make applied probability as a field more practicable.
Just as Dr. Das Gupta [1] noted in the context of a curriculum of statistical inference, we
do have a challenge to look at even broader issues and see how future generations can be
served better by our science. As for our own specific attempts, whether we have
foolishly rushed into an area where angels fear to tread or have identified some real
opportunities, only the future can tell.
Acknowledgment: I record my heartfelt thanks to my co-authors in [3],[4], and [5] for
indulging me with support for my wildly speculative suggestions and working together to
demonstrate them to be practically useful. My thanks are also due Dr. Soren Asmussen
for some useful comments and observations.
