Monitoring Frequency of Change By Li Qin 1. Introduction Control

advertisement
Monitoring Frequency of Change
By Li Qin
Abstract
Control charts are widely used in process monitoring problems. This paper gives a brief review of
control charts for monitoring a proportion and some initial ideas of using them to monitor the
change frequency of a web page, whose estimator can be expressed as the function of a proportion.
Experiment and/or simulation should be done to compare the performance of these control charts.
1. Introduction
Control charts like Shewhart p-chart, various CUSUM chart and SPRT chart have
been widely used in process monitoring problems such as quality control in
manufacturing. All the items can be inspected or samples can be taken from the
process when 100% inspection is economically or practically impossible. Items are
classified as defective or nondefective (nonconforming or conforming) based on the
test result. Applications usually focus on the change (usually increase) of the
proportion of defective items.
In estimating the change frequency of a web page, the crawler may visit it
periodically and find whether it has changed or not by computing the checksum for
the web page at each access. A web page has been shown to change by a Poisson
process. The frequency ratio is defined to be the ratio of the change frequenc y to the
access frequency. We can estimate the frequency ratio first and estimate the change
frequency indirectly from the frequency ratio by multiplying it with the access
frequency. The frequency ratio can be estimated by –log(X/n) [CGM2000a], where n
is the total number of accesses and X is the number of times the page does not change
during the checking period. The estimated change frequency can be applied to
improve the “freshness” of data warehouse, web caching policy and data mining
[CGM2000b]. One of the challenges to these applications is the change frequency
itself may change and a proper solution to monitor the shift of the change frequency
has to be proposed. So, this paper is about the possibility of using existing control
1
charts to monitor the change frequency. Similarly, we can monitor the change
frequency indirectly by monitoring the frequency ratio.
This paper is organized as follows: section 2 is a review on how to estimate the
change frequency; Section 3 is to introduce various control cha rts to be considered;
Section 4 is about some initial ideas of using control charts to monitor the change
frequency and some questions to be considered; Section 5 concludes this paper.
2. Estimating the Change Frequency
In most cases, when we try to estimate the change frequency of web pages, we don’t
have the complete change history of web pages, i.e. we don’t know when exactly each
web page changes and how many times it has changed between consecutive accesses.
So, our discussion below is based on an incomplete change history of web pages.
In order to estimate the change frequency, [CGM2000a] traced the daily change
history of 720,000 web pages from 270 sites for four months. The experiment shows
that web pages change by following a Poisson process. The frequency ratio r is
defined to be the ratio of the change frequency? to the access frequency f, so r = ? /
f . An intuitive estimator for the frequency ratio is X/T, where X is the number of
detected changes and T is the monitoring period. This estimator has been proved to be
biased and not consistent since the bias does not decrease as the sample size increases.
Due to the drawbacks of this intuitive estimator, [CGM2000a] proposed an improved
estimator expressed as - log(X/n), where n is the total number of accesses and X is the
number of accesses in which the web page does not change. For example, if we access
a web page once a day for 100 days and the web page does not change in 70 accesses,
the frequency ratio r = - log(70/100) = 0.36. This result is slightly larger than the
intuitive estimator 30/100 = 0.3 since some changes may have been missed between
accesses. Similar performance analysis shows that this estimator is better than X/n in
bias, more efficient and consistent.
2
The change frequency, λ,can change in practice. We don’t have any experiment so
far to show how the change frequency itself can change. If the change frequency
changes very quickly, it will be difficult and impractical to estimate λ and really use
it in applications. So, we assum e that the change frequency will remain relatively
stable for at least some period of time. What we can do is to test it periodically and
see whether it has changed and whether the change is beyond a predefined threshold.
Since the frequency ratio r can be estimated as –log(X/n), the proportion to be
monitored, p, will be X/n. We are interested in finding out both the increase and
decrease of p. In order not to miss too many changes, the crawler should access the
web page as frequently as possible. Usually , the crawler can not access web pages
more than once a day and we are not interested in the web pages which change more
than once a day, so the access frequency can be chosen as one access per day.
3. Control Charts
[W1997] gives a review and bibliography on control charts based on attribute data.
Here, we focus on the Shewhart p-chart, Bernoulli CUSUM chart, Binomial CUSUM
chart and SPRT chart [RS2000a, RS2000b, RS1999, and RS1998].
For our application, since the crawler can not visit all the web pages once a day,
continuous 100% inspection is not possible. Therefore, our inspection will be based
on samples taken from the process.
3.1 The Shewhart p-chart
When samples of n items are taken from the process, the Shewhart p-chart is to plot
the fraction of defective items in the samples. So, if T is the total number of defective
items in a sample of size n, then T/n is plotted on the p-chart. T has a binomial
distribution assuming p is constant and items are independent.
If the crawler visits a web page once a day for n days and X is number of accesses in
which the web page doesn’t change, the proportion of X/n can be plotted on the
3
Shewhart p-chart. Here, X has a binomial distribution with parameter n and p, where p
is the probability that the web page doesn’t change between two consecutive accesses
and p = e-r, where r is the frequency ratio. So, the result of ith access, X i, takes the
value of 1 with probability p and of 0 with probability 1-p.
3.2. Bernoulli CUSUM Chart
The Bernoulli CUSUM chart is based on the individual observations X1, X2,…. In
order to detect an increase in p, the Bernoulli CUSUM control statistic is
Bi = max (0, B i-1) + (Xi – r), i=1,2…, r is the reference value. This CUSUM chart
will signal there has been an increase in p if Bk = h, where h > 0 is the control limit.
For detecting a decrease in p, the corresponding CUSUM control statistic is Bi = min
(0, Bi-1) + (X i – r), i=1,2…, r is the reference value. It will signal there has been a
decrease in p if B k = h, where h < 0 is the control limit.
In order to get the value of r, we have to specify an out-of-control value p 1 which
we want to be detected quickly. Constants r1 and r2 are defined to be
r1 = − log
1 − p1
1 − p0
r2 = log
p1 (1 − p0 )
p 0 (1 − p1 )
Then, the reference value r = r1/ r2
Usually, p 1 is adjusted slightly so that r takes the value of the reciprocal of an integer
m, i.e. r = 1/m. The control limit h is obtained by making the false alarm rate (i.e. the
average number of observations/samples to signal when p=p 0) satisfy some
predefined value.
3.3 Binomial CUSUM Chart
Binomial CUSUM chart is to plot a cumulative sum of defective items in a sample of
n consecutive items, T1 , T 2 ,…, where each T k has a binomial distribution. For
detecting an increase in p, the binomial CUSUM control statistic is
S k = max (0, Sk-1) + (T k – nr), k=1,2..., where nr is the reference value. The Binomial
CUSUM chart will signal there has been an increase in p if S k = h, where h>0 is the
4
control limit. For detecting a decrease in p, the binomial CUSUM control statistic is
S k = min (0, Sk-1) + (T k – nr), k=1,2..., where nr is the reference value. The Binomial
CUSUM chart will signal there has been a decrease in p if S k= h,where h<0 is the
control limit.
Similar to the Bernoulli CUSUM chart, the control limit h for the binomial CUSUM
chart is also obtained by making the false alarm rate satisfy some predefined value.
3.4 SPRT Chart
In most applications, both the p-chart and the CUSUM chart take a fixed sample size
of n items using a fixed sampling interval between samples. SPRT is to use a varied
sample size which is determined dynamically. It is a sequential test of null hypothesis
H0: p=p 0 against H1: p=p1. For each item, Xi =1 if the ith item is defective (in our
application, if the web page does not change) and Xi =0 otherwise. The statistic used
j
by SPRT is S j = r 2Tj –r1j, where T j = ∑ Xi . Here, r1 and r 2 are defined as in 3.2.
i =1
The SPRT requires spec ifying two constants a and b, b<a . The following rules are
used for sampling and making decisions to accept or reject H 0:
If b<S j<a, then continue sampling;
If S j= a, then stop sampling and reject H0;
If S j= b, then stop sampling and accept H0.
The constants a and b are usually chosen to satisfy some predefined error probabilities
(probabilities for type I and II errors).
3.5 Comparison of Control Charts
The performance of the above control charts can be measured by ANSS (average
number of samples to signal), ANOS (average number of observations to signal) and
ATS (average time to signal). Since the sample size is not fixed for SPRT chart and
ATS is dependent on the length of non-inspecting period, we suggest using ANOS to
compare the performance of control charts.
5
A corrected diffusion (CD) theory approximation to the ANOS has been developed
for CUSUM chart and SPRT chart. For each type of control chart, ANOS can be
obtained for a range of in-control value p 0 and out-of-control value p 1. When the
Bernoulli CUSUM chart is used, the CD approximation to the ANOS when p = p0 is
*
e h r2 − h * r2 − 1
ANOS ( p0 ) ≈
r2 p0 − r1
We can find the required value of h * to give a desired value for in-control ANOS
(average false alarm rate). Then, by using
h* = h + ε ( p0 ) p0q0
where e (p) can be approximated by
0.00376 (log(p))4 -0.000008(log(p))7, if 0.01= p<0.5;
1  1− p
p

−

3
p
1− p

, if 0<p<0.01;


0.410 -0.0842(log(p)) -0.0391(log(p)) 3, if otherwise.
We can find the control limit h. Also, the CD approximation to the ANOS when p=p 1
is
*
eh r2 − h* r2 − 1
ANOS ( p1 ) ≈
r2 p1 − r1
When the SPRT chart is used, let a be the probability for a type I error and ß be the
probability for a type II error, using
h* ≈
g≈
1  1− β 
ln 

r2  α 
1  β 
ln 

r2  1 − α 
and h * = h + (1 -2p 0)/3 , we can find the values for h and g. Since g =b/r 2, h = a/r2, we
can get a and b. The ANOS (p 0) and ANOS (p 1) can be obtained using p 0, p 1, r1 , r 2, g
and h *.
6
The Shewhart p-chart has the advantage of simplicity, and it also has some
disadvantages: if the control limit is set to be three standard deviations from the target
value, the false alarm rate will be much different from that for a normal distribution.
The Shewhart p-chart is not effective for detecting small changes in p. The
performance of the p-chart for detecting small shifts can be improved by using a
larger sample size, but it will not be very effective in detecting large shifts.
The Bernoulli CUSUM chart detects shifts in p much faster than the p-chart. The
binomial CUSUM chart is a little slower than the Bernoulli CUSUM chart for small
shifts in p and considerably slower for very large shifts, since a binomial CUSUM
would have to wait until the end of a sample to signal. The SPRT chart has much
better performance than the p-chart or the CUSUM chart.
4. Monitoring the Change Frequency
Based on the performance analysis of the control charts and the characteristics of our
applications, Bernoulli CUSUM chart or SPRT chart would be more appropriate for
our purpose since they both are good for detecting small shifts.
Also, we need to consider the following questions:
a. determining how we check the change frequency, either periodically or randomly.
If periodically, then determine how often we should check;
b. determining the sample size since the sample size could have an important effect on
the inspection result;
c. determining the out-of-control value p 1;
d. determining the false alarm rate for CUSUM chart so that the control limit h can be
determined or the error probabilities for SPRT chart so that the two constants a and b
can be determined;
For example, we have the knowledge that the current change frequency is 0.6931 per
day, which means the web page changes 0.6931 times a day. Based on 0.6931 = log(X/n), this change frequency corresponds to X/n = 0.5 . We want to detect the shift
when the change frequency becomes 0.3567, which corresponds to X/n = 0.7. In this
case, our in-control value p 0=0.5 and out-of-control value p 1=0.7 . These values are
7
relatively large compared with those used in quality control. If we use the Bernoulli
CUSUM chart, we can get the reference value r using p 0 and p 1. Then, given a desired
value for ANOS (p 0), we can find the value for the control limit h. Next, we can get
ANOS (p 1). If the SPRT chart is used, for some desired values of a andß , we can
find the values for a and b, and further find the approximation of ANOS (p). In order
to compare the performance of these control charts, we can try to adjust the values so
that they give similar values for ANOS (p 0) and compare ANOS (p 1) for different
values of p 0 and p 1.
5. Conclusion & Future Work
This paper gives a brief review of control charts and some initial ideas of applying
these control charts to monitor the change frequency of a web page. However, in
order to show the appropriateness of the control charts, experiments or simulation
should be done and specific data to measure the performance of these control charts
should be obtained and compared.
Reference
[RS2000a] M Reynolds, Jr. and Z. Stoumbos, Monitoring a Proportion Using
CUSUM and SPRT Control Charts. Frontiers in Statistical Quality Control 6, pp. 156176(2000)
[RS2000b] M Reynolds, Jr. and Z. Stoumbos, A General Approach to Modeling
CUSUM charts for a Proportion , IIE Transactions(2000) 32, pp. 515-535
[RS1999] M Reynolds, Jr. and Z. Stoumbos, A CUSUM Chart for Monitoring a
Proportion When Inspecting Continuously, Journal of Quality Technology, Vol. 31,
No. 1, Jan 1999
[RS1998] M Reynolds, Jr. and Z. Stoumbos, The SPRT Chart for Monitoring a
Proportion , IIE Transactions (1998) 30, pp. 545-561
[W1997] W. Woodall, Control Charts Based on Attribute Data: Bibliography and
Review, Journal of Quality Technology, Vol. 29, No. 2, April 1997
[CGM2000a] Junghoo Cho and Hector Garcia -Molina, Estimating Frequency of
Change
8
[CGM2000b] Junghoo Cho and Hector Garcia -Molina, The Evolution of the Web and
Implications for an Incremental crawler, VLDB 2000, Experience/Application track,
2000.
9
Download