Monitoring Frequency of Change By Li Qin Abstract Control charts are widely used in process monitoring problems. This paper gives a brief review of control charts for monitoring a proportion and some initial ideas of using them to monitor the change frequency of a web page, whose estimator can be expressed as the function of a proportion. Experiment and/or simulation should be done to compare the performance of these control charts. 1. Introduction Control charts like Shewhart p-chart, various CUSUM chart and SPRT chart have been widely used in process monitoring problems such as quality control in manufacturing. All the items can be inspected or samples can be taken from the process when 100% inspection is economically or practically impossible. Items are classified as defective or nondefective (nonconforming or conforming) based on the test result. Applications usually focus on the change (usually increase) of the proportion of defective items. In estimating the change frequency of a web page, the crawler may visit it periodically and find whether it has changed or not by computing the checksum for the web page at each access. A web page has been shown to change by a Poisson process. The frequency ratio is defined to be the ratio of the change frequenc y to the access frequency. We can estimate the frequency ratio first and estimate the change frequency indirectly from the frequency ratio by multiplying it with the access frequency. The frequency ratio can be estimated by –log(X/n) [CGM2000a], where n is the total number of accesses and X is the number of times the page does not change during the checking period. The estimated change frequency can be applied to improve the “freshness” of data warehouse, web caching policy and data mining [CGM2000b]. One of the challenges to these applications is the change frequency itself may change and a proper solution to monitor the shift of the change frequency has to be proposed. So, this paper is about the possibility of using existing control 1 charts to monitor the change frequency. Similarly, we can monitor the change frequency indirectly by monitoring the frequency ratio. This paper is organized as follows: section 2 is a review on how to estimate the change frequency; Section 3 is to introduce various control cha rts to be considered; Section 4 is about some initial ideas of using control charts to monitor the change frequency and some questions to be considered; Section 5 concludes this paper. 2. Estimating the Change Frequency In most cases, when we try to estimate the change frequency of web pages, we don’t have the complete change history of web pages, i.e. we don’t know when exactly each web page changes and how many times it has changed between consecutive accesses. So, our discussion below is based on an incomplete change history of web pages. In order to estimate the change frequency, [CGM2000a] traced the daily change history of 720,000 web pages from 270 sites for four months. The experiment shows that web pages change by following a Poisson process. The frequency ratio r is defined to be the ratio of the change frequency? to the access frequency f, so r = ? / f . An intuitive estimator for the frequency ratio is X/T, where X is the number of detected changes and T is the monitoring period. This estimator has been proved to be biased and not consistent since the bias does not decrease as the sample size increases. Due to the drawbacks of this intuitive estimator, [CGM2000a] proposed an improved estimator expressed as - log(X/n), where n is the total number of accesses and X is the number of accesses in which the web page does not change. For example, if we access a web page once a day for 100 days and the web page does not change in 70 accesses, the frequency ratio r = - log(70/100) = 0.36. This result is slightly larger than the intuitive estimator 30/100 = 0.3 since some changes may have been missed between accesses. Similar performance analysis shows that this estimator is better than X/n in bias, more efficient and consistent. 2 The change frequency, λ,can change in practice. We don’t have any experiment so far to show how the change frequency itself can change. If the change frequency changes very quickly, it will be difficult and impractical to estimate λ and really use it in applications. So, we assum e that the change frequency will remain relatively stable for at least some period of time. What we can do is to test it periodically and see whether it has changed and whether the change is beyond a predefined threshold. Since the frequency ratio r can be estimated as –log(X/n), the proportion to be monitored, p, will be X/n. We are interested in finding out both the increase and decrease of p. In order not to miss too many changes, the crawler should access the web page as frequently as possible. Usually , the crawler can not access web pages more than once a day and we are not interested in the web pages which change more than once a day, so the access frequency can be chosen as one access per day. 3. Control Charts [W1997] gives a review and bibliography on control charts based on attribute data. Here, we focus on the Shewhart p-chart, Bernoulli CUSUM chart, Binomial CUSUM chart and SPRT chart [RS2000a, RS2000b, RS1999, and RS1998]. For our application, since the crawler can not visit all the web pages once a day, continuous 100% inspection is not possible. Therefore, our inspection will be based on samples taken from the process. 3.1 The Shewhart p-chart When samples of n items are taken from the process, the Shewhart p-chart is to plot the fraction of defective items in the samples. So, if T is the total number of defective items in a sample of size n, then T/n is plotted on the p-chart. T has a binomial distribution assuming p is constant and items are independent. If the crawler visits a web page once a day for n days and X is number of accesses in which the web page doesn’t change, the proportion of X/n can be plotted on the 3 Shewhart p-chart. Here, X has a binomial distribution with parameter n and p, where p is the probability that the web page doesn’t change between two consecutive accesses and p = e-r, where r is the frequency ratio. So, the result of ith access, X i, takes the value of 1 with probability p and of 0 with probability 1-p. 3.2. Bernoulli CUSUM Chart The Bernoulli CUSUM chart is based on the individual observations X1, X2,…. In order to detect an increase in p, the Bernoulli CUSUM control statistic is Bi = max (0, B i-1) + (Xi – r), i=1,2…, r is the reference value. This CUSUM chart will signal there has been an increase in p if Bk = h, where h > 0 is the control limit. For detecting a decrease in p, the corresponding CUSUM control statistic is Bi = min (0, Bi-1) + (X i – r), i=1,2…, r is the reference value. It will signal there has been a decrease in p if B k = h, where h < 0 is the control limit. In order to get the value of r, we have to specify an out-of-control value p 1 which we want to be detected quickly. Constants r1 and r2 are defined to be r1 = − log 1 − p1 1 − p0 r2 = log p1 (1 − p0 ) p 0 (1 − p1 ) Then, the reference value r = r1/ r2 Usually, p 1 is adjusted slightly so that r takes the value of the reciprocal of an integer m, i.e. r = 1/m. The control limit h is obtained by making the false alarm rate (i.e. the average number of observations/samples to signal when p=p 0) satisfy some predefined value. 3.3 Binomial CUSUM Chart Binomial CUSUM chart is to plot a cumulative sum of defective items in a sample of n consecutive items, T1 , T 2 ,…, where each T k has a binomial distribution. For detecting an increase in p, the binomial CUSUM control statistic is S k = max (0, Sk-1) + (T k – nr), k=1,2..., where nr is the reference value. The Binomial CUSUM chart will signal there has been an increase in p if S k = h, where h>0 is the 4 control limit. For detecting a decrease in p, the binomial CUSUM control statistic is S k = min (0, Sk-1) + (T k – nr), k=1,2..., where nr is the reference value. The Binomial CUSUM chart will signal there has been a decrease in p if S k= h,where h<0 is the control limit. Similar to the Bernoulli CUSUM chart, the control limit h for the binomial CUSUM chart is also obtained by making the false alarm rate satisfy some predefined value. 3.4 SPRT Chart In most applications, both the p-chart and the CUSUM chart take a fixed sample size of n items using a fixed sampling interval between samples. SPRT is to use a varied sample size which is determined dynamically. It is a sequential test of null hypothesis H0: p=p 0 against H1: p=p1. For each item, Xi =1 if the ith item is defective (in our application, if the web page does not change) and Xi =0 otherwise. The statistic used j by SPRT is S j = r 2Tj –r1j, where T j = ∑ Xi . Here, r1 and r 2 are defined as in 3.2. i =1 The SPRT requires spec ifying two constants a and b, b<a . The following rules are used for sampling and making decisions to accept or reject H 0: If b<S j<a, then continue sampling; If S j= a, then stop sampling and reject H0; If S j= b, then stop sampling and accept H0. The constants a and b are usually chosen to satisfy some predefined error probabilities (probabilities for type I and II errors). 3.5 Comparison of Control Charts The performance of the above control charts can be measured by ANSS (average number of samples to signal), ANOS (average number of observations to signal) and ATS (average time to signal). Since the sample size is not fixed for SPRT chart and ATS is dependent on the length of non-inspecting period, we suggest using ANOS to compare the performance of control charts. 5 A corrected diffusion (CD) theory approximation to the ANOS has been developed for CUSUM chart and SPRT chart. For each type of control chart, ANOS can be obtained for a range of in-control value p 0 and out-of-control value p 1. When the Bernoulli CUSUM chart is used, the CD approximation to the ANOS when p = p0 is * e h r2 − h * r2 − 1 ANOS ( p0 ) ≈ r2 p0 − r1 We can find the required value of h * to give a desired value for in-control ANOS (average false alarm rate). Then, by using h* = h + ε ( p0 ) p0q0 where e (p) can be approximated by 0.00376 (log(p))4 -0.000008(log(p))7, if 0.01= p<0.5; 1 1− p p − 3 p 1− p , if 0<p<0.01; 0.410 -0.0842(log(p)) -0.0391(log(p)) 3, if otherwise. We can find the control limit h. Also, the CD approximation to the ANOS when p=p 1 is * eh r2 − h* r2 − 1 ANOS ( p1 ) ≈ r2 p1 − r1 When the SPRT chart is used, let a be the probability for a type I error and ß be the probability for a type II error, using h* ≈ g≈ 1 1− β ln r2 α 1 β ln r2 1 − α and h * = h + (1 -2p 0)/3 , we can find the values for h and g. Since g =b/r 2, h = a/r2, we can get a and b. The ANOS (p 0) and ANOS (p 1) can be obtained using p 0, p 1, r1 , r 2, g and h *. 6 The Shewhart p-chart has the advantage of simplicity, and it also has some disadvantages: if the control limit is set to be three standard deviations from the target value, the false alarm rate will be much different from that for a normal distribution. The Shewhart p-chart is not effective for detecting small changes in p. The performance of the p-chart for detecting small shifts can be improved by using a larger sample size, but it will not be very effective in detecting large shifts. The Bernoulli CUSUM chart detects shifts in p much faster than the p-chart. The binomial CUSUM chart is a little slower than the Bernoulli CUSUM chart for small shifts in p and considerably slower for very large shifts, since a binomial CUSUM would have to wait until the end of a sample to signal. The SPRT chart has much better performance than the p-chart or the CUSUM chart. 4. Monitoring the Change Frequency Based on the performance analysis of the control charts and the characteristics of our applications, Bernoulli CUSUM chart or SPRT chart would be more appropriate for our purpose since they both are good for detecting small shifts. Also, we need to consider the following questions: a. determining how we check the change frequency, either periodically or randomly. If periodically, then determine how often we should check; b. determining the sample size since the sample size could have an important effect on the inspection result; c. determining the out-of-control value p 1; d. determining the false alarm rate for CUSUM chart so that the control limit h can be determined or the error probabilities for SPRT chart so that the two constants a and b can be determined; For example, we have the knowledge that the current change frequency is 0.6931 per day, which means the web page changes 0.6931 times a day. Based on 0.6931 = log(X/n), this change frequency corresponds to X/n = 0.5 . We want to detect the shift when the change frequency becomes 0.3567, which corresponds to X/n = 0.7. In this case, our in-control value p 0=0.5 and out-of-control value p 1=0.7 . These values are 7 relatively large compared with those used in quality control. If we use the Bernoulli CUSUM chart, we can get the reference value r using p 0 and p 1. Then, given a desired value for ANOS (p 0), we can find the value for the control limit h. Next, we can get ANOS (p 1). If the SPRT chart is used, for some desired values of a andß , we can find the values for a and b, and further find the approximation of ANOS (p). In order to compare the performance of these control charts, we can try to adjust the values so that they give similar values for ANOS (p 0) and compare ANOS (p 1) for different values of p 0 and p 1. 5. Conclusion & Future Work This paper gives a brief review of control charts and some initial ideas of applying these control charts to monitor the change frequency of a web page. However, in order to show the appropriateness of the control charts, experiments or simulation should be done and specific data to measure the performance of these control charts should be obtained and compared. Reference [RS2000a] M Reynolds, Jr. and Z. Stoumbos, Monitoring a Proportion Using CUSUM and SPRT Control Charts. Frontiers in Statistical Quality Control 6, pp. 156176(2000) [RS2000b] M Reynolds, Jr. and Z. Stoumbos, A General Approach to Modeling CUSUM charts for a Proportion , IIE Transactions(2000) 32, pp. 515-535 [RS1999] M Reynolds, Jr. and Z. Stoumbos, A CUSUM Chart for Monitoring a Proportion When Inspecting Continuously, Journal of Quality Technology, Vol. 31, No. 1, Jan 1999 [RS1998] M Reynolds, Jr. and Z. Stoumbos, The SPRT Chart for Monitoring a Proportion , IIE Transactions (1998) 30, pp. 545-561 [W1997] W. Woodall, Control Charts Based on Attribute Data: Bibliography and Review, Journal of Quality Technology, Vol. 29, No. 2, April 1997 [CGM2000a] Junghoo Cho and Hector Garcia -Molina, Estimating Frequency of Change 8 [CGM2000b] Junghoo Cho and Hector Garcia -Molina, The Evolution of the Web and Implications for an Incremental crawler, VLDB 2000, Experience/Application track, 2000. 9