Intelligent Bayesian Network-Based Approaches for Web Proxy Caching Prepared By : Waleed Ali Ahmed & Siti Mariyam Shamsuddin Soft Computing Research Group, Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia, 81310 Johor, Malaysia waleedalodini@gmail.com, mariyam@utm.my Outline Introduction Related Works The Proposed Intelligent Web Proxy Caching Approaches Implementation and Performance Evaluation Conclusion and Future works Introduction Background Web caching is one of the most successful solutions for improving the performance of Web-based systems. Web caching is a well-known strategy for improving the performance of Web-based system by keeping Web objects that are likely to be used in the near future in location closer to user. Why? To decrease latencies To reduce web server loads To reduce bandwidth usage Web proxy caching In Web proxy caching, the popular web objects that are likely to be revisited in the near future are stored on the proxy server which plays the key roles between users and web sites in reducing the response time of user requests and saving the network bandwidth. Why Web proxy caching? Proxy servers play the key roles between users and web sites, which could reduce the response time and save network bandwidth. The most common caching strategy. The proxy caching is widely utilized by computer network administrators, technology providers, and businesses to reduce user delays and to alleviate Internet congestion (Kaya et al., 2009; Kumar, 2009, Kumar et al., 2008) Problem Statement • Since the apportioned space to the cache is limited, the space must be utilized judiciously (Romano and ElAarag , 2011). The most common Web caching methods are not enough efficient and may suffer from cache pollution problem (Cobb and ElAarag, 2008 ; Koskela et al., 2003). • Reduction of the effective cache size • Low hit • Wasting bandwidth. • Overload on the original server So far, the difficulty in determining which ideal web objects will be re-visited is still a major challenge Intelligent Web Proxy Caching Motivations for using machine learning In Web caching Availability of web access logs and trace files or history of accesses that considered complete and prior knowledge of future accesses . The need to efficient and adaptive scheme since Web environment changes and updates rapidly and continuously . Recent studies have proposed utilized ANN in web proxy caching although ANN training may consume long time and require extra computational overhead. More significantly, integration of intelligent technique in web cache replacement is still under research. The suggested solutions We present new intelligent approaches that depend on the capability of Bayesian Network (BN) to learn from Web proxy logs files and predict the classes of objects to be re-visited or not. More significantly, the trained BN classifier is incorporated effectively with traditional Web proxy caching algorithm to present novel intelligent web proxy caching approaches Bayesian Network (BN) A Bayesian network is one of the most popular machine learning models that depends on probability estimations to find a class of an observed pattern. Rationale: – The Bayesian network (BN) is defined as a directed acyclic graph over which is defined a probability distribution. Each node in the graph represents a random variable or event, while the arcs or edges between the nodes represent association or causal relationship Bayesian Network (BN) The probabilistic dependency is maintained by the conditional probability table(CPT), which is attached to the corresponding event. In classification tasks : – the classification decision is calculated simply using formula. x ci max{P( x \ cr ) P(cr )} r i P( x \ cr ) P(cr ) probability of finding the pattern x in class probability of class c c, Why BN in Web Caching ? Bayesian networks are popular supervised learning algorithms that have great popularity in medical filed and other applications such as military applications, forecasting, control, modelling for human understanding, cognitive science, statistics, and philosophy . Hence, Bayesian networks can be utilized to produce promising solutions for Web proxy caching. Related Works Intelligent Web Caching? The conventional Web caching methods are not enough efficient (Cobb and ElAarag, 2008 ; Koskela et al., 2003) Therefore, several researchers have proposed incorporating intelligent solutions to cope with Web caching problem. According to Chen (2008), the intelligent approaches are more efficient and more adaptive to Web caching environment compared to others approaches Related works on intelligent web caching Summary of intelligent web caching From the previous studies, we can observe two approaches in intelligent web caching. • An intelligent technique is employed in web caching individually. • An intelligent technique is employed with LRU Algorithm. Both approaches may predict Web objects that can be reaccessed; However, They did not take into account the cost and size of the predicted objects in the cache replacement decision. Some important features are ignored. The training process requires long time and extra computational overhead. Proposed Approach VS Existing Approaches Proposed approach Existing approaches takes in consideration the most effective factors in cache replacement decision One factor or more ignored in cache replacement decision depends on BN that can achieve much better accuracy and faster than BPNN and ANFIS. Integrates BN classifier into GDS algorithm that takes the cost and size of cached objects in consideration BN is effectively integrated with LRU depend on ANN or ANFIS that their training may consume long time and require extra computational over head. --- Intelligent technique is employed individually or integrated with LRU The Proposed Intelligent Web Proxy Caching Approaches The operational framework for the proposed approach A Framework for the proposed approach The framework consists of two functional components: Offline component: It works only while the proxy server in leisure periods. It is responsible for training BN classifier. Online component: The intelligent caching strategies are executed in this part. Online Component In the online component, the intelligent caching strategies are achieved for managing proxy cache content. We propose intelligent web proxy caching approaches depends on integrating BN with traditional Web caching to provide more effective caching policies Bayesian Network-Greedy-Dual-Size Approach (BN-GDS): BN classifier is integrated with GDS for improving the performance in terms of the byte hit ratio of GDS. Bayesian Network-Least-Recently-Used Approach (BN-LRU) : BN classifier is combined with LRU to form a new algorithm called BN-LRU. Bayesian Network-Dynamic Aging Approach(BN-DA): BN classifier is combined with dynamic aging (DA) to form a new algorithm called BNDA. 1- The intelligent BN-GDS approach The Greedy-Dual- Size (GDS) caching algorithm was proposed by Cao and Irani (1997). The algorithm assigns a key value K(p) to each object p in the cache, so that the object with the lowest key value is replaced : K (p) L C (p) S (p) where C(p) is the cost to bring object p into the cache; S(p) is the object size; L is an inflation factor that starts at 0 and is updated to the key value of the last replaced object. If an object is accessed again, its key value is updated using the new L value. 1- The intelligent BN-GDS approach Cherkasova(1998) enhanced GDS algorithm by incorporating a frequency count , so the algorithm is called Greedy- Dual-Size-Frequency (GDSF) algorithm. K (p) L F ( p ) *C ( p ) S (p) where F(p) is the access count of object p. One advantage of GDSF policy is GDSF performs well in terms of the hit ratio. However, the byte hit ratio of GDSF policy is too low. Therefore, BN classifier is integrated with GDS for improving the performance in terms of the byte hit ratio, called BN-GDS. 1- The intelligent BN-GDS approach In the proposed BN-GDS, GDS is enhanced by incorporating the accumulative scores or probabilities W ( p) of revisiting object g depending on BN classifier as in Eq. W (g) *C(g) K (g) L S (g) This means that the key value of object g is determined not just by its past occurrence frequency, but also by the class predicted depending on the six factors. The rationale behind the proposed BN-GDS approach is that we can enhance the priority of those cached objects that may be revisited in the near future according to the BN classifier, even if they are not accessed frequently enough 2- The intelligent BN-LRU approach LRU policy is the most common proxy caching policy; However, LRU policy suffers from cold cache pollution. In other words, in LRU, a new object is inserted at the top of the cache stack. If the object is not requested again, it will take some times to be moved down to the bottom of the stack before removing it. For reducing cache pollution in LRU, BN classifier is combined with LRU to form a new algorithm called BNLRU. 2- The intelligent BN-LRU approach The proposed SVM-LRU is worked as follows: When the web object g is requested by user, BN classifier predicts the class of that object either will be revisited again or not. If the object g is classified by BN as object will be revisited again, the object g will be placed on the top of the cache stack. Otherwise, the object g will be placed in the middle of the cache stack. Hence, BN-LRU can efficiently remove the unwanted objects early to make space for the new Web objects. 2- The intelligent BN-LRU approach 3- The intelligent BN-DA approach • In addition to frequency, several factors can contribute in predicting the revisiting of the object in the future. • The proposed BN-DA approach combines the most significant factors depending on Bayesian network (BN)classifier for predicting probability that Web objects can be re-visited later. • In the proposed BN-DA approach, when user visits Web object g, the trained BN classifier can predict the probability of belonging g to the class with objects may be revisited. Then, the probabilities of g are accumulated as scores W ( p) used in cache replacement decision K (g) L W (g) Implementation and Performance Evaluation 1-Data collection We have obtained data of the proxy logs files of web objects requested in several proxy servers located around the United States of the IRCache network for fifteen days (NLANR, 2010). In this study, the proxy log files of 21st August, 2010 were used in the training phase, while the proxy log files of the following days were used in simulation and implementation phase Proxy dataset Proxy server name Location Duration of collection BO2 bo.us.ircache.net Boulder, Colorado 21/8 – 4/9/2010 SV sv.us.ircache.net Silicon Valley, California 21/8 – 4/9/2010 (FIX-West) SD sd.us.ircache.net San Diego, California 21/8 – 28/8/2010 NY ny.us.ircache.net New York, NY 21/8 – 4/9/2010 2-Data Pre-processing The data preprocessing involves removing the irrelevant requests from the log files since some the log entries are not valid or irrelevant entries. The trace preparation is carried out as follows Parsing: identifying the boundaries between successive fields and records in logs file Filtering: This includes elimination of irrelevant entries such as The uncacheable requests and Entries with unsuccessful HTTP status codes. Finalizing: This involves removing unnecessary fields. Moreover, each unique URL is converted to a unique integer identifier for reducing time of simulation. 2-Data Pre-processing The final format of our data consists of URL ID, timestamp, elapsed time, size and type of web object URL_ID Elapsed Time (milliseconds) Timestamp Size(bytes) Type 1 1282348905.73 33 33070 application/octet-stream 2 1282348907.41 703 14179 image/jpeg 3 1282348908.47 284 1276 image/jpeg 4 1282349578.75 154 24612 text/html 1 1282349661.61 31 33070 application/octet-stream 5 1282349675.35 203 5592 text/html 6 1282349688.90 231 34796 text/html 4 1282349753.72 375 24612 text/html 4 1282350464.01 133 24612 text/html 1 1282351887.76 135 33070 application/octet-stream 4 1282352609.09 55 24612 text/html 1 1282352861.56 111 33070 application/octet-stream 3-Training Phase The training pattern takes the format: Input Meaning x1 Recency of web object based on sliding window x2 Frequency of web object x3 Frequency of Web object based sliding window x4 Retrieval time of web object x5 Size of web object x6 Type of web object 3-Training Phase Preparation of Dataset for web objects classification Inputs SWL Frequenc y Frequenc y Recency Retriev al Time Target Size Type 1800 1 1 33 33070 5 1 1800 1 1 703 14179 2 0 1800 1 1 284 1276 2 0 1800 1 1 154 24612 1 1 1800 2 2 31 33070 5 0 1800 1 1 203 5592 1 0 1800 1 1 231 34796 1 0 1800 2 2 375 24612 1 1 1800 3 3 133 24612 1 0 2226.15 3 1 135 33070 5 1 2145.08 4 1 55 24612 1 0 1800 4 2 111 33070 5 0 3-Training Phase Each proxy dataset is then divided randomly into training data (70%) and testing data (30%). Subsequently, the dataset is discretized accordingly using MDL method suggested by Fayyad & Irani (1993) with default setup in WEKA. Finally, the Bayesian network (BN) is trained using WEKA as well. In WEKA, BN algorithm is available in the Java class “weka.classifiers.bayes.BayesNet”. The default values of parameters and settings predefined in WEKA are used in BN training. 4-Performance Evaluation We have modified the WebTraff simulator (Markatchev and Williamson,2002) to meet our proposed proxy caching approaches. The trained classifiers are integrated with WebTraff simulator to simulate the proposed intelligent web proxy caching approaches. 4-Performance Evaluation There are common measures to analyze the efficiency Hit Ratio (HR) Hit Ratio Number of objects acquired from the cache Total number of objects requested Byte Hit Ratio (BHR) ByteHit Ratio Number of bytesacquired fromcache T otalnumber of bytesrequested 4-Performance Evaluation Analysis of IRcache traces BO2 NY SD UC SV 8891764 2496001 29871204 #Total requests 1210693 #Cacheable requests 594989 1518232 2827904 1194098 6059349 #Cacheable bytes 23204930341 68402036319 469362584083 48043794224 230326816876 #Unique requests 530192 1144885 2402406 1012355 5284441 18690093450 56147903761 156538171752 38364029432 190539902251 #Hits 64797 373347 425498 181743 774908 #Byte Hits 4514836891 12254132558 312824412331 9679764792 39786914625 Max HR(%) 10.89 24.59 15.05 15.22 12.79 Max BHR(%) 19.46 17.91 66.65 20.15 17.27 3248452 Total size of unique requests ( bytes) 4-Performance Evaluation •(a) BO2 HR •(b) NY HR Impact of cache size on HR for different proxy datasets 4-Performance Evaluation In terms of Hit Ratio(HR) BN-GDS achieves the best HR among all algorithms, while LRU achieves the worst HR among all algorithms . BN-GDS and BN-LRU improve the performance in terms of HR for GDS and LRU respectively Although HR of BN-DA is worse than HR of GDS and GDSF, HR of BN-DA is better than HR of NNPCR-2, BN-LRU and LRU. 4-Performance Evaluation •(a) BO2 HR •(b) NY HR Impact of cache size on BHR for different proxy datasets 4-Performance Evaluation In terms of Byte Hit Ratio(BHR) BN-LRU and BN-DA achieve the best BHR among all algorithms, while GDS and GDSF attain the worst BHR. BHR of LRU is better than BHR of BN-GDS, GDS and GDSF. BN-GDS improve significantly BHR of GDS and GDSF BN-LRU and BN-DA have better BHR compared with BHR of LRU and NNPCR-2 . Conclusion This study has proposed three Intelligent Web proxy caching approaches called BN-GDS, BN-LRU and BN-DA for improving performance of the conventional Web proxy caching algorithms. BN classifier learns from Web proxy logs file to predict the classes of objects to be re-visited or not. The trained classifier is integrated effectively with conventional web proxy caching to provide more effective proxy caching policies. The simulation results have revealed that BN-GDS achieved the best HR, better BHR compared to GDS and GDSF, and acceptable BHR compared to BN-LRU and BN-DA that achieved the best BHR. That means BN-GDS was able to make better balance between HR and BHR than other algorithms. On the other hand, BN-LRU and BN-DA achieved the best BHR among all algorithms, and better HR compared LRU and NNPCR-2 . Future works In the future: Other intelligent classifiers can be utilized to improve the performance of traditional web caching policies. Clustering algorithms can be used for enhancing performance of web caching policies. References Kaya, C.C., Zhang, G., Tan, Y., & Mookerjee, V.S. 2009. An admission-control technique for delay reduction in proxy caching. Decision Support Systems, 46, 594-603. Kumar, C. 2009. Performance evaluation for implementations of a network of proxy caches. Decision Support Systems, 46, 492-500. Kumar, C., & Norris, J.B. 2008. A new approach for a proxy-level web caching mechanism. Decision Support Systems, 46, 52-60. Romano, S., & ElAarag, H. 2011. A neural network proxy cache replacement strategy and its implementation in the Squid proxy server. Neural Computing & Applications, 20, 59-78. Cobb, J., & ElAarag, H. 2008. Web proxy cache replacement scheme based on backpropagation neural network. Journal of Systems and Software, 81, 1539-1558. Koskela, T., Heikkonen, J., & Kaski, K. 2003. Web cache optimization with nonlinear model using object features. Computer Networks, 43, 805-817. Chen, H.T. 2008. Pre-fetching and Re-fetching in Web caching systems: Algorithms and Simulation. TRENT UNIVESITY,Peterborough, Ontario, Canada, Peterborough, Ontario, Canada. Cao, P., & Irani, S. 1997. Cost-Aware WWW Proxy Caching Algorithms. IN PROCEEDINGS OF THE 1997 USENIX SYMPOSIUM ON INTERNET TECHNOLOGY AND SYSTEMS. Publishing, Monterey, CA. Cherkasova, L. 1998. Improving WWW Proxies Performance with Greedy-Dual-Size-Frequency Caching Policy. In HP Technical Report, Palo Alto. References NLANR. 2010. National Lab of Applied Network Research(NLANR). Sanitized access logs: Available at http://www.ircache.net/. Fayyad, U.M., & Irani, K.B. 1993. Multi-interval discretization of continuous-valued attributes for classification learning, 13th International Joint Conference on Artificial Intelligence (IJCAI93). Publishing, pp. 1022-1027. Markatchev, N., & Williamson, C., 2002. WebTraff: A GUI for Web Proxy Cache Workload Modeling and Analysis. Proceedings of the 10th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems. Publishing, p. 356.