Presentation

advertisement
Intelligent Bayesian Network-Based Approaches
for
Web Proxy Caching
Prepared By :
Waleed Ali Ahmed & Siti Mariyam Shamsuddin
Soft Computing Research Group, Faculty of Computer
Science and Information Systems,
Universiti Teknologi Malaysia, 81310 Johor, Malaysia
waleedalodini@gmail.com, mariyam@utm.my
Outline
Introduction
Related Works
The Proposed Intelligent Web Proxy
Caching Approaches
Implementation and Performance
Evaluation
Conclusion and Future works
Introduction
Background
 Web caching is one of the most successful solutions for improving the
performance of Web-based systems.
 Web caching is a well-known strategy for improving the performance of
Web-based system by keeping Web objects that are likely to be used in
the near future in location closer to user.
 Why?



To decrease latencies
To reduce web server loads
To reduce bandwidth usage
Web proxy caching
In Web proxy caching, the popular web objects that are likely to be revisited
in the near future are stored on the proxy server which plays the key roles
between users and web sites in reducing the response time of user requests
and saving the network bandwidth.
Why Web proxy caching?
Proxy servers play the key roles between users and
web sites, which could reduce the response time and
save network bandwidth.
The most common caching strategy. The proxy caching
is widely utilized by computer network administrators,
technology providers, and businesses to reduce user
delays and to alleviate Internet congestion (Kaya et al., 2009;
Kumar, 2009, Kumar et al., 2008)
Problem Statement
• Since the apportioned space to the cache is limited, the space
must be utilized judiciously (Romano and ElAarag , 2011).
 The most common Web caching methods are not enough
efficient and may suffer from cache pollution problem (Cobb and
ElAarag, 2008 ; Koskela et al., 2003).
•
Reduction of the effective cache size
•
Low hit
•
Wasting bandwidth.
•
Overload on the original server
 So far, the difficulty in determining which ideal web objects will
be re-visited is still a major challenge
Intelligent Web Proxy Caching
 Motivations for using machine learning In Web caching
Availability of web access logs and trace files or history
of accesses that considered complete and prior
knowledge of future accesses .
The need to efficient and adaptive scheme since Web
environment changes and updates rapidly and
continuously .
 Recent studies have proposed utilized ANN in web proxy
caching although ANN training may consume long time and
require extra computational overhead.
 More significantly, integration of intelligent technique in web
cache replacement is still under research.
The suggested solutions
We present new intelligent approaches that depend on the
capability of Bayesian Network (BN) to learn from Web proxy
logs files and predict the classes of objects to be re-visited or not.
More significantly, the trained BN classifier is incorporated
effectively with traditional Web proxy caching algorithm to
present novel intelligent web proxy caching approaches
Bayesian Network (BN)
A Bayesian network is one of the most popular machine
learning models that depends on probability estimations to find
a class of an observed pattern.
Rationale:
– The Bayesian network (BN) is defined as a directed acyclic
graph over which is defined a probability distribution. Each
node in the graph represents a random variable or event,
while the arcs or edges between the nodes represent
association or causal relationship
Bayesian Network (BN)
The probabilistic dependency is maintained by the conditional
probability table(CPT), which is attached to the corresponding
event.
In classification tasks :
– the classification decision is calculated simply
using formula.
x  ci  max{P( x \ cr ) P(cr )}
r i
P( x \ cr )
P(cr )
probability of finding the pattern x in class
probability of class c
c,
Why BN in Web Caching
?
 Bayesian networks are popular supervised learning algorithms
that have great popularity in medical filed and other
applications such as military applications, forecasting, control,
modelling for human understanding, cognitive science,
statistics, and philosophy .
 Hence, Bayesian networks can be utilized to produce
promising solutions for Web proxy caching.
Related Works
Intelligent Web Caching?
 The conventional Web caching methods are not
enough efficient (Cobb and ElAarag, 2008 ; Koskela et al., 2003)
 Therefore, several researchers have proposed
incorporating intelligent solutions to cope with Web
caching problem.
 According to Chen (2008), the intelligent approaches are
more efficient and more adaptive to Web caching
environment compared to others approaches
Related
works on
intelligent
web
caching
Summary of intelligent web caching
 From the previous studies, we can observe two approaches in
intelligent web caching.
• An intelligent technique is employed in web caching
individually.
• An intelligent technique is employed with LRU Algorithm.
 Both approaches may predict Web objects that can be reaccessed; However,
 They did not take into account the cost and size of the predicted objects in
the cache replacement decision.
 Some important features are ignored.
 The training process requires long time and extra computational
overhead.
Proposed Approach VS Existing Approaches
Proposed approach
Existing approaches
takes in consideration the most effective
factors in cache replacement decision
One factor or more ignored in cache
replacement decision
depends on BN that can achieve much
better accuracy and faster than BPNN
and ANFIS.
Integrates BN classifier into GDS
algorithm that takes the cost and size of
cached objects in consideration
BN is effectively integrated with LRU
depend on ANN or ANFIS that their
training may consume long time and
require extra computational over head.
---
Intelligent technique is employed
individually or integrated with LRU
The Proposed Intelligent Web Proxy
Caching Approaches
The operational framework for the
proposed approach
A Framework for the proposed approach
The framework consists of two functional components:
 Offline component: It works only while the proxy server in leisure
periods. It is responsible for training BN classifier.
 Online component: The intelligent caching strategies are executed in
this part.
Online Component
 In the online component, the intelligent caching strategies are achieved
for managing proxy cache content.
 We propose intelligent web proxy caching approaches depends on
integrating BN with traditional Web caching to provide more effective
caching policies
 Bayesian Network-Greedy-Dual-Size Approach (BN-GDS): BN classifier
is integrated with GDS for improving the performance in terms of the
byte hit ratio of GDS.
 Bayesian Network-Least-Recently-Used Approach (BN-LRU) : BN
classifier is combined with LRU to form a new algorithm called BN-LRU.
 Bayesian Network-Dynamic Aging Approach(BN-DA): BN classifier is
combined with dynamic aging (DA) to form a new algorithm called BNDA.
1- The intelligent BN-GDS approach
 The Greedy-Dual- Size (GDS) caching algorithm was proposed by Cao and
Irani (1997). The algorithm assigns a key value K(p) to each object p in the
cache, so that the object with the lowest key value is replaced :
K (p)  L 
C (p)
S (p)
where C(p) is the cost to bring object p into the cache; S(p) is the object size;
L is an inflation factor that starts at 0 and is updated to the key value of the last replaced object.
If an object is accessed again, its key value is updated using the new L value.
1- The intelligent BN-GDS approach
Cherkasova(1998) enhanced GDS algorithm by incorporating a frequency
count , so the algorithm is called Greedy- Dual-Size-Frequency (GDSF)
algorithm.
K (p)  L 
F ( p ) *C ( p )
S (p)
where F(p) is the access count of object p.
One advantage of GDSF policy is GDSF performs well in terms of the hit
ratio. However, the byte hit ratio of GDSF policy is too low.
Therefore, BN classifier is integrated with GDS for improving the
performance in terms of the byte hit ratio, called BN-GDS.
1- The intelligent BN-GDS approach
 In the proposed BN-GDS, GDS is enhanced by incorporating the
accumulative scores or probabilities W ( p) of revisiting object g depending on
BN classifier as in Eq.
W (g) *C(g)
K (g)  L 
S (g)
 This means that the key value of object g is determined not just by its past
occurrence frequency, but also by the class predicted depending on the six factors.
The rationale behind the proposed BN-GDS approach is that we can enhance the
priority of those cached objects that may be revisited in the near future according
to the BN classifier, even if they are not accessed frequently enough
2- The intelligent BN-LRU approach
 LRU policy is the most common proxy caching policy;
However, LRU policy suffers from cold cache pollution.
In other words, in LRU, a new object is inserted at the top of
the cache stack. If the object is not requested again, it will
take some times to be moved down to the bottom of the
stack before removing it.
 For reducing cache pollution in LRU, BN classifier is
combined with LRU to form a new algorithm called BNLRU.
2- The intelligent BN-LRU approach
 The proposed SVM-LRU is worked as follows:
When the web object g is requested by user, BN
classifier predicts the class of that object either will be
revisited again or not.
 If the object g is classified by BN as object will be revisited again, the object g will be placed on the top of
the cache stack.
 Otherwise, the object g will be placed in the middle of
the cache stack.
 Hence, BN-LRU can efficiently remove the unwanted
objects early to make space for the new Web objects.
2- The intelligent BN-LRU approach
3- The intelligent BN-DA approach
• In addition to frequency, several factors can contribute in
predicting the revisiting of the object in the future.
• The proposed BN-DA approach combines the most significant
factors depending on Bayesian network (BN)classifier for
predicting probability that Web objects can be re-visited later.
• In the proposed BN-DA approach, when user visits Web object
g, the trained BN classifier can predict the probability of
belonging g to the class with objects may be revisited. Then,
the probabilities of g are accumulated as scores W ( p) used in
cache replacement decision
K (g)  L W (g)
Implementation and Performance
Evaluation
1-Data collection
We have obtained data of
the proxy logs files of web
objects requested in several
proxy servers located around
the United States of the
IRCache network for fifteen
days (NLANR, 2010).
In this study, the proxy log
files of 21st August, 2010
were used in the training
phase, while the proxy log
files of the following days
were used in simulation and
implementation phase
Proxy dataset
Proxy server name
Location
Duration of
collection
BO2
bo.us.ircache.net
Boulder, Colorado
21/8 – 4/9/2010
SV
sv.us.ircache.net
Silicon Valley, California
21/8 – 4/9/2010
(FIX-West)
SD
sd.us.ircache.net
San Diego, California
21/8 – 28/8/2010
NY
ny.us.ircache.net
New York, NY
21/8 – 4/9/2010
2-Data Pre-processing
The data preprocessing involves removing the irrelevant
requests from the log files since some the log entries are not valid or
irrelevant entries.
The trace preparation is carried out as follows
 Parsing: identifying the boundaries between successive fields and
records in logs file
 Filtering: This includes elimination of irrelevant entries such as The
uncacheable requests and Entries with unsuccessful HTTP status
codes.
 Finalizing: This involves removing unnecessary fields. Moreover,
each unique URL is converted to a unique integer identifier for reducing
time of simulation.
2-Data Pre-processing
The final format of our data consists of URL ID, timestamp,
elapsed time, size and type of web object
URL_ID
Elapsed Time
(milliseconds)
Timestamp
Size(bytes)
Type
1
1282348905.73
33
33070
application/octet-stream
2
1282348907.41
703
14179
image/jpeg
3
1282348908.47
284
1276
image/jpeg
4
1282349578.75
154
24612
text/html
1
1282349661.61
31
33070
application/octet-stream
5
1282349675.35
203
5592
text/html
6
1282349688.90
231
34796
text/html
4
1282349753.72
375
24612
text/html
4
1282350464.01
133
24612
text/html
1
1282351887.76
135
33070
application/octet-stream
4
1282352609.09
55
24612
text/html
1
1282352861.56
111
33070
application/octet-stream
3-Training Phase
 The training pattern takes the format:
Input
Meaning
x1
Recency of web object based on sliding window
x2
Frequency of web object
x3
Frequency of Web object based sliding window
x4
Retrieval time of web object
x5
Size of web object
x6
Type of web object
3-Training Phase
Preparation of Dataset for web objects
classification
Inputs
SWL
Frequenc
y
Frequenc
y
Recency
Retriev
al Time
Target
Size
Type
1800
1
1
33
33070
5
1
1800
1
1
703
14179
2
0
1800
1
1
284
1276
2
0
1800
1
1
154
24612
1
1
1800
2
2
31
33070
5
0
1800
1
1
203
5592
1
0
1800
1
1
231
34796
1
0
1800
2
2
375
24612
1
1
1800
3
3
133
24612
1
0
2226.15
3
1
135
33070
5
1
2145.08
4
1
55
24612
1
0
1800
4
2
111
33070
5
0
3-Training Phase
 Each proxy dataset is then divided randomly into training data (70%)
and testing data (30%).
 Subsequently, the dataset is discretized accordingly using MDL method
suggested by Fayyad & Irani (1993) with default setup in WEKA.
 Finally, the Bayesian network (BN) is trained using WEKA
as well. In WEKA, BN algorithm is available in the Java
class “weka.classifiers.bayes.BayesNet”. The default values
of parameters and settings predefined in WEKA are used in
BN training.
4-Performance Evaluation
 We have modified the WebTraff simulator (Markatchev and
Williamson,2002) to meet our proposed proxy caching
approaches.
 The trained classifiers are integrated with WebTraff
simulator to simulate the proposed intelligent web proxy
caching approaches.
4-Performance Evaluation
 There are common measures to analyze the efficiency
Hit Ratio (HR)
Hit Ratio 
Number of objects acquired from the cache
Total number of objects requested
Byte Hit Ratio (BHR)
ByteHit Ratio 
Number of bytesacquired fromcache
T otalnumber of bytesrequested
4-Performance Evaluation
Analysis of IRcache traces
BO2
NY
SD
UC
SV
8891764
2496001
29871204
#Total requests
1210693
#Cacheable requests
594989
1518232
2827904
1194098
6059349
#Cacheable bytes
23204930341
68402036319
469362584083
48043794224
230326816876
#Unique requests
530192
1144885
2402406
1012355
5284441
18690093450
56147903761
156538171752
38364029432
190539902251
#Hits
64797
373347
425498
181743
774908
#Byte Hits
4514836891
12254132558
312824412331
9679764792
39786914625
Max HR(%)
10.89
24.59
15.05
15.22
12.79
Max BHR(%)
19.46
17.91
66.65
20.15
17.27
3248452
Total size of unique
requests ( bytes)
4-Performance Evaluation
•(a) BO2 HR
•(b) NY HR
Impact of
cache size on
HR for
different proxy
datasets
4-Performance Evaluation
In terms of Hit Ratio(HR)
 BN-GDS achieves the best HR among all algorithms,
while LRU achieves the worst HR among all
algorithms .
 BN-GDS and BN-LRU improve the performance in
terms of HR for GDS and LRU respectively
 Although HR of BN-DA is worse than HR of GDS and
GDSF, HR of BN-DA is better than HR of NNPCR-2,
BN-LRU and LRU.
4-Performance Evaluation
•(a) BO2 HR
•(b) NY HR
Impact of
cache size on
BHR for
different proxy
datasets
4-Performance Evaluation
In terms of Byte Hit Ratio(BHR)
 BN-LRU and BN-DA achieve the best BHR among
all algorithms, while GDS and GDSF attain the worst
BHR.
 BHR of LRU is better than BHR of BN-GDS, GDS
and GDSF.
 BN-GDS improve significantly BHR of GDS and
GDSF
 BN-LRU and BN-DA have better BHR compared with
BHR of LRU and NNPCR-2 .
Conclusion
This study has proposed three Intelligent Web proxy
caching approaches called BN-GDS, BN-LRU and BN-DA
for improving performance of the conventional Web proxy
caching algorithms.
 BN classifier learns from Web proxy logs file to predict the classes of
objects to be re-visited or not.
 The trained classifier is integrated effectively with conventional web
proxy caching to provide more effective proxy caching policies.
The simulation results have revealed that BN-GDS achieved
the best HR, better BHR compared to GDS and GDSF, and acceptable BHR
compared to BN-LRU and BN-DA that achieved the best BHR. That means
BN-GDS was able to make better balance between HR and BHR than other
algorithms. On the other hand, BN-LRU and BN-DA achieved the best BHR
among all algorithms, and better HR compared LRU and NNPCR-2 .
Future works
In the future:
 Other intelligent classifiers can be utilized to improve the
performance of traditional web caching policies.
 Clustering algorithms can be used for enhancing
performance of web caching policies.
References
Kaya, C.C., Zhang, G., Tan, Y., & Mookerjee, V.S. 2009. An admission-control technique for
delay reduction in proxy caching. Decision Support Systems, 46, 594-603.
Kumar, C. 2009. Performance evaluation for implementations of a network of proxy caches.
Decision Support Systems, 46, 492-500.
Kumar, C., & Norris, J.B. 2008. A new approach for a proxy-level web caching mechanism.
Decision Support Systems, 46, 52-60.
Romano, S., & ElAarag, H. 2011. A neural network proxy cache replacement strategy and its
implementation in the Squid proxy server. Neural Computing & Applications, 20, 59-78.
Cobb, J., & ElAarag, H. 2008. Web proxy cache replacement scheme based on backpropagation neural network. Journal of Systems and Software, 81, 1539-1558.
Koskela, T., Heikkonen, J., & Kaski, K. 2003. Web cache optimization with nonlinear model
using object features. Computer Networks, 43, 805-817.
Chen, H.T. 2008. Pre-fetching and Re-fetching in Web caching systems: Algorithms and
Simulation. TRENT UNIVESITY,Peterborough, Ontario, Canada, Peterborough, Ontario,
Canada.
Cao, P., & Irani, S. 1997. Cost-Aware WWW Proxy Caching Algorithms. IN PROCEEDINGS OF
THE 1997 USENIX SYMPOSIUM ON INTERNET TECHNOLOGY AND SYSTEMS.
Publishing, Monterey, CA.
Cherkasova, L. 1998. Improving WWW Proxies Performance with Greedy-Dual-Size-Frequency
Caching Policy. In HP Technical Report, Palo Alto.
References
NLANR. 2010. National Lab of Applied Network Research(NLANR). Sanitized access logs:
Available at http://www.ircache.net/.
Fayyad, U.M., & Irani, K.B. 1993. Multi-interval discretization of continuous-valued attributes for
classification learning, 13th International Joint Conference on Artificial Intelligence (IJCAI93). Publishing, pp. 1022-1027.
Markatchev, N., & Williamson, C., 2002. WebTraff: A GUI for Web Proxy Cache Workload
Modeling and Analysis. Proceedings of the 10th IEEE International Symposium on
Modeling, Analysis, and Simulation of Computer and Telecommunications Systems.
Publishing, p. 356.
Download