Enhanced cBoids Algorithm for Web Logs Clustering

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 4 - Dec 2013
Enhanced cBoids Algorithm for Web Logs
Clustering
R Suguna1 , D Sharmila2
1
Assistant Professor
Computer Science and Engineering
Arunai College of Engineering, Thiruvannamalai – 606 603
2
Professor and Head
Electronics and Instrumentation Engineering
Bannari Amman Institute of Technology, Sathyamangalam – 638 401
Abstract— The development of web based applications
generates data in enormous amount. To manage and extract
information from such voluminous data is very difficult.
Grouping the data based on certain parameters gives the
solution to maintain the data in easier and constructive
manner. In data mining perspective such grouping is called as
clustering. Web data clustering is the method of grouping the
web information into clusters in order to organize the web
based documents and facilitate the data availability, accessing
and satisfy user preferences, understanding navigation
behaviour of the users, improving information retrieval and
content delivery. Clustering techniques are applied to web
data in the following two methods: (i) Sessions based
clustering and (ii) Link based clustering. In this paper EBoid
(Enhanced Boid) algorithm is used to group the users who are
all having similar browsing patterns. User Cluster
Comparison (UCC) algorithm is used to identify the likeness
value for particular websites. The performance of the
enhanced boids algorithm is tested with the parameters
precision, recall and f-measure values for three websites
google.com, facebook.com and microsoft.com.
Keywords— cBoids, EBoid, cluster, web logs.
I. INTRODUCTION
The development of web based applications
generates data in enormous amount. To manage
and extract information from such voluminous
data is very difficult. Grouping the data based on
certain parameters gives the solution to maintain
the data in easier and constructive manner. In
data mining perspective such grouping is called
as clustering. It is essential to group the web data
for information accessibility, understanding
user’s behaviour and improving information
retrieval [1].
ISSN: 2231-5381
Clustering is most important step in data
analysis, machine learning and statistics. It is
defined as the process of grouping N item sets
into distinct clusters based on similarity measure.
A good clustering technique produces clusters
which have low inter cluster and high intra
cluster similarity. The objective of clustering is
to maximize the similarity of the data points
within each cluster and dissimilarity across
clusters [2].
The two main cases in the clustering process
are (i) Two or more clusters are combined to
form new clusters and (ii) Initially all the items
are in one cluster and recursively splits into the
most appropriate clusters. The process continues
until a stopping measure is achieved. There are
two issues in clustering techniques. First issue is
to find the optimal number of clusters in a given
dataset and the next issue is to calculate the
relative measure between two clusters to assess
the possibility of combining the clusters [3].
In conventional clustering techniques,
objects that are in similar character are allocated
to the same cluster while objects that differ in
nature are belongs to different clusters. These
clusters are known as hard clusters. In soft
clustering, an object may be placed in more than
one cluster. Clustering is a widely used technique
http://www.ijettjournal.org
Page 189
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 4 - Dec 2013
in data mining application for discovering
patterns in underlying data.
Users are grouped either session based or
link based by using partitional, hierarchical and
fuzzy based clustering algorithms. The aim of
partitional clustering is to divide the P points in
N dimension into M clusters. The hierarchical
clustering forms a binary tree of data that merges
similar group of points in one category. Where
fuzzy based clustering, assigns the single data
point into two or more groups on the basis of
certain membership functions. Though these
algorithms have applied in various domains
efficiently but they are helpful in classifying the
static numerical data [1].
Web data clustering is the method of
grouping the web information into clusters in
order to organize the web based documents and
facilitate the data availability, accessing and
satisfy
user
preferences,
understanding
navigation behaviour of the users, improving
information retrieval and content delivery.
Clustering techniques are applied to web data in
the following two methods: (i) Sessions based
and (ii) Link based [2].
In this paper EBoid algorithm is used to
group the users who are all having similar
browsing patterns. User Cluster Comparison
(UCC) algorithm is used to identify the likeness
value for particular websites.
Chapter 1 gives introduction about the work.
Related works are details in chapter 2. Chapter 3
discusses about cBoid algorithm. Enhanced boid
algorithm is presents in chapter 4. Results and
experiments are details in chapter 5. Conclusion
and future works are presents in chapter 6.
II. RELATED WORKS
Athena Vakali (2004) stated that
clustering is a challenging job in web based
environment. Numerous clustering techniques
are proposed by the researchers for diverse
ISSN: 2231-5381
applications. It is performed either offline or
online process depending on the requirement.
Sophia et al (2006) mentioned that clustering is
described as the process of grouping the similar
objects in the same group. The data objects
within a group have high similarity and common
access behaviour than the other groups.
Clustering is one of the pattern mining
techniques used to group the similar data items.
In web usage mining varieties of clustering
algorithms were proposed by the researchers to
group the web log files in two methods namely (i)
Session based clustering and (ii) Link based
clustering. The session based clustering method
groups the users who have similar navigation
behaviour. Link based clustering method groups
the web pages which have similar contents.
Session based clustering method helps to identify
the interest level of website visitors and group
the users who have similar browsing patterns. It
mainly falls with time of stay on particular web
page during their web page surfing period. This
literature survey discusses the existing research
work carried out in session based clustering to
identify the user interest level on particular web
pages and websites.
Session based clustering methodology is
described by Athena et al (2004) to group web
users and web sessions. Similarity based and
model based approaches were proposed for
session based clustering. In similarity based
clustering, sequence alignment method is used to
measure the similarity between the sessions and
birch algorithm is used to cluster the web pages.
In model based clustering approach, model
selection techniques are applied to construct
model structure and parameters are estimated
using maximum likelihood algorithms.
George et al (2005) focused on session
based clustering by the concept of model based
clustering by Athena et al (2004). In this method
user are grouped based on the time spent for each
website. Model based clustering algorithm is
proposed by the authors to group the web logs
based on the sessions. Each model has defined
http://www.ijettjournal.org
Page 190
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 4 - Dec 2013
parameters and web pages are grouped under
each model based on its similarity. The resultant
clusters are validated by using equilibrium
distribution.
George et al (2008) introduced the
algorithm clustweb to group the web pages
which have similar weight in directed graph. In
the graph, each node represents web pages
requested by the user and edges represent the
web page request. Each node is associated with
weight. To avoid the network traffic delay the
web pages which are expected to visit by the
users in near future are pre fetched.
A new clustering algorithm with learning
automata is developed by Babak et al (2011).
The authors proposed three steps for effective
clustering. During the first step web access
patterns are transferred into weight vector using
learning automata. Then in the second step web
pages are assigned to the nearest group based on
its weight vector. Weighted fuzzy c-means
algorithm is used to group the cluster centers in
final step. This method helps to identify the
user’s surfing behaviour exactly for further
personalized services.
Miao et al (2012) proposed K-means
algorithm to cluster the web users based on their
browsing behaviour. The cluster qualities are
validated by using compactness and separation
force of intra and inter cluster behaviour. The
users who have high similarity satisfy the
compactness measure and dissimilar users are
greater separation force. Random indexing
approach is used to pre fetch the web pages
based on the formed clusters.
Existing research in the field of web usage
mining attempted to extract the required
knowledge from the web log files. Since the
nature of the web log files, it find difficult to
exercise the web log files directly. Some
preprocessing techniques and pattern discovery
algorithms are applied to the web logs to get
meaningful information. It is better to split the
data into groups with respect to meaningful
ISSN: 2231-5381
parameters. It makes the situation simple and
easier. So, variety of clustering algorithms is
applied to web log files to group them either link
based or session based. The above literature
survey reveals some of clustering algorithm
developed by the researchers.
III. CBOIDS ALGORITHM
The cBoids algorithm is developed by
Marcio & Leandro (2008) for numeric data
clustering in data mining. The main concept of
cBoids algorithm is based on Reynolds (1987)
boid model.
The boid model is a swarm intelligence
algorithm based on the grouping manner of
animals such as birds and fishes. This algorithm
was introduced by Reynolds (1897) as boids
model. In this model, a boid is the representation
of a bird. The main features of boid are the speed
and route of the gathering of the bird which are
controlled by three rules.
 Cohesion rule is used to keep the boids in
same group
 Separation rule helps to avoid collisions
between boids
 Alignment rule tries to coordinate the boids
flying direction in its neighbour.
The Reynolds (1897) boid model is suitable
for creating realistic computer animations for
birds, fishes and animals flocking behaviour. So,
this algorithm is modified as cBoids algorithm by
Marcio & Leandro (2010). They have introduced
clustering boids (cBoids) algorithm for data
clustering in data mining by adding two more
new rules called centroid and merging rules for
effective data clustering. Affinity value is
calculated by measuring the similarity between
the data objects. The Reynold’s (1897) basic
three rules are modified with respect to affinity
value.
http://www.ijettjournal.org
Page 191
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 4 - Dec 2013
Centroid Rule - Initially each data object is in
one cluster and it is considered as centroid of
that cluster.
equation 2. This rule is directly propositional to
the affinity value of the objects.
Merging Rule - This rule is used to combine two
similar centroids together.
A. Affinity calculation
The affinity value defines the degree of
relationship between data objects which shows
similarities between them. The greater affinity
value between the corresponding objects
indicates higher similarity with each other. In
cBoids algorithm affinities of data objects are
considered as one of the main parameter for
grouping the objects. The affinity value Aij
between two objects is calculated using the
equation 1.
A ij

 


L

(X
ik
k 1
 Y
jk
)
2




1
(1)
where Xik and Yjk are data objcets considered for
similarity calculation, L is total number of fields
in data object. The inverse euclidean distance is
used for calculating the distance between the
objects in affinity calculation.

Pmxy 
( xi  yi ) 2
(2)
where Xi and Yi are two centroids, n is
total number of fields in each centroid. When
two or more objects are grouped in one centroid,
there is a chance of leaving some objects from
the current group and join in another group. The
probability of leaving the objects from the
current group is based on greater affinity with the
other groups.
The current affinity with its group (cag) is
calculated using the equation (3), which is the
similarity between the objects and the centroid of
the group it belongs to. Then, the similarity of
the object with the centers of all remaining
groups is calculated using the formula (4). The
highest affinity is greatest affinity group (gag).
The probability value Pi of the object to leave the
group is directly proportional to the difference
between gag – cag using the equation 5.
cag 
gag 


n
i 1
n
i 1
( bi  c i ) 2
( bi  g i )
Pi  gag  cag
B. Centroid rule and Merging rule
n
i 1
(3)
2
(4)
(5)
According to the cBoid approach, initially
every object is considered as the centroid. It is
assumed that, a database has 100 objects then the
algorithm initially has 100 boids, and each
cluster contains a single object. Behavioural rules
are applied to merge centroids and build new
centroid. The new centroid is created by splitting
the previously merged groups into two new
groups. Merging rule is used to combine two
clusters into one centroid. When two objects
have similarity between them merging centroid
rule is used to merge the centroids.
where bi is object which is leave the group, ci is
current affinity value and gi is greatest affinity
value.
The probability of merging the centroids
X and Y is Pmxy and it is calculated using
Separation rule: The strength of the separation
between two Boids is inversely proportional to
the affinity between the Boids, the lower the
ISSN: 2231-5381
C. The New Separation, Alignment and Cohesion
Rules
According to the new rules defined by
Marcio Frayze David and Leandro Nunes de
Castro, the C. Reynolds three behavioral rules
are redefined. The rules are redefined based on
their affinity value. The rules are followed as,
http://www.ijettjournal.org
Page 192
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 4 - Dec 2013
value of affinity, the greater the separation force
between them.
(iv) Cluster similarity is validated by using
Shannon minimum entropy measure.
Alignment rule: The degree of alignment varies
with the affinity between the Boids, the greater
the affinity value between the Boids, the stronger
alignment between them.
(v) New centroid value is calculated by
taking the average value of merged group
centroids.
Cohesion rule: The strength of cohesion varies
with the Boids affinity, the greater the affinity
value, the stronger the cohesion, and vice-versa.
By following these three rules, new clusters are
formed.
D. Stopping Condition
The algorithm stops when the iteration
repeats maximum number of times without any
changes in the clustering process. The cBoids
algorithm passes maximum of 1000 iterations.
IV. ENHANCED CBOIDS ALGORITHM
The web logs are unformatted text files
and inconsistent for its nature. Preprocessing
techniques are carried out to make the web logs
consistent [10][15]. The preprocessing activities
namely data cleaning, user identification, session
identification and path completion are performed
to make the web logs suitable for applying the
pattern mining algorithms [11][12][13].
The following modifications are proposed
in the cBoids algorithm for the aim of grouping
the users through their website visiting behaviour:
(i) Proposed cBoids algorithm is applied to
the preprocessed web logs. During
preprocessing session time and frequency
value of websites are extracted from web
logs [14].
(vi) Stopping criteria is specified based on the
total number of unique users.
The proposed cBoids algorithm is used to
perform two types of clustering process to
identify the user interest level. Initially websites
are grouped with respect to each user. Then in
the second level clustering users are grouped
with respect to website. The web logs are
processed according to the need defined by the
proposed approach and the clustering process is
applied for grouping the websites based on their
usage behaviour. In the proposed system each
boid has five attributes username, ip address, url,
session time and frequency. This is represented
as
b= <user name, ip address, url, session,
frequency>
The attributes are considered as the input for the
EBoid algorithm.
A. Affinity Calculation
The EBoid algorithm considers session
time and frequency are the parameters for
affinity calculation. According to the EBoid
algorithm the data items are arranged in distance
matrix. So, Euclidean distance between the
objects is adapted for the affinity value
calculation between the objects in a group. The
affinity value is calculated by using two
measures session and frequency. The affinities
Aij between two websites are calculated using
the following equation 6.
(ii) Websites and users are grouped with
respect to the parameters session time and
frequency value.
(iii) Euclidian distance is used to calculate
the affinity value between the data objects.
ISSN: 2231-5381
http://www.ijettjournal.org
Aij  


k s , f
x
ik
2
 y jk  
 (6)
Page 193
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 4 - Dec 2013
where s is session time, f is frequency value, Xik
and Yjk are sesseion and frequency value for
objects X and Y.
B. Entropy Calculation for Merging Centroid
Entropy is the mechanism for validating
the similarity between the data item within the
group. When two clusters are merged based on
the propability rule Pmxy, the entropy value of
the merged cluster is equal or increased. If the
average entropy value is equal, it is assumed that
the two clusters have high similarity. So, the
clusters are merged together which are having
minimum entropy. Entropy value for each cluster
is calculated using the equation 7.
H X


 p x log
2
p x 
x X
(7)
C. New Centroid Calculation
When two clusters (centroids) are merged
together the new centroid value is calculated for
the merged group. This is calculated by taking
the average value of two centroids in a group.
The new centroid for the merger group is
calculated by using the equation (4.8)

c 1
 c 2
/
Cohesion rule: The strength of cohesion varies
with the Boids affinity, the greater the affinity
value and minimum entropy value, the stronger
the cohesion, and vice-versa. By following these
three rules, new clusters are formed.
The algorithm stops, when the number of
cycles repeats without any changes in the groups.
The number equals total number of users divide
by 2.
E. User Interest Level Grouping
where H indicates entropy function, X is a set of
data items in a cluster, x is a data item, p(x) is a
probability mass function of random variable x.
nc
Alignment rule: The degree of alignment varies
with the affinity between the Boids, the greater
the affinity value and minimum entropy value
between the Boids, the harder the alignment
between them.
2
(4.3)
where nc is new centroid, c1 is centroid of initial
group and c2 is centroid of new group.
The clustered websites obtained from the
EBoid algorithm refers to the most visited and
most used websites by the user. The UCC
method aims on extracting the user interest
according to the clusters formed by using
proposed cBoid algorithm. Clusters are made
according to a user’s session and frequent visits
to particular websites. The inputs for the UCC
algorithm are (i) User List and (ii) Website
Clusters.
The user interest is calculated against the
clusters is based on constraint. The constraint is
defined as the likeness of a user to particular
websites and is defined as lk. User interest level
is depends on three parameters, they are User,
Website and Likeness. Likeness value of a
particular website lk is calculated using the
equation 8.
lk = vi+ ts
D. Proposed
Separation,
Cohesion Rule
Alignment
Separation rule: The strength of the separation
between two Boids is inversely proportional to
the affinity between the Boids, the lower the
value of affinity and mamimum entropy, the
greater the separation force between them.
ISSN: 2231-5381
(8)
and
where vi is number of visits and ts is total time
spend on a particular website. The likeness value
of each cluster is calculated to select high
likeness clusters. The clusters with high likeness
value are considered for user interest level
identification. The clusters which have less
likeness value are discarded for future analysis.
http://www.ijettjournal.org
Page 194
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 4 - Dec 2013
The following shows the algorithm for website
extraction process.
V. EXPERIMENT AND PERFORMANCE ANALYSIS
The preprocessed web logs are given as the
input for this work. Java (jdk 1.6) is used for
implementing the clustering technique. The
system has Intel core i3 processor with 4GB
RAM. In this section the performance of the
proposed approach is evaluated based on their
recall, precision and f-measure values. The
analysis is conducted based on two criteria,
initially, websites which are most visited by the
users are retrieved. Secondly, the likeness values
of the top websites are analyzed.
A. Evaluation Metrics
The performance of the proposed system is
evaluated based on the following three basic
evaluation criteria:
B. Performance of EBoid algorithm with existing
algorithms
The performance of proposed cBoid
algorithm is compared with cBoid aorithms by
applying UCC model with the parameters
precision, recall and f-measure for three websites
google.com, facebook.com and microsoft.com.
The 550 unique users who are identified during
the preprocessing activities are considered for
performance evaluation.
C. Average Precision, Recall and F-Measure
analysis for three websites
The average precision, recall and Fmeasure value are calculated to know the user
interest level for particular website. Here the
analysis is carried out with three popular
websites namely google.com, facebook.com and
microsoft.com. Table 6.1 shows the average
precision, recall and f-measure values for the
above three websites for proposed and existing
algorithms.
Recall: Recall is the ratio of number of retrieved
users to the number of relevant users in
a particular website.
R e c a ll
U

U
Ret
U
Rel
Ret
Where, URet and URel are the number of retrieved
and relevant users respectively.
Precision: The ratio of retrieved users to relevant
users based on the relevant users is given as the
precision measure.
P r e c is io n

U
Ret
U
U
Figure 6.1 Average precision, recall and f-measure value
for google.com
Rel
Rel
F-measure: The precision values and the recall
values are considered for finding the f-measure
value for the total dataset. Thus the f-measure is
calculated as,
F  m easu re 
2  R e ca ll  P r ecision
R e ca ll  P r ecisio n
ISSN: 2231-5381
http://www.ijettjournal.org
Page 195
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 4 - Dec 2013
Table 6.1 Average Precision, Recall and F-measure value for Google.com, Facebook.com
and Microsoft.com
Google.com
Alg.
Name
Figure 6.2 Average precision, recall and
f- measure value for facebook.com
Facebook.com
P – Precision
R –Recall
Microsoft.com
F-F measure
P
R
F
P
R
F
P
R
F
cBoid
0.62
0.57
0.59
0.54
0.47
0.50
0.52
0.37
0.43
EBoid
0.73
0.67
0.70
0.63
0.56
0.59
0.62
0.49
0.55
REFERENCES
[1]
Athena Vakali, Jaroslav Pokorn, and Theodore Dalamagas, “An
Overview of Web Data Clustering Practices”, 3, Springer-Verlag
Berlin Heidelberg, 2004.
[2]
George Pallis, Athena Vakali, Jaroslav Pokorny, “A clustering-based
prefetching scheme on a Web cache environment”, Computers and
Electrical Engineering 34, 309–323, 2008.
[3]
Miao Wan, Arne Jönsson, Cong Wang, Lixiang Li and Yixian Yang,
“Web user clustering and Web prefetching using Random Indexing
with weight functions”,Knowledge and Information Systems, 33, 1,
89-115, 2012.
[4]
Sophia G. Petridou, Vassiliki A. Koutsonikola, Athena I. Vakali, and
Georgios I. Papadimitriou, “A Divergence-Oriented Approach for
Web Users Clustering”, ICCSA 2006, LNCS 3981, pp. 1229 – 1238,
Springer-Verlag Berlin Heidelberg 2006.
[5]
Babak Anari , Mohammad Reza Meybodi and Zohreh Anari,
“Clustering Web Access Patterns Based on learning Automata”,
IJCSI International Journal of Computer Science Issues, Vol. 8, Issue
5, No 1, September 2011 ISSN (Online): 1694-0814 .
[6]
George Pallis, Lefteris Angelis, and Athena Vakali, “Model-Based
Cluster Analysis forWeb Users Sessions”, M.-S. Hacid et al. (Eds.):
ISMIS 2005, LNAI 3488, pp. 219–227, 2005. Springer-Verlag Berlin
Heidelberg 2005.
Figure 6.3 Average precision, recall and f- measure
value for Microsoft.com
I.
CONCLUSION
Identifying and analyzing website visitor’s behavior is
an interesting task of many website analysts. In this paper
EBoid algorithm is used to cluster the users based on their
browsing pattern and group the websites based on its usage.
Web logs are given as the input for the proposed algorithm.
UCC algorithm is newly proposed to extract the likeness
value for websites. The proposed algorithms are compared
with existing algorithm to prove its efficiency. The
evaluation matrices considered in the proposed clustering
algorithm are recall, precision and
f-measure. The
experimentation of the proposed algorithm is carried out by
considering three different websites. The analysis is carried
out in different aspect to extract the high likeness website
and users for the website. The proposed algorithm gives
better performance to identify the interest level for the users.
In future association rule mining is applied to the high
likeness cluster for each user to find the frequent patterns.
ISSN: 2231-5381
[7]
Dipa, D & Jayant, G 2010, “A New Approach for Clustering of
Navigation Patterns of Online Users”, International Journal of
Engineering Science and Technology, vol. 2, no.6, pp.1670-1676.
[8] Suguna, R and Sharmila, D, “User Interest Based Web Usage Mining
using a Modified Bird Flocking Algorithm”, European Journal of
Scientific Research (EJSR), Vol 86(2), pp. 218-231.
[9] Marcio Frayze David and Leandro Nunes de Castro, “A New
Clustering Boids Algorithm for Data Mining”,2008.
[10] Cooley, R, Mobasher, B and Srivastava, J, “Data Preparation for
Mining World Wide Web Browsing Patterns”, Knowledge and
Information Systems, Springer Verlag, pp.1-27, 1999.
http://www.ijettjournal.org
Page 196
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 4 - Dec 2013
[11] Cooley, R, “Web Usage Mining: Discovery and Application of
Interesting Patterns from Web Data”, Ph.D. thesis, University of
Minnesota, 2000.
[12] Malarvizhi, M and Sahaaya, AM, “Preprocessing of Educational
Institution Web Log Data for Finding Frequent Patterns using
Weighted Association Rule Mining Technique”, European Journal
of Scientific Research, vol.74, no.4, pp. 617-633, 2012.
[13] Maheswara, R & Valli VVR, “An enhanced Pre-Processing Research
Framework For Web Log Data using A Learning Algorithm”,
NetCom, CS&IT – CSCP, pp 01-15, 2011.
[14] Reynolds, CW, “Flocks, Herds and Schools: A Distributed
Behavioral Model”, ACM SIGGRAPH Computer Graphics, vol.21
no.4, pp.25-34, 1987.
[15] Suguna, R and Sharmila, D, “User Interest Level based
Preprocessing Algorithms using Web Usage Mining”, International
Journal of Computer Science and Engineering, vol. 5, no. 9, pp. 815822, 2013. (ISSN : 0975-3397)
ISSN: 2231-5381
http://www.ijettjournal.org
Page 197
Download