Predicting Users` next visit using Grey Moving Probability Markov

advertisement
Predicting Users’ next visit using Grey Moving Probability Markov Model
1
Predicting Users’ next visit using Grey Moving Probability Markov Model
Ch. Bindu Madhuri1, Prof.J.A.Chandulal2
1Department
2Department
of Information Technology, JNTUK-UCEV, Vizianagaram.
of Computer Science and Engineering, GITAM University, Visakhapatnam, India.
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Abstract
In the context of web based applications, WUM techniques are implemented on the data collected. This paper mainly
focuses on the problem of building models to represent the past users’ behavior, which in turn predict the most likely links
a user, will request when viewing a web page. WUM is specifically designed to carry out applications by analyzing the
usage data. The problem of predicting the next request during a user’s navigational session has been extensively studied. In
this context, Grey Moving Probability Markov Models has been proposed to model the navigational sessions and for
predicting the next navigation step using different transitional probability estimation approaches. This method makes use of
GRPA with Variable-Length Markov Chains to analyze the navigational behavior of user in order generate the user session
clusters. The experimental results represent that the approach can improve the quality of clustering for user navigation
pattern in WUM systems and results used for predicting user’s next request in the huge web sites in order to customizing a
web site to the needs of specific users.
Keywords: Grey Moving Probability Markov Model, Web Personalization, Web Usage Mining (WUM)
-----------------------------------------------------------------------------------------------------------------------1 Introduction
Nowadays a huge amount of data exists owing to the large and fast expansion in the increase of data and the multiple
users, due to which internet users are facing multiple problems. Therefore, in order to afford a user with precisely vital data
becomes a decisive issue in web usage based applications. This paper mainly deals with performance upgrading of web
management through enlarging and utilizing Web Usage Mining (WUM) hypothesis.WUM is a method with the aim of
ascertains the essential associations between web usage (log) files, which is articulated in the form of usage data, by the
characteristics study of the web usage data with the techniques of DM. With the abundance of information available on the
WWW, it has become increasingly necessary for users to find the desired information resources, and to track and analyze
their usage patterns. The issue is extracting the useful knowledge from the web by the application of data mining
techniques is referred as Web Mining. Web Mining refers to the effort of Knowledge Discovery Data (KDD) from the web.
It can be defined as the process of applying data mining techniques to extract useful knowledge from the huge amount of
information available from the web. It is often categorized into three major areas [1, 2]: Web Content Mining, Web
Structure Mining, and Web Usage Mining.
Web Usage Mining (WUM) is an emergent domain in web mining, which exploits data mining techniques to discover
valuable information from navigation behavior of World Wide Web (WWW) users. Web usage mining contains three main
steps: Preprocessing, Knowledge extraction and Result analysis [4].Web Usage Mining [1] tries to make sense of the data
generated by the web surfers’ sessions or behaviours. While the web content and structure mining utilize the data on the
web, Web Usage Mining mines the log data derived from the interactions of users while interactions with the web. The
web user data includes the data from the web server access logs, proxy server logs, browser logs, user profiles, registration
data, user sessions, cookies, user queries, book mark data, mouse clicks. The preprocessing phase consists of [5] data
cleaning, user identification, session identification, transaction identification. The pattern discovery [3] depicts upon
methods and algorithms developed from several fields such as statistics, data mining, machine learning and pattern
recognition. The pattern analysis is the last step in the overall web usage mining process. This analysis filters out uninteresting patterns found in the pattern discovery phase. The major application areas for web usage mining fall into five
categories: Personalization, System improvement, Site modification, Business intelligence and Usage characterization [3].
Predicting Users’ next visit using Grey Moving Probability Markov Model
2
This paper mainly deals with WUM for which many DM techniques such as clustering have been applied to web server
logs. We proposed Grey Moving Probability Markov Model (GMPMM), which will be used to predicting users’ future
visits in order to customizing a web site to the needs of specific users. Grey Moving Probability Markov Model has been
proposed to model the web user navigational sessions and for predicting the next navigational step of user using transition
probabilities. The combination of GRPA with MC generates the user session clusters by the analysis of user navigational
behavior. This method is advantageous only if the user navigation history is known then the moving probability model is
able to predict consequent visits. The proposed methodology more specifically predicts the user's interest. The main
argument proposed is the usefulness of the model for the evaluation of prediction, with the objective of discovering a web
recommendation set. A strong indication of the fact that it predicts accurately is provided by a broad estimation of the
proposed methods on actual data sets. This paper is organized as follows: The next section provides a Background. In
section 3, a methodology for predicting users’ next visit is introduced. Section 4 represents experimental results, followed
by conclusions in section 5.
2 Related Work
Several WUM systems have been proposed to predict user’s preference and their navigation behavior. Lice and Keselj
[7] proposed the automatic classification of web user navigation patterns and proposed a novel approach to classifying user
navigation patterns and predicting users’ future requests. The approach is based on the combined mining of web server logs
and the contents of the retrieved web pages. They used character N-grams to represent the contents of web pages, and
combined them with user navigation patterns by building user navigation profiles composed of a collection of N-grams. In
this system they can incorporate their current off-line mining system into an on-line web recommendation system to
observe and calculate the degree of real users’ satisfaction on the generated recommendations, which are derived from the
predicted requests by their system.
Analog [8] is one of the first web usage mining (WUM) systems. It is structured according to an off-line and on-line
component. The off-line component builds session clusters by analyzing past users activity recorded in server log files. In
[9,10] Mobasher et al., present a system which provides dynamic recommendations, as a list of hypertext links, to users.
The analysis is based on anonymous usage data combined with the structure formed by the hyperlinks of the site.
SUGGEST, the WUM system proposed by Baraglia and palmerini, provides [11, 12] useful information to make easier
the web user navigation and to optimize the web server performance. It adapts two levels architecture composed by an offline creation of historical knowledge and an on-line engine that understands users’ behavior. As the requests arrive at this
system module it incrementally updates a graph representation of the web site based on active user sessions and classifies
the active sessions using graph partitioning algorithm.
Mehrdad Jalali, et al., [13] proposed a novel approach for classifying user navigation pattern by using Longest Common
Subsequence (LCS) algorithm, which exploits for improving accuracy of classification. In the recent years, there has been
an increasing number of research works done in web usage mining [14, 15, 16, 17, 18, and 19]. The main motivation of
these studies is to get a better understanding of the reactions and motivations of users’ navigation. Some studies also apply
the mining results to improve the design of websites [20], analyze the system performance and network communications or
even build adaptive websites [21], can distinguish three web mining approaches that exploit web logs: Association Rules
(AR) [22, 23], Frequent Sequences [24] and Frequent Generalized sequences [25, 26], algorithms for the three approaches
were developed but few experiments have been done with real web log data. Grey, Haddad proposed a recommender
system that predicts the users next requests based on their behavior discovered from web log data [27]. P.kumar, et al.,[28]
proposed a Sequential PAM algorithm, to find the clusters and to improve the web personalization System.
In this paper mainly deals with predicting the users’ next visit by using the Grey Moving Probability Markov Model
in order to customizing a web site to the needs of specific users.
Predicting Users’ next visit using Grey Moving Probability Markov Model
3
3 Methodology
3.1 Pattern discovery & Navigation Pattern
After the data preprocessing the knowledge will be extracted by using Markov chains. We perform the navigation
pattern on the derived access sessions done by the user. Then the on-line module builds active user sessions, which allow
identifying pages related to the ones in the active session and predicts the further request page.
3.1.1 Preliminaries of Markov Chain Models
This section gives the brief discussion of Markov Chain Models & Grey Moving Probability Markov Model. The Grey
System Theory seeks only the intrinsic structure of the system with limited data. In this a problem is addressed, due to lack
of information, it is difficult to determine the exact value of one or more entries in the Grey Moving Probability Matrix of a
Markov Chain. Grey moving Probability Markov Model performs correct forecast of a web user’s subsequent link. The
prediction is the assignment of predicting the users’ subsequent link.
Let 𝑋 is a sequence of 𝑁 random variables 𝑋1 , 𝑋2 … 𝑋𝑁 representing navigational sequences generated through Chain Rule
of Probability model of sequences.
π‘ƒπ‘Ÿ(𝑋) = π‘ƒπ‘Ÿ(𝑋1 , … 𝑋𝐿 )= π‘ƒπ‘Ÿ(𝑋𝐿 |𝑋𝐿−1 , 𝑋𝐿−2 … 𝑋1 ) π‘ƒπ‘Ÿ(𝑋𝐿−1 |𝑋𝐿−2 … 𝑋1 ) … π‘ƒπ‘Ÿ(𝑋1 ) key property of a first-order Markov
Chain is the probability of each depends only on the value of π‘ƒπ‘Ÿ(𝑋) =
π‘ƒπ‘Ÿ(𝑋𝐿 |𝑋𝐿−1 )π‘ƒπ‘Ÿ(𝑋𝐿−1 |𝑋𝐿−2 ) … π‘ƒπ‘Ÿ(𝑋2 |𝑋1 )π‘ƒπ‘Ÿ(𝑋1 )= π‘ƒπ‘Ÿ(𝑋1 ) ∏𝐿𝑖=2 π‘ƒπ‘Ÿ(𝑋𝑖 |𝑋𝑖−1 ).
The notation of Markov Chain is always represented as the transitional parameters, that can be denoted by π‘Žxi−1xi where
π‘Žxi−1xi = Pr(Xi = xi |Xi−1 = xi−1 ) & the probability distribution of a sequence x as
N
π‘ŽSx1 ∏N
i=2 π‘Žxi−1 xi =Pr(x1 ) ∏i=2(x i |Xi−1 ) over a sequences of length ‘N’ Where π‘ŽSx1 represents the transition from the start
state.
3.1.2 Preliminaries of Grey of Moving Probability Markov Model
To overcome the difficulty of the prediction of exact value, replace the uncertain entry by a Grey interval 𝑃𝑖𝑗 (⨂) based on
the known value. When the moving probability matrix is Grey then the required Whitenization matrix 𝑃̃ (⊗) = [ 𝑃̃𝑖𝑗 (⊗)]
is to satisfy the following:
1. 𝑃̃ (⊗) ≥ 0; i, j ∈ I;
2. ∑ jΡ”I 𝑃𝑖𝑗 = 1, for any i∈I;
For any n Ρ” T and states i, j ∈ I, 𝑃𝑖𝑗 (𝑛) = 𝑃(𝑋𝑛+1 =j/𝑋𝑛 =i ) is called the Moving Probability of the Markov
Chain.
Properties of Grey Moving Probability Matrix
Moving Probability Matrix 𝑃(𝑛) has to satisfy the following properties:
1. 𝑃𝑖𝑗 (𝑛) ≥ 0; i, j∈ I;
2. ∑ 𝑗 ∈ I 𝑃𝑖𝑗 (𝑛)=1, for any i∈I;
3. 𝑃 (𝑛) = 𝑃𝑛 ;
Transitional probability Estimation models
There are different transitional probability estimation models used namely Maximum Likelihood Estimation (MLE) & A
Bayesian Approach.
Maximum Likelihood Estimation (MLE): The maximum likelihood estimates are the observed frequencies of the bases
shown in Equation 3.1
𝐧
𝐏𝐫(𝒂) = ∑ 𝒂
𝐒 𝐧𝐒
3.1
A Bayesian Approach: Start with some prior belief for each use Laplace estimates shown in Equation 3.2
𝐧 +𝟏
P𝐫(𝒂) = ∑ (𝐧𝒂
𝐒
𝐒 +𝟏)
3.2
Predicting Users’ next visit using Grey Moving Probability Markov Model
4
𝐧 +𝐩𝒂 𝐦
𝐏𝐫(𝒂) = (∑𝒂
3.3
𝐒 𝐧𝐒 )+𝐦
‘pπ‘Ž m’ is the prior probability of π‘Ž & m number of “virtual” instances.
Procedure to predict the users’ next visit
I. Identify the dissimilar user access in the log files
Step1: Arrange the log data by host name & by time stamp.
Step2: For each divergent host name, identify each user as a different user.
Step3: If referrer field is available in the log file then do the following, else go to step 5.
Step 4: To discover every user, merge the user identification data from steps 1 to 3.Users are identified and stop.
Step 5: For identifying user, take the output of step 2.
II. Extract the URLs of the visited pages
III. Identify User Sessions
1. Allocate a distinctive session ID for each and every user recognized in the user identification process.
2. Define the timeout threshold.
3. Find time variation among every two successive web access log data.
4. If the discovered variation goes over the precise threshold, assign a new session ID to the next access of the user.
5. Arrange the entries by session ID.
IV. Transaction identification
V. Segregate the resulting set of navigational patterns into training and a testing set to perform the experiments.
VI. Find the similarity between the training set (navigation patterns) to form the clusters and analyze the user
behavior (using the Grey clustering algorithm).
VII. Apply the transitional probabilities of the sequences with different estimation approaches.
[Algorithm]: Predicting Users’ Next Visit
Input:
Test Sequence record
Output: next navigational step.
Step 1:
Initially consider the test sequence record as the reference sequence pattern π‘ π‘Ÿπ‘– .
π‘ π‘Ÿπ‘– = ⟨0, π‘ π‘Ÿπ‘– (1), π‘ π‘Ÿπ‘– (2), … , π‘ π‘Ÿπ‘– (𝑝), 0⟩ Representing this sequence as first order &
Second order Markov Chain format:
1
π‘ π‘Ÿπ‘–
= ⟨0 − π‘ π‘Ÿπ‘– (1), π‘ π‘Ÿπ‘– (1) − π‘ π‘Ÿπ‘– (2), π‘ π‘Ÿπ‘– (2) − π‘ π‘Ÿπ‘– (3), … , π‘ π‘Ÿπ‘– (𝑝 − 1) − π‘ π‘Ÿπ‘– (𝑝), π‘ π‘Ÿπ‘– (𝑝) − 0⟩
Step 2: Find the Grey Relational Pattern Grade between the reference sequence and the
cluster ′𝑛′, using 𝑣( π‘ π‘Ÿπ‘– , 𝐢𝑛 ) = (
πš«π’”π’Šπ’Ž = 𝐀 ∗
π‘†π‘šπ‘Žπ‘₯ −Δπ‘ π‘–π‘š 𝜁
π‘†π‘šπ‘Žπ‘₯ −π‘†π‘šπ‘–π‘›
𝑳𝑳π‘ͺ𝑺( π’”π’“π’Š , π‘ͺ𝒏𝒔
𝒄𝒋
𝑴𝒂𝒙(| π’”π’“π’Š , |,| π‘ͺ𝒏𝒔
𝒄𝒋
)
)
|)
+𝐁∗
| π’”π’“π’Š ∩ π‘ͺ𝒏𝒔
|
| π’”π’“π’Š ∪ π‘ͺ𝒏𝒔
|
𝒄𝒋
𝒄𝒋
π‘ͺ𝒏𝒔 = the comparative sequences in cluster and
𝒄𝒋
Step 3: Select the maximum Grey relational pattern grade 𝑣( π‘ π‘Ÿπ‘– , 𝐢𝑛 ) among them and
consider the sequences in that particular cluster as the active comparative pattern.
Step 4: Find the transitional probabilities of the sequences with different Estimation
approaches using Equation 3.13, 3.14 & 3.15.
Step 5: Calculate 𝑃(⨂) = 𝑃𝑖𝑗 ; /* probability values which occurred using different
estimation approaches consider them as the Grey values */
Step 6: 𝑃 T (0) =𝑃 (⨂); `/*state 0/
𝑃 T (1) =𝑃 T (0) 𝑃 (⨂); /state1*/
Predicting Users’ next visit using Grey Moving Probability Markov Model
5
𝑃 T (2) =𝑃 T (1) 𝑃 (⨂); /*state 2*/
𝑃 T (n) =𝑃 T (n-1) 𝑃 (⨂); /*for n states*/
Step 7: Repeat Step 4 until the desired no. of further visit is obtained. /* will get the desired no. of further visit*/
Μƒ
Step 8: Generating 𝑃̃ (⨂) = [𝑃
𝑖𝑗 (⨂)]; /* choose the highest probability link value using HM will be the Grey number as its
whitenization value*/
Algorithm: For predict the users’ next visit by using Grey Moving Probability Markov Model
4 EXPERIMENTAL RESULTS
The use of the Markov chain (MC) relies on the assumption that the states that are likely to be visited in the next
navigation depends only on what page a Web user is viewing now. Each element in the sequence matrix indicates the
proportion of visits to state j at the next transition, given the present state i. A Web user not currently on the Web site is
described as being in state 0.
Table 1: Predicting subsequent visit using Grey Moving Probability Markov Model
Table 2: Predicting subsequent visit using Grey Moving Probability Markov Model
Test
Training
Further
Sequ
sequence(exclud
(without using
(using MLE &
ence
ing
estimation
BA )
test
visit
Further
Visit
Predicting Users’ next visit using Grey Moving Probability Markov Model
6
sequences)
models)
οƒ 3,
3,3,3,6,6,8,8,12,1
2,3,3
,3,3,10,6,10,7.
οƒ 3,3,6,6,6,6.7,7,
7,10,10
οƒ 3,11,11,10,10,
1οƒ 2
οƒ 4οƒ 
?
10,10,10,10,11,1
1,11
οƒ 9,9,10,10,10,7,
3,6,11,4,12,7
3,6,11
13,11,6
οƒ 3,6,11,14,14,1
1,14,14,
οƒ 3,3,6,11,11,11,
10,11,6,7,7,7,3,7,
7,7.
οƒ 7,17,8,7,7,8,13
,11,3,1
5
Conclusion
In the prediction model, Variable Length Markov model is used to predict the category of users’ next
state with the transaction probability. This method uses two transition probability estimation models,
Maximum Likelihood Estimation (MLE) & a Bayesian Approach (BA).In the context of Predicting the
next request during a user’s navigation session has been extensively studied, higher-order Markov
models have been widely used to model navigation sessions and for predicting the next navigation step.
Generally prediction accuracy has been mainly evaluated with the Hit and Miss Score. Evaluating next
link prediction models with the aim of finding a recommendation set. This approach reduces the online
recommendation time while retaining predictive accuracy. In msnbc dataset, users who navigated
between 1 to 14 pages for predicting future visit with Grey Moving probability variable length Markov
model achieve a maximum of 96% success. In cti dataset, users who navigated between 3 to 7pages for
predicting future visit with Grey Moving probability variable length Markov achieve a maximum of
93% success. In msweb dataset, users who navigated between 1 to 12 pages for predicting future visit
with Grey Moving probability variable length Markov model achieve a maximum of 91% success.
REFERENCES
[1] R. Kosala, H. Blockeel, Web mining research: a survey, ACM SIGKDD Explorations Newsletter (1)
(2000)1–15.
[2]
F. M. Facca and P. L. Lanzi, "Mining interesting knowledge from web logs: a survey," Data & Knowledge
Engineering, vol. 53, pp. 225-241, 2005.
Predicting Users’ next visit using Grey Moving Probability Markov Model
7
[3]
J. Srivastava, R. Cooley, M. Deshpande, P.-N. Tan, Web usage mining: discovery and applications of usage
patterns from Web data, SIGKDD Explorations 1 (2) (2000) 1–12.
[4] Doru Tanasa and Brigitte Trouse, “Advanced Data preprocessing for intersites web usage mining”,IEEE
Intelligent System, March/April 2004, pp59-65.
[5] Cooley, R., B. Mobasher, and J. Srivastava, "Data Preparation for Mining World Wide Web Browsing
Patterns," Knowledge and Information Systems, vol. 1, pp. 5-32, 1999.
[6]
M. Eirinaki, M. Vazirgiannis, Web mining for Web personalization, ACM Transactions on Internet
Technology 3 (1) (2003) 1–27.
[7] R. Liu, V. Keselj,” Combined mining of Web server logs and web contents for classifying user navigation
patterns and predicting users’ future requests”, Data & Knowledge Engineering, Elsevier, 2007, pp.304-330.
[8] Yan, W.T., Jacobsen, M., Garcia-Molina, H., Umeshwar,” From user access patterns to dynamic hypertext
linking ”, Fifth International World Wide Web Conference,1996.
[9]
B.Mobasher, R.Cooley, J.Srivastava,”Automatic personalization based on web usage mining”
Communications of the ACM, 43(8), pp.142–151, 2000.
[10] M.Nakagawa,B.Mobasher,”A hybrid web personalization model based on site connectivity”,WebKDD,pp.
59-70,2003.
[11]R.Baraglia,F.Silvestri,”Dynamic personalization of Web Sites Without User Intervention”,Communication of
the ACM, 2007,pp. 63-67.
[12]R.Baraglia,
F.Silvestri,”
An
online
recommender
system
for
large
Web
sites”,
Web
Intelligence,IEEE/WIC/ACM, pp. 20–24.2004.
[13] Mehrdad Jalali 1,Norwati Mustapha 2, Ali Mamat 2, Md. Nasir B Sulaiman “A new classification model for
online predicting users’ future movements” IEEE 2008.
[14] M.-S. Chen, J. S. Park, and P. S. Yu. Data mining for path traversal patterns in a web environment. In16th
International Conference on Distributed ComputingSystems, pages 385–392, May 1996.
[15] D. Cheung, B. Kao, and J. Lee. Discovering user access patterns on the worldwide web. In 1st Pacific-Asia
Conference on Knowledge Discovery and Data Mining (PAKDD’97), February 1997.
[16] M. Spiliopoulou and L. C. Faulstich. Wum: A tool for web utilization analysis. In EDBT Workshop
WebDB’98, Valencia, Spain, March 1998.
[17] M. Baumgarten, A. G. Bchner, S. S. Anand, M. D.Mulvenna, and J. G. Hughes. User-driven
navigationpattern discovery from internet data. In International ACM Workshop on Web Usage
Analysis
and User Profiling (WebKDD’99), pages 74–91, 1999.
[18] B. Berendt. Web usage mining, site semantics, and the support of navigation. In Workshop Web Mining for
E-Commerce - Challenges and Opportunities, Boston,MA, August 2000.
[19] M. Hansen and E. Shriver. Using navigation data to improve IR functions in the context of web search. In
CIKM, pages 135–142, 2001.
[20]
F. Masseglia, P. Poncelet, and M. Teisseire. Using data mining techniques on web access logs to
dynamically improve hypertext structure. ACM SigWeb Letters, 8(3):13–19, October 1999.
[21]
T. W. Yan, M. Jacobsen, H. Garcia-Molina, and U. Dayal. From User Access Patterns to Dynamic
Hypertext Linking. In 5th World Wide Web Conference (WWW’96), Paris, France, May 1996.
Predicting Users’ next visit using Grey Moving Probability Markov Model
8
[22]
E. Frias-Martinez and V. Karamcheti. A prediction model for user access sequences. In WEBKDD
Workshop: Web Mining for Usage Patterns and User Profiles, July 2002.
[23] J. Bollen, H. V. de Sompel, and L. M. Rocha. Mining associative relations from website logs and their
application to context-dependent retrieval using spreading activation. In Workshop on Organizing Web Space
(WOWS), Berkeley, California, August 1999.
[24] B. Mobasher, H. Dai, T. Luo, and M. Nakagawa. Using sequential and non-sequential patterns for predictive
web usage mining tasks. In Proceedings of the IEEE International Conference on Data Mining (ICDM’2002),
Maebashi City, Japan, December 2002.
[25] H. Mannila and H. Toivonen. Discovering generalized episodes using minimal occurrences. In
KnowledgeDiscovery and Data Mining, pages 146–151, 1996.
[26] W. Gaul and L. Schmidt-Thieme. Mining web navigation path fragments. In Workshop on Web Mining for
E-Commerce – Challenges and Opportunities, pages 319–322, Boston, MA, August 2000.
[27] Mathias G´ ery, Hatem Haddad. “Evaluation of Web Usage Mining Approaches for User’s Next Request
Prediction”, WIDM’03, November 7–8, 2003, New Orleans, Louisiana, USA.
[28]
Sifeng
Liu
and
Yi
Lin,”Grey
Information
Theory
and
Practical
Applications”Springer
Science+BusinessMedia,springer.com.
[29] Bindu Madhuri.Ch and J.A.Chandulal.: “Analysis of the Navigation Behavior of the Users’ using Grey
Relational Pattern Analysis with Markov Chains,” International Journal of Engineering Science and Technology
,Vol. 2(10), 2010, 5402-5412.
[30] Bindu Madhuri.Ch and J.A.Chandulal.: “Analysis of Users’ Web Navigation Behavior using GRPA with
Variable Length Markov Chains,” International Journal of Data Mining & Knowledge Management Process
(IJDKP) Vol.1, No.2, March 2011.
Download