DataDakshil - Academic Science,International Journal of Computer

advertisement
Proposed Model for Prediction and Visualization of
User Tendency towards Purchases using Clickstream
Data
Dakshil Shah
Samkeet Shah
Jamshed Shapoorjee
Kiran Bhowmick
Department of Computer
Engineering
Dwarkadas J. Sanghvi
College of Engineering
Vile Parle, Mumbai
dakshil@outlook.in
Department of Computer
Engineering
Dwarkadas J. Sanghvi
College of Engineering
Vile Parle, Mumbai
samkeet@outlook.in
Department of Computer
Engineering
Dwarkadas J. Sanghvi
College of Engineering
Vile Parle, Mumbai
jamshed994@outlook.com
Assistant Professor
Department of Computer
Engineering
Dwarkadas J. Sanghvi
College of Engineering
Vile Parle, Mumbai
kiran.bhowmick@djsce.ac
.in
Abstract: As of today, online shopping has captured a fair
share of the retail market. Customers prefer to shop online
as they can compare prices of a product across various
stores, buy products while at home or traveling and pay
using e-payment methods. Due to the simple usability, ecommerce websites are gaining more customers by the
day. As the number of e-commerce websites is increasing,
companies need to discover trends in user patterns so as to
boost sales. One such method is the prediction of user
purchases. A company can predict that a certain customer
is going to purchase a certain product. It is necessary to
have a good idea about the users' buying and browsing
pattern on E-commerce websites, so that online retailers
can improve upon their system to attract and benefit more
users. Traditional methods of gathering user data include
analyzing their browsing patterns which primarily
indicate their interests. Hence Clickstream data, that
indicate the browsing paths and visiting frequency along
with time spent per category or product is a largely
unexplored area for predicting user behavior is a largely
unexplored domain for gathering user data. Hence we
suggest this novel method of analyzing user buying
behavior to predict the probability of a purchase as well as
visualize the overall user pattern, based upon the collected
clickstream data.
Keywords: Knowledge Discovery, Machine Learning, Data
Mining, Purchase Intention
I. RELEVANT WORK
A. Boroujerdi et al. [1] have discussed the efficiency of
different machine learning algorithms for designing more
accurate models that result in better product
recommendations. The model is built by first
preprocessing data, identifying important features, and
utilizing the data in classifiers so as to predict the
customers' purchasing decision. Voting methods are used
to make the decision among chosen classifiers. The
system is built using features which describe the users
behavior and not his preferences and past purchases. The
dataset used by Boroujerdi et al. has 23 attributes which
show a sessions status at a moment during the shopping
process of a particular customer. The features lie in three
groups. The first is concerning user related features such
as age and gender. The second is session related with
features such as start hour. The third includes features
such as value of items in basket, number of items in
basket.
Data pre-processing
1. Filling missing values: Using fully available features,
the training data is first clustered. In each cluster, the
missing values are filled by the average of its
available part. Based on the silhouette coefficient,
Boroujerdi et al. concluded that EM algorithm was
more efficient than OPTICS or K-means clustering
algorithms.
2. Merging transactions: As each session has multiple
transactions or entries, they are merged into a single
record per session. For features where only last value
matter, only last is maintained and rest are
eliminated. Based on correlation or standard
deviation, the features are merged to create new
features.
3. Feature Extraction: With the help of an automated
feature extraction tool called Deep Belief
Network(DBN), combining feature engineering and
learning, to extract the most useful features.
4. Feature Selection and elimination: Boroujerdi et al.
have used backward elimination for feature selection.
Outliers such as customers with very higher than
average visits to the site during a period were ignored
in the dataset, so as to achieve a better result.
Classification Methods: Amongst the many available
methods, Boroujerdi et al. found tree-based and rulebased methods to be better. ConjunctiveRule, Ridor, and
JRip from rule-based family, C4.5, LMT (Logistic Model
Trees), ADTree (Alternating Decision Tree), FT
(Functional Trees), RandomForest, and RandomTree
from
tree-based
family,
Logistic
Regression,
RBFNetwork
(Radial-Basis
Function
Network),
Multilayer Perceptron, SPegasos (SVM using stochastic
gradient descent), and SVM from function-based family
are the methods which were used in the model.
Voting: After classification, based on the predicted
instances, a Weighted Voter is used to combine the
classification models. A KNN classification model is
trained on top of the previous outputs, so as to gin better
results from the voter.
Using the model: As the model is based for a potential
website, the logging system records user actions. As the
user browses the website, data will be added to the
session records, which will be merged and then used for
classification by the model.
Experimental Results: Seven algorithms with highest
prediction accuracy were passed to the voting mechanism.
C4.5, LMT, RandomForest, REPTree, MultiLayer
Perceptron, JRip, and Ridor were selected as they covered
most part of the data set with correct predictions. Thus
records which were misclassified by one algorithm, were
classified correctly using another one.
B. Qiang Su et al. [2] aim to classify the user interest
patterns based upon their gathered clickstream data, and
consequently clustered using their clustering algorithm.
The developed algorithm has been tested for its
efficiency. They define the user interest indicators
initially. The similarities in user patterns are then found
based upon the three judging parameters and finally based
upon the similarities in the sequences, the user interests
are clustered. They have used a dataset with 10,000
entries as the test dataset. The model proceeds as follows;
user interest measurement parameters are selected as the
following:
Category Visiting Path: The browsing path of the user is
the sequence of webpages visited by the user from the
start of his session to the end of his session or the till the
final product page reached for purchased. In addition to
the authors also use the category visiting path that is the
categories browsed through during the particular session.
Visiting Frequency: Visiting frequency is defined as the
ratio of total number of visits to a particular product
category to the total length of the visiting path which the
total number of hops to categories in the session.
Relative Duration: It is the total time use spends on a
particular category to the total session time of the user.
The time spent on a category is accumulated to its father
node and the category. Duration as given in the above
formula is the time for a particular category which gets
added to the total.
Similarity in User interests:
Frequency similarity: The frequency similarity between
two users user(p) and user(q) is defined as a cosine
similarity measure of two vectors.
Duration similarity: Similar to frequency similarity,
duration similarity is also defined as the cosine similarity
measure of two vectors.
Path similarity: The path similarity between two users’
user(p) and user(q) is defined as the common path length
divided by the maximal path length. A common path is
defined as the common segment in the two category
paths. If there is more than one common path between
two users, the longest one is used in the calculation of
path similarity.
Total Similarity: α * duration similarity + β * frequency
similarity + γ * sequence similarity
Where α, β, γ are used to adjust the weight of the three
dimensions of sequence, frequency, and duration (time)
and their sum is equal to 1.
Clustering: the authors have developed a rough leader
clustering algorithm based upon the rough set theory. The
leader clustering calculation is based on users’ similarities
in terms of browsing behavior. At the same time, rough
set analysis makes it possible for a user to be assigned to
more than one cluster. The algorithm starts with a
randomly selected object as the initial leader. For each
object in the data set, we calculate the similarity between
the object and the leader. If the similarity meets the
predefined threshold, then the object is assigned to the
cluster represented by the leader; otherwise, the object
will be regarded as a new leader. These procedures will
be repeated until all objects are either assigned to a cluster
or regarded as leaders A threshold is set to limit the
number of users per cluster and thus controls the overall
similarity between two users.
Experimental results: Based upon the case study
performed using their proposed algorithms and on
comparison with other algorithms such as K-medoids
clustering algorithm, it was found that Rough leader
clustering algorithm performs better and provides
accurate results on a large dataset.
C. Antonellis et al. [3] in thier paper Algorithms for
Clustering Click Stream Data describe various methods
for clustering clickstream data. Clustering is the method
in which a set of objects are grouped together in such that
the objects in the same group are more similar to each
other than the members of the neighboring groups. These
groups are called as clusters. Clustering in click stream
data is not exactly same as clustering normal data since. A
lot of tools generally ignore detailed time and sequence
data in order to cut down on space and time of processing
data but study shows that this ignored information makes
a considerable difference in the quality of clusters created
for particular web sites. The paper extends previously
suggested models [4]. In [4] the authors use a framework
which is divided into multiple components namely an
online component which uses micro-clustering and
routinely stores comprehensive reports and statistics as
well as a component which operates offline and makes
use of a pyramidal time frame. This paper includes
another phase and uses three phase architecture instead of
a two phase one.
The component maintained online keeps the summarized
information in the temporary memory, generally we use a
storage system in but in this case we merge the web log
data. Also, the online component creates a fixed size
matrix to summarize the log data from the users data
instead of using micro clusters. The offline component
can comprise of a number of different clustering methods
that work simultaneously and with mutual exclusion such
that they can provide a different set of groups of data
clusters that depend on the attributes provided to the
system. The authors propose 3 phase architecture by
dividing the process as follows. 1) an online component
that is used to automatically store data that is continuous
as well as compressed 2) a group of offline components
that make use of this compressed data to generate
clusters; 3) a meta clustering method that can be used to
find trends in the generated clusters so that they can
further be used to generate a long-term clustering. The
clustering algorithm that they have chosen is to apply is
the k-means clustering algorithm that is widely used [5].
Although it may not work well with categorical attributes,
it works well with numerical attributes since it has great
statistical and geometric sense for the same. K-means
takes an array of counters as input and applies clustering
algorithms to it. Each time the k-means algorithm runs, it
generates some results that can be stored as clusters.
Finally, after creating of clusters the authors apply metaclustering which classify users according to their behavior
in the time period under consideration. The results have
been proved experimentally and the results are able to
provide an overview of user's behavior and this helps in
recognizing the clusters that can be used to identify users
having constant preferences about the content that they
view. Hence the three phase model presented in this paper
[3] improves static user data with dynamic information by
providing us with dependable and sophisticated measures
II. ALGORITHMS AND DATASET FOR PROPOSED
MODEL
A. Dataset
We use a dataset with obtained from YOOCHOOSE, who
have collected from user visits over a period of 6 months from
a European retailer's website. The dataset consists of two files
namely, clicks and buys. The clicks data contains information
about every click made by a user on the website. The database
undergoes preprocessing first, which involves filling in
missing values, smoothing noisy data, identifying and
removing outliers, and resolving inconsistencies. The data is
then reduced by using data reduction, which results in a
reduced representation of the dataset that is smaller in volume
than the original, but still produces the same analytical results.
Preprocessing of data is followed by feature selection and
feature extraction. Feature selection is done so as to make the
data easier to interpret with reference to system, reduce
training time. Based upon the selected features, the system is
trained for prediction of user tendency to buy the item in his
future sessions. The database is used to visualize the trends in
the user buying behavior such as peak buying time/day/ month
of the year. The trained data is then used to predict the
probability of buying i.e. either a yes or a nay for the buy as
well as what item is the user going to buy.
B. Feature Extraction
Features are functions of the original measurement variables
that help in classification. Feature extraction is the process of
defining a set of features, which will efficiently represent the
information which is important for the classification problem.
The goal of feature extraction is to improve the effectiveness
and efficiency of analysis and classification. By reducing the
dimensionality of the input set, correlated information is
eliminated at the cost of accuracy. Feature space expansion
involves generating new features based on the original ones
instead of reducing dimensionality. Attribute construction is
the process of constructing new attributes from the given
attributes, so as to help improve the accuracy and
understanding of structure in high-dimensional data. Accuracy
measures the ratio of correctly classified instances across all
classes. Feature subset selection removes redundant features
from the data set as they can lead to a reduction of the
classifier accuracy or clustering quality and lead to an increase
in computational cost (Blum and Langley, 1997), (Koller and
Sahami, 1996). The advantage is that no information about a
single feature is lost. However, if a small set of features is
required and the original features are very diverse, information
may be lost, which in turn will reduce the accuracy of
classification. The dataset used by us, contains many hidden
features which play a vital role in reaching the goal of our
project.
C. Random Forest
Random forest uses a collection of decision trees, wherein the
algorithm constructs multiple decision tree and combines their
output after computation to obtain the final solution. Random
forest algorithm is a very powerful algorithm which can be
applied in various applications. The fundamental idea of
random forest algorithm is to aggregate several binary
decision trees constructed using a variety of bootstrap samples
coming from the learning sample L, such that we chose in a
random manner at each node a subset of explanatory variables
X.
D. SVM
Support Vector Machine makes use decision plane concept to
define boundaries that can classify data. In a two dimensional
space a line is used to actually separate and classify the data
set into the two different parts. SVM carries out classification
primarily by constructing hyperplanes in the multidimensional
space leading to separate areas which are ultimately classify
the data. [6] In the case of linearly separable data, the first step
is to find the optimum separating hyperplane is found, after it
is found the solution is represented as a linear combination of
the data points that lie on the margin of the hyperplane, these
points are also known as support vectors and hence the name
support vector machines. Other data points are ignored.
Therefore, the model complexity of an SVM stays unaffected
irrespective of the number of features present in the data that
is used for training (the number of data points close to the
margins that are selected by the SVM during the learning
phase is generally small). This makes SVM well suited to deal
with tasks in the learning phase where the number of features
is large as compared to the number of training samples.
Although SVM is able to select among multiple possible
hyperplanes by ensuring maximum margin there is still a
possibility that the data is misclassified.
E. Ensemble Learning (bagging, boosting)
Bagging - bagging, is an ensemble machine learning metaalgorithm developed to improve the stability and performance
of machine learning algorithms implemented in regression and
statistical classification. When provided with a standard
training dataset D of size n, bagging generates m new training
datasets D_i, each of size n′, by sampling from D uniformly
and with replacement. By sampling with replacement, some
observations may be repeated in each D_i. If n′=n, then for
large n the set D_i is expected to have the fraction (1 - 1/e)
(≈63.2%) of the unique examples of D, the rest being
duplicates. [1] This kind of sample is known as a bootstrap
sample. The m models are fitted using the above m bootstrap
samples and combined by averaging the output (for
regression) or voting (for classification).
Boosting is an ensemble machine learning meta-algorithm
used for reducing bias mainly and also variance. In supervised
learning algorithms a collection of machine learning
techniques is used to transform weak learners to strong ones.
Although boosting is not algorithmically restricted, most
boosting algorithms iteratively train weak classifiers using a
distribution and ultimately adding them to a final strong
classifier. When they are aggregated, they are mostly weighted
in a way that is generally related to the weak learner’s
performance. The data is reweighted after the addition of a
weak learner, samples that are incorrectly classified gain
weight and samples that are correctly classified lose weight.
The main difference between a variety of boosting algorithm
lies in their method of weighting training data points and
hypotheses. AdaBoost is widely used and perhaps one of the
most significant historically as it was the first algorithm that
was able to adapt to weak learners. However, there are many
more recently developed methods like LPBoost,
MadaBoost,TotalBoost, BrownBoost, LogitBoost, and others
can do the same.
III. PROPOSED MODEL
A. Phase 1
The data set is divided into n different random subsets, with
each comprising of two third of the whole data set. The
remaining one third is used as a test data set. In this paper, we
propose a model in which the data sets are classified using a
hybrid classification method, based on SVM, Random Forest
and Boosting. The input clicks and buys data set is first to be
merged to create on data set. This new data set is to be
randomly subdivided into subsets. Each item in each subset
has a weight factor associated with it. We use SVM to classify
data items in each subset as buy or not buy. If
misclassification occurs, then weight factor of data items is
increased else decreased. The data sets are rearranged and
process repeats until the weights get updated to a very low
value. The output is computed by applying a voting
mechanism to all the random subsets classification outputs.
B. Phase 2
In the second phase, we take the data classified as buys and
create a new data set. This data set is again randomly divided
as above. For each item in the dataset, we check whether this
is the item will be purchased, using the same model as of
phase 1, where the algorithm runs on each data set n times
where n is number of available items. Thus using a voting
mechanism, it is possible to predict item which will be bought.
IV. CONCLUSION
In this paper we have explored various techniques that can be
used for purchase prediction using clickstream data. We have
divided the process into various steps like preprocessing,
feature extraction, training and testing and have proposed a
model for the same. We plan to use ensemble learning
methods combining SVM and Random Forest and aim to test
our proposed model and compare the results with other models
to develop the optimal solution.
REFERENCES
[1] Gohari Boroujerdi, E., et al. "A study on prediction of
user's tendency toward purchases in websites based on
behavior models." Information and Knowledge Technology
(IKT), 2014 6th Conference on. IEEE, 2014.
[2] Su, Qiang, and Lu Chen. "A method for discovering
clusters of e-commerce interest patterns using click-stream
data." Electronic Commerce Research and Applications 14.1
(2015): 1-13.
[3] Antonellis, Panagiotis, Christos Makris, and Nikos
Tsirakis.
"Algorithms
for
clustering
clickstream
data." Information Processing Letters 109.8 (2009): 381-385.
[4] Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S.
Yu, A Framework for Clustering Evolving Data Streams",
Proceedings of the 29th VLDB conference, 81-92 (2003)
[5] J. A. Hartigan, M. A. Wong, Algorithm AS136: a k-means
clustering algorithm", Applied Statistics, vol. 28, 100-108
(1979)
[6] J. Dukart, "Support Vector Machine Classification Basic
Principles and Application," pp. 19-19, 2012.
[7] Shiyu Shu, Lihong Ren, Yongsheng Ding, Kuangrong
Hao, Rui Jiang ‘SVM optimization algorithm based on
dynamic clustering and ensemble learning for large scale
dataset’, Systems, Man and Cybernetics (SMC), pg.- 2278 –
2283, 2014
[8] A. Verikasa, A. Gelzinisb, M. Bacauskieneb, ‘Mining data
with random forests: A survey and results of new tests’,
Pattern Recognition, Volume 44, Issue 2, February 2011,
Pages 330–349
Download