Proposed Model for Prediction and Visualization of User Tendency towards Purchases using Clickstream Data Dakshil Shah Samkeet Shah Jamshed Shapoorjee Kiran Bhowmick Department of Computer Engineering Dwarkadas J. Sanghvi College of Engineering Vile Parle, Mumbai dakshil@outlook.in Department of Computer Engineering Dwarkadas J. Sanghvi College of Engineering Vile Parle, Mumbai samkeet@outlook.in Department of Computer Engineering Dwarkadas J. Sanghvi College of Engineering Vile Parle, Mumbai jamshed994@outlook.com Assistant Professor Department of Computer Engineering Dwarkadas J. Sanghvi College of Engineering Vile Parle, Mumbai kiran.bhowmick@djsce.ac .in Abstract: As of today, online shopping has captured a fair share of the retail market. Customers prefer to shop online as they can compare prices of a product across various stores, buy products while at home or traveling and pay using e-payment methods. Due to the simple usability, ecommerce websites are gaining more customers by the day. As the number of e-commerce websites is increasing, companies need to discover trends in user patterns so as to boost sales. One such method is the prediction of user purchases. A company can predict that a certain customer is going to purchase a certain product. It is necessary to have a good idea about the users' buying and browsing pattern on E-commerce websites, so that online retailers can improve upon their system to attract and benefit more users. Traditional methods of gathering user data include analyzing their browsing patterns which primarily indicate their interests. Hence Clickstream data, that indicate the browsing paths and visiting frequency along with time spent per category or product is a largely unexplored area for predicting user behavior is a largely unexplored domain for gathering user data. Hence we suggest this novel method of analyzing user buying behavior to predict the probability of a purchase as well as visualize the overall user pattern, based upon the collected clickstream data. Keywords: Knowledge Discovery, Machine Learning, Data Mining, Purchase Intention I. RELEVANT WORK A. Boroujerdi et al. [1] have discussed the efficiency of different machine learning algorithms for designing more accurate models that result in better product recommendations. The model is built by first preprocessing data, identifying important features, and utilizing the data in classifiers so as to predict the customers' purchasing decision. Voting methods are used to make the decision among chosen classifiers. The system is built using features which describe the users behavior and not his preferences and past purchases. The dataset used by Boroujerdi et al. has 23 attributes which show a sessions status at a moment during the shopping process of a particular customer. The features lie in three groups. The first is concerning user related features such as age and gender. The second is session related with features such as start hour. The third includes features such as value of items in basket, number of items in basket. Data pre-processing 1. Filling missing values: Using fully available features, the training data is first clustered. In each cluster, the missing values are filled by the average of its available part. Based on the silhouette coefficient, Boroujerdi et al. concluded that EM algorithm was more efficient than OPTICS or K-means clustering algorithms. 2. Merging transactions: As each session has multiple transactions or entries, they are merged into a single record per session. For features where only last value matter, only last is maintained and rest are eliminated. Based on correlation or standard deviation, the features are merged to create new features. 3. Feature Extraction: With the help of an automated feature extraction tool called Deep Belief Network(DBN), combining feature engineering and learning, to extract the most useful features. 4. Feature Selection and elimination: Boroujerdi et al. have used backward elimination for feature selection. Outliers such as customers with very higher than average visits to the site during a period were ignored in the dataset, so as to achieve a better result. Classification Methods: Amongst the many available methods, Boroujerdi et al. found tree-based and rulebased methods to be better. ConjunctiveRule, Ridor, and JRip from rule-based family, C4.5, LMT (Logistic Model Trees), ADTree (Alternating Decision Tree), FT (Functional Trees), RandomForest, and RandomTree from tree-based family, Logistic Regression, RBFNetwork (Radial-Basis Function Network), Multilayer Perceptron, SPegasos (SVM using stochastic gradient descent), and SVM from function-based family are the methods which were used in the model. Voting: After classification, based on the predicted instances, a Weighted Voter is used to combine the classification models. A KNN classification model is trained on top of the previous outputs, so as to gin better results from the voter. Using the model: As the model is based for a potential website, the logging system records user actions. As the user browses the website, data will be added to the session records, which will be merged and then used for classification by the model. Experimental Results: Seven algorithms with highest prediction accuracy were passed to the voting mechanism. C4.5, LMT, RandomForest, REPTree, MultiLayer Perceptron, JRip, and Ridor were selected as they covered most part of the data set with correct predictions. Thus records which were misclassified by one algorithm, were classified correctly using another one. B. Qiang Su et al. [2] aim to classify the user interest patterns based upon their gathered clickstream data, and consequently clustered using their clustering algorithm. The developed algorithm has been tested for its efficiency. They define the user interest indicators initially. The similarities in user patterns are then found based upon the three judging parameters and finally based upon the similarities in the sequences, the user interests are clustered. They have used a dataset with 10,000 entries as the test dataset. The model proceeds as follows; user interest measurement parameters are selected as the following: Category Visiting Path: The browsing path of the user is the sequence of webpages visited by the user from the start of his session to the end of his session or the till the final product page reached for purchased. In addition to the authors also use the category visiting path that is the categories browsed through during the particular session. Visiting Frequency: Visiting frequency is defined as the ratio of total number of visits to a particular product category to the total length of the visiting path which the total number of hops to categories in the session. Relative Duration: It is the total time use spends on a particular category to the total session time of the user. The time spent on a category is accumulated to its father node and the category. Duration as given in the above formula is the time for a particular category which gets added to the total. Similarity in User interests: Frequency similarity: The frequency similarity between two users user(p) and user(q) is defined as a cosine similarity measure of two vectors. Duration similarity: Similar to frequency similarity, duration similarity is also defined as the cosine similarity measure of two vectors. Path similarity: The path similarity between two users’ user(p) and user(q) is defined as the common path length divided by the maximal path length. A common path is defined as the common segment in the two category paths. If there is more than one common path between two users, the longest one is used in the calculation of path similarity. Total Similarity: α * duration similarity + β * frequency similarity + γ * sequence similarity Where α, β, γ are used to adjust the weight of the three dimensions of sequence, frequency, and duration (time) and their sum is equal to 1. Clustering: the authors have developed a rough leader clustering algorithm based upon the rough set theory. The leader clustering calculation is based on users’ similarities in terms of browsing behavior. At the same time, rough set analysis makes it possible for a user to be assigned to more than one cluster. The algorithm starts with a randomly selected object as the initial leader. For each object in the data set, we calculate the similarity between the object and the leader. If the similarity meets the predefined threshold, then the object is assigned to the cluster represented by the leader; otherwise, the object will be regarded as a new leader. These procedures will be repeated until all objects are either assigned to a cluster or regarded as leaders A threshold is set to limit the number of users per cluster and thus controls the overall similarity between two users. Experimental results: Based upon the case study performed using their proposed algorithms and on comparison with other algorithms such as K-medoids clustering algorithm, it was found that Rough leader clustering algorithm performs better and provides accurate results on a large dataset. C. Antonellis et al. [3] in thier paper Algorithms for Clustering Click Stream Data describe various methods for clustering clickstream data. Clustering is the method in which a set of objects are grouped together in such that the objects in the same group are more similar to each other than the members of the neighboring groups. These groups are called as clusters. Clustering in click stream data is not exactly same as clustering normal data since. A lot of tools generally ignore detailed time and sequence data in order to cut down on space and time of processing data but study shows that this ignored information makes a considerable difference in the quality of clusters created for particular web sites. The paper extends previously suggested models [4]. In [4] the authors use a framework which is divided into multiple components namely an online component which uses micro-clustering and routinely stores comprehensive reports and statistics as well as a component which operates offline and makes use of a pyramidal time frame. This paper includes another phase and uses three phase architecture instead of a two phase one. The component maintained online keeps the summarized information in the temporary memory, generally we use a storage system in but in this case we merge the web log data. Also, the online component creates a fixed size matrix to summarize the log data from the users data instead of using micro clusters. The offline component can comprise of a number of different clustering methods that work simultaneously and with mutual exclusion such that they can provide a different set of groups of data clusters that depend on the attributes provided to the system. The authors propose 3 phase architecture by dividing the process as follows. 1) an online component that is used to automatically store data that is continuous as well as compressed 2) a group of offline components that make use of this compressed data to generate clusters; 3) a meta clustering method that can be used to find trends in the generated clusters so that they can further be used to generate a long-term clustering. The clustering algorithm that they have chosen is to apply is the k-means clustering algorithm that is widely used [5]. Although it may not work well with categorical attributes, it works well with numerical attributes since it has great statistical and geometric sense for the same. K-means takes an array of counters as input and applies clustering algorithms to it. Each time the k-means algorithm runs, it generates some results that can be stored as clusters. Finally, after creating of clusters the authors apply metaclustering which classify users according to their behavior in the time period under consideration. The results have been proved experimentally and the results are able to provide an overview of user's behavior and this helps in recognizing the clusters that can be used to identify users having constant preferences about the content that they view. Hence the three phase model presented in this paper [3] improves static user data with dynamic information by providing us with dependable and sophisticated measures II. ALGORITHMS AND DATASET FOR PROPOSED MODEL A. Dataset We use a dataset with obtained from YOOCHOOSE, who have collected from user visits over a period of 6 months from a European retailer's website. The dataset consists of two files namely, clicks and buys. The clicks data contains information about every click made by a user on the website. The database undergoes preprocessing first, which involves filling in missing values, smoothing noisy data, identifying and removing outliers, and resolving inconsistencies. The data is then reduced by using data reduction, which results in a reduced representation of the dataset that is smaller in volume than the original, but still produces the same analytical results. Preprocessing of data is followed by feature selection and feature extraction. Feature selection is done so as to make the data easier to interpret with reference to system, reduce training time. Based upon the selected features, the system is trained for prediction of user tendency to buy the item in his future sessions. The database is used to visualize the trends in the user buying behavior such as peak buying time/day/ month of the year. The trained data is then used to predict the probability of buying i.e. either a yes or a nay for the buy as well as what item is the user going to buy. B. Feature Extraction Features are functions of the original measurement variables that help in classification. Feature extraction is the process of defining a set of features, which will efficiently represent the information which is important for the classification problem. The goal of feature extraction is to improve the effectiveness and efficiency of analysis and classification. By reducing the dimensionality of the input set, correlated information is eliminated at the cost of accuracy. Feature space expansion involves generating new features based on the original ones instead of reducing dimensionality. Attribute construction is the process of constructing new attributes from the given attributes, so as to help improve the accuracy and understanding of structure in high-dimensional data. Accuracy measures the ratio of correctly classified instances across all classes. Feature subset selection removes redundant features from the data set as they can lead to a reduction of the classifier accuracy or clustering quality and lead to an increase in computational cost (Blum and Langley, 1997), (Koller and Sahami, 1996). The advantage is that no information about a single feature is lost. However, if a small set of features is required and the original features are very diverse, information may be lost, which in turn will reduce the accuracy of classification. The dataset used by us, contains many hidden features which play a vital role in reaching the goal of our project. C. Random Forest Random forest uses a collection of decision trees, wherein the algorithm constructs multiple decision tree and combines their output after computation to obtain the final solution. Random forest algorithm is a very powerful algorithm which can be applied in various applications. The fundamental idea of random forest algorithm is to aggregate several binary decision trees constructed using a variety of bootstrap samples coming from the learning sample L, such that we chose in a random manner at each node a subset of explanatory variables X. D. SVM Support Vector Machine makes use decision plane concept to define boundaries that can classify data. In a two dimensional space a line is used to actually separate and classify the data set into the two different parts. SVM carries out classification primarily by constructing hyperplanes in the multidimensional space leading to separate areas which are ultimately classify the data. [6] In the case of linearly separable data, the first step is to find the optimum separating hyperplane is found, after it is found the solution is represented as a linear combination of the data points that lie on the margin of the hyperplane, these points are also known as support vectors and hence the name support vector machines. Other data points are ignored. Therefore, the model complexity of an SVM stays unaffected irrespective of the number of features present in the data that is used for training (the number of data points close to the margins that are selected by the SVM during the learning phase is generally small). This makes SVM well suited to deal with tasks in the learning phase where the number of features is large as compared to the number of training samples. Although SVM is able to select among multiple possible hyperplanes by ensuring maximum margin there is still a possibility that the data is misclassified. E. Ensemble Learning (bagging, boosting) Bagging - bagging, is an ensemble machine learning metaalgorithm developed to improve the stability and performance of machine learning algorithms implemented in regression and statistical classification. When provided with a standard training dataset D of size n, bagging generates m new training datasets D_i, each of size n′, by sampling from D uniformly and with replacement. By sampling with replacement, some observations may be repeated in each D_i. If n′=n, then for large n the set D_i is expected to have the fraction (1 - 1/e) (≈63.2%) of the unique examples of D, the rest being duplicates. [1] This kind of sample is known as a bootstrap sample. The m models are fitted using the above m bootstrap samples and combined by averaging the output (for regression) or voting (for classification). Boosting is an ensemble machine learning meta-algorithm used for reducing bias mainly and also variance. In supervised learning algorithms a collection of machine learning techniques is used to transform weak learners to strong ones. Although boosting is not algorithmically restricted, most boosting algorithms iteratively train weak classifiers using a distribution and ultimately adding them to a final strong classifier. When they are aggregated, they are mostly weighted in a way that is generally related to the weak learner’s performance. The data is reweighted after the addition of a weak learner, samples that are incorrectly classified gain weight and samples that are correctly classified lose weight. The main difference between a variety of boosting algorithm lies in their method of weighting training data points and hypotheses. AdaBoost is widely used and perhaps one of the most significant historically as it was the first algorithm that was able to adapt to weak learners. However, there are many more recently developed methods like LPBoost, MadaBoost,TotalBoost, BrownBoost, LogitBoost, and others can do the same. III. PROPOSED MODEL A. Phase 1 The data set is divided into n different random subsets, with each comprising of two third of the whole data set. The remaining one third is used as a test data set. In this paper, we propose a model in which the data sets are classified using a hybrid classification method, based on SVM, Random Forest and Boosting. The input clicks and buys data set is first to be merged to create on data set. This new data set is to be randomly subdivided into subsets. Each item in each subset has a weight factor associated with it. We use SVM to classify data items in each subset as buy or not buy. If misclassification occurs, then weight factor of data items is increased else decreased. The data sets are rearranged and process repeats until the weights get updated to a very low value. The output is computed by applying a voting mechanism to all the random subsets classification outputs. B. Phase 2 In the second phase, we take the data classified as buys and create a new data set. This data set is again randomly divided as above. For each item in the dataset, we check whether this is the item will be purchased, using the same model as of phase 1, where the algorithm runs on each data set n times where n is number of available items. Thus using a voting mechanism, it is possible to predict item which will be bought. IV. CONCLUSION In this paper we have explored various techniques that can be used for purchase prediction using clickstream data. We have divided the process into various steps like preprocessing, feature extraction, training and testing and have proposed a model for the same. We plan to use ensemble learning methods combining SVM and Random Forest and aim to test our proposed model and compare the results with other models to develop the optimal solution. REFERENCES [1] Gohari Boroujerdi, E., et al. "A study on prediction of user's tendency toward purchases in websites based on behavior models." Information and Knowledge Technology (IKT), 2014 6th Conference on. IEEE, 2014. [2] Su, Qiang, and Lu Chen. "A method for discovering clusters of e-commerce interest patterns using click-stream data." Electronic Commerce Research and Applications 14.1 (2015): 1-13. [3] Antonellis, Panagiotis, Christos Makris, and Nikos Tsirakis. "Algorithms for clustering clickstream data." Information Processing Letters 109.8 (2009): 381-385. [4] Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu, A Framework for Clustering Evolving Data Streams", Proceedings of the 29th VLDB conference, 81-92 (2003) [5] J. A. Hartigan, M. A. Wong, Algorithm AS136: a k-means clustering algorithm", Applied Statistics, vol. 28, 100-108 (1979) [6] J. Dukart, "Support Vector Machine Classification Basic Principles and Application," pp. 19-19, 2012. [7] Shiyu Shu, Lihong Ren, Yongsheng Ding, Kuangrong Hao, Rui Jiang ‘SVM optimization algorithm based on dynamic clustering and ensemble learning for large scale dataset’, Systems, Man and Cybernetics (SMC), pg.- 2278 – 2283, 2014 [8] A. Verikasa, A. Gelzinisb, M. Bacauskieneb, ‘Mining data with random forests: A survey and results of new tests’, Pattern Recognition, Volume 44, Issue 2, February 2011, Pages 330–349