International Journal of Computer Application (2250-1797) Volume 6– No.2, March- April 2016 A Review on Various Classification Algorithms for Online Shopping Data Shaffy Goyal Dept of Computer Science and Engineering Giani Zail Singh Campus College of Engg. & Tech. Namisha Modi Dept of Computer Science and Engineering Giani Zail Singh Campus College of Engg. & Tech. Abstract- Data mining is the field used in the database management system. In the process of the data mining the relationships has been extracted between different attributes available in the dataset. In this paper different algorithms have been described used for data mining procedure. In the processing of classification different classifier based on rules, distances have been utilized. These algorithms brief information has been provided in this paper. This paper contains information about dataset attributes available in online shopping dataset. in this paper data classification approaches has been described that can be utilized for dataset classification. b) Data mining can predict the possible outcomes of any study. c) Data mining can answer questions that cannot be addressed through simple queries. d) Data mining can create actionable information upon which a user can rely. e) Data mining is used for removal of redundancy effect on decision making process. Keywords- K-mean, KNN, Naïve Bayes, Lazy algorithm Data mining is defined as extraction of information or data from huge data sets. The information or data that is extracted can be used for the following applications: 1. INTRODUCTION TO DATA MINING 1.1 WHAT IS DATA MINING? Data mining is the field of computer science which deals with the recognition of patterns among different data sets using various techniques such as artificial intelligence, machine learning, neural networks etc. It involves identifying patterns among data and then converting those patterns into a form that can be used by some user. Data Mining is used for decision making and for forecasting future trends of market. Many organizations have now started using Data Mining as a tool, to deal with the competitive environment for data analysis. Data Mining tools and techniques can be used to analyze the various market trends. The main objective of this paper is to study the application of data mining in determining the consumer online shopping attitudes and behavior. 1.2 CHARACTERISTICS OF DATA MINING a) Data mining is used to identify patterns among large data sets and hence analyze future trends. 1.3 APPLICATIONS OF DATA MINING Market Analysis Fraud Detection Customer Retention Production Control Science Exploration Production control Customer retention Science exploration Sports Internet surfing 1.4 DATA MINING TASK PRIMITIVES The data mining task is initialized in the form of data mining query. This data mining query is the input to the system. The data mining query is defined in the terms of task primitives. Various set of task relevant data to be mined is the portion of database in which the user is interested. This portion includes database Attributes and data Warehouse dimensions of interest. The other primitive is kind of knowledge to be mined which refers to the kind of functions to be performed. The functions included are 17 International Journal of Computer Application (2250-1797) Volume 6– No.2, March- April 2016 Characterization, Discrimination, Correlation Analysis. Association and Task primitives are also based on background knowledge .The background knowledge allows data to be mined at multiple levels of abstraction. For example, the Concept hierarchies are one of the background knowledge that allows data to be mined at multiple levels of abstraction. The Interestingness measures and thresholds for pattern evaluation also come under task primitive. This is used to evaluate the patterns that are discoveredby the process ofknowledge discovery. There are different interestingmeasures for different kind of knowledge. 2. RELATED WORK Ling Liu [7] with the advance of internet technologies in recent years, online shopping is becoming a popular trend to make purchases compared to the traditional ways. There are many excellent benefits to both the consumers and business conducting online businesses. However, it could also cause great damage to the business due to the increasing number of fraudulent online transactions. In order to improve the online shopping experience, there are great needs to reduce and prevent the fraudulent activities. RanaAlaa El-Deen Ahmed [11] author explains eleven data mining classification techniques will be comparatively tested to find the best classifier fit for consumer online shopping attitudes and behavior according to obtained dataset for big agency of online shopping ,the results shows that decision table classifier and filtered classifier gives the highest accuracy and the lowest accuracy is achieved by classification via clustering and simple cart, also this paper will provide a recommender system based on decision table classifier helping the customer to find the products he/she is searching for in some ecommerce web sites .Recommender system learns from the information about customers and products and provides appropriate personalized recommendations to customers to find the desired products. Paresh Tanna [10] shows how the different approaches achieve the objective of frequent mining along with the complexities required to perform the job. This paper demonstrates the use of WEKA tool for association rule mining using Apriori algorithm Soo Yeon Chung [3] author conducted extensive reviews of online shopping literatures and proposed a hierarchy model of online shopping behavior. We collected 47 studies and classified them by variables used. Some critical points were found that research framework, methodology, and lack of cross-cultural comparison, etc.so we developed a cross-cultural model of online shopping including shopping value, attitudes to online retailer's attributes and online purchasing based on the integrated V-A-B model. 3. CLASSIFICATION ALGORITHM 3.1 Bayes’ Theorem Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical classifiers. Bayesian classifiers can predict class membership probabilities such as the probability that a given tuple belongs to a particular class. Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities: Posterior Probability [P(H/X)] Prior Probability [P(H)] Where X is data tuple and H is some hypothesis. According to Bayes' Theorem P(H/X)= P(X/H)P(H) / P(X) 3.2 The Naive Bayes Classifier technique It is based on the so-called Bayesian theorem and is particularly suited when the dimensionality of the inputs is high. Despite its simplicity, Naive Bayes can often outperform more sophisticated classification methods. To demonstrate the concept of Naïve Bayes Classification, consider the example displayed in the illustration above. As indicated, the objects can be classified as either GREEN or RED. Our task is to classify new cases as they arrive, i.e., decide to which class label they belong, based on the currently exiting objects. Figure1: Classification of text points Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case (which hasn't been observed yet) is twice as likely to have membership GREEN rather than RED. In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities are based on previous experience, in this case the percentage of GREEN and RED objects, and often used to predict outcomes before they actually happen. 18 International Journal of Computer Application (2250-1797) Volume 6– No.2, March- April 2016 Thus, we can write: 3.3 Lazy Classifier Lazy learners store the training instances and do no real work until classification time. Lazy learning is a learning method in which generalization beyond the training data is delayed until a query is made to the system where the system tries to generalize the training data before receiving queries. The main advantage gained in employing a lazy learning method is that the target function will be approximated locally such as in the k-nearest neighbor algorithm. Because the objective function is approximated locally for each query to the system, lazy learning systems can concurrently solve multiple problems and deal successfully with changes in the problem arena. [5][8]. The disadvantages with lazy learning include the large space requirement to store the complete training dataset. Mostly noisy training data increases the case support unnecessarily, because no concept is made during the training phase and another disadvantage is that lazy learning methods are usually slower to evaluate, though this is joined with a faster training phase. 3.4 K-Means Clustering Algorithm The k-means clustering algorithm attempts to split a given anonymous data set (a set containing no information as to class identity) into a fixed number (k) of clusters. Initially knumbers of so called centroids are chosen. A centroid is a data point (imaginary or real) at the center of a cluster. In Praat each centroid is an existing data point in the given input data set, picked at random, such that all centroids are unique (that is, for all centroidsci and cj, ci ≠ cj). These centroids are used to train a K-NN classifier. The resulting classifier is used to classify (using k = 1) the data and thereby produce an initial randomized set of clusters. Each centroid is thereafter set to the arithmetic mean of the cluster it defines. The process of classification and centroid adjustment is repeated until the values of the centroids stabilize. The final centroids will be used to produce the final classification/clustering of the input data, effectively turning the set of initially anonymous data points into a set of data points, each with a class identity. K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970’s as a nonparametric technique. A case is classified by a majority vote of its neighbors, with the case being assigned to the class most common amongst its K nearest neighbors measured by a distance function. If K = 1, then the case is simply assigned to the class of its nearest neighbor. Euclidean= Manhattan= Minkowski = ( )1/q It should also be noted that all three distance measures are only valid for continuous variables. In the instance of categorical variables the Hamming distance must be used. It also brings up the issue of standardization of the numerical variables between 0 and 1 when there is a mixture of numerical and categorical variables in the dataset. Choosing the optimal value for K is best done by first inspecting the data. In general, a large K value is more precise as it reduces the overall noise but there is no guarantee. Cross-validation is another way to retrospectively determine a good K value by using an independent dataset to validate the K value. Historically, the optimal K for most datasets has been between 3-10. That produces much better results than 1NN. 3.6 DATASET USED The Dataset used is obtained from highly reputational online shopping agency which sells only online .The dataset is composed of online ordering log file for three months. The dataset consists of 304 instances and 26 attributes. The ten -fold cross validation method is used for testing the accuracy of the classification of the selected classification methods .In ten folds cross validation, a dataset is equally divided into 10 folds(partitions) with the same distribution .In each test 9 folds of data are used for training and one fold is for testing (unseen dataset).The test procedure is repeated 10 times . Personal information Educational level 3.5 K nearest Neighbors Classification Brand Include serial number, buyer name ,gender ,age Describes buyer educational level and it is classified into categories from (1-10) (1-3) Graduated,(4-6) Master,(7-10) PHD. Describes product brand name. 19 International Journal of Computer Application (2250-1797) Volume 6– No.2, March- April 2016 Product name Item description Category Quantity Price Item Type Payment Method Number of visits Duration of visit Rating User Satisfaction of the product Best deal Number of likes Positive comments Negative comments Number of posts Facebook Instagram Twitter Describes the product name. Describes the product specification. Describes product Category. Describes product ordered quantity per order. Describes product price. Describes the product different types. Describes order payment method which is classified here into three methods (COD): cash on delivery, credit card, buyer web site account. Describes buyer visit number for the web site page. Describes buyer duration visit and it is measured by minutes. Describes product rating from the buyer and it's measured by scale from(1-5) (1 )represent poor and (5) represent excellent Describes user satisfaction from the product and it is rated from (1100).100 represent highly satisfied and 1 represent not satisfied. Describes the best offer for the product 1 represent Yes and 2 represent No. Describes number of likes for the product ranged from (0-100). Describes the number of positive comments for a certain product. Describes the number of negative comments for certain product. Describes number of posts written on the web page. Describes the number of followers over Facebook. Describes the number of followers over instagram. Describes the number of followers over twitter. Table 1: various attributes available in dataset 4. BLOCK DIAGRAM OF THE PROCESS Data convertion using Appropriate Tool Applying Data mining algorithm and Evaluation Log File Data Conversion Appropriate Data mining algorithm Explore and clean data WEKA Evaluation Collecting data set and data prepration Figure 2: Block diagram of experiment components used for selecting best classifier performance The process starts with collection of data from online company. Data is cleaned in the next phase by transforming of the data to certain files which are suitable for different data mining tools in the data conversion phase. Finally, different classification algorithms are applied to the data set. Then a comparative study is done to show the best classifier algorithm used for the dataset. 5. CONCLUSION AND FUTURE SCOPE In the process of data mining various attributes has been used for classification of various dataset attributes for extraction of different hidden patterns from the dataset. In the processing of the dataset classification various classifier has been used that divides dataset into different classes. In this paper various classifier has been reviewed that has been used in the data mining process for extraction of data values. In this paper rule based classifier, clustering based and distance based classifier has been studied for extraction of optimal classifier for data classification. Classification is necessary for online shopping data is due to huge amount of redundant information available in the dataset. In the future reference the optimal classifier can be used in the real life applications of data mining. 20 International Journal of Computer Application (2250-1797) Volume 6– No.2, March- April 2016 REFERENCES [1] D. Burdick, M. Calimlim and J. Gehrke, “GenMax: An Efficient Algorithm for Mining Maximal Frequent Item sets”, in Data Mining and Knowledge Discovery, 2005. [2] S. Jie, S. Peiji and F. Jiaming, “A Model for adoption of online shopping: A perceived characteristics of Web as a shopping channel view”, in Service Systems and Service Management, 2007 International Conference, 2007. [3] C. Park, “Online shopping behavior model: A literature review and proposed model”, in Advanced Communication Technology, 2009. ICACT, 11th International Conference, 2009. [4] A. Meenakshi and D. Alagarsamy, “Efficient Storage Reduction of Frequency of Items in Vertical Data Layout”, International Journal on Computer Science and Engineering, vol. 3, 2011. [5] M. RezaulKarim, J. Jo, B. Jeong and H. Choi, “Mining EShopper's Purchase Rules by Using Maximal Frequent Patterns: An Ecommerce Perspective”, in Information Science and Applications (ICISA), 2012 International Conference, 2012, pp. 1-6. [6] K. Devkishin, A. Rizvi and V. L. Akre. “Analysis of factors affecting the online shopping behavior of consumers in UAE,” in In Current Trends in Information Technology (CTIT), 2013 International Conference, 2013, pp. 220-225. [7] Ling Liu, Zijiang Yang, “Improving Online Shopping Experience uses Data Mining and Statistical Techniques”, Journal of Convergence Information Technology(JCIT) Volume 8, Number 6, Mar 2013 [8] S. K.S, A. Prabhakaran and T. George K, “decision support system for CRM in online shopping system”, International Journal of Advances in Computer Science and Technology, vol. 3, no. 2, 2014. [9] D.M.Tank, “Improved Apriori Algorithm for Mining Association Rules,” I.J. Information Technology and Computer Science, 2014, pp. 15-23. [10] P. Tanna and Y. Ghodasara,“ Using Apriori with WEKA for Frequent Pattern Mining,”International Journal of Engineering Trends and Technology (IJETT), vol. 12, no. 3, 2015, pp. 127-131. [11] RanaAlaa El-Deen Ahmed, “Performance study of classification algorithms for consumer online shopping attitudes and behavior using data mining”, Fifth International Conference on Communication Systems and Network Technologies (CSNT), 2015, pp. 1344-1349. 21