FULL TEXT - RS Publication

advertisement
International Journal of Computer Application (2250-1797)
Volume 6– No.2, March- April 2016
A Review on Various Classification Algorithms
for Online Shopping Data
Shaffy Goyal
Dept of Computer Science and Engineering
Giani Zail Singh Campus College of Engg. & Tech.
Namisha Modi
Dept of Computer Science and Engineering
Giani Zail Singh Campus College of Engg. & Tech.
Abstract- Data mining is the field used in the database
management system. In the process of the data mining the
relationships has been extracted between different attributes
available in the dataset. In this paper different algorithms have
been described used for data mining procedure. In the
processing of classification different classifier based on rules,
distances have been utilized. These algorithms brief information
has been provided in this paper.
This paper contains
information about dataset attributes available in online
shopping dataset. in this paper data classification approaches
has been described that can be utilized for dataset classification.
b) Data mining can predict the possible outcomes of any
study.
c) Data mining can answer questions that cannot be
addressed through simple queries.
d) Data mining can create actionable information upon
which a user can rely.
e) Data mining is used for removal of redundancy effect
on decision making process.
Keywords- K-mean, KNN, Naïve Bayes, Lazy algorithm
Data mining is defined as extraction of information or data
from huge data sets. The information or data that is
extracted can be used for the following applications:
1. INTRODUCTION TO DATA MINING
1.1 WHAT IS DATA MINING?
Data mining is the field of computer science which deals
with the recognition of patterns among different data sets
using various techniques such as artificial intelligence,
machine learning, neural networks etc. It involves
identifying patterns among data and then converting those
patterns into a form that can be used by some user.
Data Mining is used for decision making and for
forecasting future trends of market. Many organizations
have now started using Data Mining as a tool, to deal with
the competitive environment for data analysis. Data
Mining tools and techniques can be used to analyze the
various market trends.
The main objective of this paper is to study the application
of data mining in determining the consumer online
shopping attitudes and behavior.
1.2 CHARACTERISTICS OF DATA MINING
a)
Data mining is used to identify patterns among large
data sets and hence analyze future trends.
1.3 APPLICATIONS OF DATA MINING










Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
Production control
Customer retention
Science exploration
Sports
Internet surfing
1.4 DATA MINING TASK PRIMITIVES
The data mining task is initialized in the form of data
mining query. This data mining query is the input to the
system. The data mining query is defined in the terms of
task primitives. Various set of task relevant data to be
mined is the portion of database in which the user is
interested. This portion includes database Attributes and
data Warehouse dimensions of interest. The other primitive
is kind of knowledge to be mined which refers to the kind
of functions to be performed. The functions included are
17
International Journal of Computer Application (2250-1797)
Volume 6– No.2, March- April 2016
Characterization,
Discrimination,
Correlation Analysis.
Association
and
Task primitives are also based on background knowledge
.The background knowledge allows data to be mined at
multiple levels of abstraction. For example, the Concept
hierarchies are one of the background knowledge that
allows data to be mined at multiple levels of abstraction.
The Interestingness measures and thresholds for pattern
evaluation also come under task primitive. This is used to
evaluate the patterns that are discoveredby the process
ofknowledge
discovery.
There
are
different
interestingmeasures for different kind of knowledge.
2. RELATED WORK
Ling Liu [7] with the advance of internet technologies in
recent years, online shopping is becoming a popular trend
to make purchases compared to the traditional ways. There
are many excellent benefits to both the consumers and
business conducting online businesses. However, it could
also cause great damage to the business due to the
increasing number of fraudulent online transactions. In
order to improve the online shopping experience, there are
great needs to reduce and prevent the fraudulent activities.
RanaAlaa El-Deen Ahmed [11] author explains eleven
data mining classification techniques will be comparatively
tested to find the best classifier fit for consumer online
shopping attitudes and behavior according to obtained
dataset for big agency of online shopping ,the results
shows that decision table classifier and filtered classifier
gives the highest accuracy and the lowest accuracy is
achieved by classification via clustering and simple cart,
also this paper will provide a recommender system based
on decision table classifier helping the customer to find the
products he/she is searching for in some ecommerce web
sites .Recommender system learns from the information
about customers and products and provides appropriate
personalized recommendations to customers to find the
desired products.
Paresh Tanna [10] shows how the different
approaches achieve the objective of frequent mining
along with the complexities required to perform the job.
This paper demonstrates the use of WEKA tool for
association rule mining using Apriori algorithm
Soo Yeon Chung [3] author conducted extensive
reviews of online shopping literatures and proposed a
hierarchy model of online shopping behavior. We
collected 47 studies and classified them by variables
used. Some critical points were found that research
framework, methodology, and lack of cross-cultural
comparison, etc.so we developed a cross-cultural model
of online shopping including shopping value, attitudes
to online retailer's attributes and online purchasing
based on the integrated V-A-B model.
3. CLASSIFICATION ALGORITHM
3.1 Bayes’ Theorem
Bayesian classification is based on Bayes' Theorem.
Bayesian classifiers are the statistical classifiers. Bayesian
classifiers can predict class membership probabilities such
as the probability that a given tuple belongs to a particular
class. Bayes' Theorem is named after Thomas Bayes. There
are two types of probabilities:
Posterior Probability [P(H/X)]
Prior Probability [P(H)]
Where X is data tuple and H is some hypothesis.
According to Bayes' Theorem
P(H/X)= P(X/H)P(H) / P(X)
3.2 The Naive Bayes Classifier technique
It is based on the so-called Bayesian theorem and is
particularly suited when the dimensionality of the inputs is
high. Despite its simplicity, Naive Bayes can often
outperform more sophisticated classification methods.
To demonstrate the concept of Naïve Bayes Classification,
consider the example displayed in the illustration above.
As indicated, the objects can be classified as either
GREEN or RED. Our task is to classify new cases as they
arrive, i.e., decide to which class label they belong, based
on the currently exiting objects.
Figure1: Classification of text points
Since there are twice as many GREEN objects as RED, it is
reasonable to believe that a new case (which hasn't been
observed yet) is twice as likely to have membership GREEN
rather than RED. In the Bayesian analysis, this belief is known
as the prior probability. Prior probabilities are based on
previous experience, in this case the percentage of GREEN
and RED objects, and often used to predict outcomes before
they actually happen.
18
International Journal of Computer Application (2250-1797)
Volume 6– No.2, March- April 2016
Thus, we can write:
3.3 Lazy Classifier
Lazy learners store the training instances and do no real
work until classification time. Lazy learning is a learning
method in which generalization beyond the training data is
delayed until a query is made to the system where the
system tries to generalize the training data before receiving
queries. The main advantage gained in employing a lazy
learning method is that the target function will be
approximated locally such as in the k-nearest neighbor
algorithm. Because the objective function is approximated
locally for each query to the system, lazy learning systems
can concurrently solve multiple problems and deal
successfully with changes in the problem arena. [5][8]. The
disadvantages with lazy learning include the large space
requirement to store the complete training dataset. Mostly
noisy training data increases the case support
unnecessarily, because no concept is made during the
training phase and another disadvantage is that lazy
learning methods are usually slower to evaluate, though
this is joined with a faster training phase.
3.4 K-Means Clustering Algorithm
The k-means clustering algorithm attempts to split a given
anonymous data set (a set containing no information as to
class identity) into a fixed number (k) of clusters.
Initially knumbers of so called centroids are chosen. A
centroid is a data point (imaginary or real) at the center
of a cluster. In Praat each centroid is an existing data
point in the given input data set, picked at random, such
that all centroids are unique (that is, for all centroidsci
and cj, ci ≠ cj). These centroids are used to train a K-NN
classifier. The resulting classifier is used to classify
(using k = 1) the data and thereby produce an initial
randomized set of clusters. Each centroid is thereafter
set to the arithmetic mean of the cluster it defines. The
process of classification and centroid adjustment is
repeated until the values of the centroids stabilize. The
final centroids will be used to produce the final
classification/clustering of the input data, effectively
turning the set of initially anonymous data points into a
set of data points, each with a class identity.
K nearest neighbors is a simple algorithm that stores all
available cases and classifies new cases based on a
similarity measure (e.g., distance functions). KNN has
been used in statistical estimation and pattern
recognition already in the beginning of 1970’s as a nonparametric technique.
A case is classified by a majority vote of its neighbors,
with the case being assigned to the class most common
amongst its K nearest neighbors measured by a distance
function. If K = 1, then the case is simply assigned to
the class of its nearest neighbor.
Euclidean=
Manhattan=
Minkowski = (
)1/q
It should also be noted that all three distance measures
are only valid for continuous variables. In the instance
of categorical variables the Hamming distance must be
used. It also brings up the issue of standardization of
the numerical variables between 0 and 1 when there is a
mixture of numerical and categorical variables in the
dataset.
Choosing the optimal value for K is best done by first
inspecting the data. In general, a large K value is more
precise as it reduces the overall noise but there is no
guarantee. Cross-validation is another way to
retrospectively determine a good K value by using an
independent dataset to validate the K value.
Historically, the optimal K for most datasets has been
between 3-10. That produces much better results than
1NN.
3.6 DATASET USED
The Dataset used is obtained from highly reputational
online shopping agency which sells only online .The
dataset is composed of online ordering log file for three
months. The dataset consists of 304 instances and 26
attributes. The ten -fold cross validation method is used for
testing the accuracy of the classification of the selected
classification methods .In ten folds cross validation, a
dataset is equally divided into 10 folds(partitions) with the
same distribution .In each test 9 folds of data are used for
training and one fold is for testing (unseen dataset).The test
procedure is repeated 10 times .
Personal
information
Educational level
3.5 K nearest Neighbors Classification
Brand
Include serial number, buyer
name ,gender ,age
Describes buyer educational level
and it is classified into categories
from (1-10) (1-3) Graduated,(4-6)
Master,(7-10) PHD.
Describes product brand name.
19
International Journal of Computer Application (2250-1797)
Volume 6– No.2, March- April 2016
Product name
Item description
Category
Quantity
Price
Item Type
Payment Method
Number of visits
Duration of visit
Rating
User Satisfaction of
the product
Best deal
Number of likes
Positive comments
Negative comments
Number of posts
Facebook
Instagram
Twitter
Describes the product name.
Describes the product
specification.
Describes product Category.
Describes product ordered
quantity per order.
Describes product price.
Describes the product different
types.
Describes order payment method
which is classified here into three
methods (COD): cash on delivery,
credit card, buyer web site account.
Describes buyer visit number for
the web site page.
Describes buyer duration visit and
it is measured by minutes.
Describes product rating from the
buyer and it's measured by scale
from(1-5) (1 )represent poor and
(5) represent excellent
Describes user satisfaction from the
product and it is rated from (1100).100 represent highly satisfied
and 1 represent not satisfied.
Describes the best offer for the
product 1 represent Yes and 2
represent No.
Describes number of likes for the
product ranged from (0-100).
Describes the number of positive
comments for a certain product.
Describes the number of negative
comments for certain product.
Describes number of posts written
on the web page.
Describes the number of followers
over Facebook.
Describes the number of followers
over instagram.
Describes the number of followers
over twitter.
Table 1: various attributes available in dataset
4. BLOCK DIAGRAM OF THE PROCESS
Data
convertion
using
Appropriate
Tool
Applying Data
mining
algorithm and
Evaluation
Log File
Data
Conversion
Appropriate
Data mining
algorithm
Explore
and clean
data
WEKA
Evaluation
Collecting data
set and data
prepration
Figure 2: Block diagram of experiment components used
for selecting best classifier performance
The process starts with collection of data from online
company. Data is cleaned in the next phase by
transforming of the data to certain files which are suitable
for different data mining tools in the data conversion
phase. Finally, different classification algorithms are
applied to the data set. Then a comparative study is done to
show the best classifier algorithm used for the dataset.
5. CONCLUSION AND FUTURE SCOPE
In the process of data mining various attributes has
been used for classification of various dataset attributes
for extraction of different hidden patterns from the
dataset. In the processing of the dataset classification
various classifier has been used that divides dataset into
different classes. In this paper various classifier has
been reviewed that has been used in the data mining
process for extraction of data values. In this paper rule
based classifier, clustering based and distance based
classifier has been studied for extraction of optimal
classifier for data classification. Classification is
necessary for online shopping data is due to huge
amount of redundant information available in the
dataset.
In the future reference the optimal classifier can be used
in the real life applications of data mining.
20
International Journal of Computer Application (2250-1797)
Volume 6– No.2, March- April 2016
REFERENCES
[1] D. Burdick, M. Calimlim and J. Gehrke, “GenMax: An
Efficient Algorithm for Mining Maximal Frequent Item
sets”, in Data Mining and Knowledge Discovery, 2005.
[2] S. Jie, S. Peiji and F. Jiaming, “A Model for adoption of
online shopping: A perceived characteristics of Web as a
shopping channel view”, in Service Systems and Service
Management, 2007 International Conference, 2007.
[3] C. Park, “Online shopping behavior model: A literature
review and proposed model”, in Advanced Communication
Technology, 2009. ICACT, 11th International Conference,
2009.
[4] A. Meenakshi and D. Alagarsamy, “Efficient Storage
Reduction of Frequency of Items in Vertical Data Layout”,
International Journal on Computer Science and Engineering,
vol. 3, 2011.
[5] M. RezaulKarim, J. Jo, B. Jeong and H. Choi, “Mining EShopper's Purchase Rules by Using Maximal Frequent
Patterns: An Ecommerce Perspective”, in Information
Science and Applications (ICISA), 2012 International
Conference, 2012, pp. 1-6.
[6] K. Devkishin, A. Rizvi and V. L. Akre. “Analysis of factors
affecting the online shopping behavior of consumers in
UAE,” in In Current Trends in Information Technology
(CTIT), 2013 International Conference, 2013, pp. 220-225.
[7] Ling Liu, Zijiang Yang, “Improving Online Shopping
Experience uses Data Mining and Statistical Techniques”,
Journal of Convergence Information Technology(JCIT)
Volume 8, Number 6, Mar 2013
[8] S. K.S, A. Prabhakaran and T. George K, “decision support
system for CRM in online shopping system”, International
Journal of Advances in Computer Science and Technology,
vol. 3, no. 2, 2014.
[9] D.M.Tank, “Improved Apriori Algorithm for Mining
Association Rules,” I.J. Information Technology and
Computer Science, 2014, pp. 15-23.
[10] P. Tanna and Y. Ghodasara,“ Using Apriori with WEKA for
Frequent Pattern Mining,”International Journal of
Engineering Trends and Technology (IJETT), vol. 12, no. 3,
2015, pp. 127-131.
[11] RanaAlaa El-Deen Ahmed, “Performance study of
classification algorithms for consumer online shopping
attitudes and behavior using data mining”, Fifth International
Conference on Communication Systems and Network
Technologies (CSNT), 2015, pp. 1344-1349.
21
Download