mining classifying

advertisement
Website Popularity Prediction (Assignment 2)
Data Mining and Industry Applications (ISCG8042)
Website Popularity Prediction (Assignment 2)
Due Date: 10/11/2014
Assisted by:
Jane Zhao
Prepared and Submitted by:
Mohammed Alkorbi (1431788)
Abdulbasit Almatrook (1430943)
Haider Al Oliwi (1440428)
Naji Alobaidi (1373749)
Page 1 of 37
Website Popularity Prediction (Assignment 2)
Table of Contents
1
Introduction .......................................................................................................................................... 5
1.1.
1.1.1
Prediction of Fitness Website Design ................................................................................... 5
1.1.2
Popularity Prediction of E-commerce Website..................................................................... 6
1.1.3
Popularity Prediction of Music Website ............................................................................... 6
1.1.4
Popularity Prediction of Education Website ......................................................................... 7
1.2.
Data Mining Applications .............................................................................................................. 7
1.2.1.
Retail Data Mining Applications ............................................................................................ 7
1.1.5
Telecommunication Data Mining Applications ..................................................................... 8
1.1.6
Banking Data Mining Applications ........................................................................................ 9
1.1.7
Website Popularity Prediction ............................................................................................ 10
1.2
2
Research Background.................................................................................................................... 5
Data Mining Tasks ....................................................................................................................... 11
1.2.1
Classification ....................................................................................................................... 11
1.2.2
Regression ........................................................................................................................... 11
1.2.3
Prediction ............................................................................................................................ 12
1.2.4
Clustering ............................................................................................................................ 14
1.2.5
Association Rule .................................................................................................................. 15
1.2.6
Summarization .................................................................................................................... 16
1.2.7
Outlier Analysis ................................................................................................................... 16
Methodology....................................................................................................................................... 17
2.1 Flowchart of System Design .............................................................................................................. 17
2.2 Explanation of System Design: .......................................................................................................... 18
2.2.1 System Module A: Extraction: .................................................................................................... 18
2.2.2 System Module B: Data Server Module ..................................................................................... 18
2.2.3 System Module C: Data Pre-processing Module ....................................................................... 18
2.2.4 System Module D: Data Mining ................................................................................................. 19
2.2.5 System Module E: Loading into Warehouse: ............................................................................. 20
2.2.6 System Module F: Visualization the Data: ................................................................................. 20
2.3 Difficulties and Solutions: ................................................................................................................. 20
2.3.1 Extraction: .................................................................................................................................. 20
2.3.2 Transformation Difficulties: ....................................................................................................... 20
2.3.3 Loading Issue:............................................................................................................................. 21
Page 2 of 37
Website Popularity Prediction (Assignment 2)
3
Experiments and Results Discussion ................................................................................................... 22
3.1
3.1.1
Linear Regression ................................................................................................................ 22
3.1.2
Logistic Regression .............................................................................................................. 22
3.1.3
Multilayer Perceptron ......................................................................................................... 23
3.1.4
Sequential Minimal Optimization for Regression (SMOreg)............................................... 24
3.2
Prediction of Fitness Website Design ......................................................................................... 24
3.2.1
Linear Regression ................................................................................................................ 24
3.2.2
Logistic Regression .............................................................................................................. 25
3.2.3
Multilayer Perceptron ......................................................................................................... 26
3.2.4
Sequential Minimal Optimization for Regression (SMOreg)............................................... 26
3.3
Popularity Prediction of E-commerce Website........................................................................... 27
3.3.3
Linear Regression ................................................................................................................ 27
3.3.2
Logistic Regression .............................................................................................................. 27
3.3.3
Multilayer Perceptron ......................................................................................................... 28
3.3.4
Sequential Minimal Optimization for Regression (SMOreg)............................................... 29
3.4
Popularity Prediction of Education Website ............................................................................... 29
3.4.1
Linear Regression ................................................................................................................ 29
3.4.2
Logistic Regression .............................................................................................................. 30
3.4.3
Multilayer Perceptron ......................................................................................................... 31
3.4.4
Sequential Minimal Optimization for Regression (SMOreg)............................................... 31
3.5
4
Listening Music Website Popularity ............................................................................................ 22
Compression ............................................................................................................................... 32
3.5.1
Linear Regression ................................................................................................................ 32
3.5.2
Logistic Regression .............................................................................................................. 32
3.5.3
Multilayer Perceptron ......................................................................................................... 33
3.5.4
Sequential Minimal Optimization for Regression (SMOreg)............................................... 33
3.5.5
Overall Comparison............................................................................................................. 33
Conclusion: .......................................................................................................................................... 35
Bibliography ................................................................................................................................................ 36
Page 3 of 37
Website Popularity Prediction (Assignment 2)
Figure 1: The Impact of Online Membership Registration Form on the Popularity of the Website ............ 6
Figure 2 The Concept of Multilayer Perceptrons (NEURAL NETWORK, 2014)............................................ 13
Figure 3: The segmentation system provided by Nielsen Caritas based on American zip code to describe
the lifestyle in each area in USA (Ngai, Xiu, & Chau, 2009) ........................................................................ 15
Figure 4 Flowchart of System Design .......................................................................................................... 17
Figure 5 Weka ............................................................................................................................................. 19
Figure 6 Linear Regression (Listening Music Website Popularity) .............................................................. 22
Figure 7 Logistic Regression (Listening Music Website Popularity) ............................................................ 22
Figure 8 Multilayer Perceptron (Listening Music Website Popularity) ....................................................... 23
Figure 9 SMOreg (Listening Music Website Popularity) ............................................................................. 24
Figure 10 Linear Regression (Prediction of Fitness Website Design).......................................................... 24
Figure 11 Logistic Regression (Prediction of Fitness Website Design)........................................................ 25
Figure 12 Multilayer Perceptron (Prediction of Fitness Website Design) .................................................. 26
Figure 13 SMOreg (Prediction of Fitness Website Design) ......................................................................... 26
Figure 14 Linear Regression (Popularity Prediction of E-commerce Website) ........................................... 27
Figure 15 Logistic Regression (Popularity Prediction of E-commerce Website) ......................................... 27
Figure 16 Multilayer Perceptron (Popularity Prediction of E-commerce Website).................................... 28
Figure 17 SMOreg (Popularity Prediction of E-commerce Website) .......................................................... 29
Figure 18 Linear Regression (Popularity Prediction of Education Website) ............................................... 29
Figure 19 Logistic Regression (Popularity Prediction of E-commerce Website) ......................................... 30
Figure 20 Multilayer Perceptron (Popularity Prediction of E-commerce Website).................................... 31
Figure 21 SMOreg (Popularity Prediction of E-commerce Website) .......................................................... 31
Page 4 of 37
Website Popularity Prediction (Assignment 2)
1 Introduction
1.1. Research Background
Data mining is mainly concerned with finding new and interesting patterns from a large
database. Data mining is also referred as knowledge discovery in database. It helps in
predicting behaviours and future trends. This helps in making effective and efficient decisions
that are based on knowledge. The contribution of four different research in the prediction of
website design and popularity are as follows:
1.1.1 Prediction of Fitness Website Design
The prediction of fitness website design contains information about the design factors of 50
fitness websites. These websites are located in USA, UK and NZ. The main aspect of this
research is to determine the significant factor in designing fitness website design. In addition,
whether these factors will affect the popularity of the fitness website. Several factors related to
fitness website design have gathered including online member registration, online contact
form, Google map function, search function, and background image.
The research investigated whether each fitness website has these factors or not. Additionally,
the research looked at the overall design of each fitness website, each website has given a
percentage related to overall design. Information gathered about well-designed websites to
determine the design rating as a percentage to each website. This information included the
effective use of website contents, layout design, width and height of a website page, design of
text and design of the image (Plumley, 2010) (MacDConald, 2011).
The conclusion of this research was that having online membership registration increases the
popularity of the website, see figure 1. Additionally, Out of 50 fitness website, 33 of them have
online membership registration. The popularity of each fitness website is done as percentage.
Page 5 of 37
Website Popularity Prediction (Assignment 2)
Impact of Online Membership Registration Form on Popularity
1.2
1
0.8
0.6
0.4
0.2
0
120.00
100.00
80.00
60.00
40.00
20.00
0.00
Membership registration
Popularity
Figure 1: The Impact of Online Membership Registration Form on the Popularity of the Website
1.1.2 Popularity Prediction of E-commerce Website
The topic chosen for assignment 1 was very interesting and really quite enlightening insights
were developed during the commencement of it. The topic for assignment 1 was the “Website
Popularity Prediction”. Information technology has really changed this World and has converted
into a miraculous wonder. Internet is a very significant key in the progress of enterprises we see
today and the essence of the topic was to apply the concepts of data mining and trying to figure
out the popularity of numerous websites. A sample of 50 E-commerce websites was taken and
then numerous attributes were assigned to each website such as “country”, “loading-time”,
“home-page size”, “Ranking”, “linking websites” and some more of them. The major source of
data collection was “Alexa.com” which is one of the largest web analytics company owned by
the industry giant “Amazon” itself. Another website “pingdom.net” was used to retrieve the
“response time” of the websites. The major column “popularity” was the dependent variable in
the dataset which was totally depending upon the independent variable “Alexa’s ranking”.
Using, the specific column a scale was formulated which filled the values in the column of
popularity. Numerous, techniques of data pre-processing and data transformation were applied
to the complete dataset to remove the anomalies and keep the data in consistent form.
1.1.3 Popularity Prediction of Music Website
The objective the dataset to gather values from 51 sites in order to discover the popularity
among music websites. To do so, data is collected based on five factors, which include:
 Loading Time: measuring the site page loading time by using SEO tools.
 Quality Links: identifying the number of dead-links on the page by using SEO tools.
Page 6 of 37
Website Popularity Prediction (Assignment 2)
 Compatibility: identifying the percentage of devices that can be displayed on the
website.
 Social Networks: identifying the website that is supplied with social means.
 Site Ranking: identifying the number of traffic on the website.
1.1.4 Popularity Prediction of Education Website
The dataset collected for the present work consists of 50 educational websites. The dataset
developed for these websites also provides rating against each field in order to analyze the level
of compatibility, popularity, quality links, trust, website traffic and share content. The rating has
been provided on the basis of users’ visit on these websites. The data collected for the present
work has been properly labeled and presented in an excel sheet. The data has been gathered
from the different websites of the UK, USA and New Zealand. These 50 selected websites serve
as representatives of large number of educational websites available in World Wide Web.
1.2. Data Mining Applications
A literature review contains data mining applications for retails, telecommunication, banking
and website popularity prediction. These data mining applications will be as follows:
1.2.1.
Retail Data Mining Applications
Nowadays retailer is facing rapidly change and competitive environment among other retailers
(Ramageri & Desai, 2013). In addition, retailers are competitive in producing a better and
effective market for their products. Every day, retailers collect large amount of customers’
information, product information and purchases information. This information can be
converted to discover a knowledge which will assist in making better business decision
(Ramageri & Desai, 2013). Retailers are seeking to target right customers who will definitely buy
their products. Data mining will assist in predicting customers’ purchases behaviour in the
futures which will increase retails’ productivity (Ramageri & Desai, 2013). Some Retails’ data
mining applications will be as followed:
1. Acquiring and Retaining Customer
Retailers are seeking to keep their customers rather than getting a new customer therefore,
retailers can convert the customer purchases details to knowledge by looking at customers’
purchases behaviour in the past months (Ramageri & Desai, 2013). This will assist retailers to
predict customers’ need and interest in buying a specific product. Additionally, this prediction
will assist retailers to keep their customers by offering discount for a particular product which
will attract customers to buy their desire products (Ramageri & Desai, 2013).
Page 7 of 37
Website Popularity Prediction (Assignment 2)
2. Market Basket Analysis
Market basket analysis is a way to understand what kinds of products are likely to be purchased
together (Ramageri & Desai, 2013). Understanding customers’ purchases behaviour will assist
retailers to know customers need in the future and develop store’s layout. Retailers can predict
the two products that are most frequently purchased together using data mining association
rule task (Ramageri & Desai, 2013). For instance, retailers performed data mining association
rule task and found out that a customer who buys razors are likely to buy shaving gel. According
to this information, these products can be together in one shelf which will attract customers to
buy these items together next time. Products association will enable retailers to design an
effective and attractive shelves (Ramageri & Desai, 2013). In addition, this will assist customers
to locate products easily and will assist retailers to provide discount for two specific products.
3. Customer segmentation and target market
Customers’ segmentation is significant for retailers that assist to divide the market into many
parts based on the customers’ demand. Data mining allows retailers to group and cluster
customers based on their age, gender, spending habit and interests (Ramageri & Desai, 2013)
(Data Segmentation, n.d). For instance, customers’ segmentation allows retailers to define the
most profitable groups of customers therefore this will allow retailers to focus more in keeping
these groups of customers which they are more valuable to their business.
1.1.5 Telecommunication Data Mining Applications
Data mining is being effectively used in the telecommunication sector in increasing the
customer satisfaction, overall process improvement and increasing the effectiveness and
efficiency of the sector in unimaginable ways. The telecommunication industry was one of the
first to handle data mining improvement. This is likely on the grounds that telecom affiliations
routinely consumed and store goliath measures of unfathomable information, have an
enormous client base, and work in a quickly changing and to a great degree nature's turf.
Telecom affiliations use data mining to overhaul their publicizing endeavors, perceive
contortion, and better deal with their telecom structures. Obviously, these affiliations also go
up against distinctive data mining inconveniences by virtue of the tremendous size of their
information sets, the persistent and regular parts of their information, and the need to foresee
especially remarkable occasions, case in point, client twisting and system abnormalities.
1. Generating Intelligence from Customer Data
The praise of data mining in the telecommunications industry could be seen as an improvement
of the utilization of master structures in the information trades industry (Liebowitz, 1988).
These systems were made to address the unusualness joined with keeping up an enormous
structure framework and the need to open up system dependable quality while minimizing
work costs.
Page 8 of 37
Website Popularity Prediction (Assignment 2)
The data mining applications for any industry rely on upon two variables: the information that
are open and the business issues confronting the business. This area gives stronghold data
about the information kept up by information trades affiliations. The inconveniences associated
with mining telecom information are in like way delineated in this segment. Telecom affiliations
keep up information about the telephone calls that investigate their systems as call
unpretentious segment records, which contain interesting information for each one telephone
call. In 2003, “At&t” long parcel clients made in plenitude of 300 million call reason for
speculation records for reliably and, in light of the way that call unnoticeable segment records
are kept online for a few months, this surmised that billions of call motivation behind premium
records were quickly accessible for information mining (Cortes & Pregibon, 2001). Call
motivation behind speculation information is helpful for advertising and intimidation
recognition applications. Telecom affiliations likewise keep up clearing client data, case in point,
charging data, and data got from outside social issues, for example, cash related assessment
data.
2. Managing Network Data
Telecom systems are amazingly bewildering approaches of apparatus, incorporated a gigantic
number of interconnected areas. Each system fragment is fit for making fumble and status
messages, which prompts a tremendous measure of schema information. This information
must be secured and broke down to help framework organization limits, for instance, issue
imprisonment. This data will irrelevantly join a timestamp, a string that uncommonly recognizes
the supplies or programming fragment delivering the message and a code that clears up why
the message is persistently delivered. For case, such a message may show, to the point that
"controller 7 fulfilled a mishap of energy for 40 seconds starting at 12.03 pm on Monday, July
12." Due to the tremendous number of framework messages delivered, experts can't in any
capacity, shape or structure handle each one message. Thus ace systems have been made to
subsequently analyze these messages and take fitting movement, simply including an expert
when an issue can't be subsequently decided (Weiss, 1998). As was the circumstances with the
call unpretentious component data, framework data is also created in ceaseless as a data
stream and must often be compacted in order to be supportive for data mining. This is often
satisfied by applying a period window to the data. For example, such a summary may show, to
the point that a supplies section fulfilled twelve events of a power difference in a 10-minute
period.
1.1.6 Banking Data Mining Applications
As per Mota Soares (2008), data mining in the banking sector assists in marketing, management
of risk, Customer Relation Management (CRM), acquisition and retention of customers. The
Page 9 of 37
Website Popularity Prediction (Assignment 2)
introduction of technology in the banking sector has made it technologically strong and
customer oriented. In the past few years, the data collected by banks has been increasing at a
fast pace. Data mining helps in managing large data such that this facilitates banks to analyze
information in context to financial profile of customers. This not only helps in enhancing
customer relations, but also in developing a customized pricing policy (Soares, 2008). As stated
by Goriness (2011), data mining is useful for the banking sector as it helps in the clear
identification of different customer segments. It helps in the marketing of cards and, thereby,
helps in enhancing profitability. Data mining also helps banks in the effective implementation of
retention and acquisition programs and in product development (Gorunescu, 2011).
1.1.7 Website Popularity Prediction
Nowadays, many applications have relied on the web services to promote their services such as
business to business and e-commerce applications. The transformation into the web application
framework has led to increasing of the data stored on the Web significantly. There is a valuable
resource for revealing information about the browsing behaviour of website visitors by using
logs recording to Web server access in Web mining. Due to the large amount of accesses, the
volumes of logs that can be gathered produce a worthwhile source for web mining. Web
services turn into more essential to supply a scalable infrastructure of registries that grants
tools for end-users and developers to achieve knowledge discovery.
Business Applications
Data mining is being used widely in business intelligence due to the development of algorithms
that using for data mining and efficient processors. Web applications in this field can be applied
to predict cost and savings, and monitor the performance of services.
1. Cost and Savings Prediction
The model aids businesses to evaluate the costs in case of the functionalities are needed. It can
be provided with the previous data, which help planning businesses to use Web services by
evaluating the cost and predicting the feasibility.
2. Performance Monitoring
The environmental business where the major goal to reduce costs by stretching resources.
Tasks need to be prioritised and observed by identifying the services usage pattern, and
preparation software on grouping of services with comparable usage patterns can be improved.
This grants to monitor the services at specific times in order to gain profundity knowledge in
exact services. Web server logs where time-series are generated by the following steps:

All data entries that contain services URL from web server log are selected and
extracted.
Page 10 of 37
Website Popularity Prediction (Assignment 2)



Data entries grouping and time ordering by IP addresses and Web services, which leads
to client interaction with the services.
Identifying the client sessions separately by computing the time during each interaction.
Finally, compute the number of clients for each service that using the service at defined
time intervals.
1.2 Data Mining Tasks
There are seven common data mining tasks that organizations can use to discover a knowledge
in their large data sets. It is recommended for organizations to choose a particular data mining
task that is appreciate for their business requirements (Ngai, Xiu, & Chau, 2009). For instance,
retailers might choose clustering data mining task to be able to discover the most profitable
groups of customers based on their age, gender, and spending habit. There are several data
mining software that are available to preform data mining tasks such as Weka and NueCom
software. The seven data mining tasks will be as follows:
1.2.1 Classification
Classification is a very important data mining technique and computational methods and being
used from quite a while now almost in every application of data mining. Classification
comprises of foreseeing a certain result focused around a given information. Keeping in mind
the end goal to anticipate the result, the calculation forms a preparation set containing a set of
properties and the separate result, normally called objective or forecast quality (Freitas, 1997).
The calculation tries to find connections between the traits that would make it conceivable to
anticipate the conclusion. Next the calculation is given an information set not seen in the recent
past, called expectation set, which contains the same set of properties, with the exception of
the expectation property – not yet known. The calculation investigates the data and produces a
forecast. The forecast correctness characterizes how "great" the calculation is. The process of
classification is also known as supervised learning. Basically, we are given a specific set of
database objects and these objects most of the times are labelled with various attributes. It is
required that a model could be formulated that separates them easily into predefined
categories or classes. There are many methods of doing classification but the most widely used
in the applications of data mining would be decision trees, rule based classification, k-means,
ID3 algorithms etc (Buja & Lee, 2001). Classification comes in very handy with all applications of
data mining but especially when it comes to fraud detection.
1.2.2 Regression
Fundamentally regression is a core concept of statistics but as data mining is also based upon
the core values of statistics so both linear and multiple regression computational methods are
Page 11 of 37
Website Popularity Prediction (Assignment 2)
also used in the process of data mining in its various applications. Regression analysis is the
process of determining how a variable y is related to one, or more, other variables x3,…., xk.
The y is usually called the dependent variable and the xi’s are called the independent or
explanatory variables (Buja & Lee, 2001). Multiple linear regression is highly applicable in data
mining, the usual scenario’s where it is used in the process of data mining are historical activity
patterns of customers based upon their credit cards usage with specific demographics and
other significant traits, predicting the time of failure of equipment based upon utilization and
environment conditions, predicting expenditure on vacation travel based on historic frequent
flier data, products sales information, staffing management and predicting the impact of
discounts on sale in retail stores and many other similar scenarios and applications of data
mining (Buja & Lee, 2001). Regression is also referred to as supervised learning. There are many
benefits of regression such as it reduces the error probability for pilot errors secondly, if
predictors are known then the inference deduction is pretty easy and simple. Linear
approximation is pretty good even if the right or correct answer is not linear. Regression
presents a very strong framework for studying other methods.
1.2.3 Prediction
One of a data mining methods is prediction, which is the purpose to find a relationship between
independent variables with themselves as well as with dependent variables. For example, in
order to predict a future marketing for certain business, there should be considered
independent variable, which is represented by sale and dependent variable which could be
profit. Then Applying regression curve depend on the profit data and the historical sale to
predict the profit. Prediction technique can combine with other techniques in data mining,
includes classification and analyzing trends. Numeric prediction is the assignment of predicting
ordered values for supplied input. Regression analysis is used to evaluate the relationships
between a dependent variable and one or more independent variables. It illustrates the
condition of the dependent variable in the term of the variation of independent variables.
Regression analysis is extensively used for forecasting and exploring the forms of the variables
relationship. It suitable option in term of the predictor variables are also ordered value. Data
mining prediction is classified into two methods as follows:
 Linear Regression
It tries to form the relationship within two variables in order to monitored data by applying a
linear equation. One of those variables is supposed to be a criterion variable where is used as
explanatory variable, whereas the other is supposed to be a predictor variable. Simple
regression is called when one predictor variable available on the prediction procedure. A
regression line is the straight line where located through best-fitting on the points.
Page 12 of 37
Website Popularity Prediction (Assignment 2)
 Nonlinear Regression
Nonlinear regression is one of the shapes of regression analysis, data fitted as a mathematical
function. In this model a curve line is created where values in the criterion variable was
distributed randomly. Logarithmic functions is used to produce small summation of the squares
in nonlinear regression.
 Multilayer Perceptrons:
Neural networks refer to the process of information on the way that brain process information
(Mu-sigma.com, 2014). Neural networks have been used in real world problems that are
complex to solve. For instance, it has been used in fraud detection, voice recognition, financial
forecasting, facial recognition and in other areas (Mu-sigma.com, 2014).
The most well-known neural network model is called Multilayer perceptrons (MLP). These
Multilayers are shown in figure 1. MLP need to have input data (numbers) in order to generate
the desired output (Mu-sigma.com, 2014). The goal of MLP is to appropriately map the input
data to the output so these input data can produce the desire output. MLP is used algorithm
called backproagation algorithm (Mu-sigma.com, 2014). The process of backproagation
algorithm is that the input data is presented to the neural network many times until the desired
output is known that is known as training (NEURAL NETWORK, 2014).
Figure 2 The Concept of Multilayer Perceptrons (NEURAL NETWORK, 2014)
 Sequential Minimal Optimization for Regression (SMOreg):
Sequential Minimal Optimization (SMO) algorithm is one of the effective methods for training
the Support Vector Machines (SVM), it can solve the regression problems. In this regression, the
input value is first mapped onto multi-dimensional feature space using some fixed mapping i.e.
nonlinear mapping, after that a linear model is constructed in this feature space. This regression
Page 13 of 37
Website Popularity Prediction (Assignment 2)
will perform using insensitive loss, and parallel, it will try to reduce the complexity by
minimizing the risk. The aim is to find the maximum distance from any one point in the training
dataset (decision boundary among two classes). (Chou, Chiu, Farfoura, & Al-Taharwa, 2011)
1.2.4 Clustering
Cluster is a collection of information that similar with each other such as a collection of similar
records, observations and cases (Ngai, Xiu, & Chau, 2009). Clustering is completely different
from classification and prediction tasks in which the data mining clustering task does not
attempt to classify and predict the value of target variables (Ngai, Xiu, & Chau, 2009). The
purpose of data mining clustering task is to segment the whole date into similar clusters. For
instance, Nielsen Caritas is clustering business. Among their services they used a segmentation
technique based on American zip code to be able to define each geographic areas in USA (Ngai,
Xiu, & Chau, 2009). Additionally, this segmentation is used to define the lifestyle for each area,
see figure 2. For instance, the clusters for zip code 90210 (Beverly Hills, California) are 01, 03,
04, 07 and 16 cluster (Ngai, Xiu, & Chau, 2009).
Clustering can assist small organization with small marketing budget to target market of a niche
product (Ngai, Xiu, & Chau, 2009).In addition, clustering can be used in accounting to segment
the financial behaviour into natural and suspicious groups.
Page 14 of 37
Website Popularity Prediction (Assignment 2)
Figure 3: The segmentation system provided by Nielsen Caritas based on American zip code to describe the lifestyle in each area
in USA (Ngai, Xiu, & Chau, 2009)
As shown in figure 3, the wealthiest lifestyle in America is defined as 01 Upper Crust (Ngai, Xiu,
& Chau, 2009).
1.2.5 Association Rule
The association rule task refer to the process of finding which attributes can go together (Ngai,
Xiu, & Chau, 2009). In other words, the task of association rule attempt to find the relationship
between two or more attributes. It is commonly used to analysis and predict customers’
purchases behaviour (Ngai, Xiu, & Chau, 2009). Additionally, it plays a significant part in market
basket analysis, shelves designing and product clustering. The association rule has two part, the
first part is called antecedent in which the item is found in the data and the second part is
called consequent in which the item is found as combination with antecedent (Ngai, Xiu, &
Chau, 2009) (Rouse, n.d). For instance, a particular supermarket found out from analysing
customers’ purchases behaviour, that customers had bought dozen of eggs and milk together
therefore, this supermarket can offer discount for these particular products to attack more
customers in buying them. Additionally, this knowledge can assist the supermarket in designing
shelves for these products to allow the customers to locate them easily.
Page 15 of 37
Website Popularity Prediction (Assignment 2)
1.2.6 Summarization
As per Han & Kamber (2011), summarization refers to the key concept of data mining that
involves techniques for determining a compact description of a given database. Summarization
methods comprise of the tabulation of mean and standard deviation. This is used for the
purpose of data analysis, generation of automated reports and data visualization (Han, Kamber,
& Pei, 2011). According to Tan (2012), compressing of a given set of data into smaller sets of
pattern in order to obtain maximum information from the given database. It is the process of
reducing data in a meaningful way. Summarization helps in the effective management of
database. It reduces the complexity present in large databases and makes it easy to manage. It
also helps in efficient and effective visual analysis. Various techniques are used to perform
summarization, and are designed in a way that it facilitates automated exploration of
unprocessed data (Tan, 2012).
1.2.7 Outlier Analysis
According to Maimon & Rokach (2006), data objects that do not confirm and are different from
the general behaviour of the data set are known as outliers. Outliers help in fraud detection by
indicating the activities that can lead to fraud. The analysis of outlier is one of the important
data mining tasks. Outliers are also defined as data elements that are different from other
observations gathered. They also create doubt in context to the mechanism from which they
are obtained (Maimon & Rokach, 2006). As per Perner (2009), outliers refer to abnormalities or
anomalies in the data mining. There are three types of outliers namely, global, collective and
contextual. Various approaches are used to detect outlier, such as distance-based approach.
This approach takes into consideration full dimensional distance metric between two elements
of data within high information space to find outliers. Outliers breach the normal mechanism
for generating normal data. Outlier helps in the detection of credit card and other financial
frauds (Perner, 2009).
Page 16 of 37
Website Popularity Prediction (Assignment 2)
2 Methodology
2.1 Flowchart of System Design
Figure 4 Flowchart of System Design
Page 17 of 37
Website Popularity Prediction (Assignment 2)
2.2 Explanation of System Design:
2.2.1 System Module A: Extraction:
The data is extracted from the excel files, which contain website loading time, response time,
ranking, country and other attributes to determine the popularity of the websites. Popularity
results are calculated on the basis of various web resources. Our system of extraction will
compose of following sources.



Archived data consisting of excel files.
Website link data (www data).
IT user researched data.
2.2.2 System Module B: Data Server Module
After getting the data from various sources we need to store it to some temporary database, so
a temporary database know as data server is used, which contain the data files so that they can
be used in transformation.
2.2.3 System Module C: Data Pre-processing Module
Data collected from different sources need to be processed before loading it in to the data
warehouse. As the data is obtained from three different sources as discussed above, there are
chances that the data is incomplete, noisy, dirty, inconsistent and redundant. This type of data
is of no use to get the quality results, so there is a need to process the data to produce the
quality results. Data ware house needs to consistent and its consistency highly depend upon
the quality of pre-processing. Following steps are performed for this module.
 Data Cleansing: In this step we identifying the data that is incomplete, incorrect,
inaccurate, irrelevant, etc. parts of the data and then replacing, modifying, and delete
this data. For data cleansing in excel we will be using the Data Cleansing option from
XLTools.
 Data Integration Module: The data collected from different sources need to be
integrated for transformation process. This step is done because you cannot implement
different system for various extraction sources.
 Data Transformation: The data transformation is included inside this data processing
module, data transformation will transform the URL in to specific format that is
consistent for the system similarly all other entities data is transform in to specific
format to make them consistent. Similarly the unit and other rate would be fixed in a
specific format in order to avoid inconsistency.
 Data Reduction: In this step the excel file is checked by the XL tool. The remove
redundant option will be used to eliminate the data that is redundant.
Page 18 of 37
Website Popularity Prediction (Assignment 2)
2.2.4 System Module D: Data Mining
The third step which involves the business logics, the data that has been cleaned, integrated
and transformed need to be used for some computational need in order to meet the business
needs of the system design. One of the main business need we want for our system is to know
the popularity of the websites. The steps that we will follow in this system of data mining
module are
 Classification: (Decision tree): (Using Weka): If we want to know the about the
individual site behaviour, we use the classification approach i.e. if we want to know
whether a certain site has how much popularity and which country visit it most, all this
will be done using the weka software, which will show the tree model at the end
showing the popularity trend in behaviour. The data that can be accessed by weka
should be in format understandable by weka i.e. *.arff, *.arff.gz etc.
Figure 5 Weka
The classify option is used while doing classification
 Clustering: The system will use the clustering techniques when the user will demand
the cluster or group of sites which are most popular. Our system will use weka
software, the excel file will be converted into system understandable format. By using
weka explorer we will be able to make cluster, using the cluster option in weak as
shown in above diagram.
 Association Rule Mining: The system will use association use mining when we need to
know that, when a popular site is visited which next site the user opens, this is all done
by association rule mining. These all steps are defined in weka explorer option as
shown above.
Page 19 of 37
Website Popularity Prediction (Assignment 2)
2.2.5 System Module E: Loading into Warehouse:
When we extract and transform the data, the data is send to the data ware. The data ware is
designed in such a way that most visited entities are group together as one table in order to
decrease computational cost.
2.2.6 System Module F: Visualization the Data:
In this system module, we design the visualization in such a way that user business demand
should be fulfilled, for the excel sheet data, if the business need is popularity then the module
of popularity will be displayed, which will help to see popularity, its association, classification
and clustering etc.
2.3 Difficulties and Solutions:
2.3.1 Extraction:
It is very difficult to extract the data, we cannot just extract the data from one source.
Extracting data from one source and then concluding results will result in inaccurate results.
The data gathering and authenticity of data is always been a problem for extraction.
Solution: The solution to above problem is that the data should be gathered from multiple
sources, and the sources should be authentic. During gathering of the dataset of excel, we
gathered multiple site data to make it authentic.
2.3.2 Transformation Difficulties:
 Same data Different Representation: One of the difficulty we need to address is that
often the data is same but has two different representation, like date, website name
etc.
Solution: The solution to this is that representation should be changed by applying a
techniques which will depict that both the representations are same.
 Domain Value Redundancy: Some time the unit used is misinterprets by the customer
and he changes the measure of unit, resulting in inconsistent data.
Solution: The unit should be fixed and before entering, a screen should be displayed to every
user which tells the unit used in the system.
 Non-atomic Data Fields: The non-atomic field is just like address problem, there is no
specific format of address, where the street is no, house number and name.
Solution: The system should be implemented by which we applying a techniques which will
depict that what is street no m name etc.
Page 20 of 37
Website Popularity Prediction (Assignment 2)
 Classification: While doing classification we have dense data, resulting in classification
computation cost. The classification is time consuming and has high computation cost.
The classification module is used to determine the pattern and details of any single
entity value.
Solution: use this module only when it is very much needed like when you need to the detail of
some attribute value really important for business prospective.
 Clustering: In clustering we often found outliers and these outliers hinges the
computational results, hence this problem should be addressed, like in excel some of
the dataset fall outside the cluster. We eliminated the outliers that were in dataset to
achieve the correctness.
2.3.3 Loading Issue:
 Full Data Refresh: When we will load the excel data in blocks in data ware, there should
be guarantee that there should not be any change in the current values of excel block
that we are loading in data ware and also the table should be empty.
 Incremental Data Refresh: When we will load the excel data in blocks in data ware,
there should be guarantee that there should not be any change in the current values of
excel block and the data ware already had this table in ware house.
 Trickle/Continuous Feed: While continuously updating the row and applying insert
delete operations, be careful that this may cause the system functions to slow down.
Solution: In order to avoid this we should only do the operations important for data ware and
do the complete block loading or incremental only when all operation of the data ware and
system are in low or closed in order to avoid slowness and any insert update problem.
Page 21 of 37
Website Popularity Prediction (Assignment 2)
3 Experiments and Results Discussion
3.1 Listening Music Website Popularity
When the outcome, or class, is numeric, and all the attributes are numeric, linear regression is a
natural technique to consider.
3.1.1 Linear Regression
Figure 6 Linear Regression (Listening Music Website Popularity)
Correlation Coefficient
The correlation coefficient measures is nearly close to one, which means perfect statically
correlation. Every positive increase of 1 in one variable, there is a positive increase of 1 in the
other.
Errors
In all above error measurements, a lower value means a more precise model, with a value of 0
depicting the statistically perfect model.
3.1.2 Logistic Regression
Figure 7 Logistic Regression (Listening Music Website Popularity)
Page 22 of 37
Website Popularity Prediction (Assignment 2)
The accuracy is around 92%, which represent of 47 out of 51 records. The confusion matrix
shows that it was rather better at classifying examples of class ‘popular’ than class ‘unpopular’.
Only 2 errors were made in 31 examples of class a, whereas 2 errors were made in only 16
examples of class b.
This difference is reflected in the Precision and Recall measures for the two classes. The Recall
for ‘popular’ is high (0.939) reflecting the fact that there were only 2 false negatives: ‘popular’
examples classified as ‘unpopular’. The Precision for ‘popular’ is lower (0.939) reflecting the fact
that there were 2 false positives: i.e. ‘unpopular’ examples classified as ‘popular’.
3.1.3 Multilayer Perceptron
The perceptron computes a single output from multiple real-valued inputs by forming a linear
combination according to its input weights and then possibly putting the output through some
nonlinear activation function. Multilayer perceptron is closely related to logistic regression.
Figure 8 Multilayer Perceptron (Listening Music Website Popularity)
The accuracy is almost perfect with 90.2%, which represent of 46 out of 51 records. The interrater agreement statistic (Kappa) has moderate of level of agreement. The confusion matrix
shows that it was rather better at classifying examples of class ‘popular’ than class ‘unpopular’.
Only 2 errors were made in 31 examples of class a, whereas 3 errors were made in only 15
examples of class b.
This difference is reflected in the Precision and Recall measures for the two classes. The Recall
for ‘popular’ is high (0.939) reflecting the fact that there were only 2 false negatives: ‘popular’
examples classified as ‘unpopular’. The Precision for ‘popular’ is lower (0.932) reflecting the fact
that there were 3 false positives: i.e. ‘unpopular’ examples classified as ‘popular’.
Page 23 of 37
Website Popularity Prediction (Assignment 2)
3.1.4 Sequential Minimal Optimization for Regression (SMOreg)
SMOreg deals with missing values and to transform nominal attributes into binary ones.
Figure 9 SMOreg (Listening Music Website Popularity)
Correlation Coefficient
The correlation coefficient measures is nearly close to one, which means perfect statically
correlation. Every positive increase of 1 in one variable, there is a positive increase of 1 in the
other.
Errors
In all above error measurements, a lower value means a more precise model, with a value of 0
depicting the statistically perfect model.
3.2 Prediction of Fitness Website Design
3.2.1 Linear Regression
Figure 10 Linear Regression (Prediction of Fitness Website Design)
Correlation Coefficient
The correlation coefficient measures is one, which means the perfect strong statically
correlation. Every positive increase of 1 in one variable, there is a positive increase of 1 in the
other.
Errors
In all above error measurements, a lower value means a more precise model, with a value of 0
depicting the statistically perfect model.
Page 24 of 37
Website Popularity Prediction (Assignment 2)
3.2.2 Logistic Regression
Figure 11 Logistic Regression (Prediction of Fitness Website Design)
The accuracy is around 98%, which represent of 49 out of 50 records. The confusion matrix
shows that it was rather better at classifying examples of class ‘unpopular’ than class ‘popular’.
There are no errors were made in 47 examples of class b, whereas 1 error were made in only 3
examples of class a.
This difference is reflected in the Precision and Recall measures for the two classes. The Recall
for ‘unpopular’ is high (3.000) reflecting the fact that there were not any false negatives:
‘popular’ examples classified as ‘unpopular’. The Precision for ‘popular’ is lower (1.000)
reflecting the fact that there were 3 false positives: i.e. ‘unpopular’ examples classified as
‘popular’.
Page 25 of 37
Website Popularity Prediction (Assignment 2)
3.2.3 Multilayer Perceptron
Figure 12 Multilayer Perceptron (Prediction of Fitness Website Design)
The accuracy is almost perfect with 96%, which represent of 48 out of 50 records. The interrater agreement statistic (Kappa) has weak of level of agreement. The confusion matrix shows
that it was rather better at classifying examples of class ‘unpopular’ than class ‘popular’. There
is no error were made in 47 examples of class b, whereas 2 errors were made in only 1 example
of class a.
This difference is reflected in the Precision and Recall measures for the two classes. The Recall
for ‘unpopular’ is high (3.000) reflecting the fact that there not any false negatives: ‘unpopular’
examples classified as ‘popular’. The Precision for ‘unpopular’ is lower (0.959) reflecting the fact
that there were 2 false positives: i.e. ‘popular’ examples classified as ‘unpopular’.
3.2.4 Sequential Minimal Optimization for Regression (SMOreg)
Figure 13 SMOreg (Prediction of Fitness Website Design)
Correlation Coefficient
The correlation coefficient measures is one, which means the perfect strong statically
correlation. Every positive increase of 1 in one variable, there is a positive increase of 1 in the
other.
Errors
Page 26 of 37
Website Popularity Prediction (Assignment 2)
In all above error measurements, a lower value means a more precise model, with a value of
nearly 0 depicting the statistically perfect model.
3.3 Popularity Prediction of E-commerce Website
3.3.1 Linear Regression
Figure 14 Linear Regression (Popularity Prediction of E-commerce Website)
Correlation Coefficient
The correlation coefficient measures is nearly close the one, which means the perfect strong
statically correlation. Every positive increase of 1 in one variable, there is a positive increase of
1 in the other.
Errors
In all above error measurements, a lower value means a more precise model, with a value of 0
depicting the statistically perfect model. But in this case we find the errors a bit high.
3.3.2 Logistic Regression
Figure 15 Logistic Regression (Popularity Prediction of E-commerce Website)
Page 27 of 37
Website Popularity Prediction (Assignment 2)
The accuracy is around 94%, which represent of 47 out of 50 records. The confusion matrix
shows that it was rather slightly better at classifying examples of class ‘unpopular’ than class
‘popular’. Only 3 error were made in 38 examples of class b, whereas 2 errors were made in
only 29 examples of class a.
This difference is reflected in the Precision and Recall measures for the two classes. The Recall
for ‘unpopular’ is high (0.947) reflecting the fact that there were only 3 false negatives:
‘unpopular’ examples classified as ‘popular’. The Precision for ‘unpopular’ is lower (0.900)
reflecting the fact that there were 2 false positives: i.e. ‘popular’ examples classified as
‘unpopular’.
3.3.3 Multilayer Perceptron
Figure 16 Multilayer Perceptron (Popularity Prediction of E-commerce Website)
The accuracy is almost perfect with 98%, which represent of 49 out of 50 records. The interrater agreement statistic (Kappa) has strong of level of agreement. The confusion matrix shows
that it was rather better at classifying examples of class ‘popular’ than class ‘unpopular’. There
is no error was made in 31 examples of class a, whereas 1 error were made in only 18 examples
of class b.
This difference is reflected in the Precision and Recall measures for the two classes. The Recall
for ‘popular’ is perfect (3.000) reflecting the fact that there were not any false negatives:
‘popular’ examples classified as ‘unpopular’. The Precision for ‘popular’ is lower (0.969)
reflecting the fact that there was one false positives: i.e. ‘unpopular’ examples classified as
‘popular’.
Page 28 of 37
Website Popularity Prediction (Assignment 2)
3.3.4 Sequential Minimal Optimization for Regression (SMOreg)
Figure 17 SMOreg (Popularity Prediction of E-commerce Website)
Correlation Coefficient
The correlation coefficient measures is nearly close the one, which means the perfect strong
statically correlation. Every positive increase of 1 in one variable, there is a positive increase of
1 in the other.
Errors
In all above error measurements, a lower value means a more precise model, with a value of 0
depicting the statistically perfect model. But in this case we find the errors a bit high.
3.4 Popularity Prediction of Education Website
3.4.1 Linear Regression
Figure 18 Linear Regression (Popularity Prediction of Education Website)
Correlation Coefficient
The correlation coefficient measures is 0.56, which means the moderate statically correlation.
Errors
In all above error measurements, a higher value means a weakness of precise model due to the
highest of error values.
Page 29 of 37
Website Popularity Prediction (Assignment 2)
3.4.2 Logistic Regression
Figure 19 Logistic Regression (Popularity Prediction of E-commerce Website)
The accuracy is 80%, which represent of 40 out of 50 records. The confusion matrix shows that
it was rather better at classifying examples of class ‘popular’ than class ‘unpopular’. Only 4
errors were made in 23 examples of class a, whereas 6 errors were made in only 17 examples of
class b.
This difference is reflected in the Precision and Recall measures for the two classes. The Recall
for ‘popular’ is high (0.823) reflecting the fact that there were only 4 false negatives: ‘popular’
examples classified as ‘unpopular’. The Precision for ‘popular’ is lower (0.793) reflecting the fact
that there were 6 false positives: i.e. ‘unpopular’ examples classified as ‘popular’.
Page 30 of 37
Website Popularity Prediction (Assignment 2)
3.4.3 Multilayer Perceptron
Figure 20 Multilayer Perceptron (Popularity Prediction of E-commerce Website)
The accuracy is moderate with 70%, which represent of 35 out of 50 records. The inter-rater
agreement statistic (Kappa) has minimal of level of agreement. The confusion matrix shows that
it was rather better at classifying examples of class ‘popular’ than class ‘unpopular’. There are 6
error were made in 21 examples of class a, whereas 9 errors were made in 14 examples of class
b.
This difference is reflected in the Precision and Recall measures for the two classes. The Recall
for ‘popular’ is high (0.78) reflecting the fact that there were 6 false negatives: ‘popular’
examples classified as ‘unpopular’. The Precision for ‘popular’ is lower (0.700) reflecting the fact
that there were 9 false positives: i.e. ‘unpopular’ examples classified as ‘popular’.
3.4.4 Sequential Minimal Optimization for Regression (SMOreg)
Figure 21 SMOreg (Popularity Prediction of E-commerce Website)
Correlation Coefficient
The correlation coefficient measures is 0.40, which means the moderate statically correlation.
Errors
Page 31 of 37
Website Popularity Prediction (Assignment 2)
In all above error measurements, a higher value means a weakness of precise model due to the
highest of error values.
3.5 Compression
3.5.1 Linear Regression
Music
Fitness
E-commerce
Education
root mean-squared error
1.12
0.0028
11.37
4.164
mean absolute error
0.16
0.0023
3.96
3.17
root relative squared error
6.16%
0.01%
31.89%
82.89%
relative absolute error
0.99%
0.01%
11.75%
73.15%
correlation coefficient
0.99
1
0.94
0.56
The result of four different prediction popularity are used a Linear Regression prediction
techniques on a given dataset, that shown the prediction of fitness sites is the best regarding to
all five metrics: it has the smallest value for each error measure and the largest correlation
coefficient. Listening to music website prediction is the second best by all five metrics whereas
the prediction of e-commerce website and education website are located on the third and
fourth respectively. E-commerce has a strong value correlation with 0.94 but the overall of
errors is a quite high. Education has moderate correlation coefficient and the highest overall
errors.
3.5.2 Logistic Regression
Correctly classified
instances
Recall
Precision
Music
92.15%
Popular
0.93
Unpopular Popular
0.88
0.94
Unpopular
0.88
Fitness
98%
0.68
1
1
0.98
E-commerce
94%
0.94
0.95
0.97
0.90
Education
80%
0.85
0.74
0.79
0.81
The result of four different prediction popularity are used a Linear Regression prediction
techniques on a given dataset, that shown the prediction of fitness sites is the best with 98
percent accuracy by less false and positive negatives. The second best prediction is E-commerce
with 94 percent whereas Music and Education are located on the third and fourth respectively
with 94 and 80 percent of accuracy.
Page 32 of 37
Website Popularity Prediction (Assignment 2)
3.5.3 Multilayer Perceptron
Correctly classified
instances
Kappa
Recall
Music
90.2%
Fitness
Precision
0.78
Popular Unpopular Popular
0.93
0.83
0.91
Unpopular
0.88
96%
0.48
0.33
1.00
1.00
0.96
E-commerce
98%
0.96
1.00
0.95
0.97
1.00
Education
70%
0.39
0.79
0.61
0.70
0.70
The result of four different prediction popularity are used a Multilayer Perceptron prediction
techniques on a given dataset, that shown the prediction of E-commerce sites is the best with
98 percent accuracy by less false either positive or negatives and strong of level of agreement.
The second best prediction is Fitness with 96 percent whereas Music and Education are located
on the third and fourth respectively with 90.2 and 70 percent of accuracy.
3.5.4 Sequential Minimal Optimization for Regression (SMOreg)
Music
Fitness
E-commerce
Education
root mean-squared error
1.21
0.08
11.2
5.16
mean absolute error
0.22
0.05
2.98
3.68
root relative squared error
6.64%
0.31%
31.40%
102.22%
relative absolute error
1.38%
0.26%
8.83%
84.95%
correlation coefficient
0.99
1
0.95
0.40
The result of four different prediction popularity are used a SMOreg prediction techniques on a
given dataset, that shown the prediction of fitness sites is the best regarding to all five metrics:
it has the smallest value for each error measure and the largest correlation coefficient.
Listening to music website prediction is the second best by all five metrics whereas the
prediction of e-commerce website and education website are located on the third and fourth
respectively. E-commerce has a strong value correlation with 0.95 but the overall of errors is a
quite high. Education has moderate correlation coefficient and the highest overall errors.
3.5.5 Overall Comparison
Linear
Regression
Logistic
Regression
Multilayer
Perceptron
SMOreg
Average
Page 33 of 37
Website Popularity Prediction (Assignment 2)
Music
0.99
92.15%
90.2%
0.99
0.95
Fitness
1
98%
96%
1
0.98
E-commerce
0.94
94%
98%
0.95
0.95
Education
0.56
80%
70%
0.40
0.61
Average
0.87
91.03%
88.55
0.85
0.87
Overall the prediction of Fitness websites is the best regarding to three regression methods.
Fitness has strongest correlation coefficients in Linear Regression and SMOreg and strongest
accuracy in Logistic Regression whereas the E-commerce has strongest accuracy in Multilayer
Perceptron and second best accuracy in Logistic Regression. Music prediction locate in third
position with second best correlation coefficients in Linear Regression and SMOreg. Education
locate in forth position with lowest accuracy and correlation coefficient in all regression
methods. However, Logistic Regression gives the best result prediction with average 91.03
accuracy
Page 34 of 37
Website Popularity Prediction (Assignment 2)
4 Conclusion:
The prediction of website popularity was performed against four data sets including Fitness
website, Education website, Music website and E-commerce website data set. The designing
system of website prediction popularity aims to determine which filed of website is appreciated
for advertisement.
Over all data sets have sort of high accuracy regarding to time series prediction methods (Linear
regression, Logistic Regression, Multilayer Perceptron and Sequential Minimal Optimization).
Additionally, most of the data sets have a low errors and high correlation. However, the
collection of data sets contains only 50 objects which considers as small amount of objects
regards to real life data mining. The high level of accuracy in data mining results will achieve
with enormous data.
Page 35 of 37
Website Popularity Prediction (Assignment 2)
Bibliography
Buja, A., & Lee, Y.-S. (2001). Data mining criteria for tree-based regression and classification. 27-36.
Chou, J.-S., Chiu, C.-K., Farfoura, M., & Al-Taharwa, I. (2011, May). Optimizing the Prediction Accuracy of
Concrete Compressive Strength Based on a Comparison of Data-Mining Techniques. Retrieved
from Asce Library: http://ascelibrary.org/doi/abs/10.1061/%28ASCE%29CP.1943-5487.0000088
Cortes, C., & Pregibon, D. (2001). Signature-Based Methods for Data Streams. Data Mining and
Knowledge Discovery, 167 - 182.
Data Segmentation. (n.d). Retrieved from Cogent Analytic:
http://www.cogentdataanalytics.com/glossary/data-segmentation.php
Freitas, A. (1997). A Genetic Programming Framework for Two Data Mining Tasks: Classification and
Generalized Rule Induction.
Gorunescu, F. (2011). Data Mining: Concepts, Models and Techniques.
Han, J., Kamber, M., & Pei, J. (2011). Data Mining: Concepts and Techniques: Concepts and Techniques.
Elsevier.
Liebowitz, J. (1988). Expert System Applications to Telecommunications. New York: John Wiley & Sons,
Inc.
MacDConald, M. (2011). Creating a Website The Missing Manual. United States: O'Reilly Media.
Maimon, O., & Rokach, L. (2006). Data Mining and Knowledge Discovery Handbook. Springer Science &
Business Media.
Nayak, R. (2008). Data mining in web services discovery and monitoring. International Journal of Web
Services Research, 5(1). pp. 62-80.
NEURAL NETWORK. (2014). Retrieved from Mu Sigma: http://www.musigma.com/analytics/thought_leadership/cafe-cerebral-neural-network.html
Ngai, E., Xiu, L., & Chau, D. (2009). Application of data mining techniques in customer relationship
management. A literature review and classification, 36(2), 2592–2602.
Perner, P. (2009). Machine Learning and Data Mining In Pattern Recognition. Springer Science &
Business Media.
Plumley, G. (2010). Website Design and Development: 100 Questions to Ask Before Building a Website.
Canada: Wiley.
Ramageri, B. M., & Desai, B. (2013). International Journal of Computer Science Engineering. Role of Data
Mining in Retail Sector, 5(1), 47-50. Retrieved from http://www.ijcse.net/
Rouse, M. (n.d). Association Rules (in data mining). Retrieved from Tech target:
http://searchbusinessanalytics.techtarget.com/definition/association-rules-in-data-mining
Page 36 of 37
Website Popularity Prediction (Assignment 2)
Soares, M. (2008). Applications of Data Mining in E-business and Finance. IOS Press.
Tan, H. (2012). Knowledge Discovery and Data Mining. Springer Science & Business Media.
Weiss, G. M. (1998). ANSWER: Network monitoring using object-oriented rule”. Proceedings of the Tenth
Conference on Innovative Applications of Artificial Intelligence.
Page 37 of 37
Download