Uploaded by c6886135

Machine Learning

advertisement
Machine Learning
ETCS -454
Maharaja Agrasen Institute of Technology, PSP area,
Sector – 22, Rohini, New Delhi – 110085
(Affiliated to Guru Gobind Singh Indraprastha University,
New Delhi)
Vision of the Institute
To nurture young minds in a learning environment of high academic value and imbibe spiritual
and ethical values with technological and management competence.
Mission of the institute:
The Institute shall endeavor to incorporate the following basic missions in the teaching
methodology:
Engineering Hardware – Software Symbiosis
Practical exercises in all Engineering and Management disciplines shall be carried out by
Hardware equipment as well as the related software enabling deeper understanding of basic
concepts and encouraging inquisitive nature.
Life – Long Learning
The Institute strives to match technological advancements and encourage students to keep
updating their knowledge for enhancing their skills and inculcating their habit of continuous
learning.
Liberalization and Globalization
The Institute endeavors to enhance technical and management skills of students so that they are
intellectually capable and competent professionals with Industrial Aptitude to face the challenges
of globalization.
Diversification
The Engineering, Technology and Management disciplines have diverse fields of studies with
different attributes. The aim is to create a synergy of the above attributes by encouraging
analytical thinking.
Digitization of Learning Processes
The Institute provides seamless opportunities for innovative learning in all Engineering and
Management disciplines through digitization of learning processes using analysis, synthesis,
simulation, graphics, tutorials and related tools to create a platform for multi-disciplinary
approach.
Entrepreneurship
The Institute strives to develop potential Engineers and Managers by enhancing their skills and
research capabilities so that they become successful entrepreneurs and responsible citizens.
Vision of CSE Department
To Produce “Critical thinkers of Innovative Technology”
Mission of CSE department
To provide an excellent learning environment across the computer science discipline to
inculcate professional behavior, strong ethical values, innovative research capabilitiesand
leadership abilities which enable them to become successful entrepreneurs in this globalized
world.
1. To nurture an excellent learning environment that helps students to enhance their
problem solving skills and to prepare students to be lifelong learners by offering a solid
theoretical foundation with applied computing experiences and educating them about
their professional, and ethical responsibilities.
2. To establish Industry-Institute Interaction, making students ready for the industrial
environment and be successful in their professional lives.
3. To promote research activities in the emerging areas of technology convergence.
4. To build engineers who can look into technical aspects of an engineering solution thereby
setting a ground for producing successful entrepreneur.
Introduction to Machine Learning Lab
Course Objective:
The main objective of this course is to ensure that students understand the working of different
machine Learning techniques in practice for popular techniques like C 4.5 algorithm, Naïve
bayes, decision tree, random forest or neural network gets applied on some statistical data.
Course Outcome:
At the end of the course, a student will be able to:
At the end of the course, a student will be able to:
C454.1: Study and Implement the Supervised learners (Naive Bayes) and Unsupervised Learners
C454.2: Estimate the precision, recall, F-measure and accuracy of classifier methods on some
dataset.(breast cancer) using 5-fold cross-validation.
C454.3 : To understand datasets of various problems and evaluate their attributes for training,
testing and validations. Estimate overfitting if exists .
C454.4: Develop a machine learning method to Predict stock prices or education of infants etc,
based on past price variation., to predict how people would rate movies, books, etc
C454.5: Select two Programming languages to implement the same problem under different
classifier algorithm . Use different functions to implement concept.
C454.6: Evaluate performance by measuring the sum of different distance algorithms of each
example from its class center test performance as a function of the number of clusters.
LIST OF EXPERIMENTS
(As prescribed by G.G.S.I.P.U)
MACHINE LEARNING LAB
Paper Code: ETCS-454
1. Introduction to machine learning lab with tools (hands on weka)
2. Understanding of Machine learning algorithms List of Databases
3. Implement K means Algorithm using WEKA Tool. Implement the K-means algorithm and
apply it to the data you selected. Evaluate performance by measuring the sum of Euclidean
distance of each example from its class center. Test the performance of the algorithm as a
function of the parameter k.
4. Study of Databases and understanding attributes evaluation in regard to problem description.
5. Working of Major Classifiers ,a) Naïve Bayes b) Decision Tree c)CART d) ARIMA (using
(e) linear and logistics regression (f) Support vector machine (g) KNN (datasets can be:
Breast Cancer data file or Reuters data set).
6. Design a prediction model for Analysis of round trip Time of Flight measurement from a
supermarket using random forest, naïve bayes etc.
7. Implement supervised learning (KNN classification) .Estimate the accuracy of using 5-fold
cross-validation. choose the appropriate options for missing values.
8. Introduction to R. Be aware of basics of machine learning methods in R.
9. Build and Develop a model in R for a particular classifier (random Forest).
10. Develop a machine learning method using Neural Networks in python to Predict stock prices
based on past price variation.
.
LIST OF EXPERIMENTS
(Beyond the syllabus)
1. Case Study of RMS Titanic Database to predict survival on basis of decision tree, Logistic
Regression , KNN or k-Nearest Neighbors, Support Vector Machines
2. Understanding of Indian education in Rural villages to predict whether girl child
will be sent to school or not?.
2. Understanding of dataset of contact patterns among students collected in
National University of Singapore.
MACHINE LEARNING
ETCS 454
Faculty name: Mr. Sandeep Tayal
Student name: Aditya Bhardwaj
Roll No.: 01396402714
Semester: VIII
Maharaja Agrasen Institute of Technology, PSP Area,
Sector – 22, Rohini, New Delhi – 110086
Index
Exp. no Experiment Name
Date of
performance
Date of
checking
Marks
Signature
Experiment 1
Hands on WEKA (Some Steps) on data set (iris.arff)
INSTRUCTIONS:
Weka is a landmark system in the history of the data mining and machine learning research
communities, because it is the only toolkit that has gained such widespread adoption and
survived for an extended period of time. The Weka or woodhen (Gallirallus australis) is an
endemic bird of New Zealand. It provides many different algorithms for data mining and
machine learning. Weka is open source and freely available. It is also platform-independent.
The GUI Chooser consists of four buttons:
Explorer: An environment for exploring data with WEKA.
Experimenter: An environment for performing experiments and conducting statistical tests
between learning schemes.
Knowledge Flow: This environment supports essentially the same functions as the Explorer
but with a drag and-drop interface. One advantage is that it supports incremental learning.
Simple CLI: Provides a simple command-line interface that allows direct execution of
WEKA commands for operating systems that do not provide their own command line interface.
(a) Start Weka
Start Weka. This may involve finding it in program launcher or double clicking on the
weka.jar file. This will start the Weka GUI Chooser. The Weka GUI Chooser lets you
choose one of the Explorer, Experimenter, Knowledge Explorer and the Simple CLI
(command line interface).
Weka GUI Chooser
Click the “Explorer” button to launch the Weka Explorer. This GUI lets you load datasets
and run classification algorithms. It also provides other features, like data filtering, clustering,
association rule extraction, and visualization, but we won’t be using these features right now.
(b). Open the data/ Dataset(sample iris.arff)
Dataset
A comma-separated list. A set of data items, the dataset, is a very basic concept of
machine learning. A dataset is roughly equivalent to a two-dimensional spreadsheet or
database table. In WEKA,it is implemented by the weka.core.Instances class. A
dataset is a collection of examples, each one of class weka.core.Instance. Each Instance
consists of a number of attributes, any of which can be nominal (= one of a predefined
list of values), numeric (= a real or integer number) or a string (= an arbitrary long list of
characters, enclosed in ”double quotes”). Additional types are date and relational, which
are not covered here but in the ARFF chapter. The external representation of an
Instances class is an ARFF file, which consists of a header describing the attribute types
and the data
Click the “Open file…” button to open a data set and double click on the “data” directory.
Weka provides a number of small common machine learning datasets that you can use to practice
on. Select the “iris.arff” file to load the Iris dataset.
Weka Explorer Interface with the Iris dataset loaded .The Iris Flower dataset is a famous dataset
from statistics and is heavily borrowed by researchers in machine learning. It contains 150
instances (rows) and 4 attributes (columns) and a class attribute for the species of iris flower (one
of setosa, versicolor, and virginica).In our example, we have not mentioned the attribute
type string, which defines ”double quoted” string attributes for text mining. In recent
WEKA versions, date/time attribute types are also supported.By default, the last
attribute is considered the class/target variable, i.e. the attribute which should be
predicted as a function of all other attributes. If this is not the case, specify the target
variable via -c. The attribute numbers are one-based indices, i.e. -c 1 specifies the first
attribute.Some basic statistics and validation of given ARFF files can be obtained via the
main() routine of weka.core.Instances:
Classifier
Any
learning
algorithm
in
WEKA
is
derived
from
the
abstract
weka.classifiers.Classifier class. Surprisingly little is needed for a basic
classifier: a routine Which g en- erates a classifier model from a training dataset (=
buildClassifier) and another routine which evaluates the generated model on an
unseen test dataset (= classifyInstance), or generates a probability distribution for
all classes (= distributionForInstance).A classifier model is an arbitrary complex
mapping from all-but-one dataset attributes to the class attribute.
( c) Select and Run an Algorithm
Now that you have loaded a dataset, it’s time to choose a machine learning algorithm to model
the problem and make predictions. Click the “Classify” tab. This is the area for running
algorithms against a loaded dataset in Weka. You will note that the “ZeroR” algorithm is selected
by default. Click the “Start” button to run this algorithm.
Weka Results for the ZeroR algorithm on the Iris flower dataset. The ZeroR algorithm selects the
majority class in the dataset (all three species of iris are equally present in the data, so it picks the
first one: setosa) and uses that to make all predictions. This is the baseline for the dataset and the
measure by which all algorithms can be compared. The result is 33%, as expected (3 classes,
each equally represented, assigning one of the three to each prediction results in 33%
classification accuracy).
You will also note that the test options selects Cross Validation by default with 10 folds. This
means that the dataset is split into 10 parts: the first 9 are used to train the algorithm, and the
10th is used to assess the algorithm. This process is repeated, allowing each of the 10 parts of the
split dataset a chance to be the held-out test set. Click the “Choose” button in the “Classifier”
section and click on “trees” and click on the “J48” algorithm. This is an implementation of the
C4.8 algorithm in Java (“J” for Java, 48 for C4.8, hence the J48 name) and is a minor extension
to the famous C4.5 algorithm. Click the “Start” button to run the algorithm.
Weka J48 algorithm results on the Iris flower dataset
(d) Review Results
After running the J48 algorithm, you can note the results in the “Classifier output” section.
The algorithm was run with 10-fold cross-validation: this means it was given an opportunity to
make a prediction for each instance of the dataset (with different training folds) and the presented
result is a summary of those predictions.
Just the results of the J48 algorithm on the Iris flower dataset in Weka.
Firstly, note the Classification Accuracy. You can see that the model achieved a result of 144/150
correct or 96%, which seems a lot better than the baseline of 33%.Secondly, look at the
Confusion Matrix. You can see a table of actual classes compared to predicted classes and you
can see that there was 1 error where an Iris-setosa was classified as an Iris-versicolor, 2 cases
where Iris-virginica was classified as an Iris-versicolor, and 3 cases where an Iris-versicolor was
classified as an Iris-setosa (a total of 6 errors). This table can help to explain the accuracy
achieved by the algorithm.
Viva Questions:
Q1: Explain main features of WEKA toolkit?
Q2: What are the various tool menus for using algorithms on problem data set?
Q3: Give the examples of real time applications where WEKA can be used?
Q 4. What is difference between experimenter and knowledge flow ?
Q5. How classifiers in WEKA can be implemented using various sub options?
Experiment 2
Aim: Understanding of Machine learning algorithms
INSTRUCTIONS:
The information data and algorithm, that will be used for different problems, can be done but
before that we need to understand some of algorithms (machine learning) like in this section,
several machine learning algorithms from top 10 data mining algorithms are described and
evaluated in this paper. Furthermore, several meta-algorithms are presented to enhance the AUC
results of selected machine learning algorithms. The machine learning classifiers for Web Spam
detection are:
–

Support Vector Machine (SVM) - SVM discriminates a set of high-dimension features
using a or sets of hyper planes that gives the largest minimum distance to separates all
data points among classes.

Multilayer Perceptron Neural Network (MLP) - MLP is a non-linear feed-forward
network model which maps a set of inputs x onto a set of outputs y using multi weights
connections.

Bayesian Network (BN) - A BN is a probabilistic graphical model for reasoning under
uncertainty, where the nodes represent discrete or continuous variables and the links
represent the relationships between them.

C4.5 Decision Tree (DT) - DT decides the target class of a new sample based on selected
features from available data using the concept of information entropy. The nodes of the
tree are the attributes, each branch of the tree represents a possible decision and the end
nodes or leaves are the classes.

Random Forest (RF) - RF works by constructing multiple decision tree s on various
sub-samples of the datasets and output the class that appear most often or mean
predictions of the decision trees.

Nave Bayes (NB) - The NB classifier is a classification algorithm based on Bayes
theorem with strong inde- pendent assumptions between features.

K-nearest Neighbour (KNN) - KNN is an instance-based learning algorithm that store
all available data points and classifies the new data points based on similarity measure
such as distance. The machine learning ensemble meta-algorithms on the other hand are:

Boosting algorithms - Boosting works by combining a set of weak classifier to a single
strong classifier. The weak classifiers are weighted in some way from the training data
points or hypotheses into a final strong classifier, thus there are a varieties of boosting
algorithms. Here, three boosting algorithms are introduced in this paper:

Adaptive Boosting (AdaBoost) - The weights of incorrectly labelled data points are
adjusted in AdaBoost such that the following classifiers focus more on incorrectly
labelled or diffi cult cases.

LogitBoost - LogitBoost is actually an extension of AdaBoost where it applies the cost
function logistic regression to AdaBoost, thus it classifies by using a regression scheme
as base learner.

Real AdaBoost - Unlike most Boosting algorithms which returns binary valued classes
(Discrete AdaBoost), Real AdaBoost outputs a real valued probability of the class.

Bagging – Bagging is a method by generating several training sets of the same size and
use the same machine learning algorithm to build model of them and combine the
predictions by averaging. It is often improve the accuracy and stability of the classifier.

Dagging - Dagging generates a number of disjoint and stratified folds out of the data and
feeds each chunk of data to a copy of the machine learning classifier. Majority vote is
done for predictions since all the generated machine learning classifier are put into the
voted Meta classifier. Dagging is useful for base classifiers that are quadratic or worse in
time behaviour on the number of instances in the training data.

Rotation Forest - The rotation forest is constructed using a number of the same machine
learning classifier typically decision tree independently and trained on a new set of
trained features form by sub-sampling of thedatasets with principal component analysis
applied on each sub-sets.
.
Viva Questions:
Q1: What are different classifiers?
Q2: Recognize the situations where one particular classifier is used?
Q3: What are different clustering techniques?
Q 4. Name different association algorithms. How they are different from
network?
Neural
Experiment 3
Aim: Understand clustering approaches and implement K means Algorithm using weka Tool
Fig: 1: Clustering techniques
K-means Clustering The term "k-means" was first used by James MacQueen in 1967,
though the idea goes back to 1957. The standard algorithm was first proposed by Stuart
Lloyd in 1957 as a technique for pulse-code modulation, though it wasn't published until
1982. K-means is a widely used partitional clustering method in the industries. The Kmeans algorithm is the most commonly used partitional clustering algorithm because it
can be easily implemented and is the most efficient one in terms of the execution time.
Here’s how the algorithm works
K-Means Algorithm: The algorithm for partitioning, where each cluster’s center is
represented by mean value of objects in the cluster.
Input: k: the number of clusters. D: a data set containing n objects.
Output: A set of k clusters.
Method:
1.
2.
3.
4
Arbitrarily choose k objects from D as the initial cluster centers.
Repeat.
(Re) assign each object to the cluster to which the object is most similar using Eq
1, based on the mean value of the objects in the cluster.
Update the cluster means, i.e. calculate the mean value of the objects for each
cluster until no change
Output:
=== Run information ===
Scheme:
weka.clusterers.SimpleKMeans -init 0 -maxcandidates 100 -periodic-pruning 10000 -min-density 2.0 -t1 1.25 -t2 -1.0 -N 2 -A "weka.core.EuclideanDistance -R firstlast" -I 500 -num-slots 1 -S 10
Relation:
iris
Instances:
150
Attributes:
5
sepallength
sepalwidth
petallength
petalwidth
class
Test mode:
evaluate on training data
=== Clustering model (full training set) ===
kMeans
======
Number of iterations: 7
Within cluster sum of squared errors: 62.1436882815797
Initial starting points (random):
Cluster 0: 6.1,2.9,4.7,1.4,Iris-versicolor
Cluster 1: 6.2,2.9,4.3,1.3,Iris-versicolor
Missing values globally replaced with mean/mode
Final cluster centroids:
Cluster#
Attribute
1
Full Data
(150.0)
0
(100.0)
(50.0)
================================================================
==
sepallength
5.8433
6.262
5.006
sepalwidth
3.054
2.872
3.418
petallength
3.7587
4.906
1.464
petalwidth
1.1987
1.676
0.244
class
setosa
Iris-setosa Iris-versicolor
Iris-
Time taken to build model (full training data) : 0.01 seconds
=== Model and evaluation on training set ===
Clustered Instances
0
1
100 ( 67%)
50 ( 33%)
The above information shows the result of k-means clustering Methods using WEKA tool. After
that we saved the result, the result will be saved in the ARFF file format. We also open this file in
the ms excel.
Viva Questions:
Q1: When would you use k means cluster and when would you use hierarchical cluster?
Q2: What is the difference between Classification and Clustering?
Q3: What definition of similarity is used by the k-means clustering algorithm?
Q 4. Describe the optimal k-means optimization problem; is it NP-Hard?
Q5. How are the k means typically chosen in the k-means algorithm?
Q6. What is the computing time for the k-means algorithm?
Experiment 4
Aim: Study sample ARFF files in Database;
INSTRUCTIONS:
There are lot of data files that store attributes details of problem description and they
store data in either of formats
1. CSV- comma separated value
2. arf3. excel -xls
There is list of following arff files shown below, These files need to be evaluated and analysed
to get results on basis of data provided in files.
1) Airline
2) Breast-cancer
3) Contact-lenses
4) Cpu
5) Credit-g
1. Airline:
Monthly totals of international airline passengers (in thousands) for
1949-1960.
@relation airline_passengers
@attribute passenger_numbers numeric
@attribute Date date 'yyyy-MM-dd'
@data
112,1949-01-01
118,1949-02-01
132,1949-03-01
129,1949-04-01
121,1949-05-01
135,1949-06-01
148,1949-07-01
148,1949-08-01
432,1960-12-01
2. breast-cancer
This data set includes 201 instances of one class and 85 instances of another class. The
instances are described by 9 attributes, some of which are linear and some are nominal.
Number of Instances: 286
Number of Attributes: 9 + the class attribute
Attribute Information:
1. Class: no-recurrence-events, recurrence-events
2. age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99.
3. menopause: lt40, ge40, premeno.
4. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44,
45-49, 50-54, 55-59.
5. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26,
27-29, 30-32, 33-35, 36-39.
6. node-caps: yes, no.
7. deg-malig: 1, 2, 3.
8. breast: left, right.
9. breast-quad: left-up, left-low, right-up, right-low, central.
10. irradiat: yes, no.
Missing Attribute Values: (denoted by "?")
Attribute #: Number of instances with missing values
8
Class Distribution:
1. no-recurrence-events: 201 instances
2. recurrence-events: 85 instances
Num Instances: 286
Num Attributes: 10
Num Continuous: 0 (Int 0 / Real 0)
Num Discrete:
10
Missing values: 9 / 0.3
name
type enum ints real missing distinct (1)
1 'age'
Enum 100 0 0
0/ 0
6/ 2 0
2 'menopause'
Enum 100 0 0
0/ 0
3/ 1 0
3 'tumor-size'
Enum 100 0 0
0 / 0 11 / 4 0
4 'inv-nodes'
Enum 100 0 0
0/ 0
7/ 2 0
5 'node-caps'
Enum 97 0 0
8/ 3
2/ 1 0
6 'deg-malig'
Enum 100 0 0
0/ 0
3/ 1 0
7 'breast'
Enum 100 0 0
0/ 0
2/ 1 0
8 'breast-quad'
Enum 100 0 0
1/ 0
5/ 2 0
9 'irradiat'
Enum 100 0 0
0/ 0
2/ 1 0
10 'Class'
Enum 100 0 0
0/ 0
2/ 1 0
@relation breast-cancer
@attribute age {'10-19','20-29','30-39','40-49','50-59','60-69','70-79','80-89','90-99'}
@attribute menopause {'lt40','ge40','premeno'}
@attribute tumor-size {'0-4','5-9','10-14','15-19','20-24','25-29','30-34','35-39','40-44','4549','50-54','55-59'}
@attribute inv-nodes {'0-2','3-5','6-8','9-11','12-14','15-17','18-20','21-23','24-26','2729','30-32','33-35','36-39'}
@attribute node-caps {'yes','no'}
@attribute deg-malig {'1','2','3'}
@attribute breast {'left','right'}
@attribute breast-quad {'left_up','left_low','right_up','right_low','central'}
@attribute 'irradiat' {'yes','no'}
@attribute 'Class' {'no-recurrence-events','recurrence-events'}
@data
'40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events'
'50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events'
'50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events'
'40-49','premeno','35-39','0-
3) Contact-lenses
1. Title: Database for fitting contact lenses
2. Sources:
(a) Cendrowska, J. "PRISM: An algorithm for inducing modular rules",
International Journal of Man-Machine Studies, 1987, 27, 349-370
(b) Donor: Benoit Julien (Julien@ce.cmu.edu)
(c) Date: 1 August 1990
3. Past Usage:
1. See above.
2. Witten, I. H. & MacDonald, B. A. (1988). Using concept
learning for knowledge acquisition. International Journal of
Man-Machine Studies, 27, (pp. 349-370).
Notes: This database is complete (all possible combinations of attribute-value pairs are
represented). Each instance is complete and correct. 9 rules cover the training set.
4. Relevant Information Paragraph:
The examples are complete and noise free.
The examples highly simplified the problem. The attributes do not
fully describe all the factors affecting the decision as to which type,
if any, to fit.
5. Number of Instances: 24
6. Number of Attributes: 4 (all nominal)
7. Attribute Information:
-- 3 Classes
1 : the patient should be fitted with hard contact lenses,
2 : the patient should be fitted with soft contact lenses,
1 : the patient should not be fitted with contact lenses.
1. age of the patient: (1) young, (2) pre-presbyopic, (3) presbyopic
2. spectacle prescription: (1) myope, (2) hypermetrope
3. astigmatic: (1) no, (2) yes
4. tear production rate: (1) reduced, (2) normal
8. Number of Missing Attribute Values: 0
9. Class Distribution:
1. hard contact lenses: 4
2. soft contact lenses: 5
3. no contact lenses: 15
@relation contact-lenses
@attribute age
{young, pre-presbyopic, presbyopic}
@attribute spectacle-prescrip {myope, hypermetrope}
@attribute astigmatism
{no, yes}
@attribute tear-prod-rate
{reduced, normal}
@attribute contact-lenses
{soft, hard, none}
@data
24 instances
young,myope,no,reduced,none
young,myope,no,normal,soft
young,myope,yes,reduced,none
young,myope,yes,normal,hard
young,hypermetrope,no,reduced,none
young,hypermetrope,no,normal,soft
young,hypermetrope,yes,reduced,none
young,hypermetrope,yes,normal,hard
pre-presbyopic,myope,no,reduced,none
pre-presbyopic,myope,no,normal,soft
pre-presbyopic,myope,yes,reduced,none
pre-presbyopic,myope,yes,normal,hard
pre-presbyopic,hypermetrope,no,reduced,none
pre-presbyopic,hypermetrope,no,normal,soft
pre-presbyopic,hypermetrope,yes,reduced,none
pre-presbyopic,hypermetrope,yes,normal,none
presbyopic,myope,no,reduced,none
presbyopic,myope,no,normal,none
presbyopic,myope,yes,reduced,none
presbyopic,myope,yes,normal,hard
presbyopic,hypermetrope,no,reduced,none
presbyopic,hypermetrope,no,normal,soft
presbyopic,hypermetrope,yes,reduced,none
presbyopic,hypermetrope,yes,normal,none
4) CPU
Deleted "vendor" attribute to make data consistent with with what we
used in the data mining book.
@relation 'cpu'
@attribute MYCT numeric
@attribute MMIN numeric
@attribute MMAX numeric
@attribute CACH numeric
@attribute CHMIN numeric
@attribute CHMAX numeric
@attribute class numeric
@data
125,256,6000,256,16,128,198
29,8000,32000,32,8,32,269
29,8000,32000,32,8,32,220
29,8000,32000,32,8,32,172
29,8000,16000,32,8,16,132
26,8000,32000,64,8,32,318
23,16000,32000,64,16,32,367
23,16000,32000,64,16,32,489
23,16000,64000,64,16,32,636
23,32000,64000,128,32,64,1144
5) Credit-g
Description of the German credit dataset.
1. Title: German Credit data
2. Source Information
3. Number of Instances: 1000
Two datasets are provided. the original dataset, in the form provided
by Prof. Hofmann, contains categorical/symbolic attributes and
is in the file "german.data".
For algorithms that need numerical attributes, Strathclyde University produced the file
"german.data-numeric". This file has been edited and several indicator variables added
to make it suitable for algorithms which cannot cope with categorical variables. Several
attributes that are ordered categorical (such as attribute 17) have been coded as integer.
This was the form used by StatLog.
6. Number of Attributes german: 20 (7 numerical, 13 categorical)
Number of
Attributes german.numer: 24 (24 numerical)
7. Attribute description for german
Attribute 1: (qualitative)
Status of existing checking account
A11 :
... < 0 DM
A12 : 0 <= ... < 200 DM
A13 :
... >= 200 DM /
salary assignments for at least 1 year
A14 : no checking account
Attribute 2: (numerical)
Duration in month
Attribute 3: (qualitative)
Credit history
A30 : no credits taken/
all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account/
other credits existing (not at this bank)
Attribute 4: (qualitative)
Purpose
A40 : car (new)
A41 : car (used)
A42 : furniture/equipment
A43 : radio/television
Attribute 5: (numerical)
Credit amount
Attibute 6: (qualitative)
Savings account/bonds
A61 :
... < 100 DM
A62 : 100 <= ... < 500 DM
A63 : 500 <= ... < 1000 DM
A64 :
.. >= 1000 DM
A65 : unknown/ no savings account
Attribute 7: (qualitative)
Present employment since
A71 : unemployed
A72 :
... < 1 year
A73 : 1 <= ... < 4 years
A74 : 4 <= ... < 7 years
A75 :
.. >= 7 years
Attribute 8: (numerical)
Installment rate in percentage of disposable income
Attribute 9: (qualitative)
Personal status and sex
A91 : male : divorced/separated
A92 : female : divorced/separated/married
A93 : male : single
A94 : male : married/widowed
A95 : female : single
Attribute 10: (qualitative)
Other debtors / guarantors
A101 : none
A102 : co-applicant
A103 : guarantor
Attribute 11: (numerical)
Present residence since
Attribute 12: (qualitative)
Property
A121 : real estate
A122 : if not A121 : building society savings agreement/
life insurance
Attribute 13: (numerical)
Age in years
Attribute 14: (qualitative)
Other installment plans
A141 : bank
A142 : stores
A143 : none
Attribute 15: (qualitative)
Housing
A151 : rent
A152 : own
A153 : for free
Attribute 16: (numerical)
Number of existing credits at this bank
Attribute 17: (qualitative)
Job
A171 : unemployed/ unskilled - non-resident
A172 : unskilled - resident
A173 : skilled employee / official
A174 : management/ self-employed/ highly qualified employee/ officer
Attribute 18: (numerical)
Number of people being liable to provide maintenance for
Attribute 19: (qualitative)
Telephone
A191 : none
A192 : yes, registered under the customers name
Attribute 20: (qualitative)
foreign worker
A201 : yes
A202 : no
Cost Matrix This dataset requires use of a cost matrix (see below)
1 2
1 0 2
2 5 0
(1 = Good, 2 = Bad)
the rows represent the actual classification and the columns the predicted classification.
It is worse to class a customer as good when they are bad (5), than it is to class a
customer as bad when they are good (1).
Relabeled values in attribute checking_status
From: A11
To: '<0'
From: A12
To: '0<=X<200'
From: A13
To: '>=200'
From: A14
To: 'no checking'
@relation german_credit
@attribute checking_status { '<0', '0<=X<200', '>=200', 'no checking'}
@attribute duration numeric
@attribute credit_history { 'no credits/all paid', 'all paid', 'existing paid', 'delayed
previously', 'critical/other existing credit'}
@attribute purpose { 'new car', 'used car', furniture/equipment, radio/tv, 'domestic
appliance', repairs, education, vacation, retraining, business, other}
@attribute credit_amount numeric
@attribute savings_status { '<100', '100<=X<500', '500<=X<1000', '>=1000', 'no known
savings'}
@attribute employment { unemployed, '<1', '1<=X<4', '4<=X<7', '>=7'}
@attribute installment_commitment numeric
@attribute personal_status { 'male div/sep', 'female div/dep/mar', 'male single', 'male
mar/wid', 'female single'}
@attribute other_parties { none, 'co applicant', guarantor}
@attribute residence_since numeric
@attribute property_magnitude { 'real estate', 'life insurance', car, 'no known property'}
@attribute age numeric
@attribute other_payment_plans { bank, stores, none}
@attribute housing { rent, own, 'for free'}
@attribute existing_credits numeric
@attribute job { 'unemp/unskilled non res', 'unskilled resident', skilled, 'high qualif/self
emp/mgmt'}
@attribute num_dependents numeric
@attribute own_telephone { none, yes}
@attribute foreign_worker { yes, no}
@attribute class { good, bad}
@data
'<0',6,'critical/other existing credit',radio/tv,1169,'no known savings','>=7',4,'male
single',none,4,'real estate',67,none,own,2,skilled,1,yes,yes,good
'0<=X<200',48,'existing
paid',radio/tv,5951,'<100','1<=X<4',2,'female
div/dep/mar',none,2,'real estate',22,none,own,1,skilled,1,none,yes,bad
'no checking',12,'critical/other existing credit',education,2096,'<100','4<=X<7',2,'male
single',none,3,'real estate',49,none,own,1,'unskilled resident',2,none,yes,good
'<0',42,'existing
paid',furniture/equipment,7882,'<100','4<=X<7',2,'male
single',guarantor,4,'life insurance',45,none,'for free',1,skilled,2,none,yes,good
'<0',24,'delayed previously','new car',4870,'<100','1<=X<4',3,'male single',none,4,'no
known property',53,none,'for free',2,skilled,2,none,yes,bad
'no checking',36,'existing paid',education,9055,'no known savings','1<=X<4',2,'male
single',none,4,'no known property',35,none,'for free',1,'unskilled resident',2,yes,yes,good
'no checking',24,'existing paid',furniture/equipment,2835,'500<=X<1000','>=7',3,'male
single',none,4,'life insurance',53,none,own,1,skilled,1,none,yes,good
'0<=X<200',36,'existing
paid','used
car',6948,'<100','1<=X<4',2,'male
single',none,2,car,35,none,rent,1,'high qualif/self emp/mgmt',1,yes,yes,good.
(f) Breast
Cancer
In this problem we have to use 30 different columns and we have to predict the
Stage of Breast Cancer M (Malignant) and B (Benign)
Attribute Information:
1) ID number
2) Diagnosis (M = malignant, B = benign)
-3-32.Ten real-valued features are computed for each cell nucleus:
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g). concavity (severity of concave portions of the contour)
h). concave points (number of concave portions of the contour)
i). symmetry
j). fractal dimension ("coastline approximation" - 1)
Viva Questions:
Q1: Take the different situations and analyse problem in terms of different attributes?
Q2: What are the various types of data attributes used in problem description?
Q3: How much minimum no. of instances are required to satisfy problem description?
Experiment 5
Aim: Study Major Classifier algorithms and implement.
INSTRUCTIONS:
(a) Naive Bayes algorithm
It is a classification technique based on Bayes’ Theorem with an assumption of independence
among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a
particular feature in a class is unrelated to the presence of any other feature. For example, a fruit
may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these
features depend on each other or upon the existence of the other features, all of these properties
independently contribute to the probability that this fruit is an apple and that is why it is known
as ‘Naive’. Naive Bayes model is easy to build and particularly useful for very large data sets.
Along with simplicity, Naive Bayes is known to outperform even highly sophisticated
classification methods. Bayes theorem provides a way of calculating posterior probability P(c|x)
from P(c), P(x) and P(x|c). Look at the equation below:
Above,
•
P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
•
P(c) is the prior probability of class.
•
P(x|c) is the likelihood which is the probability of predictor given class.
•
P(x) is the prior probability of predictor.
Pros and Cons of Naive Bayes?
Pros:
•
•
It is easy and fast to predict class of test data set. It also perform well in multi class
prediction
When assumption of independence holds, a Naive Bayes classifier performs better
compare to other models like logistic regression and you need less training data.
•
It perform well in case of categorical input variables compared to numerical variable(s).
For numerical variable, normal distribution is assumed (bell curve, which is a strong
assumption).
Cons:
•
•
•
If categorical variable has a category (in test data set), which was not observed in training
data set, then model will assign a 0 (zero) probability and will be unable to make a
prediction. This is often known as “Zero Frequency”. To solve this, we can use the
smoothing technique. One of the simplest smoothing techniques is called Laplace
estimation.
On the other side naive Bayes is also known as a bad estimator, so the probability outputs
from predict_proba are not to be taken too seriously.
Another limitation of Naive Bayes is the assumption of independent predictors. In real
life, it is almost impossible that we get a set of predictors which are completely
independent.
Applications of Naive Bayes Algorithms
•
Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast.
Thus, it could be used for making predictions in real time.
•
Multi class Prediction: This algorithm is also well known for multi class prediction
feature. Here we can predict the probability of multiple classes of target variable.
•
Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers
mostly used in text classification (due to better result in multi class problems and
independence rule) have higher success rate as compared to other algorithms. As a result,
it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in
social media analysis, to identify positive and negative customer sentiments)
•Recommendation System: Naive Bayes Classifier and Collaborative Filtering together
builds a Recommendation System that uses machine learning and data mining techniques to filter
unseen information and predict whether a user would like a given resource or not.
(b) Decision
Trees in Machine Learning
A tree has many analogies in real life, and turns out that it has influenced a wide area of machine
learning, covering both classification and regression. In decision analysis, a decision tree can
be used to visually and explicitly represent decisions and decision making. As the name goes, it
uses a tree-like model of decisions. Though a commonly used tool in data mining for deriving a
strategy to reach a particular goal, its also widely used in machine learning, which will be the
main focus of this article.
Why Decision trees?
We have couple of other algorithms there, so why do we have to choose Decision trees?
well, there might be many reasons but I believe a few which are
1.
Decision tress often mimic the human level thinking so its so simple to understand the
data and make some good interpretations.
2.
Decision trees actually make you see the logic for the data to interpret(not like black box
algorithms like SVM,NN,etc..)
How can an algorithm be represented as a tree?
For this let’s consider a very basic example that uses titanic data set for predicting whether a
passenger will survive or not. Below model uses 3 features/attributes/columns from the data set,
namely sex, age and sibsp (number of spouses or children along).
A decision tree is drawn upside down with its root at the top. In the image on the left, the bold
text in black represents a condition/internal node, based on which the tree splits into branches/
edges. The end of the branch that doesn’t split anymore is the decision/leaf, in this case, whether
the passenger died or survived, represented as red and green text respectively.
Although, a real dataset will have a lot more features and this will just be a branch in a much
bigger tree, but you can’t ignore the simplicity of this algorithm. The feature importance is
clear and relations can be viewed easily. This methodology is more commonly known as
learning decision tree from data and above tree is called Classification tree as the target is to
classify passenger as survived or died. Regression trees are represented in the same manner, just
they predict continuous values like price of a house. In general, Decision Tree algorithms are
referred to as CART or Classification and Regression Trees.
.
( c) Classification and Regression Trees:
Growing a tree involves deciding on which features to choose and what conditions to use for
splitting, along with knowing when to stop
Advantages of CART
•
Simple to understand, interpret, visualize.
•
Decision trees implicitly perform variable screening or feature selection.
•
Can handle both numerical and categorical data. Can also handle multi-output problems.
•
Decision trees require relatively little effort from users for data preparation.
•
Nonlinear relationships between parameters do not affect tree performance.
Disadvantages of CART
•
Decision-tree learners can create over-complex trees that do not generalize the data well.
This is called overfitting.
•
Decision trees can be unstable because small variations in the data might result in a
completely different tree being generated. This is called variance, which needs to be
lowered by methods likebagging and boosting.
•
Greedy algorithms cannot guarantee to return the globally optimal decision tree. This can
be mitigated by training multiple trees, where the features and samples are randomly
sampled with replacement.
•
Decision tree learners create biased trees if some classes dominate. It is therefore
recommended to balance the data set prior to fitting with the decision tree.
An improvement over decision tree learning is made using technique of boosting.
(D) Autoregressive Integrated Moving Average
Model(ARIMA)
An ARIMA model is a class of statistical models for analyzing and forecasting time series data.
It explicitly caters to a suite of standard structures in time series data, and as such provides a
simple yet powerful method for making skillful time series forecasts.
ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. It is a
generalization of the simpler AutoRegressive Moving Average and adds the notion of
integration.
This acronym is descriptive, capturing the key aspects of the model itself. Briefly, they are:
•
•
•
AR: Autoregression. A model that uses the dependent relationship between an
observation and some number of lagged observations.
I: Integrated. The use of differencing of raw observations (e.g. subtracting an observation
from an observation at the previous time step) in order to make the time series stationary.
MA: Moving Average. A model that uses the dependency between an observation and a
residual error from a moving average model applied to lagged observations.
Each of these components are explicitly specified in the model as a parameter. A standard
notation is used of ARIMA(p,d,q) where the parameters are substituted with integer values to
quickly indicate the specific ARIMA model being used.
The parameters of the ARIMA model are defined as follows:
•
•
•
p: The number of lag observations included in the model, also called the lag order.
d: The number of times that the raw observations are differenced, also called the degree
of differencing.
q: The size of the moving average window, also called the order of moving average.
A linear regression model is constructed including the specified number and type of terms, and
the data is prepared by a degree of differencing in order to make it stationary, i.e. to remove
trend and seasonal structures that negatively affect the regression model.
In summary, the steps of this process are as follows:
1.
Model Identification. Use plots and summary statistics to identify trends, seasonality,
and autoregression elements to get an idea of the amount of differencing and the size of
the lag that will be required.
2. Parameter Estimation. Use a fitting procedure to find the coefficients of the regression
model.
3. Model Checking. Use plots and statistical tests of the residual errors to determine the
amount and type of temporal structure not captured by the model.
The process is repeated until either a desirable level of fit is achieved on the in-sample or out-ofsample observations (e.g. training or test datasets).
(e) Linear
Regression and Logistics regression
It is used to estimate real values (cost of houses, number of calls, total sales etc.) based on
continuous variable(s). Here, we establish relationship between independent and dependent
variables by fitting a best line. This best fit line is known as regression line and represented by a
linear equation Y= a *X + b.
Look at the below example. Here we have identified the best fit line having linear equation
y=0.2811x+13.9. Now using this equation, we can find the weight, knowing the height of a
person.
Linear Regression is of mainly two types: Simple Linear Regression and Multiple Linear
Regression.
Python Code
#Import Library
#Import other necessary libraries like pandas, numpy...
from sklearn import linear_model
#Load Train and Test datasets
#Identify feature and response variable(s) and values must be numeric and
numpy arrays
x_train=input_variables_values_training_datasets
y_train=target_variables_values_training_datasets
x_test=input_variables_values_test_datasets
# Create linear regression object
linear = linear_model.LinearRegression()
# Train the model using the training sets and check score
linear.fit(x_train, y_train)
linear.score(x_train, y_train)
#Equation coefficient and Intercept
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)
#Predict Output
predicted= linear.predict(x_test)
R Code
#Load Train and Test datasets
#Identify feature and response variable(s) and values must be numeric and
numpy arrays
x_train <- input_variables_values_training_datasets
y_train <- target_variables_values_training_datasets
x_test <- input_variables_values_test_datasets
x <- cbind(x_train,y_train)
# Train the model using the training sets and check score
linear <- lm(y_train ~ ., data = x)
summary(linear)
#Predict Output
predicted= predict(linear,x_test)
Logistic regression:
It is used to estimate discrete values ( Binary values like 0/1, yes/no, true/false ) based on given
set of independent variable(s). In simple words, it predicts the probability of occurrence of an
event by fitting data to a logit function. Hence, it is also known as logit regression. Since, it
predicts the probability, its output values lies between 0 and 1 (as expected).
Python Code
#Import Library
from sklearn.linear_model import LogisticRegression
#Assumed you have, X (predictor) and Y (target) for training data set and
x_test(predictor) of test_dataset
# Create logistic regression object
model = LogisticRegression()
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Equation coefficient and Intercept
print('Coefficient: \n', model.coef_)
print('Intercept: \n', model.intercept_)
#Predict Output
predicted= model.predict(x_test)
R Code
x <- cbind(x_train,y_train)
# Train the model using the training sets and check score
logistic <- glm(y_train ~ ., data = x,family='binomial')
summary(logistic)
#Predict Output
predicted= predict(logistic,x_test)
(F) SVM
(Support Vector Machine)
It is a classification method. In this algorithm, we plot each data item as a point in n-dimensional
space (where n is number of features you have) with the value of each feature being the value of
a particular coordinate.
For example, if we only had two features like Height and Hair length of an individual, we’d first
plot these two variables in two dimensional space where each point has two co-ordinates (these
co-ordinates are known as Support Vectors)
we will find some line that splits the data between the two differently classified groups of data.
This will be the line such that the distances from the closest point in each of the two groups will
be farthest away.
In the example shown above, the line which splits the data into two differently classified groups
is the black line, since the two closest points are the farthest apart from the line. This line is our
classifier. Then, depending on where the testing data lands on either side of the line, that’s what
class we can classify the new data as.
Python Code
#Import Library
from sklearn import svm
#Assumed you have, X (predictor) and Y (target) for training data set and
x_test(predictor) of test_dataset
# Create SVM classification object
model = svm.svc() # there is various option associated with it, this is
simple for classification. You can refer link, for mo# re detail.
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(x_test)
R Code
library(e1071)
x <- cbind(x_train,y_train)
# Fitting model
fit <-svm(y_train ~ ., data = x)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
KNN (K- NEAREST NEIGHBORS)
It can be used for both classification and regression problems. However, it is more widely used
in classification problems in the industry. K nearest neighbors is a simple algorithm that stores all
available cases and classifies new cases by a majority vote of its k neighbors. The case being
assigned to the class is most common amongst its K nearest neighbors measured by a distance
function. These distance functions can be Euclidean, Manhattan, Minkowski and Hamming
distance. First three functions are used for continuous function and fourth one (Hamming) for
categorical variables. If K = 1, then the case is simply assigned to the class of its nearest
neighbor. At times, choosing K turns out to be a challenge while performing kNN modeling.
MORE: INTRODUCTION TO K-NEAREST NEIGHBORS : SIMPLIFIED.
KNN CAN EASILY BE MAPPED TO OUR REAL LIVES. IF YOU WANT TO LEARN
ABOUT A PERSON, OF WHOM YOU HAVE NO INFORMATION, YOU MIGHT LIKE
TO FIND OUT ABOUT HIS CLOSE FRIENDS AND THE CIRCLES HE MOVES IN AND
GAIN ACCESS TO HIS/HER INFORMATION!
THINGS TO CONSIDER BEFORE SELECTING KNN:



KNN IS COMPUTATIONALLY EXPENSIVE
VARIABLES SHOULD BE NORMALIZED ELSE HIGHER RANGE VARIABLES
CAN BIAS IT
WORKS ON PRE-PROCESSING STAGE MORE BEFORE GOING FOR KNN LIKE
OUTLIER, NOISE REMOVAL
PYTHON CODE
#IMPORT LIBRARY
FROM SKLEARN.NEIGHBORS IMPORT KNEIGHBORSCLASSIFIER
#ASSUMED YOU HAVE, X (PREDICTOR) AND Y (TARGET) FOR TRAINING DATA SET AND
X_TEST(PREDICTOR) OF TEST_DATASET
# CREATE KNEIGHBORS CLASSIFIER OBJECT MODEL
KNEIGHBORSCLASSIFIER(N_NEIGHBORS=6) # DEFAULT VALUE FOR N_NEIGHBORS IS 5
# TRAIN THE MODEL USING THE TRAINING SETS AND CHECK SCORE
MODEL.FIT(X, Y)
#PREDICT OUTPUT
PREDICTED= MODEL.PREDICT(X_TEST)
R CODE
LIBRARY(KNN)
X <- CBIND(X_TRAIN,Y_TRAIN)
# FITTING MODEL
FIT <-KNN(Y_TRAIN ~ ., DATA = X,K=5)
SUMMARY(FIT)
#PREDICT OUTPUT
PREDICTED= PREDICT(FIT,X_TEST)
Experiment 6
Aim: Analysis of round trip Time of Flight measurement from a supermarket.
INSTRUCTIONS:
Dataset source http://crawdad.org/cmu/supermarket/20140527/
Tool used: WEKA ID
Here true value of interest is taken as θ and the value estimated using some algorithm
as θ^. Correlation tells you how much θ and θ^ are related. It gives values
between −1−1 and 11, where 00 is no relation, 11 is very strong, linear relation and −1−1 is
an inverse linear relation (i.e. bigger values of θ indicate smaller values of θ^, or vice
versa).
Mean absolute error is:
Root mean square error is:
Relative absolute error:
Root relative squared error:
Some terms used :
 TP Rate: rate of true positives (instances correctly classified as a given class)
FP Rate: rate of false positives (instances falsely classified as a given class)

Precision: proportion of instances that are truly of a class divided by the total instances
classified as that class

Recall: proportion of instances classified as a given class divided by the actual total in
that class (equivalent to TP rate)

F-Measure: A combined measure for precision and recall calculated as 2 * Precision *
Recall / (Precision + Recall)
Algorithms:
1. Naïve Bayes
=== Run information ===
Scheme:
weka.classifiers.bayes.NaiveBayes
Relation: supermarket
Instances: 4627
Attributes: 217
[list of attributes omitted]
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Naive Bayes Classifier
Time taken to build model: 0.01 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
2948
63.713 %
Incorrectly Classified Instances
1679
36.287 %
Kappa statistic
0
Mean absolute error
0.4624
Root mean squared error
0.4808
Relative absolute error
100
%
Root relative squared error
100
%
Total Number of Instances
4627
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC
ROC Area PRC Area Class
1.000 1.000 0.637
1.000 0.778
0.000 0.499 0.637 low
0.000 0.000 0.000
0.000 0.000
0.000 0.499 0.363 high
Weighted Avg. 0.637 0.637 0.406
0.637 0.496
0.000 0.499 0.537
=== Confusion Matrix ===
a b <-- classified as
2948 0 | a = low
1679 0 | b = high
Mean absolute error is:
MAE = 1N∑i=1N|θ^i−θi|
2.
Decision stump
=== Run information ===
Scheme:
weka.classifiers.trees.DecisionStump
Relation: supermarket
Instances: 4627
Attributes: 217
[list of attributes omitted]
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Decision Stump
Classifications
tissues-paper prd = t : high
tissues-paper prd != t : low
tissues-paper prd is missing : low
Class distributions
tissues-paper prd = t
low
high
0.48553627058299953
0.5144637294170005
tissues-paper prd != t
low
high
0.7802521008403361 0.21974789915966386
tissues-paper prd is missing
low
high
0.7802521008403361 0.21974789915966386
Time taken to build model: 0.09 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
2980
Incorrectly Classified Instances
1647
Kappa statistic
0.2813
Mean absolute error
0.4212
Root mean squared error
0.4603
Relative absolute error
91.0797 %
Root relative squared error
95.7349 %
Total Number of Instances
4627
64.4046 %
35.5954 %
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC
ROC Area PRC
Area Class
0.627 0.325 0.772 0.627 0.692
0.290 0.642 0.732 low
0.675 0.373 0.507 0.675 0.579
0.290 0.642 0.466 high
Weighted Avg. 0.644 0.343 0.676
0.644 0.651
0.290 0.642 0.635
=== Confusion Matrix ===
a b <-- classified as
1847 1101 | a = low
546 1133 | b = high
3.
Random forest
== Run information ===
Scheme:
weka.classifiers.trees.RandomForest -P 100 -I 100 -num-slots 1 -K 0 -M 1.0
-V 0.001 -S 1
Relation: supermarket
Instances: 4627
Attributes: 217
[list of attributes omitted]
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
RandomForest
Bagging with 100 iterations and base learner
weka.classifiers.trees.RandomTree -K 0 -M 1.0 -V 0.001 -S 1 -do-not-check-capabilities
Time taken to build model: 3 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
2948
Incorrectly Classified Instances
1679
Kappa statistic
0
Mean absolute error
0.4624
Root mean squared error
0.4808
Relative absolute error
99.9964 %
Root relative squared error
100
%
Total Number of Instances
4627
=== Detailed Accuracy By Class ===
63.713 %
36.287 %
TP Rate FP Rate Precision Recall F-Measure MCC
ROC Area PRC
Area Class
1.000 1.000 0.637
1.000 0.778
0.000 0.500 0.637 low
0.000 0.000 0.000
0.000 0.000
0.000 0.500 0.363 high
Weighted Avg. 0.637 0.637 0.406
0.637 0.496
0.000 0.500 0.538
=== Confusion Matrix ===
a b <-- classified as
2948 0 | a = low
1679 0 | b = high
4. MLP(multi layer preceptron)
=== Run information ===
Scheme:
weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 S 0 -E 20 -H a
Relation: supermarket
Instances: 4627
Attributes: 217
[list of attributes omitted]
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Attrib department216
Class low
Input
Node 0
Class high
Input
Node 1
0.044512208908162154
Time taken to build model: 1166.18 seconds
5.
K MEAN CLUSTERING
=== Run information ===
Scheme:
weka.clusterers.SimpleKMeans -init 0 -max-candidates 100 -periodicpruning 10000 -min-density 2.0 -t1 -1.25 -t2 -1.0 -N 2 -A "weka.core.EuclideanDistance R first-last" -I 500 -num-slots 1 -S 10
Relation: supermarket
Instances: 4627
Attributes: 217
[list of attributes omitted]
Test mode: evaluate on training data
=== Clustering model (full training set) ===
Number of iterations: 2
Within cluster sum of squared errors: 0.0
Initial starting points (random):
Cluster 0:
t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,
t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,
t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,
t,t,t,t,t,t,t,t,t,t,t,t,high
Cluster 1:
t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,
t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,
t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,
t,t,t,t,t,t,t,t,t,t,t,t,low
Missing values globally replaced with mean/mode
Final cluster centroids:
Cluster#
Attribute
Full Data
0
1
(4627.0) (1679.0) (2948.0)
=====================================================
department1
t
tt
department2
t
tt
department3
t
tt
department4
t
tt
department5
t
tt
department6
t
tt
department7
t
tt
department8
t
tt
department9
t
tt
grocery misc
t
tt
department11
t
tt
baby needs
t
tt
bread and cake
t
tt
baking needs
t
tt
coupons
t
tt
Time taken to build model (full training data) : 0.22 seconds
=== Model and evaluation on training set ===
Clustered Instances
0
1
1679 ( 36%)
2948 ( 64%)
Result:
By perform the classification by using the above five algorithms, we observe that ‘Naïve bayes’
take the least time to build model and naïve bayes is also most accurate with least Mean
Absolute Error.
Therefore, we conclude that out of the above algorithms, Naïve Bayes performs best.
Viva Questions:
Q1: What are performance measures used ?
Q2: What do you mean by confusion matrix ?
Q3: What is the best accuracy achieved by which algorithn?
Q4: How do you choose attributes which affect problem description?
Q5: What do u mean by TP rate and FP rate?
Q6: Define precision and recall.
Experiment 7
Aim: Implement supervised learning (KNN classification)
Instruction guidelines:
K Nearest Neighbors - Classification
K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases
based on a similarity measure (e.g., distance functions). KNN has been used in statistical
estimation and pattern recognition already in the beginning of 1970‟s as a non-parametric
technique.
Algorithm: A case is classified by a majority vote of its neighbors, with the case being assigned
to the class most common amongst its K nearest neighbors measured by a distance function. If K
= 1, then the case is simply assigned to the class of its nearest neighbor.
It should also be noted that all three distance measures are only valid for continuous variables. In
the instance of categorical variables the Hamming distance must be used. It also brings up the
issue of standardization of the numerical variables between 0 and 1 when there is a mixture of
numerical and categorical variables in the dataset.
Choosing the optimal value for K is best done by first inspecting the data. In general, a large K
value is more precise as it reduces the overall noise but there is no guarantee. Cross-validation is
another way to retrospectively determine a good K value by using an independent dataset to
validate the K value. Historically, the optimal K for most datasets has been between 3-10. That
produces much better results than 1NN
Example:
Consider the following data concerning credit default. Age and Loan are two numerical variables (predictors)
and Default is the target.
We can now use the training set to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean
distance. If K=1 then the nearest neighbor is the last case in the training set with Default=Y.
D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.01 >> Default=Y
Fig-1
With K=3, there are two Default=Y and one Default=N out of three closest neighbors.
The prediction for the unknown case is again Default=Y.
Sample Input:
Step 1: Open WEKA GUI and Select Explorer
Step 2: Load the Breast Cancer data set (breast-cancer.arff) given in experiment 4 databases, using “Open
File”
Step 3: Choose the “Classify” Tab and Choose The Decision Tree Classifier labelled as J48 in the trees
folder and choose “Cross-Validation” and enter 5 in the Folds, then press Start
Output:
=== Run information ===
Scheme:
Relation:
Instances:
Attributes:
weka.classifiers.trees.J48 -C 0.25 -M 2
breast-cancer
286
10
age
menopause
tumor-size
inv-nodes
node-caps
deg-malig
breast
breast-quad
irradiat
Test mode:
Class
5-fold cross-validation
=== Classifier model (full training set) ===
J48 pruned tree
-----------------node-caps = yes
|
deg-malig =
|
deg-malig =
|
deg-malig =
node-caps = no:
1: recurrence-events (1.01/0.4)
2: no-recurrence-events (26.2/8.0)
3: recurrence-events (30.4/7.4)
no-recurrence-events (228.39/53.4)
Number of Leaves
Size of the tree :
:
4
6
Time taken to build model: 0.02 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
=== Detailed Accuracy By Class ===
212
74
0.2288
0.3726
0.4435
89.0412 %
97.0395 %
286
74.1259 %
25.8741 %
TP Rate
FP Rate
Precision
Recall
ROC Area PRC Area Class
0.960
0.776
0.745
0.960
0.582
0.728
no-recurrence-events
0.224
0.040
0.704
0.224
0.582
0.444
recurrence-events
Weighted Avg.
0.741
0.558
0.733
0.741
0.582
0.643
=== Confusion Matrix ===
a
193
66
b
<-- classified as
8 |
a = no-recurrence-events
19 |
b = recurrence-events
Viva Questions:
Q1: How is KNN algorithm different from K means Clustering.
Q2: How is KNN algorithm used in radar target specification?
Q3: Does KNN algorithm have a centroid as k-means?
Q4:Is there a way to obtain the centroid for the classified data by KNN?
Q5: What is difference between Supervised and unsupervised learning?
F-Measure
MCC
0.839
0.287
0.339
0.287
0.691
0.287
Experiment 8
Aim: Understanding of R and its basics
R is a programming language and software environment for statistical analysis, graphics
representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the
University of Auckland, New Zealand, and is currently developed by the R Development Core
Team.
The core of R is an interpreted computer language which allows branching and looping as well
as modular programming using functions. R allows integration with the procedures written in
the C, C++, .Net, Python or FORTRAN languages for efficiency.
R is freely available under the GNU General Public License, and pre-compiled binary versions
are provided for various operating systems like Linux, Windows and Mac.
R is free software distributed under a GNU-style copy left, and an official part of the GNU
project called GNU S.
Evolution of R
R was initially written by Ross Ihaka and Robert Gentleman at the Department of Statistics of
the University of Auckland in Auckland, New Zealand. R made its first appearance in 1993.

A large group of individuals has contributed to R by sending code and bug reports.

Since mid-1997 there has been a core group (the "R Core Team") who can modify the R
source code archive.
Features of R
As stated earlier, R is a programming language and software environment for statistical
analysis, graphics representation and reporting. The following are the important features of R −

R is a well-developed, simple and effective programming language which includes
conditionals, loops, user defined recursive functions and input and output facilities.

R has an effective data handling and storage facility,

R provides a suite of operators for calculations on arrays, lists, vectors and matrices.

R provides a large, coherent and integrated collection of tools for data analysis.

R provides graphical facilities for data analysis and display either directly at the
computer or printing at the papers.
As a conclusion, R is world’s most widely used statistics programming language. It's the # 1
choice of data scientists and supported by a vibrant and talented community of contributors. R
is taught in universities and deployed in mission critical business applications.
Experiment 9
Aim: Using R studio to implement linear model for regression.
importnumpy as np
importmatplotlib.pyplot as plt
defestimate_coef(x, y):
n = np.size(x)
print n
m_x, m_y = np.mean(x), np.mean(y)
SS_xy = np.sum(y*x - n*m_y*m_x)
SS_xx = np.sum(x*x - n*m_x*m_x)
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return(b_0, b_1)
defplot_regression_line(x, y, b):
plt.scatter(x, y, color = "m", marker = "o", s = 30)
y_pred = b[0] + b[1]*x
plt.plot(x, y_pred, color = "g")
plt.xlabel('x')
plt.ylabel('y')
plt.show()
fromsklearn import datasets
boston = datasets.load_boston(return_X_y=False)
X = boston.data[:,5]
y = boston.target
printX.shape
printy.shape
b = estimate_coef(X, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))
plot_regression_line(X, y, b)
OUTPUT
Viva Questions:
Q1: Compare R and Python programming languages for Predictive Modelling..
Q2: Explain about data import in R language?
Q3: How missing values and impossible values are represented in R language?.
Q4: R language has several packages for solving a particular problem. How do you make a
decision on which one is the best to use?
Q5: How many data structures does R language have?
Q6: What is the process to create a table in R language without using external files?
Experiment 10
Aim: Using python(Anaconda or Pycharm) software, implement linear model for
regression or Neural Network.
Program :
importnumpy as np
import pandas as pd
fromsklearn.model_selection import train_test_split
fromsklearn.linear_model import LinearRegression
fromsklearn.metrics import mean_squared_error, r2_score
testdata = pd.read_csv('c://1//2.csv', sep=',')
print"dataset length :", len(testdata)
print"dataset shape :", testdata.shape
X = testdata.values[:, 0:6]
Y = testdata.values[:, 6]
print X
print Y
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25)
clf = LinearRegression()
clf.fit(X_train, y_train)
pre = clf.predict(X_test)
# The coefficients
print('Coefficients: ', clf.coef_)
# The mean squared error
#print("Mean squared error: %.2f"
#
% mean_squared_error(X_test, y_test))
# Explained variance score: 1 is perfect prediction
#print('Variance score: %.2f' % r2_score(X_test, y_test))
OUTPUT
“C:\Users\Anureet\PycharmProjects\untitled\venv\Scripts\python.exe”
”C:/Users/Anureet/PycharmProjects/untitled/venv/Stock price prediction.py”
dataset length : 1258
dataset shape : (1258, 7)
[[1.100e+01 2.000e+00 2.013e+03 1.489e+01 1.501e+01 1.426e+01]
[1.200e+01 2.000e+00 2.013e+03 1.445e+01 1.451e+01 1.410e+01]
[1.300e+01 2.000e+00 2.013e+03 1.430e+01 1.494e+01 1.425e+01]
...
[5.000e+00 2.000e+00 2.018e+03 5.199e+01 5.239e+01 4.975e+01]
[6.000e+00 2.000e+00 2.018e+03 4.932e+01 5.150e+01 4.879e+01]
[7.000e+00 2.000e+00 2.018e+03 5.091e+01 5.198e+01 5.089e+01]]
[14.46 14.27 14.66 ... 49.76 51.18 51.4 ]
('Coefficients: ', array([-8.08582422e- 04, -1.01285688e- 02, -3.88743101e- 04, 5.56289130e- 01,
8.39542627e-01, 7.15213953e-01]))
Process finished with exit code 0
Viva Questions:
1) How can you build a simple logistic regression model in Python?
2) How can you train and interpret a linear regression model in SciKit learn?
3) Name a few libraries in Python used for Data Analysis and Scientific computations.
4) Which library would you prefer for plotting in Python language: Seaborn or Matplotlib?
5) Which Random Forest parameters can be tuned to enhance the predictive power of the model?
6) Which method in pandas.tools. plotting is used to create scatter plot matrix?
Experiment 11
AIM -Understanding of RMS Titanic Dataset to predict survival by training a
model and predict the required solution.
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April
15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing
1502 out of 2224 passengers and crew. This sensational tragedy shocked the international
community and led to better safety regulations for ships.One of the reasons that the shipwreck
led to such loss of life was that there were not enough lifeboats for the passengers and crew.
Although there was some element of luck involved in surviving the sinking, some groups of
people were more likely to survive than others, such as women, children, and the upper-class.
• Survived (Target Variable) - Binary categorical variable where 0 represents not survived and 1
represents survived.
• Pclass- Categorical variable. It is passenger class.
• Sex - Binary Variable representing the gender the of passenger
• Age - Feature engineered variable. It is divided into 4 classes.
• Fare - Feature engineered variable. It is divided into 4 classes.
• Embarked - Categorical Variable. It tells the Port of embarkation.
• Title - New feature created from names. The title of names is classified into 4 different classes.
• isAlone- Binary Variable. It tells whether the passenger is travelling alone or not.
• Age*Class - Feature engineered variable.
Model, predict and solve
Now we are ready to train a model and predict the required solution. There are 60+ predictive
modelling algorithms to choose from. We must understand the type of problem and solution
requirement to narrow down to a select few models which we can evaluate. Our problem is a
classification and regression problem. We want to identify relationship between output (Survived
or not) with other variables or features (Gender, Age, Port...). We are also performing a category
of machine learning which is called supervised learning as we are training our model with a
given dataset. With these two criteria - Supervised Learning plus Classification and Regression,
we can narrow down our choice of models to a few. These include:
•
Logistic Regression
•
KNN or k-Nearest Neighbours
•
•
•
Support Vector Machines
Naive Bayes classifier
Decision Tree
Size of the training and testing dataset
1. Logistic Regression is a useful model to run early in the workflow. Logistic regression
measures the relationship between the categorical dependent variable (feature) and one or more
independent variables (features) by estimating probabilities using a logistic function, which is the
cumulative logistic distribution.
Note the confidence score generated by the model based on our training dataset.
2. In pattern recognition, the k-Nearest Neighbours algorithm (or k-NN for short) is a nonparametric method used for classification and regression. A sample is classified by a majority
vote of its neighbours, with the sample being assigned to the class most common among its k
nearest neighbours (k is a positive integer, typically small). If k = 1, then the object is simply
assigned to the class of that single nearest neighbour.
KNN confidence score is better than Logistics Regression but worse than SVM.
3. Next we model using Support Vector Machines which are supervised learning models with
associated learning algorithms that analyze data used for classification and regression analysis.
Given a set of training samples, each marked as belonging to one or the other of two categories,
an SVM training algorithm builds a model that assigns new test samples to one category or the
other, making it a non-probabilistic binary linear classifier.
Note that the model generates a confidence score which is higher than Logistics Regression
model.
4. In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers
based on applying Bayes' theorem with strong (naive) independence assumptions between the
features. Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in
the number of variables (features) in a learning problem.
The model generated confidence score is the lowest among the models evaluated so far.
5. This model uses a decision tree as a predictive model which maps features (tree branches) to
conclusions about the target value (tree leaves). Tree models where the target variable can take a
finite set of values are called classification trees; in these tree structures, leaves represent class
labels and branches represent conjunctions of features that lead to those class labels. Decision
trees where the target variable can take continuous values (typically real numbers) are called
regression trees. The model confidence score is the highest among models evaluated so far.
Experiment 12
AIM -Understanding of Indian education in Rural villages to predict whether girl
child will be sent to school or not?
The data is focused on rural India. It primarily looks into the fact whether the villagers
are willing to send the girl children to school or not and if they are not sending their daughters to
school the reasons have also been mentioned. The district is Gwalior. Various details of the
villagers such as village, gender, age, education, occupation, category, caste, religion, land etc
have also been collected.
1 Naïve Bayes Classifier
The algorithm was run with 10-fold cross-validation: this means it was given an opportunity to
make a prediction for each instance of the dataset (with different training folds) and the presented
result is a summary of those predictions. Firstly, I noted the Classification Accuracy. The model
achieved a result of 109/200 correct or 54.5%.
=== Confusion Matrix ===
a
b
c
d
e
f
g
h
i
j
k
l
m
<-0 0 1 1 0 1 0 2 0 0 0 0 0 | a =
2 1 1 1 8 0 0 0 0 1 0 0 0 | b
2 0 17 2 9 0 0 2 0 0 0 0 0 | c
0 0 4 3 2 0 1 0 0 1 0 0 0 | d
classified as
Govt.
= Driver
= Farmer
= Shopkeeper
1 8
3 0
0 1
1 0
0 0
0 0
0 0
Teacher
0 0
0 0
2
0
0
0
2
2
0
3 73 1 0 1
0 0 0 0 1
1 0 0 0 2
0 1 1 0 8
0 0 0 0 0
0 1 0 0 0
0 1 0 0
1
0
0
0
2
0
0
3
0
0
0
0
2
0
2
0
0
0
0
0
2
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
1
4
3
0
0
0
0
0
0
|
|
|
|
|
|
0
e
f
g
h
i
j
0
|
|
=
=
=
=
=
=
|
labour
Security Guard
Raj Mistri
Fishing
Labour & Driver
Homemaker
k = Govt School
l = Dhobi
m = goats
The confusion matrix shows the precision of the algorithm showing that 1,1,1,2 Government
officials were misclassified as Farmer, Shopkeeper, Security Guard and Fishermen respectively,
2,1,1,8,1 Drivers were misclassified as Government officials, Farmer, Shopkeeper, Labour,
Homemaker, and so on. This table can help to explain the accuracy achieved by the algorithm.
Now when we have model,
we need to load our test data we’ve created before. For this, select Supplied test set and click
button Set. Click More Options, where in new window, choose PlainText from Output
predictions. Then click left mouse button on recently created model on result list and select Reevaluate model on current test set.
After re-evaluation
Now the Classification Accuracy is 151/200 correct or 75.5%.
TP = true positives: number of examples predicted positive that are actually positive
FP = false positives: number of examples predicted positive that are actually negative
TN = true negatives: number of examples predicted negative that are actually negative
FN = false negatives: number of examples predicted negative that are actually positive
Recall is the TP rate ( also referred to as sensitivity) what fraction of those that are actually
positive were predicted positive? : TP / actual positives Precision is TP / predicted Positive
what fraction of those predicted positive are actually positive? precision is also referred to as
Positive predictive value (PPV); Other related measures used in classification include True
Negative Rate and Accuracy: True Negative Rate is also called Specificity. (TN / actual
negatives) 1-specificity is x-axis of ROC curve: this is the same as the FP rate (FP / actual
negatives) F-measure A measure that combines precision and recall is the harmonic mean of
precision and recall, the traditional F-measure or balanced F-score:
Mean absolute error (MAE)
The MAE measures the average magnitude of the errors in a set of forecasts, without considering
their direction. It measures accuracy for continuous variables. The equation is given in the library
references. Expressed in words, the MAE is the average over the verification sample of the
absolute values of the differences between forecast and the corresponding observation. The MAE
is a linear score which means that all the individual differences are weighted equally in the
average;
Root mean squared error (RMSE)
The RMSE is a quadratic scoring rule which measures the average magnitude of the error. The
equation for the RMSE is given in both of the references. Expressing the formula in words, the
difference between forecast and corresponding observed values are each squared and then
averaged over the sample. Finally, the square root of the average is taken. Since the errors are
squared before they are averaged, the RMSE gives a relatively high weight to large errors. This
means the RMSE is most useful when large errors are particularly undesirable.
2 Support Vector Machine
The model achieved a result of 181/200 correct or 92.3469%.
We have classified the dataset on the basis the reasons why the villagers are unwilling to send
girl children to schools in Gwalior village. The different classes are NA, Poverty, Marriage,
Distance, X, Unsafe Public Space, Transport Facilities, and Household Responsibilities.
The weighted average true positive rate is 0.923 that is nearly all the predicted positive values
are actually positive. The weighted average false positive rate is 0.205 that is few of them are
predicted as positive values but are actually negative. The precision in 0.902 that is the
algorithm is nearly accurate.
=== Confusion Matrix ===
a b
147 0
4 12
5 0
0 0
0 0
4 0
0 0
0 0
c
1
0
3
1
0
0
0
0
d
0
0
0
3
0
0
0
0
e
0
0
0
0
8
0
0
0
f
0
0
0
0
0
0
0
0
g
0
0
0
0
0
0
4
0
h
0
0
0
0
0
0
0
4
<-- classified as
| a = NA
| b = Poverty
| c = Marriage
| d = Distance
| e = X
| f = Unsafe Public Space
| g = Transport Facilities
| h = Household Responsibilities
The confusion matrix shows that majority of the reasons were not available and out of the
reasons which were available people did not send their daughters to school because of poverty
and very few of them considered Distance as a major factor for not sending their girl children to
school.
3 Random Forest
The accuracy of this algorithm is 100% that is 200/200 have been correctly classified
=== Confusion Matrix ===
a b c d e f g h i j k l m
<-- classified as
5 0 00 0 0 0 0 0 0 0 0 0 | a = Govt.
0 14 000 0 0 0 0 0 0 0 0 | b = Driver
0 0 32 0 0 0 0 0 0 0 0 0 0 | c = Farmer
0 0 0 11 0 0 0 0 0 0 0 0 0 | d = Shopkeeper
0000970 0 00000 0 | e = labour
0 0 0 0 0 4 0 0 0 0 0 0 0 | f = Security Guard
0 0 0 0 0 0 4 0 0 0 0 0 0 | g = Raj Mistri
00 0 0 0 0 0 11 0 0 0 0 0 | h = Fishing
0 0 0 0 0 0 0 0 4 0 0 0 0 | i = Labour & Driver
0 0 0 0 0 0 0 0 0 5 0 0 0 | j = Homemaker
0
0
0
0
0
0
0
0
0
04
0
0
|
k = Govt School
Teacher
0 0 0 0 0 0 0 0 0 0 0 5 0 | l = Dhobi
0 0 0 0 0 0 0 0 0 0 0 0 4 | m = goats
There is no observation which has been misclassified. Maximum number of villagers arelaborers.
4 Random Tree
The classification accuracy is 76.0204% that is 149/200 have been classified correctly.
The false positive rate is 0.352 that is highest of all the four algorithms applied above. Here
35.2% of the values which should have been classified negatively have been assigned a positive
value.
=== Confusion Matrix ===
a
126
b
7
c
d
3
1
7
8
1
4
1
3
1
0
0
2
0
0
4
0
0
Public Space
3
1
0
Facilities
1
0
0
Responsibilities
e
0
0
0
3
0
0
f
8
0
0
0
6
0
g
3
0
0
0
0
0
h
0
0
0
0
0
0
|
0
0
0
0
0
<-- classified as
a = NA
|
b = Poverty
|
c = Marriage
|
d = Distance
|
e = X
|
f
=
Unsafe
0
0
0
0
0
|
g
=
Transport
0
0
0
0
3
|
h
=
Household
22 NA , 8 Poverty , 5 Marriage, 1 Distance, 2 X, 4 Unsafe Public Space, 4 Transport Facilities
and 1 Household Responsibilities class values have been misclassified.
The best algorithm out of the above algorithms is Random Forest with 100% accuracy rate and
the worst is Naïve Bayes algorithm with 75.5% accuracy rate.
Experiment 13
AIM -Understanding of Dataset of contact patterns among students collected in
National University of Singapore.
This is dataset collected from contact patterns among students collected during the
spring semester 2006 in National University of Singapore
Using RemovePercentage filter, instances have been reduced to: 500
This data has been taken and saved as training data set and then used for further classification.
ALGORITHM-1 :GaussianProcesses
=== Run information ===
Scheme:
weka.classifiers.functions.SimpleLinearRegression
Relation:
MOCK_DATA
(1)weka.filters.unsupervised.instance.RemovePercentage-P50.0
Instances:
500
Attributes:
4
Start Time
Session Id
Student Id
Duration
Test mode:
evaluate on training data
=== Classifier model (full training set) ===
Linear regression on Session Id
0.03 * Session Id + 10.38
Time taken to build model: 0 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0 seconds
=== Summary ===
Correlation coefficient
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
ALGORITHM 2: Linear Regression
0.0677
4.9869
5.7893
99.7326 %
99.7708 %
500
Linear Regression Model
Start Time =
0.0274 * Session Id +
10.3846
Time taken to build model: 0.01 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0 seconds
=== Summary ===
Correlation coefficient
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
Algorithm 3: Decision Table
Algorithm 6: DecisionTable
Merit of best subset found: 5.814
Evaluation (for feature selection): CV (leave one out)
Feature set: 1
0.0677
4.9869
5.7893
99.7326 %
99.7708 %
500
Time taken to build model: 0.02 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0 seconds
=== Summary ===
Correlation coefficient
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
0
5.0003
5.8026
100
100
%
500
%
CONCLUSION:
Six algorithms have been used to measure the best classifier. Depending on various attributes,
performance of various algorithms can be measured via mean absolute error and correlation
coefficient.
Depending on the results above, worst correlation has been found by DecisionTable and best
correlation
has been found by Decision Stump
=== Run information ===
Scheme:
weka.classifiers.rules.DecisionTable -X
"weka.attributeSelection.BestFirst -D 1 -N 5"
Relation:
MOCK_DATA
weka.filters.unsupervised.instance.RemovePercentage-P50.0
Instances:
500
Attributes:
4
Start Time
Session Id
Student Id
Duration
Test mode:
evaluate on training data
=== Classifier model (full training set) ===
Decision Table:
Number of training instances: 500
Number of Rules : 1
Non matches covered by Majority class.
Best first.
Start set: no attributes
Search direction: forward
Stale search after 5 node expansions
Total number of subsets evaluated: 9
1
-S
(1)-
Download