Machine Learning ETCS -454 Maharaja Agrasen Institute of Technology, PSP area, Sector – 22, Rohini, New Delhi – 110085 (Affiliated to Guru Gobind Singh Indraprastha University, New Delhi) Vision of the Institute To nurture young minds in a learning environment of high academic value and imbibe spiritual and ethical values with technological and management competence. Mission of the institute: The Institute shall endeavor to incorporate the following basic missions in the teaching methodology: Engineering Hardware – Software Symbiosis Practical exercises in all Engineering and Management disciplines shall be carried out by Hardware equipment as well as the related software enabling deeper understanding of basic concepts and encouraging inquisitive nature. Life – Long Learning The Institute strives to match technological advancements and encourage students to keep updating their knowledge for enhancing their skills and inculcating their habit of continuous learning. Liberalization and Globalization The Institute endeavors to enhance technical and management skills of students so that they are intellectually capable and competent professionals with Industrial Aptitude to face the challenges of globalization. Diversification The Engineering, Technology and Management disciplines have diverse fields of studies with different attributes. The aim is to create a synergy of the above attributes by encouraging analytical thinking. Digitization of Learning Processes The Institute provides seamless opportunities for innovative learning in all Engineering and Management disciplines through digitization of learning processes using analysis, synthesis, simulation, graphics, tutorials and related tools to create a platform for multi-disciplinary approach. Entrepreneurship The Institute strives to develop potential Engineers and Managers by enhancing their skills and research capabilities so that they become successful entrepreneurs and responsible citizens. Vision of CSE Department To Produce “Critical thinkers of Innovative Technology” Mission of CSE department To provide an excellent learning environment across the computer science discipline to inculcate professional behavior, strong ethical values, innovative research capabilitiesand leadership abilities which enable them to become successful entrepreneurs in this globalized world. 1. To nurture an excellent learning environment that helps students to enhance their problem solving skills and to prepare students to be lifelong learners by offering a solid theoretical foundation with applied computing experiences and educating them about their professional, and ethical responsibilities. 2. To establish Industry-Institute Interaction, making students ready for the industrial environment and be successful in their professional lives. 3. To promote research activities in the emerging areas of technology convergence. 4. To build engineers who can look into technical aspects of an engineering solution thereby setting a ground for producing successful entrepreneur. Introduction to Machine Learning Lab Course Objective: The main objective of this course is to ensure that students understand the working of different machine Learning techniques in practice for popular techniques like C 4.5 algorithm, Naïve bayes, decision tree, random forest or neural network gets applied on some statistical data. Course Outcome: At the end of the course, a student will be able to: At the end of the course, a student will be able to: C454.1: Study and Implement the Supervised learners (Naive Bayes) and Unsupervised Learners C454.2: Estimate the precision, recall, F-measure and accuracy of classifier methods on some dataset.(breast cancer) using 5-fold cross-validation. C454.3 : To understand datasets of various problems and evaluate their attributes for training, testing and validations. Estimate overfitting if exists . C454.4: Develop a machine learning method to Predict stock prices or education of infants etc, based on past price variation., to predict how people would rate movies, books, etc C454.5: Select two Programming languages to implement the same problem under different classifier algorithm . Use different functions to implement concept. C454.6: Evaluate performance by measuring the sum of different distance algorithms of each example from its class center test performance as a function of the number of clusters. LIST OF EXPERIMENTS (As prescribed by G.G.S.I.P.U) MACHINE LEARNING LAB Paper Code: ETCS-454 1. Introduction to machine learning lab with tools (hands on weka) 2. Understanding of Machine learning algorithms List of Databases 3. Implement K means Algorithm using WEKA Tool. Implement the K-means algorithm and apply it to the data you selected. Evaluate performance by measuring the sum of Euclidean distance of each example from its class center. Test the performance of the algorithm as a function of the parameter k. 4. Study of Databases and understanding attributes evaluation in regard to problem description. 5. Working of Major Classifiers ,a) Naïve Bayes b) Decision Tree c)CART d) ARIMA (using (e) linear and logistics regression (f) Support vector machine (g) KNN (datasets can be: Breast Cancer data file or Reuters data set). 6. Design a prediction model for Analysis of round trip Time of Flight measurement from a supermarket using random forest, naïve bayes etc. 7. Implement supervised learning (KNN classification) .Estimate the accuracy of using 5-fold cross-validation. choose the appropriate options for missing values. 8. Introduction to R. Be aware of basics of machine learning methods in R. 9. Build and Develop a model in R for a particular classifier (random Forest). 10. Develop a machine learning method using Neural Networks in python to Predict stock prices based on past price variation. . LIST OF EXPERIMENTS (Beyond the syllabus) 1. Case Study of RMS Titanic Database to predict survival on basis of decision tree, Logistic Regression , KNN or k-Nearest Neighbors, Support Vector Machines 2. Understanding of Indian education in Rural villages to predict whether girl child will be sent to school or not?. 2. Understanding of dataset of contact patterns among students collected in National University of Singapore. MACHINE LEARNING ETCS 454 Faculty name: Mr. Sandeep Tayal Student name: Aditya Bhardwaj Roll No.: 01396402714 Semester: VIII Maharaja Agrasen Institute of Technology, PSP Area, Sector – 22, Rohini, New Delhi – 110086 Index Exp. no Experiment Name Date of performance Date of checking Marks Signature Experiment 1 Hands on WEKA (Some Steps) on data set (iris.arff) INSTRUCTIONS: Weka is a landmark system in the history of the data mining and machine learning research communities, because it is the only toolkit that has gained such widespread adoption and survived for an extended period of time. The Weka or woodhen (Gallirallus australis) is an endemic bird of New Zealand. It provides many different algorithms for data mining and machine learning. Weka is open source and freely available. It is also platform-independent. The GUI Chooser consists of four buttons: Explorer: An environment for exploring data with WEKA. Experimenter: An environment for performing experiments and conducting statistical tests between learning schemes. Knowledge Flow: This environment supports essentially the same functions as the Explorer but with a drag and-drop interface. One advantage is that it supports incremental learning. Simple CLI: Provides a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface. (a) Start Weka Start Weka. This may involve finding it in program launcher or double clicking on the weka.jar file. This will start the Weka GUI Chooser. The Weka GUI Chooser lets you choose one of the Explorer, Experimenter, Knowledge Explorer and the Simple CLI (command line interface). Weka GUI Chooser Click the “Explorer” button to launch the Weka Explorer. This GUI lets you load datasets and run classification algorithms. It also provides other features, like data filtering, clustering, association rule extraction, and visualization, but we won’t be using these features right now. (b). Open the data/ Dataset(sample iris.arff) Dataset A comma-separated list. A set of data items, the dataset, is a very basic concept of machine learning. A dataset is roughly equivalent to a two-dimensional spreadsheet or database table. In WEKA,it is implemented by the weka.core.Instances class. A dataset is a collection of examples, each one of class weka.core.Instance. Each Instance consists of a number of attributes, any of which can be nominal (= one of a predefined list of values), numeric (= a real or integer number) or a string (= an arbitrary long list of characters, enclosed in ”double quotes”). Additional types are date and relational, which are not covered here but in the ARFF chapter. The external representation of an Instances class is an ARFF file, which consists of a header describing the attribute types and the data Click the “Open file…” button to open a data set and double click on the “data” directory. Weka provides a number of small common machine learning datasets that you can use to practice on. Select the “iris.arff” file to load the Iris dataset. Weka Explorer Interface with the Iris dataset loaded .The Iris Flower dataset is a famous dataset from statistics and is heavily borrowed by researchers in machine learning. It contains 150 instances (rows) and 4 attributes (columns) and a class attribute for the species of iris flower (one of setosa, versicolor, and virginica).In our example, we have not mentioned the attribute type string, which defines ”double quoted” string attributes for text mining. In recent WEKA versions, date/time attribute types are also supported.By default, the last attribute is considered the class/target variable, i.e. the attribute which should be predicted as a function of all other attributes. If this is not the case, specify the target variable via -c. The attribute numbers are one-based indices, i.e. -c 1 specifies the first attribute.Some basic statistics and validation of given ARFF files can be obtained via the main() routine of weka.core.Instances: Classifier Any learning algorithm in WEKA is derived from the abstract weka.classifiers.Classifier class. Surprisingly little is needed for a basic classifier: a routine Which g en- erates a classifier model from a training dataset (= buildClassifier) and another routine which evaluates the generated model on an unseen test dataset (= classifyInstance), or generates a probability distribution for all classes (= distributionForInstance).A classifier model is an arbitrary complex mapping from all-but-one dataset attributes to the class attribute. ( c) Select and Run an Algorithm Now that you have loaded a dataset, it’s time to choose a machine learning algorithm to model the problem and make predictions. Click the “Classify” tab. This is the area for running algorithms against a loaded dataset in Weka. You will note that the “ZeroR” algorithm is selected by default. Click the “Start” button to run this algorithm. Weka Results for the ZeroR algorithm on the Iris flower dataset. The ZeroR algorithm selects the majority class in the dataset (all three species of iris are equally present in the data, so it picks the first one: setosa) and uses that to make all predictions. This is the baseline for the dataset and the measure by which all algorithms can be compared. The result is 33%, as expected (3 classes, each equally represented, assigning one of the three to each prediction results in 33% classification accuracy). You will also note that the test options selects Cross Validation by default with 10 folds. This means that the dataset is split into 10 parts: the first 9 are used to train the algorithm, and the 10th is used to assess the algorithm. This process is repeated, allowing each of the 10 parts of the split dataset a chance to be the held-out test set. Click the “Choose” button in the “Classifier” section and click on “trees” and click on the “J48” algorithm. This is an implementation of the C4.8 algorithm in Java (“J” for Java, 48 for C4.8, hence the J48 name) and is a minor extension to the famous C4.5 algorithm. Click the “Start” button to run the algorithm. Weka J48 algorithm results on the Iris flower dataset (d) Review Results After running the J48 algorithm, you can note the results in the “Classifier output” section. The algorithm was run with 10-fold cross-validation: this means it was given an opportunity to make a prediction for each instance of the dataset (with different training folds) and the presented result is a summary of those predictions. Just the results of the J48 algorithm on the Iris flower dataset in Weka. Firstly, note the Classification Accuracy. You can see that the model achieved a result of 144/150 correct or 96%, which seems a lot better than the baseline of 33%.Secondly, look at the Confusion Matrix. You can see a table of actual classes compared to predicted classes and you can see that there was 1 error where an Iris-setosa was classified as an Iris-versicolor, 2 cases where Iris-virginica was classified as an Iris-versicolor, and 3 cases where an Iris-versicolor was classified as an Iris-setosa (a total of 6 errors). This table can help to explain the accuracy achieved by the algorithm. Viva Questions: Q1: Explain main features of WEKA toolkit? Q2: What are the various tool menus for using algorithms on problem data set? Q3: Give the examples of real time applications where WEKA can be used? Q 4. What is difference between experimenter and knowledge flow ? Q5. How classifiers in WEKA can be implemented using various sub options? Experiment 2 Aim: Understanding of Machine learning algorithms INSTRUCTIONS: The information data and algorithm, that will be used for different problems, can be done but before that we need to understand some of algorithms (machine learning) like in this section, several machine learning algorithms from top 10 data mining algorithms are described and evaluated in this paper. Furthermore, several meta-algorithms are presented to enhance the AUC results of selected machine learning algorithms. The machine learning classifiers for Web Spam detection are: – Support Vector Machine (SVM) - SVM discriminates a set of high-dimension features using a or sets of hyper planes that gives the largest minimum distance to separates all data points among classes. Multilayer Perceptron Neural Network (MLP) - MLP is a non-linear feed-forward network model which maps a set of inputs x onto a set of outputs y using multi weights connections. Bayesian Network (BN) - A BN is a probabilistic graphical model for reasoning under uncertainty, where the nodes represent discrete or continuous variables and the links represent the relationships between them. C4.5 Decision Tree (DT) - DT decides the target class of a new sample based on selected features from available data using the concept of information entropy. The nodes of the tree are the attributes, each branch of the tree represents a possible decision and the end nodes or leaves are the classes. Random Forest (RF) - RF works by constructing multiple decision tree s on various sub-samples of the datasets and output the class that appear most often or mean predictions of the decision trees. Nave Bayes (NB) - The NB classifier is a classification algorithm based on Bayes theorem with strong inde- pendent assumptions between features. K-nearest Neighbour (KNN) - KNN is an instance-based learning algorithm that store all available data points and classifies the new data points based on similarity measure such as distance. The machine learning ensemble meta-algorithms on the other hand are: Boosting algorithms - Boosting works by combining a set of weak classifier to a single strong classifier. The weak classifiers are weighted in some way from the training data points or hypotheses into a final strong classifier, thus there are a varieties of boosting algorithms. Here, three boosting algorithms are introduced in this paper: Adaptive Boosting (AdaBoost) - The weights of incorrectly labelled data points are adjusted in AdaBoost such that the following classifiers focus more on incorrectly labelled or diffi cult cases. LogitBoost - LogitBoost is actually an extension of AdaBoost where it applies the cost function logistic regression to AdaBoost, thus it classifies by using a regression scheme as base learner. Real AdaBoost - Unlike most Boosting algorithms which returns binary valued classes (Discrete AdaBoost), Real AdaBoost outputs a real valued probability of the class. Bagging – Bagging is a method by generating several training sets of the same size and use the same machine learning algorithm to build model of them and combine the predictions by averaging. It is often improve the accuracy and stability of the classifier. Dagging - Dagging generates a number of disjoint and stratified folds out of the data and feeds each chunk of data to a copy of the machine learning classifier. Majority vote is done for predictions since all the generated machine learning classifier are put into the voted Meta classifier. Dagging is useful for base classifiers that are quadratic or worse in time behaviour on the number of instances in the training data. Rotation Forest - The rotation forest is constructed using a number of the same machine learning classifier typically decision tree independently and trained on a new set of trained features form by sub-sampling of thedatasets with principal component analysis applied on each sub-sets. . Viva Questions: Q1: What are different classifiers? Q2: Recognize the situations where one particular classifier is used? Q3: What are different clustering techniques? Q 4. Name different association algorithms. How they are different from network? Neural Experiment 3 Aim: Understand clustering approaches and implement K means Algorithm using weka Tool Fig: 1: Clustering techniques K-means Clustering The term "k-means" was first used by James MacQueen in 1967, though the idea goes back to 1957. The standard algorithm was first proposed by Stuart Lloyd in 1957 as a technique for pulse-code modulation, though it wasn't published until 1982. K-means is a widely used partitional clustering method in the industries. The Kmeans algorithm is the most commonly used partitional clustering algorithm because it can be easily implemented and is the most efficient one in terms of the execution time. Here’s how the algorithm works K-Means Algorithm: The algorithm for partitioning, where each cluster’s center is represented by mean value of objects in the cluster. Input: k: the number of clusters. D: a data set containing n objects. Output: A set of k clusters. Method: 1. 2. 3. 4 Arbitrarily choose k objects from D as the initial cluster centers. Repeat. (Re) assign each object to the cluster to which the object is most similar using Eq 1, based on the mean value of the objects in the cluster. Update the cluster means, i.e. calculate the mean value of the objects for each cluster until no change Output: === Run information === Scheme: weka.clusterers.SimpleKMeans -init 0 -maxcandidates 100 -periodic-pruning 10000 -min-density 2.0 -t1 1.25 -t2 -1.0 -N 2 -A "weka.core.EuclideanDistance -R firstlast" -I 500 -num-slots 1 -S 10 Relation: iris Instances: 150 Attributes: 5 sepallength sepalwidth petallength petalwidth class Test mode: evaluate on training data === Clustering model (full training set) === kMeans ====== Number of iterations: 7 Within cluster sum of squared errors: 62.1436882815797 Initial starting points (random): Cluster 0: 6.1,2.9,4.7,1.4,Iris-versicolor Cluster 1: 6.2,2.9,4.3,1.3,Iris-versicolor Missing values globally replaced with mean/mode Final cluster centroids: Cluster# Attribute 1 Full Data (150.0) 0 (100.0) (50.0) ================================================================ == sepallength 5.8433 6.262 5.006 sepalwidth 3.054 2.872 3.418 petallength 3.7587 4.906 1.464 petalwidth 1.1987 1.676 0.244 class setosa Iris-setosa Iris-versicolor Iris- Time taken to build model (full training data) : 0.01 seconds === Model and evaluation on training set === Clustered Instances 0 1 100 ( 67%) 50 ( 33%) The above information shows the result of k-means clustering Methods using WEKA tool. After that we saved the result, the result will be saved in the ARFF file format. We also open this file in the ms excel. Viva Questions: Q1: When would you use k means cluster and when would you use hierarchical cluster? Q2: What is the difference between Classification and Clustering? Q3: What definition of similarity is used by the k-means clustering algorithm? Q 4. Describe the optimal k-means optimization problem; is it NP-Hard? Q5. How are the k means typically chosen in the k-means algorithm? Q6. What is the computing time for the k-means algorithm? Experiment 4 Aim: Study sample ARFF files in Database; INSTRUCTIONS: There are lot of data files that store attributes details of problem description and they store data in either of formats 1. CSV- comma separated value 2. arf3. excel -xls There is list of following arff files shown below, These files need to be evaluated and analysed to get results on basis of data provided in files. 1) Airline 2) Breast-cancer 3) Contact-lenses 4) Cpu 5) Credit-g 1. Airline: Monthly totals of international airline passengers (in thousands) for 1949-1960. @relation airline_passengers @attribute passenger_numbers numeric @attribute Date date 'yyyy-MM-dd' @data 112,1949-01-01 118,1949-02-01 132,1949-03-01 129,1949-04-01 121,1949-05-01 135,1949-06-01 148,1949-07-01 148,1949-08-01 432,1960-12-01 2. breast-cancer This data set includes 201 instances of one class and 85 instances of another class. The instances are described by 9 attributes, some of which are linear and some are nominal. Number of Instances: 286 Number of Attributes: 9 + the class attribute Attribute Information: 1. Class: no-recurrence-events, recurrence-events 2. age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99. 3. menopause: lt40, ge40, premeno. 4. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59. 5. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39. 6. node-caps: yes, no. 7. deg-malig: 1, 2, 3. 8. breast: left, right. 9. breast-quad: left-up, left-low, right-up, right-low, central. 10. irradiat: yes, no. Missing Attribute Values: (denoted by "?") Attribute #: Number of instances with missing values 8 Class Distribution: 1. no-recurrence-events: 201 instances 2. recurrence-events: 85 instances Num Instances: 286 Num Attributes: 10 Num Continuous: 0 (Int 0 / Real 0) Num Discrete: 10 Missing values: 9 / 0.3 name type enum ints real missing distinct (1) 1 'age' Enum 100 0 0 0/ 0 6/ 2 0 2 'menopause' Enum 100 0 0 0/ 0 3/ 1 0 3 'tumor-size' Enum 100 0 0 0 / 0 11 / 4 0 4 'inv-nodes' Enum 100 0 0 0/ 0 7/ 2 0 5 'node-caps' Enum 97 0 0 8/ 3 2/ 1 0 6 'deg-malig' Enum 100 0 0 0/ 0 3/ 1 0 7 'breast' Enum 100 0 0 0/ 0 2/ 1 0 8 'breast-quad' Enum 100 0 0 1/ 0 5/ 2 0 9 'irradiat' Enum 100 0 0 0/ 0 2/ 1 0 10 'Class' Enum 100 0 0 0/ 0 2/ 1 0 @relation breast-cancer @attribute age {'10-19','20-29','30-39','40-49','50-59','60-69','70-79','80-89','90-99'} @attribute menopause {'lt40','ge40','premeno'} @attribute tumor-size {'0-4','5-9','10-14','15-19','20-24','25-29','30-34','35-39','40-44','4549','50-54','55-59'} @attribute inv-nodes {'0-2','3-5','6-8','9-11','12-14','15-17','18-20','21-23','24-26','2729','30-32','33-35','36-39'} @attribute node-caps {'yes','no'} @attribute deg-malig {'1','2','3'} @attribute breast {'left','right'} @attribute breast-quad {'left_up','left_low','right_up','right_low','central'} @attribute 'irradiat' {'yes','no'} @attribute 'Class' {'no-recurrence-events','recurrence-events'} @data '40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events' '50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events' '50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events' '40-49','premeno','35-39','0- 3) Contact-lenses 1. Title: Database for fitting contact lenses 2. Sources: (a) Cendrowska, J. "PRISM: An algorithm for inducing modular rules", International Journal of Man-Machine Studies, 1987, 27, 349-370 (b) Donor: Benoit Julien (Julien@ce.cmu.edu) (c) Date: 1 August 1990 3. Past Usage: 1. See above. 2. Witten, I. H. & MacDonald, B. A. (1988). Using concept learning for knowledge acquisition. International Journal of Man-Machine Studies, 27, (pp. 349-370). Notes: This database is complete (all possible combinations of attribute-value pairs are represented). Each instance is complete and correct. 9 rules cover the training set. 4. Relevant Information Paragraph: The examples are complete and noise free. The examples highly simplified the problem. The attributes do not fully describe all the factors affecting the decision as to which type, if any, to fit. 5. Number of Instances: 24 6. Number of Attributes: 4 (all nominal) 7. Attribute Information: -- 3 Classes 1 : the patient should be fitted with hard contact lenses, 2 : the patient should be fitted with soft contact lenses, 1 : the patient should not be fitted with contact lenses. 1. age of the patient: (1) young, (2) pre-presbyopic, (3) presbyopic 2. spectacle prescription: (1) myope, (2) hypermetrope 3. astigmatic: (1) no, (2) yes 4. tear production rate: (1) reduced, (2) normal 8. Number of Missing Attribute Values: 0 9. Class Distribution: 1. hard contact lenses: 4 2. soft contact lenses: 5 3. no contact lenses: 15 @relation contact-lenses @attribute age {young, pre-presbyopic, presbyopic} @attribute spectacle-prescrip {myope, hypermetrope} @attribute astigmatism {no, yes} @attribute tear-prod-rate {reduced, normal} @attribute contact-lenses {soft, hard, none} @data 24 instances young,myope,no,reduced,none young,myope,no,normal,soft young,myope,yes,reduced,none young,myope,yes,normal,hard young,hypermetrope,no,reduced,none young,hypermetrope,no,normal,soft young,hypermetrope,yes,reduced,none young,hypermetrope,yes,normal,hard pre-presbyopic,myope,no,reduced,none pre-presbyopic,myope,no,normal,soft pre-presbyopic,myope,yes,reduced,none pre-presbyopic,myope,yes,normal,hard pre-presbyopic,hypermetrope,no,reduced,none pre-presbyopic,hypermetrope,no,normal,soft pre-presbyopic,hypermetrope,yes,reduced,none pre-presbyopic,hypermetrope,yes,normal,none presbyopic,myope,no,reduced,none presbyopic,myope,no,normal,none presbyopic,myope,yes,reduced,none presbyopic,myope,yes,normal,hard presbyopic,hypermetrope,no,reduced,none presbyopic,hypermetrope,no,normal,soft presbyopic,hypermetrope,yes,reduced,none presbyopic,hypermetrope,yes,normal,none 4) CPU Deleted "vendor" attribute to make data consistent with with what we used in the data mining book. @relation 'cpu' @attribute MYCT numeric @attribute MMIN numeric @attribute MMAX numeric @attribute CACH numeric @attribute CHMIN numeric @attribute CHMAX numeric @attribute class numeric @data 125,256,6000,256,16,128,198 29,8000,32000,32,8,32,269 29,8000,32000,32,8,32,220 29,8000,32000,32,8,32,172 29,8000,16000,32,8,16,132 26,8000,32000,64,8,32,318 23,16000,32000,64,16,32,367 23,16000,32000,64,16,32,489 23,16000,64000,64,16,32,636 23,32000,64000,128,32,64,1144 5) Credit-g Description of the German credit dataset. 1. Title: German Credit data 2. Source Information 3. Number of Instances: 1000 Two datasets are provided. the original dataset, in the form provided by Prof. Hofmann, contains categorical/symbolic attributes and is in the file "german.data". For algorithms that need numerical attributes, Strathclyde University produced the file "german.data-numeric". This file has been edited and several indicator variables added to make it suitable for algorithms which cannot cope with categorical variables. Several attributes that are ordered categorical (such as attribute 17) have been coded as integer. This was the form used by StatLog. 6. Number of Attributes german: 20 (7 numerical, 13 categorical) Number of Attributes german.numer: 24 (24 numerical) 7. Attribute description for german Attribute 1: (qualitative) Status of existing checking account A11 : ... < 0 DM A12 : 0 <= ... < 200 DM A13 : ... >= 200 DM / salary assignments for at least 1 year A14 : no checking account Attribute 2: (numerical) Duration in month Attribute 3: (qualitative) Credit history A30 : no credits taken/ all credits paid back duly A31 : all credits at this bank paid back duly A32 : existing credits paid back duly till now A33 : delay in paying off in the past A34 : critical account/ other credits existing (not at this bank) Attribute 4: (qualitative) Purpose A40 : car (new) A41 : car (used) A42 : furniture/equipment A43 : radio/television Attribute 5: (numerical) Credit amount Attibute 6: (qualitative) Savings account/bonds A61 : ... < 100 DM A62 : 100 <= ... < 500 DM A63 : 500 <= ... < 1000 DM A64 : .. >= 1000 DM A65 : unknown/ no savings account Attribute 7: (qualitative) Present employment since A71 : unemployed A72 : ... < 1 year A73 : 1 <= ... < 4 years A74 : 4 <= ... < 7 years A75 : .. >= 7 years Attribute 8: (numerical) Installment rate in percentage of disposable income Attribute 9: (qualitative) Personal status and sex A91 : male : divorced/separated A92 : female : divorced/separated/married A93 : male : single A94 : male : married/widowed A95 : female : single Attribute 10: (qualitative) Other debtors / guarantors A101 : none A102 : co-applicant A103 : guarantor Attribute 11: (numerical) Present residence since Attribute 12: (qualitative) Property A121 : real estate A122 : if not A121 : building society savings agreement/ life insurance Attribute 13: (numerical) Age in years Attribute 14: (qualitative) Other installment plans A141 : bank A142 : stores A143 : none Attribute 15: (qualitative) Housing A151 : rent A152 : own A153 : for free Attribute 16: (numerical) Number of existing credits at this bank Attribute 17: (qualitative) Job A171 : unemployed/ unskilled - non-resident A172 : unskilled - resident A173 : skilled employee / official A174 : management/ self-employed/ highly qualified employee/ officer Attribute 18: (numerical) Number of people being liable to provide maintenance for Attribute 19: (qualitative) Telephone A191 : none A192 : yes, registered under the customers name Attribute 20: (qualitative) foreign worker A201 : yes A202 : no Cost Matrix This dataset requires use of a cost matrix (see below) 1 2 1 0 2 2 5 0 (1 = Good, 2 = Bad) the rows represent the actual classification and the columns the predicted classification. It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1). Relabeled values in attribute checking_status From: A11 To: '<0' From: A12 To: '0<=X<200' From: A13 To: '>=200' From: A14 To: 'no checking' @relation german_credit @attribute checking_status { '<0', '0<=X<200', '>=200', 'no checking'} @attribute duration numeric @attribute credit_history { 'no credits/all paid', 'all paid', 'existing paid', 'delayed previously', 'critical/other existing credit'} @attribute purpose { 'new car', 'used car', furniture/equipment, radio/tv, 'domestic appliance', repairs, education, vacation, retraining, business, other} @attribute credit_amount numeric @attribute savings_status { '<100', '100<=X<500', '500<=X<1000', '>=1000', 'no known savings'} @attribute employment { unemployed, '<1', '1<=X<4', '4<=X<7', '>=7'} @attribute installment_commitment numeric @attribute personal_status { 'male div/sep', 'female div/dep/mar', 'male single', 'male mar/wid', 'female single'} @attribute other_parties { none, 'co applicant', guarantor} @attribute residence_since numeric @attribute property_magnitude { 'real estate', 'life insurance', car, 'no known property'} @attribute age numeric @attribute other_payment_plans { bank, stores, none} @attribute housing { rent, own, 'for free'} @attribute existing_credits numeric @attribute job { 'unemp/unskilled non res', 'unskilled resident', skilled, 'high qualif/self emp/mgmt'} @attribute num_dependents numeric @attribute own_telephone { none, yes} @attribute foreign_worker { yes, no} @attribute class { good, bad} @data '<0',6,'critical/other existing credit',radio/tv,1169,'no known savings','>=7',4,'male single',none,4,'real estate',67,none,own,2,skilled,1,yes,yes,good '0<=X<200',48,'existing paid',radio/tv,5951,'<100','1<=X<4',2,'female div/dep/mar',none,2,'real estate',22,none,own,1,skilled,1,none,yes,bad 'no checking',12,'critical/other existing credit',education,2096,'<100','4<=X<7',2,'male single',none,3,'real estate',49,none,own,1,'unskilled resident',2,none,yes,good '<0',42,'existing paid',furniture/equipment,7882,'<100','4<=X<7',2,'male single',guarantor,4,'life insurance',45,none,'for free',1,skilled,2,none,yes,good '<0',24,'delayed previously','new car',4870,'<100','1<=X<4',3,'male single',none,4,'no known property',53,none,'for free',2,skilled,2,none,yes,bad 'no checking',36,'existing paid',education,9055,'no known savings','1<=X<4',2,'male single',none,4,'no known property',35,none,'for free',1,'unskilled resident',2,yes,yes,good 'no checking',24,'existing paid',furniture/equipment,2835,'500<=X<1000','>=7',3,'male single',none,4,'life insurance',53,none,own,1,skilled,1,none,yes,good '0<=X<200',36,'existing paid','used car',6948,'<100','1<=X<4',2,'male single',none,2,car,35,none,rent,1,'high qualif/self emp/mgmt',1,yes,yes,good. (f) Breast Cancer In this problem we have to use 30 different columns and we have to predict the Stage of Breast Cancer M (Malignant) and B (Benign) Attribute Information: 1) ID number 2) Diagnosis (M = malignant, B = benign) -3-32.Ten real-valued features are computed for each cell nucleus: a) radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area - 1.0) g). concavity (severity of concave portions of the contour) h). concave points (number of concave portions of the contour) i). symmetry j). fractal dimension ("coastline approximation" - 1) Viva Questions: Q1: Take the different situations and analyse problem in terms of different attributes? Q2: What are the various types of data attributes used in problem description? Q3: How much minimum no. of instances are required to satisfy problem description? Experiment 5 Aim: Study Major Classifier algorithms and implement. INSTRUCTIONS: (a) Naive Bayes algorithm It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’. Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods. Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at the equation below: Above, • P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes). • P(c) is the prior probability of class. • P(x|c) is the likelihood which is the probability of predictor given class. • P(x) is the prior probability of predictor. Pros and Cons of Naive Bayes? Pros: • • It is easy and fast to predict class of test data set. It also perform well in multi class prediction When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data. • It perform well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption). Cons: • • • If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation. On the other side naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously. Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent. Applications of Naive Bayes Algorithms • Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making predictions in real time. • Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here we can predict the probability of multiple classes of target variable. • Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text classification (due to better result in multi class problems and independence rule) have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments) •Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not. (b) Decision Trees in Machine Learning A tree has many analogies in real life, and turns out that it has influenced a wide area of machine learning, covering both classification and regression. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. As the name goes, it uses a tree-like model of decisions. Though a commonly used tool in data mining for deriving a strategy to reach a particular goal, its also widely used in machine learning, which will be the main focus of this article. Why Decision trees? We have couple of other algorithms there, so why do we have to choose Decision trees? well, there might be many reasons but I believe a few which are 1. Decision tress often mimic the human level thinking so its so simple to understand the data and make some good interpretations. 2. Decision trees actually make you see the logic for the data to interpret(not like black box algorithms like SVM,NN,etc..) How can an algorithm be represented as a tree? For this let’s consider a very basic example that uses titanic data set for predicting whether a passenger will survive or not. Below model uses 3 features/attributes/columns from the data set, namely sex, age and sibsp (number of spouses or children along). A decision tree is drawn upside down with its root at the top. In the image on the left, the bold text in black represents a condition/internal node, based on which the tree splits into branches/ edges. The end of the branch that doesn’t split anymore is the decision/leaf, in this case, whether the passenger died or survived, represented as red and green text respectively. Although, a real dataset will have a lot more features and this will just be a branch in a much bigger tree, but you can’t ignore the simplicity of this algorithm. The feature importance is clear and relations can be viewed easily. This methodology is more commonly known as learning decision tree from data and above tree is called Classification tree as the target is to classify passenger as survived or died. Regression trees are represented in the same manner, just they predict continuous values like price of a house. In general, Decision Tree algorithms are referred to as CART or Classification and Regression Trees. . ( c) Classification and Regression Trees: Growing a tree involves deciding on which features to choose and what conditions to use for splitting, along with knowing when to stop Advantages of CART • Simple to understand, interpret, visualize. • Decision trees implicitly perform variable screening or feature selection. • Can handle both numerical and categorical data. Can also handle multi-output problems. • Decision trees require relatively little effort from users for data preparation. • Nonlinear relationships between parameters do not affect tree performance. Disadvantages of CART • Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting. • Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This is called variance, which needs to be lowered by methods likebagging and boosting. • Greedy algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees, where the features and samples are randomly sampled with replacement. • Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the data set prior to fitting with the decision tree. An improvement over decision tree learning is made using technique of boosting. (D) Autoregressive Integrated Moving Average Model(ARIMA) An ARIMA model is a class of statistical models for analyzing and forecasting time series data. It explicitly caters to a suite of standard structures in time series data, and as such provides a simple yet powerful method for making skillful time series forecasts. ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. It is a generalization of the simpler AutoRegressive Moving Average and adds the notion of integration. This acronym is descriptive, capturing the key aspects of the model itself. Briefly, they are: • • • AR: Autoregression. A model that uses the dependent relationship between an observation and some number of lagged observations. I: Integrated. The use of differencing of raw observations (e.g. subtracting an observation from an observation at the previous time step) in order to make the time series stationary. MA: Moving Average. A model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations. Each of these components are explicitly specified in the model as a parameter. A standard notation is used of ARIMA(p,d,q) where the parameters are substituted with integer values to quickly indicate the specific ARIMA model being used. The parameters of the ARIMA model are defined as follows: • • • p: The number of lag observations included in the model, also called the lag order. d: The number of times that the raw observations are differenced, also called the degree of differencing. q: The size of the moving average window, also called the order of moving average. A linear regression model is constructed including the specified number and type of terms, and the data is prepared by a degree of differencing in order to make it stationary, i.e. to remove trend and seasonal structures that negatively affect the regression model. In summary, the steps of this process are as follows: 1. Model Identification. Use plots and summary statistics to identify trends, seasonality, and autoregression elements to get an idea of the amount of differencing and the size of the lag that will be required. 2. Parameter Estimation. Use a fitting procedure to find the coefficients of the regression model. 3. Model Checking. Use plots and statistical tests of the residual errors to determine the amount and type of temporal structure not captured by the model. The process is repeated until either a desirable level of fit is achieved on the in-sample or out-ofsample observations (e.g. training or test datasets). (e) Linear Regression and Logistics regression It is used to estimate real values (cost of houses, number of calls, total sales etc.) based on continuous variable(s). Here, we establish relationship between independent and dependent variables by fitting a best line. This best fit line is known as regression line and represented by a linear equation Y= a *X + b. Look at the below example. Here we have identified the best fit line having linear equation y=0.2811x+13.9. Now using this equation, we can find the weight, knowing the height of a person. Linear Regression is of mainly two types: Simple Linear Regression and Multiple Linear Regression. Python Code #Import Library #Import other necessary libraries like pandas, numpy... from sklearn import linear_model #Load Train and Test datasets #Identify feature and response variable(s) and values must be numeric and numpy arrays x_train=input_variables_values_training_datasets y_train=target_variables_values_training_datasets x_test=input_variables_values_test_datasets # Create linear regression object linear = linear_model.LinearRegression() # Train the model using the training sets and check score linear.fit(x_train, y_train) linear.score(x_train, y_train) #Equation coefficient and Intercept print('Coefficient: \n', linear.coef_) print('Intercept: \n', linear.intercept_) #Predict Output predicted= linear.predict(x_test) R Code #Load Train and Test datasets #Identify feature and response variable(s) and values must be numeric and numpy arrays x_train <- input_variables_values_training_datasets y_train <- target_variables_values_training_datasets x_test <- input_variables_values_test_datasets x <- cbind(x_train,y_train) # Train the model using the training sets and check score linear <- lm(y_train ~ ., data = x) summary(linear) #Predict Output predicted= predict(linear,x_test) Logistic regression: It is used to estimate discrete values ( Binary values like 0/1, yes/no, true/false ) based on given set of independent variable(s). In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function. Hence, it is also known as logit regression. Since, it predicts the probability, its output values lies between 0 and 1 (as expected). Python Code #Import Library from sklearn.linear_model import LogisticRegression #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset # Create logistic regression object model = LogisticRegression() # Train the model using the training sets and check score model.fit(X, y) model.score(X, y) #Equation coefficient and Intercept print('Coefficient: \n', model.coef_) print('Intercept: \n', model.intercept_) #Predict Output predicted= model.predict(x_test) R Code x <- cbind(x_train,y_train) # Train the model using the training sets and check score logistic <- glm(y_train ~ ., data = x,family='binomial') summary(logistic) #Predict Output predicted= predict(logistic,x_test) (F) SVM (Support Vector Machine) It is a classification method. In this algorithm, we plot each data item as a point in n-dimensional space (where n is number of features you have) with the value of each feature being the value of a particular coordinate. For example, if we only had two features like Height and Hair length of an individual, we’d first plot these two variables in two dimensional space where each point has two co-ordinates (these co-ordinates are known as Support Vectors) we will find some line that splits the data between the two differently classified groups of data. This will be the line such that the distances from the closest point in each of the two groups will be farthest away. In the example shown above, the line which splits the data into two differently classified groups is the black line, since the two closest points are the farthest apart from the line. This line is our classifier. Then, depending on where the testing data lands on either side of the line, that’s what class we can classify the new data as. Python Code #Import Library from sklearn import svm #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset # Create SVM classification object model = svm.svc() # there is various option associated with it, this is simple for classification. You can refer link, for mo# re detail. # Train the model using the training sets and check score model.fit(X, y) model.score(X, y) #Predict Output predicted= model.predict(x_test) R Code library(e1071) x <- cbind(x_train,y_train) # Fitting model fit <-svm(y_train ~ ., data = x) summary(fit) #Predict Output predicted= predict(fit,x_test) KNN (K- NEAREST NEIGHBORS) It can be used for both classification and regression problems. However, it is more widely used in classification problems in the industry. K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. The case being assigned to the class is most common amongst its K nearest neighbors measured by a distance function. These distance functions can be Euclidean, Manhattan, Minkowski and Hamming distance. First three functions are used for continuous function and fourth one (Hamming) for categorical variables. If K = 1, then the case is simply assigned to the class of its nearest neighbor. At times, choosing K turns out to be a challenge while performing kNN modeling. MORE: INTRODUCTION TO K-NEAREST NEIGHBORS : SIMPLIFIED. KNN CAN EASILY BE MAPPED TO OUR REAL LIVES. IF YOU WANT TO LEARN ABOUT A PERSON, OF WHOM YOU HAVE NO INFORMATION, YOU MIGHT LIKE TO FIND OUT ABOUT HIS CLOSE FRIENDS AND THE CIRCLES HE MOVES IN AND GAIN ACCESS TO HIS/HER INFORMATION! THINGS TO CONSIDER BEFORE SELECTING KNN: KNN IS COMPUTATIONALLY EXPENSIVE VARIABLES SHOULD BE NORMALIZED ELSE HIGHER RANGE VARIABLES CAN BIAS IT WORKS ON PRE-PROCESSING STAGE MORE BEFORE GOING FOR KNN LIKE OUTLIER, NOISE REMOVAL PYTHON CODE #IMPORT LIBRARY FROM SKLEARN.NEIGHBORS IMPORT KNEIGHBORSCLASSIFIER #ASSUMED YOU HAVE, X (PREDICTOR) AND Y (TARGET) FOR TRAINING DATA SET AND X_TEST(PREDICTOR) OF TEST_DATASET # CREATE KNEIGHBORS CLASSIFIER OBJECT MODEL KNEIGHBORSCLASSIFIER(N_NEIGHBORS=6) # DEFAULT VALUE FOR N_NEIGHBORS IS 5 # TRAIN THE MODEL USING THE TRAINING SETS AND CHECK SCORE MODEL.FIT(X, Y) #PREDICT OUTPUT PREDICTED= MODEL.PREDICT(X_TEST) R CODE LIBRARY(KNN) X <- CBIND(X_TRAIN,Y_TRAIN) # FITTING MODEL FIT <-KNN(Y_TRAIN ~ ., DATA = X,K=5) SUMMARY(FIT) #PREDICT OUTPUT PREDICTED= PREDICT(FIT,X_TEST) Experiment 6 Aim: Analysis of round trip Time of Flight measurement from a supermarket. INSTRUCTIONS: Dataset source http://crawdad.org/cmu/supermarket/20140527/ Tool used: WEKA ID Here true value of interest is taken as θ and the value estimated using some algorithm as θ^. Correlation tells you how much θ and θ^ are related. It gives values between −1−1 and 11, where 00 is no relation, 11 is very strong, linear relation and −1−1 is an inverse linear relation (i.e. bigger values of θ indicate smaller values of θ^, or vice versa). Mean absolute error is: Root mean square error is: Relative absolute error: Root relative squared error: Some terms used : TP Rate: rate of true positives (instances correctly classified as a given class) FP Rate: rate of false positives (instances falsely classified as a given class) Precision: proportion of instances that are truly of a class divided by the total instances classified as that class Recall: proportion of instances classified as a given class divided by the actual total in that class (equivalent to TP rate) F-Measure: A combined measure for precision and recall calculated as 2 * Precision * Recall / (Precision + Recall) Algorithms: 1. Naïve Bayes === Run information === Scheme: weka.classifiers.bayes.NaiveBayes Relation: supermarket Instances: 4627 Attributes: 217 [list of attributes omitted] Test mode: 10-fold cross-validation === Classifier model (full training set) === Naive Bayes Classifier Time taken to build model: 0.01 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 2948 63.713 % Incorrectly Classified Instances 1679 36.287 % Kappa statistic 0 Mean absolute error 0.4624 Root mean squared error 0.4808 Relative absolute error 100 % Root relative squared error 100 % Total Number of Instances 4627 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class 1.000 1.000 0.637 1.000 0.778 0.000 0.499 0.637 low 0.000 0.000 0.000 0.000 0.000 0.000 0.499 0.363 high Weighted Avg. 0.637 0.637 0.406 0.637 0.496 0.000 0.499 0.537 === Confusion Matrix === a b <-- classified as 2948 0 | a = low 1679 0 | b = high Mean absolute error is: MAE = 1N∑i=1N|θ^i−θi| 2. Decision stump === Run information === Scheme: weka.classifiers.trees.DecisionStump Relation: supermarket Instances: 4627 Attributes: 217 [list of attributes omitted] Test mode: 10-fold cross-validation === Classifier model (full training set) === Decision Stump Classifications tissues-paper prd = t : high tissues-paper prd != t : low tissues-paper prd is missing : low Class distributions tissues-paper prd = t low high 0.48553627058299953 0.5144637294170005 tissues-paper prd != t low high 0.7802521008403361 0.21974789915966386 tissues-paper prd is missing low high 0.7802521008403361 0.21974789915966386 Time taken to build model: 0.09 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 2980 Incorrectly Classified Instances 1647 Kappa statistic 0.2813 Mean absolute error 0.4212 Root mean squared error 0.4603 Relative absolute error 91.0797 % Root relative squared error 95.7349 % Total Number of Instances 4627 64.4046 % 35.5954 % === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class 0.627 0.325 0.772 0.627 0.692 0.290 0.642 0.732 low 0.675 0.373 0.507 0.675 0.579 0.290 0.642 0.466 high Weighted Avg. 0.644 0.343 0.676 0.644 0.651 0.290 0.642 0.635 === Confusion Matrix === a b <-- classified as 1847 1101 | a = low 546 1133 | b = high 3. Random forest == Run information === Scheme: weka.classifiers.trees.RandomForest -P 100 -I 100 -num-slots 1 -K 0 -M 1.0 -V 0.001 -S 1 Relation: supermarket Instances: 4627 Attributes: 217 [list of attributes omitted] Test mode: 10-fold cross-validation === Classifier model (full training set) === RandomForest Bagging with 100 iterations and base learner weka.classifiers.trees.RandomTree -K 0 -M 1.0 -V 0.001 -S 1 -do-not-check-capabilities Time taken to build model: 3 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 2948 Incorrectly Classified Instances 1679 Kappa statistic 0 Mean absolute error 0.4624 Root mean squared error 0.4808 Relative absolute error 99.9964 % Root relative squared error 100 % Total Number of Instances 4627 === Detailed Accuracy By Class === 63.713 % 36.287 % TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class 1.000 1.000 0.637 1.000 0.778 0.000 0.500 0.637 low 0.000 0.000 0.000 0.000 0.000 0.000 0.500 0.363 high Weighted Avg. 0.637 0.637 0.406 0.637 0.496 0.000 0.500 0.538 === Confusion Matrix === a b <-- classified as 2948 0 | a = low 1679 0 | b = high 4. MLP(multi layer preceptron) === Run information === Scheme: weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 S 0 -E 20 -H a Relation: supermarket Instances: 4627 Attributes: 217 [list of attributes omitted] Test mode: 10-fold cross-validation === Classifier model (full training set) === Attrib department216 Class low Input Node 0 Class high Input Node 1 0.044512208908162154 Time taken to build model: 1166.18 seconds 5. K MEAN CLUSTERING === Run information === Scheme: weka.clusterers.SimpleKMeans -init 0 -max-candidates 100 -periodicpruning 10000 -min-density 2.0 -t1 -1.25 -t2 -1.0 -N 2 -A "weka.core.EuclideanDistance R first-last" -I 500 -num-slots 1 -S 10 Relation: supermarket Instances: 4627 Attributes: 217 [list of attributes omitted] Test mode: evaluate on training data === Clustering model (full training set) === Number of iterations: 2 Within cluster sum of squared errors: 0.0 Initial starting points (random): Cluster 0: t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t, t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t, t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t, t,t,t,t,t,t,t,t,t,t,t,t,high Cluster 1: t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t, t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t, t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t, t,t,t,t,t,t,t,t,t,t,t,t,low Missing values globally replaced with mean/mode Final cluster centroids: Cluster# Attribute Full Data 0 1 (4627.0) (1679.0) (2948.0) ===================================================== department1 t tt department2 t tt department3 t tt department4 t tt department5 t tt department6 t tt department7 t tt department8 t tt department9 t tt grocery misc t tt department11 t tt baby needs t tt bread and cake t tt baking needs t tt coupons t tt Time taken to build model (full training data) : 0.22 seconds === Model and evaluation on training set === Clustered Instances 0 1 1679 ( 36%) 2948 ( 64%) Result: By perform the classification by using the above five algorithms, we observe that ‘Naïve bayes’ take the least time to build model and naïve bayes is also most accurate with least Mean Absolute Error. Therefore, we conclude that out of the above algorithms, Naïve Bayes performs best. Viva Questions: Q1: What are performance measures used ? Q2: What do you mean by confusion matrix ? Q3: What is the best accuracy achieved by which algorithn? Q4: How do you choose attributes which affect problem description? Q5: What do u mean by TP rate and FP rate? Q6: Define precision and recall. Experiment 7 Aim: Implement supervised learning (KNN classification) Instruction guidelines: K Nearest Neighbors - Classification K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970‟s as a non-parametric technique. Algorithm: A case is classified by a majority vote of its neighbors, with the case being assigned to the class most common amongst its K nearest neighbors measured by a distance function. If K = 1, then the case is simply assigned to the class of its nearest neighbor. It should also be noted that all three distance measures are only valid for continuous variables. In the instance of categorical variables the Hamming distance must be used. It also brings up the issue of standardization of the numerical variables between 0 and 1 when there is a mixture of numerical and categorical variables in the dataset. Choosing the optimal value for K is best done by first inspecting the data. In general, a large K value is more precise as it reduces the overall noise but there is no guarantee. Cross-validation is another way to retrospectively determine a good K value by using an independent dataset to validate the K value. Historically, the optimal K for most datasets has been between 3-10. That produces much better results than 1NN Example: Consider the following data concerning credit default. Age and Loan are two numerical variables (predictors) and Default is the target. We can now use the training set to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance. If K=1 then the nearest neighbor is the last case in the training set with Default=Y. D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.01 >> Default=Y Fig-1 With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The prediction for the unknown case is again Default=Y. Sample Input: Step 1: Open WEKA GUI and Select Explorer Step 2: Load the Breast Cancer data set (breast-cancer.arff) given in experiment 4 databases, using “Open File” Step 3: Choose the “Classify” Tab and Choose The Decision Tree Classifier labelled as J48 in the trees folder and choose “Cross-Validation” and enter 5 in the Folds, then press Start Output: === Run information === Scheme: Relation: Instances: Attributes: weka.classifiers.trees.J48 -C 0.25 -M 2 breast-cancer 286 10 age menopause tumor-size inv-nodes node-caps deg-malig breast breast-quad irradiat Test mode: Class 5-fold cross-validation === Classifier model (full training set) === J48 pruned tree -----------------node-caps = yes | deg-malig = | deg-malig = | deg-malig = node-caps = no: 1: recurrence-events (1.01/0.4) 2: no-recurrence-events (26.2/8.0) 3: recurrence-events (30.4/7.4) no-recurrence-events (228.39/53.4) Number of Leaves Size of the tree : : 4 6 Time taken to build model: 0.02 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances === Detailed Accuracy By Class === 212 74 0.2288 0.3726 0.4435 89.0412 % 97.0395 % 286 74.1259 % 25.8741 % TP Rate FP Rate Precision Recall ROC Area PRC Area Class 0.960 0.776 0.745 0.960 0.582 0.728 no-recurrence-events 0.224 0.040 0.704 0.224 0.582 0.444 recurrence-events Weighted Avg. 0.741 0.558 0.733 0.741 0.582 0.643 === Confusion Matrix === a 193 66 b <-- classified as 8 | a = no-recurrence-events 19 | b = recurrence-events Viva Questions: Q1: How is KNN algorithm different from K means Clustering. Q2: How is KNN algorithm used in radar target specification? Q3: Does KNN algorithm have a centroid as k-means? Q4:Is there a way to obtain the centroid for the classified data by KNN? Q5: What is difference between Supervised and unsupervised learning? F-Measure MCC 0.839 0.287 0.339 0.287 0.691 0.287 Experiment 8 Aim: Understanding of R and its basics R is a programming language and software environment for statistical analysis, graphics representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team. The core of R is an interpreted computer language which allows branching and looping as well as modular programming using functions. R allows integration with the procedures written in the C, C++, .Net, Python or FORTRAN languages for efficiency. R is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systems like Linux, Windows and Mac. R is free software distributed under a GNU-style copy left, and an official part of the GNU project called GNU S. Evolution of R R was initially written by Ross Ihaka and Robert Gentleman at the Department of Statistics of the University of Auckland in Auckland, New Zealand. R made its first appearance in 1993. A large group of individuals has contributed to R by sending code and bug reports. Since mid-1997 there has been a core group (the "R Core Team") who can modify the R source code archive. Features of R As stated earlier, R is a programming language and software environment for statistical analysis, graphics representation and reporting. The following are the important features of R − R is a well-developed, simple and effective programming language which includes conditionals, loops, user defined recursive functions and input and output facilities. R has an effective data handling and storage facility, R provides a suite of operators for calculations on arrays, lists, vectors and matrices. R provides a large, coherent and integrated collection of tools for data analysis. R provides graphical facilities for data analysis and display either directly at the computer or printing at the papers. As a conclusion, R is world’s most widely used statistics programming language. It's the # 1 choice of data scientists and supported by a vibrant and talented community of contributors. R is taught in universities and deployed in mission critical business applications. Experiment 9 Aim: Using R studio to implement linear model for regression. importnumpy as np importmatplotlib.pyplot as plt defestimate_coef(x, y): n = np.size(x) print n m_x, m_y = np.mean(x), np.mean(y) SS_xy = np.sum(y*x - n*m_y*m_x) SS_xx = np.sum(x*x - n*m_x*m_x) b_1 = SS_xy / SS_xx b_0 = m_y - b_1*m_x return(b_0, b_1) defplot_regression_line(x, y, b): plt.scatter(x, y, color = "m", marker = "o", s = 30) y_pred = b[0] + b[1]*x plt.plot(x, y_pred, color = "g") plt.xlabel('x') plt.ylabel('y') plt.show() fromsklearn import datasets boston = datasets.load_boston(return_X_y=False) X = boston.data[:,5] y = boston.target printX.shape printy.shape b = estimate_coef(X, y) print("Estimated coefficients:\nb_0 = {} \ \nb_1 = {}".format(b[0], b[1])) plot_regression_line(X, y, b) OUTPUT Viva Questions: Q1: Compare R and Python programming languages for Predictive Modelling.. Q2: Explain about data import in R language? Q3: How missing values and impossible values are represented in R language?. Q4: R language has several packages for solving a particular problem. How do you make a decision on which one is the best to use? Q5: How many data structures does R language have? Q6: What is the process to create a table in R language without using external files? Experiment 10 Aim: Using python(Anaconda or Pycharm) software, implement linear model for regression or Neural Network. Program : importnumpy as np import pandas as pd fromsklearn.model_selection import train_test_split fromsklearn.linear_model import LinearRegression fromsklearn.metrics import mean_squared_error, r2_score testdata = pd.read_csv(&#39;c://1//2.csv&#39;, sep=&#39;,&#39;) print&quot;dataset length :&quot;, len(testdata) print&quot;dataset shape :&quot;, testdata.shape X = testdata.values[:, 0:6] Y = testdata.values[:, 6] print X print Y X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25) clf = LinearRegression() clf.fit(X_train, y_train) pre = clf.predict(X_test) # The coefficients print(&#39;Coefficients: &#39;, clf.coef_) # The mean squared error #print(&quot;Mean squared error: %.2f&quot; # % mean_squared_error(X_test, y_test)) # Explained variance score: 1 is perfect prediction #print(&#39;Variance score: %.2f&#39; % r2_score(X_test, y_test)) OUTPUT “C:\Users\Anureet\PycharmProjects\untitled\venv\Scripts\python.exe” ”C:/Users/Anureet/PycharmProjects/untitled/venv/Stock price prediction.py” dataset length : 1258 dataset shape : (1258, 7) [[1.100e+01 2.000e+00 2.013e+03 1.489e+01 1.501e+01 1.426e+01] [1.200e+01 2.000e+00 2.013e+03 1.445e+01 1.451e+01 1.410e+01] [1.300e+01 2.000e+00 2.013e+03 1.430e+01 1.494e+01 1.425e+01] ... [5.000e+00 2.000e+00 2.018e+03 5.199e+01 5.239e+01 4.975e+01] [6.000e+00 2.000e+00 2.018e+03 4.932e+01 5.150e+01 4.879e+01] [7.000e+00 2.000e+00 2.018e+03 5.091e+01 5.198e+01 5.089e+01]] [14.46 14.27 14.66 ... 49.76 51.18 51.4 ] (&#39;Coefficients: &#39;, array([-8.08582422e- 04, -1.01285688e- 02, -3.88743101e- 04, 5.56289130e- 01, 8.39542627e-01, 7.15213953e-01])) Process finished with exit code 0 Viva Questions: 1) How can you build a simple logistic regression model in Python? 2) How can you train and interpret a linear regression model in SciKit learn? 3) Name a few libraries in Python used for Data Analysis and Scientific computations. 4) Which library would you prefer for plotting in Python language: Seaborn or Matplotlib? 5) Which Random Forest parameters can be tuned to enhance the predictive power of the model? 6) Which method in pandas.tools. plotting is used to create scatter plot matrix? Experiment 11 AIM -Understanding of RMS Titanic Dataset to predict survival by training a model and predict the required solution. The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. • Survived (Target Variable) - Binary categorical variable where 0 represents not survived and 1 represents survived. • Pclass- Categorical variable. It is passenger class. • Sex - Binary Variable representing the gender the of passenger • Age - Feature engineered variable. It is divided into 4 classes. • Fare - Feature engineered variable. It is divided into 4 classes. • Embarked - Categorical Variable. It tells the Port of embarkation. • Title - New feature created from names. The title of names is classified into 4 different classes. • isAlone- Binary Variable. It tells whether the passenger is travelling alone or not. • Age*Class - Feature engineered variable. Model, predict and solve Now we are ready to train a model and predict the required solution. There are 60+ predictive modelling algorithms to choose from. We must understand the type of problem and solution requirement to narrow down to a select few models which we can evaluate. Our problem is a classification and regression problem. We want to identify relationship between output (Survived or not) with other variables or features (Gender, Age, Port...). We are also performing a category of machine learning which is called supervised learning as we are training our model with a given dataset. With these two criteria - Supervised Learning plus Classification and Regression, we can narrow down our choice of models to a few. These include: • Logistic Regression • KNN or k-Nearest Neighbours • • • Support Vector Machines Naive Bayes classifier Decision Tree Size of the training and testing dataset 1. Logistic Regression is a useful model to run early in the workflow. Logistic regression measures the relationship between the categorical dependent variable (feature) and one or more independent variables (features) by estimating probabilities using a logistic function, which is the cumulative logistic distribution. Note the confidence score generated by the model based on our training dataset. 2. In pattern recognition, the k-Nearest Neighbours algorithm (or k-NN for short) is a nonparametric method used for classification and regression. A sample is classified by a majority vote of its neighbours, with the sample being assigned to the class most common among its k nearest neighbours (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbour. KNN confidence score is better than Logistics Regression but worse than SVM. 3. Next we model using Support Vector Machines which are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training samples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new test samples to one category or the other, making it a non-probabilistic binary linear classifier. Note that the model generates a confidence score which is higher than Logistics Regression model. 4. In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features. Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features) in a learning problem. The model generated confidence score is the lowest among the models evaluated so far. 5. This model uses a decision tree as a predictive model which maps features (tree branches) to conclusions about the target value (tree leaves). Tree models where the target variable can take a finite set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. The model confidence score is the highest among models evaluated so far. Experiment 12 AIM -Understanding of Indian education in Rural villages to predict whether girl child will be sent to school or not? The data is focused on rural India. It primarily looks into the fact whether the villagers are willing to send the girl children to school or not and if they are not sending their daughters to school the reasons have also been mentioned. The district is Gwalior. Various details of the villagers such as village, gender, age, education, occupation, category, caste, religion, land etc have also been collected. 1 Naïve Bayes Classifier The algorithm was run with 10-fold cross-validation: this means it was given an opportunity to make a prediction for each instance of the dataset (with different training folds) and the presented result is a summary of those predictions. Firstly, I noted the Classification Accuracy. The model achieved a result of 109/200 correct or 54.5%. === Confusion Matrix === a b c d e f g h i j k l m <-0 0 1 1 0 1 0 2 0 0 0 0 0 | a = 2 1 1 1 8 0 0 0 0 1 0 0 0 | b 2 0 17 2 9 0 0 2 0 0 0 0 0 | c 0 0 4 3 2 0 1 0 0 1 0 0 0 | d classified as Govt. = Driver = Farmer = Shopkeeper 1 8 3 0 0 1 1 0 0 0 0 0 0 0 Teacher 0 0 0 0 2 0 0 0 2 2 0 3 73 1 0 1 0 0 0 0 1 1 0 0 0 2 0 1 1 0 8 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 2 0 0 3 0 0 0 0 2 0 2 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 4 3 0 0 0 0 0 0 | | | | | | 0 e f g h i j 0 | | = = = = = = | labour Security Guard Raj Mistri Fishing Labour & Driver Homemaker k = Govt School l = Dhobi m = goats The confusion matrix shows the precision of the algorithm showing that 1,1,1,2 Government officials were misclassified as Farmer, Shopkeeper, Security Guard and Fishermen respectively, 2,1,1,8,1 Drivers were misclassified as Government officials, Farmer, Shopkeeper, Labour, Homemaker, and so on. This table can help to explain the accuracy achieved by the algorithm. Now when we have model, we need to load our test data we’ve created before. For this, select Supplied test set and click button Set. Click More Options, where in new window, choose PlainText from Output predictions. Then click left mouse button on recently created model on result list and select Reevaluate model on current test set. After re-evaluation Now the Classification Accuracy is 151/200 correct or 75.5%. TP = true positives: number of examples predicted positive that are actually positive FP = false positives: number of examples predicted positive that are actually negative TN = true negatives: number of examples predicted negative that are actually negative FN = false negatives: number of examples predicted negative that are actually positive Recall is the TP rate ( also referred to as sensitivity) what fraction of those that are actually positive were predicted positive? : TP / actual positives Precision is TP / predicted Positive what fraction of those predicted positive are actually positive? precision is also referred to as Positive predictive value (PPV); Other related measures used in classification include True Negative Rate and Accuracy: True Negative Rate is also called Specificity. (TN / actual negatives) 1-specificity is x-axis of ROC curve: this is the same as the FP rate (FP / actual negatives) F-measure A measure that combines precision and recall is the harmonic mean of precision and recall, the traditional F-measure or balanced F-score: Mean absolute error (MAE) The MAE measures the average magnitude of the errors in a set of forecasts, without considering their direction. It measures accuracy for continuous variables. The equation is given in the library references. Expressed in words, the MAE is the average over the verification sample of the absolute values of the differences between forecast and the corresponding observation. The MAE is a linear score which means that all the individual differences are weighted equally in the average; Root mean squared error (RMSE) The RMSE is a quadratic scoring rule which measures the average magnitude of the error. The equation for the RMSE is given in both of the references. Expressing the formula in words, the difference between forecast and corresponding observed values are each squared and then averaged over the sample. Finally, the square root of the average is taken. Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE is most useful when large errors are particularly undesirable. 2 Support Vector Machine The model achieved a result of 181/200 correct or 92.3469%. We have classified the dataset on the basis the reasons why the villagers are unwilling to send girl children to schools in Gwalior village. The different classes are NA, Poverty, Marriage, Distance, X, Unsafe Public Space, Transport Facilities, and Household Responsibilities. The weighted average true positive rate is 0.923 that is nearly all the predicted positive values are actually positive. The weighted average false positive rate is 0.205 that is few of them are predicted as positive values but are actually negative. The precision in 0.902 that is the algorithm is nearly accurate. === Confusion Matrix === a b 147 0 4 12 5 0 0 0 0 0 4 0 0 0 0 0 c 1 0 3 1 0 0 0 0 d 0 0 0 3 0 0 0 0 e 0 0 0 0 8 0 0 0 f 0 0 0 0 0 0 0 0 g 0 0 0 0 0 0 4 0 h 0 0 0 0 0 0 0 4 <-- classified as | a = NA | b = Poverty | c = Marriage | d = Distance | e = X | f = Unsafe Public Space | g = Transport Facilities | h = Household Responsibilities The confusion matrix shows that majority of the reasons were not available and out of the reasons which were available people did not send their daughters to school because of poverty and very few of them considered Distance as a major factor for not sending their girl children to school. 3 Random Forest The accuracy of this algorithm is 100% that is 200/200 have been correctly classified === Confusion Matrix === a b c d e f g h i j k l m <-- classified as 5 0 00 0 0 0 0 0 0 0 0 0 | a = Govt. 0 14 000 0 0 0 0 0 0 0 0 | b = Driver 0 0 32 0 0 0 0 0 0 0 0 0 0 | c = Farmer 0 0 0 11 0 0 0 0 0 0 0 0 0 | d = Shopkeeper 0000970 0 00000 0 | e = labour 0 0 0 0 0 4 0 0 0 0 0 0 0 | f = Security Guard 0 0 0 0 0 0 4 0 0 0 0 0 0 | g = Raj Mistri 00 0 0 0 0 0 11 0 0 0 0 0 | h = Fishing 0 0 0 0 0 0 0 0 4 0 0 0 0 | i = Labour & Driver 0 0 0 0 0 0 0 0 0 5 0 0 0 | j = Homemaker 0 0 0 0 0 0 0 0 0 04 0 0 | k = Govt School Teacher 0 0 0 0 0 0 0 0 0 0 0 5 0 | l = Dhobi 0 0 0 0 0 0 0 0 0 0 0 0 4 | m = goats There is no observation which has been misclassified. Maximum number of villagers arelaborers. 4 Random Tree The classification accuracy is 76.0204% that is 149/200 have been classified correctly. The false positive rate is 0.352 that is highest of all the four algorithms applied above. Here 35.2% of the values which should have been classified negatively have been assigned a positive value. === Confusion Matrix === a 126 b 7 c d 3 1 7 8 1 4 1 3 1 0 0 2 0 0 4 0 0 Public Space 3 1 0 Facilities 1 0 0 Responsibilities e 0 0 0 3 0 0 f 8 0 0 0 6 0 g 3 0 0 0 0 0 h 0 0 0 0 0 0 | 0 0 0 0 0 <-- classified as a = NA | b = Poverty | c = Marriage | d = Distance | e = X | f = Unsafe 0 0 0 0 0 | g = Transport 0 0 0 0 3 | h = Household 22 NA , 8 Poverty , 5 Marriage, 1 Distance, 2 X, 4 Unsafe Public Space, 4 Transport Facilities and 1 Household Responsibilities class values have been misclassified. The best algorithm out of the above algorithms is Random Forest with 100% accuracy rate and the worst is Naïve Bayes algorithm with 75.5% accuracy rate. Experiment 13 AIM -Understanding of Dataset of contact patterns among students collected in National University of Singapore. This is dataset collected from contact patterns among students collected during the spring semester 2006 in National University of Singapore Using RemovePercentage filter, instances have been reduced to: 500 This data has been taken and saved as training data set and then used for further classification. ALGORITHM-1 :GaussianProcesses === Run information === Scheme: weka.classifiers.functions.SimpleLinearRegression Relation: MOCK_DATA (1)weka.filters.unsupervised.instance.RemovePercentage-P50.0 Instances: 500 Attributes: 4 Start Time Session Id Student Id Duration Test mode: evaluate on training data === Classifier model (full training set) === Linear regression on Session Id 0.03 * Session Id + 10.38 Time taken to build model: 0 seconds === Evaluation on training set === Time taken to test model on training data: 0 seconds === Summary === Correlation coefficient Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances ALGORITHM 2: Linear Regression 0.0677 4.9869 5.7893 99.7326 % 99.7708 % 500 Linear Regression Model Start Time = 0.0274 * Session Id + 10.3846 Time taken to build model: 0.01 seconds === Evaluation on training set === Time taken to test model on training data: 0 seconds === Summary === Correlation coefficient Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances Algorithm 3: Decision Table Algorithm 6: DecisionTable Merit of best subset found: 5.814 Evaluation (for feature selection): CV (leave one out) Feature set: 1 0.0677 4.9869 5.7893 99.7326 % 99.7708 % 500 Time taken to build model: 0.02 seconds === Evaluation on training set === Time taken to test model on training data: 0 seconds === Summary === Correlation coefficient Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 0 5.0003 5.8026 100 100 % 500 % CONCLUSION: Six algorithms have been used to measure the best classifier. Depending on various attributes, performance of various algorithms can be measured via mean absolute error and correlation coefficient. Depending on the results above, worst correlation has been found by DecisionTable and best correlation has been found by Decision Stump === Run information === Scheme: weka.classifiers.rules.DecisionTable -X "weka.attributeSelection.BestFirst -D 1 -N 5" Relation: MOCK_DATA weka.filters.unsupervised.instance.RemovePercentage-P50.0 Instances: 500 Attributes: 4 Start Time Session Id Student Id Duration Test mode: evaluate on training data === Classifier model (full training set) === Decision Table: Number of training instances: 500 Number of Rules : 1 Non matches covered by Majority class. Best first. Start set: no attributes Search direction: forward Stale search after 5 node expansions Total number of subsets evaluated: 9 1 -S (1)-