Contents DATA MINING TUTORIALS ............................................................................................................................ 6 MEAN, MEDIAN, MODE IN DATA MINING .................................................................................................... 6 FINDING THE ESTIMATED MEAN, MEDIAN AND MODE FOR GROUPED DATA IN DATA MINING ................ 8 WHAT ARE QUARTILES AND BOX PLOT IN DATA MINING .......................................................................... 10 How to find outliers? .............................................................................................................................. 11 BOX PLOT IN DATA MINING ........................................................................................................................ 13 What is box plot? ....................................................................................................................................... 13 Draw the box plot for the odd length data set?........................................................................................ 13 HOW TO CALCULATE VARIANCE AND STANDARD DEVIATION OF DATA IN DATA MINING ....................... 15 What is data variance and standard deviation? ..................................................................................... 15 How to calculate variance and standard deviation of data? .................................................................. 15 marks....................................................................................................................................................... 15 Variance = 28.81 ..................................................................................................................................... 16 Standard deviation = 5.37 ....................................................................................................................... 16 DATA SKEWNESS IN DATA MINING............................................................................................................. 16 What is data skewness? ............................................................................................................................. 16 ATTRIBUTES TYPES IN DATA MINING.......................................................................................................... 17 What are Attribute? .................................................................................................................................... 17 Types Of attributes ................................................................................................................................. 17 Nominal data:.............................................................................................................................................. 17 Binary data: ................................................................................................................................................. 18 Symmetric data: .......................................................................................................................................... 18 Asymmetric data: ........................................................................................................................................ 18 Ordinal data: ........................................................................................................................................... 19 PROXIMITY MEASURE FOR NOMINAL ATTRIBUTES IN DATA MINING ....................................................... 20 How to calculate Proximity Measure for Nominal Attributes? .................................................................. 20 Pairs for distance Measurement: ............................................................................................................ 20 Formulae: ................................................................................................................................................ 20 DISTANCE MEASURE FOR ASYMMETRIC BINARY ATTRIBUTES IN DATA MINING....................................... 21 How to calculate proximity measure for asymmetric binary attributes? ................................................... 21 OR................................................................................................................................................................ 21 How to measure the distance of asymmetric binary variables? ................................................................ 21 Contingency table for binary data: ......................................................................................................... 21 DISTANCE MEASURE FOR SYMMETRIC BINARY VARIABLES ....................................................................... 22 How to calculate proximity measure for symmetric binary attributes?..................................................... 22 OR................................................................................................................................................................ 22 How to measure the distance/dissimilarity of symmetric binary variables? ............................................. 22 DISTANCE MEASURE FOR SYMMETRIC BINARY VARIABLES ....................................................................... 23 How to calculate proximity measure for symmetric binary attributes?..................................................... 23 OR................................................................................................................................................................ 23 How to measure the distance/dissimilarity of symmetric binary variables? ............................................. 23 EUCLIDEAN DISTANCE IN DATA MINING .................................................................................................... 24 What is Euclidean distance ? ...................................................................................................................... 24 Example:.................................................................................................................................................. 24 Formulae: ................................................................................................................................................ 25 JACCARD COEFFICIENT SIMILARITY MEASURE FOR ASYMMETRIC BINARY VARIABLES ............................. 25 How to calculate simmilarity of asymmetric binary variable using Jaccard coefficient? ........................... 25 Jaccard coefficient is used to calculate the similarity among asymmetric binary attributes. .................... 25 Contingency table for binary data: ......................................................................................................... 25 Consider 1 for positive/True and 0 for negative/False .......................................................................... 26 COSINE SIMILARITY IN DATA MINING ......................................................................................................... 26 What is Cosine similarity? ........................................................................................................................... 26 Example of cosine similarity: ...................................................................................................................... 26 MAJOR TASKS OF DATA PRE-PROCESSING ................................................................................................. 27 Major tasks of data pre-processing ........................................................................................................... 27 1. Data Cleaning...................................................................................................................................... 27 2. Data Integration ................................................................................................................................. 27 3. Data Reduction ................................................................................................................................... 27 4. Data Transformation .......................................................................................................................... 27 5. Data Discretization ............................................................................................................................. 27 DATA CLEANING .......................................................................................................................................... 28 What is Binning?......................................................................................................................................... 29 Types of binning: ........................................................................................................................................ 29 Z-SCORE NORMALIZATION OF DATA .......................................................................................................... 29 What is Z-Score? ......................................................................................................................................... 29 How to calculate Z-Score of the following data? .................................................................................... 29 marks....................................................................................................................................................... 29 Standard deviation = 5.37 ....................................................................................................................... 30 marks....................................................................................................................................................... 30 marks after z-score normalization .......................................................................................................... 30 MIN MAX NORMALIZATION OF DATA IN DATA MINING ............................................................................ 31 What is Min Max normalization? ................................................................................................................ 31 How to normalize the data through min max normalization technique? .............................................. 31 marks....................................................................................................................................................... 31 marks....................................................................................................................................................... 33 marks after Min-Max normalization ....................................................................................................... 33 MIN MAX SCALLING IN DATA MINING ........................................................................................................ 33 Example 2: ................................................................................................................................................... 33 NORMALIZATION WITH DECIMAL SCALING IN DATA MINING ................................................................... 34 What is decimal scaling? ............................................................................................................................. 34 Formulae: ................................................................................................................................................ 34 Example 1: ............................................................................................................................................... 34 Example 2: ............................................................................................................................................... 35 Example 3: ............................................................................................................................................... 35 STANDARD DEVIATION NORMALIZATION OF DATA IN DATA MINING ....................................................... 35 Data Normalization with the Help Of Standard Deviation ......................................................................... 35 DATA DISCRETIZATION IN DATA MINING ................................................................................................... 36 What is data discretization? ...................................................................................................................... 36 What are some famous techniques of data discretization? ..................................................................... 37 BINNING METHODS FOR DATA SMOOTHING IN DATA MINING ................................................................. 37 What is binning method? ........................................................................................................................... 37 How to smooth the data by equal frequency bins? ................................................................................. 37 How to smooth the data by bin means? .................................................................................................. 37 How to smooth the data by bin boundaries?........................................................................................... 38 CORRELATION ANALYSIS OF NOMINAL DATA............................................................................................. 38 Correlation VS Causality:............................................................................................................................ 38 CORRELATION ANALYSIS FOR NUMERICAL DATA ....................................................................................... 39 Correlation analysis for numerical data ...................................................................................................... 39 FREQUENT PATTERN MINING IN DATA MINING......................................................................................... 40 Frequent pattern :....................................................................................................................................... 40 Closed frequent itemset: ............................................................................................................................ 40 Max frequent itemset: ................................................................................................................................ 40 Example:.................................................................................................................................................. 40 It is compulsory to set a min_support that can defines which itemset is frequent. An itemset is frequent if its support is greater or equal to min_support. ................................................................ 40 Set of closed frequent itemset / Closed pattern : ...................................................................................... 41 Set of max frequent itemset / max pattern : .............................................................................................. 41 APRIORI ALGORITHM IN DATA MINING ...................................................................................................... 42 Apriori Algorithm ........................................................................................................................................ 42 APRIORI PRINCIPLES IN DATA MINING ....................................................................................................... 44 Apriori principles: ....................................................................................................................................... 44 APRIORI CANDIDATES GENERATION IN DATA MINING .............................................................................. 45 Apriori Candidates generation: ................................................................................................................... 45 Step 1: self-joining........................................................................................................................... 45 Step 2: Apriori pruning principle: .................................................................................................... 45 KMEANS CLUSTERING IN DATA MINING ..................................................................................................... 45 What is clustering?...................................................................................................................................... 45 What is K-Means clustering? ...................................................................................................................... 45 How it works? ......................................................................................................................................... 46 Step 1: Find the centroid randomly. It is better to take the boundary and middle values as centroid. ................................................................................................................................................................ 46 Step 2: Assign cluster to each item-set(value of attribute) .................................................................. 46 Step 3: Repeat all the process, every time we repeat the process total sum of error rate is changed. When error rate stops to change, then finalize the cluster and their itemset(attribute value) ......... 46 Iteration 2: .............................................................................................................................................. 47 Iteration 3: .............................................................................................................................................. 48 Iteration 4: .............................................................................................................................................. 49 Iteration stops: Now, iterations are stoped because error rate is consistent in iteration 3 and iteration 4. Error rate is now fixed at 220.75, so there is no need of further. Clusters are final now. ................................................................................................................................................................ 50 Shortcomings of K-Means clustering: ......................................................................................................... 50 KMEANS CLUSTERING ON TWO ATTRIBUTES IN DATA MINING ................................................................. 50 How K-Means clustering is performs on two attributes? ......................................................................... 50 DECISION TREE INDUCTION IN DATA MINING ............................................................................................ 52 COMPUTING INFORMATION-GAIN FOR CONTINUOUS-VALUED ATTRIBUTES IN DATA MINING............... 57 How we can calculate the split point? ........................................................................................................ 57 WHICH ATTRIBUTE SELECTION MEASURE IS BEST IN DATA MINING ......................................................... 58 Attribute Selection Measure:...................................................................................................................... 58 GINI INDEX FOR BINARY VARIABLES IN DATA MINING ............................................................................... 58 What is gini index? ...................................................................................................................................... 58 NAIVE BAYES CLASSIFIER TUTORIAL IN DATA MINING ............................................................................... 60 BOOSTING IN DATA MINING ....................................................................................................................... 61 What is Boosting? ....................................................................................................................................... 61 Types of boosting algorithm: ...................................................................................................................... 61 RAINFOREST ALGORITHM IN DATA MINING............................................................................................... 62 HOLDOUT METHOD FOR EVALUATING A CLASSIFIER IN DATA MINING..................................................... 64 Holdout method:......................................................................................................................................... 64 Training set: ................................................................................................................................................ 64 Test set: ....................................................................................................................................................... 64 Validation set: ............................................................................................................................................. 64 EVALUATION OF A CLASSIFIER BY CONFUSION MATRIX IN DATA MINING ................................................ 64 How to evaluate a classifier? ...................................................................................................................... 64 Accuracy: ..................................................................................................................................................... 65 Error-Rate:................................................................................................................................................... 65 +VE predictions: .......................................................................................................................................... 65 -VE predictions: ........................................................................................................................................... 65 Precision:..................................................................................................................................................... 65 Recall: .......................................................................................................................................................... 65 F-Measure is harmonic mean of recall and precision. ................................................................................ 66 F-Measure = 2 * Precision * Recall / Precision + Recall .............................................................................. 66 Specificity: ................................................................................................................................................... 66 OVERFITTING OF DECISION TREE AND TREE PRUNING IN DATA MINING .................................................. 66 Overfitting of tree: ...................................................................................................................................... 66 DATA MINING What is data mining? Data mining is about extracting the hidden useful information from the huge amount of data. Data mining is the automated analysis of massive data sets. knowledge discovery from data. What are alternative names for data mining? Knowledge discovery in databases Data/pattern analysis knowledge extraction Data dredging Data archeology Business intelligence Information harvesting What is not data mining? Expert systems(in artificial intelligence) o Expert system takes decision on the expertee of designed algorithms Simple querying o Query takes decision according to the given condition in SQL. For example a database query “SELECT * FROM table” is just a database query and its displays information from table but actually this is not hidden information. So it is a simple query and not data mining. MEAN, MEDIAN, MODE IN DATA MINING What is mean? Mean is the average of numbers. Example: 3, 5, 6, 9, 8 Mean = all values/Total number of values Mean = 3+5+6+9+8/5 Mean = 6.2 How to calculate the mean for data with frequencies? Age Frequency Age * Frequency 22 5 22 * 5 = 11 33 2 33 * 2 = 66 44 6 44 * 6 = 264 66 4 66 * 4 = 264 Total 17 605 Mean = 605 / 17 Mean = 35.58 What is Median? Median is the middle value among all values. How to calculate median for odd number of values? Example: 9, 8, 5, 6, 3 Arrange values in order 3, 5, 6, 8, 9 Median = 6 How to calculate median for even number of values? Example: 9, 8, 5, 6, 3, 4 Arrange values in order 3, 4, 5, 6, 8, 9 Add 2 middle values and calculate their mean. Median = 5+6/2 Median = 5.5 What is Mode? Mode is the most occuring value. How to calculate mode? Example: 3, 6, 6, 8, 9 Mode = 6 (because 6 is occuring 2 time and all other values are occurs only one times) FINDING THE ESTIMATED MEAN, MEDIAN AND MODE FOR GROUPED DATA IN DATA MINING How to calculate the estimated mean and estimated median of grouped data? Age Mid of age Frequency Mid * Frequency 21 – 25 23 5 23 * 5 = 115 26 – 30 28 2 28 * 2 = 56 31 – 35 33 6 33 * 6 = 198 35 – 40 37 8 37 * 8 = 296 21 665 Total Estimated Mean = 665 / 21 = 31.66 Class intervals: Group 21 to 25, 26 to 30, 31 to 35 and 35 to 40 are class intervals. Mean is 31.6 so 31.6 rounds to 32. Estimated mean = 32 Median group = 31 to 35 Estimated Median = ? Estimated Median = L + (TV / 2) – SBM ⁄ FMG * GW L = Lower boundary of median group 30.5 TV = Total number of values 21 SBM = Sum of frequencies before median group 7 FMG = Frequency of median group 6 GW = Group width 5 Result: Our median group is 31 to 35 and yes estimated median 33.4 is in median group. How to calculate the estimated mode of the above grouped data? L = Lower boundary of median group 30.5 SBM = sum of frequencies before median group 7 FMG = Frequency of median group 6 FBMG = Sum of frequencies of all groups before median group 7 FBMG = Sum of frequencies of all groups after median group 8 GW = Group width 5 Mode: Mode is the most occuring Value in the data. WHAT ARE QUARTILES AND BOX PLOT IN DATA MINING What is quartile? Quartile means four equal groups. How to find quartiles of odd length data set? Example: Data = 8, 5, 2, 4, 8, 9, 5 Step 1: First of all arrange the values in order After ordering the values: Data = 2, 4, 5, 5, 8, 8, 9 Step 2: For dividing this data into four equal parts, we needed three quartiles. Q1: Lower quartile Q2: Median of the data set Q3: Upper quartile Step 3: Find the median of the data set and lebel it as Q2. Data = 2, 4, 5, 5, 8, 8, 9 Q1: 4 – Lower quartile Q2: 5 – Middle quartile Q3: 8 – Upper quartile Inter Quartile Range= Q3 – Q1 =8–4 =4 What is Outlier? Outlier is the set of data far away from common and famous pattern. How to find outliers? Outlier is mostly a value higher or lower than 1.5 * IQR =1.5 * IQR =1.5 * 5 = 7.5 Population size: Population size is the total number of values in data. How to find quartiles of even length data set? Example: Data = 8, 5, 2, 4, 8, 9, 5,7 Step 1: First of all arrange the values in order After ordering the values: Data = 2, 4, 5, 5, 7, 8, 8, 9 Step 2: For dividing this data into four equal parts, we needed three quartiles. Q1: Lower quartile Q2: Median of the data set Q3: Upper quartile Step 3: Find the median of the data set and lebel it as Q2. Data = 2, 4, ♦ 5, 5, ♦ 7, 8 ♦ 8, 9 Minimum: 2 Q1: 4 + 5 / 2 = 4.5 Lower quartile Q2: 5+ 7 / 2 = 6 Middle quartile Q3: 8 + 8 / 2 = 8 Upper quartile Maximum: 9 Inter Quartile Range= Q3 – Q1 = 8 – 4.5 = 3.5 Outlier is mostly a value higher or lower than 1.5 * IQR =1.5 * IQR =1.5 * 3.5 = 5.25 BOX PLOT IN DATA MINING Note: Please understand the tutorial of quartile before moving to this topic. What is box plot? Box plot is a plotting of data in such a way that it is like a box shape and it represents the five number summary. Five summary is minimum value, Quartile 1, Median, Quartile 3 and maximum value. End of the box is represented by inter-quartile range (IQR). IQR = Q3 – Q1 The median is marked by a line within the box A rectangle is drawn to represent the second and third quartiles, usually with a vertical line inside to indicate the median value. The lower and upper quartiles are shown as horizontal lines either side of the rectangle. Maximum and minimum values are on the whiskers. Whiskers are the liens drawn on maximum and minimum value. Draw the box plot for the odd length data set? Data = 2, 4, 5, 5, 8, 8, 9 First of all find the quariles. Q1: 4 – Lower quartile Q2: 5 – Middle quartile Q3: 8 – Upper quartile Inter Quartile Range= Q3 – Q1 =8–4=4 Population size: Population size is the total number of values in data. Box Plot: Draw the box plot for the even length data set? Data = 8, 5, 2, 4, 8, 9, 5,7 First of all arrange the values in sequence. 2, 4, ♦ 5, 5, ♦ 7, 8 ♦ 8, 9 Minimum: 2 Q1: 4 + 5 / 2 = 4.5 – Lower quartile Q2: 5+ 7 / 2 = 6 – Middle quartile Q3: 8 + 8 / 2 = 8 – Upper quartile Maximum: 9 Box Plot: HOW TO CALCULATE VARIANCE AND STANDARD DEVIATION OF DATA IN DATA MINING What is data variance and standard deviation? Different values in the data set can be spread here and there from the mean. Variance tells us that how far away are the values from the mean. Standard deviation is the square root of variance. Low standard deviation tells us that less numbers are far away from mean. High standard deviation tells us that more numbers are far away from mean. How to calculate variance and standard deviation of data? marks 8 10 15 20 Mean = 13.25 Variance = 28.81 Standard deviation = 5.37 DATA SKEWNESS IN DATA MINING What is data skewness? When most of the values are skewed to the left or right side from the median, then the data is called skewed. Data can be in any of the following shapes; 1. Symmetric: Mean, median and mode are at the same point. 2. Positively skewed: When most of the values are to the left from the median. 3. Negatively skewed: When most of the values are to the right from the median. ATTRIBUTES TYPES IN DATA MINING What are Attribute? Attribute is the property of object. Attribute represents different features of the object. Example: In this example RollNo, Name and Result are attributes of the object student. RollNo Name Result 1 Ali Pass 2 Akram Fail Types Of attributes Binary Nominal Numeric o Interval-scaled o Ratio-scaled Nominal data: Nominal data is in alphabetical form and not in integer. Example: Attribute Value Categorical data Lecturer, Assistant Professor, Professor States New, Pending, Working, Complete, Finish Colors Black, Brown, White, Red Binary data: Binary data have only two values/states. Example: Attribute Value HIV detected Yes, No Result Pass, Fail Binary attribute is of two types; 1. Symmetric binary 2. Asymmetric binary Symmetric data: Both values are equally important Example: Attribute Value Gender Male, Female Asymmetric data: Both values are not equally important Example: Attribute Value HIV detected Yes, No Result Pass, Fail Ordinal data: All Values have a meaningful order. Example: Attribute Value Grade A, B, C, D, F BPS- Basic pay scale 16, 17, 18 Discrete Data: Discrete data have finite value. It can be in numerical form and can also be in categorical form. Example: Attribute Value Profession Teacher, Bussiness Man, Peon etc Postal Code 42200, 42300 etc Continuous data: Continuous data technically have an infinite number of steps. Continuous data is in float type. There can be many numbers in between 1 and 2 Example: Attribute Value Height 5.4…, 6.5….. etc Weight 50.09…. etc PROXIMITY MEASURE FOR NOMINAL ATTRIBUTES IN DATA MINING How to calculate Proximity Measure for Nominal Attributes? RollNo Marks Grade 1 90 A 2 80 B 3 82 B 4 90 A Pairs for distance Measurement: d(RollNo1,RollNo1) d(RollNo1,RollNo2) d(RollNo1,RollNo3) d(RollNo1,RollNo4) d(RollNo2,RollNo1) d(RollNo2,RollNo2) d(RollNo2,RollNo3) d(RollNo2,RollNo4) d(RollNo3,RollNo1) d(RollNo3,RollNo2) d(RollNo3,RollNo3) d(RollNo3,RollNo4) d(RollNo4,RollNo1) d(RollNo4,RollNo2) d(RollNo4,RollNo4) d(RollNo3,RollNo4) Formulae: distance(object1, Object2) = P – M / P P is total number of attributes M is total number of matches So in our case we have four objects RollNo1, RollNo2, RollNo3, RollNo4 d(1,1) = P – M / P =2–2/2 =0 d(RollNo1,RollNo2) d(RollNo1,RollNo3) d(RollNo1,RollNo4) (2,1) = P – M / P =2–0/2 =1 (2,2) = P – M / P =2–2/2 =0 d(RollNo2,RollNo3) d(RollNo2,RollNo4) (3,1) = P – M / P =2–0/2 =1 (3,2) = P – M / P =2–1/2 = 0.5 (3,3) = P – M / P =2–2/2 =0 d(RollNo3,RollNo4) (4,1) = P – M / P =2–2/2 =0 (4,2) = P – M / P =2–0/2 =1 DISTANCE MEASURE FOR ASYMMETRIC BINARY ATTRIBUTES IN DATA MINING How to calculate proximity measure for asymmetric binary attributes? OR How to measure the distance of asymmetric binary variables? Contingency table for binary data: Consider 1 for positive/True and 0 for negative/False Object 2 1 / True / Positive 0 / False / Negative Sum 1 / True / Positive A B A+B 0 / False / Negative C D C+D Sum A+C B+D Object 1 Name Fever Cough Test 1 Test 2 Test 3 Test 4 Asad Negative Yes Negative Positive Negative Negative Bilal Negative Yes Negative Positive Positive Negative Tahir Positive Yes Negative Negative Negative Negative DISTANCE MEASURE FOR SYMMETRIC BINARY VARIABLES How to calculate proximity measure for symmetric binary attributes? OR How to measure the distance/dissimilarity of symmetric binary variables? Contingency table for binary data: Object 2 1 / True / Positive 0 / False / Negative Sum 1 / True / Positive A B A+B 0 / False / Negative C D C+D Sum A+C B+D Object 1 Name Gender Job_Status Akram Male Regular Ali Male Contract Consider 1 for positive/True and 0 for negative/False Here we are considering Male and regular as positive and female and contract as negative. A = Akram is positive and Ali is also positive. so A=1 because Ali and Akram both are male and male is positive. B = Akram is positive and Ali is negative. So B=1 because Akram is regular that is positive and Ali is oncontract that is negative C = Akram is negative and Ali is 1. So C = 0 because Akram is never negative. He is male and regular. andmale and regular both are positive. D = Akram is negative and Ali is also negative. So D=0 because Akram is never negative. He is always positive(male and regular). DISTANCE MEASURE FOR SYMMETRIC BINARY VARIABLES How to calculate proximity measure for symmetric binary attributes? OR How to measure the distance/dissimilarity of symmetric binary variables? Contingency table for binary data: Object 2 1 / True / Positive 0 / False / Negative Sum 1 / True / Positive A B A+B 0 / False / Negative C D C+D Sum A+C B+D Object 1 Name Gender Job_Status Akram Male Regular Ali Male Contract Consider 1 for positive/True and 0 for negative/False Here we are considering Male and regular as positive and female and contract as negative. A = Akram is positive and Ali is also positive. so A=1 because Ali and Akram both are male and male is positive. B = Akram is positive and Ali is negative. So B=1 because Akram is regular that is positive and Ali is oncontract that is negative C = Akram is negative and Ali is 1. So C = 0 because Akram is never negative. He is male and regular. andmale and regular both are positive. D = Akram is negative and Ali is also negative. So D=0 because Akram is never negative. He is always positive(male and regular). EUCLIDEAN DISTANCE IN DATA MINING What is Euclidean distance ? Euclidean distance is a technique used to find the distance/dissimilarity among objects. Example: Age Marks Sameed 10 90 Shah zeb 6 95 Formulae: Euclidean distance (sameed, sameed) = SQRT ( (X1 – X2)2 + (Y1 -Y2)2 ) = 0 Euclidean distance (sameed, sameed) = SQRT ( (10 – 10)2 + (90 -90)2) = 0 Here note that (90-95) = -5 and when we take sqaure of a negative number then it will be a positive number. For example, (-5)2 = 25 Euclidean distance (sameed, shah zeb) = SQRT ( (10 – 6)2 + (90 -95)2) = 6.40312 Euclidean distance (shah zeb, sameed) = SQRT ( (10 – 6)2 + (90 -95)2) = 6.40312 Euclidean distance (sameed, sameed) = SQRT ( (10 – 10)2 + (90 -90)2) = 0 Euclidean Distance is given below; Sameed Shah zeb Sameed 0 6.40312 Shah zeb 6.40312 0 JACCARD COEFFICIENT SIMILARITY MEASURE FOR ASYMMETRIC BINARY VARIABLES How to calculate simmilarity of asymmetric binary variable using Jaccard coefficient? Jaccard coefficient is used to calculate the similarity among asymmetric binary attributes. Contingency table for binary data: Object 2 1 / True / Positive 0 / False / Negative Sum 1 / True / Positive A B A+B 0 / False / Negative C D C+D Sum A+C B+D Object 1 Name Fever Cough Test 1 Test 2 Test 3 Test 4 Asad Negative Yes Negative Positive Negative Negative Bilal Negative Yes Negative Positive Positive Negative Tahir Positive Yes Negative Negative Negative Negative Consider 1 for positive/True and 0 for negative/False COSINE SIMILARITY IN DATA MINING What is Cosine similarity? Cosine similarity is a measure to find the similarity among two files/documents. Example of cosine similarity: What is the similarity between two files, file 1 and file 2 ? Formula: cos(file 1, file 2) = (file 1 · file 2) / ||file 1|| ||file 2|| , file 1 = (0, 3, 0, 0, 2, 0, 0, 2, 0, 5) file 2 = (1, 2, 0, 0, 1, 1, 0, 1, 0, 3) file 1 · file 2 = 0*1 + 3*2 + 0*0 + 0*0 + 2*1 + 0*1 + 0*0 + 2*1 + 0*0 + 5*3 = 25 ||d1||= (0*0 + 3*3 + 0*0 + 0*0 + 2*2 + 0*0 + 0*0 + 2*2 + 0*0 + 5*5)0.5 =(42)0.5 = 6.481 ||d2||= (1*1 + 2*2 + 0*0 + 0*0 + 1*1 + 1*1 + 0*0 + 1*1 + 0*0 + 3*3)0.5 =(17)0.5 = 4.12 cos(d1 , d2 ) = 0.94 MAJOR TASKS OF DATA PRE-PROCESSING Major tasks of data pre-processing 1. Data Cleaning o Data cleaning is a process to clean the data in such a way that data can be easily integrate. 2. Data Integration o Data integration is a process to integrate/combine all the data. 3. Data Reduction o Data reduction is a process to reduce the large data into smaller once in such a way that data can be easily transformed further. 4. Data Transformation o Data transformation is a process to transform the data in a reliable shape. 5. Data Discretization After the completion of these taks, the data is ready for mining. DATA CLEANING Data Cleaning Data cleaning is a process to clean the dirty data. Data is mostly not clean. It means that mostly data can be incorrect due to a large number of reasons like due to hardware error/failure, network error or human error. So it is compulsory to clean the data before mining. Dirty data Examples Incomplete data salary=” ” Inconsistent data Noisy data Intentionalerror Age =”5 years”, Birthday =”06/06/1990″, Current Year =”2017″ Salary = “-5000”, Name = “123” Sometimes applications alot auto value to attribute. e.g some application put gender value as male by default. gender=”male” How to Handle incomplete/Missing Data? Ignore the tuple Fill in the missing value manually Fill the values automatically by o Getting the attribute mean o Getting the constant value if any constant value is there. o Getting the most probable value by Bayesian formula or decision tree How to Handle Noisy Data? Binning Regression Clustering Combined computer and human inspection What is Binning? Binning is a technique in which first of all we sort the data and then partition the data into equal frequencybins. Bin 1 2, 3, 6, 8 Bin 2 14,16,18,24 Bin 3 26,28,30,32 Types of binning: There are many types of binning. Some of them are as follows; 1. Smooth by getting the bin means Bin 1 4.75, 4.75, 4.75, 4.75 Bin 2 18,18,18,18 Bin 3 29,29,29,29 Smooth by getting the bin median Smooth by getting the bin boundaries, etc. Z-SCORE NORMALIZATION OF DATA What is Z-Score? Z-Score helps in normalizing the data. How to calculate Z-Score of the following data? marks 8 10 15 20 Mean = 13.25 Standard deviation = 5.37 marks marks after z-score normalization 8 -0.97 10 -0.60 15 0.32 20 1.25 MIN MAX NORMALIZATION OF DATA IN DATA MINING What is Min Max normalization? Min Max is a technique that helps to normalizing the data. It will scale the data between the 0 and 1. How to normalize the data through min max normalization technique? marks 8 10 15 20 Min: Minimum value of the given attribute. Here Min is 8 Max: Maximum value of the given attribute. Here Max is 20 V: V is the respective value of attribute. For example here V1=8, V2=10, V3=15 and V4=20 newMax: 1 newMin: 0 marks marks after Min-Max normalization 8 0 10 0.16 15 0.25 20 1 MIN MAX SCALLING IN DATA MINING Example 2: Min max normalization detail is available in previous tutorial. Here, There is just another example for practice. NORMALIZATION WITH DECIMAL SCALING IN DATA MINING What is decimal scaling? Decimal scaling is a data normalization technique. In this technique we move the decimal point of valuesof the attribute. This movement of decimal points totally depends on the maximum value among all valuesin the attribute. Formulae: A value v of attribute A is can be normalized by the following formula Normalized value of attribute = ( vi / 10j ) Example 1: CGPA Formula CGPA Normalized after Decimal scaling 2 2/10 0.2 3 3/10 0.3 We will check maximum value among our attribute CGPA. Here maximum value is 3 so we can convert it into decimal by dividing with 10. Why 10? we will count total numbers in our maximum value and then put 1 and after 1 we can put zeros equal to the length of maximum value. Here 3 is maximum value and total numbers in this value are only 1. so we will put one zero after one. Example 2: Salary bonus Formula CGPA Normalized after Decimal scaling 400 400 / 1000 0.4 310 310 / 1000 0.31 We will check maximum value among our attribute “salary bonus“. Here maximum value is 400 so we canconvert it into decimal by dividing with 1000. Why 1000? 400 contains three digits and we so we can put three zeros after 1. So, it looks like 1000. Example 3: Salary Formula CGPA Normalized after Decimal scaling 40,000 40,000 / 100000 0.4 31, 000 31,000 / 100000 0.31 STANDARD DEVIATION NORMALIZATION OF DATA IN DATA MINING Data Normalization with the Help Of Standard Deviation Data is in attribute tuples and data can be normalize by using standard deviation. Example: Age 22 40 First Step: Calculate the mean of the data: Mean = (22+40) / 2 = 22 Second Step: Now, we will subtract the mean from all the values and find the square of all data : (22-22)^2 = 0 (40-22)^2 = 324 Third Step: Now, we will find the deviation as follows: Deviation = sqrt ((0 + 324 / 2) = 162 Fourth Step: Now we will normalize the attribute values: (x – Mean) / Deviation For 22 : (22 – 22) / 162 = 0 For 40 : (40 – 22) / 162 = 0.11 Age before normalization Age after normalization 22 0 40 0.11 DATA DISCRETIZATION IN DATA MINING What is data discretization? Data discretization converts the large number of data values into smaller once, so that data evaluation and data management becomes very easy. Example: we have an attribute age with following values, Age 10,11,13,14,17,19,30, 31, 32, 38, 40, 42,70 , 72, 73, 75 Table: Before discretization Age 10,11,13,14,17,19, Young 30, 31, 32, 38, 40, 42, Mature 70 , 72, 73, 75 Old Table: How to discretization Age Young Mature Old Table: After discretization What are some famous techniques of data discretization? 1. Histogram analysis 2. Binning 3. Correlation analysis 4. Clustering analysis 5. Decision tree analysis 6. Equal width partitioning 7. Equal depth partitioning BINNING METHODS FOR DATA SMOOTHING IN DATA MINING What is binning method? Binning method can be used for smoothing data. Example: Sorted data for Age: 3, 7, 8, 13, 22, 22, 26, 22, How to smooth the data by equal frequency bins? Bin 1: 3, 7, 8, 13 Bin 2: 22, 22, 22, 26 Bin 3: 26, 28, 30, 37 How to smooth the data by bin means? Bin 1: 10, 10, 10, 10 Bin 2: 23, 23, 23, 23 Bin 3: 30, 30, 30, 30 26, 28, 30, 37 How to smooth the data by bin boundaries? Bin 1: 3, 3, 3, 13 Bin 2: 22, 22, 26, 26 Bin 3: 26, 26, 26, 37 CORRELATION ANALYSIS OF NOMINAL DATA Correlation analysis of Nominal data: Chi Square Test This analysis can be done by chi-square test.Chi-square test is the test to analyse the correlation of nominal data. Correlation VS Causality: Correlation does not always tells us about causality. Example: Number of student passed in exam and number of car theft in a country are correlated with each other but may be it does not means that number of student passed effects car theft in a country. But in some cases it may be; Number of student passed in exam and number of students who live near to the university are correlated with each other and may be number of student who live near to the university can be a cause of the student result. Passed student Not passed student Sum Live near University Observed=140 Expected = 180*330/1320 Expected =45 Observed=190 Expected = 1140*330/1320 Expected =285 330 Not live near University Observed=40 Expected = 180*990/1320 Expected =135 Observed=950 Expected = 1140*990/1320 Expected =855 990 Sum 140 + 40 = 180 190 + 950 = 1140 1320 Degrees of freedom: DF = (r – 1) * (c – 1) Level of significance: .01 .05 .10 CORRELATION ANALYSIS FOR NUMERICAL DATA Correlation analysis for numerical data A B 3 1 4 6 1 2 Step 1: Find all the initial values A B AB A2=C B2=D 3 1 3 9 1 4 6 24 16 36 1 2 2 1 4 Total number of values (n) is 3. The other values we need are: ΣA =3 + 4 + 1 = 8 ΣB = 1 + 6 + 2 = 9 ΣAB = 3 + 24 + 2 = 29 ΣC = 9 + 16 + 1 = 26 ΣD= 1 + 36 + 4 = 41 Step 2: Input the Values (r) =[ nΣAB – (ΣA)(ΣB) / Sqrt([nΣC– (ΣA)2] [nΣD – (ΣB)2])] r = [3(29) – (8)(9) / Sqrt ([3(26) – (8) 2 ] [3(41)-(9) 2 ])] r= [87-72 / Sqrt ([78-64] [123 -81])] r= [15 / Sqrt ([14] [42])] r=[15 / Sqrt (588)] r= 15 / 24.24 r= 0.61 FREQUENT PATTERN MINING IN DATA MINING Frequent pattern : Frequent pattern is a pattern that occurs again and again (frequently) in a data set. A pattern can be a set of items, sub structures and sub sequences etc. Closed frequent itemset: If itemset has no superset with the same frequency Max frequent itemset: If itemset does not have any frequent supersets. Example: It is compulsory to set a min_support that can defines which itemset is frequent. An itemset is frequent if its support is greater or equal to min_support. Suppose the minimum support is 1 and there are two transactions T1 and T2 T1 = {A1, A2, A3, ……………….A19, A20} T2 = {A1, A2, A3, …….A9, A10} Set of closed frequent itemset / Closed pattern : T1 = {A1, A2, A3, ……………….A19, A20} o Frequency of T1 is 1 T2 = {A1, A2, A3, …….A9, A10} o Frequency of T2 is 2 Set of max frequent itemset / max pattern : T1 = {A1, A2, A3, ……………….A19, A20} o T1 is representing max pattern. APRIORI ALGORITHM IN DATA MINING Apriori Algorithm Apriori Helps in mining the frequent itemset. Example 1: Minimum Support :2 Step 1: Data in the database Step 2: Calculate the support / frequency of all items Step 3: Discard the items with minimum support less than 2 Step 4: Combine two two items Step 5: Calculate the support / frequency of all items Step 6: Discard the items with minimum support less than 2 Step 6.5: Combine three three items and calculate their support. Step 7: Discard the items with minimum support less than 2 Result: Only one itemset is frequent (Eggs, Tea, Cold Drink) because this itemset have minimum support 2 Example 2: Minimum Support :3 Step 1: Data in the database Step 2: Calculate the support / frequency of all items Step 3: Discard the items with minimum support less than 3 Step 4: Combine two two items Step 5: Calculate the support / frequency of all items Step 6: Discard the items with minimum support less than 3 Step 6.5: Combine three three items and calculate their support. Step 7: Discard the items with minimum support less than 3 Result: There is no frequent itemset because all itemsets have minimum support less than 3 APRIORI PRINCIPLES IN DATA MINING Apriori principles: 1. Downward closure property of frequent patterns All subset of any frequent itemset must also be frequent. o Example: If Tea, Biscuit, Coffee is a frequent itemset, then we can say that all of the following itemset are frequent; Tea Biscuit Coffee Tea, Biscuit Tea, Coffee Biscuit, Coffee 2. Apriori pruning principle: If an itemset is infrequent, its superset should not be generated for getting the frequent o itemset. Example: If Tea, Biscuit is a frequent itemset and Coffee is not frequent itemset, then we can say that all of the following itemset are frequent; Tea Biscuit Tea, Biscuit APRIORI CANDIDATES GENERATION IN DATA MINING Apriori Candidates generation: Candidates can be generated by following activities; Step 1: self-joining o Example: V W X Y ZX={V W X, V W Y, V X Y, V X Z, W X Y}Self-joining = X * X V W X Y from V W X and V W Y V X Y Z from V X Y and V X Z So frequent candidate are V W X Y and V X Y Z Step 2: Apriori pruning principle: o Example: V W X Y ZX={V W X, V W Y, V X Y, V X Z, W X Y} According to Apriori Pruning principle V X Y Z is removed because V Y Z is not in X. So frequent candidate is V W X Y KMEANS CLUSTERING IN DATA MINING What is clustering? Clustering is a process of partitioning a group of data into small partitions or cluster on the basis ofsimilarity and dissimilarity. What is K-Means clustering? K-Means clustering is a clustering method in which we move the every data item(attribute value) nearest to its similar cluster. How it works? Step 1: Find the centroid randomly. It is better to take the boundary and middle values as centroid. Step 2: Assign cluster to each item-set(value of attribute) Step 3: Repeat all the process, every time we repeat the process total sum of error rate is changed. When error rate stops to change, then finalize the cluster and their itemset(attribute value) id Age D1 D2 D3 Cluster Error 1 23 0 42 20 1 0 2 33 10 32 10 1 100 3 28 5 37 15 1 25 4 23 0 42 20 1 0 5 65 42 0 22 2 0 6 67 44 2 24 2 4 7 64 41 1 21 2 1 8 73 50 8 30 2 64 9 68 45 3 25 2 9 10 43 20 22 0 3 0 11 34 11 31 9 3 81 12 43 20 22 0 3 0 13 52 29 13 9 3 81 14 49 26 16 6 3 36 Centriod 23 65 43 401 Iteration 2: id Age D1 D2 1 23 3.75 44.4 21.2 1 14.0625 2 33 6.25 34.4 11.2 1 39.0625 3 28 1.25 39.4 16.2 1 1.5625 4 23 3.75 44.4 21.2 1 14.0625 5 34 7.25 33.4 10.2 1 52.5625 6 65 38.25 2.4 20.8 2 5.76 7 67 40.25 0.4 22.8 2 0.16 8 64 37.25 3.4 19.8 2 11.56 9 73 46.25 5.6 28.8 2 31.36 10 68 41.25 0.6 23.8 2 0.36 11 43 16.25 24.4 1.2 3 1.44 12 43 16.25 24.4 1.2 3 1.44 13 52 25.25 15.4 7.8 3 60.84 14 49 22.25 18.4 4.8 3 23.04 Centriod 26.75 D3 67.4 44.2 Cluster Error 257.2725 Iteration 3: id Age D1 D2 D3 Cluster Error 1 23 5.2 44.4 23.75 1 27.04 2 33 4.8 34.4 13.75 1 23.04 3 28 0.2 39.4 18.75 1 0.04 4 23 5.2 44.4 23.75 1 27.04 5 34 5.8 33.4 12.75 1 33.64 6 65 36.8 2.4 18.25 2 5.76 7 67 38.8 0.4 20.25 2 0.16 8 64 35.8 3.4 17.25 2 11.56 9 73 44.8 5.6 26.25 2 31.36 10 68 39.8 0.6 21.25 2 0.36 11 43 14.8 24.4 3.75 3 14.0625 12 43 14.8 24.4 3.75 3 14.0625 13 52 23.8 15.4 5.25 3 27.5625 14 49 20.8 18.4 2.25 3 5.0625 Centriod 28.2 67.4 46.75 220.75 Iteration 4: id Age D1 D2 D3 Cluster Error 1 23 5.2 44.4 23.75 1 27.04 2 23 5.2 44.4 23.75 1 27.04 3 28 0.2 39.4 18.75 1 0.04 4 33 4.8 34.4 13.75 1 23.04 5 34 5.8 33.4 12.75 1 33.64 6 43 14.8 24.4 3.75 3 14.0625 7 43 14.8 24.4 3.75 3 14.0625 8 49 20.8 18.4 2.25 3 5.0625 9 52 23.8 15.4 5.25 3 27.5625 10 64 35.8 3.4 17.25 2 11.56 11 65 36.8 2.4 18.25 2 5.76 12 67 38.8 0.4 20.25 2 0.16 13 68 39.8 0.6 21.25 2 0.36 14 73 44.8 5.6 26.25 2 31.36 67.4 46.75 Centriod 28.2 220.75 Iteration stops: Now, iterations are stoped because error rate is consistent in iteration 3 and iteration 4. Error rate is now fixed at 220.75, so there is no need of further. Clusters are final now. Shortcomings of K-Means clustering: It is sensitive to outliers. Not much suitable for categorical or nominal data KMEANS CLUSTERING ON TWO ATTRIBUTES IN DATA MINING How K-Means clustering is performs on two attributes? Solution: For example we have 2 attributes; 1. Paper1 2. Paper 2 First of all randomly we will chose centroid values. Then calculates the distance of each value of paper 1 and paper 2. Find the error rate. Repeat this all process until the error rate remains consistent in at least two last iterations or clusters stops to change further. Figure: k-mean clustering DECISION TREE INDUCTION IN DATA MINING Decision Tree Induction: Decision tree is a tree like structure and consists of following parts(discussed in Figure 1); 1. Root node: age is root node o 2. Branches: Following are the branches; o <20 21…50 >50 USA PK High Low 3. Leaf node: Following are the leaf nodes; o Yes No Entropy: Entropy is a method to measure the uncertainty. Entropy can be measure in between 0 and 1. High entropy represents that data have more variance with each other. Low entropy represents that data have less variance with each other. P = Total yes = 9 N = Total no = 5 Note that to calculate the log2 of a number, we can do the following procedure. For example; what is log2 of 0.642? Ans: log (0.642) / log (2) =–9/14 * log2(9/14) – 5/14 * log2 (5/14) =-9/14 * log2(0.642) – 5/14 * log2 (0.357) =-9/14 * (0.639) – 5/14 * (-1.485) =0.941 For Age: age Pi Ni Info(Pi, Ni) <20 2 YES 3 NO 0.970 21…50 4 YES 0 NO 0 >50 3 YES 2 NO 0.970 Note: if yes =2 and No=3 then entropy is 0.970 and it is same 0.970 if yes=3 and No=2 So here when we calculates the entropy for age<20 , then their is no need to calculate the entropy for age >50 because total number of Yes and No is same. Gain of Age 0.248 Gain of Income 0.029 Gain of Credit Rating 0.048 Gain of Region 0.151 0.248 is a greater value than income, Credit Rating and Region. So Age will be considered as root node. Note that if yes and no are in the following sequence like (0,any number) or (any number, 0) then entropy is always 0. If yes and no are occurring in such a sequence (3,5) and (5, 3) then both have same entropy. Entropy calculates impurity or uncertainty of data. If the coin is fair (1/2, head and tail have equal probability, represent maximum uncertainty, because it is difficult to guess that head occurs or tails occurs) and suppose coin have head on both sides then probability is 1/1, and uncertainty or entropy is less. if p is equal to q then more uncertainty if p is not equal to q then less uncertainty Now again calculate entropy for; 1. Income 2. Region 3. Credit For Income: Income Pi Ni Info(Pi, Ni) High 0 YES 2 NO 0 Medium 1 YES 1 NO 1 Low 1 YES 0 NO 0 For Region: Region Pi Ni Info(Pi, Ni) USA 0 YES 3 NO 0 PK 2 YES 0 NO 0 Credit Rating Pi Ni Info(Pi, Ni) Low 1 YES 2 NO 0 High 1 YES 1 NO 0 For Credit Rating: Gain of Region 0.970 Gain of Credit Rating 0.02 Gain of Income 0.57 Similarly you can calculate for all. 0.970 is a greater value than income, Credit Ratingand Region. So Age will be considered as root node. COMPUTING INFORMATION-GAIN FOR CONTINUOUS-VALUED ATTRIBUTES IN DATA MINING For example, we have the following data mentioned below; How we can calculate the split point? Income Class 18 YES 45 NO 18 NO 25 YES 28 YES 28 NO 34 NO Solution: Step 1 : Sort the data in ascending order. Income Class 18 YES 18 NO 25 YES 28 YES 28 NO 34 NO 45 NO Step 2: Find the midpoint of first two numbers and calculate the information gain Split point = (18+25) / 2 = 21 Infoage<21(D) = 2/7(I(1,1)) + 5/7(I(2,3)) = 2/7(-1/2(log2(1/2)) – 1/2(log2(1/2))+5/7(-2/5(log2(2/5)) – 3/5(log2(3/5))) = 0.98 WHICH ATTRIBUTE SELECTION MEASURE IS BEST IN DATA MINING Attribute Selection Measure: There are different attribute selection measures. Some of them are as follows; 1. Information gain 2. Gain ratio 3. Gini index Information gain Gain ratio Giniindex Biased towards the multi-valued attribute Unbalanced splits One partition is much smaller than other partition. Biased towards the multivalued attribute Difficult to manage a large number of classes Partitionsare equal GINI INDEX FOR BINARY VARIABLES IN DATA MINING What is gini index? Gini index is the most commonly used measure of inequality. Also referred as gini ratio or gini coefficient. Gini index for binary variables: Student inHostel Target Class Yes True Yes Yes True Yes Yes False No False False Yes False True No False True No False False No True False Yes False True No Now we will calculate gini index of student and inHostel; Step 1: Gini(X) = 1 – [(4/9)2 + (5/9)2] = 40/81 Step 2: Gini(Student = False) = 1 – [(1/5)2 + (4/5)2] = 8/25 Gini(Student = True) = 1 – [(3/4)2 + (1/4)2] = 3/8 GiniGain(Student) = Gini(X) – [4/9· Gini(Student = True) + 5/9· Gini(Student = False)] = 0.149 Step 3: Gini(inHostel = False) = 1 – [(2/4)2 + (2/4)2] = 1/2 Gini(inHostel = True) = 1 – [(2/5)2 + (3/5)2] = 12/25 GiniGain(inHostel) = Gini(X) – [5/9· Gini(inHostel = True) + 4/9· Gini(inHostel = False)] = 0.005 Results: Best split point is Student because it has high gini gain. NAIVE BAYES CLASSIFIER TUTORIAL IN DATA MINING Naive bayes classifier: Step 1. Calculate P(Ci) P(buys_computer = “no”) = 5/14= 0.357 P(buys_computer = “yes”) = 9/14 = 0.643 Step 2. Calculate P(X|Ci) for all classes P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4 P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 Step 3. Select the scenario against which you want to classify. X = (age <= 30 , income = medium, student = yes, credit_rating = fair) Step 4: Calculate P(X|Ci) : P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 Step 5: Calculate C P(X|Ci)*P(Ci) : P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007 P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028 Therefore, X belongs to class (“buys_computer = yes”) BOOSTING IN DATA MINING What is Boosting? Boosting is an efficient algorithm that is able to convert a weak learner into a strong learner. Example: Suppose we wants to check that an email is “spam email” or “safe email”? In this case there can be multiple possibilities like; Rule 1: Email contains only links of some website. o Decision: It is a spam Rule 2: Email from an official email address. e.g t4tutorialsfree@gmail.com. o Decision: It is not a spam. Rule 3: Email have a request to get private bank details. e.g bank account number and father/mother name etc. o Decision: It is a spam Now the question is that, the 3 rules discussed above or enough to classify an email as “spam” or not? Answer: These 3 rules are not enough. These 3 rules are weak learner. So we need to boost these learners. We can boost the weak learners to stronger learner by boosting. Boosting can be done by combining and assigning weights to every weak learner. Boosting have greater accuracy as compared to Bagging. Types of boosting algorithm: Three main types of boosting algorithm are as follows; 1. XGBoost algorithm 2. AdaBoost algorithm 3. Gradient tree boosting algorithm RAINFOREST ALGORITHM IN DATA MINING What is RainForest? RainForest is framework especially designed to classify the large data set. RainForest contains AVC set. AVC set consist of the following parts; 1. Attribute 2. Value 3. Class_Label Example: Income Rank Buy_Mobile 75,000 Professor yes 75,000 Professor yes 50,000 Lecturer no After applying the AVC set table looks like; Buy_Mobile Income Yes No 75,000 2 0 50,000 0 1 Buy_Mobile Rank Yes No Professor 2 0 Lecturer 0 1 AVC sets can be built according to the amount of main memory available. This can be described in the following three cases; 1. The AVC-set of the root node fits in main memory. 2. 2. Each individual AVC-set of the root node fits in main memory, but the AVC-group of the root node does not fit in main memory. 3. None of the individual AVC-sets of the root fit in main memory. HOLDOUT METHOD FOR EVALUATING A CLASSIFIER IN DATA MINING Holdout method: All data is randomly divided into same equal size data sets. e.g, 1. Training set 2. Test set 3. Validation set Training set: It is a data set helps in prediction of the model. Test set: Unseen data is used as a sub set of the data set to assess the performance of model. Validation set: Validation set is also a data set used to asses the performance of model built during the training. For example; There are total 3 data sets. Total training set for model construction 2/3 Total test set for accuracy estimation 1/3 EVALUATION OF A CLASSIFIER BY CONFUSION MATRIX IN DATA MINING How to evaluate a classifier? Classifier can be evaluated by building the confusion matrix. Confusion matrix shows the total number of correct and wrong predictions. Confusion Matrix for class label positive(+VE) and negative(-VE)is shown below; Actual Class(Target) Predicted Class (Model) +VE -VE +VE A = True +VE B= False -VE +VE prediction P=A / (A+B) -VE C= False +VE D= True -VE -VE prediction D / (C + D) Sensitivity Specificity A / (A + C) D / (B + D) Accuracy = A + D / (A + B + C + D) Accuracy: Accuracy is the proportion of the total number of correct predictions. e.g Accuracy = A + D / (A + B + C + D) Error-Rate: Error Rate = 1 – Accuracy +VE predictions: +VE predictions are the proportion of the total number of correct positive predictions. +VE predictions = A / (A+B) -VE predictions: -VE predictions are the proportion of the total number of correct negative predictions. -VE predictions = D / (C + D) Precision: Precision is the correctness that how much tuple are +VE and classifier predicted them as +VE -VE and classifier predicted them as -VE Precision = A / P Recall: Recall = A / Real positive Sensitivity (Recall): Sensitive is total True +VE rate. The correction of the actual positive cases that are correctly identified. Sensitivity (Recall) = A / (A + C) F-Measure: F-Measure is harmonic mean of recall and precision. F-Measure = 2 * Precision * Recall / Precision + Recall Specificity: Specificity is true -VE rate. Specificity is the proportion of the actual -VE cases that are correctly identified. Specificity = D / (B + D) Note: Specificity of one class is same as sensitivity of the other class. OVERFITTING OF DECISION TREE AND TREE PRUNING IN DATA MINING Overfitting of tree: Before overfitting of tree, lets revise test data and training data; Training Data: Training data is the data that is used for prediction. Test Data: Test data is used to assess the power of training data in prediction. Overfitting: Overfitting means too many un-necessary branches in the tree. Overfitting results in different kind of anomalies that are the results of outliers and noise. How to avoid overfitting? There are two techniques to avoid overfitting; 1. Pre-pruning 2. Post-pruning 1.Pree-Pruning: Pree-Pruning means to stop the growing tree before a tree is fully grown. 2. Post-Pruning: Post-Pruning means to allow the tree to grow with no size limit. After tree completion, starts toprune the tree. Advantages of pree-pruning and post-pruning: Pruning controls to increase the tree un-necessary. Pruning reduce the complexity of tree.