Uploaded by aparna ns

Data Mining Tutorials: Statistics & Pre-processing

advertisement
Contents
DATA MINING TUTORIALS ............................................................................................................................ 6
MEAN, MEDIAN, MODE IN DATA MINING .................................................................................................... 6
FINDING THE ESTIMATED MEAN, MEDIAN AND MODE FOR GROUPED DATA IN DATA MINING ................ 8
WHAT ARE QUARTILES AND BOX PLOT IN DATA MINING .......................................................................... 10
How to find outliers? .............................................................................................................................. 11
BOX PLOT IN DATA MINING ........................................................................................................................ 13
What is box plot? ....................................................................................................................................... 13
Draw the box plot for the odd length data set?........................................................................................ 13
HOW TO CALCULATE VARIANCE AND STANDARD DEVIATION OF DATA IN DATA MINING ....................... 15
What is data variance and standard deviation? ..................................................................................... 15
How to calculate variance and standard deviation of data? .................................................................. 15
marks....................................................................................................................................................... 15
Variance = 28.81 ..................................................................................................................................... 16
Standard deviation = 5.37 ....................................................................................................................... 16
DATA SKEWNESS IN DATA MINING............................................................................................................. 16
What is data skewness? ............................................................................................................................. 16
ATTRIBUTES TYPES IN DATA MINING.......................................................................................................... 17
What are Attribute? .................................................................................................................................... 17
Types Of attributes ................................................................................................................................. 17
Nominal data:.............................................................................................................................................. 17
Binary data: ................................................................................................................................................. 18
Symmetric data: .......................................................................................................................................... 18
Asymmetric data: ........................................................................................................................................ 18
Ordinal data: ........................................................................................................................................... 19
PROXIMITY MEASURE FOR NOMINAL ATTRIBUTES IN DATA MINING ....................................................... 20
How to calculate Proximity Measure for Nominal Attributes? .................................................................. 20
Pairs for distance Measurement: ............................................................................................................ 20
Formulae: ................................................................................................................................................ 20
DISTANCE MEASURE FOR ASYMMETRIC BINARY ATTRIBUTES IN DATA MINING....................................... 21
How to calculate proximity measure for asymmetric binary attributes? ................................................... 21
OR................................................................................................................................................................ 21
How to measure the distance of asymmetric binary variables? ................................................................ 21
Contingency table for binary data: ......................................................................................................... 21
DISTANCE MEASURE FOR SYMMETRIC BINARY VARIABLES ....................................................................... 22
How to calculate proximity measure for symmetric binary attributes?..................................................... 22
OR................................................................................................................................................................ 22
How to measure the distance/dissimilarity of symmetric binary variables? ............................................. 22
DISTANCE MEASURE FOR SYMMETRIC BINARY VARIABLES ....................................................................... 23
How to calculate proximity measure for symmetric binary attributes?..................................................... 23
OR................................................................................................................................................................ 23
How to measure the distance/dissimilarity of symmetric binary variables? ............................................. 23
EUCLIDEAN DISTANCE IN DATA MINING .................................................................................................... 24
What is Euclidean distance ? ...................................................................................................................... 24
Example:.................................................................................................................................................. 24
Formulae: ................................................................................................................................................ 25
JACCARD COEFFICIENT SIMILARITY MEASURE FOR ASYMMETRIC BINARY VARIABLES ............................. 25
How to calculate simmilarity of asymmetric binary variable using Jaccard coefficient? ........................... 25
Jaccard coefficient is used to calculate the similarity among asymmetric binary attributes. .................... 25
Contingency table for binary data: ......................................................................................................... 25
Consider 1 for positive/True and 0 for negative/False .......................................................................... 26
COSINE SIMILARITY IN DATA MINING ......................................................................................................... 26
What is Cosine similarity? ........................................................................................................................... 26
Example of cosine similarity: ...................................................................................................................... 26
MAJOR TASKS OF DATA PRE-PROCESSING ................................................................................................. 27
Major tasks of data pre-processing ........................................................................................................... 27
1.
Data Cleaning...................................................................................................................................... 27
2.
Data Integration ................................................................................................................................. 27
3.
Data Reduction ................................................................................................................................... 27
4.
Data Transformation .......................................................................................................................... 27
5.
Data Discretization ............................................................................................................................. 27
DATA CLEANING .......................................................................................................................................... 28
What is Binning?......................................................................................................................................... 29
Types of binning: ........................................................................................................................................ 29
Z-SCORE NORMALIZATION OF DATA .......................................................................................................... 29
What is Z-Score? ......................................................................................................................................... 29
How to calculate Z-Score of the following data? .................................................................................... 29
marks....................................................................................................................................................... 29
Standard deviation = 5.37 ....................................................................................................................... 30
marks....................................................................................................................................................... 30
marks after z-score normalization .......................................................................................................... 30
MIN MAX NORMALIZATION OF DATA IN DATA MINING ............................................................................ 31
What is Min Max normalization? ................................................................................................................ 31
How to normalize the data through min max normalization technique? .............................................. 31
marks....................................................................................................................................................... 31
marks....................................................................................................................................................... 33
marks after Min-Max normalization ....................................................................................................... 33
MIN MAX SCALLING IN DATA MINING ........................................................................................................ 33
Example 2: ................................................................................................................................................... 33
NORMALIZATION WITH DECIMAL SCALING IN DATA MINING ................................................................... 34
What is decimal scaling? ............................................................................................................................. 34
Formulae: ................................................................................................................................................ 34
Example 1: ............................................................................................................................................... 34
Example 2: ............................................................................................................................................... 35
Example 3: ............................................................................................................................................... 35
STANDARD DEVIATION NORMALIZATION OF DATA IN DATA MINING ....................................................... 35
Data Normalization with the Help Of Standard Deviation ......................................................................... 35
DATA DISCRETIZATION IN DATA MINING ................................................................................................... 36
What is data discretization? ...................................................................................................................... 36
What are some famous techniques of data discretization? ..................................................................... 37
BINNING METHODS FOR DATA SMOOTHING IN DATA MINING ................................................................. 37
What is binning method? ........................................................................................................................... 37
How to smooth the data by equal frequency bins? ................................................................................. 37
How to smooth the data by bin means? .................................................................................................. 37
How to smooth the data by bin boundaries?........................................................................................... 38
CORRELATION ANALYSIS OF NOMINAL DATA............................................................................................. 38
Correlation VS Causality:............................................................................................................................ 38
CORRELATION ANALYSIS FOR NUMERICAL DATA ....................................................................................... 39
Correlation analysis for numerical data ...................................................................................................... 39
FREQUENT PATTERN MINING IN DATA MINING......................................................................................... 40
Frequent pattern :....................................................................................................................................... 40
Closed frequent itemset: ............................................................................................................................ 40
Max frequent itemset: ................................................................................................................................ 40
Example:.................................................................................................................................................. 40
It is compulsory to set a min_support that can defines which itemset is frequent. An itemset is
frequent if its support is greater or equal to min_support. ................................................................ 40
Set of closed frequent itemset / Closed pattern : ...................................................................................... 41
Set of max frequent itemset / max pattern : .............................................................................................. 41
APRIORI ALGORITHM IN DATA MINING ...................................................................................................... 42
Apriori Algorithm ........................................................................................................................................ 42
APRIORI PRINCIPLES IN DATA MINING ....................................................................................................... 44
Apriori principles: ....................................................................................................................................... 44
APRIORI CANDIDATES GENERATION IN DATA MINING .............................................................................. 45
Apriori Candidates generation: ................................................................................................................... 45

Step 1: self-joining........................................................................................................................... 45

Step 2: Apriori pruning principle: .................................................................................................... 45
KMEANS CLUSTERING IN DATA MINING ..................................................................................................... 45
What is clustering?...................................................................................................................................... 45
What is K-Means clustering? ...................................................................................................................... 45
How it works? ......................................................................................................................................... 46
Step 1: Find the centroid randomly. It is better to take the boundary and middle values as centroid.
................................................................................................................................................................ 46
Step 2: Assign cluster to each item-set(value of attribute) .................................................................. 46
Step 3: Repeat all the process, every time we repeat the process total sum of error rate is changed.
When error rate stops to change, then finalize the cluster and their itemset(attribute value) ......... 46
Iteration 2: .............................................................................................................................................. 47
Iteration 3: .............................................................................................................................................. 48
Iteration 4: .............................................................................................................................................. 49
Iteration stops: Now, iterations are stoped because error rate is consistent in iteration 3 and
iteration 4. Error rate is now fixed at 220.75, so there is no need of further. Clusters are final now.
................................................................................................................................................................ 50
Shortcomings of K-Means clustering: ......................................................................................................... 50
KMEANS CLUSTERING ON TWO ATTRIBUTES IN DATA MINING ................................................................. 50
How K-Means clustering is performs on two attributes? ......................................................................... 50
DECISION TREE INDUCTION IN DATA MINING ............................................................................................ 52
COMPUTING INFORMATION-GAIN FOR CONTINUOUS-VALUED ATTRIBUTES IN DATA MINING............... 57
How we can calculate the split point? ........................................................................................................ 57
WHICH ATTRIBUTE SELECTION MEASURE IS BEST IN DATA MINING ......................................................... 58
Attribute Selection Measure:...................................................................................................................... 58
GINI INDEX FOR BINARY VARIABLES IN DATA MINING ............................................................................... 58
What is gini index? ...................................................................................................................................... 58
NAIVE BAYES CLASSIFIER TUTORIAL IN DATA MINING ............................................................................... 60
BOOSTING IN DATA MINING ....................................................................................................................... 61
What is Boosting? ....................................................................................................................................... 61
Types of boosting algorithm: ...................................................................................................................... 61
RAINFOREST ALGORITHM IN DATA MINING............................................................................................... 62
HOLDOUT METHOD FOR EVALUATING A CLASSIFIER IN DATA MINING..................................................... 64
Holdout method:......................................................................................................................................... 64
Training set: ................................................................................................................................................ 64
Test set: ....................................................................................................................................................... 64
Validation set: ............................................................................................................................................. 64
EVALUATION OF A CLASSIFIER BY CONFUSION MATRIX IN DATA MINING ................................................ 64
How to evaluate a classifier? ...................................................................................................................... 64
Accuracy: ..................................................................................................................................................... 65
Error-Rate:................................................................................................................................................... 65
+VE predictions: .......................................................................................................................................... 65
-VE predictions: ........................................................................................................................................... 65
Precision:..................................................................................................................................................... 65
Recall: .......................................................................................................................................................... 65
F-Measure is harmonic mean of recall and precision. ................................................................................ 66
F-Measure = 2 * Precision * Recall / Precision + Recall .............................................................................. 66
Specificity: ................................................................................................................................................... 66
OVERFITTING OF DECISION TREE AND TREE PRUNING IN DATA MINING .................................................. 66
Overfitting of tree: ...................................................................................................................................... 66
DATA MINING
What is data mining?
Data mining is about extracting the hidden useful information from the huge amount of data.
Data mining is the automated analysis of massive data sets.
knowledge discovery from data.
What are alternative names for data mining?

Knowledge discovery in databases

Data/pattern analysis

knowledge extraction

Data dredging

Data archeology

Business intelligence

Information harvesting
What is not data mining?

Expert systems(in artificial intelligence)
o
Expert system takes decision on the expertee of designed algorithms

Simple querying
o
Query takes decision according to the given condition in SQL. For example a database query
“SELECT * FROM table” is just a database query and its displays information from table but
actually this is not hidden information. So it is a simple query and not data mining.
MEAN, MEDIAN, MODE IN DATA MINING
What is mean?
Mean is the average of numbers.
Example:
3, 5, 6, 9, 8
Mean = all values/Total number of values
Mean = 3+5+6+9+8/5
Mean = 6.2
How to calculate the mean for data with frequencies?
Age
Frequency
Age * Frequency
22
5
22 * 5 = 11
33
2
33 * 2 = 66
44
6
44 * 6 = 264
66
4
66 * 4 = 264
Total
17
605
Mean = 605 / 17
Mean = 35.58
What is Median?
Median is the middle value among all values.
How to calculate median for odd number of values?
Example:
9, 8, 5, 6, 3
Arrange values in order
3, 5, 6, 8, 9
Median = 6
How to calculate median for even number of values?
Example:
9, 8, 5, 6, 3, 4
Arrange values in order
3, 4, 5, 6, 8, 9
Add 2 middle values and calculate their mean.
Median = 5+6/2
Median = 5.5
What is Mode?
Mode is the most occuring value.
How to calculate mode?
Example:
3, 6, 6, 8, 9
Mode = 6 (because 6 is occuring 2 time and all other values are occurs only one times)
FINDING THE ESTIMATED MEAN, MEDIAN AND MODE FOR GROUPED DATA IN DATA MINING
How to calculate the estimated mean and estimated median of grouped data?
Age
Mid of age
Frequency
Mid * Frequency
21 – 25
23
5
23 * 5 = 115
26 – 30
28
2
28 * 2 = 56
31 – 35
33
6
33 * 6 = 198
35 – 40
37
8
37 * 8 = 296
21
665
Total
Estimated Mean = 665 / 21
= 31.66
Class intervals:
Group 21 to 25, 26 to 30, 31 to 35 and 35 to 40 are class intervals.
Mean is 31.6 so 31.6 rounds to 32.
Estimated mean = 32
Median group = 31 to 35
Estimated Median = ?
Estimated Median = L + (TV / 2) – SBM ⁄ FMG * GW
L = Lower boundary of median group
30.5
TV = Total number of values
21
SBM = Sum of frequencies before median group
7
FMG = Frequency of median group
6
GW = Group width
5
Result: Our median group is 31 to 35 and yes estimated median 33.4 is in median group.
How to calculate the estimated mode of the above grouped data?
L = Lower boundary of median group
30.5
SBM = sum of frequencies before median group
7
FMG = Frequency of median group
6
FBMG = Sum of frequencies of all groups before median group
7
FBMG = Sum of frequencies of all groups after median group
8
GW = Group width
5
Mode: Mode is the most occuring Value in the data.
WHAT ARE QUARTILES AND BOX PLOT IN DATA MINING
What is quartile?
Quartile means four equal groups.
How to find quartiles of odd length data set?
Example:
Data = 8, 5, 2, 4, 8, 9, 5
Step 1:
First of all arrange the values in order
After ordering the values:
Data = 2, 4, 5, 5, 8, 8, 9
Step 2:
For dividing this data into four equal parts, we needed three quartiles.
Q1: Lower quartile
Q2: Median of the data set
Q3: Upper quartile
Step 3:
Find the median of the data set and lebel it as Q2.
Data = 2, 4, 5, 5, 8, 8, 9
Q1: 4 – Lower quartile
Q2: 5 – Middle quartile
Q3: 8 – Upper quartile
Inter Quartile Range= Q3 – Q1
=8–4
=4
What is Outlier?
Outlier is the set of data far away from common and famous pattern.
How to find outliers?
Outlier is mostly a value higher or lower than 1.5 * IQR
=1.5 * IQR
=1.5 * 5
= 7.5
Population size:
Population size is the total number of values in data.
How to find quartiles of even length data set?
Example:
Data = 8, 5, 2, 4, 8, 9, 5,7
Step 1:
First of all arrange the values in order
After ordering the values:
Data = 2, 4, 5, 5, 7, 8, 8, 9
Step 2:
For dividing this data into four equal parts, we needed three quartiles.
Q1: Lower quartile
Q2: Median of the data set
Q3: Upper quartile
Step 3:
Find the median of the data set and lebel it as Q2.
Data = 2, 4, ♦ 5, 5,
♦ 7, 8 ♦ 8, 9
Minimum: 2
Q1: 4 + 5 / 2 = 4.5 Lower quartile
Q2: 5+ 7 / 2 = 6
Middle quartile
Q3: 8 + 8 / 2 = 8
Upper quartile
Maximum: 9
Inter Quartile Range= Q3 – Q1
= 8 – 4.5
= 3.5
Outlier is mostly a value higher or lower than 1.5 * IQR
=1.5 * IQR
=1.5 * 3.5
= 5.25
BOX PLOT IN DATA MINING
Note: Please understand the tutorial of quartile before moving to this topic.
What is box plot?

Box plot is a plotting of data in such a way that it is like a box shape and it represents the
five number summary. Five summary is minimum value, Quartile 1, Median, Quartile 3 and
maximum value.

End of the box is represented by inter-quartile range (IQR).

IQR = Q3 – Q1

The median is marked by a line within the box

A rectangle is drawn to represent the second and third quartiles, usually with a vertical line inside
to indicate the median value. The lower and upper quartiles are shown as horizontal lines either
side of the rectangle. Maximum and minimum values are on the whiskers. Whiskers are
the liens drawn on maximum and minimum value.
Draw the box plot for the odd length data set?
Data = 2, 4, 5, 5, 8, 8, 9
First of all find the quariles.
Q1: 4 – Lower quartile
Q2: 5 – Middle quartile
Q3: 8 – Upper quartile
Inter Quartile Range= Q3 – Q1
=8–4=4
Population size:
Population size is the total number of values in data.
Box Plot:
Draw the box plot for the even length data set?
Data = 8, 5, 2, 4, 8, 9, 5,7
First of all arrange the values in sequence.
2, 4, ♦ 5, 5,
♦ 7, 8 ♦ 8, 9
Minimum: 2
Q1: 4 + 5 / 2 = 4.5 – Lower quartile
Q2: 5+ 7 / 2 = 6 – Middle quartile
Q3: 8 + 8 / 2 = 8 – Upper quartile
Maximum: 9
Box Plot:
HOW TO CALCULATE VARIANCE AND STANDARD DEVIATION OF DATA IN DATA MINING
What is data variance and standard deviation?
Different values in the data set can be spread here and there from the mean. Variance tells us
that how far away are the values from the mean.
Standard deviation is the square root of variance.

Low standard deviation tells us that less numbers are far away from mean.

High standard deviation tells us that more numbers are far away from mean.
How to calculate variance and standard deviation of data?
marks
8
10
15
20
Mean = 13.25
Variance = 28.81
Standard deviation = 5.37
DATA SKEWNESS IN DATA MINING
What is data skewness?
When most of the values are skewed to the left or right side from the median, then the data is
called skewed.
Data can be in any of the following shapes;
1. Symmetric: Mean, median and mode are at the same point.
2. Positively skewed: When most of the values are to the left from the median.
3. Negatively skewed: When most of the values are to the right from the median.
ATTRIBUTES TYPES IN DATA MINING
What are Attribute?
Attribute is the property of object. Attribute represents different features of the object.
Example:
In this example RollNo, Name and Result are attributes of the object student.
RollNo
Name
Result
1
Ali
Pass
2
Akram
Fail
Types Of attributes

Binary

Nominal

Numeric
o
Interval-scaled
o
Ratio-scaled
Nominal data:
Nominal data is in alphabetical form and not in integer.
Example:
Attribute
Value
Categorical data
Lecturer, Assistant Professor, Professor
States
New, Pending, Working, Complete, Finish
Colors
Black, Brown, White, Red
Binary data:
Binary data have only two values/states.
Example:
Attribute
Value
HIV detected
Yes, No
Result
Pass, Fail
Binary attribute is of two types;
1. Symmetric binary
2. Asymmetric binary
Symmetric data:
Both values are equally important
Example:
Attribute
Value
Gender
Male, Female
Asymmetric data:
Both values are not equally important
Example:
Attribute
Value
HIV detected
Yes, No
Result
Pass, Fail
Ordinal data:
All Values have a meaningful order.
Example:
Attribute
Value
Grade
A, B, C, D, F
BPS- Basic pay scale
16, 17, 18
Discrete Data:
Discrete data have finite value. It can be in numerical form and can also be in categorical form.
Example:
Attribute
Value
Profession
Teacher, Bussiness Man, Peon etc
Postal Code
42200, 42300 etc
Continuous data:
Continuous data technically have an infinite number of steps.
Continuous data is in float type. There can be many numbers in between 1 and 2
Example:
Attribute
Value
Height
5.4…, 6.5….. etc
Weight
50.09…. etc
PROXIMITY MEASURE FOR NOMINAL ATTRIBUTES IN DATA MINING
How to calculate Proximity Measure for Nominal Attributes?
RollNo
Marks
Grade
1
90
A
2
80
B
3
82
B
4
90
A
Pairs for distance Measurement:
d(RollNo1,RollNo1)
d(RollNo1,RollNo2)
d(RollNo1,RollNo3)
d(RollNo1,RollNo4)
d(RollNo2,RollNo1)
d(RollNo2,RollNo2)
d(RollNo2,RollNo3)
d(RollNo2,RollNo4)
d(RollNo3,RollNo1)
d(RollNo3,RollNo2)
d(RollNo3,RollNo3)
d(RollNo3,RollNo4)
d(RollNo4,RollNo1)
d(RollNo4,RollNo2)
d(RollNo4,RollNo4)
d(RollNo3,RollNo4)
Formulae:
distance(object1, Object2) = P – M / P
P is total number of attributes
M is total number of matches
So in our case we have four objects RollNo1, RollNo2, RollNo3, RollNo4
d(1,1) = P – M / P
=2–2/2
=0
d(RollNo1,RollNo2)
d(RollNo1,RollNo3)
d(RollNo1,RollNo4)
(2,1) = P – M / P
=2–0/2
=1
(2,2) = P – M / P
=2–2/2
=0
d(RollNo2,RollNo3)
d(RollNo2,RollNo4)
(3,1) = P – M / P
=2–0/2
=1
(3,2) = P – M / P
=2–1/2
= 0.5
(3,3) = P – M / P
=2–2/2
=0
d(RollNo3,RollNo4)
(4,1) = P – M / P
=2–2/2
=0
(4,2) = P – M / P
=2–0/2
=1
DISTANCE MEASURE FOR ASYMMETRIC BINARY ATTRIBUTES IN DATA MINING
How to calculate proximity measure for asymmetric binary attributes?
OR
How to measure the distance of asymmetric binary variables?
Contingency table for binary data:
Consider 1 for positive/True and 0 for negative/False
Object 2
1 / True / Positive
0 / False / Negative
Sum
1 / True / Positive
A
B
A+B
0 / False / Negative
C
D
C+D
Sum
A+C
B+D
Object 1
Name
Fever
Cough
Test 1
Test 2
Test 3
Test 4
Asad
Negative
Yes
Negative
Positive
Negative
Negative
Bilal
Negative
Yes
Negative
Positive
Positive
Negative
Tahir
Positive
Yes
Negative
Negative
Negative
Negative
DISTANCE MEASURE FOR SYMMETRIC BINARY VARIABLES
How to calculate proximity measure for symmetric binary attributes?
OR
How to measure the distance/dissimilarity of symmetric binary variables?
Contingency table for binary data:
Object 2
1 / True / Positive
0 / False / Negative
Sum
1 / True / Positive
A
B
A+B
0 / False / Negative
C
D
C+D
Sum
A+C
B+D
Object 1
Name
Gender
Job_Status
Akram
Male
Regular
Ali
Male
Contract
Consider 1 for positive/True and 0 for negative/False
Here we are considering Male and regular as positive and female and contract as negative.
A = Akram is positive and Ali is also positive. so A=1 because Ali and Akram both
are male and male is positive.
B = Akram is positive and Ali is negative. So B=1 because Akram is regular that is positive and Ali
is oncontract that is negative
C = Akram is negative and Ali is 1. So C = 0 because Akram is never negative. He
is male and regular. andmale and regular both are positive.
D = Akram is negative and Ali is also negative. So D=0 because Akram is never negative. He is
always positive(male and regular).
DISTANCE MEASURE FOR SYMMETRIC BINARY VARIABLES
How to calculate proximity measure for symmetric binary attributes?
OR
How to measure the distance/dissimilarity of symmetric binary variables?
Contingency table for binary data:
Object 2
1 / True / Positive
0 / False / Negative
Sum
1 / True / Positive
A
B
A+B
0 / False / Negative
C
D
C+D
Sum
A+C
B+D
Object 1
Name
Gender
Job_Status
Akram
Male
Regular
Ali
Male
Contract
Consider 1 for positive/True and 0 for negative/False
Here we are considering Male and regular as positive and female and contract as negative.
A = Akram is positive and Ali is also positive. so A=1 because Ali and Akram both
are male and male is positive.
B = Akram is positive and Ali is negative. So B=1 because Akram is regular that is positive and Ali
is oncontract that is negative
C = Akram is negative and Ali is 1. So C = 0 because Akram is never negative. He
is male and regular. andmale and regular both are positive.
D = Akram is negative and Ali is also negative. So D=0 because Akram is never negative. He is
always positive(male and regular).
EUCLIDEAN DISTANCE IN DATA MINING
What is Euclidean distance ?
Euclidean distance is a technique used to find the distance/dissimilarity among objects.
Example:
Age
Marks
Sameed
10
90
Shah zeb
6
95
Formulae:
Euclidean distance (sameed, sameed) = SQRT ( (X1 – X2)2 + (Y1 -Y2)2 ) = 0
Euclidean distance (sameed, sameed) = SQRT ( (10 – 10)2 + (90 -90)2) = 0
Here note that (90-95) = -5 and when we take sqaure of a negative number then it will be a
positive number. For example, (-5)2 = 25
Euclidean distance (sameed, shah zeb) = SQRT ( (10 – 6)2 + (90 -95)2) = 6.40312
Euclidean distance (shah zeb, sameed) = SQRT ( (10 – 6)2 + (90 -95)2) = 6.40312
Euclidean distance (sameed, sameed) = SQRT ( (10 – 10)2 + (90 -90)2) = 0
Euclidean Distance is given below;
Sameed
Shah zeb
Sameed
0
6.40312
Shah zeb
6.40312
0
JACCARD COEFFICIENT SIMILARITY MEASURE FOR ASYMMETRIC BINARY VARIABLES
How to calculate simmilarity of asymmetric binary variable using Jaccard coefficient?
Jaccard coefficient is used to calculate the similarity among asymmetric binary attributes.
Contingency table for binary data:
Object 2
1 / True / Positive
0 / False / Negative
Sum
1 / True / Positive
A
B
A+B
0 / False / Negative
C
D
C+D
Sum
A+C
B+D
Object 1
Name
Fever
Cough
Test 1
Test 2
Test 3
Test 4
Asad
Negative
Yes
Negative
Positive
Negative
Negative
Bilal
Negative
Yes
Negative
Positive
Positive
Negative
Tahir
Positive
Yes
Negative
Negative
Negative
Negative
Consider 1 for positive/True and 0 for negative/False
COSINE SIMILARITY IN DATA MINING
What is Cosine similarity?
Cosine similarity is a measure to find the similarity among two files/documents.
Example of cosine similarity:
What is the similarity between two files, file 1 and file 2 ?
Formula: cos(file 1, file 2) = (file 1 · file 2) / ||file 1|| ||file 2|| ,
file 1 = (0, 3, 0, 0, 2, 0, 0, 2, 0, 5)
file 2 = (1, 2, 0, 0, 1, 1, 0, 1, 0, 3)
file 1 · file 2 = 0*1 + 3*2 + 0*0 + 0*0 + 2*1 + 0*1 + 0*0 + 2*1 + 0*0 + 5*3
= 25
||d1||= (0*0 + 3*3 + 0*0 + 0*0 + 2*2 + 0*0 + 0*0 + 2*2 + 0*0 + 5*5)0.5
=(42)0.5 = 6.481
||d2||= (1*1 + 2*2 + 0*0 + 0*0 + 1*1 + 1*1 + 0*0 + 1*1 + 0*0 + 3*3)0.5
=(17)0.5
= 4.12
cos(d1 , d2 ) = 0.94
MAJOR TASKS OF DATA PRE-PROCESSING
Major tasks of data pre-processing
1. Data Cleaning
o
Data cleaning is a process to clean the data in such a way that data can be
easily integrate.
2. Data Integration
o
Data integration is a process to integrate/combine all the data.
3. Data Reduction
o
Data reduction is a process to reduce the large data into smaller once in such a way that
data can be easily transformed further.
4. Data Transformation
o
Data transformation is a process to transform the data in a reliable shape.
5. Data Discretization
After the completion of these taks, the data is ready for mining.
DATA CLEANING
Data Cleaning
Data cleaning is a process to clean the dirty data.
Data is mostly not clean. It means that mostly data can be incorrect due to a large number of
reasons like due to hardware error/failure, network error or human error. So it is compulsory to
clean the data before mining.
Dirty data
Examples
Incomplete data
salary=” ”
Inconsistent
data
Noisy data
Intentionalerror
Age =”5 years”, Birthday =”06/06/1990″, Current Year =”2017″
Salary = “-5000”, Name = “123”
Sometimes applications alot auto value to attribute. e.g some application put
gender value as male by default. gender=”male”
How to Handle incomplete/Missing Data?

Ignore the tuple

Fill in the missing value manually

Fill the values automatically by
o
Getting the attribute mean
o
Getting the constant value if any constant value is there.
o
Getting the most probable value by Bayesian formula or decision tree
How to Handle Noisy Data?

Binning

Regression

Clustering

Combined computer and human inspection
What is Binning?
Binning is a technique in which first of all we sort the data and then partition the data into
equal frequencybins.
Bin 1
2, 3, 6, 8
Bin 2
14,16,18,24
Bin 3
26,28,30,32
Types of binning:
There are many types of binning. Some of them are as follows;
1. Smooth by getting the bin means
Bin 1
4.75, 4.75, 4.75, 4.75
Bin 2
18,18,18,18
Bin 3
29,29,29,29

Smooth by getting the bin median

Smooth by getting the bin boundaries, etc.
Z-SCORE NORMALIZATION OF DATA
What is Z-Score?
Z-Score helps in normalizing the data.
How to calculate Z-Score of the following data?
marks
8
10
15
20
Mean = 13.25
Standard deviation = 5.37
marks
marks after z-score normalization
8
-0.97
10
-0.60
15
0.32
20
1.25
MIN MAX NORMALIZATION OF DATA IN DATA MINING
What is Min Max normalization?
Min Max is a technique that helps to normalizing the data. It will scale the data between the 0
and 1.
How to normalize the data through min max normalization technique?
marks
8
10
15
20
Min: Minimum value of the given attribute. Here Min is 8
Max: Maximum value of the given attribute. Here Max is 20
V: V is the respective value of attribute. For example here V1=8, V2=10, V3=15 and V4=20
newMax: 1
newMin: 0
marks
marks after Min-Max normalization
8
0
10
0.16
15
0.25
20
1
MIN MAX SCALLING IN DATA MINING
Example 2:
Min max normalization detail is available in previous tutorial.
Here, There is just another example for practice.
NORMALIZATION WITH DECIMAL SCALING IN DATA MINING
What is decimal scaling?
Decimal scaling is a data normalization technique. In this technique we move the decimal point
of valuesof the attribute. This movement of decimal points totally depends on the maximum
value among all valuesin the attribute.
Formulae:
A value v of attribute A is can be normalized by the following formula
Normalized value of attribute = ( vi / 10j )
Example 1:
CGPA
Formula
CGPA Normalized after Decimal scaling
2
2/10
0.2
3
3/10
0.3
We will check maximum value among our attribute CGPA. Here maximum value is 3 so we
can convert it into decimal by dividing with 10. Why 10?
we will count total numbers in our maximum value and then put 1 and after 1 we can put zeros
equal to the length of maximum value.
Here 3 is maximum value and total numbers in this value are only 1. so we will put one zero after
one.
Example 2:
Salary bonus
Formula
CGPA Normalized after Decimal scaling
400
400 / 1000
0.4
310
310 / 1000
0.31
We will check maximum value among our attribute “salary bonus“. Here maximum value is 400
so we canconvert it into decimal by dividing with 1000. Why 1000?
400 contains three digits and we so we can put three zeros after 1. So, it looks like 1000.
Example 3:
Salary
Formula
CGPA Normalized after Decimal scaling
40,000
40,000 / 100000
0.4
31, 000
31,000 / 100000
0.31
STANDARD DEVIATION NORMALIZATION OF DATA IN DATA MINING
Data Normalization with the Help Of Standard Deviation
Data is in attribute tuples and data can be normalize by using standard deviation.
Example:
Age
22
40
First Step:
Calculate the mean of the data:
Mean = (22+40) / 2 = 22
Second Step:
Now, we will subtract the mean from all the values and find the square of all data :
(22-22)^2 = 0
(40-22)^2 = 324
Third Step:
Now, we will find the deviation as follows:
Deviation = sqrt ((0 + 324 / 2) = 162
Fourth Step:
Now we will normalize the attribute values:
(x – Mean) / Deviation
For 22 : (22 – 22) / 162 = 0
For 40 : (40 – 22) / 162 = 0.11
Age before normalization
Age after normalization
22
0
40
0.11
DATA DISCRETIZATION IN DATA MINING
What is data discretization?
Data discretization converts the large number of data values into smaller once, so that data
evaluation and data management becomes very easy.
Example:
we have an attribute age with following values,
Age
10,11,13,14,17,19,30, 31, 32, 38, 40, 42,70 , 72, 73, 75
Table: Before discretization
Age
10,11,13,14,17,19,
Young
30, 31, 32, 38, 40, 42,
Mature
70 , 72, 73, 75
Old
Table: How to discretization
Age
Young
Mature
Old
Table: After discretization
What are some famous techniques of data discretization?
1. Histogram analysis
2. Binning
3. Correlation analysis
4. Clustering analysis
5. Decision tree analysis
6. Equal width partitioning
7. Equal depth partitioning
BINNING METHODS FOR DATA SMOOTHING IN DATA MINING
What is binning method?
Binning method can be used for smoothing data.
Example:
Sorted data for Age: 3, 7, 8, 13,
22, 22, 26, 22,
How to smooth the data by equal frequency bins?

Bin 1: 3, 7, 8, 13

Bin 2: 22, 22, 22, 26

Bin 3: 26, 28, 30, 37
How to smooth the data by bin means?

Bin 1: 10, 10, 10, 10

Bin 2: 23, 23, 23, 23

Bin 3: 30, 30, 30, 30
26, 28, 30, 37
How to smooth the data by bin boundaries?



Bin 1: 3, 3, 3, 13
Bin 2: 22, 22, 26, 26
Bin 3: 26, 26, 26, 37
CORRELATION ANALYSIS OF NOMINAL DATA
Correlation analysis of Nominal data:
Chi Square Test
This analysis can be done by chi-square test.Chi-square test is the test to analyse the correlation
of nominal data.
Correlation VS Causality:
Correlation does not always tells us about causality.
Example:

Number of student passed in exam and number of car theft in a country are correlated with each
other but may be it does not means that number of student passed effects car theft in a country.
But in some cases it may be;

Number of student passed in exam and number of students who live near to the university are
correlated with each other and may be number of student who live near to the university can be
a cause of the student result.
Passed student
Not passed student
Sum
Live near University
Observed=140
Expected = 180*330/1320
Expected =45
Observed=190
Expected = 1140*330/1320
Expected =285
330
Not live near University
Observed=40
Expected = 180*990/1320
Expected =135
Observed=950
Expected = 1140*990/1320
Expected =855
990
Sum
140 + 40 = 180
190 + 950 = 1140
1320
Degrees of freedom:
DF = (r – 1) * (c – 1)
Level of significance:
.01
.05
.10
CORRELATION ANALYSIS FOR NUMERICAL DATA
Correlation analysis for numerical data
A
B
3
1
4
6
1
2
Step 1: Find all the initial values
A
B
AB
A2=C
B2=D
3
1
3
9
1
4
6
24
16
36
1
2
2
1
4
Total number of values (n) is 3.
The other values we need are:
ΣA =3 + 4 + 1 = 8
ΣB = 1 + 6 + 2 = 9
ΣAB = 3 + 24 + 2 = 29
ΣC = 9 + 16 + 1 = 26
ΣD= 1 + 36 + 4 = 41
Step 2: Input the Values
(r) =[ nΣAB – (ΣA)(ΣB) / Sqrt([nΣC– (ΣA)2] [nΣD – (ΣB)2])]
r = [3(29) – (8)(9) / Sqrt ([3(26) – (8) 2 ] [3(41)-(9) 2 ])]
r= [87-72 / Sqrt ([78-64] [123 -81])]
r= [15 / Sqrt ([14] [42])]
r=[15 / Sqrt (588)]
r= 15 / 24.24
r= 0.61
FREQUENT PATTERN MINING IN DATA MINING
Frequent pattern :
Frequent pattern is a pattern that occurs again and again (frequently) in a data set.
A pattern can be a set of items, sub structures and sub sequences etc.
Closed frequent itemset:
If itemset has no superset with the same frequency
Max frequent itemset:
If itemset does not have any frequent supersets.
Example:
It is compulsory to set a min_support that can defines which itemset is frequent. An itemset is
frequent if its support is greater or equal to min_support.
Suppose the minimum support is 1 and there are two transactions T1 and T2
T1 = {A1, A2, A3, ……………….A19, A20}
T2 = {A1, A2, A3, …….A9, A10}
Set of closed frequent itemset / Closed pattern :

T1 = {A1, A2, A3, ……………….A19, A20}
o
Frequency of T1 is 1

T2 = {A1, A2, A3, …….A9, A10}
o
Frequency of T2 is 2
Set of max frequent itemset / max pattern :

T1 = {A1, A2, A3, ……………….A19, A20}
o
T1 is representing max pattern.
APRIORI ALGORITHM IN DATA MINING
Apriori Algorithm
Apriori Helps in mining the frequent itemset.
Example 1:
Minimum Support :2
Step 1: Data in the database
Step 2: Calculate the support / frequency of all items
Step 3: Discard the items with minimum support less than 2
Step 4: Combine two two items
Step 5: Calculate the support / frequency of all items
Step 6: Discard the items with minimum support less than 2
Step 6.5: Combine three three items and calculate their support.
Step 7: Discard the items with minimum support less than 2
Result:
Only one itemset is frequent (Eggs, Tea, Cold Drink) because this itemset have
minimum support 2
Example 2:
Minimum Support :3
Step 1: Data in the database
Step 2: Calculate the support / frequency of all items
Step 3: Discard the items with minimum support less than 3
Step 4: Combine two two items
Step 5: Calculate the support / frequency of all items
Step 6: Discard the items with minimum support less than 3
Step 6.5: Combine three three items and calculate their support.
Step 7: Discard the items with minimum support less than 3
Result:
There is no frequent itemset because all itemsets have minimum support less than 3
APRIORI PRINCIPLES IN DATA MINING
Apriori principles:
1. Downward closure property of frequent patterns
All subset of any frequent itemset must also be frequent.
o
Example:

If Tea, Biscuit, Coffee is a frequent itemset, then we can say that all of the

following itemset are frequent;

Tea

Biscuit

Coffee

Tea, Biscuit

Tea, Coffee

Biscuit, Coffee
2. Apriori pruning principle:
If an itemset is infrequent, its superset should not be generated for getting the frequent
o
itemset.
Example:

If Tea, Biscuit is a frequent itemset and Coffee is not frequent itemset, then

we can say that all of the following itemset are frequent;

Tea

Biscuit

Tea, Biscuit
APRIORI CANDIDATES GENERATION IN DATA MINING
Apriori Candidates generation:
Candidates can be generated by following activities;

Step 1: self-joining
o
Example:

V W X Y ZX={V W X, V W Y, V X Y, V X Z, W X Y}Self-joining = X * X
V W X Y from V W X and V W Y
V X Y Z from V X Y and V X Z

So frequent candidate are V W X Y and V X Y Z

Step 2: Apriori pruning principle:
o
Example:

V W X Y ZX={V W X, V W Y, V X Y, V X Z, W X Y} According to Apriori Pruning principle V X Y
Z is removed because V Y Z is not in X.

So frequent candidate is V W X Y
KMEANS CLUSTERING IN DATA MINING
What is clustering?
Clustering is a process of partitioning a group of data into small partitions or cluster on the
basis ofsimilarity and dissimilarity.
What is K-Means clustering?
K-Means clustering is a clustering method in which we move the every data item(attribute value)
nearest to its similar cluster.
How it works?
Step 1: Find the centroid randomly. It is better to take the boundary and middle values as
centroid.
Step 2: Assign cluster to each item-set(value of attribute)
Step 3: Repeat all the process, every time we repeat the process total sum of error rate is
changed. When error rate stops to change, then finalize the cluster and their itemset(attribute
value)
id
Age
D1
D2
D3
Cluster
Error
1
23
0
42
20
1
0
2
33
10
32
10
1
100
3
28
5
37
15
1
25
4
23
0
42
20
1
0
5
65
42
0
22
2
0
6
67
44
2
24
2
4
7
64
41
1
21
2
1
8
73
50
8
30
2
64
9
68
45
3
25
2
9
10
43
20
22
0
3
0
11
34
11
31
9
3
81
12
43
20
22
0
3
0
13
52
29
13
9
3
81
14
49
26
16
6
3
36
Centriod
23
65
43
401
Iteration 2:
id
Age
D1
D2
1
23
3.75
44.4 21.2 1
14.0625
2
33
6.25
34.4 11.2 1
39.0625
3
28
1.25
39.4 16.2 1
1.5625
4
23
3.75
44.4 21.2 1
14.0625
5
34
7.25
33.4 10.2 1
52.5625
6
65
38.25
2.4
20.8 2
5.76
7
67
40.25
0.4
22.8 2
0.16
8
64
37.25
3.4
19.8 2
11.56
9
73
46.25
5.6
28.8 2
31.36
10
68
41.25
0.6
23.8 2
0.36
11
43
16.25
24.4 1.2
3
1.44
12
43
16.25
24.4 1.2
3
1.44
13
52
25.25
15.4 7.8
3
60.84
14
49
22.25
18.4 4.8
3
23.04
Centriod 26.75
D3
67.4 44.2
Cluster Error
257.2725
Iteration 3:
id
Age
D1
D2
D3
Cluster Error
1
23
5.2
44.4
23.75
1
27.04
2
33
4.8
34.4
13.75
1
23.04
3
28
0.2
39.4
18.75
1
0.04
4
23
5.2
44.4
23.75
1
27.04
5
34
5.8
33.4
12.75
1
33.64
6
65
36.8
2.4
18.25
2
5.76
7
67
38.8
0.4
20.25
2
0.16
8
64
35.8
3.4
17.25
2
11.56
9
73
44.8
5.6
26.25
2
31.36
10
68
39.8
0.6
21.25
2
0.36
11
43
14.8
24.4
3.75
3
14.0625
12
43
14.8
24.4
3.75
3
14.0625
13
52
23.8
15.4
5.25
3
27.5625
14
49
20.8
18.4
2.25
3
5.0625
Centriod 28.2
67.4
46.75
220.75
Iteration 4:
id
Age
D1
D2
D3
Cluster Error
1
23
5.2
44.4
23.75
1
27.04
2
23
5.2
44.4
23.75
1
27.04
3
28
0.2
39.4
18.75
1
0.04
4
33
4.8
34.4
13.75
1
23.04
5
34
5.8
33.4
12.75
1
33.64
6
43
14.8
24.4
3.75
3
14.0625
7
43
14.8
24.4
3.75
3
14.0625
8
49
20.8
18.4
2.25
3
5.0625
9
52
23.8
15.4
5.25
3
27.5625
10
64
35.8
3.4
17.25
2
11.56
11
65
36.8
2.4
18.25
2
5.76
12
67
38.8
0.4
20.25
2
0.16
13
68
39.8
0.6
21.25
2
0.36
14
73
44.8
5.6
26.25
2
31.36
67.4
46.75
Centriod 28.2
220.75
Iteration stops: Now, iterations are stoped because error rate is consistent in iteration 3 and
iteration 4. Error rate is now fixed at 220.75, so there is no need of further. Clusters are final
now.
Shortcomings of K-Means clustering:

It is sensitive to outliers.

Not much suitable for categorical or nominal data
KMEANS CLUSTERING ON TWO ATTRIBUTES IN DATA MINING
How K-Means clustering is performs on two attributes?
Solution:
For example we have 2 attributes;
1. Paper1
2. Paper 2

First of all randomly we will chose centroid values.

Then calculates the distance of each value of paper 1 and paper 2.

Find the error rate.

Repeat this all process until the error rate remains consistent in at least two last iterations or
clusters stops to change further.
Figure: k-mean clustering
DECISION TREE INDUCTION IN DATA MINING
Decision Tree Induction:
Decision tree is a tree like structure and consists of following parts(discussed in Figure 1);
1. Root node:
age is root node
o
2. Branches:
Following are the branches;
o

<20

21…50

>50

USA

PK

High

Low
3. Leaf node:
Following are the leaf nodes;
o

Yes

No
Entropy:
Entropy is a method to measure the uncertainty.

Entropy can be measure in between 0 and 1.

High entropy represents that data have more variance with each other.

Low entropy represents that data have less variance with each other.
P = Total yes = 9
N = Total no = 5
Note that to calculate the log2 of a number, we can do the following procedure.
For example;
what is log2 of 0.642?
Ans: log (0.642) / log (2)
=–9/14 * log2(9/14) – 5/14 * log2 (5/14)
=-9/14 * log2(0.642) – 5/14 * log2 (0.357)
=-9/14 * (0.639) – 5/14 * (-1.485)
=0.941
For Age:
age
Pi
Ni
Info(Pi, Ni)
<20
2 YES
3 NO
0.970
21…50
4 YES
0 NO
0
>50
3 YES
2 NO
0.970
Note: if yes =2 and No=3 then entropy is 0.970 and it is same 0.970 if yes=3 and No=2
So here when we calculates the entropy for age<20 , then their is no need
to calculate the entropy for age >50 because total number of Yes and No is same.
Gain of Age
0.248
Gain of Income
0.029
Gain of Credit Rating
0.048
Gain of Region
0.151
0.248 is a greater value than income, Credit Rating and
Region. So Age will be considered as root node.
Note that

if yes and no are in the following sequence like (0,any number) or (any number, 0) then entropy is
always 0.

If yes and no are occurring in such a sequence (3,5) and (5, 3) then both have same entropy.

Entropy calculates impurity or uncertainty of data.

If the coin is fair (1/2, head and tail have equal probability, represent maximum uncertainty,
because it is difficult to guess that head occurs or tails occurs) and suppose coin have head
on both sides then probability is 1/1, and uncertainty or entropy is less.

if p is equal to q then more uncertainty

if p is not equal to q then less uncertainty
Now again calculate entropy for;
1. Income
2. Region
3. Credit
For Income:
Income
Pi
Ni
Info(Pi, Ni)
High
0 YES
2 NO
0
Medium
1 YES
1 NO
1
Low
1 YES
0 NO
0
For Region:
Region
Pi
Ni
Info(Pi, Ni)
USA
0 YES
3 NO
0
PK
2 YES
0 NO
0
Credit Rating
Pi
Ni
Info(Pi, Ni)
Low
1 YES
2 NO
0
High
1 YES
1 NO
0
For Credit Rating:
Gain of Region
0.970
Gain of Credit
Rating
0.02
Gain of Income
0.57
Similarly you can calculate for all.
0.970 is a greater value than income, Credit Ratingand
Region. So Age will be considered as root node.
COMPUTING INFORMATION-GAIN FOR CONTINUOUS-VALUED ATTRIBUTES IN DATA MINING
For example, we have the following data mentioned below;
How we can calculate the split point?
Income
Class
18
YES
45
NO
18
NO
25
YES
28
YES
28
NO
34
NO
Solution:
Step 1 : Sort the data in ascending order.
Income
Class
18
YES
18
NO
25
YES
28
YES
28
NO
34
NO
45
NO
Step 2: Find the midpoint of first two numbers and calculate the information gain
Split point = (18+25) / 2 = 21
Infoage<21(D) = 2/7(I(1,1)) + 5/7(I(2,3))
= 2/7(-1/2(log2(1/2)) – 1/2(log2(1/2))+5/7(-2/5(log2(2/5)) – 3/5(log2(3/5)))
= 0.98
WHICH ATTRIBUTE SELECTION MEASURE IS BEST IN DATA MINING
Attribute Selection Measure:
There are different attribute selection measures. Some of them are as follows;
1. Information gain
2. Gain ratio
3. Gini index
Information gain
Gain ratio
Giniindex
Biased towards the
multi-valued attribute
Unbalanced splits
One partition is much smaller
than other partition.
Biased towards the multivalued attribute
Difficult to manage a large
number of classes
Partitionsare equal
GINI INDEX FOR BINARY VARIABLES IN DATA MINING
What is gini index?

Gini index is the most commonly used measure of inequality.

Also referred as gini ratio or gini coefficient.
Gini index for binary variables:
Student
inHostel
Target Class
Yes
True
Yes
Yes
True
Yes
Yes
False
No
False
False
Yes
False
True
No
False
True
No
False
False
No
True
False
Yes
False
True
No
Now we will calculate gini index of student and inHostel;
Step 1:
Gini(X) = 1 – [(4/9)2 + (5/9)2] = 40/81
Step 2:
Gini(Student = False) = 1 – [(1/5)2 + (4/5)2] = 8/25
Gini(Student = True) = 1 – [(3/4)2 + (1/4)2] = 3/8
GiniGain(Student) = Gini(X) – [4/9· Gini(Student = True) + 5/9· Gini(Student = False)] = 0.149
Step 3:
Gini(inHostel = False) = 1 – [(2/4)2 + (2/4)2] = 1/2
Gini(inHostel = True) = 1 – [(2/5)2 + (3/5)2] = 12/25
GiniGain(inHostel) = Gini(X) – [5/9· Gini(inHostel = True) + 4/9· Gini(inHostel = False)] = 0.005
Results:
Best split point is Student because it has high gini gain.
NAIVE BAYES CLASSIFIER TUTORIAL IN DATA MINING
Naive bayes classifier:
Step 1. Calculate P(Ci)

P(buys_computer = “no”) = 5/14= 0.357

P(buys_computer = “yes”) = 9/14 = 0.643
Step 2. Calculate P(X|Ci) for all classes

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6

P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222

P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4

P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444

P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2

P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667

P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
Step 3. Select the scenario against which you want to classify.

X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
Step 4: Calculate P(X|Ci) :

P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019

P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
Step 5: Calculate C P(X|Ci)*P(Ci) :

P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
Therefore, X belongs to class (“buys_computer = yes”)
BOOSTING IN DATA MINING
What is Boosting?
Boosting is an efficient algorithm that is able to convert a weak learner into a strong learner.
Example:
Suppose we wants to check that an email is “spam email” or “safe email”?
In this case there can be multiple possibilities like;

Rule 1: Email contains only links of some website.
o
Decision: It is a spam

Rule 2: Email from an official email address. e.g t4tutorialsfree@gmail.com.
o
Decision: It is not a spam.

Rule 3: Email have a request to get private bank details. e.g bank account number and
father/mother name etc.
o
Decision: It is a spam
Now the question is that, the 3 rules discussed above or enough to classify an email as “spam” or
not?

Answer: These 3 rules are not enough. These 3 rules are weak learner. So we need to boost these
learners. We can boost the weak learners to stronger learner by boosting.

Boosting can be done by combining and assigning weights to every weak learner.
Boosting have greater accuracy as compared to Bagging.
Types of boosting algorithm:
Three main types of boosting algorithm are as follows;
1. XGBoost algorithm
2. AdaBoost algorithm
3. Gradient tree boosting algorithm
RAINFOREST ALGORITHM IN DATA MINING
What is RainForest?
RainForest is framework especially designed to classify the large data set.
RainForest contains AVC set.
AVC set consist of the following parts;
1. Attribute
2. Value
3. Class_Label
Example:
Income
Rank
Buy_Mobile
75,000
Professor
yes
75,000
Professor
yes
50,000
Lecturer
no
After applying the AVC set table looks like;
Buy_Mobile
Income
Yes
No
75,000
2
0
50,000
0
1
Buy_Mobile
Rank
Yes
No
Professor
2
0
Lecturer
0
1
AVC sets can be built according to the amount of main memory available. This can be described
in the following three cases;
1. The AVC-set of the root node fits in main memory.
2. 2. Each individual AVC-set of the root node fits in main memory, but the AVC-group of the
root node does not fit in main memory.
3. None of the individual AVC-sets of the root fit in main memory.
HOLDOUT METHOD FOR EVALUATING A CLASSIFIER IN DATA MINING
Holdout method:
All data is randomly divided into same equal size data sets. e.g,
1. Training set
2. Test set
3. Validation set
Training set:

It is a data set helps in prediction of the model.
Test set:

Unseen data is used as a sub set of the data set to assess the performance of model.
Validation set:

Validation set is also a data set used to asses the performance of model built during the training.
For example;
There are total 3 data sets.
Total training set for model construction

2/3
Total test set for accuracy estimation

1/3
EVALUATION OF A CLASSIFIER BY CONFUSION MATRIX IN DATA MINING
How to evaluate a classifier?
Classifier can be evaluated by building the confusion matrix. Confusion matrix shows the total
number of correct and wrong predictions.
Confusion Matrix for class label positive(+VE) and negative(-VE)is shown below;
Actual Class(Target)
Predicted
Class
(Model)
+VE
-VE
+VE
A =
True +VE
B=
False -VE
+VE
prediction
P=A / (A+B)
-VE
C=
False +VE
D=
True -VE
-VE
prediction
D / (C + D)
Sensitivity
Specificity
A / (A + C)
D / (B + D)
Accuracy =
A + D / (A + B + C + D)
Accuracy:
Accuracy is the proportion of the total number of correct predictions.
e.g
Accuracy = A + D / (A + B + C + D)
Error-Rate:
Error Rate = 1 – Accuracy
+VE predictions:
+VE predictions are the proportion of the total number of correct positive predictions.
+VE predictions = A / (A+B)
-VE predictions:
-VE predictions are the proportion of the total number of correct negative predictions.
-VE predictions = D / (C + D)
Precision:
Precision is the correctness that how much tuple are

+VE and classifier predicted them as +VE

-VE and classifier predicted them as -VE
Precision = A / P
Recall:
Recall = A / Real positive
Sensitivity (Recall):
Sensitive is total True +VE rate.
The correction of the actual positive cases that are correctly identified.
Sensitivity (Recall) = A / (A + C)
F-Measure:
F-Measure is harmonic mean of recall and precision.
F-Measure = 2 * Precision * Recall / Precision + Recall
Specificity:
Specificity is true -VE rate.
Specificity is the proportion of the actual -VE cases that are correctly identified.
Specificity = D / (B + D)
Note: Specificity of one class is same as sensitivity of the other class.
OVERFITTING OF DECISION TREE AND TREE PRUNING IN DATA MINING
Overfitting of tree:
Before overfitting of tree, lets revise test data and training data;
Training Data:
Training data is the data that is used for prediction.
Test Data:
Test data is used to assess the power of training data in prediction.
Overfitting:
Overfitting means too many un-necessary branches in the tree. Overfitting results in different
kind of anomalies that are the results of outliers and noise.
How to avoid overfitting?
There are two techniques to avoid overfitting;
1. Pre-pruning
2. Post-pruning
1.Pree-Pruning:
Pree-Pruning means to stop the growing tree before a tree is fully grown.
2. Post-Pruning:
Post-Pruning means to allow the tree to grow with no size limit. After tree completion, starts
toprune the tree.
Advantages of pree-pruning and post-pruning:

Pruning controls to increase the tree un-necessary.

Pruning reduce the complexity of tree.
Download