Dr. Jaume Bacardit jqb@cs.nott.ac.uk
Topic 1: Preliminaries
• General data mining definitions and concepts
• Handling datasets
• Repositories of datasets
• Experimental evaluation of data mining methods
• Information theory
• Playing a bit with Weka
• It most cases we will treat a dataset as a table with rows and columns
Attributes
Instances
• Instances are the atomic elements of information from a dataset
– Also known as records or prototypes
• Each instance is composed of a certain number of attributes
– Also known as features or variables
– In a dataset attributes can be of different types
• Continuous (or Integer)
• Discrete (i.e. being able to take a value from a finite set)
• Many times the datasets got a special attribute, the class or output
• If they do, the task of the data mining process consists in generating a model that can predict the class/output for a new instance based on the values for the rest of attributes
• In order to generate this model, we will use a corpus of data for which we already know the answer, the training set
Witten and Frank, 2005 (http://www.cs.waikato.ac.nz/~eibe/Slides2edRev2.zip)
New
Instance
Training
Set
Learning
Algorithm
Models Inference
Engine
Annotated
Instance
• If the special attribute is discrete
– We call it class
– The dataset is a classification problem
• If the special attribute is continuous
– We call it output
– The dataset is a regression problem
• Also called modelling or function aproximation
Y
• CN2, RISE, GAssist, BioHEL
1
If (X<0.25 and Y>0.75) or
(X>0.75 and Y<0.25) then
If (X>0.75 and Y>0.75) then
If (X<0.25 and Y<0.25) then
Everything else
0
1
X
A decision tree (ID3/C4.5) age?
<=30 no no student?
yes yes yes
>40 credit rating?
excellent fair yes
x x x x x x x x x o o o o o o o x o o o o o o
Small Margin Large Margin
Support Vectors
• When we do not have/not take into account the class/output attribute
• If the goal is to identify aggregations of instances
– Clustering problem
• If the goal is to detect strong patterns in the data
– Association rules/Itemset mining
Partitional clustering Hierarchical clustering http://www.mathworks.com/matlabcentral/fx_files/19344/1/k_means.jpg
http://www.scsb.utmb.edu/faculty/luxon.htm
Witten and Frank, 2005 (http://www.cs.waikato.ac.nz/~eibe/Slides2edRev2.zip)
• When the system is being given a reward or
punishment whether its prediction was correct or not
• The DM system is not being told what is exactly right or wrong, just the reward
• RL methods work prediction by prediction
– This is why many times they are called online systems
• In DM most datasets are represented as simple plain text files (or sometimes excel sheets)
• More sophisticated (and efficient) methods exist
• We need to decide how to specify the dataset structure (i.e. the metadata) and the content
(the instances)
• ARFF = Attribute-Relation File Format
• File format from the WEKA DM package
• http://www.cs.waikato.ac.nz/~ml/weka/arff.html
• Much more complex file format designed for scientific data handling
• It can store heterogeneous and hierarchical organized data
• It has been designed for efficiency
• Presentation slides
• UCI repository
– http://archive.ics.uci.edu/ml/
– Probably the most famous collection of datasets in
ML!
– Currently has 235 datasets
• Kaggle
– It is not a static repository of datasets, but a site that manages Data Mining competitions
– Example of the modern concept of crowdsourcing
• KDNuggets
– http://www.kdnuggets.com/datasets/
• PSPbenchmarks
– http://www.infobiotic.net/PSPbenchmarks/
– Datasets derived from Protein Structure
Prediction problems
– Interesting benchmarks because they can be parametrised in a very large variety of ways
• Pascal Large Scale Learning Challenge
– http://largescale.ml.tu-berlin.de/about/
• We have Data Mining methods and datasets
• How can we know that the patterns they extract are meaningful?
• How can we know how well do they perform and which method is the best?
• We need to follow a principled protocol to make sure that the results and conclusions we extract are sound
• The dataset, before the data mining process, is partitioned into two non-overlapped parts:
– Training set. Will be used to generate the model
– Test set. Will be used to validate the model
• We can compute the performance metric (more on the next slides) on both sets. If the metric for the training set is much higher than in the test set, we have a case of overlearning
• That is, the DM method has not modelled the problem, only the training set
• Accuracy: C/D
– C = number of correctly classified examples. D is size of instance set
– Simplest and most widespread metric
– But what if some classes got more examples than others? The majority class would get more benefit out of this metric
• Cohen ’ s Kappa
– Computes the agreement between two distributions of categorical variables (i.e. real and predicted classes)
– Takes into account agreement by chance
– Hence, it may be more suitable for multi-class problems
(Garcia et al., 2009)
C = number of classes, x i
= examples from class I
X ii
=examples correctly classified from class i
• There are more metrics (e.g. ROC curves)
• Regression problems
– RMSD
• Metrics for clustering problems
• Association rule mining
– Support (percentage of instances covered by the pattern)
– Confidence (agreement between predicate and consequent of the rule)
• We have to partition the dataset into training and test, but how is this partitioning done?
• If this partitioning is wrong we will not obtain a good estimation of the method ’ s performance
• Simply dividing the dataset into two nonoverlaped sets (e.g. 2/3 of the dataset for training, 1/3 of the set for test).
• Most simple method and computationally cheap
• The performance is computed on the test set
• Its reliability greatly depends on how the sets are partitioned
• Divide the dataset into K strata
• Iterate K times:
– Use sets 1.. K-1 for training and the set K for test
– Use sets 1..K-2,K for training and the set K-1 for test
– etc
• Computer the performance estimation as the average of the metric for each of the K test sets
• More robust estimator, but computationally more costlier
• If each of the K strata has the same class distribution as the whole dataset, this method is called stratified K-fold cross- validation
• Leave-one-out cross-validation
– Extreme case of cross-validation, where K=D, the size of the dataset
– Training sets will have size D-1 and test sets will have only one instance
– Mostly used in datasets of small size
• Bootstrap
– Generate a sample with replacement from the dataset with size D. Use it as training set
– Use the instances never selected as test set
– Repeat this process a high number of times
• But which method is best?
• Two criteria to take into account
– Bias: Difference between the estimation and the true error
– Variance of the estimation ’ s sample
• Formal and experimental analysis of some of these methods
• Experimental study for microarray data
• At the end of an experimentation we will have tested N methods on D datasets, thus we will have a NxD table of performance metrics
• How can we identify the best method?
– Highest average performance across the D datasets?
– Highest average rank across the D datasets?
• Also, is the best method significantly better than the others?
• For this we need statistical test (next slide)
• Best study nowadays on statistical tests
• Most of the tests described in the next slides can be found in many statistical packages (e.g. R )
• Procedure for making decisions about data
• Each test defines an hypothesis (H
0
) about the observed data
– “The accuracy differences I observe between A and B are just by pure chance and the methods perform equally”
– Or in a more statistical way “The two sets of data observations A and B belong to the same distribution”
• Then, the probability of H
0 observed data is calculated being true given the
• If the probability (p-value) is smaller than a certain threshold, H
0 is rejected and the observed differences are statistically significant
• Tests whether the difference between two distributions is significant or not
• Generates a p-value, in this case the probability that the two distributions are not significantly different
• P-values of 0.05 or 0.01 are typical thresholds
• Can only applied if the distributions have the same variance and are normal. These conditions are hardly ever achieved
• Non-parametric tests, which is not affected by the normality of the distribution
• It ranks the absolute differences in performance, dataset by dataset
• Next, it sums the ranks separately for the positive and negative differences
• A p-value will be generated depending on which sum of ranks is smaller and the number of datasets
• Both the t-test and the Wilcoxon test are used to compare two methods
• What if we have more than two methods in our comparison?
– You can hold individually that A is better than B and better than C with 95% confidence and but it is better than both B and C at the same time?
– The p-values do not hold anymore
• We need to apply a correction for multiple tests (e.g. Bonferroni , Holm , etc.)
• Designed explicitly to compare multiple methods
• Based on the average rank of each method across the datasets
• This test just says if the performance of the methods included is similar or not
• Can be used when having more than 10 datasets and more than 5 methods
• Once the test has determined that there are significant performance differences, a post-hoc test is used to spot them
• Comparing every method to each other (e.g. the Nemenyi test)
• Comparing a control method against the others (e.g. the Holm test)
– E.g. the best method against the others
– This latter kind of test gives less information but are also more powerful (i.e. less conservative)
• Both are based on ranks
• R is an open source statistical computing package
• It is extremely powerful, but not the most simple tool to use in the world….
• It has build-in functions for most of the tests described in the previous slides
• Example from the command line….
• A plain text file with rows consisting of:
– <method> <dataset> <performance metric>
• Example file
• CV has already been computed
• Let ’ s load the data in R:
$ R
> data<-read.table("rchA.dat")
> data
V1 V2 V3
1 C4_5 rchA_1_q2 0.664884
2 C4_5 rchA_2_q2 0.690657
3 C4_5 rchA_3_q2 0.731942
4 C4_5 rchA_4_q2 0.756022
5 C4_5 rchA_5_q2 0.733984
> pairwise.t.test(data$V3,data$V1,pool.sd=F,paired=T)
Pairwise comparisons using t tests with non-pooled SD data: data$V3 and data$V1
C4_5 HEL+ LCS+
HEL+ 0.00056 -
LCS+ 0.00038 0.53291 -
NB 0.53291 0.52834 0.53291
P value adjustment method: holm
> pairwise.wilcox.test(data$V3,data$V1,paired=T)
Pairwise comparisons using Wilcoxon rank sum test data: data$V3 and data$V1
C4_5 HEL+ LCS+
HEL+ 0.00038 -
LCS+ 0.00032 0.72586 -
NB 0.72586 0.72586 0.72586
P value adjustment method: holm
> pairwise.wilcox.test(data$V3,data$V1,paired=T)$p.value
C4_5 HEL+ LCS+
HEL+ 0.0003814697 NA NA Saving the results to a file
LCS+ 0.0003204346 0.7258606 NA
NB 0.7258605957 0.7258606 0.7258606
> write.table(pairwise.wilcox.test(data$V3,data$V1,paired=T)$p.value,"wilcox.dat
” )
• Code for the Holm and Nemenyi post hoc tests
• Code for accuracy (between 0 and 1)
> source("fried+holm.r
” )
> fried.holm("rchA.dat")
Friedman rank sum test data: data$V3, data$V1 and data$V2
Friedman chi-squared = 22.4667, df = 3, p-value = 5.216e-05
[1] "Average ranks"
C4_5 HEL+ LCS+ NB
3.666667 1.722222 2.166667 2.444444
[1] "Control is HEL+"
[1] "Confidence level of 0.950000"
[1] "HEL+ is significantly better than C4_5"
[1] "Confidence level of 0.990000"
[1] "HEL+ is significantly better than C4_5"
• N datasets, M methods
– Method can mean anything (mining, preprocessing, etc.)
• For each of the N datasets
– Partition it into training and test sets
– For each of the K pairs of (Training k n ,Test k n ) sets
• For each of the M methods
– model = Train Method M using set Training k n
– Metric[N][M][K] = Test model using set Test k n
– Metric[N][M] = Average across k pairs
• Compute average ranks of each method across datasets
• Run statistical tests using Metric matrix
– There are overall statistical significant performance differences?
– Individual tests one-vs-rest or all-vs-all
• Shannon’s Entropy
– Measure of the information content in discrete data
– Metric inspired by statistical information transmission
• “How many bits per symbol (in average) would it take to transmit a message given the frequencies of each symbol?”
H ( X )
= n
i
=
1 p ( x i
)log p ( x i
)
• If we are tossing a coin and want to send a sequence of tosses
– If heads and tails have the same probability it would take one bit per toss
– If heads have a very small probability compared to tails we could just send a list of the heads and say that everything else is a tail
• Each head message would be long, but there would be very few of them, so overall it would take less than one bit to transmit a symbol (in average).
• Conditional Entropy
– Entropy of a certain variable given that we know all about a related variable
H ( Y | X )
=
x
Î
X p ( x ) H ( Y | X
= x )
• Information gain
– How many bits would I save from transmitting Y if I know about X?
– IG(Y|X) = H(Y) – H(Y|X)
• Mutual information
– Assesses the interrelationship between two variables
I ( X ; Y )
=
y
Î
Y
x
Î
X p ( x , y )log
è
ç
æ p ( x , y ) p ( x ) p ( y )
ö
ø
÷
• We have a continuous variable where each data point is associated to a class
– 10 red dots, 10 blue dots
– H(X) = -(0.5log(0.5)+0.5log(0.5))=1
• What if we split this variable in two?
– Where to split?
– The point where we obtain maximum IG.
Conditional variable X is the splitting point
IG ( X ; cutp )
=
H ( X )
çç
æ
è
X left
X
H ( X ; left )
+
X right
X
H ( X ; right )
÷÷
ö
ø
• H(X;left) = 0.469, H(X;right) = 0.469, IG=0.531
• H(X;left) = 0, H(X;right) = 0.863, IG=0.384
• H(X;left) = 0.863, H(X;right)=0, IG=0.384
• Download the WEKA GUI from here
• Dataset for this demo
• What I am going to show are just some of the steps from this tutorial
$ java weka.classifiers.trees.J48 -t dataset.arff
J48 pruned tree
-----------------att16 <= 1.28221
| att1 <= 1.33404
| | att3 <= 1.37681: 3 (6.0)
| | att3 > 1.37681: 2 (4.0/1.0)
| att1 > 1.33404
| | att17 <= 1.52505
| | | att5 <= 1.34135: 3 (3.0)
| | | att5 > 1.34135
| | | | att8 <= 1.3391: 3 (4.0/1.0)
| | | | att8 > 1.3391: 2 (19.0/1.0)
| | att17 > 1.52505: 2 (18.0) att16 > 1.28221: 3 (938.0/456.0)
Number of Leaves : 7
Size of the tree : 13
=== Confusion Matrix === a b <-- classified as
39 457 | a = 2
2 494 | b = 3
=== Stratified cross-validation ===
Correctly Classified Instances 491 49.496 %
Incorrectly Classified Instances 501 50.504 %
Kappa statistic -0.0101
Mean absolute error 0.4992
Root mean squared error 0.5111
Relative absolute error 99.8304 %
Root relative squared error 102.214 %
Total Number of Instances 992
=== Confusion Matrix === a b <-- classified as
141 355 | a = 2
146 350 | b = 3
Time taken to build model: 0.11 seconds
Time taken to test model on training data: 0.05 seconds
=== Error on training data ===
Correctly Classified Instances 533 53.7298 %
Incorrectly Classified Instances 459 46.2702 %
Kappa statistic 0.0746
Mean absolute error 0.4774
Root mean squared error 0.4885
Relative absolute error 95.4706 %
Root relative squared error 97.7091 %
Total Number of Instances 992
• Knowledge Extration using Evolutionary
Learning
• Another data mining platform with integrated graphical design of experiments
• Download prototype
• Manual
• Instructions about integrating new methods into it ( slides , paper )