slides

G54DMT – Data Mining Techniques and

Applications http://www.cs.nott.ac.uk/~jqb/G54DMT

Dr. Jaume Bacardit jqb@cs.nott.ac.uk

Topic 1: Preliminaries

Outline of the topic

• General data mining definitions and concepts

• Handling datasets

• Repositories of datasets

• Experimental evaluation of data mining methods

• Information theory

• Playing a bit with Weka

Dataset structure

• It most cases we will treat a dataset as a table with rows and columns

Attributes

Instances

Instances and attributes

• Instances are the atomic elements of information from a dataset

– Also known as records or prototypes

• Each instance is composed of a certain number of attributes

– Also known as features or variables

– In a dataset attributes can be of different types

• Continuous (or Integer)

• Discrete (i.e. being able to take a value from a finite set)

Supervised learning

• Many times the datasets got a special attribute, the class or output

• If they do, the task of the data mining process consists in generating a model that can predict the class/output for a new instance based on the values for the rest of attributes

• In order to generate this model, we will use a corpus of data for which we already know the answer, the training set

Example of dataset

Witten and Frank, 2005 (http://www.cs.waikato.ac.nz/~eibe/Slides2edRev2.zip)

Process of supervised learning

New

Instance

Training

Set

Learning

Algorithm

Models Inference

Engine

Annotated

Instance

Types of supervised learning

• If the special attribute is discrete

– We call it class

– The dataset is a classification problem

• If the special attribute is continuous

– We call it output

– The dataset is a regression problem

• Also called modelling or function aproximation

Y

Rule Learning

• CN2, RISE, GAssist, BioHEL

1

If (X<0.25 and Y>0.75) or

(X>0.75 and Y<0.25) then 

If (X>0.75 and Y>0.75) then 

If (X<0.25 and Y<0.25) then 

Everything else 

0

1

X

A decision tree (ID3/C4.5) age?

<=30 no no student?

yes yes yes

>40 credit rating?

excellent fair yes

Linear Classification (Logistic

Regression)

x x x x x x x x x o o o o o o o x o o o o o o

Support Vector Machines (SVM)

Small Margin Large Margin

Support Vectors

Unsupervised learning

• When we do not have/not take into account the class/output attribute

• If the goal is to identify aggregations of instances

– Clustering problem

• If the goal is to detect strong patterns in the data

– Association rules/Itemset mining

Clustering (K-Means, EM)

Partitional clustering Hierarchical clustering http://www.mathworks.com/matlabcentral/fx_files/19344/1/k_means.jpg

http://www.scsb.utmb.edu/faculty/luxon.htm

Association rules mining (Apriori,

FP-growth)

Witten and Frank, 2005 (http://www.cs.waikato.ac.nz/~eibe/Slides2edRev2.zip)

Reinforcement learning

• When the system is being given a reward or

punishment whether its prediction was correct or not

• The DM system is not being told what is exactly right or wrong, just the reward

• RL methods work prediction by prediction

– This is why many times they are called online systems

Dataset handling and format

• In DM most datasets are represented as simple plain text files (or sometimes excel sheets)

• More sophisticated (and efficient) methods exist

• We need to decide how to specify the dataset structure (i.e. the metadata) and the content

(the instances)

ARFF format

• ARFF = Attribute-Relation File Format

• File format from the WEKA DM package

• http://www.cs.waikato.ac.nz/~ml/weka/arff.html

HDF5

• Much more complex file format designed for scientific data handling

• It can store heterogeneous and hierarchical organized data

• It has been designed for efficiency

• Presentation slides

Repositories of datasets

• UCI repository

– http://archive.ics.uci.edu/ml/

– Probably the most famous collection of datasets in

ML!

– Currently has 235 datasets

• Kaggle

– It is not a static repository of datasets, but a site that manages Data Mining competitions

– Example of the modern concept of crowdsourcing

Repositories of datasets

• KDNuggets

– http://www.kdnuggets.com/datasets/

• PSPbenchmarks

– http://www.infobiotic.net/PSPbenchmarks/

– Datasets derived from Protein Structure

Prediction problems

– Interesting benchmarks because they can be parametrised in a very large variety of ways

• Pascal Large Scale Learning Challenge

– http://largescale.ml.tu-berlin.de/about/

Experimental Evaluation

• We have Data Mining methods and datasets

• How can we know that the patterns they extract are meaningful?

• How can we know how well do they perform and which method is the best?

• We need to follow a principled protocol to make sure that the results and conclusions we extract are sound

Verifying that the model is sound

• The dataset, before the data mining process, is partitioned into two non-overlapped parts:

– Training set. Will be used to generate the model

– Test set. Will be used to validate the model

• We can compute the performance metric (more on the next slides) on both sets. If the metric for the training set is much higher than in the test set, we have a case of overlearning

• That is, the DM method has not modelled the problem, only the training set

Performance metrics: Classification

Problems

• Accuracy: C/D

– C = number of correctly classified examples. D is size of instance set

– Simplest and most widespread metric

– But what if some classes got more examples than others? The majority class would get more benefit out of this metric

Performance metrics: classification problems

• Cohen ’ s Kappa

– Computes the agreement between two distributions of categorical variables (i.e. real and predicted classes)

– Takes into account agreement by chance

– Hence, it may be more suitable for multi-class problems

(Garcia et al., 2009)

C = number of classes, x i

= examples from class I

X ii

=examples correctly classified from class i

• There are more metrics (e.g. ROC curves)

Performance metrics

• Regression problems

– RMSD

• Metrics for clustering problems

• Association rule mining

– Support (percentage of instances covered by the pattern)

– Confidence (agreement between predicate and consequent of the rule)

Performance estimation methodologies

• We have to partition the dataset into training and test, but how is this partitioning done?

• If this partitioning is wrong we will not obtain a good estimation of the method ’ s performance

Performace estimation methodologies: Holdout

• Simply dividing the dataset into two nonoverlaped sets (e.g. 2/3 of the dataset for training, 1/3 of the set for test).

• Most simple method and computationally cheap

• The performance is computed on the test set

• Its reliability greatly depends on how the sets are partitioned

Performace estimation methodologies:

K-fold cross-validation

• Divide the dataset into K strata

• Iterate K times:

– Use sets 1.. K-1 for training and the set K for test

– Use sets 1..K-2,K for training and the set K-1 for test

– etc

• Computer the performance estimation as the average of the metric for each of the K test sets

• More robust estimator, but computationally more costlier

• If each of the K strata has the same class distribution as the whole dataset, this method is called stratified K-fold cross- validation


• Leave-one-out cross-validation

– Extreme case of cross-validation, where K=D, the size of the dataset

– Training sets will have size D-1 and test sets will have only one instance

– Mostly used in datasets of small size

• Bootstrap

– Generate a sample with replacement from the dataset with size D. Use it as training set

– Use the instances never selected as test set

– Repeat this process a high number of times


• But which method is best?

• Two criteria to take into account

– Bias: Difference between the estimation and the true error

– Variance of the estimation ’ s sample

• Formal and experimental analysis of some of these methods

• Experimental study for microarray data

Who is the best?

• At the end of an experimentation we will have tested N methods on D datasets, thus we will have a NxD table of performance metrics

• How can we identify the best method?

– Highest average performance across the D datasets?

– Highest average rank across the D datasets?

• Also, is the best method significantly better than the others?

• For this we need statistical test (next slide)

• Best study nowadays on statistical tests

• Most of the tests described in the next slides can be found in many statistical packages (e.g. R )

Statistical tests

• Procedure for making decisions about data

• Each test defines an hypothesis (H

0

) about the observed data

– “The accuracy differences I observe between A and B are just by pure chance and the methods perform equally”

– Or in a more statistical way “The two sets of data observations A and B belong to the same distribution”

• Then, the probability of H

0 observed data is calculated being true given the

• If the probability (p-value) is smaller than a certain threshold, H

0 is rejected and the observed differences are statistically significant

Traditional approach: Student Ttest

• Tests whether the difference between two distributions is significant or not

• Generates a p-value, in this case the probability that the two distributions are not significantly different

• P-values of 0.05 or 0.01 are typical thresholds

• Can only applied if the distributions have the same variance and are normal. These conditions are hardly ever achieved

Wilcoxon test

• Non-parametric tests, which is not affected by the normality of the distribution

• It ranks the absolute differences in performance, dataset by dataset

• Next, it sums the ranks separately for the positive and negative differences

• A p-value will be generated depending on which sum of ranks is smaller and the number of datasets

Multiple pair-wise comparisons?

• Both the t-test and the Wilcoxon test are used to compare two methods

• What if we have more than two methods in our comparison?

– You can hold individually that A is better than B and better than C with 95% confidence and but it is better than both B and C at the same time?

– The p-values do not hold anymore

• We need to apply a correction for multiple tests (e.g. Bonferroni , Holm , etc.)

The Friedman test

• Designed explicitly to compare multiple methods

• Based on the average rank of each method across the datasets

• This test just says if the performance of the methods included is similar or not

• Can be used when having more than 10 datasets and more than 5 methods

• Once the test has determined that there are significant performance differences, a post-hoc test is used to spot them

Two types of post-hoc test

• Comparing every method to each other (e.g. the Nemenyi test)

• Comparing a control method against the others (e.g. the Holm test)

– E.g. the best method against the others

– This latter kind of test gives less information but are also more powerful (i.e. less conservative)

• Both are based on ranks

Using R to compute statistical tests

• R is an open source statistical computing package

• It is extremely powerful, but not the most simple tool to use in the world….

• It has build-in functions for most of the tests described in the previous slides

• Example from the command line….

Data for the test

• A plain text file with rows consisting of:

– <method> <dataset> <performance metric>

• Example file

• CV has already been computed

• Let ’ s load the data in R:

$ R

> data<-read.table("rchA.dat")

> data

V1 V2 V3

1 C4_5 rchA_1_q2 0.664884

2 C4_5 rchA_2_q2 0.690657

3 C4_5 rchA_3_q2 0.731942

4 C4_5 rchA_4_q2 0.756022

5 C4_5 rchA_5_q2 0.733984

T-test

> pairwise.t.test(data$V3,data$V1,pool.sd=F,paired=T)

Pairwise comparisons using t tests with non-pooled SD data: data$V3 and data$V1

C4_5 HEL+ LCS+

HEL+ 0.00056 -

LCS+ 0.00038 0.53291 -

NB 0.53291 0.52834 0.53291

P value adjustment method: holm

Wilcoxon test

> pairwise.wilcox.test(data$V3,data$V1,paired=T)

Pairwise comparisons using Wilcoxon rank sum test data: data$V3 and data$V1

C4_5 HEL+ LCS+

HEL+ 0.00038 -

LCS+ 0.00032 0.72586 -

NB 0.72586 0.72586 0.72586

P value adjustment method: holm

> pairwise.wilcox.test(data$V3,data$V1,paired=T)$p.value

C4_5 HEL+ LCS+

HEL+ 0.0003814697 NA NA Saving the results to a file

LCS+ 0.0003204346 0.7258606 NA

NB 0.7258605957 0.7258606 0.7258606

> write.table(pairwise.wilcox.test(data$V3,data$V1,paired=T)$p.value,"wilcox.dat

” )

Friedman and post hoc tests

• Code for the Holm and Nemenyi post hoc tests

• Code for accuracy (between 0 and 1)

> source("fried+holm.r

” )

> fried.holm("rchA.dat")

Friedman rank sum test data: data$V3, data$V1 and data$V2

Friedman chi-squared = 22.4667, df = 3, p-value = 5.216e-05

[1] "Average ranks"

C4_5 HEL+ LCS+ NB

3.666667 1.722222 2.166667 2.444444

[1] "Control is HEL+"

[1] "Confidence level of 0.950000"

[1] "HEL+ is significantly better than C4_5"

[1] "Confidence level of 0.990000"

[1] "HEL+ is significantly better than C4_5"

Evaluation pipeline (summary)

• N datasets, M methods

– Method can mean anything (mining, preprocessing, etc.)

• For each of the N datasets

– Partition it into training and test sets

– For each of the K pairs of (Training k n ,Test k n ) sets

• For each of the M methods

– model = Train Method M using set Training k n

– Metric[N][M][K] = Test model using set Test k n

– Metric[N][M] = Average across k pairs

• Compute average ranks of each method across datasets

• Run statistical tests using Metric matrix

– There are overall statistical significant performance differences?

– Individual tests one-vs-rest or all-vs-all

Basic Information Theory

• Shannon’s Entropy

– Measure of the information content in discrete data

– Metric inspired by statistical information transmission

• “How many bits per symbol (in average) would it take to transmit a message given the frequencies of each symbol?”

H ( X )

= n

å

i

=

1 p ( x i

)log p ( x i

)

Shannon’s Entropy

• If we are tossing a coin and want to send a sequence of tosses

– If heads and tails have the same probability it would take one bit per toss

– If heads have a very small probability compared to tails we could just send a list of the heads and say that everything else is a tail

• Each head message would be long, but there would be very few of them, so overall it would take less than one bit to transmit a symbol (in average).

Other information theory metrics

• Conditional Entropy

– Entropy of a certain variable given that we know all about a related variable

H ( Y | X )

=

å

x

Î

X p ( x ) H ( Y | X

= x )

• Information gain

– How many bits would I save from transmitting Y if I know about X?

– IG(Y|X) = H(Y) – H(Y|X)

• Mutual information

– Assesses the interrelationship between two variables

I ( X ; Y )

=

å

y

Î

Y

å

x

Î

X p ( x , y )log

è

ç

æ p ( x , y ) p ( x ) p ( y )

ö

ø

÷

Why is it useful for in data mining?

• We have a continuous variable where each data point is associated to a class

– 10 red dots, 10 blue dots

– H(X) = -(0.5log(0.5)+0.5log(0.5))=1

• What if we split this variable in two?

– Where to split?

– The point where we obtain maximum IG.

Conditional variable X is the splitting point

IG ( X ; cutp )

=

H ( X )

çç

æ

è

X left

X

H ( X ; left )

+

X right

X

H ( X ; right )

÷÷

ö

ø

Different cut points

• H(X;left) = 0.469, H(X;right) = 0.469, IG=0.531

• H(X;left) = 0, H(X;right) = 0.863, IG=0.384

• H(X;left) = 0.863, H(X;right)=0, IG=0.384

Let ’ s play a bit with Weka

• Download the WEKA GUI from here

• Dataset for this demo

• What I am going to show are just some of the steps from this tutorial

Weka from the command line….

$ java weka.classifiers.trees.J48 -t dataset.arff

J48 pruned tree

-----------------att16 <= 1.28221

| att1 <= 1.33404

| | att3 <= 1.37681: 3 (6.0)

| | att3 > 1.37681: 2 (4.0/1.0)

| att1 > 1.33404

| | att17 <= 1.52505

| | | att5 <= 1.34135: 3 (3.0)

| | | att5 > 1.34135

| | | | att8 <= 1.3391: 3 (4.0/1.0)

| | | | att8 > 1.3391: 2 (19.0/1.0)

| | att17 > 1.52505: 2 (18.0) att16 > 1.28221: 3 (938.0/456.0)

Number of Leaves : 7

Size of the tree : 13

=== Confusion Matrix === a b <-- classified as

39 457 | a = 2

2 494 | b = 3

=== Stratified cross-validation ===

Correctly Classified Instances 491 49.496 %

Incorrectly Classified Instances 501 50.504 %

Kappa statistic -0.0101

Mean absolute error 0.4992

Root mean squared error 0.5111

Relative absolute error 99.8304 %

Root relative squared error 102.214 %

Total Number of Instances 992

=== Confusion Matrix === a b <-- classified as

141 355 | a = 2

146 350 | b = 3

Time taken to build model: 0.11 seconds

Time taken to test model on training data: 0.05 seconds

=== Error on training data ===

Correctly Classified Instances 533 53.7298 %

Incorrectly Classified Instances 459 46.2702 %

Kappa statistic 0.0746

Mean absolute error 0.4774

Root mean squared error 0.4885

Relative absolute error 95.4706 %

Root relative squared error 97.7091 %

Total Number of Instances 992

KEEL

• Knowledge Extration using Evolutionary

Learning

• Another data mining platform with integrated graphical design of experiments

• Download prototype

• Manual

• Instructions about integrating new methods into it ( slides , paper )

Questions?

slides

G54DMT – Data Mining Techniques and

Applications http://www.cs.nott.ac.uk/~jqb/G54DMT

Outline of the topic

Dataset structure

Instances and attributes

Supervised learning

Example of dataset

Process of supervised learning

Types of supervised learning

Rule Learning

Linear Classification (Logistic

Regression)

Support Vector Machines (SVM)

Unsupervised learning

Clustering (K-Means, EM)

Association rules mining (Apriori,

FP-growth)

Reinforcement learning

Dataset handling and format

ARFF format

HDF5

Repositories of datasets

Repositories of datasets

Experimental Evaluation

Verifying that the model is sound

Performance metrics: Classification

Problems

Performance metrics: classification problems

Performance metrics

Performance estimation methodologies

Performace estimation methodologies: Holdout

Performace estimation methodologies:

K-fold cross-validation

Performance estimation methodologies

Performance estimation methodologies

Who is the best?

Statistical tests

Traditional approach: Student Ttest

Wilcoxon test

Multiple pair-wise comparisons?

The Friedman test

Two types of post-hoc test

Using R to compute statistical tests

Data for the test

T-test

Wilcoxon test

Friedman and post hoc tests

Evaluation pipeline (summary)

Basic Information Theory

å

Shannon’s Entropy

Other information theory metrics

å

å

å

Why is it useful for in data mining?

Different cut points

Let ’ s play a bit with Weka

Weka from the command line….

KEEL

Questions?

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib