Conventional Data Mining Techniques I A B M Shawkat Ali PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our copyright controlled PowerPoint slide content for all courses wherein the associated text has been adopted. PowerPoint slides may be placed on course management systems that operate under a controlled environment (accessed restricted to enrolled students, instructors and content administrators). Cengage Learning Australia does not require a copyright clearance form for the usage of PowerPoint slides as outlined above. Copyright © 2007 Cengage Learning Australia Pty Limited 1 My Request “A good listener is not only popular everywhere, but after a while he gets to know something” - Wilson Mizner Objectives On completion of this lecture you should know: • • • • • What is a classification task. How Decision Tree algorithm works. Experimental test with Decision Tree. How Naive Bayes algorithm works. Naive Bayes demonstration with Weka. Characteristics of classification • Classifying attributes: Group of variables that define the class of an instance or record. • Predicting attributes: Group of variables that determine the class of an instance. • Supervised learning: Classifying attributes are dependent variables and necessarily categorical whereas predicting attributes are independent variables. N.B. The term group also includes singleton. Three approaches • Divide the domain into distinct regions for each class. An instance contains enough information to belong to a class. • Find the probability that an instance falls into a given class. • Find the probability that a class contains a given instance. Supervised learning • One or more dependent variables and one or more independent variables. Classes are predefined. • To develop a learner model usually part of the known data is used for training and the rest for testing. Often 70% of the data is used as training data. Decision tree • A tree structure where non-terminal nodes represent tests on one or more attributes and terminal nodes reflect decision outcomes. A decision tree model for Weather dataset Categorising continuous attributes • Split should be made where the error is minimum: Temperature: 15 Play Golf: No 18 22 No Yes 28 34 Yes Yes 40 No • Two possible splits between Yes and No – one at 20 [(18+22)/2] and another at 37 [(34+40)/2]. One error for 20 and two errors for 37. Entropy measure Suppose a message looks like this: DADAABCDABCCADBB…. You wish to transmit it as a binary code: A=00, B=01, C=10, D=11 You need 2 bits per letter. Suppose you learn that A and B are more frequent and the probabilities are: P(A)=1/2, P(B)=1/4, P(C)=1/8, and P(D)=1/8 If you adopt the code: A=0, B=10, C=110, and D=111 You can transmit the message by using 1.75 bits per letter on average. This fact is contained in the entropy measure: H(X) = – p1log2p1 – p2log2p2 – … For the first scenario: 1 1 1 1 1 1 1 1 H ( X ) log 2 log 2 log 2 log 2 2 4 4 4 4 4 4 4 4 For the second scenario: 1 1 1 1 1 1 1 1 H ( X ) log 2 log 2 log 2 log 2 1.75 2 2 4 4 8 8 8 8 Generally, if there are n possibilities with pi the probability of the event i, the entropy H(X) is n given by: H ( X ) pi log 2 pi i 1 The Greek letter sigma (Σ) simply indicates the summation of all the values. The above equation can be extended for the conditional probability. For instance, what is the entropy for attribute Y when we already know the outcome of attribute X? Involving two attributes X and Y, we have: H ( X ) pX pY | X log2 pY | X Body image data Name Weight Height Food Habit Fitness Program Nancy average short vegetarian yes Karen average short no restriction no Wendy Melissa Doiya Kate Judy Ruma heavy heavy heavy light light average average tall average short average tall no red meat vegetarian vegetarian no restriction no restriction no restriction no no no yes no yes Concern satisfied (negative) dissatisfied (positive) dissatisfied satisfied satisfied satisfied dissatisfied satisfied For attribute Food Habit: 2 Positive 2 Negative b1 Food habit b2 1 Positive b3 3 Negative Sample average entropy calculation for the attribute ‘Food Habit’: nb nbc nbc log2 b nb nt c nb 4 2 2 2 log log 8 4 2 4 4 2 4 8 2 1 1 1 0 0 3 0 0 3 log log log log 4 8 1 2 1 1 2 1 8 3 2 3 3 2 0.5 0.5 0 0 0 0 1 3 8 8 0.50 Average Entropy for Food Habit = 0.50 3 3 For attribute Height: b1 Height b2 b3 1 Positive 2 Negative 2 Positive 1 Negative 2 Negative Sample average entropy calculation for the attribute ‘height’: 3 1 1 2 log 2 log 2 8 3 3 3 3 8 2 3 2 2 1 1 log log 23 3 2 3 8 3 2 0 0 2 2 log log 22 2 2 2 8 2 3 0.528 0.39 0.39 0.528 0 3 8 0.69 Average Entropy for Height = 0.69 For attribute Weight: 1 Positive 1 Negative b1 Weight b2 b3 1 Positive 2 Negative 1 Positive 2 Negative Sample average entropy calculation for the attribute ‘weight’: 2 1 2 1 3 1 1 1 2 1 log 2 log 2 log 2 log 2 3 3 3 2 8 3 2 2 8 2 2 1 2 3 1 log 2 log 2 3 3 3 8 3 2 3 3 0.505. 0.5280.39 0.5280.39 8 8 8 0.94 Average Entropy for Weight = 0.94 For attribute Fitness Program: 3 Positive 2 Negative b1 Fitness program b2 3 Negative Sample average entropy calculation for the attribute ‘fitness program’ 5 3 3 2 2 3 0 0 3 3 log log log log 25 5 2 5 8 3 2 3 3 2 3 8 5 5 0.442 0.529 0 8 0.61 Average Entropy for Fitness program = 0.61 Attribute Food habit Height Weight Fitness program Average Entropy 0.50 0.69 0.94 0.61 Decision tree for the body image dataset Food Habit No red meat No restriction Vegetarian Fitness Program Fitness Program Yes Satisfied No Dissatisfied No Fitness Program Yes Dissatisfied Satisfied No Satisfied Avoiding over-fitted trees • Stop growing tree if perfect classification requires too many branching. • Allow the tree to over-fit and then post-prune. – Reduced error pruning – Rule post-pruning Decision tree software • Quinlan’s ID3 (1986) captures the main idea of building decision trees. Thereafter: C4.5 (Weka’s version is J48), C5.0 (Windows version See5). • CART (always performs binary splits – two branches.) • CHAID (Found in statistical packages like SAS or SPSS, but works only with categorical data.) Decision tree by WEKA Figure 4.8 C4.5 (Weka name is J48) classifier selection from Weka environment J48 (C4.5) classifier for CRX data Figure 4.10 J48 classifier performance for credit scoring data Visualise decision tree Figure 4.11 Visualise Decision Tree selection from Weka A decision tree model for weather data Figure 4.12 A decision tree model for Weather data Tree representation in Weka attribute9 = t | attribute10 = t: 1 (228.0/21.0) | attribute10 = f • Each line represents a node in the tree. The attribute without any vertical bar ‘|’ to its left is the root node. The second two lines, which start with a '|', is a child node of the first line. In general, a node with one or more '|' characters before the rule is a child node of the node immediately preceding that line but having one less '|' character. • A leaf node is identified by a colon (:) sign. The class is identified by the number, in our case 1 or 2, immediately following the colon sign. • Nodes that generate a classification, such as | attribute10 = t: 1 (228.0/21.0) are followed by two numbers (sometimes one) in parentheses. The first number tells how many instances in the training set are correctly classified by this node, in this case 228.0. The second number 21.0 represents the number of instances incorrectly classified by the node. Cross-validation Figure 4.9 10 fold cross validation procedure. Applications of decision tree • • • • • • Traffic Risk Analysis Remote Sensing Medical Application Speech Recognition Character Recognition And many more Strengths of decision trees • • • • • Easy-to-understand. Map nicely to a set of production rules. Applied to real problems. Make no prior assumptions about the data. Able to process both numerical and categorical data. • Comparatively faster than Neural Networks. Weaknesses of decision trees • • • • Output attribute must be categorical. Limited to one output attribute. Decision tree algorithms are unstable. Trees created from numeric datasets can be complex. Naive Bayes classifier P (C | A) P ( A | C ) P (C ) P ( A) where C is the hypothesis to be tested A is the evidence associated with C [In simple words, if we know how much of A can happen when C happens, the formula tells us how much of C can happen when A happens.] (cont.) • P(C) is the prior probability or marginal probability of C. It is "prior" in the sense that it has not yet accounted for the information available in A. • P(C|A) is the conditional probability of C, given A. It is also called the posterior probability because it has already incorporated the outcome of event A. •P(A|C) is the conditional probability of A given C. •P(A) is the prior or marginal probability of A, which is normally the evidence. Bayes classifier: An example [Called naive because all attributes are given equal weights and assumed independent] Data for Bayes classifier Finance/Investme nt promotion Travel/Tour promotion Reading/Magazin e promotion Health/Diet promotion Sex Yes No Yes No Male Yes Yes No No Male No Yes Yes Yes Female No Yes No Yes Male Yes Yes Yes Yes Female No No Yes No Female Yes No No No Male Yes Yes No No Male No No No Yes Female Yes No No No Male The instance to be classified Finance/Investment promotion = No Travel/Tour promotion = Yes Reading/Magazine promotion = Yes Health/Diet promotion = No Sex = ? Counts and probabilities for attribute sex Finance/Investme nt promotion Travel/Tour promotion Reading/Magazi ne promotion Health/Diet promotion Sex Male Female Male Female Male Female Male Female Yes 5 1 3 2 1 3 1 3 No 1 3 3 2 5 1 5 1 Ratio: yes/total 5/6 1/4 3/6 2/4 1/6 3/4 1/6 3/4 Ratio: no/total 1/6 3/4 3/6 2/4 5/6 1/4 5/6 1/4 Computing the probability for sex = male P( sex male | A) P ( A | sex male) P ( sex male) P ( A) [The denominator is actually the sum of all possible values of the numerator. Thus in this case, P(A) = P(A|sex=male)x P(sex=male) + P(A|sex=female)xP(sex=female).] Conditional probabilities for sex = male P(Finance/Investment = No | Sex = male) = 1/6 P(Travel/Tour = Yes | Sex = male) = 3/6 P(Reading/Magazine = Yes | Sex = male) = 1/6 P(Health/Diet = No | Sex = male) = 5/6 The probability for sex = male given evidence A P(sex = male | A) 0.006944 / P(A) The probability for sex = female given evidence A P(sex = female| A) 0.028125 / P(A) Zero-valued attribute counts Since probabilities get multiplied we cannot have a zero value because that will eliminate everything. When 0 is a possibility, each ratio of the form n/d should be replaced by: n ( k )( p ) d k k is a value between 0 and 1 (usually 1) p is an equal fractional part of the total number of possible values for the attribute Missing data With Bayes classifier, missing data items are ignored. Numeric data ( x ) 2 /(2 2 ) f ( x) 1 /( 2 ) e where e = the exponential function m = the class mean for the given numerical attribute s = the class standard deviation for the attribute x = the attribute value [The formula above gives the probability density function for a continuous attribute such as weight or age. The probability for say age 45 is estimated as: P(age = 45) = Prob (age ≤ 45.5) – Prob (age ≤ 44.5). These probabilities can be found in a normal distribution table.] Naive Bayes’ algorithm Probability-based solver Bayes’ theorem Relates the conditional and marginal probabilities of stochastic events A and C P(C | A) P ( A | C ) P(C ) L( A | C ) P(C ) P( A) where P stands for the probability of the variable within parenthesis, and L(A|C) is referred to as likelihood of A given fixed C. Bayes’s theorem with many attributes P( A1 ,..., An | C ) P(C ) P(C | A1 ,... An ) P( A1 ,..., An ) where n is the number of attributes in P(C | A1 ,..., An ) . Assuming that the attributes are independent (hence the name ‘naïve’) and ignoring the denominator, we get: P(C | A1 ,... An ) P( A1 | C ) P( A2 | C ) ... An | C ) P(C ) or n P(C | A1 ,..., An ) P( Ai | C) P(C) i 1 Example of Naive Bayes’ x2 x1 Figure 4.13 A simple example for Naive Bayes classification X2 X1 Figure 4.14 New instance identification using Naive Bayes methodology Probability Measure P(New Case|Crescent-shape) = No. of Crescent-shapes in the vicinity 1 Total No. of Crescent-shapes 40 No. of Oval shapes in the vicinity 3 P(New Case|Oval shape) = Total No. of Oval shapes 20 Posterior probability of New case being Crescent-shaped = Prior prob. of Crescent shape Likelihood of new case being Crescent = 40 1 1 60 40 60 Posterior probability of New case being Oval-shaped = Posterior probability of Oval shape Likelihood of New case being Oval = 20 3 3 1 60 20 60 20 Naive Bayes’ through WEKA Figure 4.15 Naive Bayes classifier selection from Weka Figure 4.16 Naïve Bayes classifier performance with US voting data Applications of Naive Bayes • • • • • Parameter Estimation Document Classification Systems Performance Management Medical Diagnosis And many more Naive Bayes’ algorithm • Strengths: Simplicity, efficiency, convenience, etc. • Weaknesses: Restrictive assumptions, not very effective, etc. Recap • Classification task in Data Mining • Decision Tree working principle • Naive Bayes working principle • Applications of Decision Tree and Nave Bayes • Data splitting method in Data Mining Artificial Neural Networks & Support Vector Machines PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our copyright controlled PowerPoint slide content for all courses wherein the associated text has been adopted. PowerPoint slides may be placed on course management systems that operate under a controlled environment (accessed restricted to enrolled students, instructors and content administrators). Cengage Learning Australia does not require a copyright clearance form for the usage of PowerPoint slides as outlined above. Copyright © 2007 Cengage Learning Australia Pty Limited Objectives On completion of this lecture you should know: • • • • What is artificial neural network Application and limitations of neural networks What is support vector machines Application and limitations of support vector machines Artificial neural networks An information processing conceptual model. Three broad types of models: • Analytical (Physics-based, derived from fundamental principles, also called white box model.) • Conceptual (Introduces some artificial elements to mimic the working of the system, also known as grey-box model.) • Black box (Only relates input with output without any consideration of the internal working of the system.) An ANN example m 01101101 13 a 01100001 01 n 01101110 g 01100111 07 o 01101111 15 mango ... … ………………. ……………….. Image Sub-image ASCII Code Neural Network Trained to Recognize characters 14 Neural Network output Figure 5.1 A basic Neural Network process for optical character recognition Biological neuron Nucleus Axon Dendrites Cell Body (Soma) Presynaptic terminal Figure 5.2 Sketch of biological neuron Usefulness • Where we cannot formulate an algorithmic solution. • Where we can get lots of examples of the behaviour we require but too few perfect ones. • Where we need to extract patterns or trends from data which is too complex or imprecise to be noticed by humans or other computer algorithms. Format • Mimics human brain consisting of interconnected neurons. • One input layer, one or two hidden layer(s), and one output layer. Two hidden layers are usually enough to explain all the patterns. Each layer can have one or more neurons. • Each neuron is connected to all the neurons of the preceding layer and the following layer for a fullyconnected network, but not with other neurons in the same layer. • In feed-forward neural networks, information moves only in one direction (from input towards output i.e. there is no loop to carry information backwards). Processing • Each neuron in the input layer reads the value of an attribute within the range [0,1] and passes to the next layer unchanged. • Each neuron in the next layer receives information from all the neurons of the previous layer with each value multiplied by a weight and then summed. Weights can be negative and do not have to add to 1. Weights can initially be selected arbitrarily and adjusted gradually during model training process. • The weighted sum may be passed on to the next layer of neurons without modification or it can be modified with a simple threshold function (i.e., pass on 0 if the sum is less than or 1 if it is greater than a certain value). The most popular modification is by sigmoid function (to be described later). • During training of the model, the model output is compared with the observed values and the weights adjusted till a satisfactory matching between the two is obtained. Feed-forward neural networks Input 1 Node 1 W14 W15 Input 2 Node 2 Node 3 Input Layer W46 W24 W25 W34 Input 3 Node 4 Node 6 Node 5 Output W56 W35 Hidden Layer Output Layer Figure 5.3 A fully connected feedforward neural network Genetic learning Often used to get a good starting set of initial weights for a neural network model: • Genetic algorithms are based on a biological metaphor • Suitable for both supervised and unsupervised data mining. Genetic learning operators • Crossover: First Step – forms new elements from population. Parent population create offspring. • Mutation: Second step – some offspring undergoes random changes in their characteristics. • Selection: Third Step – All candidates are evaluated against a fitness function. Only initial number of most fit population are kept and the rest are discarded. The final selection is based on ‘survival of the fittest.’ An initial population of weights Table 5.1 Population of connection weights for genetic learning Population Element W1 W2 W3 W4 W5 W6 W7 W8 W9 Parent 1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 Parent 2 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 Child 1 Child 2 Child 3 0.1 0.1 0.2 0.1 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.1 0.1 0.1 0.1 0.1 0.1 0.9 0.1 0.1 0.9 0.1 0.1 0.1 0.9 0.9 The Sigmoid function 1 f ( x) kx 1 e Eq 5.2 where k = 1 is popular, and the shape is given on the next page, and e is the base of the natural logarithms approximated by 2.718281… The diagram below is for k =1. Try drawing it for k = 0.5 and 10. 1.200 1.000 0.800 f(x) 0.600 0.400 0.200 x 0.000 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 XOR problem solved by NN Table 5.2 Boolean XOR problem XOR Input 1 Input 2 Output 0 0 0 0 1 1 1 0 1 1 1 0 Input 1 Node 1 W1=1 Node 3 Input 2 Node 2 Output W2=1 Figure 5.5 Training Neural Networks for XOR problem. The learning rule for output = W1(Input 1) + W2 (Input 2) Node 3 1 Input 1 Node 1 1 Node 4 1 Input 2 -1 Node 2 1 Input layer 2 Node 6 Output -1 Node 5 Hidden layer Output layer Figure 5.6 Complete neural network for XOR problem. The activation of nodes is by threshold function • We can calculate the output by following a two-step process: • Step 1: The input values in the hidden layer nodes are obtained from the following relationship (We note that the network is not fully connected). Node 3 = (1)(Input 1) Node 4 = (1)(Input 1) + (1)(Input 2) Node 5 = (1)(Input 2) • The output from each hidden layer node will be either 0 or 1. (cont.) • Step 2: The neural network output is obtained by summing the values: Output = (-1)(Node 3) + (2)(Node 4) + (1)(Node 5) Backpropagation networks Please recall Figure 5.3 (cont.) Therefore, input to Node 4 = 1(1.2) + 0.5(0.7) + 0.1(0.9) = 1.64 Output from Node 4 = 0.8375 [using Equation (5.2)] Input to Node 5 = 1(-0.4) + 0.5(0.6) + 0.1(1.1) = 0.01 Output from Node 5 = 0.5025 Input to Node 6 = 0.8375(0.8) + 0.5025(0.5) = 0.9213 Output from Node 6 = 0.7153 Suppose that the actual observed output is 0.6, therefore the error is (0.7153 – 0.6) = 0.1153. (cont.) (cont.) Which really means that the computed errors are summed across all output nodes. Since we have only one output node: Error (W14) = Error (W24) = Error (W34) = 0.023(0.8)0.8375(1-0.8375) = 0.00250 Error(W15) = Error(W25) = Error(W35) =0.023(0.5)0.5025(1-0.5025) = 0.00287 (cont.) The final step is the weight adjustments made using delta rule developed by Widrow and Hoff (Widrow and Lehr, 1995). The rule is: Wjk (new) = Wjk (old) + ΔWjk ΔWjk = (r)[Error(Wjk)](Ok) where r is learning rate which lies between 0 and 1 – higher value means faster convergence but can lead to instability. With 0.3 as the learning rate, we get: ΔW56 = ΔW46 = 0.3(0.023)0.7153 = 0.00494 The updated value of W46 = 0.8 + 0.00494 0.80494 The updated value of W56 = 0.5 + 0.00494 0.50494 ΔW14 = ΔW24 = ΔW34 = 0.3(0.00250)0.8375 0.000628 The updated value of W14 = 1.2 + 0.000628 1.200628. = = = = General considerations • • • • • What input attributes will be used to build the network? How will the network output be represented? How many hidden layers should the network contain? How many nodes should there be in each hidden layer? What condition will terminate network training? Neural network in Weka Figure 5.7. Iris data file opened for Neural Network model construction Figure 5.8. Neural Network algorithm selection from Weka Figure 5.9 Weka environment after Neural Network algorithm selection Figure 5.10 Output screen after execution of Weka Parameter value selection for neural networks = = = Run information = = = Scheme: weka.classifiers.functions.MultilayerPerceptron L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a Relation: iris Instances: 150 Attributes: 5 sepallength sepalwidth petallength petalwidth class Test mode: 10-fold cross-validation (cont.) = = = Classifier model (full training set) = = = Sigmoid Node 0 Inputs Weights Threshold -3.5015971588434005 Node 3 -1.0058110853859954 Node 4 9.07503844669134 Node 5 -4.107780453339232 ----------------Time taken to build model: 0.66 seconds = = = Stratified cross-validation = = = = = = Summary = = = Correctly Classified Instances 146 97.3333 % Incorrectly Classified Instances 4 2.6667 % ----------------- C B Y-Axis A X-Axis Figure 5.12 Training process of a neural network. A = under-trained model, B = properly trained model, and C = over-trained model Figure 5.13 Parameter details in Weka for neural network algorithm Neural network of the Iris data Sepal length Sepal width Petal length Petal width Node 6 Node 3 Node 0 Iris Setosa Node 4 Node 1 Iris versicolor Node 5 Node 2 Iris virginica Node 7 Node 8 Node 9 Neural network strengths • Works well with noisy data. • Can process both numeric and categorical data. • Appropriate for applications requiring a time element. • Have performed well in several domains. • Appropriate for supervised learning and unsupervised clustering. Weaknesses • • • • Lack explanation capabilities. May not provide optimal solutions to problems. Overtraining can be a problem. Support vector machines (SVM) Class partitioning lines x2 x2 x2 OH OH OH x1 x1 x1 Figure 5.17 The training patterns Crescent-shaped and Ovalshaped objects are represented by two different symbols based on their class labels. These patterns are classified by non OH in (a) and (b), but the solid line indicates the OH in (c) (Ali, 2005) (b) (a) Y-Axis U X-Axis L OH U Optimal hyperplane and the support vectors OH L X-Axis Figure 5.18 (a) margins are narrow (b) margins are maximum (a) Input space (b) Linear separation (c) Nonlinear separation Figure 5.19 (a) The input space, (b) The linear OH construction with errors, and (c) The nonlinear OH construction without error by using kernel mapping to a 256 dimensional space SVM architecture Figure 5.20 An SVM architecture XOR solved by SVM Table 5.3. Boolean XOR Problem Input data x (-1,-1) Output class y -1 (-1,+1) +1 (+1,-1) +1 (+1,+1) -1 • First, we transform the dataset by polynomial kernel as: K (x i , x j ) (1 x i x j ) T Here, xi .x j T 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 , Therefore the kernel matrix is: 9 1 K (x i , x j ) 1 1 1 9 1 1 1 1 9 1 1 1 1 9 We can write the maximization term following SVM implementation given in Figure 5.20 as: 4 1 4 4 o i i i j yi y j K xi x j 2 i1 j 1 i1 1 (9 2 2 2 2 9 2 1 2 3 4 2 1 1 2 1 3 1 4 2 2 2 9 2 2 9 2 ) 2 3 2 4 3 3 4 4 4 subject to: y i 1 i i 1 2 3 4 0 0 1, 0 2 , 0 3 , 0 4 . 9 1 2 3 4 1 9 1 1 2 3 4 1 2 9 3 4 1 1 2 3 9 4 1 By solving these above equations we can write the solution to this optimisation problem as: 1 2 3 4 1 8 Therefore, the decision function in the inner product representation is: n 4 i 1 i 1 2 fˆ x i yi K x i , x 0.125 yi x i x 1 The 2nd degree polynomial kernel function: K (xi , x j ) ((xi , x j ) 1) 2 ( xi1 x j1 xi 2 x j 2 ) 2 2( xi1 x j1 xi 2 x j 2 ) 1 1 ( xi1 x j1 ) 2 2( xi1 x j1 )(xi 2 x j 2 ) ( xi 2 x j 2 ) 2 2( xi1 x j1 ) 2( xi 2 x j 2 ) ( x i )T ( x j ) Now we can write the 2nd degree polynomial transformation function as: (xi ) [1, xi1 , 2xi1 xi 2 , xi 2 , 2xi1 , 2xi 2 ]T 2 2 4 o y ( x ) i i i i 1 1 [(x1 ) (x 2 ) (x 3 ) (x 4 )] 8 = 1 1 1 1 0 1 1 1 1 0 2 2 2 2 1 1 / 2 8 1 1 1 1 0 2 2 2 2 0 2 2 2 2 0 1 2 x1 2 x1 x 2 1 ,0,0,0 2 0 0,0, x 2 2 2 x1 2 x 2 Therefore the optimal hyperplane function for this XOR problem is: fˆ (x) x1 x2 Figure 5.21 XOR problem classification by SVM using second degree polynomial kernel Kernel trick (a) linear kernel (b) polynomial kernel (c) rbf kernel (d) spline kernel (e) sigmoidal kernel (f) multiquadratic kernel SVM (SMO) in Weka Figure 5.23 SVM classifier (named as SMO in Weka list) selection Figure 5.24 SVM classification performance for Colon Tumor data RBF kernel selection Figure 5.25 RBF kernel selection for SVM classifier Figure 5.26 SVM classification performance with RBF kernel Advantages of SVM • Can handle high dimensional dataset well. • Can map almost any nonlinear space with a suitable kernel. • Recent advances have added to its versatility. Disadvantages of SVM • • • • Choice of proper kernel function is not easy. Once a kernel is chosen, the algorithm is rigid. Slower than other algorithms. Not efficient with discrete data. Recap • • • Neural networks working principle. Support vector machines working principle. Applications and limitation of neural networks and support vector machines. Model Evaluation Techniques PowerPoint permissions Cengage Learning Australia hereby permits the usage and posting of our copyright controlled PowerPoint slide content for all courses wherein the associated text has been adopted. PowerPoint slides may be placed on course management systems that operate under a controlled environment (accessed restricted to enrolled students, instructors and content administrators). Cengage Learning Australia does not require a copyright clearance form for the usage of PowerPoint slides as outlined above. Copyright © 2007 Cengage Learning Australia Pty Limited Objectives On completion of this lecture you should know: • • • • Quantity-Based Model Evaluation Techniques Appropriateness of Evaluation Techniques Bagging and Boosting Model Ranking What should be evaluated? The performance criteria should be: • Reliable (work almost every time) • Valid (relevant to objectives) Accuracy/error rate estimation Figure 6.1 Colon tumor problem training performance with J48 Weka decision tree classifier Error/accuracy • The opposite of accuracy is Incorrect Classification of Instances, or alternatively, Error. An empirical percent error can be defined as the ratio of the number of errors to the number of total instances examined in the experiment, which in equation form is: Number of incorrectly classified instances Percent error 100 Total number of instances Figure 6.2 Colon tumor problem 10-fold cross-validation performance with J48 Weka classifier Confusion matrix Actual Healthy (Negative) Sick (Positive) Predicted Healthy Sick (Negative) (Positive) a b c d Diagonal terms give the correct values. (cont.) • Accuracy: ACC ad abcd d cd • True positive rate: TPR • False positive rate: b FPR ab • True negative rate: TNR a ab (cont.) • False negative rate: FNR c cd • Precision: • F-measure: P d bd ( 2 1)P TPR F 2 P TPR • Kappa statistics: P( A) P( E ) K 1 P( E ) , (a b)(a c) (b d )(c d ) P( E ) (a b c d ) 2 Example 6.1 Computed decision Actual Class 1 Class 2 Class 3 Class 1 14 2 5 Class 2 ? (x) 40 2 Class 3 1 ? (y) 18 Answer: For class 2 : 0.2 = (x +2)/(x + 40 +2) Or, 0.2(x + 42) = x + 2 Or, 8.4 – 2 = 0.8 x Therefore, x = 8 From model accuracy: 0.72 = (72)/(y + 90), which gives y = 10. Final confusion matrix Computed decision Class 1 Class 2 Class 3 Actual Class 1 14 2 5 Class 2 8 40 2 Class 3 1 10 18 Error rate estimation • Mean Absolute Error, MAE = P1 A1 P2 A2 ... Pn An n • Root Mean-Squared Error, RMS = ( P1 A1 )2 ( P2 A2 )2 ... ( Pn An )2 n • Relative Absolute Error, RAE = P1 A1 P2 A2 ... Pn An A1 A A2 A ... An A Receiver operating curve (ROC) True Positive Rate 1 C2 P1 0.5 C1 C0 0 0 0.5 1 False Positive Rate Figure 6.3 Receiver operating curves. C0 is for a random model, C1 and C2 are for two different parametric models, and P1 is for a non-parametric model Euclidean distance comparison Pd 1 w (1 TPR) (1 w) FPR 2 2 Where w is the weight factor, with a range 0 to 1, that is used to assign relative importance of false positives to false negatives. Pd has a range from 1 for the perfect classifier to 0 for an absolute incorrect classifier. Lift chart Lift = Model A Computed Computed accept reject Actual 300 1,700 accept Actual 9,700 88,300 reject Total 10,000 90,000 Lift = (300/10,000)/ (2,000/100,000) = 1.5 P ( Rs | S ) P ( Rt | G ) Model B Computed Computed accept reject Actual 1,200 800 accept Actual 38,800 59,200 reject Total 40,000 60,000 Lift = (1,200/40,000)/ (2,000/100,000) = 1.5 Table 6.2 Two models with equal lifts Lift chart 2000 1800 1600 L1 1200 1000 800 L0 Number Responding 1400 600 400 200 0 0 10 20 30 40 50 60 70 80 90 100 Percent Sampled Figure 6.4 Lift chart for targeted versus mass mailing Cost and utility Cost E ij C ij i 1 j 1 Utility EijU ij i 1 j 1 Cost matrix calculation Table 6.3 Cost matrix to penalise errors Predicted Actual Negative Positive Negative 0 1 Positive 3 0 Table 6.4 Confusion matrix of a hypothetical dataset Predicted Class True Class - 1 - - 2 - - 3 - - 4 - 1 43 3 4 1 2 2 34 1 0 3 5 2 39 2 4 3 1 4 56 Table 6.5 Cost matrix for the hypothetical dataset of Table 6.4 Predicted Class True Class - 1 - - 2 - - 3 - - 4 - 1 0 1 2 1 2 1 0 1 3 3 2 2 0 1 4 0 3 1 0 Parsimony (Occam’s razor principle) • The simpler of two models with similar performances is a better choice. Bagging and boosting • Bagging is the process of developing a model for each random sample, and classifying an unknown instance based on what majority of the models predict. • Boosting is similar but more complex than bagging. In boosting, increased focus is given to misclassified instances to capture their behaviour well. Bagging and boosting performance evaluation through Weka Figure 6.5 AdaBoost1 algorithm selection from Weka environment Figure 6.6 Base classifier selection for AdaBoostM1 algorithms Figure 6.7 AdaBoostM1 algorithm selection in Weka environment with J48 as a base classifier Figure 6.8 AdaBoostM1 performance with base classifier J48 Figure 6.9 Bagging performance with base classifier J48 Figure 6.10 J48 classifier performance for liver-disorders data Model ranking • Accuracy • Computation time • Memory space required • Ease of interpretation Model ranking Rij 1 eij maxei minei maxei Where eij is the performance measure of the jth algorithm on dataset i, and ei is a vector of accuracy for dataset i. 1 si f i 1 i 2 n 2 Where si is the total number of success (best) cases for the ith classifier, i is the total number of failure (worst) cases for the same classifier, and n is the total number of datasets. Z ai ti Where and are the weight parameters for ranking average accuracy against computational complexity. The average accuracy and complexity are denoted by ai and ti. We consider the range for and between 0 and 2. Example 6.2 Table 6.6 Various measures of accuracy of the models tested (Ali, 2005) Classifier IBK C4.5 PART KD NB OneR SVM NN TPR 0.595 0.595 0.595 0.61 0.495 0.365 0.565 0.645 TNR 0.54 0.6 0.6 0.6 0.44 0.385 0.595 0.605 % of correct classification 0.505 0.615 0.55 0.565 0.385 0.325 0.62 0.615 F-measure 0.565 0.625 0.625 0.575 0.48 0.365 0.56 0.62 Average accuracy 0.551 0.609 0.593 0.588 0.45 0.36 0.585 0.621 Table 6.7 Ranking of algorithms in terms of computation time (Ali, 2005) Classifier Execution Time IBK C4.5 0.535 0.535 PART KD NB 0.52 0.51 0.705 OneR SVM NN 0.995 0.5 0.015 average accuracy average computational time 1 average performance 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 IBK C4.5 PART KD NB OneR SVM NN classifier names Figure 6.11 Combined performances for the classifiers (Ali, 2005) IBK NB C4.5 OneR PART SVM KD NN 3 2.5 2 1 0.5 -1 -1.5 -2 beta values Figure 6.12 Relative performance for the classifiers, when (=1) is fixed (Ali, 2005) 1.8 1.4 1 0.6 0.2 -0.4 -0.8 -1.2 -0.5 -1.6 0 -2 Z values 1.5 Model ranking through Weka Figure 6.13 Weka GUI chooser Step 2 Figure 6.14 Weka experimenter setup Step 3 Figure 6.15 Several datasets and classifier selection in Weka environment Figure 6.16 Weka run performance with several classifier for multiple problems Step 4 Figure 6.17 Weka performance analysis window Step 5 Figure 6.18 Classifiers list visualisation in the WEKA environment. Step 6 Figure 6.19 Combined classification performance of several classifiers Step 7 Figure 6.20 Training time performance for different classifiers No free lunch (NFL) theorem • If algorithm A outperforms algorithm B on some cost functions, then loosely speaking there must exist exactly as many other functions where B outperforms A. Recap • Model Evaluation Techniques. • Appropriateness of Evaluation Techniques. • Bagging and Boosting Methods. • Combined Model Ranking.