A (x)

advertisement
8 March 2010
Boris Mirkin
Computational Intelligence and Data Visualization
http://www.dcs.bbk.ac.uk/~mirkin/advanced
Topics for this and next lecture:
 Correlation: Supervised learning
o Statement of the problem . . . . .
o Linear regression . . . . . .
o Linear discrimination . . . . .
o Gradient optimisation for learning . .
o Linear regression with the steepest descent .
o Perceptron . . . . . . . .
o Artificial neuron . . . . . .
o Neural Network with one hidden layer .
o Error back propagation for learning weights
o Data standardization in NN learning . .
Prediction of Iris sepal sizes . . . .
o Decision trees . . . . . . .
o Bayes approach and Naïve Bayes classifier
o SVM and kernel
.
.
.
.
.
.
.
.
.
.
.
1
5
7
8
9
10
11
12
15
20
21
23
1
Correlation : Supervised Learning
Problem:
Given N pairs (xi, ui) (observed at entities i =1,…, N)
in which xi are predictor/input vectors xi=(xi1,…,xip)
(dimension p) and ui = (ui1,…,uiq) are target/output vectors
(dimension q), build a decision rule
û = F(x)
such that the difference between computed û and observed
target vector u, given x, is minimal over the class of
admissible rules F.
Specifically, let us take a look at the iris.dat data set:
Sepal
Petal
Sepal and petal in an Iris flower.
This popular data set describes 150 Iris specimens, representing three taxa of Iris flowers, I Iris setosa (diploid),
II Iris versicolor (tetraploid) and III Iris virginica (hexaploid), 50 specimens from each.
Each specimen is measured on four morphological variables: sepal length (w1), sepal width (w2), petal length
(w3), and petal width (w4) (see Figure 0.1).
Table 0.3. Iris data: Iris specimens measured over four features each (three from each taxon shown).
#
1
2
3
I Iris setosa
w1
5.1
4.4
4.4
w2
3.5
3.2
3.0
w3
1.4
1.3
1.3
w4
0.3
0.2
0.2
II Iris versicolor
III Iris virginica
w1
6.4
5.5
5.7
w1
6.3
6.7
7.2
w2
3.2
2.4
2.9
w3
4.5
3.8
4.2
w4
1.5
1.1
1.3
w2
3.3
3.3
3.6
w3
6.0
5.7
6.1
w4
2.5
2.1
2.5
Assume, for illustrative purposes, sepal is easy to measure, petal
not. This mimics a real-world data collection.
2
I would like to have a rule for predicting petal measures from
those of sepal.
Why (and how) should one restrict the class of rules F admissible?
A big question, a shaky answer.
Take a look at the 2D regression problem: pairs (x,u) are observed
at N entities:
u
x
We have N=7 points on the Figure, which thus can be exactly fitted
by a polynomial of 6th order u=p(x)=a0+a1x+a2x2+ a3x3
+a4x4+a5x5+a6x6. Indeed, the 7 points give 7 equations ui=p(xi)
(i=1,…,7) to exactly determine the coefficients of p(x).
Polynomial p(x), on which graph all observations lie, has no
predictive power: beyond the range, the curve may go either course
(like those shown). The blue straight line fits none of the points but
expresses the tendency and should be preferred.
If no theoretical motivation, it is hard to tell, what class of Fs to use.
Occam’s razor:
William Ockham (c. 1285–1349): “Entities should not be multiplied
unnecessarily.” (“All things being equal, the simplest explanation
tends to be the best one.”) Interpretation:
“Principle of maximum parsimony (i.e., economy)”
My motto “The simpler a theory, the more cases it covers”
3
Decision rule is an algorithm not necessarily expressed as an
analytic function.
Different approaches depending on assumptions of:
Data Flow, Type of Target, Type of Rule, Criterion.
Data flow:
Entities i = 1, …, N come one-by-one (Incremental (on-line)
learning) or all known at once (Batch mode)
Type of target (Quantitative/Categorical) and rule:
Regression: u quantitative, q=1 or more
- Linear regression (F – linear)
- Decision tree (F – tree-like)
- Neural Nets (F – a net structure, general)
- Evolutionary algorithms (F – general)
Pattern recognition (classification): u binary
- Discrimination (F – linear)
- Support vector machine (SVM) (F –linear)
- Logistic regression (F – probability, exponential-linear)
- Neural Nets (F – general)
- Evolutionary algorithms (F – general)
Criterion: least-squares, maximum likelihood, error count
Frequently, the difference between u and û is measured with the
squared error,
E=<u- û, u- û>=<u-F(x),u-F(x)>
(1)
then it is E that must be minimised over all admissible Fs.
In the non-linear case, it is not easy to do.
First consider the linear case.
4
Linear (multiple) regression (q=1, least squares)
Rule assumption:
u = w1*x1+w2*x2+…+wp*xp+w0
where w0, w1,…, wp are unknown constants.
For any entity i =1,…, N rule-computed value
ûi = w1*xi1+w2*xi2+…+wp*xip+w0
differs from the observed one by di = |ûi – ui|.
To find w1, w2, …, wp, w0, minimise
D2 = idi2 = i (ui -w1*xi1-w2*xi2-…-wp*xip-w0)2 (1)
over all possible vectors w = (w0,w1,…,wp).
To make the problem uniform, a fictitious feature x0 is
introduced such that all its values are 1: xi0 =1 for all i =
1,…,N. Then the criterion D2 involves no intercept, just the
inner products <w,xi> where w=(w0,w1,…,wp) and xi=(xi0,
xi1, xi2 , …, xip) are (p+1)-dimensional vectors, a unknown, xi
known. From now on, the intercept in (1) is abolished
because of the convention.
5
Solution:
D2 is but the Euclidean distance squared between Ndimensional target feature column u=(ui) and vector û=Xw
whose components are ûi= <w,xi>. Here X is N x (p+1)
matrix whose rows are xi (augmented with the component
xi0=1, thus being (p+1)-dimensional) so that Xw is the matrix
algebra product of X and w.
Vectors defined as Xw for all possible w form (p+1)dimensional vector space, X-span.
Problem reformulated: given u, find its projection û in the Xspan space.
Global solution to the problem is linear, defined by X:
û = PXu
(2)
where PX is the so-called orthogonal projection matrix, of size
N x N:
PX = X (XTX)-1XT
so that û = X (XTX)-1XTu.
Matrix PX projects every N-dimensional vector u to its
nearest match in the (p+1)-dimensional X-span space.
Equation (2)  the optimal w = (XTX)-1XTu.
The inverse (XTX)-1 does not necessarily exist.
6
Linear discrimination
Linear discrimination problem differs in only that aspect
that values ui are binary, either “yes” or “no” - classification,
not regression, problem.
To make it quantitative, define ui=1 if i belongs to the “yes”
class and ui= -1 if i belongs to the “no” class.
The intercept is referred to, in the context of the
discrimination/classification problem, as bias.
On Figure below entities (x1,x2, u) are presented by stars * at u=1
and circles at u= -1.
Vector w represents solution described above; dashed line
represents the set of all x that are orthogonal to w, <w,x> = 0 – the
separating hyperplane. Figure shows a relatively rare situation at
which the two patterns can be separated by a hyperplane – the
linear separability.
Linear discriminant decision rule: if ûi = <w,xi> > 0, predict
ůi=1; if ûi = <w,xi> < 0, predict ůi= -1; that is,
ůi = sign(<w,xi>)
(sign(a)=1 when a > 0, = - 1 when a < 0, and =0 when a = 0.)
7
A general approach to optimization of any function:
Gradient optimisation (the steepest ascent/descent, or hillclimbing) of any function f(z) of a multidimensional variable z:
given an initial state z=z0, do a sequence of iterations to move to a
better z location. Each iteration updates z-value:
z(new) =z(old) ± *grad(f(z(old))
(2)
where + applies if f is maximised, and –, if minimised. Here ·
grad(f(z)) stands for the vector of partial derivatives of f with
respect to the components of z. It is known from calculus, that the
vector grad(f(z)) shows the direction of the steepest rise of function
f at point z. It is assumed, that – grad(f(z)) shows the steepest
descent direction. We do not discuss here how the grad vector can
be estimated in any point.
·  value controls the length of the change of z in (2) and should
be small (to guarantee not over jumping) , but not too small (to
guarantee changes when grad(f(z(old)) becomes too small; indeed
grad(f(z(old)) = 0 if old is optimum).
Q: For those who knows the partial derivatives, thus, knows that the gradient is
vector of the partial derivatives of the function. What is the gradient of function
f(x1,x2)=x12+x22? Function f(x1,x2)=(x1-1)2+3*(x2-4)2? Function f(z1,z2) =
3*z12 + (1-z2)4? A: (2x1, 2x2), 2*(x1-1),3*(x2-4)), (6*z1, -4*(1-z2)3).
What is good about the gradient method, that it can be interpreted within
the machine learning paradigm in which entities come one by one, and go
(like “cats”). This is why it is often used even when a global optimisation
algorithm is known.
8
Square error minimisation using steepest descent (for
linear regression)
Problem: build a linear rule for predicting scalar u using p input
variables x, u= w1*x1+w2*x2-…-wp*xp-w0* x0 where x0=1
for any entity.
0. Initialise weights w randomly.
1. For each training instance (xi, ui)
a. Compute grad(Ei(w)) where Ei(w) is part of
criterion E in (1) related to the instance:
Ei= (ui -w1*xi1-w2*xi2-…-wp*xip-w0* xi0)2
From calculus’ derivatives, t-th component of the
gradient is Ei/wt= –2(ui- ûi) xit (t=0, 1, …, p)
b. Update weights w according to equation
w(new) = w - grad(Ei(w))
so that
wt(new) = wt + (ui- ûi)xit
here  is put rather than 2 because it is arbitrary
2. If w(new)  w(old), stop, otherwise go to 1. with
w=w(new).
9
Perceptron algorithm (for linear discrimination)
Problem: build a linear rule for predicting scalar 1/-1 u using p input
variables x, u= w1*x1+w2*x2-…-wp*xp-w0* x0 where x0=1
for any entity (pattern recognition: u=1 for the pattern, u=-1
for not)
0. Initialise weights w randomly or to zero.
1. For each training instance (xi, ui)
a. compute ůi = sign(<w,xi>)
b. if ůi ui, update weights w according to equation
w(new) = w(old) + (ui- ůi)xi
where , 0<<1, is the so-called learning rate
2. Stopping rule: w(new)  w(old).
Perceptron is proven to converge to the optimal w when the
patterns are linearly separable.
Perceptron is a slightly modified form of the conventional
anti-gradient minimisation algorithm:
the partial derivative of Ei with respect to wt is equal to –
2(ui- ûi) xit, which is similar to that used the perceptron
learning rule, - 2(ui- ůi)xi. Thus, the innovation: change the
continuous ûi for the discrete ůi =sign(ûi) in the antigradient process.
10
Neuron and artificial neuron
A neuron cell fires an output when its summary input
becomes higher than a threshold. Dendrite brings signal in,
axon passes it out, and the firing occurs via synapse, a gap
between neurons, that makes the threshold.
The decision rule ůi =sign(ûi) can be interpreted in terms of
an artificial neuron: features xi are input signals (from other
neurons), weights wt are the wiring (axon) features, the bias
w0 – the firing threshold, and sign() – the neuron activation
function. This way, the perceptron is an example of natureinspired computation.
11
An artificial neuron is a model: A set of inputs
(corresponding to x-features), wiring weights, and activation
function involving a firing threshold.
Two popular activation functions, besides the sign function
ůi =sign(ûi), are
the linear activation function, ůi = ûi (we considered it when
discussed the steepest descent) and
sigmoid activation function ůi =s(ûi) where
s(x) = 1/ (1 + exp(-x)),
which is a smooth analogue to the sign function.
This function’s output is always between 0 and 1. To imitate the
perceptron with its sign(x) output, between -1 and 1:
th(x) =2s(x)-1= 2(1+ e-x)-1 - 1
(2)
This function is usually referred to as the hyperbolic tangent. In
contrast to sigmoid s(x), hyperbolic th(x) is symmetric: th(-x) = th(x), like sign(x), which can be useful in some contexts.
An artificial neuron is a linear function supplied with a
non-linear threshold rule.
12
Multi-layer neural nets
Neuron nets with one hidden layer (proven to suffice)
0.Problem: Iris features are in pairs: the size (length and width)
of petals (features 1, 2) and that of sepals (features 3, 4). It is likely
that the sepal sizes and petal sizes are related.
150 x 4
/advanced/ml/Data/iris.dat
Sepal
Petal
Consider at any Iris specimen xi=(xi1,xi2,xi3,xi4), i=1,…,150, x =
(xi3, xi4) (sepal) input and u = (xi1,xi2) (petal) output. Find F such
that u  F(x).
1.Model: One-hidden-layer NN
Build F as a neural network of three layers:
(a) input layer that accepts x = ( xi3, xi4) and bias x0=1 (see the
previous lecture),
(b) output layer producing estimate û for output u = (xi1,xi2),
and
(c) intermediate - hidden - layer to allow more flexibility in the
space of feasible functions F (hidden - because not seen from
the outside
13
This structure (Figure 1) is generic in NN theory; it has been proven,
for instance, that such a structure can exactly learn any subset of the set of
entities. Moreover, any pre-specified u = F(x) can be approximated with
such a one-hidden-layer network, if the number of hidden neurons is large
enough (Tsybenko 1989).
û1
û2
III1
k
III2
v11 v12 v21 v22 v31
j
I1
x1
v32
II2
II1
w21
w22
w11 w12 w13 w31
i
Output (linear)
I2
x2
Hidden (sigmoid)
II3
w23
w32
w33
I3
x0 = 1
Input (linear)
Figure 1. A feed-forward network with 2 input and 2 output
features (no feedback loops). Layers: input (I, indexed by i), output
(III, indexed by k) and Hidden (II, indexed by j).
Weights I to II form 3x3 matrix
W=(wij),
i= I1, I2, I3,
j= II1, II2, II3,
Weights II to III form 3x2 matrix
V=(vjk),
j= II1, II2, II3, k= III1, III2
Layers I and III are assumed to give identical transformation
(linear); hidden layer (II) – sigmoid s(x) or th(x)
14
2. Formula for NN transformation F:
Node j of hidden layer II:
Input:
zj=w1j*x1 + w2j*x2+w3j*x3
which is j-th component of vector z = i xi*wij = x*W where x
is1x3 input vector, W=(wij) is 3x3 weight matrix. (x3 represents x0
on the network scheme.)
Output:
φ(zj), j=1,2,3, φ is function (1) or (2).
Node k of output layer III:
Output = Input,
 j vjk*φ(zj),
which is k-th component of the matrix product û = φ(z)*V.
Thus, NN on Figure 1 transforms input x into output û as:
û = φ(x*W)*V
(3)
If matrices W, V are known, (3) expresses – and computes - the
unknown function u=F(x) in terms of φ, W, and V.
3. Learning problem
Find weight matrices W and V minimising the squared difference
between observed u and û found with (3),
E=d(u,û) = <u - φ(x*W)*V, u - φ(x*W)*V >,
(4)
over the training entity set.
(A very much non-linear problem!)
15
4. Learning weights with error back propagation
4.1. Updating formula.
In NN applications, learning weights W and V minimising E is done
with back-propagation that imitates the gradient descent. It runs
iterations of updating V and W, each based on the data of an entity
(in our case, one of 150 Iris specimens), with the input values in
x=(xi) and output values in u=(uk).
An update moves V and W into the anti-gradient direction:
V(new)=V(old)-gV, W(new)=W(old)- gW
(5)
where  is the learning rate (step size) and gV, gW are parts of the
gradient of the error function E in (4) related to matrices V and W.
Specifically, the error function is
E = [(u1 – û1)2 + (u2 – û2)2 ]/2
(6)
where e1 = u1 – û1 and e2 = u2 – û2 are differences between the
actual and predicted outputs. The division by 2 is made to avoid
factor 2 in the derivatives of E. Also, this E resembles the so-called
potential energy function in physics.
Equation (5) can be rewritten component-wise
vjk(new)=vjk(old) - E/vjk,
wij(new)=wij(old) - E/wij (iI, jII, kIII)
(5’)
16
To make this computable, one must express the derivatives
explicitly. This can be done with the so-called chain rule in
calculus: the derivative of a superposition of functions is equal to
the product of derivatives of these functions. For example, if
f(x)=g(p(x)), then f(x)=g(p(x)p(x) where  denotes the
derivative.
In particular,
E/vjk = - (uk – ûk) ûk /vjk.
Since ûk = j φ(zj) vjk, the derivative ûk /vjk=φ(zj); thus,
E/vjk = - (uk – ûk) φ(zj).
(7)
The derivative E/wij refers to the layer of W, the previous one,
which requires more chain derivations. Specifically,
E/wij = k[-(uk – ûk) ûk /wij].
Since ûk = j φ(i xiwij) vjk, the derivative
ûk /wij = vjk φ(i xiwij) xi.
Now we need the derivative of the activation functions, sigmoid or
hyperbolic tangent, which are simple polynomials of themselves:
s(x)= [(1+ e-x)-1] =(-1) (1+ e-x)-2 (e-x )=
(-1)(1+ e-x)-2(e-x )(-1)= (1+ e-x)-2 e-x =s(x)(1-s(x))
(8)
17
th(x)= [2s(x)-1] = 2s(x)= 2s(x)(1-s(x)) =
(1+th(x))(1-th(x))/2
The final expression for the derivatives:
E/wij = -k[(uk – ûk)  vjk] φ(zj) xi
(9)
φ(zj) is in (8) depending on what activation function is used.
Equations (5), (7) and (9) lead to the following rule for the
processing of an instance in the back-propagation algorithm.
4.2. Instance Processing:
1. Forward computation (of the output û and error). Given
matrices V and W, upon receiving an instance (x,u), the
estimate û of vector u is computed according to the neural
network as formalised in equation (3), and the error e = u – û
is calculated.
2. Error back-propagation (for estimation of the gradient).
Each neuron receives the relevant error estimate, which is
-ek = -(uk – ûk) from (7)
for output neurons k (k=III1, III2) or
-k[(uk – ûk)  vjk] from (9),
for hidden neurons j (j=II1, II2, II3) [this can be seen as the
sum of errors arriving from the output neurons according to the
corresponding synapse weights], and
adjusts that to the derivative (7) or (9) by multiplying it over its
local data depending on the source signal, which is φ(zj), for
neuron k’s source j in (7), and φ(zj) xi for neuron j’s source i
in (9).
The results constitute matrices gV and gW, respectively.
18
3. Weights update. Matrices V and W are updated according to
formula (5) or, equivalently, (5’).
What is nice in this procedure is that the computation is done locally, so
that every neuron processes only the data that are available to this neuron,
first from the previous layers, then backward, from the output layer. In
particular, the algorithm does not change if the number of hidden neurons is
increased from h=3, in Figure 1, to any other integer h.
The procedure 4.2 can be easily extended to any feed-forward
network, with many hidden layers as well.
Procedure 1-3 is performed for all available entities in a random
order, which constitutes an epoch. Typically, one epoch is not
enough for matrices V and W to get stabilised.
Thus, a number of epochs is executed, until the matrices are
stabilised. Since, in practical calculations, this may take ages to
achieve, other stopping criteria can be utilised. One of such criteria
is when the difference between the average values (over iterations
within an epoch) of the error function (5) becomes smaller than a
pre-specified threshold, such as 0.0001. Yet another criterion is
halting the process when the number of epochs applied to the data
reaches a pre-specified threshold such as 10,000.
An explicit formulation of the back-propagation algorithm is as
follows.
19
5. Back propagation algorithm for NN on Fig. 1 (for
Iris data set which is available as a whole).
A.
Initialise weight matrices W=(wij) and V=(vjk) by using
random normal distribution N(0,1) with the mean at 0 and the
variance 1.
B.
Choose the data standardisation option amounting to
selection of the shift and scale coefficients, av and bv for each
feature v, so that every data entry, xiv, is transformed to
yiv=(xiv-av)/bv.
C.
Formulate Halt criterion as explained above and run a loop
over epochs.
D.
Randomise the order of entities [in MatLab, with the
command randperm(N)] and run a loop of the 4.2. Instance
Processing in that order (an epoch).
E. If Halt-criterion is met, end the computation and output results:
W, V, û, e, and E. Otherwise, execute D again.
Back propagation should be executed with a re-sampling scheme,
such as the k-fold cross-validation, to provide the estimates of
variation of the results regards the data change.
6. Data standardisation for NN learning
Due to specifics of the binary target variables and activation
functions, such as th(x) and sign(x), that have -1 and 1 as the
boundaries, the data are frequently pre-processed, in the NN context
to make every feature’s range to be between -1 and 1: take bv equal
to the half-range bv=(Mv-mv)/2, and shift coefficient av to the
mid-range av=(Mv+mv)/2. Here Mv denotes the maximum and
mv the minimum of feature v.
20
7. Performing computations for Iris data
The MatLab code below implements the back propagation algorithm for
Iris data set stored in \ml\Data as file iris.dat.
% nnbpm.m for learning petal from sepal in iris data
% modified with the hyperbolic tangent in the hidden layer
% data normalisation to [-1,1] interval
and
%--------------preparing input and output data --------------------da=load('Data\iris.dat');
[n,m]=size(da);
%----------------normalise to [-1,1] scale---------------------mr=max(da);
ml=min(da);
ra=mr-ml;
ba=mr+ml;
tda=2*da-ones(n,1)*ba;
dan=tda./(ones(n,1)*ra);
dan=10*dan; %make the scale 10-fold to see things clearly
output=dan(:,[3 4]); % petal sizes
input=dan(:,[1 2]); % sepal sizes
input(:,3)=10;
% bias component, 10-fold
%-----------------initialise the network -----------------------h=3; %the number of hidden neurons
W=randn(3,h) %initialising wij weights
V=randn(h,2) %initialising vjk weights
W0=W; % to store if this is good
V0=V; % to store if this is good
count=0; %counter of epochs
stopp=0; %halt-condition negative
%------------- looping epochs---------------------------while(stopp==0)
mede=zeros(1,2); % mean error to be stored after an epoch
ror=randperm(n); % randomly ordering
%----------------looping entities in the random order ror
for ii=1:n
x=input(ror(ii),:); %current instance's input
u=output(ror(ii),:);% current instance's output
%--------------forward pass (to calculate response ru and error)-----ow=x*W;% summary action of inputs in the hidden layer
o1=1+exp(-ow);
oow=ones(1,h)./o1; %sigmoid transformation
oow=2*oow-1;% symmetric sigmoid output of the hidden layer
ov=oow*V; %output of the output layer
err=u-ov; %the error
mede=mede+abs(err)/n; % the average absolute error
21
%------------error back-propagating-------------------------gV=-oow'*err;
% gradients of matrix V
t1=V*err'; % error propagated to the hidden layer
t2=(1-oow).*(1+oow)/2; %the symmetric sigmoid’s derivative
t3=t2.*t1';
% error
gW=-x'*t3;
% gradients of matrix W
%-----------------change of the weights----------------------mu=0.00001;%learn. rate; greater values hinder convergence
V=V-mu*gV;
W=W-mu*gW;
end; %of the entity loop
%------------------halt: stop-condition -------------------------count=count+1;
ss=mean(mede);
if ss<0.01|count>=30000 % small error or the number of epoch
stopp=1;
end;
if rem(count,200)==0
count
mede
end % these are to watch results every 200-th epoch
end;
V0
W0 %these are to copy initial weights if results are good
This program leads to average errors presented below at different numbers
of hidden neurons h (note the feature ranges are equal to 20 here):
h |e1|
|e2|
One can see an improvement –
3 1.11
1.76
but not so great!
6 1.00
1.69
10 0.97
1.63
Home-work:
1. Find values of E for the errors reported in Table above.
2. Take a look at what happens if the data are not normalised.
3. What happens if the learning rate is increased, or decreased, ten times.
4. Extend the table above for different numbers of hidden neurons.
5. Try petal sizes as input with sepal sizes as output.
6. Adapt this code to other data such as studn.dat.
7. Modify this code to involve the sigmoid activation function.
8. Find a way to improve the convergence of the process, for instance, with
adaptive changes in the step size values.
22
Decision Trees: a structure used for prediction of quantitative
features (regression tree) or nominal features (classification tree).
Each node corresponds to a subset of entities (the root to the set of
all entities I), and its children are the subset’s parts defined by a
single predictor feature x.
Each terminal node  individual target feature value u.
Example: Author-defined clusters of eight Companies (u – product)
Sector:
Util./Ind.
Ecom: No
A
Retail
C
Yes
B
Figure 1. Decision tree for three product-defined classes of Companies defined by
categorical features.
NSup:
<4
4 or more
C
ShareP: > 30
< 30
A
B
Figure 2. Decision tree for three product-defined classes of Companies defined by
quantitative features.
23
Decision trees:
Advantages
Interpretability
Computation efficiency
Drawbacks
Simplistic
Imprecise
Algorithm: Take a node and a feature value(s) and split the
corresponding subset accordingly
Issues (classification tree):
Stop: Whether any node should be split at all
Select: Which node of the tree and by which feature to split
Score:
Chi-squared (CHAID in SPSS package),
Entropy (C4.5 package),
Change of Gini coefficient (CART package)
Assign: What target class k to assign to a terminal node x:
Conventionally, k* at which p(k/x) is maximised over k.
I suggest: This is ok when p(k) is about 10%-30%.
Otherwise, use comparison between p(k/x) and p(k). Specifically,
(i) If p(k) is of the order of 50%, then the absolute Quetelet index a(k/x)=
p(k/x)- p(k) should be used;
(ii) If p(k) is of the order of 1% or less, the relative Quetelet index
q(k/x)= [p(k/x)- p(k)]/p(k) should be employed.
24
Download