Class Problems Week 4

advertisement
CE802 Class Problem Sheet 2 (Week 4)
(Q1)
The following data are used to train a system to determine the best crop to grow
in a particular field:
Drainage
Good
Poor
Poor
Good
Good
Poor
Good
Poor
Soil
Heavy
Heavy
Light
Light
Heavy
Heavy
Light
Light
PH
Acid
Acid
Acid
Alkaline
Alkaline
Alkaline
Acid
Alkaline
Best crop
Thistles
Dandelions
Bindweed
Thistles
Thistles
Dandelions
Thistles
Bindweed
Calculate the information gains provided by each of the three attributes:
Drainage, Soil and PH. Hence construct the decision tree for determining the
best crop from the values of these three attributes.
(Q2)
(i)
An interstellar space probe is launched to investigate the distribution of life
outside the solar system. Seven planets are investigated and life is found on four
of them. Calculate the amount of information needed to predict whether life will
be found on a planet.
(It may be helpful to remember that ln2(x) = lne(x)/lne(2) = ln10(x)/ln10(2). )
(ii)
Measurements are also made of the period of rotation (day length), number of
moons and principal atmospheric gas. The results are displayed in the following
table:
Attributes
Class
Day Length
Moons
Atmosphere
Medium
More than one
Methane
Life
Short
Zero
Oxygen
No Life
Short
Zero
Nitrogen
Life
Long
Zero
Oxygen
No Life
Long
One
Ammonia
Life
Long
Zero
Ammonia
No Life
Long
One
Nitrogen
Life
Calculate the information gain provided by the Day Length attribute.
(continued overleaf)
(iii)
The Moons attribute provides an information gain of 0.523; the Atmosphere
attribute provides an information gain of 0.699. Using these values and the
result of part(ii) construct the complete decision tree to predict whether life will
be found on a planet. No further calculation should be necessary.
(iv)
Discuss whether the tree that formed your answer to part(iii) is likely to exhibit
overfitting, giving a full explanation of your conclusion.
Download