CE802 Class Problem Sheet 2 (Week 4) (Q1) The following data are used to train a system to determine the best crop to grow in a particular field: Drainage Good Poor Poor Good Good Poor Good Poor Soil Heavy Heavy Light Light Heavy Heavy Light Light PH Acid Acid Acid Alkaline Alkaline Alkaline Acid Alkaline Best crop Thistles Dandelions Bindweed Thistles Thistles Dandelions Thistles Bindweed Calculate the information gains provided by each of the three attributes: Drainage, Soil and PH. Hence construct the decision tree for determining the best crop from the values of these three attributes. (Q2) (i) An interstellar space probe is launched to investigate the distribution of life outside the solar system. Seven planets are investigated and life is found on four of them. Calculate the amount of information needed to predict whether life will be found on a planet. (It may be helpful to remember that ln2(x) = lne(x)/lne(2) = ln10(x)/ln10(2). ) (ii) Measurements are also made of the period of rotation (day length), number of moons and principal atmospheric gas. The results are displayed in the following table: Attributes Class Day Length Moons Atmosphere Medium More than one Methane Life Short Zero Oxygen No Life Short Zero Nitrogen Life Long Zero Oxygen No Life Long One Ammonia Life Long Zero Ammonia No Life Long One Nitrogen Life Calculate the information gain provided by the Day Length attribute. (continued overleaf) (iii) The Moons attribute provides an information gain of 0.523; the Atmosphere attribute provides an information gain of 0.699. Using these values and the result of part(ii) construct the complete decision tree to predict whether life will be found on a planet. No further calculation should be necessary. (iv) Discuss whether the tree that formed your answer to part(iii) is likely to exhibit overfitting, giving a full explanation of your conclusion.