2007 Supplemental - Computing - Dublin Institute of Technology

MSc in Computing
(Knowledge Management)
Dr. B. Mac Namee
Prof. B. O’Shea
Mr. A. French
Time Allowed: 2 hours
Attempt any two questions.
All questions carry equal marks.
Discuss the problem of overfitting as it relates to training decision tree
(10 marks)
With regard to measuring classifier accuracy, discuss the advantages of using
specificity, sensitivity and precision in conjunction with a confusion matrix
rather than using a single accuracy figure.
(12 marks)
As part of a bank’s compliance program, a classification system is to be built
to classify bank loan applicants into those that are likely to default on their
loans and those that are likely to repay them. A large set of historical
labelled data is available for training the system, although there are a large
number of missing values in the data due to data entry problems.
The system must be as accurate as possible. Also, it is important that when
the bank is audited it should be possible for auditors to find out the reasons
behind the classifications that the system makes. It is expected that the
system will be reviewed, and retrained if necessary, every year.
(i) Comment on the issues involved in selecting a suitable classification
technique for this task.
(8 marks)
(ii) Compare the suitability to this task of any three classification
techniques with which you are familiar. Suggest, with reasons, which
one would be the most appropriate.
(13 marks)
(iii) Suggest an appropriate strategy for testing the classifier created for
the scenario described above.
(7 marks)
Page 2 of 3
2. (a) When speaking about Business Systems Intelligence it is often said that
organizations are “drowning in data, but starving for knowledge”. Explain
what is meant by this and suggest how business systems intelligence tools can
be used to remedy the situation.
(8 marks)
(b) Bill Inmon proposes that data in a data warehouse have four properties.
Discuss these properties, illustrating each with an example and appropriate
(16 marks)
(c) Compare and contrast on-line transaction processing (OLTP) and on-line
analytical processing (OLAP).
(14 marks)
(d) Discuss the importance of using a standard process for business systems
intelligence projects and describe one such process.
(12 marks)
“Data cleaning is one of the three biggest problems in data warehousing”
—Ralph Kimball
(i) Identify the common ways in which real world data found in business
systems intelligence projects can be dirty. For each, suggest the main
reasons that the data is likely to have become dirty.
(10 marks)
(ii) Why is it important to deal with dirty data?
(4 marks)
(b) Briefly describe one technique for identifying outliers in data and explain
why the issue of handling outliers is so troublesome.
(12 marks)
(c) Briefly explain the difference between lossy and lossless data compression.
(4 marks)
(d) Describe the important properties of a general association rule mining
(12 marks)
(e) Describe the key idea that underlies the apriori algorithm for association
rule mining and explain why it is so important.
(8 marks)
Page 3 of 3