R217/924 DUBLIN INSTITUTE OF TECHNOLOGY KEVIN STREET, DUBLIN 8 ____________________ MSc in Computing (Knowledge Management) ____________________ SUPPLEMENTAL EXAMINATIONS 2007 ____________________ BUSINESS SYSTEMS INTELLIGENCE Dr. B. Mac Namee Prof. B. O’Shea Mr. A. French Time Allowed: 2 hours Attempt any two questions. All questions carry equal marks. R217/924 1. (a) Discuss the problem of overfitting as it relates to training decision tree classifiers. (10 marks) (b) With regard to measuring classifier accuracy, discuss the advantages of using specificity, sensitivity and precision in conjunction with a confusion matrix rather than using a single accuracy figure. (12 marks) (c) As part of a bank’s compliance program, a classification system is to be built to classify bank loan applicants into those that are likely to default on their loans and those that are likely to repay them. A large set of historical labelled data is available for training the system, although there are a large number of missing values in the data due to data entry problems. The system must be as accurate as possible. Also, it is important that when the bank is audited it should be possible for auditors to find out the reasons behind the classifications that the system makes. It is expected that the system will be reviewed, and retrained if necessary, every year. (i) Comment on the issues involved in selecting a suitable classification technique for this task. (8 marks) (ii) Compare the suitability to this task of any three classification techniques with which you are familiar. Suggest, with reasons, which one would be the most appropriate. (13 marks) (iii) Suggest an appropriate strategy for testing the classifier created for the scenario described above. (7 marks) Page 2 of 3 R217/924 2. (a) When speaking about Business Systems Intelligence it is often said that organizations are “drowning in data, but starving for knowledge”. Explain what is meant by this and suggest how business systems intelligence tools can be used to remedy the situation. (8 marks) (b) Bill Inmon proposes that data in a data warehouse have four properties. Discuss these properties, illustrating each with an example and appropriate diagram. (16 marks) (c) Compare and contrast on-line transaction processing (OLTP) and on-line analytical processing (OLAP). (14 marks) (d) Discuss the importance of using a standard process for business systems intelligence projects and describe one such process. (12 marks) 3. (a) “Data cleaning is one of the three biggest problems in data warehousing” —Ralph Kimball (i) Identify the common ways in which real world data found in business systems intelligence projects can be dirty. For each, suggest the main reasons that the data is likely to have become dirty. (10 marks) (ii) Why is it important to deal with dirty data? (4 marks) (b) Briefly describe one technique for identifying outliers in data and explain why the issue of handling outliers is so troublesome. (12 marks) (c) Briefly explain the difference between lossy and lossless data compression. (4 marks) (d) Describe the important properties of a general association rule mining algorithm. (12 marks) (e) Describe the key idea that underlies the apriori algorithm for association rule mining and explain why it is so important. (8 marks) Page 3 of 3