IS328 Data Mining Faculty of Science, Technology and Environment School of Computing, Information and Mathematical Sciences Final Examination Semester 2, 2018 Mode: F2F and Online Question Paper Duration of Exam: 3 Hours + 10 Minutes Reading Time: 10 minutes Writing Time: 3 Hours Total mark: 100 Instructions: This exam has two sections – A and B. Section A - 10 Short Answer Questions (34 Marks) Section B – 3 Long Answer Questions (66 Marks) Answer ALL questions in Section A and only THREE questions in Section B Answer both Sections A and B in the answer booklet provided. You may use non-programmable electronic calculators. No other electronic devices are allowed in the exam This exam is worth 50% of your overall mark. This booklet has 8 pages including the cover page. IS328 Final Examination Semester 2, 2018 Page 1 of 8 Section A – Short Answer Questions (34 marks) Q1) [3 marks] Briefly explain the following data types and give an example of each data type. Interval Data Ratio Data Nominal Data Q2) [4 marks] Consider the following six objects described by two attributes X1 and X2. A B C D E F X1 1 1.5 5 3 4 3 X2 1 1.5 5 4 4 3.5 Assume there are two clusters as follows: C1 = {A, B} C2 = {C, D, E, F} Use Manhattan distances to calculate the following: (i) The complete link distance between C1 and C2. (ii) The centroid link distance between C1 and C2. Q3) [3 marks] Assume that some medical test results are described using seven symmetric binary attributes as shown below: Mary Jack John David A1 1 1 1 1 A2 0 1 1 0 A3 1 1 0 1 A4 0 0 1 0 A5 1 0 0 1 A6 1 1 0 0 A7 0 0 1 1 Given a new patient Lisa as below, find the 3 nearest neighbours of Lisa. Show your calculations. Lisa A1 0 A2 1 A3 0 A4 1 A5 A6 1 1 A7 0 IS328 Final Examination Semester 2, 2018 Page 2 of 8 Q4) [3 marks] Consider the following transaction database. Transaction-id 1 2 3 4 5 6 7 8 (i) (ii) items A,B,C,E A,B,D,E B,C,D,E B,D,E A,B,D B,E,C B,A,E C,B,E What is the support in percentage of the itemset {B,C,E}? What is the maximum support for a 3-item set? Q5) [4 marks] Refer to the transaction database given in Q3. (i) Which of the following rules has the highest confidence? R1: A BD R2: B DE R3: C BE R4: D AE R5: E AB (ii) Sort the above rules in descending order according to their lift values. Q6) [3 marks] Assume that we have a dataset containing information about 200 individuals. One hundred of these individuals have purchased life insurance. A supervised data mining session has discovered the following rule: IF age < 30 & credit card insurance = yes THEN life insurance = yes Rule Accuracy: 70% Rule Coverage: 63% How many individuals who have credit card insurance and are less than 30 years old will be in the class life insurance= no? Justify your answer. Q7) [3 marks] Consider the following transaction database. T1 ABC T2 BCD T3 ABCD T4 ABD T5 ABCD IS328 Final Examination Semester 2, 2018 Page 3 of 8 a) Complete the following table and highlight all the frequent sets Assume that minimum support is 60%. ItemSet Support ItemSet Support Support A AB ABC B AC ABD C AD ACD D BC BCD BD ABCD Support CD b) How many rules can be generated from the frequent sets? Justify your answer. Q8) [4 marks] From the table completed in Q7, determine the following: (i) All closed frequent sets (ii) All maximal frequent sets Justify your answers. Q9) [4 marks] Compare the pros and cons of decision tree and neural network classification methods. Provide at least two pros and two cons of each. Q10) [3 marks] Calculate the Information gain for the split shown below. Show your calculations IS328 Final Examination Semester 2, 2018 Page 4 of 8 Section B – Long Answer Questions (66 marks) Answer ANY THREE Questions from Q11, Q12, Q13 and Q14 in the answer booklet provided. Question 11 Decision Tree Algorithm and Classification (22 marks) a) Describe the 5 major steps and their purposes in knowledge discovery from databases. (5 marks) Consider the following data set: Play-Tennis Example Outlook Temperature Wind Humidity Play Sunny Hot Weak High No Sunny Hot Strong High No Overcast Hot Weak High Yes Rain Mild Weak High Yes Rain Cool Weak Normal Yes Rain Cool Strong Normal No Overcast Cool Strong Normal Yes Sunny Mild Weak High No Sunny Cool Weak Normal Yes Rain Mild Weak Normal Yes Sunny Mild Strong Normal Yes Overcast Mild Strong High Yes Overcast Hot Weak Normal Yes Rain Mild Strong High No b) Given the above training data, construct a decision tree. Use information gain method for attribute selection. Show your steps and calculations in detail. (12 marks) c) Using the following test data set, evaluate the decision tree constructed in (b). Create a confusion matrix for the model. Show your steps and calculations. (3 marks) IS328 Final Examination Semester 2, 2018 Page 5 of 8 Outlook Overcast Temperature Hot Wind Weak Humidity Normal Play Yes Sunny Mild Strong Normal Yes Rain Mild Strong High No Sunny Hot Strong High No Overcast Hot Weak Normal No Rain Mild Strong High Yes d) Calculate the classification accuracy and error rate. (2 marks) Question 12: Cluster Analysis (22 marks) a) Distinguish between agglomerative clustering and divisive clustering. (2 marks) Consider the following six objects described by two attributes X1 and X2. A B C D E F X1 1 1.5 5 3 4 3 X2 1 1.5 5 4 4 3.5 b) The Euclidean distance matrix is given below: Complete the missing values in the matrix, (6 marks) Dist A B C D E F A 0 0.71 5.66 3.61 4.24 3.20 B 0.71 0 4.95 2.92 3.54 2.50 C 5.66 4.95 0 2.24 1.41 2.50 D 3.61 2.92 2.24 0 E 4.24 3.54 1.41 0 F 3.20 2.50 2.50 0 c) Cluster the above objects using the agglomerative algorithm with the Complete-Link cluster distance. Show the updated distance matrix at each iteration. .(10 marks) d) Provide a dendrogram depicting the final results. Briefly explain your dendrogram (4 marks) IS328 Final Examination Semester 2, 2018 Page 6 of 8 Question 13: Probabilistic and Nearest Neighbor (22 marks) Consider the following trainging data (Buys_Computer) Age <=30 31..40 >40 >40 31..40 <=30 <=30 31..40 >40 31..40 31..40 31..40 . Income High Medium Medium Low High Medium Low Medium Medium Medium High High student No Yes No No No Yes Yes Yes Yes No No No credit_rating Good Fair Good Excellent Excellent Fair Fair Good Excellent Good Good Fair Buys_Computer No No Yes No Yes No Yes Yes Yes Yes Yes No a) Predict the class of the following new examples using K-Nearest Neighbour for K=5: (10 marks) Age =29, income=medium, student=yes, credit_rating=excellent Age = 32, income=high, student=no, credit_rating=good For similarity measure use a simple match of attribute values as described below: Similarity(A,B) = ∑ wi * ∂( ai, bi) --------------------∑ wi where ∂(ai, bi) is 1 if ai equals bi and 0 otherwise. ai and bi are either age, income, student or credit_rating. Weights (wi ) are all 1 except for income it is 2. b) Using the data set given in (a), predict the class for the following examples using Naïve Bayes classification. Show your calculations in detail. (10 marks) Age =29, income=medium, student=yes, credit_rating=excellent Age = 32, income=high, student=no, credit_rating=good c) Compare the results obtained in (b) and (c). Comment on the results. (2 marks) IS328 Final Examination Semester 2, 2018 Page 7 of 8 Question 14: Web Mining , Big Data and Neural Networks (22 marks) Web Data Mining – 8 marks Web Mining techniques are categorized into three areas. They are: Web Content Mining Web Structure Mining Web Usage Mining (a) (b) Describe the objectives and techniques of each of the above web mining. (6 marks) Describe at least two main issues in web mining. (2 marks) Big Data Analytics – 8 marks (c) (d) (e) (f) (g) What do you know about the term “Big Data”? (1 mark) What are the five V’s of Big Data? Briefly describe. (2 marks) Differentiate between structured data and unstructured data Give Examples (2 marks) What are the steps to deploy a Big Data solution? (2 marks) What is the role of Hadoop in Big Data? (1 mark) Neural Networks – 6 marks Suppose we want to classify potential bank customers as good creditors or bad creditors for loan applications. We have a training dataset describing past customers using the following attributes: Marital Status {married, single, divorced} Gender {female, male} Age {[18-30], [30-50], [50-65], [65+]}, Income {[10K-25K], [[25K-50K], [50K=65K], [65K-100K], [100K+]) (h) Design a 3-layererd neural network that could be trained to predict the credit rating of an applicant. Show the architecture of the network showing the layers and neurons. Explain your decisions. (8 marks) END OF SECTION B IS328 Final Examination Semester 2, 2018 Page 8 of 8