Uploaded by teretikoroi

IS328 Exam QP

advertisement
IS328 Data Mining
Faculty of Science, Technology and Environment
School of Computing, Information and Mathematical Sciences
Final Examination
Semester 2, 2018
Mode: F2F and Online
Question Paper
Duration of Exam: 3 Hours + 10 Minutes
Reading Time: 10 minutes
Writing Time: 3 Hours
Total mark: 100
Instructions:
 This exam has two sections – A and B.
Section A - 10 Short Answer Questions (34 Marks)
Section B – 3 Long Answer Questions (66 Marks)
 Answer ALL questions in Section A and only THREE questions in Section B
 Answer both Sections A and B in the answer booklet provided.
 You may use non-programmable electronic calculators.
 No other electronic devices are allowed in the exam
 This exam is worth 50% of your overall mark.
 This booklet has 8 pages including the cover page.
IS328 Final Examination Semester 2, 2018 Page 1 of 8
Section A – Short Answer Questions (34 marks)
Q1) [3 marks] Briefly explain the following data types and give an example of each data type.
 Interval Data
 Ratio Data
 Nominal Data
Q2) [4 marks] Consider the following six objects described by two attributes X1 and X2.
A
B
C
D
E
F
X1
1
1.5
5
3
4
3
X2
1
1.5
5
4
4
3.5
Assume there are two clusters as follows:
C1 = {A, B}
C2 = {C, D, E, F}
Use Manhattan distances to calculate the following:
(i)
The complete link distance between C1 and C2.
(ii)
The centroid link distance between C1 and C2.
Q3) [3 marks] Assume that some medical test results are described using seven symmetric binary
attributes as shown below:
Mary
Jack
John
David
A1
1
1
1
1
A2
0
1
1
0
A3
1
1
0
1
A4
0
0
1
0
A5
1
0
0
1
A6
1
1
0
0
A7
0
0
1
1
Given a new patient Lisa as below, find the 3 nearest neighbours of Lisa.
Show your calculations.
Lisa
A1
0
A2
1
A3
0
A4
1
A5 A6
1
1
A7
0
IS328 Final Examination Semester 2, 2018 Page 2 of 8
Q4) [3 marks] Consider the following transaction database.
Transaction-id
1
2
3
4
5
6
7
8
(i)
(ii)
items
A,B,C,E
A,B,D,E
B,C,D,E
B,D,E
A,B,D
B,E,C
B,A,E
C,B,E
What is the support in percentage of the itemset {B,C,E}?
What is the maximum support for a 3-item set?
Q5) [4 marks] Refer to the transaction database given in Q3.
(i)
Which of the following rules has the highest confidence?
R1: A  BD
R2: B  DE
R3: C  BE
R4: D  AE
R5: E  AB
(ii)
Sort the above rules in descending order according to their lift values.
Q6) [3 marks] Assume that we have a dataset containing information about 200 individuals.
One hundred of these individuals have purchased life insurance.
A supervised data mining session has discovered the following rule:
IF age < 30 & credit card insurance = yes
THEN life insurance = yes
Rule Accuracy:
70%
Rule Coverage:
63%
How many individuals who have credit card insurance and are less than 30 years old
will be in the class life insurance= no? Justify your answer.
Q7) [3 marks] Consider the following transaction database.
T1
ABC
T2
BCD
T3
ABCD
T4
ABD
T5
ABCD
IS328 Final Examination Semester 2, 2018 Page 3 of 8
a) Complete the following table and highlight all the frequent sets
Assume that minimum support is 60%.
ItemSet
Support ItemSet
Support
Support
A
AB
ABC
B
AC
ABD
C
AD
ACD
D
BC
BCD
BD
ABCD
Support
CD
b) How many rules can be generated from the frequent sets? Justify your answer.
Q8) [4 marks] From the table completed in Q7, determine the following:
(i)
All closed frequent sets
(ii)
All maximal frequent sets
Justify your answers.
Q9) [4 marks] Compare the pros and cons of decision tree and neural network classification methods.
Provide at least two pros and two cons of each.
Q10) [3 marks] Calculate the Information gain for the split shown below.
Show your calculations
IS328 Final Examination Semester 2, 2018 Page 4 of 8
Section B – Long Answer Questions (66 marks)
Answer ANY THREE Questions from Q11, Q12, Q13 and Q14 in the answer booklet provided.
Question 11 Decision Tree Algorithm and Classification (22 marks)
a) Describe the 5 major steps and their purposes in knowledge discovery from databases. (5
marks)
Consider the following data set:
Play-Tennis Example
Outlook
Temperature
Wind
Humidity
Play
Sunny
Hot
Weak
High
No
Sunny
Hot
Strong
High
No
Overcast
Hot
Weak
High
Yes
Rain
Mild
Weak
High
Yes
Rain
Cool
Weak
Normal
Yes
Rain
Cool
Strong
Normal
No
Overcast
Cool
Strong
Normal
Yes
Sunny
Mild
Weak
High
No
Sunny
Cool
Weak
Normal
Yes
Rain
Mild
Weak
Normal
Yes
Sunny
Mild
Strong
Normal
Yes
Overcast
Mild
Strong
High
Yes
Overcast
Hot
Weak
Normal
Yes
Rain
Mild
Strong
High
No
b) Given the above training data, construct a decision tree. Use information gain method for
attribute selection. Show your steps and calculations in detail. (12 marks)
c) Using the following test data set, evaluate the decision tree constructed in (b).
Create a confusion matrix for the model. Show your steps and calculations. (3 marks)
IS328 Final Examination Semester 2, 2018 Page 5 of 8
Outlook
Overcast
Temperature
Hot
Wind
Weak
Humidity
Normal
Play
Yes
Sunny
Mild
Strong
Normal
Yes
Rain
Mild
Strong
High
No
Sunny
Hot
Strong
High
No
Overcast
Hot
Weak
Normal
No
Rain
Mild
Strong
High
Yes
d) Calculate the classification accuracy and error rate. (2 marks)
Question 12: Cluster Analysis (22 marks)
a) Distinguish between agglomerative clustering and divisive clustering. (2 marks)
Consider the following six objects described by two attributes X1 and X2.
A
B
C
D
E
F
X1
1
1.5
5
3
4
3
X2
1
1.5
5
4
4
3.5
b) The Euclidean distance matrix is given below: Complete the missing values in the matrix, (6
marks)
Dist
A
B
C
D
E
F
A
0
0.71
5.66
3.61
4.24
3.20
B
0.71
0
4.95
2.92
3.54
2.50
C
5.66
4.95
0
2.24
1.41
2.50
D
3.61
2.92
2.24
0
E
4.24
3.54
1.41
0
F
3.20
2.50
2.50
0
c) Cluster the above objects using the agglomerative algorithm with the Complete-Link cluster
distance.
Show the updated distance matrix at each iteration. .(10 marks)
d) Provide a dendrogram depicting the final results. Briefly explain your dendrogram (4 marks)
IS328 Final Examination Semester 2, 2018 Page 6 of 8
Question 13: Probabilistic and Nearest Neighbor (22 marks)
Consider the following trainging data (Buys_Computer)
Age
<=30
31..40
>40
>40
31..40
<=30
<=30
31..40
>40
31..40
31..40
31..40
.
Income
High
Medium
Medium
Low
High
Medium
Low
Medium
Medium
Medium
High
High
student
No
Yes
No
No
No
Yes
Yes
Yes
Yes
No
No
No
credit_rating
Good
Fair
Good
Excellent
Excellent
Fair
Fair
Good
Excellent
Good
Good
Fair
Buys_Computer
No
No
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
a) Predict the class of the following new examples using K-Nearest Neighbour for K=5: (10
marks)
Age =29, income=medium, student=yes, credit_rating=excellent
Age = 32, income=high, student=no, credit_rating=good
For similarity measure use a simple match of attribute values as described below:
Similarity(A,B) = ∑ wi * ∂( ai, bi)
--------------------∑ wi
where ∂(ai, bi) is 1 if ai equals bi and 0 otherwise.
ai and bi are either age, income, student or credit_rating. Weights (wi ) are all 1 except for
income it is 2.
b) Using the data set given in (a), predict the class for the following examples using Naïve
Bayes classification.
Show your calculations in detail. (10 marks)
Age =29, income=medium, student=yes, credit_rating=excellent
Age = 32, income=high, student=no, credit_rating=good
c) Compare the results obtained in (b) and (c). Comment on the results. (2 marks)
IS328 Final Examination Semester 2, 2018 Page 7 of 8
Question 14: Web Mining , Big Data and Neural Networks (22 marks)
Web Data Mining – 8 marks
Web Mining techniques are categorized into three areas. They are:



Web Content Mining
Web Structure Mining
Web Usage Mining
(a)
(b)
Describe the objectives and techniques of each of the above web mining. (6 marks)
Describe at least two main issues in web mining. (2 marks)
Big Data Analytics – 8 marks
(c)
(d)
(e)
(f)
(g)
What do you know about the term “Big Data”? (1 mark)
What are the five V’s of Big Data? Briefly describe. (2 marks)
Differentiate between structured data and unstructured data Give Examples (2 marks)
What are the steps to deploy a Big Data solution? (2 marks)
What is the role of Hadoop in Big Data? (1 mark)
Neural Networks – 6 marks
Suppose we want to classify potential bank customers as good creditors or bad creditors for
loan applications.
We have a training dataset describing past customers using the following attributes:
Marital Status {married, single, divorced}
Gender {female, male}
Age {[18-30], [30-50], [50-65], [65+]},
Income {[10K-25K], [[25K-50K], [50K=65K], [65K-100K], [100K+])
(h)
Design a 3-layererd neural network that could be trained to predict the credit rating of
an applicant. Show the architecture of the network showing the layers and neurons.
Explain your decisions. (8 marks)
END OF SECTION B
IS328 Final Examination Semester 2, 2018 Page 8 of 8
Download