Final exam review - University of Utah

advertisement
ACCTG 6901, University of Utah
Final Exam Review, Spring 2003
Types of questions
 Concepts: short answers, comparisons, definitions, descriptions,
interpretations and examples
 Methods: problem solving, analyses, applications and selections of data
mining tasks/methods, and attributes
Topics
 Data mining basics and the KDD process
 Why data mining? Different views of data mining. Related terms and
disciplines. Definitions.
 Readings: slides and Ch. 1.1 and Ch. 1.2 in T2

Association rule discovery
 Objectives
 Fundamentals: association rules, item sets, support, confidence, and lift
 Apriori property and algorithm
 Applications
 Types of association rules

Sequential patterns mining
 Objectives
 Fundamentals: Sequence, scope within which sequences are mined, large
sequence, relationships amongst sequences, support, distance measure and
calculations, and comparative criteria
 Apriori property and algorithm
 Applications

Clustering
 Objectives
 Fundamentals: Cluster, distance measure and calculations, and comparative
criteria
 Bottom-up hierarchical algorithm
 Applications and comparative criteria

Classification and prediction
 Objectives
 Fundamentals: classification, different views of prediction, data required, and
comparative criteria
 Decision tree induction, measure of diversity and pruning
 Neural network classifier, components, steps and tuning parameters
 Applications and comparative criteria
Sample Exam Questions
Question 1 Association rules and sequential patterns
The following is an example of customer purchase transaction data set.
CID
TID
Date
1
1
01/01/2001
1
2
01/02/2001
1
3
01/03/2001
2
4
01/03/2001
2
5
01/04/2001
3
6
01/04/2001
3
7
01/05/2001
4
8
01/05/2001
4
9
01/06/2001
5
10
01/11/2001
Note: CID = Customer ID and TID = Transactions ID
Items Purchased
10,20
10,30,50,70
10,20,30,40
20,30
20,40,70
10,30,60,70
10,50,70
10,20,30
20,40,60
10,20,30,60
a) Calculate the support, confidence and lift of the following association rule. Indicate if
the items in the association rule are independent of each other or have negative or
positive impacts on each other. (8 points)
{10} -> {50,70}
b) The following is the list of large two item sets. Show the steps to apply the Apriori
property to generate and prune the candidates for large three itemsets. Describe how
the Apriori property is used is in the steps. Give the final list of candidate large three
item sets. (10 points)
{10,20} {10,30} {20,30} {20,40}
c) Does customer 1 support the sequence <{20} {50,70} {10}>? Justify your answer. (5
points)
d) Calculate the support of <{10}, {30}>. (4 points)
e) Based on the types of association rules discussed in class, identify which type(s) of
rules {10}-> {50,70} is? (3 points)
Question 2. Clustering and Classification
The following is the star schema for Sales Department of your company.
CUSTOMER
PRODUCT
Name
Date of Birth
Annual Income
City
State
Name
Type
Introduction Date
SALESFACT
TransactionID
Quantity
Amount
DATE
Day of Year
Year
Note:
1. TransactionID is used as the primary key in the fact table because there might be more
than one transaction for each customer and product in a given day.
2. The Introduction Date for a product is the date when it is first introduced into the
market.
a) The clustering task was selected to identify customer segmentation. Suggest the
attributes including derived attributes to be used in the clustering task and justify
your answer. (10 points)
b) Recommend a standardization or normalization method for the attributes in a
distance function. (10 points)
c) You are asked to recommend a classification/predication task to be performed on
the above data set.
i. Specify the input and class label attributes you choose for this
classification/prediction task. Give an example of business decision(s)
that can benefit from the classification/prediction results using the
input and class label attributes of your choice. (10 points)
ii. Define and give an example of noise using the data set above. (5 points)
iii. Assume that you will use a decision tree classifier. Specify and compare
the different tree pruning approaches. (10 points)
iv.
Suppose you are using a neural network instead of a decision tree. List at
least three possible parameters you want to tune to improve its
performance during the training period. (5 points)
Question 3: Selecting data mining tasks
The task attributes of the four data mining tasks discussed in class are briefly described
below:
Association rule and sequential pattern mining - Customer ID, Transaction ID and Item.
Classification/prediction - input and the class label attributes
Clustering mining - input attributes
The following are the data fields in the data mining server log:
User ID, Session ID, Dataset ID, MiningTask ID, Parameter Value, Accuracy
a) Which task will you perform to identify the data mining tasks that tend to be
performed in the same session? Describe the attributes you choose and how they are
mapped to the data mining task attributes listed above. (6 points)
b) Which task will you perform to identify the sequence of data mining tasks that users
tend to perform on the same data set over time? Describe the attributes you choose
and how they are mapped to the data mining task attributes listed above. (6 points)
c) Which task will you perform to determine if the Parameter Value level (low, medium
or high) and the level of Parameter Value adjustment (small, moderate or large) tend
to have a positive or negative impact on Accuracy. Describe the attributes you choose
and how they are mapped to the data mining task attributes listed above. (8 points)
Answers to Sample Exam Questions
Question 1:
a)
Support = Support ({10,50,70}) = 2/10 = 20%
Confidence = Support ({10,50,70})/ Support({10}) = 0.2/0.7 = 2/7 = 29%
Lift = Confidence/Support({50,70}) = 2/7/0.2 = 10/7 = 1.43 > 1
Since lift is larger than 1, it’s a positive rule.
b)
{10,20} {10,30} {20,30} {20,40}
***O: describe how the apriori property is used to decide which 2 large item sets are
joined together and to determine which 3 item set should be pruned.
Join: {10,20,30} {20,30,40}
Prune: {10,20,30} ({20,30,40} is pruned)
Final list: {10,20,30}
d)
The sequence of customer 1 is:
<{10,20} {10,30,50,70} {10,20,30,40}>
Since {20}  {10,20}, {50,70}  {10,30,50,70}, and {10}  {10,20,30,40},
<{20} {50,70} {10}> is contained in the sequence of customer 1. Therefore, customer 1
supports sequence <{20} {50,70} {10}>.
e)
Only customer 1 supports the sequence <{10} {30}> and there are 5 customers,
therefore,
Support = 1/5 = 20%
f)
The association rule {10} -> {50,70} is a single-level, single-dimensional and Boolean
association rule.
Question 2.
a) Customer age, customer’s annual income, # of days since a customer’s last
purchase, the total number of a customer’s sales transactions, and the total amount
of a customer’s purchases. I am interested in groups of similar customers based on
customer age, income and life-time value using the last three derived attributes.
b) For each attribute, calculate the mean value and the mean absolute deviation.
Calculate the standardized value = (original value – mean value)/mean absolute
deviation.
c) i) Input: Customer city, State, age, Income. Output: Product Type
Business Analysis: These choices for input and output attributes enable us to
understand the impact of customer demographics on product type preference. In
marketing, this is called “customer segmentation.”
ii)
Noise refers to records with the same input attribute values but different
class labels. For example, in customer table, the same customer name
with different city and state may be a noise: Is it because of erroneous
input, or is it because the customer just moved>
iii)
1. Prepruning: Halting creations of unreliable branches by statistically determine
the goodness of further tree splits
2. Postpruning: Remove unreliable branches from a full tree by minimizing error
rates or required encoding bits
iv) Hidden layer node number, learning rate, epochs, momentum, accuracy
threshold, hidden layer number,
Question 3:
a)
I will suggest using association rule mining. In this data mining task, Session ID can be
mapped to Transaction ID and MiningTask ID can be mapped to Item.
b)
I will suggest using sequential pattern mining. In this data mining task, Dataset ID can be
mapped to Customer ID, Session ID can be mapped to Transaction ID and MiningTask
ID can be mapped to Item.
c)
I will suggest using classification. In this data mining task, input attributes include
parameter value level and level of parameter value adjustment, and the class label
attribute is the impact on accuracy (i.e., positive, negative, or no impact).
Download