2003Data Mining Tut 1

advertisement
Intelligent Data Analysis and Probabilistic Inference
Data Mining Tutorial 1: Overview & Data Cleaning
1. Basic Concepts
a. Give a brief definition for the term “Data Mining”?
b. Briefly explain the difference between “Data Mining”, “OLAP” and traditional
“Database Querying”.
c. Explain the difference between “Explorative Data Mining” and “Predictive Data
Mining” and give one example of each.
d. State three different applications for which data mining techniques seem
appropriate. Informally explain each application.
e. Explain what is meant by “Data Integration” and describe why it is an important
pre-processing step for data mining.
2. Data Mining Techniques and Applications
a. Explain briefly the difference between “Classification” and “Clustering” and give
an informal example of an application that would benefit from each technique.
b. Explain briefly the difference between “Regression” and “Classification”.
c. Explain briefly what is meant by “Association Rule Analysis” and describe the
different between it and “Sequence Rule Analysis”.
3. Clustering:
You are given the task to cluster (i.e. divide into similar groups) the students
attending this tutorial based on their physical appearance.
a. Devise a feature representation scheme that allows describing each student in the
class as a record, make sure that you have at least 5 features to describe each
student. For each feature, describe the type of variable it denotes (Numerical,
Categorical, etc) and state the valid range of values for that variable.
b. Fill in the feature table for six students, i.e. build a table containing 6 rows and 5
columns and provide the values for each cell in the table.
c. Describe why you believe your feature representation scheme will produce good
results when applied to grouping the students in the tutorial.
d. Explain what is meant by an “outlier”. Add a new record to the table that you
believe would be an outlier compared to the whole data set and also to the
different clusters, and explain why it is indeed an outlier.
4. Classification:
You are now given the task to derive a model that can predict whether a student will
pass the data mining course or not (PASS/FAIL decision).
a. Devise a feature representation scheme with five features that can help deriving
such a model. Make sure you choose features you believe may be good predictors
of a student’s grade, and describe why you believe they are better predictors than
the features you chose in question 3.
b. Fill in the table with six different records for six hypothetical students from the
class of 2000. This table should contain six columns (one column for each of your
chosen features, and one column for Pass/Fail result) and six rows (one for each
yg@doc.ic.ac.uk, mmg@doc.ic.ac.uk
25th Nov 2003
c.
d.
e.
f.
student). Which columns (variables) of this table are “independent” (“input”)
variables and which are “dependent” (“output” or “class”) variables?
A decision rule is in the form “If FeatureA = FeatureValue1 then ClassValue =
ClassValue1). Informally derive at least 4 “Decision Rules” that can be inferred
from your data table. Is there any inconsistency between your rules?
Explain informally how you can test the accuracy of your decision rules based on
the data set you have provided. What is the accuracy of the each rule? What is the
accuracy of the overall model (i.e. the 4 rules together)?
Testing the accuracy of the rules on your data set may be biased, they probably
over-fit your data since they were derived and tested only using this data set.
What would be a better way to assess the accuracy of your rules?
Explain how your decision rules can be applied to predict whether you yourself
will PASS or FAIL the data mining course in 2003.
5. Classification/Prediction/Feature Selection:
There are many applications of data mining in finance. Explain why and how it can
be dangerous to naively use predictive data mining techniques to predict stock price
movements.
Hint: Consider what features would you choose to describe each stock and also
consider what really makes stock prices move. Can you find a good feature set that
can be presented to a data mining algorithm?
6. Data Cleaning:
a. Explain what is meant by “Data Cleaning” and why it may be required before
mining a large data set.
b. Describe three commonly used data cleaning operations.
c. Explain three methods for handling missing data in dataset.
7. Data Cleaning:
Given the following data set [4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34]
a. Divide the data set into 3 equi-depth bins.
b. Divide the data set into 3 bins that are smoothed by their means.
c. Normalize the data set based on a min-max normalization.
yg@doc.ic.ac.uk, mmg@doc.ic.ac.uk
25th Nov 2003
Download