EMTM 554 Data Mining ... Closed book

advertisement

EMTM 554 Data Mining Spring, 2007

Closed book

All questions are short-answer; one phrase or sentence each will suffice.

1. Why transform data from a transactional database to a data warehouse?

More efficient to query

Join different databases

2. Name (or describe) two good methods for viewing four-dimensional data a) mosaic b) interactive

3. Give two reasons why it is important to collect meta-data. a) know what the data really means b) reconcile different databases

4. When or why might k-means be better to use than k-nearest neighbors?

Generalizes better for sparse data

5. When would one prefer to build a dendrogram (tree) using agglomerative clustering to k-means?

Don’t know how many clusters; EDA

6. When would one prefer k-means to agglomerative clustering?

Large data set

7. Stepwise logistic regression was used to predict purchase of a luxury good as a function of a set of consumer demographic characteristics and prior purchases. The resulting model included “ car model ” and “ value of house ”, but not “ income ”. Does this mean that “ income

” is uncorrelated with purchase of this luxury good? Please explain.

No. It could be that income is correlated with value of hous

8. Why might physicians prefer decision trees to more accurate methods such as artificial neural networks or support vector machines?

Easier to interpret the results

9. Computer scientists often evaluate data mining / machine learning methods in terms of their accuracy on a held out sample of data (“test data”).

(a) Why might the resulting accuracies not be relevant for business decisions?

Asymmetric costs

(b) What might you compute instead?

Precision/recall. cost

10. Describe 5-fold cross-validation devide data into 5 parts. Train on each of the parts and test on the remaining 4. average the errors over all tests

11. Given data of the following form, where NA indicates missing data:

Cost color size

Item1 42 blue small

Item2 57 red medium

Item3 32 NA large

Item4 NA red small

One way to handle missing data is to replace it with the average (or most frequent) value of the column. (E.g., set the missing cost to (42+57+32)/3 and the missing color to red.)

(a) Why might this be a bad idea?

If they are not missing at random, you are throwing away information

(b) When might it be a good idea?

If they are missing at random

(c) What would that table look like when transformed so that a regression method could use the data? (Please do not estimate the values of the missing data.)

Cost Cost_missing blue red color_missing small medium large

Item1 42 0 1 0 0 1 0 0

Item2 57 0

Item3 32 0

Item4 0 1

0

0

0

1

0

1

0

1

0

0

0

1

1

0

0

0

1

0

12. Given the following confusion matrix

Actual

purchase no purchase

Predicted purchase 10 60 no purchase 20 200

(a) What is the lift from the model?

Fraction of those predicted to purchase who do 10/70

Over fraction for the total 30/290

Roughly 1.3

(b) What is its precision?

Fraction of those predicted to purchase who do

10/70

(c) What is its recall?

Fraction of those who purchase who are predicted

10/30

You do not need to simplify your answer; just leave it as an expression like

“(100 + 20)/60”

13. A number of statisticians think that data mining is a bad idea. a) Why?

Hard to determine causality, and one needs causality as a basis for action b) What do they suggest one do instead?

Experimental design

14. Two steps are missing from the following CRISP methodology list. What are they?

Data Understanding

Data Preparation

Data cleansing

Deployment

(a) business understanding

(b) evaluation

15. Capital One was extremely profitable because they combined

(a) segmentation and

(b) price discrimination

16. What is the difference between information retrieval and information extraction ?

Retrieval gets documents; extraction gets facts to put in a database

17. What is the key idea behind google’s pagerank?

Pages which are pointed to by other pages are “better”; c) d)

Those pointed to by pages that many people point to are even better

18. Give four reasons why text mining is hard. a) b)

19. Please order from (typically) most to least important

(a) choosing the right machine learning method (e.g., neural nets, logistic regression, SVMs)

(b) having the right data available

(c) doing feature selection well

1. _b_ (most important)

2. __c_ (intermediate)

3. _a__ (least important)

20. When training logistic regression and a decision tree on the same data set, we obtained the following classification error rates:

Training set Validation set

error error

Logistic Regression 56% 71%

Decision Tree 12% 17%

a. What do you think is happening with the logistic regression? underfitting

b. What could you do about it?

Include interaction terms

On another data set, we had the following results:

Training set Validation set error error

Logistic Regression 3% 33%

Decision Tree 12% 16%

c. What do you think is happening with the logistic regression? overfitting

d. What would you do to improve its behavior?

Stepwise regression

Download