CIS 4930 Data Mining Spring 2016, Assignment 1

advertisement
CIS 4930 Data Mining
Spring 2016, Assignment 1
Instructor: Peixiang Zhao
TA: Yongjiang Liang
Due date: Monday, 02/08/2016, during class
Problem 1
[15 = 30 ∗ 5]
Classify the following attributes as binary, discrete, or continuous. Also classify them
as qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may
have more than one interpretation, so briefly indicate your reasoning if you think there
may be some ambiguity. Example: Age in years. Answer: Discrete, quantitative,
ratio.
1. Coat check number ( When you attend an event, you can often give your coat
to someone, in turn, gives you a number that you can use to claim your coat
when you leave);
2. Angles as measured in degrees between 0 and 360;
3. Bronze, Silver, and Gold medals as awarded at the Olympics;
4. ISBN numbers for books;
5. Distance from the center of campus.
Problem 2
[15 = 50 ∗ 3]
Suppose that the data for analysis includes the attribute age. The age values for the
data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25,
25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
1. What are the mean, median and mode of the data?
2. What are the first quartile (Q1 ) and the third quartile (Q3 ) of the data?
3. Draw the boxplot of the data.
2
CIS 4930: Data Mining
AGE
FAT
AGE
FAT
Problem 3
23
9.5
52
34.6
23
26.5
54
42.5
27
7.8
54
28.8
27
17.8
56
33.4
39
31.4
57
30.2
41
25.9
58
34.1
47
27.4
58
32.9
49
27.2
60
41.2
50
31.2
61
35.7
[20 = 100 ∗ 2]
Suppose that a hospital tested the age and body fat data for 18 randomly selected
adults with the following results:
1. Draw the boxplot for age and fat;
2. Draw the quartile-quartile (q-q) plot for age and fat.
Problem 4
[10]
Suppose we have the following 2-D data set: Given a new data point, d = (1.4, 1.6)
d1
d2
d3
d4
d5
x
1.5
2
1.6
1.2
1.5
y
1.7
1.9
1.8
1.5
1.0
as a query, rank the database points based on similarity (from the most similar to
the least similar) with the query using Euclidean distance, Manhattan distance,
supremum distance, and cosine similarity.
Problem 5
[10]
The following table shows how many transactions containing beer and/or nuts among
10, 000 transactions. Compute χ2 for these two factors for correlation analysis.
Nuts
No Nuts
Total
Spring 2016
Beer
50
150
200
No Beer
800
9000
9800
Total
850
9150
10000
Assignment 1
CIS 4930: Data Mining
Problem 6
3
[30]
Consider the exam score dataset (in the course website, the assignment page) which
includes the records of students’ exam scores for the past few years of the database
course. The first column is the students’ id, the second column is the mid-term score,
and the third column is the final-exam score. Each row represents one student and
the numeric values are splitted by TAB. Choose one language among C/C++, Java,
or Python and write programs to implement functions for the following statistical
description of data. If the result is not integer, then round it to 3 decimal places.
Give out the basic statistical results about mid-term and final exam scores here in the
assignment. Meanwhile, submit your source code (and possibly makefile or software)
to the TA via emails.
1. Max, min, median, Q1 , Q3 ;
2. Mean, mode, Standard deviation.
Spring 2016
Assignment 1
Download