CIS 4930 Data Mining Spring 2016, Assignment 1 Instructor: Peixiang Zhao TA: Yongjiang Liang Due date: Monday, 02/08/2016, during class Problem 1 [15 = 30 ∗ 5] Classify the following attributes as binary, discrete, or continuous. Also classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than one interpretation, so briefly indicate your reasoning if you think there may be some ambiguity. Example: Age in years. Answer: Discrete, quantitative, ratio. 1. Coat check number ( When you attend an event, you can often give your coat to someone, in turn, gives you a number that you can use to claim your coat when you leave); 2. Angles as measured in degrees between 0 and 360; 3. Bronze, Silver, and Gold medals as awarded at the Olympics; 4. ISBN numbers for books; 5. Distance from the center of campus. Problem 2 [15 = 50 ∗ 3] Suppose that the data for analysis includes the attribute age. The age values for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. 1. What are the mean, median and mode of the data? 2. What are the first quartile (Q1 ) and the third quartile (Q3 ) of the data? 3. Draw the boxplot of the data. 2 CIS 4930: Data Mining AGE FAT AGE FAT Problem 3 23 9.5 52 34.6 23 26.5 54 42.5 27 7.8 54 28.8 27 17.8 56 33.4 39 31.4 57 30.2 41 25.9 58 34.1 47 27.4 58 32.9 49 27.2 60 41.2 50 31.2 61 35.7 [20 = 100 ∗ 2] Suppose that a hospital tested the age and body fat data for 18 randomly selected adults with the following results: 1. Draw the boxplot for age and fat; 2. Draw the quartile-quartile (q-q) plot for age and fat. Problem 4 [10] Suppose we have the following 2-D data set: Given a new data point, d = (1.4, 1.6) d1 d2 d3 d4 d5 x 1.5 2 1.6 1.2 1.5 y 1.7 1.9 1.8 1.5 1.0 as a query, rank the database points based on similarity (from the most similar to the least similar) with the query using Euclidean distance, Manhattan distance, supremum distance, and cosine similarity. Problem 5 [10] The following table shows how many transactions containing beer and/or nuts among 10, 000 transactions. Compute χ2 for these two factors for correlation analysis. Nuts No Nuts Total Spring 2016 Beer 50 150 200 No Beer 800 9000 9800 Total 850 9150 10000 Assignment 1 CIS 4930: Data Mining Problem 6 3 [30] Consider the exam score dataset (in the course website, the assignment page) which includes the records of students’ exam scores for the past few years of the database course. The first column is the students’ id, the second column is the mid-term score, and the third column is the final-exam score. Each row represents one student and the numeric values are splitted by TAB. Choose one language among C/C++, Java, or Python and write programs to implement functions for the following statistical description of data. If the result is not integer, then round it to 3 decimal places. Give out the basic statistical results about mid-term and final exam scores here in the assignment. Meanwhile, submit your source code (and possibly makefile or software) to the TA via emails. 1. Max, min, median, Q1 , Q3 ; 2. Mean, mode, Standard deviation. Spring 2016 Assignment 1