1. Data quality can be assessed in term of... Purpose two other dimension of data quality.

1. Data quality can be assessed in term of accuracy, completeness, and consistency. Purpose two other dimension of data quality. 2. In real-world data, tples with missing values for some attributes are acommon occurrence. Describe various methods for handling this problem. 3. Suppose the data for analysis includes the attributes age. The age for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 21, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. a. What is the mean of the data? What is the median? b. Can you find (roughly) the forst quartile (Q1) and the third quartile (Q3) of the data? c. Give the five-number summary of the data 4. Using the data for age given in Exercise 3, answer the following. a. Use smoothing by means to smooth the data, using a bin depth of 3. Illustrate your steps. Comment on the effect of this technique for the given data. b. What other methods are there for data smoothing? 5. Suppose a hospital tested the age and body fat data for 18 randomly selected adults with the following rasult: age %fat 23 9.5 23 26.5 27 7.8 27 17.8 39 31.4 41 25.9 47 27.4 49 27.2 50 31.2 age %fat 52 34.6 54 42.5 54 28.8 56 33.4 57 30.2 58 34.1 58 32.9 60 41.2 61 35.7 a. b. c. d. e. Calculate the mean, median, and standard deviation of age and %fat Draw the boxplots for age and %flat Draw scatter plot and q-q plot based on these two variables Normalize the two variables based on z-score normalization Calculate the correlation coefficient (Pearson’s mement coefficient). Are these two variables positively or negatively correlated? 6. Suppose a group of 12 salary price records has been sorted as follow: 5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215 Partition them into three bins by each of the following methods: a. equal-frequency (equidepth) patritioning b. equal-width partitioning c. clustering 7. It is important to define or select similarity measures in data analysis. However, there in no commonly accepted subjevtive similarity measure. Using different similarity measures may deduce different results. Nonetheless, some apparently different similarity measures may be equivalent after some transformation. Suppose we have the following two-dimensional data set: x1 x2 x3 x4 x5 A1 1.5 2 1.6 1.2 1.5 A2 1.7 1.9 1.8 1.5 1.0 a. Consider the data as two-dimensional data points. Given a new data point, x = (1.4, 1.6) as a query, rank the database points based on similarity with the query using (1) Euclidean distance (Equation 7.5), and (2) cosine similariry (Equation 7.16). b. Normalize the data set to make the norm of each data point equal to 1. use Euclidean distance on the transformed data to rank the data points.

1. Data quality can be assessed in term of... Purpose two other dimension of data quality.

Related documents

Products

Support

1. Data quality can be assessed in term of... Purpose two other dimension of data quality.

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib