Exercises Data Mining Lecture 1 1. Show that the measure of similarity sim does not follow the triangle-inequality. 2. Use Matlab (or Maple or whatever) to plot – or draw by hand: V = {x = (x,y) in IR2 | ||x||1 >= 1 and ||x||5 <= 1} 3. a Use Matlab (etc) to plot: V = {x = (x,y) in IR2 | ||x||1/2 = 1} b Argue from the plot why we normally in the generalized norm ||*||d choose d >= 1. 4 1 4. Let: g 1 4 a Use Matlab (etc) to plot: V = {x = (x,y) in IR2 | ||x||g = 4}, where ||*||g is the Riemannian norm with metric g b Determine the eigen-values i and eigen-vectors vi of g , and draw the vectors: 1v1 and 2v2 in the same plot as V. 5. Consider the dataset DAMlex3.mat. Load this file in Matlab with: load ' DAMlex3.mat ' –ascii. This set contains ten points xi in IR4, i=1..10. Let the matrix d indicate the distances between these points where d(i,j) is the distance between the points xi and xj according to a specific norm or metric. a. Determine d in case the norm is: i. Euclidean, ii. Max-norm, i.e. generalized p-norm with p = ∞:, iii. generalized pnorm with p = 4. b. Consider the matrix1 g : 6 1 - 2 - 1 1 1 7 - 2 1 g 5 - 2 - 2 2 1 -1 1 1 2 i. Show that g is a valid metric. ii. Compute the Riemannian distance matrix d under the metric g. 6. Consider the dataset DAMlex1.mat. Load this file in Matlab with: load ' DAMlex1.mat ' –ascii. a Determine the two principal axes a1 and a2 . Perform this by applying the Matlabsource pca.m available on the web. What fraction of the data is explained by these two components? b The algorithm in pca.m projects the dataset on the plane spanned by the two principal axes a1 and a2. Plot the dataset in this plane. 7. Consider dataset DAMlex2.mat. This contains the distances d between ten unknown points xi, where d(i,j) is the Euclidean distance between the points xi and xj. The objective is to compute or estimate the points: X = {x1T, x2T, …, x10T} from this distance matrix d. a. Implement the algorithm described in hand et al. in equations 3.14 and 3.15, in section 3.7, pp. 84 – 86, for computing the matrix B = XXT . b. Show that the diagonal components of B are the squared Euclidean lengths of the 1 A DxD-matrix with D > 3 is often called a tensor. ten sought points. c. Suppose that the points lie in a plane, i.e. xi IR2. Argue that one can choose one arbitrary point to define the first coordinate axis (e.g. the x-axis) . Call your selected index i*, i.e. your selected point is xi*. d. Argue that Bij is the inner product between the points xi and xj. Argue that using Bjj and Bij we can determine the length of xj and its angle with the x-axis. Argue that in general this allows for two solutions of xj. e. Now select another point: xj*, and use it to define the second axis; the y-axis. Show that – it the underlying set X were truly 2D – using i* and j* we can compute the all coordinates xi from B(i,i), B(i,i*), and B(i,j*). f. Determine in this way the 10 2D-coordinates X for the distance matrix d. g. What will happen if the underlying set X is not truly 2D?