Exercises Data Mining Lecture 1

advertisement
Exercises Data Mining Lecture 1
1. Show that the measure of similarity sim does not follow the triangle-inequality.
2. Use Matlab (or Maple or whatever) to plot – or draw by hand:
V = {x = (x,y) in IR2 | ||x||1 >= 1 and ||x||5 <= 1}
3. a Use Matlab (etc) to plot: V = {x = (x,y) in IR2 | ||x||1/2 = 1}
b Argue from the plot why we normally in the generalized norm ||*||d choose d >= 1.
 4 1

4. Let: g  
 1 4
a Use Matlab (etc) to plot: V = {x = (x,y) in IR2 | ||x||g = 4}, where ||*||g is the
Riemannian norm with metric g
b Determine the eigen-values i and eigen-vectors vi of g , and draw the vectors: 1v1
and 2v2 in the same plot as V.
5. Consider the dataset DAMlex3.mat. Load this file in Matlab with:
load ' DAMlex3.mat ' –ascii. This set contains ten points xi in IR4, i=1..10. Let the
matrix d indicate the distances between these points where d(i,j) is the distance
between the points xi and xj according to a specific norm or metric.
a. Determine d in case the norm is:
i. Euclidean, ii. Max-norm, i.e. generalized p-norm with p = ∞:, iii. generalized pnorm with p = 4.
b. Consider the matrix1 g :
 6 1 - 2 - 1


1 1 7 - 2 1
g
5  - 2 - 2 2 1 
 -1 1 1 2 


i. Show that g is a valid metric.
ii. Compute the Riemannian distance matrix d under the metric g.
6. Consider the dataset DAMlex1.mat. Load this file in Matlab with:
load ' DAMlex1.mat ' –ascii.
a Determine the two principal axes a1 and a2 . Perform this by applying the Matlabsource pca.m available on the web. What fraction of the data is explained by these two
components?
b The algorithm in pca.m projects the dataset on the plane spanned by the two
principal axes a1 and a2. Plot the dataset in this plane.
7. Consider dataset DAMlex2.mat. This contains the distances d between ten unknown
points xi, where d(i,j) is the Euclidean distance between the points xi and xj. The
objective is to compute or estimate the points: X = {x1T, x2T, …, x10T} from this
distance matrix d.
a. Implement the algorithm described in hand et al. in equations 3.14 and 3.15, in
section 3.7, pp. 84 – 86, for computing the matrix B = XXT .
b. Show that the diagonal components of B are the squared Euclidean lengths of the
1
A DxD-matrix with D > 3 is often called a tensor.
ten sought points.
c. Suppose that the points lie in a plane, i.e. xi  IR2. Argue that one can choose one
arbitrary point to define the first coordinate axis (e.g. the x-axis) . Call your selected
index i*, i.e. your selected point is xi*.
d. Argue that Bij is the inner product between the points xi and xj. Argue that using Bjj
and Bij we can determine the length of xj and its angle with the x-axis. Argue that in
general this allows for two solutions of xj.
e. Now select another point: xj*, and use it to define the second axis; the y-axis. Show
that – it the underlying set X were truly 2D – using i* and j* we can compute the all
coordinates xi from B(i,i), B(i,i*), and B(i,j*).
f. Determine in this way the 10 2D-coordinates X for the distance matrix d.
g. What will happen if the underlying set X is not truly 2D?
Download