05/09/2019 Similarity-distance Playing arund with Iris We will use Iris in class to practice some attribute transformations and computing similarities. In [3]: import matplotlib.pyplot as plt from sklearn import datasets # import some data to play with iris = datasets.load_iris() X = iris.data[:, 2:4] # we only take petal length and petal width. Y = iris.target plt.figure(2, figsize=(8, 6)) plt.clf() # Plot the training points plt.scatter(X[:, 0], X[:, 1], c=Y) plt.xlabel('Petal length') plt.ylabel('Petal width') plt.show() Lets import numpy, we'll use it to get various statistics. Also, let's peek at the data. https://hub.gke.mybinder.org/user/jupyterlab-jupyterlab-demo-5durbs0y/lab/workspaces/auto-u?clone 1/4 05/09/2019 Similarity-distance In [4]: import numpy as np A = iris.data a = A[0,:] b = A[-1,:] print(a,b) [5.1 3.5 1.4 0.2] [5.9 3. 5.1 1.8] In [5]: c = np.log(a) In [6]: d = np.abs(c) print(d) [1.62924054 1.25276297 0.33647224 1.60943791] You can also get the min and max with numpy In [7]: for i in range(A.shape[1]): print(np.min(A[:,i]), np.max(A[:,i])) 4.3 2.0 1.0 0.1 7.9 4.4 6.9 2.5 TODO: Standardize the data by subtracting the mean and dividing by standard deviation. In [8]: standardised = ((A-np.mean(A))/np.std(A)) TODO: Compute the Euclidean distance between every pair of points and print the pair with the smallest distance. https://hub.gke.mybinder.org/user/jupyterlab-jupyterlab-demo-5durbs0y/lab/workspaces/auto-u?clone 2/4 05/09/2019 Similarity-distance In [9]: from sklearn.metrics.pairwise import euclidean_distances dist = euclidean_distances(standardised) print(dist) [[0. 7] [0.27282639 6] [0.25832953 6] ... [2.2594606 5] [2.35621893 3] [2.09745567 ]] 0.27282639 0.25832953 ... 2.2594606 0. 2.35621893 2.0974556 0.15198777 ... 2.27925353 2.39028652 2.1041753 0.15198777 0. ... 2.3616593 2.45648262 2.1779021 2.27925353 2.3616593 ... 0. 0.31230517 0.3243988 2.39028652 2.45648262 ... 0.31230517 0. 0.3891467 2.10417536 2.17790216 ... 0.32439885 0.38914673 0. In [19]: for i in range(A.shape[0]): dist[i,i] = np.max(dist) print(dist) [[3.58954366 7] [0.27282639 6] [0.25832953 6] ... [2.2594606 5] [2.35621893 3] [2.09745567 6]] 0.27282639 0.25832953 ... 2.2594606 2.35621893 2.0974556 3.58954366 0.15198777 ... 2.27925353 2.39028652 2.1041753 0.15198777 3.58954366 ... 2.3616593 2.27925353 2.3616593 2.45648262 2.1779021 ... 3.58954366 0.31230517 0.3243988 2.39028652 2.45648262 ... 0.31230517 3.58954366 0.3891467 2.10417536 2.17790216 ... 0.32439885 0.38914673 3.5895436 In [22]: print(np.min(dist)) 0.0 We can see duplicate data points in dataset, so the min distance will be 0 https://hub.gke.mybinder.org/user/jupyterlab-jupyterlab-demo-5durbs0y/lab/workspaces/auto-u?clone 3/4 05/09/2019 Similarity-distance In [21]: indices = list(zip(*np.where(dist==0.0))) print(indices) for index in indices: print("pair of points with minimum distance",A[index[0],:],A[index[1],:]) [(101, 142), (142, 101)] pair of points with minimum distance [5.8 2.7 5.1 1.9] [5.8 2.7 5.1 1. 9] pair of points with minimum distance [5.8 2.7 5.1 1.9] [5.8 2.7 5.1 1. 9] In [ ]: https://hub.gke.mybinder.org/user/jupyterlab-jupyterlab-demo-5durbs0y/lab/workspaces/auto-u?clone 4/4