Distances and Clustering Algorithms Lecture Notes

Distances Euclidian Manhattan: Cosine Distance This particular metric is used when the magnitude between vectors does not matter but the orientation. Here cosine value 1 is for vectors pointing in the same direction i.e. there are similarities between the documents/data points. At zero for orthogonal vectors i.e. Unrelated(some similarity found). Value -1 for vectors pointing in opposite directions(No similarity). Mahalanobis Distance: Mahalanobis Distance is used for calculating the distance between two data points in a multivariate space. The Mahalanobis distance is a measure of the distance between a point P and a distribution D. The idea of measuring is, how many standard deviations away P is from the mean of D. measuring the strength/similarity between two different data objects S is the covariance metrics Strings: Hamming: the number of positions at which the corresponding characters are different (same len). Levenstein metrics the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other: Kitten vs sitting (3) https://en.wikipedia.org/wiki/Levenshtein_distance Similarity: K-means + Simple: easy to understand and to implement Efficient: Time complexity: O(tkn) (Most used) The algorithm is only applicable if the mean is defined For categorical data, k-mode - the centroid is represented by most frequent values The user needs to specify k. The algorithm is sensitive to outliers is not suitable for discovering clusters that are not hyper-ellipsoids (or hyper-spheres) DBSCAN (type of spectral clustering) https://en.wikipedia.org/wiki/DBSCAN DBSCAN does not require one to specify the number of clusters i DBSCAN can find arbitrarily-shaped clusters. DBSCAN has a notion of noise, and is robust to outliers. The quality of DBSCAN depends on the metrics DBSCAN cannot cluster data sets well with large differences in densities Not entirely deterministic: border points that are reachable from more than one cluster can be part of either cluster Parameters: Min points Epsilon Distance function Hierarchical: K-SOM Competitive learning algorithm: Kohonen Self Organization Maps (K-SOM) neural network model unsupervised neural network commonly used for high-dimensional data clustering. SOMs’ distinct property is that they can map high-dimensional input vectors onto spaces with fewer dimensions and preserve datasets' original topology while doing so. Sources: 1. 2. 3. 4. 5. 6. 7. 8. 9. Center of brains, Minds & Mahcines, 9.54, fall semptember 2014 https://www.analyticsvidhya.com/blog/2020/02/4-types-of-distance-metrics-in-machine-learning/ https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modellinge51395ffe60d https://calculatedcontent.com/2012/10/09/spectral-clustering/ https://ogrisel.github.io/scikit-learn.org/sklearntutorial/auto_examples/cluster/plot_cluster_comparison.html https://towardsdatascience.com/17-clustering-algorithms-used-in-data-science-mining-49dbfa5bf69a https://en.wikipedia.org/wiki/DBSCAN https://dzone.com/articles/k-means-and-som-gentle-introduction-to-worlds-most

Distances and Clustering Algorithms Lecture Notes

Related documents

Products

Support

Distances and Clustering Algorithms Lecture Notes

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib