Distances Euclidian Manhattan: Cosine Distance This particular metric is used when the magnitude between vectors does not matter but the orientation. Here cosine value 1 is for vectors pointing in the same direction i.e. there are similarities between the documents/data points. At zero for orthogonal vectors i.e. Unrelated(some similarity found). Value -1 for vectors pointing in opposite directions(No similarity). Mahalanobis Distance: Mahalanobis Distance is used for calculating the distance between two data points in a multivariate space. The Mahalanobis distance is a measure of the distance between a point P and a distribution D. The idea of measuring is, how many standard deviations away P is from the mean of D. measuring the strength/similarity between two different data objects S is the covariance metrics Strings: Hamming: the number of positions at which the corresponding characters are different (same len). Levenstein metrics the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other: Kitten vs sitting (3) https://en.wikipedia.org/wiki/Levenshtein_distance Similarity: K-means + Simple: easy to understand and to implement Efficient: Time complexity: O(tkn) (Most used) The algorithm is only applicable if the mean is defined For categorical data, k-mode - the centroid is represented by most frequent values The user needs to specify k. The algorithm is sensitive to outliers is not suitable for discovering clusters that are not hyper-ellipsoids (or hyper-spheres) DBSCAN (type of spectral clustering) https://en.wikipedia.org/wiki/DBSCAN DBSCAN does not require one to specify the number of clusters i DBSCAN can find arbitrarily-shaped clusters. DBSCAN has a notion of noise, and is robust to outliers. The quality of DBSCAN depends on the metrics DBSCAN cannot cluster data sets well with large differences in densities Not entirely deterministic: border points that are reachable from more than one cluster can be part of either cluster Parameters: Min points Epsilon Distance function Hierarchical: K-SOM Competitive learning algorithm: Kohonen Self Organization Maps (K-SOM) neural network model unsupervised neural network commonly used for high-dimensional data clustering. SOMs’ distinct property is that they can map high-dimensional input vectors onto spaces with fewer dimensions and preserve datasets' original topology while doing so. Sources: 1. 2. 3. 4. 5. 6. 7. 8. 9. Center of brains, Minds & Mahcines, 9.54, fall semptember 2014 https://www.analyticsvidhya.com/blog/2020/02/4-types-of-distance-metrics-in-machine-learning/ https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modellinge51395ffe60d https://calculatedcontent.com/2012/10/09/spectral-clustering/ https://ogrisel.github.io/scikit-learn.org/sklearntutorial/auto_examples/cluster/plot_cluster_comparison.html https://towardsdatascience.com/17-clustering-algorithms-used-in-data-science-mining-49dbfa5bf69a https://en.wikipedia.org/wiki/DBSCAN https://dzone.com/articles/k-means-and-som-gentle-introduction-to-worlds-most