Uploaded by Sasha An

Clustering, Machine Learning

advertisement
Distances
Euclidian
Manhattan:
Cosine Distance
This particular metric is used when the magnitude between vectors does not matter but the orientation.
Here cosine value 1 is for vectors pointing in the same direction i.e. there are similarities between the
documents/data points. At zero for orthogonal vectors i.e. Unrelated(some similarity found). Value -1 for
vectors pointing in opposite directions(No similarity).
Mahalanobis Distance:
Mahalanobis Distance is used for calculating the distance between two data points in a multivariate space.
The Mahalanobis distance is a measure of the distance between a point P and a distribution D. The idea of
measuring is, how many standard deviations away P is from the mean of D.
measuring the strength/similarity between two different data objects
S is the covariance metrics
Strings: Hamming: the number of positions at which the corresponding characters are different (same len).
Levenstein metrics
the minimum number of single-character edits (insertions, deletions or substitutions) required to change one
word into the other:
Kitten vs sitting (3)
https://en.wikipedia.org/wiki/Levenshtein_distance
Similarity:
K-means
+
Simple: easy to understand and to implement
Efficient: Time complexity: O(tkn)
(Most used)
The algorithm is only applicable if the mean is defined
For categorical data, k-mode - the centroid is represented by most frequent values
The user needs to specify k.
The algorithm is sensitive to outliers
is not suitable for discovering clusters that are not hyper-ellipsoids (or hyper-spheres)
DBSCAN
(type of spectral clustering)
https://en.wikipedia.org/wiki/DBSCAN
DBSCAN does not require one to specify the number of clusters i
DBSCAN can find arbitrarily-shaped clusters.
DBSCAN has a notion of noise, and is robust to outliers.
The quality of DBSCAN depends on the metrics
DBSCAN cannot cluster data sets well with large differences in densities
Not entirely deterministic: border points that are reachable from more than one cluster can be part of either
cluster
Parameters:
Min points
Epsilon
Distance function
Hierarchical:
K-SOM
Competitive learning algorithm: Kohonen Self Organization Maps (K-SOM)
neural network model
unsupervised neural network commonly used for high-dimensional data clustering.
SOMs’ distinct property is that they can map high-dimensional input vectors onto spaces with fewer
dimensions and preserve datasets' original topology while doing so.
Sources:
1.
2.
3.
4.
5.
6.
7.
8.
9.
Center of brains, Minds & Mahcines, 9.54, fall semptember 2014
https://www.analyticsvidhya.com/blog/2020/02/4-types-of-distance-metrics-in-machine-learning/
https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modellinge51395ffe60d
https://calculatedcontent.com/2012/10/09/spectral-clustering/
https://ogrisel.github.io/scikit-learn.org/sklearntutorial/auto_examples/cluster/plot_cluster_comparison.html
https://towardsdatascience.com/17-clustering-algorithms-used-in-data-science-mining-49dbfa5bf69a
https://en.wikipedia.org/wiki/DBSCAN
https://dzone.com/articles/k-means-and-som-gentle-introduction-to-worlds-most
Download