Uploaded by Niko Stark

COMPSCI 753 Notes

advertisement
Data Stream
Find similar items



Jaccard Similarity
Shingling -> MinHash -> LSH
MinHash
o Random permutations
o One-pass MinHash
Sampling data stream

Fixed proportion

Fixed size
o Reservoir Sampling
Find frequent items

[Deterministic] Misra-Gries
o (m-m')/(k+1) decrement steps at most

[Randomized] CountMin Sketch
Filtering data stream

Bloom Filter
o 1 hash function
o k hash functions
Locality Sensitive Search (LSH)

(r, c, p1, p2)-sensitive

Jaccard similarity/distance
o MinHash
Cosine similarity/distance
o SimHash

Graph
Link Analysis





TF.IDF
Term Spam
PageRank
o Dead ends
 Recursively remove
 Taxation
Biased PageRank
Link Spam (Spam Farm)
o
o
Trust Rank
Spam Mass
Social Network Analysis






Small world property
o Power law degree distribution
Core-Periphery structure
Strength of ties
o Triadic closure
Clustering Coefficient
o Triangle enumeration
Community Detection
o K-core decomposition
Influence Maximization
o Greedy-based
o Sketch-based
Join Analysis

Natural join

Semijoin
Multiway join
Acyclic join
o Yannakakis Algorithm


Download