Bayesian Hierarchical Clustering Paper by K. Heller and Z

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH Outline  Background - Traditional Methods  Bayesian Hierarchical Clustering (BHC)  Basic ideas  Dirichlet Process Mixture Model (DPM)  Algorithm  Experiment results  Conclusion Background Traditional Method Hierarchical Clustering  Given : data points  Output: a tree (series of clusters)  Leaves : data points  Internal nodes : nested clusters  Examples  Evolutionary tree of living organisms  Internet newsgroups  Newswire documents Traditional Hierarchical Clustering  Bottom-up agglomerative algorithm  Closeness based on given distance measure (e.g. Euclidean distance between cluster means) Traditional Hierarchical Clustering (cont’d)  Limitations  No guide to choosing correct number of clusters, or where to prune tree.  Distance metric selection (especially for data such as images or sequences)  Evaluation (Probabilistic model)  How to evaluate how good result is ?  How to compare to other models ?  How to make predictions and cluster new data with existing hierarchy ? BHC Bayesian Hierarchical Clustering Bayesian Hierarchical Clustering  Basic ideas:  Use marginal likelihoods to decide which clusters to merge  P(Data to merge were from the same mixture component) vs. P(Data to merge were from different mixture components)  Generative Model : Dirichlet Process Mixture Model (DPM) Dirichlet Process Mixture Model (DPM)  Formal Definition  Different Perspectives  Infinite version of Mixture Model (Motivation and Problems)  Stick-breaking Process (How generated distribution look like)  Chinese Restaurant Process, Polya urn scheme  Benefits  Conjugate prior  Unlimited clusters  “Rich-Get-Richer, ” Does it really work? Depends!  Pitman-Yor process, Uniform Process, … BHC Algorithm - Overview  Same as traditional  One-pass, bottom-up method  Initializes each data point in own cluster, and iteratively merges pairs of clusters.  Difference  Uses a statistical hypothesis test to choose which clusters to merge. BHC Algorithm - Concepts Two hypotheses to compare 1. All data was generated i.i.d. from the same probabilistic model with unknown parameters. 2. Data has two or more clusters in it. Hypothesis H1 Probability of the data under H1:  : prior over the parameters  Dk : data in the two trees to be merged  Integral is tractable with conjugate prior Hypothesis H2 Probability of the data under H2:  Product over sub-trees BHC Algorithm - Working Flow  From Bayes Rule, the posterior probability of the merged hypothesis: Data number, concentration(DPM) Hidden features (Beneath Distribution)  The pair of trees with highest probability are merged.  Natural place to cut the final tree: Tree-Consistent Partitions  Consider the right tree and all 15 possible partitions of {1,2,3,4}: (1)(2)(3)(4), (1 2)(3)(4), (1 3)(2)(4), (1 4)(2)(3), (2 3)(1)(4), (2 4)(1)(3), (3 4)(1)(2), (1 2)(3 4), (1 3)(2 4), (1 4)(2 3), (1 2 3)(4), (1 2 4)(3), (1 3 4)(2), (2 3 4)(1), (1 2 3 4)  (1 2) (3) (4) and (1 2 3) (4) are tree-consistent partitions.  (1)(2 3)(4) and (1 3)(2 4) are not tree-consistent partitions. Merged Hypothesis Prior (πk)  Based on DPM (CRP perspective)  πk = P(All points belong to one cluster)  d’s are the case for all tree-consistent partitions Predictive Distribution  BHC allow to define predictive distributions for new data points.  Note : P(x|D) != P(x|Dk) for root!? Approximate Inference for DPM prior  BHC forms a lower bound for the marginal likelihood of an infinite mixture model by efficiently summing over an exponentially large subset of all partitions.  Idea : deterministically sum over partitions with high probability, therefore accounting for most of the mass.  Compare to MCMC method, this is more deterministic and efficient. Learning Hyperparameters  α : Concentration parameter  β : Define G0  Learned by recursive gradients and EM-like method To Sum Up for BHC  Statistical model for comparison and decides when to stop.  Allow to define predictive distributions for new data points.  Approximate Inference for DPM marginal.  Parameters  α : Concentration parameter  β : Define G0 Unique Aspects of BHC Algorithm  Hierarchical way of organizing nested clusters, not a hierarchical generative model.  Derived from DPM.  Hypothesis test : one vs. many other clusterings (compare to one vs. two clusters at each stage)  Not iterative and does not require sampling. (except for learning parameters) Results from the experiments Conclusion and some take home notes Conclusion  Limitations -> No guide to choosing correct number of clusters, or where to prune tree. (Natural Stop Criterion) <-> Distance metric selection (Model-based Criterion) <- -> Evaluation, Comparison, Inference (Probabilistic model) <Some useful results for DPM) <- Summary  Defines probabilistic model of data, can compute probability of new data point belonging to any cluster in tree.  Model-based criterion to decide on merging clusters.  Bayesian hypothesis testing used to decide which merges are advantageous, and to decide appropriate depth of tree.  Algorithm can be interpreted as approximate inference method for a DPM; gives new lower bound on marginal likelihood by summing over exponentially many clusterings of the data. Limitations  Inherent greediness  Lack of any incorporation of tree uncertainty  O(n2) complexity for building tree References  Main paper:  Bayesian Hierarchical Clustering, K. Heller and Z. Ghahramani, ICML 2005  Thesis:  Efficient Bayesian Methods for Clustering, Katherine Ann Heller  Other references:  Wikipedia  Paper Slides  www.ee.duke.edu/~lcarin/emag/.../DW_PD_100705.ppt  http://cs.brown.edu/courses/csci2950-p/fall2011/lectures/2011-10-13_ghosh.pdf  General ML  http://blog.echen.me/ References  Other references(cont’d)  DPM & Nonparametric Bayesian :  http://nlp.stanford.edu/~grenager/papers/dp_2005_02_24.ppt  https://www.cs.cmu.edu/~kbe/dp_tutorial.pdf  http://www.iro.umontreal.ca/~lisa/seminaires/31-10-2006.pdf  http://videolectures.net/mlss07_teh_dp/ , http://mlg.eng.cam.ac.uk/tutorials/07/ywt.pdf  http://www.cns.nyu.edu/~eorhan/notes/dpmm.pdf (Easy to read)  http://mlg.eng.cam.ac.uk/zoubin/talks/uai05tutorial-b.pdf  Heavy text:  http://stat.columbia.edu/~porbanz/reports/OrbanzTeh2010.pdf  http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/dp.pdf  http://www.stat.uchicago.edu/~pmcc/reports/clusters.pdf  Hierarchical DPM  http://www.cs.berkeley.edu/~jordan/papers/hdp.pdf  Other methods  https://people.cs.umass.edu/~wallach/publications/wallach10alternative.pdf Thank You for Your Attentions! 

Bayesian Hierarchical Clustering Paper by K. Heller and Z

Related documents

Products

Support

Bayesian Hierarchical Clustering Paper by K. Heller and Z

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib