COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE

COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI RANDOM SUBSPACE METHOD (RSM)  Proposed by Ho “The Random Subspace for Constructing Decision Forests”, 1998  Another combining technique for weak classifiers like Bagging, Boosting. RSM ALGORITHM 1. Repeat for b = 1, 2, . . ., B: (a) Select an r-dimensional random subspace X from the original p-dimensional feature space X. 2. Combine classifiers Cb(x), b = 1, 2, . . ., B, by simple majority voting to a final decision rule MOTIVATION FOR RSM  Redundancy in Data Feature Space Completely redundant feature set  Redundancy is spread over many features   Weak classifiers that have critical training sample sizes RSM PERFORMANCE ISSUES  RSM Performance depends on: Training sample size  The choice of a base classifier  The choice of combining rule (simple majority vs. weighted)  The degree of redundancy of the dataset  The number of features chosen  DECISION FORESTS (by Ho)   A combination of trees instead of a single tree Assumption: Dataset has some redundant features Works efficiently with any decision tree algorithm and data splitting method  Ideally, look for best individual trees with lowest tree similarity  UNLABELED DATA  Small number of labeled documents  Large pool of unlabeled documents  How to classify unlabeled documents accurately? EXPECTATION-MAXIMIZATION (E-M) CO-TRAINING  Blum and Mitchel, “Combining Labeled and Unlabeled Data with Co-Training”, 1998.  Requirements: Two sufficiently strong feature sets  Conditionally independent  CO-TRAINING APPLICATION OF CO-TRAINING TO A SINGLE FEATURE SET Algorithm: Obtain a small set L of labeled examples Obtain a large set U of unlabeled examples Obtain two sets F1 and F2 of features that are sufficiently redundant While U is not empty do: Learn classifier C1 from L based on F1 Learn classifier C2 from L based on F2 For each classifier Ci do: Ci labels examples from U based on Fi Ci chooses the most confidently predicted examples E from U E is removed from U and added (with their given labels) to L End loop THINGS TO DO    How can we measure redundancy and use it efficiently? Can we improve Co-training? How can we apply RSM efficiently to: Supervised learning  Semi-supervised learning  Unsupervised learning  QUESTIONS ???????????????????????????????????????????????????? ???????????????????????????????????????????????????? ???????????????????????????????????????????????????? ????????????????????????????????????????????????????

COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE

Related documents

Products

Support

COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib