Metagene Projection • There are a lot of array data available • Species, platform, labeling method, researcher and other issues make using these data difficult. • Metagene Projection claims to “reduce noise while still capturing the invariant biological features of the data.” • This should “enable cross-platform and crossspecies analysis, improve clustering and class prediction and provide a computational means to detect and remove sample contamination.” NMF (Brunet et al., 2004 PNAS 101:4164) – Nonnegative Matrix Factorization W=genes X small # metagenes H= small # metagenes X samples M and T n (genes) x N (samples) KEY point: n (genes) identifiers in M and T must match Unknown: Can M and T be totally different types of data? Moore-Penrose generalized pseudoinverse Model – 30 samples, 3 metagenes Test – 38 samples, 3 metagenes After Metagene Projection Before Metagene Projection Before Metagene Projection : Rank normalized and including only the top 500 markers of each class. – Underperforms metagene projection •KO of the same gene impacts different cell lines in similar way. • Both mouse stem cell lines, one on Exon array, one on 430_2 •For 3’ UTR – max average per gene selected •For Exon – max probe count per transcript cluster id selected •gene symbol <–> gene symbol join •All 17354 genes used Expressed Clustering (10989 genes) RankNorm = 15((Rank-1)/(#genes-1)) Expressed and rank normalized clustering Metagene Projection Preprocessing 2 required inputs for the genepattern metagene projection module are model and test preprocessing parameter files. gct.file="Arv.gct" cls.file="Arv.cls" column.subset="ALL" column.sel.type="samples" thres=3 NO ceil=14 FILTER at fold=1 this value delta=1.5 norm=6 4525 pass Model input, preprocessing and refinement H matrix from NMF Projected model dataset Model combined with test Original data – Platforms separated Projected data – possibly better separation move 1 het to improve clades