Research Projects Tao Jiang’s Lab Algorithms and Computational Biology Laboratory Department of Computer Science and Engineering University of California, Riverside March, 2013 Project Overview Predicting Operons by a Comparative Genomics Approach (DOE GtL) Evolutionary Dynamics of Myb Gene DNA-binding Domains (NSF ITR) Prediction of HNF4 Binding Sites and Target Genes in Human and Mouse Genomes (NIH/NSF) Efficient Selection of Unique and Popular Oligos for Large EST Databases (USDA/NSF) Oligonucleotide Fingerprinting of Ribosomal RNA Genes and Microorganism Classification (NSF BDI/NIH) Efficient Haplotyping Algorithms for Pedigree Data and Gene Association Mapping (NSF CCF and NIH) High Throughput Ortholog Assignment via Genome Rearrangement (NSF IIS) Genome-Wide Inference of mRNA Isoforms and Estimation of Their Expression Levels from RNA-Seq short reads Metagenomic Data Analysis Predicting operons by a comparative genomics approach Xin Chen Collaboration: Ying Xu (ORNL) Fund: DOE GtL This project aims at predicting candidate operons in the genome Synechococcus sp. WH8102, based on a comparative genomics approach. These candidate operons may provide us with helpful information for the construction of protein-protein interaction networks and functional pathways. Operon structures Operons represent a basic organizational unit of genes in the complex hierarchical structure of biological processes in a cell. They are mainly used to facilitate efficient implementation of transcriptional regulation, especially in bacteria. Biological characteristics of genes in an operon include: • sharing certain regulatory elements • arranged in tandem on the same strand • separated by short distances • well conserved across phylogenetically related species • their functions are usually related Existing methods for operon prediction • • • • • • Overbeek et al. (1999): gene pairs of close bidirectional best hits Salgado et al. (2000): close gene distances and gene functional classes Ermolaeva et al. (2001): the likelihood of conserved genes being an operon Carven et al. (2002): a probabilistic learning approach on whole genome Sabatti et al. (2002): a Bayesian classification scheme on gene microarray Zheng et al. (2002): based on information from metabolic pathways Our approach based on comparative genomics Genome sequences with annotation genes Pairwise comparison running blastp program with E-value = 1e-20 Gene matches (homolog information) Cluster conserved, nearby genes Scoring Candidate operons Constraints: 1. neighbor genes separated by 100 bases or less 2. genes in an operon located in the same strand 3. gene sets conserved across two or more genomes 4. full matching required for a candidate operon 5. promoter and terminator to be considered 6. pathway information to be considered List of ranked operons output A score is given by: 1. product of E-values of gene matches involved in an operon 2. intergenic distances in an operon to be considered 3. predictive reliability of promoter or terminator to be considered Comparative analysis is based on the idea that functional segments tend to evolve at lower rate than nonfunctional segments, making well conserved regions likely to be of very interest (Overbeek et al., 1999). Implementation details • Data preparation: three genome data downloaded from Genome b ORNL website (http://compbio.ornl.gov/channel/index.html). b1 b2 • Pairwise comparison: blastp with E-value <1e-20, a bipartite gene matching graph. Same COG ID. b3 b4 Genome a b5 b6 Genome c b7 a1 a2 c1 a3 c2 a4 c3 a5 c4 a6 c5 a7 c6 a8 c7 • Gene clustering: – neighbor genes separated by 100 bases or less – genes in an operon located in the same strand – gene sets conserved across two or more genomes – full matching required for a candidate operon • Scoring: product of E-values of all gene matches involved, operons with lower scores output earlier The gene matching graph for three cyanobacterial genomes Three genomes with their gene numbers: • Synechococcus sp. WH8102 (2520) • Prochlorococcus marinus sp. MED4 (1700) • Prochlorococcus marinus sp. MIT9313 (2267) The numbers of gene matching pairs: • 1593 between syn_wh and par_med • 2242 between syn_wh and par_mit • 1579 between par_med and par_mit Predicted operons in Synechococcus sp. WH8102 A total of 242 operons output from Synechococcus sp. WH8102: • 126 operons shared with both other two genomes • 26 operons shared with pmar_med only • 90 operons shared with pmar_mit only ( See operons at http://www.cs.ucr.edu/~xinchen/operons.htm ) Several observations on the putative operons • • • • The average size of putative operons is 2.88, very close to 3; The two most frequent intergenic distances are –4 and –1 overlap; All operons in Synechococcus sp. WH8102 are on the positive strand; Matching genes have the same COG IDs across three genomes. Z. Su, P. Dam, X. Chen, V. Olman, T. Jiang, B. Palenik, and Y. Xu. GIW’2003. X. Chen, Z. Su, Y. Xu, and T. Jiang. GIW’2004 (the best paper award). X. Chen, Z. Su, P. Dam, B. Palenik, Y. Xu, and T. Jiang. Nucleic Acids Research, 2004. Ongoing work • Look for a way of predicting promoters and terminators upstream and downstream of candidate operons. • Find a method to validate/score putative operons by promoter/terminator results. • Incorporate additional information like intergenic distances and predicted promoters into the scoring system. • Pathway information to be considered. Evolutionary Dynamics of Myb Gene DNA-binding Domains Li Jia Collaboration: Michael Clegg (Botany) Fund: NSF ITR Motivation Natural selection on changes of “regulatory genes” that regulate the timing or rate of development, must be required for evolution. (Britten and Davidson, 1969 and 1971) Natural selection on transcription factors should provide one of predominant mechanisms for the generation of novel phenotypes. The Crucial Role of TFs Organism Genes coding for transcriptional regulators Total number of genes Total number Percentage in total gene number A. Thaliana ~25,000 ~1,500 ~5% O. Sativa ~50,000 ~200 ~4% C. Elegans ~18,000 ~700 ~5% D. Melanogaster ~15,000 ~800 ~6% H. Sapiens ~35,000 ~3,000 ~9% M. Musculus ~30,000 ~1,800 ~6% ...... Signaling molecules T F s ...... WHEN? WHERE? Target genes HOW? R2R3-MYB Structure: Flexible domain R2R3-MYB Helix1 R2 DNA-binding domain Activation domain R3 Helix2 Functions: MYB Helix3 1) Secondary metabolism 2) Cell shape 3) Disease resistance 1) Stress response Target genes Differentiation Proliferation Metabolism OBJECTIVE to unveil molecular dynamics that underlines the evolution of TFs (Myb) R3 Helix3 R2 Helix2 R3 Helix1 R2 Helix1 R3 Helix2 R2 Helix3 Infer Positive Selection Sites (based on dN/dS analysis in the duplication history of R2R3-Myb gene family) synonymous vs A. Thaliana R2 Helix1 20 nonsynonymous mutation rates Helix2 Helix3 R3 Helix1 Helix2 Helix3 16 14 12 10 8 6 4 2 96 91 86 81 76 71 66 61 56 51 46 41 36 31 26 21 16 11 6 0 1 positive selection counts 18 Amino acid position Jia, Clegg and Jiang (2003) Plant Mol. Biol. Positive Selection Sites A. Thaliana (dicot) Sites Counts Percentage Counts/site 1 531 100% 5.4 R2 domain Helix1 Helix2 Helix3 14 7 10 173 83 8 33% 16% 1.5% 12.4* 11.6* 0.8 R3 domain Helix1 Helix2 Helix3 14 7 10 119 33 1 22% 6% 0.2% 8.5** 4.7* 0.1 Full R2R3 region O. Sativa (monocot) Category Full R2R3 region Helix 1 R2 Helix domain 2 Helix 3 Helix 1 Helix R3 2 domain Helix 3 Count sites 103 15 7 10 14 6 10 indica 52 12 14 0 16 2 0 japonica 380 61 73 0 197 9 0 Percentage indica 100% 23% 27% 0% 31% 4% 0% japonica 100% 16% 19% 0% 52% 2% 0% Count/site japonic indica a 0.5 3.7 0.8** 4.1** 2.0** 10.4** 0.0 0.0 1.1* 14.1** 0.3 1.5** 0.0 0.0 Jia, Clegg and Jiang (2003) Plant Mol. Biol. Jia, Clegg and Jiang (2004) Plant Physiol.. Co-evolved -Helices japonica r (R2, R3) indica Arabidopsis 0.69** 0.68** 0.69** r (R2-1, R3-1) N/A 0.15 0.11 r (R2-2, R3-2) 0.40** 0.62** 0.65** r (R2-3, R3-3) 0.38 0.29 0.2 Jia, Clegg and Jiang (2004) Plant Physiol.. SUMMARY 1) Positive selection sites positive selection pressure works through the first and second helices of the R2R3 repeats rather than the third helices due to their structural characteristics 2) Co-evolution patterns the functional importance of the pairing-correlations between the related secondary structures in preserving the conformation of the specific protein folding-pocket (the second helices) APPLICATIONS: determine protein-DNA interaction regions of transcription factors based on their primary codon sequences genetically modify MYB structure to improve economically important traits Prediction of HNF4 Binding Sites and Target Genes in Human and Mouse Genomes Chuhu Yang Collaboration: F.M. Sladek (Cell Biology, Neuroscience) Fund: NIH/NSF HNF4—an important TF • An important TF that regulates the expression of many genes, especially some liver-specific genes; it also plays an important role in the process of development. • It has been demonstrated to regulate the expression of over 60 genes. • Researchers anticipate to find more HNF4 target genes. Related to many human diseases such as Diabetes, hemophilia, hepatitis etc. Atherosclerosis Diabetes Hemophilia Thrombosis Coagulation Factors Anti-thrombin Apolipoproteins HNF4 EPO Hypoxia MCAD MCAD deficiency OTC OTC deficiency PEPCK L-PK HNF1 CYP genes ACO HBV BPG Drug Metabolism Cancer HNF4 is highly conserved in many different organisms 1 Zn+ + Human DNA binding 1 Rat/mouse Transactivation 100 % 93 97.4% 69% 100 % 87.2% 64% 90 % 61.4% 14% 1 Drosophila 464 93% 1 Xenopus Ligand? 22% % = amino acid identity 88% 464 464 666 Our previous work • Collected 71 HNF4 binding sequences from literature. • Developed software based on an optimized (or permuted) Markov model and trained it with the 71 known sequences. • Searched –500 to +100 regions (relative to transcription start sites) of all the human genes in UCSC database. • Predicted 840 potential HNF4 binding sites in the human genome. • Verified in vitro 77 new HNF4 binding sequences, resulting in a total of 137 HNF4 binding sequences. •This work has been summarized in a paper, which was published in Bioinformatics (Vol. 18 Suppl. 2 2002). Current work Search the promoter regions of all the human genes with 137 HNF4 binding sequences for potential HNF4 target genes in human. Search the promoter regions of all the mouse genes with 137 HNF4 binding sequences for potential HNF4 target genes in mouse. Compare HNF4 target genes in both human and mouse genomes. Do in vivo experiment to verify potential HNF4 target genes. Future work Optimize current software so that it can predict HNF4 binding sites more accurately. Study the functions of all HNF4 target genes, cluster them into different functional groups and study the relationship between different groups. Set up regulatory networks of all HNF4 target genes in human and mouse genomes. Sequence weighting: A new approach to constructing PSSM (or PWM) for motif finding from Chip and gene expression data. Efficient Selection of Unique and Popular Oligos for Large EST Databases Jie Zheng Collaboration: Sefano Lonardi and Timothy J. Close (Botany) Funding: USDA / NSF Problems of Oligo Selection (for the Barley EST data in HarvEST) • Unique Oligo Problem – Selection of oligos each of which appears (exactly) in one EST sequence but does not appear (exactly or approximately) in any other EST • Popular Oligo Problem – Selection of oligos that appear (exactly or approximately) in many ESTs Applications • Unique oligos – PCR primer designs – Microarray probe designs • Popular oligos – Useful in screening genomic libraries (such as BAC libraries) for gene-rich regions Methods • Basic idea – Separate dissimilar strings as early as possible to reduce the search space • Algorithm for unique oligos – Group similar oligos by hashing 11-mer seeds, and disqualify oligos similar to oligos in other ESTs • Algorithm for popular oligos – Cluster similar oligos by hashing 20-mer cores and comparing regions outside cores – Identify centers in clusters Performance • Input Data: – 46145 Barley EST sequences of about 28 Millions base pairs from the HarvEST database • Time and Space: – A couple of hours on a 1.2GHz CPU, 1GB RAM machine • Accuracy in simulation – Relative error is below 2% Oligonucleotide Fingerprinting of Ribosomal RNA Genes (OFRG) and Microorganism Classification Andres Figueroa and Zheng Liu Collaboration: J. Borneman (Plant Pathology) and M. Chrobak (CSE) Fund: NSF BDI/NIH R01 Basic Idea • rRNA genes (rDNA) can be used as an ID of species, especially microorganisms. • Use microarray technology to identify the rDNAs of the microbes in a community. Oligonucleotide probes are designed to hybridize with the (unknown) rDNA clones in a sample. • Analyze the hybridization result to obtain fingerprints. Project Flowchart Taxonomic tree Sample: soil, mouse gut, plant tissue, etc. Cluster Extract rDNA Fingerprints Sample rDNA Fingerprint assignment PCR Normalized signal intensities Mixture of rDNA Clone: Ligate and transform Clone library Normalization Signal intensities PCR Hybridization with probes Individual rDNA Array Print Taxonomic tree Project Structure Expr. data Genomic DBs rDNA sequence DB Web-based integrated platform OFRG management DB Probe set design Label unknown clone Clustering Binarize fingerprints Future Work • Complete rDNA sequence database (done) • Create the OFRG management database (done) • Intensity normalization/binarization using control information (partially done) • Extend to [0,t], for t = 2,3,4,… • Combine tools into an integrated platform • A higher throughput system based on microbeads and polony sequencing technologies (NIH) Polony (polymerase colony) Polony hybridizing with different probes Efficient Haplotyping Algorithms for Pedigree Data Lan Liu, Bob Wang and Jing Xiao Collaboration: Jing Li (CWRU) and Tim Chen (USC) Fund: NSF CCR/NIH R01 An Example Pedigree: The British Royal Family Elizabeth II of the United Kingdom Diana, Prince Charles, Camilla, Princess of Wales Prince of Wales Duchess of Cornwall Prince William of Wales Prince Henry of Wales Prince Philip, Duke of Edinburgh Captain Commander Princess Anne, Mark Phillips Princess Royal Timothy Laurence Peter Phillips Zara Phillips Sarah Prince Andrew, Duke of York Margaret Ferguson Princess Beatrice of York Princess Eugenie of York Prince Edward, Earl of Wessex Sophie Rhys-Jones Lady Louise Windsor MRHC Problem Find a minimum recombinant haplotype configuration from a given pedigree with genotype data Assumptions: • Mendelian law (no mutations) • Recombination events are rare (1 2) (1 2) (1 2) (1 2) (1 2) (2 2) … … (1 1) (1 2) (2 2) ... (1 2) (2 2) (2 2) … (1 1) (1 2) (1 2) (1 2) (2 2) (1 2) ... ... (1 2) (1 2) (1 2) ... (1 1) (1 2) (2 2) … Input (unphased data) 1|2 1|2 1|2 2|1 2|1 2|2 … … 1|1 1|2 2|2 ... 1|2 2|2 2|2 … 1|1 1|2 2|2 ... 1|2 1|2 2|1 ... 1|2 1|2 2|1 ... 1|1 2|1 2|2 … Output (phased data) Motivations Haplotype is more biologically meaningful than genotype since each haplotype of a child is inherited from one parent. Haplotype data are more informative and more valuable in determining the association between diseases and genes and in study of human histories. The human genome project gave us the consensus genotype sequence of humans, but in order to understand the genetic effects on many complex diseases such as cancers, diabetes, osteoporoses, the genetic variations are more important, which can be represented by haplotypes. Current techniques collect genotype data. Computational methods deriving haplotypes from genotypes are highly demanded. The ongoing international HapMap project. It’s generally believed that with parents/pedigree information, we could get more accurate haplotype and frequency estimations than from data w/o such information. Family-based association studies have been widely used. We would expect more family-based gene mapping methods that assume accurate haplotype information. Not only computation intensive, model-based statistical methods may use assumptions that may not hold in real datasets. Results • MRHC is NP-hard • Heuristic: block-extension algorithm • Exact algorithms: member-based and locus-based dynamic programming • ILP algorithm for MRHC with missing alleles • Software: PedPhase • Special cases: – Efficient algorithms for ZRHC based on systems of linear equations and low stretch spanning trees – locus-based dynamic programming for loopless pedigree • A datamining approach to gene association mapping • Several results on genome-wide TagSNP selection via linkage disequilibrium ILP Formulation Objective function: m 1 (r Non- Founders j 1 j i ,1 ri ,j2 ) Subject to Genotype constraints: (0 means missing allele) tj tj k 1 k 1 {0,0} { f i ,jk 1 , mij,k 1} {mrj ,0} { f i ,jr mij,r 1} {mrj , mrj } { f i ,jr mij,r 1} {mrj , msj } { f i ,jr f i ,js mij,r mij, s f i ,jr mij,r f i ,js mij, s 1} Mendelian law of inheritance constraints: f i ,jk f f j,k g ij,1 0 f i ,jk m fj ,k g ij,1 1 Constraints for the r variables: Test Results on Real Data The ZRHC Problem Problem definition Given a pedigree and the genotype information for each member, find a recombination-free haplotype configuration for each member that obeys the Mendelian law of inheritance. Some Constraints 4 5 1 2 12 12 12 12 12 12 12 12 6 12 3 4 12 12 12 12 5 6 11 11 12 12 4 5 12 12 12 12 12 21 12 6 4 5 12 6 21 12 21 21 21 The Constraints as Linear Equations Note: The variables represent phase and the equations are over F(2) (in fact, addition mod 2). The Final Linear System O(mn) equations on O(mn) variables. Standard Gaussian elimination gives rise to an O m3 n3 time algorithm. A Faster Algorithm for ZRHC • We have recently devised a faster algorithm for ZRHC with running time O mn2 n3 log2 n loglog n O n O mn Transform O mn Matrix O mn Matrix Reduce redundancy O n log 2 n log log n O n Matrix Some Open Problems • • • • Faster (and reliable) method than ILP for large pedigrees The k-RHC problem for small k Probabilistic models for k-RHC (Xiao Jing) Incorporation of population models into pedigrees – Combine with the parsimony model – Combine with the perfect phylogeny model – Population of trios? • Dealing with mutations, errors, and missing data • Association mapping on/using pedigree data? A High-Throughput Combinatorial Approach to Genome-Wide Ortholog Assignment Zheng Fu, Wilson Shi, Vincent Peng Collaboration: Liqing Zhang (Virginia Tech) Fund: NSF IIS Joint work with X. Chen, J. Zheng, V. Vacic, P. Nan, Y. Zhong, and S. Lonardi Orthology • Homolog 同源 – Gene family • Duplication 复制 mouse chicken frog – Paralog 旁系同源 • Speciation 分支 – Ortholog 直系同源 (from http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html) Orthology • Homolog 同源 – Gene family • Duplication 复制 – Paralog 旁系同源 • Speciation b mouse chicken frog 分支 – Ortholog 直系同源 (from http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html) Orthology • Homolog 同源 – Gene family • Duplication 复制 – Paralog 旁系同源 • Speciation b mouse chicken frog 分支 – Ortholog 直系同源 (from http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html) Orthology – the more complicated picture A Speciation 1 Gene duplication 1 B C Speciation 2 Gene duplication 2 A1 G1 B1 C1 G2 Outparalogs evolved via a duplication prior to a given speciation event. B2 C2 C3 G3 Inparalogs evolved via a duplication posterior to a given speciation event. True exemplar is the direct descendant of the ancestral gene of a given set of inparalogs. A main ortholog pair is defined as the two true exemplar genes of two co-orthologous gene sets. Significance • Orthologous genes in different species are evolutionary and functional counterparts. • Many methods use orthologs in a critical way: – – – – – – Function inference Protein structure prediction Motif finding Phylogenetic analysis Pathway reconstruction and more ... • Identification of orthologs, especially exemplar genes, is a fundamental and challenging problem. Ortholog Assignment Methods • BBH: Best Bidirectional Hit (by BLASTn / BLASTp) • COG: Cluster of Orthologous Groups (Tatusov et al., Science, 278: 631-637, 1997; Nucleic Acids Res., 28:33-36, 2000) • TOGA: TIGR Orthologous Gene Alignments (Lee et al., Genome Res, 12: 493-502, 2002) • INPARANOID: Identify Orthologs & Inparalogs (Remm et al., J Mol Biol. 314:1041-1052, 2001) • OrthoMCL: a Markov Cluster algorithm (Li et al., Genome Res, 13: 2178-2189, 2003 ) • Reconciled Tree: Gene tree v.s. species tree (Yuan et al., Bioinformatics, 14:285-289, 2001) • OrthoParaMap: Synteny regions (Cannon et al., BMC Bioinformatics 4(1):35, 2003 ) • Shared Genomic Synteny: Synteny anchors and Synteny blocks (Zheng et al., Bioinformatics 21:703-710, 2004 ) • SOAR: System of Ortholog Assignment by Reversal (Chen et al., IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2005) Molecular Evolution • Local mutation – Base substitution – Base insertion – Base deletion • Global rearrangement and duplication – Inversion/Reversal – Translocation – Transposition – Fusion/Fission – Duplication/Loss A complete ortholog assignment system should make use of information from both levels of molecular evolution. 重排 Example The ancestral genome a1 b c a2 d e f g Speciation reversal b a1 duplication c a2 d e f g a1 b c a1 c a2 d Genome e d e f a3 g fission duplication b a2 a4 f g a1 b c a2 d Genome e f a3 Given the evolutionary scenario in terms of gene order, main ortholog pairs and inparalogs could be identified in a straightforward way. g The Parsimony Approach 简约 • Identify homologs using BLASTp. • Reconstruct the evolutionary scenario on the basis of the parsimony principle: postulate the minimum possible number of rearrangement events and duplication events in the evolution of two closely related genomes since their splitting so as to assign orthologs. • Ortholog assignment problem could be formulated as a problem of finding a most parsimonious transformation from one genome into the other, without explicitly inferring their ancestral genome. RD (Reversal-Duplication) Distance • RD distance: RD (, ) R(, ) D(, ) – denotes the number of rearrangement events in a most parsimonious transformation – denotes the number of gene duplications in a most parsimonious transformation R ( , ) D ( , ) (b a1 c a2 d e a4 f g ) ( a1 b c) ( a2 d e f a3 g ) RD (, ) 4 The Key Algorithmic Problem -SRDD • Two related (unichromosomal) genomes – No inparalogs, i.e. no post-speciation duplications – No gene losses – Equal gene content • Signed Reversal Distance with Duplicates – Given two related genomes – Only reversals have occurred – How to find a shortest sequence of reversals • Almost untouched in the literature – Duplicated genes are present – Generalizes the problem of sorting by reversal Sorting By Reversals Problem • Goal: Given a permutation, find a shortest series of reversals that transforms it into the identity permutation (1 2 … n ) • Input: Permutation p • Output: A series of reversals r1, … rt transforming p into the identity permutation such that t is minimum Sorting by Reversals Problem • Goal: Given a permutation, find a shortest series of reversals that transforms it into the identity permutation (1 2 … n ) • Input: Permutation p • Output: A series of reversals r1, … rt transforming p into the identity permutation such that t is minimum Sorting by Reversals: Example • t =d(p ) - reversal distance of p • Example : p = 3 4 2 1 5 6 4 3 2 15 6 4 3 2 1 5 6 1 2 3 4 5 6 So d(p ) = 3 7 7 7 7 10 10 8 8 9 8 9 8 9 10 9 10 The MCSP Problem • Minimum Common Substring Partition G: 3 1 2 -1 4 H: -4 1 2 3 1 • This may help eliminate many duplicates, but is different from syntenic blocks. • Give two related genomes G and H , we have ( L(G, H ) 1) / 2 d (G, H ) L(G, H ) 1 An Outline of MSOAR Dataset A Dataset B Homology search: 1. Apply all-vs.-all comparison by BLASTp 2. Only select the blast hits with similarity score above cutoff 3. Keep the top five bi-directional best hits Assign orthologs by minimizing RD distance: 1. Apply suboptimal rules 2. Apply minimum common substring partition partition 3. Maximum cycle graphdecomposition decomposition 4. Detect inparalogs by identifying “noise” gene pairs List of orthologous gene pairs output Real Data • Homo sapiens: – Build 36.1 human genome assembly (UCSC hg18, March 2006) – 20161 protein sequences in total • Mus musculus: – Build 36 mouse genome assembly (UCSC mm8, February 2006) – 19199 protein sequences in total MSOAR vs Inparanoid • Validation: Official gene symbols extracted from the UniProt release 6.0 (September 2005) • For 20161 human protein sequences and 19199 mouse protein sequences, MSOAR assigned 14362 orthologs between Human and Mouse, among which 11050 are true positives, 1748 are unknown pairs and 1508 are false positives, resulting in a sensitivity of 92.26% and a specificity of 87.99%. • The comparison between MSOAR and Inparanoid Mol. Biol., 2001) (Remm et al., J. MSOAR vs INPARANOID Human chromosome 20 STK35 Stk35 TGM3 TGM6 Tgm3 Tgm6 SNRPB Snrpb ZNF343 Tmc2 TMC2 NOL5A Nol5a IDH3B Idh3b Mouse chromosome 2 The ortholog pair SNRPB (Human) and Snrpb (Mouse) are not bi-directional best hits, which could be missed by the sequence-similarity based ortholog assignment methods like Inparanoid. Validation by HCOP • The HGNC Comparison of Orthology Predictions (HCOP) is a tool that integrates and displays the human-mouse orthology assertions made by Ensembl, Homologene, Inparanoid, PhIGS, MGD and HGNC. (http://www.gene.ucl.ac.uk/cgi-bin/nomenclature/hcop.pl) Distribution of the number of supports from HCOP The Number of Orthologs Assigned by MSOAR 6000 5000 4000 3000 2000 1000 0 0 1 2 3 4 The Number of Supports 5 6 Future Work – More efficient algorithms for MCSP and MCD. The best approximation algorithm for MCSP has ratio O(k) (Kolman and Walen’06). Can the ratio be improved to O(1)? – Refine the evolutionary model for MSOAR (transposition, tandem duplication, gene loss, etc.) How would the DCJ model fit in? – Ortholog assignment for multiple genome comparison. The median problem. – More explicit treatment of one-to-many and many-to-many orthology relationship. – Take advantage of other sources of genomic information such as unique sequence tags, syntenic blocks, etc. Genome-Wide Inference of mRNA Isoforms and Estimation of Their Expression Levels from RNA-Seq Reads Wei Li (UCR/Harvard) and Jianxing Feng (Tsinghua/Tongji) This method was also reported in SLIDE later (Li et al., PNAS, 2011). Separating Metagenomic Short Reads into Genomes via Clustering Olga Tanaseichuk and James Borneman (UCR) Metagenomics • Genomics – Study of an organism's genome – Relies upon cultivation and isolation – > 99% of bacteria cannot be cultivated • Metagenomics ▫ Study of all organisms in an environmental sample by simultaneous sequencing of their genomes ▫ Makes it possible to study organisms that can’t be isolated or difficult to grow in a lab Metagenomic Projects The Acid Mine Drainage Project The Tinto River in Spain (Credit - Carol Stoker) The Sargasso Sea Project The Human-Microbiome Project A coral reef off the coast of Malden Island in Kiritibati • A large scale sequencing in an environmental setting • Identified >1 million of putative genes (10 times > than in all databases at that • Simple community: 5 dominant time) species (3 bacteria and 2 archaea) • ~1800 species • Motivation: to understand mechanisms by which the microbes tolerate the extremely acid environments • Microbial community living in a host • 100 trillion microbes • 100 times more microbial than human genes • Is there a core human microbiome? • How changes in microbiome correlate with human health? DNA Sequencing • Next Generation Sequencing (NGS) – – – – High-throughput Cost- and time-effective No cloning (reduced clonal biases) Shorter read length compared to Sanger reads (~1000 bps) • Roche/454 (~450 bps) • Illumina/Solexa (35-100 bps) • ABI SOLiD (35–50 bps) – Due to rapid progress, sequencing lengths will increase Goals of Metagenomics • • • • • Phylogenetic diversity Metabolic pathways Genes that predominate in a given environment Genes for desirable enzymes ... Ultimate goal: complete genomic sequences Problem Formulation • Given metagenomic reads, separate reads from different species (or groups of related species) Difficulties • Repeats in genomic sequences • Sequencing errors genomics • Unknown number of species and abundance levels • Common repeats in different genomes due to homologous sequences metagenomics Existing Approaches • Similarity-Based – Similarity search against databases of known genomes or genes/proteins • Composition-Based – Binning based on conserved compositional features of genomes • Abundance-Based – Separate genomes by abundance levels Algorithm: Overview • Purpose: separating short paired-end reads from different genomes in a metagenomic dataset • Two-phase heuristic algorithm – short reads – similar abundance levels – arbitrary abundance levels (in combination with AbundanceBin [Wu and Ye, RECOMB, 2010]) Algorithm: Definitions and Observations Unique l-mers (occur only once) Repeated l-mers (occur > once) Observation 1: Most of the l-mers in a bacterial genome are unique l ~ 20, for most of complete genomes The ratio of unique l-mers to distinct l-mers Algorithm: Definitions and Observations Unique l-mers Repeated l-mers Observation 2: Most l-mers in a metagenome are unique for l ~ 20 and genomes separated by sufficient phylogenetic distances Algorithm: Definitions and Observations Repeated l-mers Individual repeats Common repeats Observation 3: Most of the repeats in a metagenome are individual for l ~ 20 and genomes separated by sufficient phylogenetic distances Flowchart Arbitrary Abundance Levels • Significant abundance ratios is defined by the expected misclassification rate (>3%) Experimental Results: Overview • Lack of NGS metagenomic benchmarks • Lack of algorithms in the literature to separate short NGS reads from different genomes • Datasets – Tests on variety of synthetic datasets with different number of genomes, phylogenetic distances and abundance ratios – Performance on a real metagenomic dataset from gut bacteriocytes of a glassy-winged sharpshooter • Comparison – We modify the Velvet assembler [Zerbiono and Birney, Renome Research, 2008] to work as a genome separator (clusters in Phase I are replaced by sets of l-mers from the Velvet contigs) – With CompostBin on longer reads Experimental Results • 182 synthetic datasets of 4 categories – 79 experiments for the same genus – 66 – same family – 29 – same order – 8 – same class • Read length: 80 bps • Coverage depth: ~15-30 • Equal abundance levels • 2-10 genomes in each dataset • Simulation: Metasim [Richter et al., PloS ONE, 2008] • Phylogeny: NCBI taxonomy Experimental Results Experimental Results: Genomes with Different Abundance Levels Experimental Results: Comparison with CompostBin • Simulated paired-end Sanger reads from [Chatterji et al., RECOMB, 2008] – Handling longer reads (1000 bps) • Cut long reads into short reads of 80 bps • Linkage information is recovered in Phase II – Handling lower coverage depth (~3-6) • Choose higher threshold K to separate repeats and unique l-mers in preprocessing • Simulated paired-end Illumina reads – 80 bps, high coverage depth (~15-30) Experimental Results: Comparison with CompostBin Test1 Test2 Test3 Test4 Test5 Test 6 Test7 Test8 Test9 Abundance ratio 1:1 1:1 1:1 1:1 1:1 1:1 1:1:8 1:1:8 1:1:1:1:2:14 Phylogenetic distance Species Genus Genus Family Family Order Family Order Order Phylum Species, Order, Family Phylum, Kingdom Experimental Results: Real Dataset • Gut bacteriocytes of glassy-winged sharpshooter, Homalodisca coagulata – Consists of reads from: • Baumannia cicadellinicola • Sulcia muelleri • Miscellaneous unclassified reads • Sanger reads • Performance is measured on the ability to separate reads from B.cicadellinicola and S.muelleri • Performance – TOSS: Sensitivity: ~92% and error rate ~1.6% – CompostBin: Error rate: ~9% Implementation of TOSS • Implemented in C • Running time and memory depend on – Number and length of reads – Total length of the genomes • For 80 bps reads -- 0.5 GB of RAM per 1 Mbps – 2-4 genomes, total length 2-6 Mbps – 1-3 h, 2-4 GB of RAM – 15 genomes, total length 40 Mbps – 14 h, 20 GB of RAM Questions? Comments? Contact: Tao Jiang Department of Computer Science and Engineering University of California – Riverside jiang@cs.ucr.edu www.cs.ucr.edu/~jiang