Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG 1 Part 1: GENCODE v10 lncRNA screening vs human and mouse genomes Strategy: Template: PipeR one2many homolog assignment genes 10840 transcripts 17547 exons 58857 sum of mature transcript length (nt) 16·927·027 real coverage (nt) 13·083·478 non overlapping loci 7428 PipeR Parameters: Blast - Freyhult parametrization - Lower case masking - Low complexity masking Exonerate - est2genome model - 70% coverage required - seed extension 2X (the span of the genomic size of the query on both sides) 2 PipeR: a pipeline for mapping lncRNAs • • blast-exonerate based framework to map lncRNAs against target genomes algorithm used: lncRNA 2 Blast hits chromosome mapping extension Exonerate spliced transcript 3 GENECODE lncRNAs Vs Complete Genomes PipeR: lncRNA Homology Mapping 1. 2. 3. 4. 5. GFF File Anchor points: ENCODE vs Mouse with tuned Blast Extension: Exonerate Filtering: Id and Coverage Validation of the GFF annotation Overlap with Annotation Overlap with Cufflink Models RPKM on target genome Further Mapping Parameter Space Exploration using Experimental Evidences Notredame, Bussotti Mapping overview Gene B Gene A Query species Transcript 1 Transcript 3 Transcript 2 Blast/Exonerate failed Multiple Homologues Homolog 1 Best reciprocal Homolog 2 Conserved exon number Homolog 3 High repeat coverage Homolog 4 Overlap with protein Target species 5 GENCODEv10 vs human genome • mapped 17327 transcripts out of 17547 • many lncRNAs found in multiple copies (lncRNA families) - found 144566 homologs corresponding to 501355 exons • Annotations of discovered homologs are readily available 6 Homolog repeat coverage • About the 10% of all our homolog predictions are fully covered by repeats 7 Homolog repeat coverage • We could sub-group the homologs in 3 set according with the repeat coverage: <= 20 < = 80 < = 100 8 HUMAN Mapping statistics <= 20% <= 80% <= 100% genV10 mapped genes 6088 10425 10698 genV10 mapped transcripts 9318 16856 17327 Total homologs 35399 102250 144566 Homologs whose exons overlap protein coding exons (same strand) 3621 5076 8988 9 GENCODEv10 vs mouse genome • mapped 3190 transcripts out of 17547 representing 2249 human genes • many lncRNAs found in multiple copies (lncRNA families) - found 14936 homologs corresponding to 38910 exons • Annotations of discovered homologs are readily available 10 Human/Mouse Exon Number Conservation • Difference between the number of exons in the human transcripts and in the mouse homologs • “0” means that the exon number is the same • Negative bins indicate mouse homologs having more exons than the human query • 1160 GENCODE v10 transcripts find at least 1 homolog in mouse with the same exon number human < mouse human > mouse 11 Homolog repeat coverage • We could sub-group the homologs in 3 set according with the repeat coverage: <= 20 < = 80 < = 100 12 MOUSE Mapping statistics <= 20% <= 80% <= 100% Reciprocal homologs genV10 mapped genes 1867 2172 2249 1445 genV10 mapped transcripts 2586 3076 3190 1966 Total homologs 6108 11141 14936 1966 Homologs whose exons overlap protein coding exons (same strand) 1611 2290 3177 497 Homologs with conserved number of exons 1534 2407 2958 689 Best Candidates: There are 148 transcripts that have < 20% repeat coverage, conserved exon structure, do not overlap protein coding exons and are best reciprocal homologs with the human queries 13 GENECODE lncRNAs Vs Complete Genomes PipeR: lncRNA Homology Mapping 1. 2. 3. 4. 5. GFF File Anchor points: ENCODE vs Mouse with tuned Blast Extension: Exonerate Filtering: Id and Coverage Validation of the GFF annotation Overlap with Annotation Overlap with Cufflink Models RPKM on target genome Further Mapping Parameter Space Exploration using Experimental Evidences Notredame, Bussotti BlastR vs The World BlastR vs The World blastnOpt (12487) a) blastn (8749) Figure 2: Exon read support. a) Venn-diagram indicating the number of exon detected by different methods (numbers in parentesis) and their intersection (transcripts annotated identically by the three methods). b) Average amount of reads per exons c) Percent of reads covered by at least one exon all (7492) blastr (12093) b) c) 1,400 80 % exons with read average reads per exon 78 1,300 1,200 1,100 1,000 76 74 72 70 68 66 64 900 62 800 60 blastn blastnOpt methods blastr all blastn blastnOpt methods blastr all Part 2: Ensembl.v65 lncRNAs screening vs human and mouse genomes Strategy: Template: PipeR one2many homolog assignment genes 3845 transcripts 5669 exons 18353 sum of mature transcript length (nt) 7279679 real coverage (nt) 6091050 non overlapping loci 2790 PipeR Parameters: Blast - Freyhult parametrization - Lower case masking - Low complexity masking Exonerate - est2genome model - 70% coverage required - seed extension 2X (the span of the genomic size of the query on both sides) 18 Ensembl.v65 vs human genome • mapped 1187 transcripts out of 5669 • many lncRNAs found in multiple copies (lncRNA families) - found 13193 homologs corresponding to 46770 exons • Annotations of discovered homologs are readily available 19 Ensembl.v65 vs mouse genome • mapped 5622 transcripts out of 5669 • many lncRNAs found in multiple copies (lncRNA families) - found 41005 homologs corresponding to 121515 exons • Annotations of discovered homologs are readily available 20 Mouse/Human Exon Number Conservation • Difference between the number of exons in the mouse transcripts and in the human homologs • “0” means that the exon number is the same • Negative bins indicate human homologs having more exons than the mouse query • 481 Ensemblv65 transcripts find at least 1 homolog in human with the same exon number mouse < human mouse > human 21 Homolog repeat coverage • Not observed a peak of homolog predictions fully covered by repeats 22 Ensemble.65 and GENCODEv10 repeat coverage • Input lncRNA datasets have similar repeat distributions 23 ensV65 mapped genes 879 ensV65 mapped genes 3815 ensV65 mapped transcripts 1187 ensV65 mapped transcripts 5622 Total homologs 13193 Total homologs 41005 3642 Homologs whose exons overlap protein coding exons (same strand) 10086 Homologs whose exons overlap protein coding exons (same strand) Homologs whose exons do not overlap any gencode v10 element (same strand) 6085 Homologs with conserved number of exons 4925 HUMAN MOUSE Mapping statistics 24 Part 3: GENCODE v10 lncRNA coding potential check Strategies: 1) GeneId ORF score comparison between mRNAs and lncRNAs 2) BlastX against human proteins (ensembl 65) 3) Overlap with protein coding gene exon annotations (gencodeV10) 4) PipeR filtering routines 25 1) ORF scores as returned by GeneID 2) blastX against human proteins indicates that 1202 GENCODE v10 lncRNAs match proteins Parameters: seg low complexity filtering, repeat filtering , evalue 10e-10, search just the plus strand. Human Ensembl 65 protein set 26 3) -Checked the overlap between GENCODE v10 lncRNA exons and GENCODE v10 protein coding exons. - Found 846 lncRNA having at least one exon overlapping with a protein coding gene exon Example 1 Example 2 27 4) Extensive filtering 7813 GENCODE v10 transcripts passed *ALL* PipeR filtering routines Filtering rules: - overlap with protein coding exons - geneID ORF score similar to the ones of mRNA - blastX to uniprot database (50% redundancy) - blastX to nr database - rpsBlast to pfam domain families - blast against Rfam 28