Mapping NGS sequences to a reference genome Why? • Resequencing studies (DNA) – Structural variation – SNP identification • RNAseq – Mapping transcripts to a genome sequence • Genome annotation • Transcript enumeration • Identification of splice junctions/variants Blast is too slow • Different alignment algorithms are necessary • Burrows Wheeler Alignment – sequence database (genome) is transformed to produce an index – Individual sequence reads are searched against this index • STAR Aligner (Dobin et al. 2012) Bioinformatics – Uncompressed Suffix trees BWT of “banana” Tophat2 • Based on the Bowtie alignment engine – Bowtie, matching with no gaps – Tophat2, gapped matches • Aligns reads to a Burrows Wheeler transformed index of the genome • 1st pass non-gapped matches • 2nd pass splits unmapped reads and attempts to align the fragments The STAR Aligner • Start at the first base of sequence read • Find Maximal Mappable Prefix (MMP) • Repeat process using unmapped portion of read • 50x faster than other aligners OUTPUTS • TopHat (Bowtie) – .bam file (binary alignment/map) – .sam (sequence alignment/map) – Single .sam file entry: I8MVR:53:837 0 17_dna:chromosome 14090858 TAACTACGAATACCTGTCGAT **%-**,00%-*-%---*-*XX:Z:C5T3C2T2CT2C XM:Z:h..H......h.H...x...h XG:Z:CT 255 21M * 0 0 NM:i:7 XR:Z:CT .sam fields .sam flags 0x0001 1 0x0002 2 0x0004 0x0008 0x0010 4 8 16 0x0020 0x0040 0x0080 0x0100 32 64 128 256 the read is paired in sequencing, no matter whether it is mapped in a pair the read is mapped in a proper pair (depends on the protocol, normally inferred during alignment) the query sequence itself is unmapped the mate is unmapped strand of the query (0 for forward; 1 for reverse strand) strand of the mate the read is the first read in a pair the read is the second read in a pair the alignment is not primary (a read having split hits may have multiple primary alignment records) 1. 1 2. 2 3. 1+2 4. 0+4 5. 1+4 6. 0+2+4 7. 1+2+4 8. 0+8 9. 1+8 10. 0+2+8 11. 1+2+8 12. 0+4+8 13. 1+4+8 14. 0+2+4+8 15. 1+2+4+8 16. …etc. CIGAR format I8MVR:104:144 0 7_dna:chromosome 120102744 255 62M1I14M * 0 0 GGTTTTTTGGAAGAGTAGTTCGCGTTTCATTAATTAGTTATTTTTTAGTTTTTAAATAAAATAAAATTTTAAAAAAA Quantifying alignments • How many reads overlap a given interval on a chromosome (scaffold)? • How do these regions correspond to known genes? – .gtf file • How many transcripts from my gene of interest? • How confident can I be about a variant call? Annotate regions - GTF files 1 2 3 4 5 6 7 8 Chromosome _8.1 Cufflinks transcript 90162 90766 1000 + . Chromosome _8.1 Cufflinks exon 90162 90231 1000 + . Chromosome _8.1 Cufflinks exon 90314 90766 1000 + . Chromosome _8.1 Cufflinks transcript 90889 91620 1000 . . 9 gene_id "CUFF.1"; transcript_id "CUFF.1.1"; FPKM "110.6292802224"; frac "1.000000"; conf_lo "41.668327"; conf_hi "132.581041"; cov "6.415537"; gene_id "CUFF.1"; transcript_id "CUFF.1.1"; exon_number "1"; FPKM "110.6292802224"; frac "1.000000"; conf_lo "41.668327"; conf_hi "132.581041"; cov "6.415537"; gene_id "CUFF.1"; transcript_id "CUFF.1.1"; exon_number "2"; FPKM "110.6292802224"; frac "1.000000"; conf_lo "41.668327"; conf_hi "132.581041"; cov "6.415537"; gene_id "CUFF.2"; transcript_id "CUFF.2.1"; FPKM "49.8117204717"; frac "1.000000"; conf_lo "21.651798"; conf_hi "73.074820"; cov "2.193724"; GTF fields 1. 2. 3. 4. 5. Sequence ID Source Feature Start End 6. 7. 8. 9. Score Strand Frame Attribute Variant Calling • .bam/.sam file contains all of the information required to call variants • Variant calls can’t be extracted from the .bam file • Must provide the genome sequence I8MVR:53:837 0 17_dna:chromosome 14090858 255 21M * 0 0 TAACTACGAATACCTGTCGAT **%-**,00%-*-%---*-*NM:i:7 XX:Z:C5T3C2T2CT2C XM:Z:h..H......h.H...x...h XR:Z:CT XG:Z:CT Today’s exercises Variant Analysis • Extract variant information from provided .bam file • Examine output file and learn about the information contained in the various fields Introducing… Dr. Eric Rouchka • Bioinformatics Core Director • Department of Computer Engineering and Computer Science • University of Louisville • Kentucky Biomedical Research Infrastructure Network