Short read alignment BNFO 601 Short read alignment • Input: – Reads: short DNA sequences usually up to 100 base pairs (bp) produced by a sequencing machine • Reads are fragments of a longer DNA sequence present in the sample given as input to the machine • Usually number in the millions – Genome sequence: a reference DNA sequence much longer than the read length Short read alignment • Applications – Genome assembly – RNA splicing studies – Gene expression studies – Discovery of new genes – Discovering of cancer causing mutations Short read alignment • Two approaches – Hashing based algorithms • • • • BFAST SHRIMP MAQ STAMPY (statistical alignment) – Burrows Wheeler transform • Bowtie • BWA BFAST overview PLoS ONE 4(11): e7767. BFAST algorithm PLoS ONE 4(11): e7767. BFAST masked keys Short read alignment Empirical performance: • Simulated data: – Extract random substrings of fixed length with random mutations and gaps – Realign back to reference genome • Real data: – Paired reads: two ends of the same molecule – Count number of paired reads within 500 to 10000 bases of each other Short read alignment Courtesy of Genome Res. June 2011 21: 936-939; Short read alignment Courtesy of Genome Res. June 2011 21: 936-939; Short read alignment