Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 15: High-throughput sequencing and bioinformatics Marshall University Genomics Core Facility High-throughput sequencing • High-throughput sequencing (a.k.a next-generation sequencing, or NGS) is an experimental technique that can sequence large numbers of DNA fragments at one time • Basic idea: – Take a DNA sample, denature and fragment it into segments up to a few hundred base-pairs long – DNA is attached to a substrate (flow cell) – Special nucleotide bases are allowed to anneal to single-stranded DNA samples • adapted so that only one base attaches at a time • different fluorescent dyes attached to each of A, C, G, T – flow cell is scanned with an optical scanner to determine base added to each fragment – “stops” and dyes removed from attached base, so another base can be attached – repeat up to 100-150 times – can then “turn around” and sequence same fragments from the other end Marshall University School of Medicine Applications of NGS • Many different applications of NGS – Genome sequencing • Sequence DNA from samples, identify variants • Potentially identify causal variants for disease – Exome sequencing • Sequence only the coding portions of the genome – RNA sequencing (RNA-Seq) • Collect RNA samples, build complementary DNA, sequence the DNA • Can count the number of reads mapping to each known gene to measure gene expression • Sequences can show transcripts – Identify different splice variants for genes – Many others Marshall University School of Medicine Output from NGS • A single run on an Illumina sequencer can read up to 3 billion DNA fragments – 2 reads of 100 bases per fragment, so up to 6×1011 (600,000,000,000) bases per experimental run • This is a lot of data • How to process it? • Some standard pipelines exist for common types of experiments • Most start with aligning the reads to a known (“reference”) genome Marshall University School of Medicine Reference Genomes • Human Genome project was completed in 2003 – Technically, the first complete draft • Still an ongoing project • Basically a sequence of “consensus” bases for each chromosome • Raw sequences like this are stored in a fasta file – Can be a single file, or one file per chromosome – Very simple text file format • Line containing name of sequence starts with ‘>’ • Other lines contain bases, maximum of 80 per line Marshall University School of Medicine Sequencing Reads • Output from a sequencer consists of collection of reads – Typically 100 bases per read – Millions (sometimes even billions) of reads per sample • Each base also has a quality score associated with it – Estimate of how confident the sequencing software is in making the call • Reads are stored in fastq files Marshall University School of Medicine Alignment • Alignment involves finding the location in the genome of a read – Which chromosome, which base number? • In theory, could scan the entire genome and look for a match of the read to the sequence – However, must account for variation – Natural biological variation – Sequencing error • Alignment needs to find the “best match” in the entire genome – – – – Remember, mammalian genomes are around 3 billion bases long And we have up to 3 billion reads per run of the sequencer So potentially up to 9×1018 (9,000,000,000,000,000,000) things to check Very sophisticated algorithms are needed to make this a viable task • Typically give “best approximation” results – Most commonly used alignment software: BWA and BowTie Marshall University School of Medicine Alignment output • Result of alignment is a sam (“sequence alignment mapping”) file, or its binary version, a bam file. – For each read, describes the sequence to which it mapped (i.e. the chromosome), how it mapped (were there “gaps”?), the quality of the mapping (how certain?), and other data • These files (and others) can be viewed in a genome viewer – Integrative Genome Viewer (IGV) from the Broad Institute or UCSC’s Genome Browser are the commonly-used ones Marshall University School of Medicine RNA-Seq alignment • • • • • • RNA-Seq presents special problems for alignment In most eukaryotes, transcription is spliced Remember we start with RNA transcripts Use those to construct cDNA And sequence the cDNA Need to map this back to the reference sequence from where it came • Introns appear as huge deletions to aligner Marshall University School of Medicine RNA-Seq alignment problem Transcription Fragmentation Make cDNA, sequence How to map to genome? Marshall University School of Medicine RNA-Seq alignment solutions • One option is to ignore the problem – Will only align reads that map to a single exon • Or have a splice junction close to one end of the read • Longer exons will be well mapped – Maybe works well enough for simple differential expression experiments • Another option is to build a reference transcriptome and align to it – Basically, take all the known genes and their transcripts, and build a “genome” out of the known transcripts – Restricts only to known transcripts – Splice variation can cause problems Marshall University School of Medicine TopHat aligner • TopHat is a specialized aligner for aligning RNA-Seq reads – Like BowTie, collaboration between UMD and MIT • Works by (computationally) chopping reads into small chunks (~25bp) and aligning those – Much less likely those will span splice junctions, so many will align directly to the genome – Use the aligned chunks to guess where the splice junctions are – Then use the splice junctions to align the chunks that failed to align first time • Very computationally intensive – 50 million reads usually take about 16 hours to align on a 12processor core – Typical experiment may have a dozen or more such samples Marshall University School of Medicine Downstream Analysis • After aligning, analysis varies depending on type of application • For genomic applications, usually want to know where the variants are compared to the reference genome – Which are “interesting”? • For RNA-Seq, often want to compare expression of genes between samples – Similar to microarray studies Marshall University School of Medicine Variant Calling • Variant calling is performed by examining the alignment file (bam file) and comparing the sequence from the read to the sequence from the reference genome – Use quality scores for bases to determine probability this is a real variant and not a sequencing error – Standard software for this step is samtools Marshall University School of Medicine Filtering variants • In humans, typically a subject will have hundreds of thousands of variants relative to the reference genome – About 1 base in every 10,000 • Which of these are of interest? – Need to filter them • Typically, for disease studies, look for variants present in the diseased individuals and not present in the normal individuals – In family studies this can remove many variants Marshall University School of Medicine Filtering steps for variants • Still have many variants in consideration – Typically remove “common variants” • If the variant is known, and has a relatively large frequency in the population (say > 5%), then it probably is not causal for a major disease • Simply by evolutionary consideration, or by considering the disease incidence rate – Often focus only on variants that occur in coding regions of the genome – Can also look at the effect those variants have on the generated protein sequence • Synonymous changes (ones which change the DNA sequence but result in the same amino acid) are probably not going to have an effect on phenotype – Sophisticated tools (Polyphen2 and SIFT) will examine the relevant protein sequence across many species • If the sequence is highly conserved across many species, changes to it are likely to be deleterious Marshall University School of Medicine Differential Expression analysis by RNA-Seq • Typical use of RNA-Seq is to determine genes differentially expressed between two sets of samples – Similar to microarray studies but with several advantages – Not restricted to genes spotted on array • Not even restricted to known genes – Can distinguish between different splice variants if needed – Less statistical noise: get an actual count of reads Marshall University School of Medicine Typical pipeline for differential expression analysis • Align reads to reference genome (preferably using TopHat) • For each known gene, count the number of reads for each sample intersecting that gene – Can also do this for each exon instead of each gene if you want differential transcript analysis • Normalize the counts by the total number of counts for the sample • For each gene, compare the normalized counts for one set of samples to the normalized counts for another set of samples • Generate a p-value • Correct for multiple hypothesis testing Marshall University School of Medicine Annotation and Variation databases • Pipelines for variant filtering and RNA-Seq analysis relied not just on reference genome but on knowing where the genes are in that genome – For RNA-Seq, had to count reads intersecting each gene – For variant filtering, wanted to know which variants were in coding exons – Variant filtering also relied on “known variants” • These steps rely on additional databases • Two main sources: – NCBI National Center for Biotechnology Information • Part of National Institutes of Health (NIH) – EBI European Bioinformatics Institute • Part of European Molecular Biology Laboratory (EMBL) – UCSC maintains a database linking to both of these • Kind of meta-database Marshall University School of Medicine Strategies for using annotation and variation databases • There are online and some standalone tools for interacting with these databases • Because of the scale of the data, however, at some point using these requires writing computer code • For most use-cases, download text file and write code to read and process it – Usually download from genome.ucsc.edu Marshall University School of Medicine Custom analyses • Some analyses don’t have “off-the-shelf” data analysis pipelines – Have to be created ad-hoc • Current project: reduced representation bisulphite sequencing (RRBS) – Technique to use NGS to identify sites in the genome which are methylated • Methylation affects transcription • Known mechanism for turning gene expression on and off Marshall University School of Medicine RRBS • DNA is first cleaved by MSP1 enzyme – Cleaves DNA at sites matching CCGG • Then size-selected by gel to select only fragments between 30 and 350 bases in length • Treat with bisuphite – Converts unmethylated cytosines (C) to uracil • Reads as T in sequencing – But leaves methylated cytosines alone Marshall University School of Medicine Bismark • Some tools exist for RRBS analysis, but are fairly primitive – Bismark is an aligner/methylation caller developed by the Babaraham Institute – Basic idea: • • • • • Convert all Cs in reference genome to Ts Temporarily convert all Cs in reads to Ts Align converted reads to converted genome Un-convert genome and reads Ts in reads that mapped to Cs in reference are likely to be unmethylated Cs • Cs in reads that mapped to Cs in reference are likely to be methylated Cs Marshall University School of Medicine RRBS Pipeline • Before we align and call methylation, want to do some quality control analysis – Did the experimental protocol produce the results we expected? • All reads should begin CGG – Cleavage step drops the first base of the CCGG cleavage site • Should be able to identify potential targets of alignment – Find all occurrences of CCGG in the reference genome – Find all fragments of the genome between these sites with length between 30 and 350 bases – All reads should align to the beginning or end of these fragments Marshall University School of Medicine A note on computational power • If we attempted the task “Find all occurrences of CCGG in the reference genome” by hand • Optimistically assume you can scan 30 bases per second looking for these • At 3 billion bases, this would take 100,000,000 seconds – 1,666,667 minutes, or 27,778 hours, or 1,157 days – i.e. 3.17 years of continuous work – or about 14 years full time at 40 hours/week • My laptop can complete this task in a couple of minutes • Remember aligning a dozen RNA-Seq samples takes a weekend on our 22×12 CPU computer cluster Marshall University School of Medicine Next pipeline steps • Having confirmed reads are located as expected, look at methylation calls • Compare methylation status between different cell lines and identify locations where methylation is different – Use Fisher’s exact test to compare number of methylated and unmethylated calls between cell lines – Correct p-value for multiple hypothesis testing • Look for “clusters” of differences • Filter – Which are in genes, or perhaps just upstream of genes (in the promotor region) – These are likely to affect expression Marshall University School of Medicine Summary • NGS data analysis involves managing and manipulating large amounts of data • Eventually, some programming skills are necessary • Statistical analysis is usually involved at the end of the pipeline • Potential for very powerful analyses and discoveries Marshall University School of Medicine