BIT 815: Analysis of Deep Sequencing Data Overview: • This course will cover methods for analysis of data from Illumina and Roche/454 high-throughput sequencing, with or without a reference genome sequence, using free and open-source software tools with an emphasis on the command-line Linux computing environment* Lecture Topics: • Types of samples and analyses • Experimental design and analysis • Data formats and conversion tools • Alignment, de-novo assembly, and other analyses • Computing needs and available resources • Annotation • Summarizing and visualizing results Labs: Lab sessions meet in a computing lab, and will provide students with hands-on experience in managing and analyzing datasets from Illumina and Roche/454 instruments, covering the same set of topics as the lectures. Example datasets will be available from both platforms, for both DNA and RNA samples; students who have their own datasets may contact the instructor prior to the course to discuss opportunities for analysis of their data during the lab sessions. * see http://www.physics.ubc.ca/mbelab/computer/linux-intro/html/ for an overview Introduction to the course and to each other - background in biology, computing, and sequencing - experiments of interest to participants Course structure - 3 two-hour blocks per week, one theme per week * ~ 45 min lecture/discussion * ~ 70 min lab exercises - some assigned reading - participation in classroom discussion is expected - no exams Course Objective - to teach you how to teach yourself The sequencing rate is growing faster than Moore’s Law Stein (2010) Genome Biology 11:207 An alternative perspective from an independent source Doubling time 19.8 months Doubling time 2 months Doubling time 7.3 months Sequence data analysis is changing rapidly - relatively few methods are completely static - much of the software is still under active development - new methods and tools are reported every month - staying on the learning curve is essential Why use Linux for sequencing data analysis? - it is well-suited to the task * preferred development platform for most tools * modular design – thousands of independent programs * however … it’s built for speed, not for comfort Modular design in Linux – a ‘toolbox’ approach • Individual components of the Linux operating system are written as separate programs • Different programs can have similar functions • A Linux “distribution” is a collection of programs that work together as an operating system • Users have the power to add new programs, or take away existing programs that are not being used, to optimize system performance A map of the software components of the kernel Why is modularity an advantage? - adding new software is relatively straightforward - the operating system can be continually upgraded - adding tools to the toolbox is easy - your analyses are limited only by your imagination; the tools to carry them out are probably already in place There is always more than one way to do it - some sequence analysis tasks have matured to stability - most have not, and are still changing - ‘best practices’ are also changing, and subject to dispute - staying on the learning curve is essential Key principles from the Eric Raymond book chapter • Clarity is better than cleverness. Document everything you do, because you won’t remember what you did, or why • Programmer time is more expensive than machine time. Don’t worry about optimizing things unless it is necessary • Prototype before polishing – get it working before you optimize it. It is often easiest to start with something very simple, then add complexity and capability in steps • Design for simplicity; add complexity only where you must. “Make things as simple as possible, but no simpler” – paraphrased from Albert Einstein Computational thinking – four general principles • Decompose a complex problem to simple steps. Linux is based on simple tools that do one thing well; these tools require problems to be framed in simple terms. • Look for patterns . Recognizing similarities among different types of problems allows re-use of the same tools in new contexts. • Generalize patterns to create abstract versions. A tool is most powerful when it can be applied to a variety of problems that all share common features • Combine simple tools into more complex pipelines. Repetitive tasks are what computers are good at – our job is to build the algorithms, or sequences of simple steps, that allow the computer to do those repetitive tasks so we don’t have to. File Globbing Exercises • Download FileGlobbing.pdf from the course website, and also get smallfiles.zip if you are not using a BIT laptop (http://www4.ncsu.edu/~rosswhet/BIT815/Spring2013/list.html) • Note the complexity of the explanation – way more information than you really wanted • Start with the simplest forms, and work up from there • Right-click on the bit815 (or smallfiles) folder in Documents, choose “Command Prompt Here” from the drop-down menu. • Use ls *, ls *.fq, ls smallread[12].fq, ls small*, and other commands to explore what works and what doesn’t File and Directory Commands • The Software Carpentry videos introduced several commands related to directories and files – most if not all of those will work in the Gnu On Windows or Mac Terminal command line • Start with creating a new directory – mkdir sandbox • Change to the new directory, and create some files there: cd sandbox touch file1 file2 file3 file4 file5 file6 file7 file8 file9 • List those files, using the short version or the long one: ls ls –la • Note that all files are empty (0 bytes in column 5) File and Directory Commands • Create another directory within the sandbox directory mkdir bucket • Create a symbolic link (equivalent to a Windows shortcut) ln -s ../smallread1.fq mylink • Do another long directory listing and look at the output – what information is available there? ls –la File and Directory Commands drwxrwxr-x 2 ross ross 4096 Mar 1 11:05 bucket -rw-rw-r-- 1 ross ross 0 Mar 1 11:05 file1 -rw-rw-r-- 1 ross ross 0 Mar 1 11:05 file2 -rw-rw-r-- 1 ross ross 0 Mar 1 11:05 file3 -rw-rw-r-- 1 ross ross 0 Mar 1 11:05 file4 -rw-rw-r-- 1 ross ross 0 Mar 1 11:05 file5 -rw-rw-r-- 1 ross ross 0 Mar 1 11:05 file6 -rw-rw-r-- 1 ross ross 0 Mar 1 11:05 file7 -rw-rw-r-- 1 ross ross 0 Mar 1 11:05 file8 -rw-rw-r-- 1 ross ross 0 Mar 1 11:05 file9 lrwxrwxrwx 1 ross ross 16 Mar 1 11:06 mylink -> ../smallread1.fq • Use pwd to make sure you are still in the sandbox directory • Try removing everything in the directory with rm * what happens? File and Directory Commands • Remove the bucket directory using rmdir bucket • Be very careful when using file globbing with the rm command, because there is no undelete in Linux • When something is deleted, it is gone forever, so be careful – always make sure you know what directory you are in, and which files will be affected by any globbing characters you use. Regular Expression Exercises • Download RegularExpressions.pdf from the course website (http://www4.ncsu.edu/~rosswhet/BIT815/Spring2013/list.html) • Read through the document – note that (as with file globbing) there are different ways to describe the same pattern • Open Notepad++ from the Programs menu in Windows, navigate to the Documents/bit815 folder, and open smallread1.fq • Use Cntrl-H to open the search and replace dialog box Regular Expression Exercise • Enter the following expression in the Find what: box ^@1:([0-9]{1,3}):([0-9]{2,5}):([0-9]{2,5}):([YN]) • Enter the following expression in the Replace with: box @InstrID_FlowcellID_lane1_tile\1_xcoord\2_ycoor\3_pass\4 • Click the Find Next button to see if the pattern matches anything in the file • If not, make sure you entered it correctly. If so, click Replace to see what happens • If it works for one example, click Replace All and scroll through the file to see the results Sequencing technology overview - Two different systems on campus: Illumina GAIIx, 454 - A similar overall strategy for highly-parallel sequencing - Different approaches taken at virtually every step - These different platforms produce data with different characteristics - Other platforms are available off-campus, but are not a focus of the course Similarities - DNA molecules are fragmented and ligated to adaptors - individual DNA molecules are immobilized on a surface - a series of nucleotide addition reactions are carried out - the nucleotide added is detected after each addition - a data file is produced containing the DNA sequences of many fragments Sequencing technology overview - 454 DNA fragmentation – usually sonication Adaptor oligonucleotide addition Images from www.454.com Sequencing technology overview - 454 A single molecule immobilized on a bead PCR amplification in oil-water emulsion creates ~10 million copies per bead Images from www.454.com Sequencing technology overview - 454 DNA-containing beads deposited in wells of PicoTiterPlate , along with smaller beads with immobilized enzymes for light production “Pyrosequencing” produces light when any nucleotide is incorporated, so only a single nucleotide is provided during a cycle, and light output is recorded during each cycle Sequencing technology overview - 454 A ‘flowgram’ showing light output from each cycle of base addition one flowgram is produced for each of the ~1 million wells in a PicoTiterPlate TACG ‘key’ sequence Sequencing technology overview – Illumina Illumina uses a glass ‘flowcell’, about the size of a microscope slide, with 8 separate ‘lanes’. The GAIIx instrument focuses the laser and light detection system only on one of the two surfaces inside the flowcell; the new HiSeq instrument scans both surfaces and therefore doubles the yield of sequence data per lane. Additional improvements in scanning and increases in cluster density make the difference closer to 4x or 5x more data from a HiSeq. Sequencing technology overview – Illumina Fragment DNA, ligate adaptor oligos Single-stranded DNA binds to flowcell surface Sequencing technology overview – Illumina Surface-bound primers are extended by DNA polymerase across annealed ssDNA molecules, the DNA is denatured back to single strands, and the free ends of immobilized strands anneal again to oligos bound on surface of flowcell. This ‘bridge PCR’ continues until a cluster of ~ 1000 molecules is produced on the surface of the flowcell, all descended from the single molecule that bound at that site. After PCR, the free ends of all DNA strands are blocked. Sequencing technology overview – Illumina Another perspective of the amplification process, showing the clusters of products Sequencing technology overview – Illumina Sequencing technology overview – Illumina Sequencing technology overview – Illumina Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 GCTGA CTTAG TAAGT AGCCG Although four different colors are used for the fluorescent nucleotides, only two lasers are used to excite the fluorescence. The fluorescent labels are grouped in pairs - labels on A and C are excited by a red laser, and labels on G and T are excited by a green laser. The software assumes that signal from both lasers will be balanced at each cycle. This means that distinguishing between the A signal and the C signal is more difficult for the instrument than A versus G or A versus T. Base substitution errors are the most common type of sequencing error for Illumina instruments. Understanding FASTQ format or “what do all these symbols mean?” Instrument ID lane tile X Y barcode read# flowcell Header lines sequence quality scores • Quality scores are numbers that represent the probability that the given base call is an error. • These probabilities are always less than 1, so the value is given as 10 times minus log(10) of the probability • For example, an error probability of 0.001 (1x10-3) is represented as a quality score of 30. • The numbers are converted into text characters so they occupy less space – a single character is as meaningful as 2 numbers plus a space between adjacent values Understanding FASTQ format Illumina v1.8 header version: @HWI-EAS209:06:FC706VJ:5:58:5894:21141 1:N:ATCACG Instrument /flowcell ID lane tile X Y barcode read# Header lines sequence quality scores Unfortunately, at least four different ways of converting numbers to characters have been used, and header line formats have also changed, so one aspect of data analysis is knowing what you have. Illumina flowcell geometry (GAIIx) 12345678 A flowcell has 8 lanes, which are physically separated. Each lane is imaged during each cycle of sequencing in multiple separate images, called ‘tiles’, which are not physically separated. Tiles within a GAIIx lane are numbered from 1 to 60 down the length of the lane, then from 61 to 120 back up the other side. 1 120 2 119 59 62 60 61 Illumina flowcell geometry (Hiseq) 12345678 Tiles within a Hiseq lane are numbered using a different system . The first digit denotes which surface (1 = lower, 2 = upper), the second denotes a vertical “swath” (1 = left, 2 = middle, 3 = right), and the last two digits denote a tile within that swath (01 means closest to the outflow end of the lane; 08 or 16 means closest to the inflow end of the lane 1101 1201 1301 1102 1202 1302 1107 1207 1307 1108 1208 1308 Command-line Exercises • Download SAMformatAndCLtools.pdf from the course website (http://www4.ncsu.edu/~rosswhet/BIT815/Spring2013/list.html) • Read the first two pages of the document – don’t worry about the “bitwise flag” information; that is for future reference • Go back to the command-line window that you used for the file globbing exercise – right-click on the folder that contains the example Fastq and SAM files, and select “Command Prompt Here” from the drop-down menu Command-line Exercises • Download SAMformatAndCLtools.pdf from the course website (http://www4.ncsu.edu/~rosswhet/BIT815/Spring2013/list.html) • Read the first two pages of the document – don’t worry about the “bitwise flag” information; that is for future reference • Go back to the command-line window that you used for the file globbing exercise – right-click on the folder that contains the example Fastq and SAM files, and select “Command Prompt Here” from the drop-down menu