Course Expectations Sequencing technology and (very) large datasets 6/1/2015 Goals for the course Understand how next-generation sequencing technologies are used in biomedical research Learn how to conduct a RNA-seq analysis Learn how to analyze gene lists to form hypotheses that can be tested experimentally Learn to write a results section for a manuscript Logistics Course website: http://biochem.slu.edu/bchm628/ Some data will be shared via Google drive Contact: Phone: 977-8858 Email: donlinmj@slu.edu Office – DRC 507 Call or email. Usually at WashU on Thursdays Lab – DRC 654 Exercise format There will be 5 exercises, each consisting of 2-4 sections which represent a biological question to be answered with bioinformatics tools/resources from that week or earlier weeks. You’ll provide the answer in the same format as you would write for the results section of a paper Why did you do this experiment or analysis? What did you actually do? What did you observe? What does it mean? Include supporting data Figures with figure legends Correctly formatted tables of data. Exercises, cont You will hand in your exercise via email in either Word or PDF format, with supplemental data in Excel, Word or PDF format. The exercise should print in portrait orientation. The exercise should include a header with your name at the top and the file should be named: Your Name-Ex #. There is a penalty for turning in your exercises after the deadline. The timestamp on your email is the final determination of whether an exercise is on-time or not. Final project This will be a project summary of the analyses that you will do over the course of the 4 weeks. You will be asked to choose 3 genes from your gene lists that you would follow-up on at the bench. You will be asked to give a rationale for making the choices that you did. You will analyze the three genes virtually using some of the tools from weeks 3 & 4. You will also be asked to propose additional bench experiments for them. Final project will be due July 7th at 3:00 pm. Data tables In general, columns describe attributes and rows contain the individual data. The first row contains a header. If you have lots of data, it is generally formatted to have more rows than columns. Table 1: Gene expression for WT cells under conditions X,Y, Z. Gene name Log 2 (Cond. X/untreated) Log 2 (Cond. Y/untreated) Log 2 (Cond. Z/untreated) NM_00522 2.56 3.12 2.75 NM_06588 -1.25 -1.02 -0.98 Table 2: Comparison of clinical parameters for groups 1 and 2. Clinical parameter ALT/AST ratio Leukocyte count 1 Statistical 2 Group 1 (avg ± mean) Group 2 (avg ± mean) P-value 25 ± 1 35 ± 2 0.0021 1200 ± 32 950 ± 65 0.0512 significance was determined by a Mann-Whitney test Statistical significance was determined by 2-tailed t-test Data tables, cont For the purposes of this class, the tables should be formatted to fit onto a letter size page in portrait orientation. If your table is so wide that it forces the page into landscape orientation, then it should be included as a supplemental attachment to the exercise. If the table extends past 1 page, then include it as a supplemental attachment. Refer to supplemental tables in your write-up and number then and the file as Name_SuppTable1, ect. Supplemental tables can be in Excel format. Figures If you can export the figure from whatever program in jpeg or png format, those can be inserted into a Word document easily. PDFs can be converted to other formats using Illustrator There are some online converters http://www.wikihow.com/Convert-PDF-to-JPEG Screen capture and placement may also work. Talk to me if you have issues. I won’t be very picky about high resolution. Figures, cont. Figures should have figure legends. The figure legends should describe the experiment that lead to the data in the figure and include an explanation for any symbols used. Figures should be numbered consecutively and should not take up more than ¼ of the page. If larger than that, include as supplemental data. Create a text box in Word, write the figure legend and then insert the figure above the figure legend. This will allow you to resize as necessary. Again, talk to me is you have issues. Grading Grading: Exercises 65 % Final exam 25 % Class attendance 10 % Grading policy handout Details about late assignment and tests Lecture outline Overview of sequencing a genome Next generation sequencing High-throughput experiments by sequencing Genome browsers Genome sequencing Approach depends on the source, size, complexity and goal for the data for a given organism Goal? De novo sequencing Re-sequencing for annotation Sequencing to identify variations Size and complexity Virus, bacterial, single-celled eukaryote, mammal, plant Sample prep Can it be cultured? Tissue source: unlimited or limited quantities? Virus levels, RNA or DNA Genome sizes Genome size (base pairs) Number of genes Hepatitis C virus 0.01 x 106 10 Epstein-Barr virus 0.172 x 106 37 Bacterium (E. coli) 4.6 x 106 4406 Yeast (S. cerevisiae) 12.5 x 106 6172 Nematode worm (C. elegans) 100.3 x 106 19,099 Thale cress (A. thaliana) 115.4 x 106 25,498 Fruit fly (D. melanogaster) 128.3 x 106 13,601 Corn (Z. mays) 2500 x 106 39,469 Human (H. sapiens) 3223 x 106 20,500 Wheat (T. aestivium) 5500 x 106 (x 3) ~95,000 Organism Types of questions How many genes? How many functional genetic elements miRNAs, ncRNAs What’s different about this genome compared to another one? Virulence differences in pathogenic organisms What is the cause of this particular phenotype? What taxonomic groups are represented in this population of bacteria, viruses or fungi? How do the gene expression patterns change between samples (across time)? Where does this transcription factor bind in the genome? Genetic maps Chromosomal banding patterns Stain with Giemsa (G-banding pattern) Chromosomes are numbered based on size Giemsa binds to phosphate groups & attaches to regions that are AT rich Dark regions heterchromatic, late replicating and AT rich Lighter regions euchromatic, early replicating and GC rich Chromosome nomenclature p (petite) = short arm q (queue) = long arm Bands are numbered going away from centromere 4q21.1 represents chromosome 4, long arm 2nd band, 1st sub-band and 1st sub-sub-band DNA sequencing – Overview Gel electrophoresis Predominant in 1980s Whole genome strategies Physical mapping (BAC clones) Walking Shotgun sequencing Capillary sequencing machines Computational fragment assembly Next generation technologies Polony based sequencing Novel assembly techniques Cost/base for DNA sequence 1.0E+02 1.0E+01 1.0E+00 1.0E-01 1.0E-02 1.0E-03 1.0E-04 1.0E-05 1.0E-06 1.0E-07 Traditional approach Shear the very large genome into smaller chunks Clone in vectors that can support large inserts Digest and separate on high resolution gel to determine the clone overlap Pick minimum number of clones Shotgun sequence each clone Read the traces and assemble Make the gene calls Load it into a genome viewer BAC library in DNA sequencing Shotgun sequencing D Sequence each clone Individual sequence reads Contig assembly E Contig A Gap Contig B Paired reads vs single reads Single reads • M13 clones • robotic template prep Contig A Gap Contig B Paired reads • Plasmids, cosmids, BACs Contig A Gap Contig B Gap closure!! Prefer 3-10 mate pairs per gap Inserts of different, but known sizes Steps to Assemble a Genome Some Terminology read a 500-900 long word that comes 1. Find reads outoverlapping of sequencer mate pair a pair of reads from two ends of the same insert fragment 2. Merge some “good” pairs of reads into contigssequence formed contig longer a contiguous by several overlapping reads with no gaps 3. Link contigs to formand supercontigs supercontig an ordered oriented set (scaffold) of contigs, usually by mate pairs consensus sequence derived from the 4. Derive multiple consensus sequence sequence alignment of reads in contig ..ACGATTACAATAGGTT.. Target: 30X coverage or >30 high quality reads per base Assembled into chromosomes Refseq nomenclature: NT: genomic sequence of complete gene NC: chromosome NM: mRNA sequence NP: protein sequence Assembly: completed genome, multiple assemblies Calling the genes De novo computer algorithms Identify coding sequences by GC content Start and stop sites Intron/exon boundaries Comparison with other known genes EST libraries Sanger method Misha Angrist Sanger sequencing reached its technical limits Only modestly parallel (394 lanes/machine) Long read lengths (500-900 bp) & >99.9% correct Need to clone the DNA to obtain enough for sequencing reaction At SLU: cost for typical Sanger sequencing is $5-6/sample with reliable 500 bp of sequence DNA sequencing timeline How many sequenced genomes? NCBI: >12,000 genomes deposited JGI (Joint Genome Institute): 6600 complete >20,000 draft genomes NGS sequencing Polony: discrete clonal amplifications of a single DNA molecule, grown in a gel matrix. The clusters can then be individually sequenced, producing short reads Polony-based or cluster-based sequencing is the basis of most second generation sequencers Typical NGS workflow: 1. Library construction to add adapters to sequence 2. Template CLONAL amplification (on a bead or chip) 3. Massively PARALLEL sequencing Library Prep: ~ 6 hours Illumina NGS A) Fragment DNA B) Repair ends/Add A overhang DNA C) Ligate adapters D) Select ligated DNA Cluster generation ~ 6 hours E) Attach DNA to flow cell F) Bridge amplification G) Generate clusters H) Anneal sequencing primer Sequencing 2-6 days I) Extend 1st base, read & deblock K) Generate base calls J) Repeat to extend strand Illumina HiSeq and miSeq 100 – 200 bp read lengths Available locally with MoGene and Cofactor Genomics GTAC (Wash U) has HiSeq 2000 which has 50bp single end reads and 100 bp paired-end reads Why not use this for all sequencing? Cost is ~300-400/library and ~$1100/lane of sequencing Generate Tb of data per run Gb per lane Ion Torrent – measures pH changes Done on a semiconductor chip Ion Torrent workflow Illumina vs Ion Torrent Illumina has greater capacity but longer run times Latest versions of both have read lengths ~200 bp SLU has an Ion Torrent machine Cost is ~$270/sample, including the sequencing Can do single- or pair-end reads Paired end are 2X cost for library construction, but necessary for de novo genome assembly Bioinformatics challenges Each flow cell in the Illumina Hiseq 2000 can generate a billion bases of sequence Raw read files are Tb in size Processed read files are several 700-800 Mb Alignment files 150-300 Mb Assembly of millions of short (75-100 bp) reads into vertebrate genome Need high-performance compute (HPC) cluster for vertebrate sized genomes Sequencing has become a standard technique RNA sequencing for expression ChIP sequencing for TF site identification DNA sequencing for variants Identification of populations/genetic changes in highly variable viruses and bacteria Metagenomics Identification of unknown/non-culturable communities of bacteria/viruses/fungi Why RNAseq over microarray? Technical variation is less Do not need a sequenced genome Greater dynamic range of expression Detect transcript isoforms Identify novel transcripts Identify non-coding RNAs Data availability Public repository of microarray, RNAseq and other high- throughput expression data is GEO & SRA at the NCBI GEO: Gene expression omnibus http://www.ncbi.nlm.nih.gov/geo/ Tools for downloading as well as querying datasets Array and sequence-based data available SRA: short read archive http://www.ncbi.nlm.nih.gov/sra Can download raw sequence data (fastq files) Today in computer lab Tutorial on searching NCBI/GEO for large datasets Partek Genomics Suite (PGS) tutorial