last time • pbx1 assignment…..find location of the probes in another one of the probesets for zebrafish. • Read limma documentation • Run limma on your data set • Be sure you have your Galaxy account set up pbx1 UCSC Genome Browser on Zebrafish Jul. 2010 (Zv9/danRer7) Assembly chr2:19,708,833-19,758,832 limma From gene list to intepretation • limma will generate a list of probeset ids for differentially expressed genes – What next? • Convert the probeset ids to gene symbols • Look for enrichment of functional terms associated with the genes in your list http://david.abcc.ncifcrf.gov/ RNA Seq • Use of next-generation sequencing technology (NGS) to measure RNA levels • RNA Seq advantages: – Wider dynamic range compared to microarray technology – Not dependent on known genome annotations – Higher throughput compared to microarray technology • RNA Seq challenges: – Specificity versus completeness of alignments..especially for short sequence reads – Manipulation and analysis of large files – Data storage costs RNA Seq Library Prep http://www.geospiza.com/finchtalk/uploaded_images/rna-seq-steps-786705.png Sequencing Technologies http://www.geospiza.com/finchtalk/uploaded_images/plates-and-slides-718301.png Sequence “Space” • Roche 454 – Flow space – Measure pyrophosphate released by a nucleotide when it is added to a growing DNA chain – Flow space describes sequence in terms of these base incorporations – http://www.youtube.com/watch?v=bFNjxKHP8Jc • AB SOLiD – Color space – Sequencing by DNA ligation via synthetic DNA molecules that contain two nested known bases with a flouorescent dye – Each base sequenced twice – http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related • Illumina/Solexa – Base space – Single base extentions of fluorescent-labeled nucleotides with protected 3 ‘ OH groups – Sequencing via cycles of base addition/detection followed deprotection of the 3’ OH – http://www.youtube.com/watch?v=77r5p8IBwJk&feature=related • GenomeTV – Next Generation Sequencing (lecture) – http://www.youtube.com/watch?v=g0vGrNjpyA8&feature=related http://finchtalk.geospiza.com/2008/03/color-space-flow-space-sequence-space_23.html Further Reading • Metzker, ML. (2010) Sequencing technologies – the next generation. Nature Reviews Genetics 11:31-36. Short Read Archive http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi? Short Read Archive Handbook http://www.ncbi.nlm.nih.gov/books/NBK47528/ Aspera Connect http://www.asperasoft.com/en/products/client_software_2/aspera_connect_8 High performance file transfer for getting data from the Short Read Archive SRA Toolkit http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software RNA Seq Workflow • RNA Seq – FASTQ file format • Alignment – SAM file format • Annotation – GTF, BED file format • Alignment Counts – RPKM • Statistical analysis FASTQ: Data Format • FASTQ – Text based – Encodes sequence calls and quality scores with ASCII characters – Stores minimal information about the sequence read – 4 lines per sequence • Line 1: begins with @; followed by sequence identifier and optional description • Line 2: the sequence • Line 3: begins with the “+” and is followed by sequence identifiers and description (both are optional) • Line 4: encoding of quality scores for the sequence in line 2 • References/Documentation – http://maq.sourceforge.net/fastq.shtml – Cock et al. (2009). Nuc Acids Res 38:1767-1771. FASTQ Example For analysis, it may be necessary to convert to the Sanger form of FASTQ…For example, Illumina stores quality scores ranging from 0-62; Sanger quality scores range from 0-93. Solexa quality scores have to be converted to PHRED quality scores. FASTQ example from: Cock et al. (2009). Nuc Acids Res 38:1767-1771. Example Data Data deposited in GEO with accession id GSE20846 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE20846 http://www.ncbi.nlm.nih.gov/sra?term=SRP002119 SRP002119 (study/project) SRX017794 (experiment) SRS025246 (source) SRR037945 (run) SRR037946 (run) SRA to FASTQ • NCBI’s SRA Tools contains utilities to convert SRA format to FASTQ – fastq-dump • If utilities and sra formatted file are in the same directory, command line is… fastq-dump <name of sra formatted file> NOTE: Downloading and working with next generation sequence data will very quickly exceed the capacity of a typical desktop or laptop computer. You will need appropriate infrastructure in place to work with these files…or consider scalable Cloud storage and compute services! TopHat http://tophat.cbcb.umd.edu/ TopHat is a good tool for aligning RNA Seq data compared to other aligners (Maq, BWA) because it takes splicing into account during the alignment process. Figure from: Trapnell et al. (2010). Nature Biotechnology 28:511-515. Trapnell et al. (2009). Bioinformatics 25:1105-1111. TopHat is built on the Bowtie alignment algorithm. Trapnell C et al. Bioinformatics 2009;25:1105-1111 SAM (Sequence Alignment/Map) • It may not be necessary to align reads from scratch…you can instead use existing alignments in SAM format – SAM is the output of aligners that map reads to a reference genome – Tab delimited w/ header section and alignment section • Header sections begin with @ (are optional) • Alignment section has 11 mandatory fields – BAM is the binary format of SAM http://samtools.sourceforge.net/ Mandatory Alignment Fields http://samtools.sourceforge.net/SAM1.pdf Alignment Examples Alignments in SAM format http://samtools.sourceforge.net/SAM1.pdf Cufflinks http://cufflinks.cbcb.umd.edu/ • Assembles transcripts, • Estimates their abundances, and •Tests for differential expression and regulation in RNA-Seq samples Trapnell et al. (2010). Nature Biotechnology 28:511-515. Cufflinks Output • Gene expression • Transcript expression • Assembled transcripts Annotations • Mapping reads to specific transcripts/genes Data Visualization • UCSC Browser (accessible from Galaxy) • Trackster (native to Galaxy) External visualization tools: • Genome Workbench – http://www.ncbi.nlm.nih.gov/projects/gbench/ • Integrative Genomics Viewer (IGV) – http://www.broadinstitute.org/igv/ Statistical Analysis • Once the mapping and genome summarization are done, the data can be analyzed just like any other count data • Bullard, et al. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11:94. Typical RNA_Seq Project Work Flow Tissue Sample Total RNA mRNA FASTQ file cDNA Sequencing QC TopHat Cufflinks Gene/Transcript/Exon Expression Visualization Statistical Analysis JAX Computational Sciences Service See Tutorial 1 Galaxy http://main.g2.bx.psu.edu/ Build and share data and analysis workflows No programming experience required Strong and growing development and user community RNA Seq Workflow • Convert data to FASTQ • Upload files to Galaxy • Quality Control – Throw out low quality sequence reads, etc. • Map reads to a reference genome – Many algorithms available – Trade off between speed and sensitivity • Data summarization – Associating alignments with genome annotations – Counts • Data Visualization • Statistical Analysis Tools Dialog/Parameter Selection History Uploading Data to Galaxy Because of the size of most sequence files it is necessary to use ftp to get files to Galaxy. Select appropriate reference genome at time of data upload. You can upload compressed files and they will be uncompressed upon loading into Galaxy. Tutorial Web Site http://www.ncbi.nlm.nih.gov/staff/church/GenomeAnalysis/index.shtml This site will be accessible after the meeting. Check back for updates and new tutorials. next time • Analyze project data with DAVID – Convert probeset ids to genes – Look for enrichment of functional terms • Try the first part of Tutorial 5 in Galaxy