March 22 - Mouse Genome Informatics

advertisement
last time
• pbx1 assignment…..find location of the probes
in another one of the probesets for zebrafish.
• Read limma documentation
• Run limma on your data set
• Be sure you have your Galaxy account set up
pbx1
UCSC Genome Browser on Zebrafish Jul. 2010 (Zv9/danRer7) Assembly
chr2:19,708,833-19,758,832
limma
From gene list to intepretation
• limma will generate a list of probeset ids for
differentially expressed genes
– What next?
• Convert the probeset ids to gene symbols
• Look for enrichment of functional terms
associated with the genes in your list
http://david.abcc.ncifcrf.gov/
RNA Seq
• Use of next-generation sequencing technology (NGS)
to measure RNA levels
• RNA Seq advantages:
– Wider dynamic range compared to microarray technology
– Not dependent on known genome annotations
– Higher throughput compared to microarray technology
• RNA Seq challenges:
– Specificity versus completeness of alignments..especially
for short sequence reads
– Manipulation and analysis of large files
– Data storage costs
RNA Seq Library Prep
http://www.geospiza.com/finchtalk/uploaded_images/rna-seq-steps-786705.png
Sequencing Technologies
http://www.geospiza.com/finchtalk/uploaded_images/plates-and-slides-718301.png
Sequence “Space”
• Roche 454 – Flow space
– Measure pyrophosphate released by a nucleotide when it is added to a growing
DNA chain
– Flow space describes sequence in terms of these base incorporations
– http://www.youtube.com/watch?v=bFNjxKHP8Jc
• AB SOLiD – Color space
– Sequencing by DNA ligation via synthetic DNA molecules that contain two nested
known bases with a flouorescent dye
– Each base sequenced twice
– http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related
• Illumina/Solexa – Base space
– Single base extentions of fluorescent-labeled nucleotides with protected 3 ‘ OH
groups
– Sequencing via cycles of base addition/detection followed deprotection of the 3’
OH
– http://www.youtube.com/watch?v=77r5p8IBwJk&feature=related
• GenomeTV – Next Generation Sequencing (lecture)
– http://www.youtube.com/watch?v=g0vGrNjpyA8&feature=related
http://finchtalk.geospiza.com/2008/03/color-space-flow-space-sequence-space_23.html
Further Reading
• Metzker, ML. (2010) Sequencing technologies
– the next generation. Nature Reviews
Genetics 11:31-36.
Short Read Archive
http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?
Short Read Archive Handbook
http://www.ncbi.nlm.nih.gov/books/NBK47528/
Aspera Connect
http://www.asperasoft.com/en/products/client_software_2/aspera_connect_8
High performance file
transfer for getting data from
the Short Read Archive
SRA Toolkit
http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software
RNA Seq Workflow
• RNA Seq
– FASTQ file format
• Alignment
– SAM file format
• Annotation
– GTF, BED file format
• Alignment Counts
– RPKM
• Statistical analysis
FASTQ: Data Format
• FASTQ
– Text based
– Encodes sequence calls and quality scores with ASCII
characters
– Stores minimal information about the sequence read
– 4 lines per sequence
• Line 1: begins with @; followed by sequence identifier and optional
description
• Line 2: the sequence
• Line 3: begins with the “+” and is followed by sequence identifiers
and description (both are optional)
• Line 4: encoding of quality scores for the sequence in line 2
• References/Documentation
– http://maq.sourceforge.net/fastq.shtml
– Cock et al. (2009). Nuc Acids Res 38:1767-1771.
FASTQ Example
For analysis, it may be
necessary to convert to
the Sanger form of
FASTQ…For example,
Illumina stores quality
scores ranging from 0-62;
Sanger quality scores
range from 0-93.
Solexa quality scores
have to be converted to
PHRED quality scores.
FASTQ example from: Cock et al. (2009). Nuc Acids Res 38:1767-1771.
Example Data
Data deposited in GEO with accession id GSE20846
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE20846
http://www.ncbi.nlm.nih.gov/sra?term=SRP002119
SRP002119 (study/project)
SRX017794 (experiment)
SRS025246 (source)
SRR037945 (run)
SRR037946 (run)
SRA to FASTQ
• NCBI’s SRA Tools contains utilities to convert
SRA format to FASTQ
– fastq-dump
• If utilities and sra formatted file are in the
same directory, command line is…
fastq-dump <name of sra formatted file>
NOTE: Downloading and working with next generation sequence data will
very quickly exceed the capacity of a typical desktop or laptop computer. You
will need appropriate infrastructure in place to work with these files…or
consider scalable Cloud storage and compute services!
TopHat
http://tophat.cbcb.umd.edu/
TopHat is a good tool for
aligning RNA Seq data
compared to other aligners
(Maq, BWA) because it takes
splicing into account during
the alignment process.
Figure from: Trapnell et al. (2010). Nature Biotechnology 28:511-515.
Trapnell et al. (2009). Bioinformatics 25:1105-1111.
TopHat is built
on the Bowtie
alignment
algorithm.
Trapnell C et al. Bioinformatics
2009;25:1105-1111
SAM (Sequence Alignment/Map)
• It may not be necessary to align reads from
scratch…you can instead use existing
alignments in SAM format
– SAM is the output of aligners that map reads to a
reference genome
– Tab delimited w/ header section and alignment
section
• Header sections begin with @ (are optional)
• Alignment section has 11 mandatory fields
– BAM is the binary format of SAM
http://samtools.sourceforge.net/
Mandatory Alignment Fields
http://samtools.sourceforge.net/SAM1.pdf
Alignment Examples
Alignments in SAM format
http://samtools.sourceforge.net/SAM1.pdf
Cufflinks
http://cufflinks.cbcb.umd.edu/
• Assembles transcripts,
• Estimates their
abundances, and
•Tests for differential
expression and regulation
in RNA-Seq samples
Trapnell et al. (2010). Nature Biotechnology 28:511-515.
Cufflinks Output
• Gene expression
• Transcript expression
• Assembled transcripts
Annotations
• Mapping reads to specific transcripts/genes
Data Visualization
• UCSC Browser (accessible from Galaxy)
• Trackster (native to Galaxy)
External visualization tools:
• Genome Workbench
– http://www.ncbi.nlm.nih.gov/projects/gbench/
• Integrative Genomics Viewer (IGV)
– http://www.broadinstitute.org/igv/
Statistical Analysis
• Once the mapping and genome summarization are
done, the data can be analyzed just like any other
count data
• Bullard, et al. (2010). Evaluation of statistical
methods for normalization and differential
expression in mRNA-Seq experiments. BMC
Bioinformatics 11:94.
Typical RNA_Seq Project Work Flow
Tissue Sample
Total RNA
mRNA
FASTQ file
cDNA
Sequencing
QC
TopHat
Cufflinks
Gene/Transcript/Exon
Expression
Visualization
Statistical
Analysis
JAX Computational Sciences Service
See Tutorial 1
Galaxy
http://main.g2.bx.psu.edu/
Build and share data and analysis workflows
No programming experience required
Strong and growing development and user community
RNA Seq Workflow
• Convert data to FASTQ
• Upload files to Galaxy
• Quality Control
– Throw out low quality sequence reads, etc.
• Map reads to a reference genome
– Many algorithms available
– Trade off between speed and sensitivity
• Data summarization
– Associating alignments with genome annotations
– Counts
• Data Visualization
• Statistical Analysis
Tools
Dialog/Parameter Selection
History
Uploading Data to Galaxy
Because of the size of
most sequence files it
is necessary to use ftp
to get files to Galaxy.
Select appropriate
reference genome at
time of data upload.
You can upload compressed files
and they will be uncompressed
upon loading into Galaxy.
Tutorial Web Site
http://www.ncbi.nlm.nih.gov/staff/church/GenomeAnalysis/index.shtml
This site will be accessible
after the meeting. Check
back for updates and new
tutorials.
next time
• Analyze project data with DAVID
– Convert probeset ids to genes
– Look for enrichment of functional terms
• Try the first part of Tutorial 5 in Galaxy
Download