PAG_Notes - Bioinformatics Core Wiki

advertisement
“Development of Genome and Transcriptome Sequence Resources in European Hazlenut” Todd
Mockler, Oregon State U (tmockler@cgrb.oregonstate.edu)
Did short-insert Solexa only. Did not do any quality trimming/filtering (no error correction)
Due to RAM limits, could only do ~100 million reads at once with velvet, so did velvet subassemblies and used MIRA to co-assemble contigs. Results were extremely fragmented (333,492
contigs, <gene size). Aligned input data to see depth of sequencing; got ~50% of reads in assembly.
Aligned RNA-seq data to contigs – 22.6% of contig space had RNA-seq reads mapped. They
estimate it is 90% of the genes.
Only 20% of velvet contigs matched genbank (65% of hits were grape/poplar/castor bean)
Tried SOAP de novo for entire pile at once, but got 1.3 million scaffolds (much of which was
“N”s)
Have since done 2 KB mate-pair library sequencing
“Next generation transcriptome resequencing in Populus trichocarpa enables studies in natural
genetic diversity, genome-wide analysis of mRNA processing and expression, and association
genetics” Carl J. Douglas, U. of British Columbia (cdouglas@interchange.ubc.ca)
Did 4 lanes of mRNA-Seq per cDNA sample of 20 genotypes
Used MAQ, discarding >3 mismatches. Can map intron-exon boundaries. Can find non-annotated
exons and splice-site variations
8% of reads mapped to “introns”
6% of reads mapped to “intergenic regions”
64% of reads map uniquely
Read count was normalized (how?) and candidate genes were ranked on a log scale
SNP analysis:
Peaks analysis using wig files
Calls based on consistency across 20 individuals
Validation by Sanger sequencing
(but, there can be allele-specific expression)
7x coverage of xylem RNA produced 400,000 candidate SNPs
Also doing transcriptional network analysis based on Arabidopsis
“Prelude to a Genome: De novo assembly, annotation and profiling of an expressed gene catalog
of a fast-growing Eucalyptus hybrid clone using Illumina mRNA-Seq” Eshchar Mizrachi, U. of
Pretoria, South Africa (eshchar.mizrachi@up.ac.za) (Eucspresso/Eucagen website)
Had 4.5x assembly (now have 8x assembly) and very little NCBI gene sequence information
Constructed gene catalog with mRNA-Seq
16 lanes of paired-end sequences (4.2 Gb after filtering rRNA)
Used velvet, then custom scripts to extend contigs and separate genes, pseudogenes and
paralogs, to generate a final set of 19,000 contigs with high coverage, many full length with UTR
Contigs won't assemble if coverage is not uniform, so contigs were assembled one by one using
Velvet and custom scripts to map reads to each contig (using SOAP, bowtie, and ?)
Quality control: mapped Illumina reads to Sanger reads
1
Found homologs to Arabidopsis, Poplar, and Grape
KEGG maps were very similar to Arabidopsis
Used 128 GB RAM machine, now have 256 GB machine
“Analysis of the transcriptome of the Fagaceae species” Abdelali Barakat, Penn State
(aub14@psu.edu) (Fagaceae Genomics website has all sequences)
Trying to find disease-resistant genes (American Chestnut vs. Chinese Chestnut)
Sampled early in infection (17 dpi) in field
Created chestnut unigene set (how?)
Blast contigs against Arabidopsis looking for blight disease-related defense genes
Also found miRNAs (different prep?)
Now sequencing Chinese Chestnut genome.
“Genetic variability in the conifer wood transcriptome” John Mackay, University of Lava, Canada
(John.Mackay@sbf.ulaval.ca)
Work was in white spruce and is related to wood properties (quantitative variation in transcript levels)
Microarray has been developed with ~32K Spruce genes developed from EST and 454 sequencing
Association population was genotyped with Illumina and microarray
Regulon genes (Tfs – coordinate regulation, test in transgenes
“Genome-wide SNP data refine relationships in the grapevine” Sean Myles, USDA
(smm367@cornell.edu) (Grape Genetic Diversity Project)
Generated 2 Gb Illumina sequence from HpaII digestion of genomic DNA
Identified 470,000 SNPs (on average, 1 every 41 bp), of which 71,000 “good”; reduced that to 10,000
SNPs, put 8,988 on an array (Illumina SNP chip) and found 6,114 high quality SNPs, most of which
preferentially segregate.
They are looking at grape domestication from V. silvestris; archaeological data indicates this
occurred in the Near East. Was there ar domestication bottleneck? SNP selection shows diversity
between domesticated vinifera. LD breakdown is very similar (decays very rapidly), so there are no
haplotypes, so SNPs are difficullt to use. May need to use haplotype data for genome-wide analysis.
RNA-Seq: 36-44 nt reads, 17-20 million reads per sample, mapped to reference genome by
eland. Mapreadslocation.py maps unique reads to exons, introns, intergenic regions and unmapped
exons. Use MAQ to find SNPs with minimum read depth of 10. Result was 86,000 SNPs, only 7480 in
genes (exons) and 24,000 in introns.
Splice variants: again align with eland. Scripts: vmf_filter_uniqs.py and get_alt_splice.py used
to find 385 alternatively spliced genes.
Erange – for global analysis of gene expression
Normalization – calculated reads per kb of exons for million of total reads
2
More than 17,000 genes seen
Validated several genes with qPCR
“Genoscope” annotation for manual verification
(see Berkeley Bullard reference)
Did Fisher's Exact Text and found ~7,000 genes had statistically significant differential
expression.
Interestingly, they found that if they took the Unmapped reads and built contigs with those,
sometimes reads would map to those contigs, that didn't previously map.
“A gene expression map of Vitis vinfera cv. Corvina development” Mario Pezzotti, U. of Verona
(mario.pezzotti@univr.it)
3 biological replicates of samples of different tissues (20 developmental stages) over 2008.
Roots were from in vitro
Separated skin, flesh and seed of berries
Used gene prediction from 12X genome assembly (2 different gene predictions)
Combimatrix 90K array: 1 probe/transcript, 3 of each probe (time-consuming)
NimbleGen 12plex (12 x 135K) array: started design with 1 probe/transcript and 4 of each
probe, changed to 4 probes/transcript and 1 of each probe
Analysis used FDR shreshold of .05 and fold change > 2
Hybridization quality control – compare biological replicates (made sure that different
biological replicates were on different chips)
Performed functional characterization with GO
“Transcriptomic Analysis of Early Fruit Development In Cucumber (Cucumis sativus)” Rebecca
Grumet, MSU (grumet@msu.edu)
As fruit size increases, cell/vacuole size increases, cell wall/middle lamella thicken
Samples were 8 days post pollination, did 454 sequencing
Showed graph of contig length vs. number of ESTs
Contigs >200 bp and >= 10 ESTs: 90% had homologs in Arabidopsis, 7% had no
homologs in NCBI database
If <10 ESTs, only 55% had homologs
Took replicate samples, used sequence to verify, then qRT-PCR
Different distribution of genes represented if > 100 ESTs vs. the whole list
Found unique phloem genes and hormones involved with fruit set and development
Lipid proteins (surface wax)
Latex-like proteins (defense?)
Transcription/signaling genes were under-represented
Time course experiment: 0, 4, 8, 12, 16 days post pollination
454 titanium yielded 1.1 million ESTs
Similar stats as for previous set with >= 10 ESTs/contig
3
Looked at >= 100 ESTs/transcript, compared time points to day 0 (anthesis)
Heat map showed -3 fold green to +3 fold red
“Transcriptome mapping and cataloging of alternative splicing in Arabidopsis” Sergei Filichkin,
Oregon State U (filichks@onid.orst.edu) (mockler-lab-tools.cgrb.oregonstate.edu)
RNA-Seq with 32 bp reads
1) Combine RNAs from different tissues with oligo dT primers to get full-length cDNAs
(~60 million reads)
2) Use random priming to generate 300 million trimmed reads and align to genome
(HashMatch). If no match, do more inc. SuperSplat
Produce single-base resolution map of transcriptome
Look for novel splice junctions, make sure that reads cover them
At least 42% intron-containing genes are alternatively spliced
6% of all introns have non-consensus splice junctions (5 fold higher than previously thought)
Use RT-PCR (splice junction specific and exon flanking primers); Sanger sequencing; qRT-PCR
to validate
“Rapeseed (B. napus) SNP Discovery Using a Dedicated Sequence Capture Protocol and 454
Sequencing” Jean-Philippe Pichon, Biogemma
Allotetraploid: B. rapa (n=10) x B. oleraceae (n=9) → B. napus (n=18)
Sequence Capture (NimbleGen) → Sequence selection (769 probes/array)
Need exon-exon junction data to avoid chimeric probes
Probes with repeated sequences removed)
Capture array (720K probes) → 454 sequencing
102055/102736 reads, assembled into 2439 contigs
4
“Sequence assembly in color space: from de novo to resequencing” Todd Michael, Rutgers U. and
Monsanto
solidsoftwaretools.com/gf/project/denovo/frs/
Solid pipeline: QVfilter → Accuracy Enhancement → Preprocessor → Velvet
Colorspace=dinucleotide sequence
Their software converts “colorspace” into “basespace”
Can use first base in each read to check contig to find errors. Use minimal mismatches to do error
correction (last stage of de novo pipeline)
SAET (Spectral Accuracy Enhancement Tool) does error correction from raw reads
Runs sliding window across each read
“words” without errors occur more often than words with errors
69% of reads mapped to reference is about expected (= error rate of <1%)
Complex (diploid) Genomes with SAET (de novo, community-led consortium)
Quality/error correction (pre-processing) → de novo assembly → contig merging → scaffolding
(post processing) → Gap filling
ASID for for de novo assembly (improvement on gap and repeat filling)
SOLiD (60 Gb) + 454 (29 Gb) = 70X coverage
Reads → contigs
Paired ends → scaffolds (PE indicate connection of contigs)
454 Sequences assembled with Newbler
It is more important to have a lot of paired-end sequences to generate scaffolds
PASS program to align short sequences (can use colorspace and basespace)
CONSORT program can close the gaps that are created with Newbler
Did alignments with many sets of parameters
All possible words are aligned to minimize errors
Use PE to merge contigs into scaffolds, can estimate gap length (“consolidated arcs”
Build scaffold paths from intercontig arcs
Assembly of small plant genomes
Every SNP call represents 2 independent reactions, so need 2 changes in colorspace for 1 SNP,
whereas 1 change in colorspace = sequencing error)
Do not just translate colorspace to basespace, because then can't make colorspace correction
Use corrected dnaspace in contig assembly
Polyclonal reads need to be removed
2 DNA moieties in amplification process
2 signals emanating from beads
Errors are on the 3' end of the reads (reference Sasson & Michael, Bioinformatics)
5
If QV < 10, there is an error at that base
Normally, first 5-10 bases should be of highest quality (QV > 25)
If 5 of the first 10 bases are QV < 25 read is polyclonal
86% of filtered reads matched the reference genome (in resequencing) vs. 15% of reads that didn't pass
filter (have multiple errors)
SOPRA (Statistical Optimization of Paired Read Assembly) will be available soon
V-SOPRA (velvet fragment mode) and S-SOPRA (SSAKE algorithm)
1.
Assemble reads in colorspace, but first base is dnascape to check consistency
2.
Translate colorspace to dnaspace
3.
Optimize orientation
Contigs that don't pass step 1 are discarded
Reads are aligned onto contigs, then queried to see if optimal for distance and orientation
(“spring model”) to identify misassemblies and chimeras
How does coverage/density compare across scaffold?
Areas of ambiguity have increased density, so apply trimming method to break into different
scaffolds
SOPRA improves assembly accuracy and N50 (this work was based on bacteria)
Can detect structural variants
“Next Generation Forward Genetics” Ies J. Nijman, Hubrecht Institute
Mapping and identification of mutants by SOLiD
Forward genetics – Arabidopsis bulk segregant analysis
Mutagenize, then cross mutant with phenotype to wild-type, then self F1
Homozygous mutant (25%) have recombination events to isolate small region of your mutation
Procedure to call SNPs and determine linkage (need genetic map)
1. Light sequence of homozygous pool (10-20 million reads per mutant)
2. Deep sequence library enriched for region of interest for SNP calling
Map loci to find diffferences (polymorphisms)
Use Agilent capture array (CGH array) for that region)
Multiplexed libraries hybridized on single array
Can find small recombination events
Candidates need biological followup
6
“Robustness in the face of complexity: Single Molecule Real Time DNA sequencing” Stephen
Turner, Pacific Biosystems (sturner@pacificbiosciences.com)
Hairpin adaptors are used, so sense and antisense strands are both sequenced multiple times
Tends to favor high G+C content
Can span duplications, repeats
Read length distribution is exponential, with mean read length > 1kb
“Strobe Sequencing” – turn lights on and off to generate multiple sequence windows in same molecules
(their version of mate-pairs)
Good for sequencing across multiple insertion events
Can span insertions and anchor them to genomic reference
Can also exploit circular technology to do up to 20X coverage on single molecule, so can get consensus
on single molecule, which makes it easy to find sequencing errors.
Can see methylated bases by measuring delay between nucleotide incorporations (no need for bisulfite
conversion)
Can distinguish between different types of base modifications
Potential sequencing performance: 50 bases/sec > 100 Gb/hr (human genome in 15 minutes)
Maximum readlength will be 15-17kb in 5 years (limiting factor is polymerase processivity)
This platform doesn't lend itself to counting tags, but is better for understanding splice variants
Initially not competitive with short-read mRNA-Seq
Generalized tool for all sorts of biological sequencing (not just DNA/RNA)
They are working on ways of circularizing RNA. Currently have a protocol for linear RNA (but not
ready for release)
Will start selling DNA machines in mid-2010
Sample prep is as short as 20 minutes with random primers
SMART sample prep takes less than a day (and will improve)
Runs are short (6-20 minutes), throughput currently is equivalent to NGS
Automated sample loading system, so you can “walk away” for 8 hours
Dead time between samples is 120 seconds, should be reduced to 10-20 seconds
Data reduction (base calling/quality filtering) is internal to system. Image data is not stored.
Will create “temporal reference sequence” for organism that saves reference kinetics.
Then can do re-sequencing and compare kinetics and structural variation.
7
“De Novo Sequencing of Plant Genomes” William McCombie, CSHL
Hybrid Hierarchical Assembly (Sanger, 454, and Illumina reads)
Used different combinations of assemblies (tried staged with pre-filtering, then Phrap)
Currently working on 30X peach, 72X cocoa
Using velvet on rice genome – ran into memory problems on 512 G RAM machine
New version of velvet “gets around this”
Abyss assembly generated 12-35 Mb max contig size with N50 1850-2850
“Physical mapping and transcriptional analysis of large homeologous deletions in soybean”
Robert Stupar, University of Minnesota (stup004@umn.edu)
Looking at homeologous blocks between Gm08 and Gm15.
Identified common SNPs in 1Mb region
Generated lines with one homeologous group deleted (or additional copy added)
Can't recover homozygous deletions
Used Nimblegen 700k microarray to hybridize to find copies (each probe represented ~1kb)
Transcription consequences: relative proportion similar to dosage.
Does regulation differ?
“Comparative analysis of a 1-Mb region of Phaseolus vulgaris to the highly duplicated soybean
genome” Jer-Young Lin, Purdue (lin51@purdue.edu)
Studying same 1Mb region as previous talk, but comparing Phaseolus to soybean
Reshuffling of genome occurred in the soybean tetraploid ancestor, but some genes are not
found in both homeologs
Compared Gm15 and Gm8 to Pv5: Gm15 showed inversion and loss of genes. Gm8 is closer to
the ancestral genome and more similar to Pv5, but Pv5 has more transposons than either Gm8 or
Gm15.
There are some segmental duplications on Gm15 (very recent)
There are some successive tandem duplications on Gm8 (9 copies)
Both have higher transposon densities
Sequence divergence: lower Ks (synonymous substitution rate) on Gm15 (biased gene loss)
Expressional bias for Gm8 when compare homeologous gene pairs
Correlation between the two? But R values seem low
“Verticillium Comparative Genomics Sheds Light on Pathogenicity of Wilt Pathogen” Li-jun Ma,
Broad Institute (lijun@broadinstitute.org)
V. daliae vs. V. albo-atrum:
Region unique to V. d. is highly dynamic, rich in transposons and highly expressed genes (but
no housekeeping genes)
Genes shared between wilt-pathogens (Verticillium, Fusarium):
glucosyl transferase – synthesiszes membrane oligosaccharides related to bacterial genes (HGT)
8
CAZy (carbohydrate-active enzymes) – more in Verticillium (contributes to broad shost range).
(Expanding arsenal of cell-wall degrading enzymes)
“MAKER: An easy to use genome annotation pipeline” Carson Holt, University of Utah
(www.yandell-lab.org)
References: Cantarel 2008 GR, Coghlan 2008 BMCB
Incorrect genome annotations poison every experiment that use them
10/2009: 222 eukaryotic genomes sequenced (but unpublished), ~900 projects underway
MAKER is designed to help small research groups:
Easy to use, part of GMOD, GFF output; optimized for parallelization
Annotation pipeline (needs quality control statistics), not gene predictor (just a model).
Identifies repeats, ESTs, etc.
Ab initio annotation (SNAP, Augustus) needs training (e.g., model organism); however
MAKER doesn't need training, can “train itself”. SNAP runs inside of MAKER, can run iteratively
(bootstrap) to improve the gene models. Can update annotations based on new evidence
MAKER algorithm:
1) Identify and mask repetitive elements (although many encode vital proteins) with
Repeat Masker (RepBase or user-supplied library) and RepeatRunner (proteins that have
diverged). MAKER supports SNAP, Augustus, GeneMart, FGENESH; trainer HMMs
must be suppled for each (see GMOD wiki for how to train SNAP with MAKER).
2) Align evidence from ESTs/proteins – uses blasts (EST tblastx, blastn, protein blastx)
Can be complications with splice sites
Program EXONERATE polishes blat data (no HSP overlap; HSPs must align with splice
sites)
3) Passes evidence back to ab initio program (e.g., SNAP), different ab initio programs will
produce different models
4) Model is updated
Using MAKER (go to yandell lab website)
Maker Web Annotation Service – can use on-line
MWAS_Tutorial (gmod.org/wiki/MWAS_Tutorial)
SOBA (Sequencing Orthology Consortium) statistics for different features, shows graphs of
statistics on data, DAGs, etc.
APOLLO genome viewer
De novo annotation of newly sequenced genome:
GENMARK is easiest to train (then SNAP/Segma, by Ian Korf)
Training info at “Summer School of the Americas”
Other problems: evidence to pass through or update existing annotation
mRNA-seq data: give resolution for intron-exon junctions
Align reads into expression “islands” and “junctions”
Pass alignments as EST evidence via GFF format. Can set threshold for alignment for
required depth of coverage (maybe can use BES as genome to do this?) or can assemble to
set longer alignments
Legacy annotations (existing models) may be poor quality, conflicting annotations
Incorporate mRNA-seq to existing annotations, “private” annotation sets.
MAKER merges with these to produce consensus annotations, and can feed conflicting
annotations into ab initio predictors
9
MAKER is a part of “Genome Investigator”. Other parts are Evaluator and Verifier
MAKER data can be integrated into GMOD tools:
Chado database
Jbrowse (GFF3 → Jbrowse input (web interface)
can show interpro domains
Use to evaluate different contig builds
SEGMA (Ian Korf)
Other programs can be used to annotate repetitive elements
“GMOD Project Update” Dave Clements and Scot Cain (clements@nescent.org) NESCent, Durham
NC (gmod.org/wiki/gbrowse)
Visualization –
Gbrowse 2.0 has been released (Jan 2010), big performance improvement
Includes popup baloons
Will allow private individual accounts to share data
Jbrowse – 2nd generation browser
Gbrowse_syn – comparative genomics viewer
Reference sequence compared with 2 others
Syntenic blocks don't need to be colinear
Data Management –
Chado 1.0 → 1.1 really soon, with improvements, faster, friendlier
Can create ontology-based views
Tripal – web front end for Chado data based on Drupal
Takes care of accounts, job management (e.g., blast)
Table Edit – MediaWiki extension
Easier to make tables, extract data to tables (e.g., Chado → MediaWiki)
BioMart – query multiple databases, new GUI, easier to use
InterMine – data integration (20 common formats)
Annotation –
Maker, DIYA (pipeline), Galaxy (workflow), Ergatis (annotation/cloud computing),
Apollo
Tutorials available at gmod.org
“Using the Jbrowse Genome Browser with Large Amounts of Data” Mitchell Skinner
(mitch_skinner@berkeley.edu), UCB
Jbrowse can use NGS data, can link out to other websites, has smooth scrollling
Moves work from web server to web browser, but without overloading browser (breaks up
data), by doing more reads than writes.
Test Set – 4.4 million features (not PE reads), took 8 minutes to process from bam file
Compression ~90%, only used 400 MB RAM
Breaking up data: NCLists (features contained within other features)
10
fast to query, tree structure
load fake features, containing NCLS (“lazy loading”)
SAM Tools and R-Trees (Big Bed, Big Wig) also came up with loading schemes
Jbrowse must load through a proxy (translation layer)
Gbrowse vs. Jbrowse:
Gbrowse has more functionality, but Jbrowse does basic tasks well (but rough around the
edges)
“Comaprative Genomics with Gbrowse_syn” Sheldon McKay, CSHL
(gmod.org/wiki/GBrowse_syn_PAG_tutorial)
Gbrowse_syn = generic synteny browser, included with Gbrowse 1.69 or later
Used by Wormbase, TAIR
Seamless Gbrowse integration, sits on top of Gbrowse interface + database + alignment data
On-going support/development
Doesn't rely on perfect co-linearity (no orphan alignments)
On the fly chaining (groups alignments together to create pseudo synteny blocks)
No limit on the number of species
Uses grid lines to trace fine-scale indels (so info isn't lost)
Keeps track of inserts/deletions
Can color-code and shade things
Can “flip panels” to make look better (toggle between reference sequence and other sequence)
Can do “all-in-one” view (summary) of all species
Can mouse over things to get info
Can compare gene models
For small aligned regions, can show gene orthology, chained orthologs, panels can be merged
Input data is primarily whole genome alignments; start with raw sequences:
Mask repeats (RepeatMasker, etc.)
Further processing
Identify orthologous regions (ENREDO, MERCATO, orthocluster, etc.)
Can then go to Gbrowse_syn
Nucleotide level alignment (PECAN, MAVID, etc.)
Wiggle tracks to Gbrowse_syn
Can also send output to Gbrowse, UCSC, etc.
To work well, need pairwise comparisons between each species (MERCATOR Ultra contigs)
Can use data without alignments (co-linear blocks, but no sequence alignment)
Gene orthology alignments based on protein blasts (associate annotation)
Only need start point and end point
Self vs. self comparison of polyploidy, duplication, etc.
Calculate many anchor points with multi sequence alignment
Aligned DNA sequences can be distant (e.g., PECAN alignment of P. pacificus to C. elegans)
Segmental duplications – use protein orthology, then synteny blocks
11
Architecture – Bio::DB::GFF
Future of Gbrowse_syn:
Integration with Gbrowse 2.0
“On the fly” sequence alignment view
AJAX-based user interface (JBrowse_syn)
Other Synteny Browsers:
SynView (Wang Bioinformatics 2006)
Part of Gbrowse; marked up config file
SynBrowse (Ran, Bioinformatics)
Stand-alone app
Similar to genome browser
Uses sequence alignments and other data to highlight relationships
Usually displays co-linearity relative to a reference genome
Sybil (uses chado database, Crabtree, Methods Mol. Biol. 2007)
CMap (broad user community)
non-GMOD browsers:
circular (mkweb/bcgrc.ca/circos, mizbee.org)
CoGe (synteny.cnr.berkeley.edu/CoGe)
Gbrowse_syn demo (tutorial is on their wiki)
Run on Vmware, take an image first
User name and password are both gmod
Go to full screen after it loads (helps you to remember to stay on virtual machine)
Firefox is installed in the distribution
Configuration – need to run as root (sudo)
To switch: alt-tab
To copy: shift-cntl-c
To paste: shift-cntl-v
gbrowse_netinstall.pl grabs all prereqs
Includes cpan module, bioperl, Gbrowse source code
Need mySQL server and a bunch of other things
run -d flag for latest version
hit enter whenever prompted (ignore error messages, unless it refuses to install)
Alignment data needs to be in fasta or clustalw format (but doesn't need to have been generated
by clustalw)
Ids need to have metadata to relate alignment back to reference genome
PECAN or Mauve can align big pieces of DNA like genomes
Can take wiggle tracks
If not using alignments, can load database with other entry points (e.g., gene orthology data)
Central configuration file is different between Gbrowse and Gbrowse_syn
Takes GFF3 files
“Tripal: a Construction Toolkit for Online Genomic Databases” Stephen Ficklin, Clemson U
Genomics Institute (ficklin@clemson.edu) (gmod.org/wiki Tripal tutorial)
12
Tripal = GMOD Chado (database) + Drupal (content management)
Requirements: Linux/GMOD Chado/Drupal/PHP
Distributed via SourceForge
Can use Drupal themes – use what other people have created to customize screen look
Tripal has modules for expansion (under “Administration”), correlate with Chado tables
To get data out use Master Views – can break down by GO term/category
“SCRI Visualization Tools” David Marshall, SCRI (David.Marshall@scri.ac.uk)
All Java-based, freely available
TABLET
RNA-Seq from different genotypes aligned to different contig assemblies
Use Mosaic → Gigabayes → markup files → import sets of features
Used for variant finding, SNP discovery (can “jump” to particular SNPs)
Can put assemblies in and look at contigs (shows read info past consensus sequence)
Alternative splicing with RNA-Seq
Bowtie (against Arabidopsis pseudo-molecules) → Tophat (takes rest of reads and
creates contigs)
FLAPJACK
Has markers in order on genetic map of chromosomes (database)
Can enter categorical or numerical trait data, experiments
4 sets of data (ordering for markers, project file, SNPs, genome/map)
Can color by allele frequency to find rare haplotypes
Marker select mode – can select markers under a QTL
Can sort horizontally and vertically
STRUDEL
Synteny between brachypodium/barley/rice (very nice looking)
“Integrated Genome Browser: Visualization Software and Data Server for Next-Generation
Genomics” Ann Loraine, U of NC (aloraine@uncc.edu) (genoviz.sourceforge.net, gb.bioviz.org)
Developed by Affy to visualize tiling arrays
GenovizSDK – Java library for building visualization applications
IGB can be used for Chip-Seq and RNA-Seq data
Ways to get data in:
Open file (many formats allowed)
Via Website
Via data server/Quickload (if only want pieces of a big dataset)
Can set up folders and files on server, with directory for each genome
Add annotation file and file with genome structure
Add annotations (eg., TAIR9)
454 data (2 sample mRNA-Seq data)
Align data onto genome using BLAT (similar to UCSC) – uses a compressed format so data moves
faster
13
Can move tracks around to create almost heat-map like views.
“Gene order comparison with contigs and scaffolds” Adriana Munoz, U. of Ottowa
(amuno010@uottowa.ca)
Project: 10 Drosophila species + 4 outgroup species
Rearrangement algorithms – use contigs/scaffolds as if they are chromosomes
Based on NGPs (neighboring gene pairs) database of adjacent genes
Reconstruct contigs by overlapping NGPs
Rearrangement operations (single chromosomes vs. 2 chromosomes) – generated mathematical model
Genomic distance is minimum number of rearrangements to get from one genome to another
Genome fragmentation – compare one genome in contig form to another full genome
If both genomes are in contig form, use slightly different algorithm
Can do phylogeny and reconstruct ancestral genomes
The larger the number of contigs, the less accurate this method is
Extended model to compare genomes in scaffold form
“Comparative Multi Genome Annotation with Gnomon” Alexander Souvorov
(sourvorov@ncbi.nlm.nih.gov)
Alignments are used for model-building and training, then ab initio gene prediction finds genes, which
are extended
Can be used in any annotation pipeline that can handle protein alignments
Used two Theileria genomes with no ESTs available
Weak homology to pool of known proteins
Easier to combine both genomes for annotation
Used tblastx hits in reference genome to compare introns in targets to find common features
Errors in reference genome won't be propagated
Align each genome to the other multiple times
Works well for small genomes, but large genomes have areas of repeats
Protein sequence support to link two or more genomes
Tested with Arabidiopsis (20757 TAIR genes with ~2.5 isoforms each) and grape (3515 contigs,
fewer genes)
Input conditions for “lost” and “found” sequences (mathematical equation)
Grape worked better than Arabidopsis (proteins clustered)
“Gene Identification pipeline for novel eukaryotic genomes combining unsupervised training
with experimental evidence” Mark Borodovsky, GA Inst. Of Tech (borodovsky@gatech.edu)
14
Used for gene identification in novel eukaryotes without ESTs
Conserved regions of DNA can be modeled
Probabilistic models like HMM, supermodel that switches on other models at certain places in
the genome.
GeneMark algorithm uses HMM with duration
As genome becomes more complex the percentage of non-coding DNA increases
Exons stay small, introns/intergenic sequences grow
Unsupervised program is desired, requires no training (so can get to annotation faster) is iterative
Need 5 MB sequence for model to work well. N50 should be >= 10K
GeneMark-ES is a new version, used on strawberry genome (210 Mb)
633 genes in test set
5832 genes are supported by ESTs
The number of TEs causing repeats (segmental duplications) predict repetitive elements (can be
in introns)
Automatic, faster to use
“TAGdb: a tool for gene and promoter discovery in complex plant genomes” Chris Duran, U. of
Queensland (c.duran@uq.edu.au) (flora.acpfg.com.au.tagdb)
Uses paired end sequence data: web front-end and command line
Illumina read length 35-70 bases, insert size up to 10 kb (mate pairs?)
Uses Javascript & AJAX, “openLayers”, Perl CGI, MySQL
Can submit jobs on their machine → “double-barreled” blast (MEGABLAST)
IGLOO (image generation) – unrendered, vector based data
Use RIVA to view job
“Double barreled blast” (MEGABLAST): use one set of tags as reference to blast sequences to.
Then, filter based on length and score.
Output is custom fasta file.
Short-read libraries are color-coded
Command line Perl scripts (so don't need web interface) give blast output file
Verification script to check insert size/positional info
Can look at read coverage to compare where they are different (more matches in repetitive sequence)
“Data Mining at PLEXdb: the Plant and Plant Pathogen Expression Database for Functional and
Comparative Genomics” Sudhansu Dash (sdash@iastate.edu) Iowa State U.
Plant and Pathogen Gene Expression database (for microarrays).
Only supports Affy arrays (inc. Medicago)
Option to submit data at NCBI GEO
15
Visualize genes in different treatments
Microarray platform translator (between Affy arrays)
Gene Oscillo Scope – compare between experiments
Data Mining to find co-expressed genes
Plans to include other platforms, NGS, etc. if get NSF funding
“Discover what's new at NCBI” Steve Pechous (pechous@ncbi.nlm.nih.gov)
New Tools:
Primerblast – based on Primer3
CloneFinder – on MapViewer Home Page
Graphical Sequence Viewer (“Graphics” on top) – gene distributions on chromosome
New Databases:
SRA – now just an archive, tools under development; stored under experiments
BioSystems – with links to molecules (proteins, genes, PubChem) and directly to KEGG
pathway
NCBINew on NCBI Bookshelf
There is a 3rd party annotation submission (how?)
“Managing Genome Assemblies” Deanna Church (church@ncbi.nlm.nih.gov)
Unloacalized sequences – know which chromosome, but not where
Unplaced sequences – know the organism, but don't know which chromosome
Alternate loci – alternative representations of sequences present on the chromosome
Assembly Database – allows tracking different versions of an assemby.
Can submit now, but user interface is not done yet
Genome Reference Consortium – Genome workbench tool
“NCBI Genetic Variation Resources” Lon Phan (lonphan@ncbi.nlm.nih.gov)
dbSNP – simple genetic variation (~90 different organisms; searchable)
dbVar – large structural variations and CNV (indels, inversions, etc.)
dbGAP – genotypes and phenotypes
“Expanding the Protein Cluster Database to include plants” Anjana Raina
(raina@ncbi.nlm.nih.gov)
Collection of RefSeq proteins, including plants (RefSeq is database of non-redundant set of
chromosomes, transcripts, and proteins)
Clusters can be split and combined (~41,192 clusters, most have only 2 proteins)
16
Can build phylogenetic trees
Pre-computed protein alignment link for each (can see protein alignments)
Can link out to mapViewer
Database ACCESS
No blast link available but can use Genome Workbench
“UniGene: A resource for plant and animal transcripts” Lukas Wagner (wagner@ncbi.nlm.nih.gov)
Assemblies are unstable and error-prone (no consensus transcripts)
Grouping transcripts without an annotated genome
mRNA gene associations, HomoloGene associated via blastx
mRNA-mRNA pairwise alignments
Organisms: >70,000 ESTs or mRNAs
Can work directly off 454 transcript reads
They are working on how to handle high-throughput sequences
Digital Differential Display to compare expression levels based on Fisher's Exact Test
Protein similarities – best matching protein for each organism (ProtEST), can align all cDNAs
Links from UniGene to Map Viewer
17
Download