PAG_Notes - Bioinformatics Core Wiki

“Development of Genome and Transcriptome Sequence Resources in European Hazlenut” Todd Mockler, Oregon State U (tmockler@cgrb.oregonstate.edu) Did short-insert Solexa only. Did not do any quality trimming/filtering (no error correction) Due to RAM limits, could only do ~100 million reads at once with velvet, so did velvet subassemblies and used MIRA to co-assemble contigs. Results were extremely fragmented (333,492 contigs, <gene size). Aligned input data to see depth of sequencing; got ~50% of reads in assembly. Aligned RNA-seq data to contigs – 22.6% of contig space had RNA-seq reads mapped. They estimate it is 90% of the genes. Only 20% of velvet contigs matched genbank (65% of hits were grape/poplar/castor bean) Tried SOAP de novo for entire pile at once, but got 1.3 million scaffolds (much of which was “N”s) Have since done 2 KB mate-pair library sequencing “Next generation transcriptome resequencing in Populus trichocarpa enables studies in natural genetic diversity, genome-wide analysis of mRNA processing and expression, and association genetics” Carl J. Douglas, U. of British Columbia (cdouglas@interchange.ubc.ca) Did 4 lanes of mRNA-Seq per cDNA sample of 20 genotypes Used MAQ, discarding >3 mismatches. Can map intron-exon boundaries. Can find non-annotated exons and splice-site variations 8% of reads mapped to “introns” 6% of reads mapped to “intergenic regions” 64% of reads map uniquely Read count was normalized (how?) and candidate genes were ranked on a log scale SNP analysis: Peaks analysis using wig files Calls based on consistency across 20 individuals Validation by Sanger sequencing (but, there can be allele-specific expression) 7x coverage of xylem RNA produced 400,000 candidate SNPs Also doing transcriptional network analysis based on Arabidopsis “Prelude to a Genome: De novo assembly, annotation and profiling of an expressed gene catalog of a fast-growing Eucalyptus hybrid clone using Illumina mRNA-Seq” Eshchar Mizrachi, U. of Pretoria, South Africa (eshchar.mizrachi@up.ac.za) (Eucspresso/Eucagen website) Had 4.5x assembly (now have 8x assembly) and very little NCBI gene sequence information Constructed gene catalog with mRNA-Seq 16 lanes of paired-end sequences (4.2 Gb after filtering rRNA) Used velvet, then custom scripts to extend contigs and separate genes, pseudogenes and paralogs, to generate a final set of 19,000 contigs with high coverage, many full length with UTR Contigs won't assemble if coverage is not uniform, so contigs were assembled one by one using Velvet and custom scripts to map reads to each contig (using SOAP, bowtie, and ?) Quality control: mapped Illumina reads to Sanger reads 1 Found homologs to Arabidopsis, Poplar, and Grape KEGG maps were very similar to Arabidopsis Used 128 GB RAM machine, now have 256 GB machine “Analysis of the transcriptome of the Fagaceae species” Abdelali Barakat, Penn State (aub14@psu.edu) (Fagaceae Genomics website has all sequences) Trying to find disease-resistant genes (American Chestnut vs. Chinese Chestnut) Sampled early in infection (17 dpi) in field Created chestnut unigene set (how?) Blast contigs against Arabidopsis looking for blight disease-related defense genes Also found miRNAs (different prep?) Now sequencing Chinese Chestnut genome. “Genetic variability in the conifer wood transcriptome” John Mackay, University of Lava, Canada (John.Mackay@sbf.ulaval.ca) Work was in white spruce and is related to wood properties (quantitative variation in transcript levels) Microarray has been developed with ~32K Spruce genes developed from EST and 454 sequencing Association population was genotyped with Illumina and microarray Regulon genes (Tfs – coordinate regulation, test in transgenes “Genome-wide SNP data refine relationships in the grapevine” Sean Myles, USDA (smm367@cornell.edu) (Grape Genetic Diversity Project) Generated 2 Gb Illumina sequence from HpaII digestion of genomic DNA Identified 470,000 SNPs (on average, 1 every 41 bp), of which 71,000 “good”; reduced that to 10,000 SNPs, put 8,988 on an array (Illumina SNP chip) and found 6,114 high quality SNPs, most of which preferentially segregate. They are looking at grape domestication from V. silvestris; archaeological data indicates this occurred in the Near East. Was there ar domestication bottleneck? SNP selection shows diversity between domesticated vinifera. LD breakdown is very similar (decays very rapidly), so there are no haplotypes, so SNPs are difficullt to use. May need to use haplotype data for genome-wide analysis. RNA-Seq: 36-44 nt reads, 17-20 million reads per sample, mapped to reference genome by eland. Mapreadslocation.py maps unique reads to exons, introns, intergenic regions and unmapped exons. Use MAQ to find SNPs with minimum read depth of 10. Result was 86,000 SNPs, only 7480 in genes (exons) and 24,000 in introns. Splice variants: again align with eland. Scripts: vmf_filter_uniqs.py and get_alt_splice.py used to find 385 alternatively spliced genes. Erange – for global analysis of gene expression Normalization – calculated reads per kb of exons for million of total reads 2 More than 17,000 genes seen Validated several genes with qPCR “Genoscope” annotation for manual verification (see Berkeley Bullard reference) Did Fisher's Exact Text and found ~7,000 genes had statistically significant differential expression. Interestingly, they found that if they took the Unmapped reads and built contigs with those, sometimes reads would map to those contigs, that didn't previously map. “A gene expression map of Vitis vinfera cv. Corvina development” Mario Pezzotti, U. of Verona (mario.pezzotti@univr.it) 3 biological replicates of samples of different tissues (20 developmental stages) over 2008. Roots were from in vitro Separated skin, flesh and seed of berries Used gene prediction from 12X genome assembly (2 different gene predictions) Combimatrix 90K array: 1 probe/transcript, 3 of each probe (time-consuming) NimbleGen 12plex (12 x 135K) array: started design with 1 probe/transcript and 4 of each probe, changed to 4 probes/transcript and 1 of each probe Analysis used FDR shreshold of .05 and fold change > 2 Hybridization quality control – compare biological replicates (made sure that different biological replicates were on different chips) Performed functional characterization with GO “Transcriptomic Analysis of Early Fruit Development In Cucumber (Cucumis sativus)” Rebecca Grumet, MSU (grumet@msu.edu) As fruit size increases, cell/vacuole size increases, cell wall/middle lamella thicken Samples were 8 days post pollination, did 454 sequencing Showed graph of contig length vs. number of ESTs Contigs >200 bp and >= 10 ESTs: 90% had homologs in Arabidopsis, 7% had no homologs in NCBI database If <10 ESTs, only 55% had homologs Took replicate samples, used sequence to verify, then qRT-PCR Different distribution of genes represented if > 100 ESTs vs. the whole list Found unique phloem genes and hormones involved with fruit set and development Lipid proteins (surface wax) Latex-like proteins (defense?) Transcription/signaling genes were under-represented Time course experiment: 0, 4, 8, 12, 16 days post pollination 454 titanium yielded 1.1 million ESTs Similar stats as for previous set with >= 10 ESTs/contig 3 Looked at >= 100 ESTs/transcript, compared time points to day 0 (anthesis) Heat map showed -3 fold green to +3 fold red “Transcriptome mapping and cataloging of alternative splicing in Arabidopsis” Sergei Filichkin, Oregon State U (filichks@onid.orst.edu) (mockler-lab-tools.cgrb.oregonstate.edu) RNA-Seq with 32 bp reads 1) Combine RNAs from different tissues with oligo dT primers to get full-length cDNAs (~60 million reads) 2) Use random priming to generate 300 million trimmed reads and align to genome (HashMatch). If no match, do more inc. SuperSplat Produce single-base resolution map of transcriptome Look for novel splice junctions, make sure that reads cover them At least 42% intron-containing genes are alternatively spliced 6% of all introns have non-consensus splice junctions (5 fold higher than previously thought) Use RT-PCR (splice junction specific and exon flanking primers); Sanger sequencing; qRT-PCR to validate “Rapeseed (B. napus) SNP Discovery Using a Dedicated Sequence Capture Protocol and 454 Sequencing” Jean-Philippe Pichon, Biogemma Allotetraploid: B. rapa (n=10) x B. oleraceae (n=9) → B. napus (n=18) Sequence Capture (NimbleGen) → Sequence selection (769 probes/array) Need exon-exon junction data to avoid chimeric probes Probes with repeated sequences removed) Capture array (720K probes) → 454 sequencing 102055/102736 reads, assembled into 2439 contigs 4 “Sequence assembly in color space: from de novo to resequencing” Todd Michael, Rutgers U. and Monsanto solidsoftwaretools.com/gf/project/denovo/frs/ Solid pipeline: QVfilter → Accuracy Enhancement → Preprocessor → Velvet Colorspace=dinucleotide sequence Their software converts “colorspace” into “basespace” Can use first base in each read to check contig to find errors. Use minimal mismatches to do error correction (last stage of de novo pipeline) SAET (Spectral Accuracy Enhancement Tool) does error correction from raw reads Runs sliding window across each read “words” without errors occur more often than words with errors 69% of reads mapped to reference is about expected (= error rate of <1%) Complex (diploid) Genomes with SAET (de novo, community-led consortium) Quality/error correction (pre-processing) → de novo assembly → contig merging → scaffolding (post processing) → Gap filling ASID for for de novo assembly (improvement on gap and repeat filling) SOLiD (60 Gb) + 454 (29 Gb) = 70X coverage Reads → contigs Paired ends → scaffolds (PE indicate connection of contigs) 454 Sequences assembled with Newbler It is more important to have a lot of paired-end sequences to generate scaffolds PASS program to align short sequences (can use colorspace and basespace) CONSORT program can close the gaps that are created with Newbler Did alignments with many sets of parameters All possible words are aligned to minimize errors Use PE to merge contigs into scaffolds, can estimate gap length (“consolidated arcs” Build scaffold paths from intercontig arcs Assembly of small plant genomes Every SNP call represents 2 independent reactions, so need 2 changes in colorspace for 1 SNP, whereas 1 change in colorspace = sequencing error) Do not just translate colorspace to basespace, because then can't make colorspace correction Use corrected dnaspace in contig assembly Polyclonal reads need to be removed 2 DNA moieties in amplification process 2 signals emanating from beads Errors are on the 3' end of the reads (reference Sasson & Michael, Bioinformatics) 5 If QV < 10, there is an error at that base Normally, first 5-10 bases should be of highest quality (QV > 25) If 5 of the first 10 bases are QV < 25 read is polyclonal 86% of filtered reads matched the reference genome (in resequencing) vs. 15% of reads that didn't pass filter (have multiple errors) SOPRA (Statistical Optimization of Paired Read Assembly) will be available soon V-SOPRA (velvet fragment mode) and S-SOPRA (SSAKE algorithm) 1. Assemble reads in colorspace, but first base is dnascape to check consistency 2. Translate colorspace to dnaspace 3. Optimize orientation Contigs that don't pass step 1 are discarded Reads are aligned onto contigs, then queried to see if optimal for distance and orientation (“spring model”) to identify misassemblies and chimeras How does coverage/density compare across scaffold? Areas of ambiguity have increased density, so apply trimming method to break into different scaffolds SOPRA improves assembly accuracy and N50 (this work was based on bacteria) Can detect structural variants “Next Generation Forward Genetics” Ies J. Nijman, Hubrecht Institute Mapping and identification of mutants by SOLiD Forward genetics – Arabidopsis bulk segregant analysis Mutagenize, then cross mutant with phenotype to wild-type, then self F1 Homozygous mutant (25%) have recombination events to isolate small region of your mutation Procedure to call SNPs and determine linkage (need genetic map) 1. Light sequence of homozygous pool (10-20 million reads per mutant) 2. Deep sequence library enriched for region of interest for SNP calling Map loci to find diffferences (polymorphisms) Use Agilent capture array (CGH array) for that region) Multiplexed libraries hybridized on single array Can find small recombination events Candidates need biological followup 6 “Robustness in the face of complexity: Single Molecule Real Time DNA sequencing” Stephen Turner, Pacific Biosystems (sturner@pacificbiosciences.com) Hairpin adaptors are used, so sense and antisense strands are both sequenced multiple times Tends to favor high G+C content Can span duplications, repeats Read length distribution is exponential, with mean read length > 1kb “Strobe Sequencing” – turn lights on and off to generate multiple sequence windows in same molecules (their version of mate-pairs) Good for sequencing across multiple insertion events Can span insertions and anchor them to genomic reference Can also exploit circular technology to do up to 20X coverage on single molecule, so can get consensus on single molecule, which makes it easy to find sequencing errors. Can see methylated bases by measuring delay between nucleotide incorporations (no need for bisulfite conversion) Can distinguish between different types of base modifications Potential sequencing performance: 50 bases/sec > 100 Gb/hr (human genome in 15 minutes) Maximum readlength will be 15-17kb in 5 years (limiting factor is polymerase processivity) This platform doesn't lend itself to counting tags, but is better for understanding splice variants Initially not competitive with short-read mRNA-Seq Generalized tool for all sorts of biological sequencing (not just DNA/RNA) They are working on ways of circularizing RNA. Currently have a protocol for linear RNA (but not ready for release) Will start selling DNA machines in mid-2010 Sample prep is as short as 20 minutes with random primers SMART sample prep takes less than a day (and will improve) Runs are short (6-20 minutes), throughput currently is equivalent to NGS Automated sample loading system, so you can “walk away” for 8 hours Dead time between samples is 120 seconds, should be reduced to 10-20 seconds Data reduction (base calling/quality filtering) is internal to system. Image data is not stored. Will create “temporal reference sequence” for organism that saves reference kinetics. Then can do re-sequencing and compare kinetics and structural variation. 7 “De Novo Sequencing of Plant Genomes” William McCombie, CSHL Hybrid Hierarchical Assembly (Sanger, 454, and Illumina reads) Used different combinations of assemblies (tried staged with pre-filtering, then Phrap) Currently working on 30X peach, 72X cocoa Using velvet on rice genome – ran into memory problems on 512 G RAM machine New version of velvet “gets around this” Abyss assembly generated 12-35 Mb max contig size with N50 1850-2850 “Physical mapping and transcriptional analysis of large homeologous deletions in soybean” Robert Stupar, University of Minnesota (stup004@umn.edu) Looking at homeologous blocks between Gm08 and Gm15. Identified common SNPs in 1Mb region Generated lines with one homeologous group deleted (or additional copy added) Can't recover homozygous deletions Used Nimblegen 700k microarray to hybridize to find copies (each probe represented ~1kb) Transcription consequences: relative proportion similar to dosage. Does regulation differ? “Comparative analysis of a 1-Mb region of Phaseolus vulgaris to the highly duplicated soybean genome” Jer-Young Lin, Purdue (lin51@purdue.edu) Studying same 1Mb region as previous talk, but comparing Phaseolus to soybean Reshuffling of genome occurred in the soybean tetraploid ancestor, but some genes are not found in both homeologs Compared Gm15 and Gm8 to Pv5: Gm15 showed inversion and loss of genes. Gm8 is closer to the ancestral genome and more similar to Pv5, but Pv5 has more transposons than either Gm8 or Gm15. There are some segmental duplications on Gm15 (very recent) There are some successive tandem duplications on Gm8 (9 copies) Both have higher transposon densities Sequence divergence: lower Ks (synonymous substitution rate) on Gm15 (biased gene loss) Expressional bias for Gm8 when compare homeologous gene pairs Correlation between the two? But R values seem low “Verticillium Comparative Genomics Sheds Light on Pathogenicity of Wilt Pathogen” Li-jun Ma, Broad Institute (lijun@broadinstitute.org) V. daliae vs. V. albo-atrum: Region unique to V. d. is highly dynamic, rich in transposons and highly expressed genes (but no housekeeping genes) Genes shared between wilt-pathogens (Verticillium, Fusarium): glucosyl transferase – synthesiszes membrane oligosaccharides related to bacterial genes (HGT) 8 CAZy (carbohydrate-active enzymes) – more in Verticillium (contributes to broad shost range). (Expanding arsenal of cell-wall degrading enzymes) “MAKER: An easy to use genome annotation pipeline” Carson Holt, University of Utah (www.yandell-lab.org) References: Cantarel 2008 GR, Coghlan 2008 BMCB Incorrect genome annotations poison every experiment that use them 10/2009: 222 eukaryotic genomes sequenced (but unpublished), ~900 projects underway MAKER is designed to help small research groups: Easy to use, part of GMOD, GFF output; optimized for parallelization Annotation pipeline (needs quality control statistics), not gene predictor (just a model). Identifies repeats, ESTs, etc. Ab initio annotation (SNAP, Augustus) needs training (e.g., model organism); however MAKER doesn't need training, can “train itself”. SNAP runs inside of MAKER, can run iteratively (bootstrap) to improve the gene models. Can update annotations based on new evidence MAKER algorithm: 1) Identify and mask repetitive elements (although many encode vital proteins) with Repeat Masker (RepBase or user-supplied library) and RepeatRunner (proteins that have diverged). MAKER supports SNAP, Augustus, GeneMart, FGENESH; trainer HMMs must be suppled for each (see GMOD wiki for how to train SNAP with MAKER). 2) Align evidence from ESTs/proteins – uses blasts (EST tblastx, blastn, protein blastx) Can be complications with splice sites Program EXONERATE polishes blat data (no HSP overlap; HSPs must align with splice sites) 3) Passes evidence back to ab initio program (e.g., SNAP), different ab initio programs will produce different models 4) Model is updated Using MAKER (go to yandell lab website) Maker Web Annotation Service – can use on-line MWAS_Tutorial (gmod.org/wiki/MWAS_Tutorial) SOBA (Sequencing Orthology Consortium) statistics for different features, shows graphs of statistics on data, DAGs, etc. APOLLO genome viewer De novo annotation of newly sequenced genome: GENMARK is easiest to train (then SNAP/Segma, by Ian Korf) Training info at “Summer School of the Americas” Other problems: evidence to pass through or update existing annotation mRNA-seq data: give resolution for intron-exon junctions Align reads into expression “islands” and “junctions” Pass alignments as EST evidence via GFF format. Can set threshold for alignment for required depth of coverage (maybe can use BES as genome to do this?) or can assemble to set longer alignments Legacy annotations (existing models) may be poor quality, conflicting annotations Incorporate mRNA-seq to existing annotations, “private” annotation sets. MAKER merges with these to produce consensus annotations, and can feed conflicting annotations into ab initio predictors 9 MAKER is a part of “Genome Investigator”. Other parts are Evaluator and Verifier MAKER data can be integrated into GMOD tools: Chado database Jbrowse (GFF3 → Jbrowse input (web interface) can show interpro domains Use to evaluate different contig builds SEGMA (Ian Korf) Other programs can be used to annotate repetitive elements “GMOD Project Update” Dave Clements and Scot Cain (clements@nescent.org) NESCent, Durham NC (gmod.org/wiki/gbrowse) Visualization – Gbrowse 2.0 has been released (Jan 2010), big performance improvement Includes popup baloons Will allow private individual accounts to share data Jbrowse – 2nd generation browser Gbrowse_syn – comparative genomics viewer Reference sequence compared with 2 others Syntenic blocks don't need to be colinear Data Management – Chado 1.0 → 1.1 really soon, with improvements, faster, friendlier Can create ontology-based views Tripal – web front end for Chado data based on Drupal Takes care of accounts, job management (e.g., blast) Table Edit – MediaWiki extension Easier to make tables, extract data to tables (e.g., Chado → MediaWiki) BioMart – query multiple databases, new GUI, easier to use InterMine – data integration (20 common formats) Annotation – Maker, DIYA (pipeline), Galaxy (workflow), Ergatis (annotation/cloud computing), Apollo Tutorials available at gmod.org “Using the Jbrowse Genome Browser with Large Amounts of Data” Mitchell Skinner (mitch_skinner@berkeley.edu), UCB Jbrowse can use NGS data, can link out to other websites, has smooth scrollling Moves work from web server to web browser, but without overloading browser (breaks up data), by doing more reads than writes. Test Set – 4.4 million features (not PE reads), took 8 minutes to process from bam file Compression ~90%, only used 400 MB RAM Breaking up data: NCLists (features contained within other features) 10 fast to query, tree structure load fake features, containing NCLS (“lazy loading”) SAM Tools and R-Trees (Big Bed, Big Wig) also came up with loading schemes Jbrowse must load through a proxy (translation layer) Gbrowse vs. Jbrowse: Gbrowse has more functionality, but Jbrowse does basic tasks well (but rough around the edges) “Comaprative Genomics with Gbrowse_syn” Sheldon McKay, CSHL (gmod.org/wiki/GBrowse_syn_PAG_tutorial) Gbrowse_syn = generic synteny browser, included with Gbrowse 1.69 or later Used by Wormbase, TAIR Seamless Gbrowse integration, sits on top of Gbrowse interface + database + alignment data On-going support/development Doesn't rely on perfect co-linearity (no orphan alignments) On the fly chaining (groups alignments together to create pseudo synteny blocks) No limit on the number of species Uses grid lines to trace fine-scale indels (so info isn't lost) Keeps track of inserts/deletions Can color-code and shade things Can “flip panels” to make look better (toggle between reference sequence and other sequence) Can do “all-in-one” view (summary) of all species Can mouse over things to get info Can compare gene models For small aligned regions, can show gene orthology, chained orthologs, panels can be merged Input data is primarily whole genome alignments; start with raw sequences: Mask repeats (RepeatMasker, etc.) Further processing Identify orthologous regions (ENREDO, MERCATO, orthocluster, etc.) Can then go to Gbrowse_syn Nucleotide level alignment (PECAN, MAVID, etc.) Wiggle tracks to Gbrowse_syn Can also send output to Gbrowse, UCSC, etc. To work well, need pairwise comparisons between each species (MERCATOR Ultra contigs) Can use data without alignments (co-linear blocks, but no sequence alignment) Gene orthology alignments based on protein blasts (associate annotation) Only need start point and end point Self vs. self comparison of polyploidy, duplication, etc. Calculate many anchor points with multi sequence alignment Aligned DNA sequences can be distant (e.g., PECAN alignment of P. pacificus to C. elegans) Segmental duplications – use protein orthology, then synteny blocks 11 Architecture – Bio::DB::GFF Future of Gbrowse_syn: Integration with Gbrowse 2.0 “On the fly” sequence alignment view AJAX-based user interface (JBrowse_syn) Other Synteny Browsers: SynView (Wang Bioinformatics 2006) Part of Gbrowse; marked up config file SynBrowse (Ran, Bioinformatics) Stand-alone app Similar to genome browser Uses sequence alignments and other data to highlight relationships Usually displays co-linearity relative to a reference genome Sybil (uses chado database, Crabtree, Methods Mol. Biol. 2007) CMap (broad user community) non-GMOD browsers: circular (mkweb/bcgrc.ca/circos, mizbee.org) CoGe (synteny.cnr.berkeley.edu/CoGe) Gbrowse_syn demo (tutorial is on their wiki) Run on Vmware, take an image first User name and password are both gmod Go to full screen after it loads (helps you to remember to stay on virtual machine) Firefox is installed in the distribution Configuration – need to run as root (sudo) To switch: alt-tab To copy: shift-cntl-c To paste: shift-cntl-v gbrowse_netinstall.pl grabs all prereqs Includes cpan module, bioperl, Gbrowse source code Need mySQL server and a bunch of other things run -d flag for latest version hit enter whenever prompted (ignore error messages, unless it refuses to install) Alignment data needs to be in fasta or clustalw format (but doesn't need to have been generated by clustalw) Ids need to have metadata to relate alignment back to reference genome PECAN or Mauve can align big pieces of DNA like genomes Can take wiggle tracks If not using alignments, can load database with other entry points (e.g., gene orthology data) Central configuration file is different between Gbrowse and Gbrowse_syn Takes GFF3 files “Tripal: a Construction Toolkit for Online Genomic Databases” Stephen Ficklin, Clemson U Genomics Institute (ficklin@clemson.edu) (gmod.org/wiki Tripal tutorial) 12 Tripal = GMOD Chado (database) + Drupal (content management) Requirements: Linux/GMOD Chado/Drupal/PHP Distributed via SourceForge Can use Drupal themes – use what other people have created to customize screen look Tripal has modules for expansion (under “Administration”), correlate with Chado tables To get data out use Master Views – can break down by GO term/category “SCRI Visualization Tools” David Marshall, SCRI (David.Marshall@scri.ac.uk) All Java-based, freely available TABLET RNA-Seq from different genotypes aligned to different contig assemblies Use Mosaic → Gigabayes → markup files → import sets of features Used for variant finding, SNP discovery (can “jump” to particular SNPs) Can put assemblies in and look at contigs (shows read info past consensus sequence) Alternative splicing with RNA-Seq Bowtie (against Arabidopsis pseudo-molecules) → Tophat (takes rest of reads and creates contigs) FLAPJACK Has markers in order on genetic map of chromosomes (database) Can enter categorical or numerical trait data, experiments 4 sets of data (ordering for markers, project file, SNPs, genome/map) Can color by allele frequency to find rare haplotypes Marker select mode – can select markers under a QTL Can sort horizontally and vertically STRUDEL Synteny between brachypodium/barley/rice (very nice looking) “Integrated Genome Browser: Visualization Software and Data Server for Next-Generation Genomics” Ann Loraine, U of NC (aloraine@uncc.edu) (genoviz.sourceforge.net, gb.bioviz.org) Developed by Affy to visualize tiling arrays GenovizSDK – Java library for building visualization applications IGB can be used for Chip-Seq and RNA-Seq data Ways to get data in: Open file (many formats allowed) Via Website Via data server/Quickload (if only want pieces of a big dataset) Can set up folders and files on server, with directory for each genome Add annotation file and file with genome structure Add annotations (eg., TAIR9) 454 data (2 sample mRNA-Seq data) Align data onto genome using BLAT (similar to UCSC) – uses a compressed format so data moves faster 13 Can move tracks around to create almost heat-map like views. “Gene order comparison with contigs and scaffolds” Adriana Munoz, U. of Ottowa (amuno010@uottowa.ca) Project: 10 Drosophila species + 4 outgroup species Rearrangement algorithms – use contigs/scaffolds as if they are chromosomes Based on NGPs (neighboring gene pairs) database of adjacent genes Reconstruct contigs by overlapping NGPs Rearrangement operations (single chromosomes vs. 2 chromosomes) – generated mathematical model Genomic distance is minimum number of rearrangements to get from one genome to another Genome fragmentation – compare one genome in contig form to another full genome If both genomes are in contig form, use slightly different algorithm Can do phylogeny and reconstruct ancestral genomes The larger the number of contigs, the less accurate this method is Extended model to compare genomes in scaffold form “Comparative Multi Genome Annotation with Gnomon” Alexander Souvorov (sourvorov@ncbi.nlm.nih.gov) Alignments are used for model-building and training, then ab initio gene prediction finds genes, which are extended Can be used in any annotation pipeline that can handle protein alignments Used two Theileria genomes with no ESTs available Weak homology to pool of known proteins Easier to combine both genomes for annotation Used tblastx hits in reference genome to compare introns in targets to find common features Errors in reference genome won't be propagated Align each genome to the other multiple times Works well for small genomes, but large genomes have areas of repeats Protein sequence support to link two or more genomes Tested with Arabidiopsis (20757 TAIR genes with ~2.5 isoforms each) and grape (3515 contigs, fewer genes) Input conditions for “lost” and “found” sequences (mathematical equation) Grape worked better than Arabidopsis (proteins clustered) “Gene Identification pipeline for novel eukaryotic genomes combining unsupervised training with experimental evidence” Mark Borodovsky, GA Inst. Of Tech (borodovsky@gatech.edu) 14 Used for gene identification in novel eukaryotes without ESTs Conserved regions of DNA can be modeled Probabilistic models like HMM, supermodel that switches on other models at certain places in the genome. GeneMark algorithm uses HMM with duration As genome becomes more complex the percentage of non-coding DNA increases Exons stay small, introns/intergenic sequences grow Unsupervised program is desired, requires no training (so can get to annotation faster) is iterative Need 5 MB sequence for model to work well. N50 should be >= 10K GeneMark-ES is a new version, used on strawberry genome (210 Mb) 633 genes in test set 5832 genes are supported by ESTs The number of TEs causing repeats (segmental duplications) predict repetitive elements (can be in introns) Automatic, faster to use “TAGdb: a tool for gene and promoter discovery in complex plant genomes” Chris Duran, U. of Queensland (c.duran@uq.edu.au) (flora.acpfg.com.au.tagdb) Uses paired end sequence data: web front-end and command line Illumina read length 35-70 bases, insert size up to 10 kb (mate pairs?) Uses Javascript & AJAX, “openLayers”, Perl CGI, MySQL Can submit jobs on their machine → “double-barreled” blast (MEGABLAST) IGLOO (image generation) – unrendered, vector based data Use RIVA to view job “Double barreled blast” (MEGABLAST): use one set of tags as reference to blast sequences to. Then, filter based on length and score. Output is custom fasta file. Short-read libraries are color-coded Command line Perl scripts (so don't need web interface) give blast output file Verification script to check insert size/positional info Can look at read coverage to compare where they are different (more matches in repetitive sequence) “Data Mining at PLEXdb: the Plant and Plant Pathogen Expression Database for Functional and Comparative Genomics” Sudhansu Dash (sdash@iastate.edu) Iowa State U. Plant and Pathogen Gene Expression database (for microarrays). Only supports Affy arrays (inc. Medicago) Option to submit data at NCBI GEO 15 Visualize genes in different treatments Microarray platform translator (between Affy arrays) Gene Oscillo Scope – compare between experiments Data Mining to find co-expressed genes Plans to include other platforms, NGS, etc. if get NSF funding “Discover what's new at NCBI” Steve Pechous (pechous@ncbi.nlm.nih.gov) New Tools: Primerblast – based on Primer3 CloneFinder – on MapViewer Home Page Graphical Sequence Viewer (“Graphics” on top) – gene distributions on chromosome New Databases: SRA – now just an archive, tools under development; stored under experiments BioSystems – with links to molecules (proteins, genes, PubChem) and directly to KEGG pathway NCBINew on NCBI Bookshelf There is a 3rd party annotation submission (how?) “Managing Genome Assemblies” Deanna Church (church@ncbi.nlm.nih.gov) Unloacalized sequences – know which chromosome, but not where Unplaced sequences – know the organism, but don't know which chromosome Alternate loci – alternative representations of sequences present on the chromosome Assembly Database – allows tracking different versions of an assemby. Can submit now, but user interface is not done yet Genome Reference Consortium – Genome workbench tool “NCBI Genetic Variation Resources” Lon Phan (lonphan@ncbi.nlm.nih.gov) dbSNP – simple genetic variation (~90 different organisms; searchable) dbVar – large structural variations and CNV (indels, inversions, etc.) dbGAP – genotypes and phenotypes “Expanding the Protein Cluster Database to include plants” Anjana Raina (raina@ncbi.nlm.nih.gov) Collection of RefSeq proteins, including plants (RefSeq is database of non-redundant set of chromosomes, transcripts, and proteins) Clusters can be split and combined (~41,192 clusters, most have only 2 proteins) 16 Can build phylogenetic trees Pre-computed protein alignment link for each (can see protein alignments) Can link out to mapViewer Database ACCESS No blast link available but can use Genome Workbench “UniGene: A resource for plant and animal transcripts” Lukas Wagner (wagner@ncbi.nlm.nih.gov) Assemblies are unstable and error-prone (no consensus transcripts) Grouping transcripts without an annotated genome mRNA gene associations, HomoloGene associated via blastx mRNA-mRNA pairwise alignments Organisms: >70,000 ESTs or mRNAs Can work directly off 454 transcript reads They are working on how to handle high-throughput sequences Digital Differential Display to compare expression levels based on Fisher's Exact Test Protein similarities – best matching protein for each organism (ProtEST), can align all cDNAs Links from UniGene to Map Viewer 17

PAG_Notes - Bioinformatics Core Wiki

Related documents

Products

Support

PAG_Notes - Bioinformatics Core Wiki

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib