Big data: approaches to biologist

advertisement
Less is more
Approaches to
biologist-driven analysis
and
next-generation sequencing data
Paul Gordon
Genome Canada Bioinformatics Platform
University of Calgary
What am I doing here?
Genome Canada
Bioinformatics Platform
• Next Generation Sequencing
• Next Generation Web
• Future challenges
Better tech: less DNA, more sequence
44μm
70nm
PhytoMetaSyn
Sprockets:
Hierarchical Gene Models from ESTs
Developed in collaboration with BASF Plant Sciences
Genozymes
Hydrocarbon Metagenomics
CAVEman
•
Java 3D-based, world-first complete 3D human body atlas (adult male)
–
•
•
2,335 organs, hierarchical organization following Terminologia Anatomica
Numerous applications involving mapping of genetic and disease data
More information: http://cave.ucalgary.ca/caveman
Pharmacokinetics visualization
(Absorption-distribution-metabolismexcretion of Aspirin)
Patient MRI stack
mapped onto atlas and
registered by landmarks
Exploring gene expression patterns
Basic Research
• ING-protein interactions
(cancer and ageing-rated
proteins)
• Archaeal UV-light response
• Large-scale
human
genome organization
Research Applications
•Desulf.: mechanisms of oil pipeline
corrosion and its prevention
• Kidney transplants: improved
rejection diagnostics in Edmonton
•Mad cow disease/chronic
wasting disease: live diagnostics
DNA Diagnostics Discovery for Mad Cow
Preinoculation
Preclinical
Control
animal #6
Ball toy
Photo: S. Czub, CFIA Lethbridge
Controls
Clinical
Motif finding (elk dataset)
61 blood samples
Next-gen
107 million base pairs
432 billion pairwise alignments (6574312)
Decypher hardware accelerator
1082019 25mers or smaller
Decypher hardware accelerator
Uninfected
152317
Infected
132417
Thousands of animal
coverage/timepoint combos (CPU intensive)
Infected
3 universal
Motif Results
Possible mode of action?
PrPsc(+?)
Infectious agent
Activation
Feedback
Retrovirus
PrP
Integration
Carp et al., EMBO J., 2006
Leblanc et al., EMBO J. 2006
Stengel et al., Biochem. Biophys. Res. Commun. 2006
Lee et al., Biochem. Biophys. Res. Commun. 2006
Etc.
Endogenous Retrovirus?
↑ EVI1
Consistent with protein-only evidence…
Neurovirulent? (e.g. M.L. Labat 1999)
↑PLZF
↓PLZF-controlled
genes
Vacuole
CNA Export
Circulating
Nucleic Acids
Cell death
Nucleoprotein complexes
PrP Amyloid fibres
Manuelidis et al, PNAS 2007
Protected promoters
(Motifs A & B)
Virus particles? ~25nm
Bettertech:
tech:less
lessDNA,
input,
more
results
Better
more
sequence
Generate
Manuscript
Now
Where are we at?
Life Sciences
Emerging
Technologies
Web
Bioinformatics
Semantic Web
Source: Gartner Inc.
How software works…
(Gene name, DNA sequence, QTL…)
Parameters/Input
Functions/
Rules
Results/
Output
(article, allele,…)
The problem with the Web
1998
Now
Once you label me, you negate me.
Søren Kierkegaard
Bluejay
http://bluejay.ucalgary.ca
Comparative
genomics
Gene
expression
integration
BioMoby
linking
Waypoints
The task at hand
(biologist)
ACCGT…
Sequencer Data File
(Binary)
Known
Proteins
(computer scientist)
BLAST
Report
(related
proteins)
DNASequence
NCBI_gi
Sequence_Alignment
Audience
Amoeba
God
Self-perception of computer skills
The need for shoehorns
• The current vision of the Semantic Web
intends to create a new structure starting up
with no reference to its vast, functioning, but
more primitive predecessor … things just don’t
happen like that
All the Web as Workflows
Seahawk
prompting
Proxied
Web page
Drag ‘n’ drop
Seahawk
What’s Ahead?
The more a man learns, the more he realizes how little he knows
Semantic Web
http://www.uniprot.org/tissues/229
http://purl.uniprot.org/po/0009009
Take home messages
As tech improves, we can ask better questions
We will need shoehorns to access existing
resources for the foreseeable future
Download