ppt

advertisement
NGS Bioinformatics Workshop
2.5 Meta-Analysis of
Genomic Data
May 30th, 2012
IRMACS 10900
Facilitator: Richard Bruskiewich
Adjunct Professor, MBB
Acknowledgment:
Several slides courtesy of Professor Fiona Brinkman, MBB
Today’s Agenda
A brief overview of the bioinformatics for
SNP detection software
Proteins
Systems biology
Metagenomics (some resources; very brief…)
Group feedback: bioinformatics needs at SFU?
NGS-based SNP Analysis Programs
From: Nielsen et al. 2011. Nature Reviews Genetics 12:443-451
NGS Bioinformatics Workshop
2.5 Meta-Analysis of Genomic Data
BIOINFORMATICS OF PROTEINS
From DNA to Protein to Systems
ATGGAATTC…
5
Amino Acid Properties – Venn Diagram
Polypeptides
O
H3 N +
H
R1
H
N
H
R2 H
N
O H
O
H
R4
O
R3
N
H
O
Ramachandran Plot
Secondary Structure (SS) Prediction
Note major assumptions in all
 Entire information for forming ss is contained in the primary sequence
 Side groups of residues will determine structure
 Pattern recognition
 Looks for patterns in common ss’s like amphipathic alpha-helices (e.g. pattern
of polar and non-polar residues)
 Homology
 Predict ss of the central residue of a given segment from homologous segments
(neighbors)
 Based on alignments of homologous residues from a protein family
 Assumption: homologous proteins = similar structure
 Extension: Use BLOSUM to detect similarity, or, better, use Position Specific
Scoring Matrix (PSSM)
SS Prediction Programs
• PredictProtein-PHD (72%)
– http://www.predictprotein.org/
• PREDATOR (75%)
– http://www-db.embl heidelberg.de/jss/servlet/
de.embl.bk.wwwTools.GroupLeftEMBL/argos/
predator/predator_info.html
• PSIpred (77%)
– http://bioinf.cs.ucl.ac.uk/psipred/ (PSSM generated by
PSI-BLAST, better sequence database, won CASP
competition for many years)
• Jpred (81%)
– http://www.compbio.dundee.ac.uk/jpred/
Tertiary Structure
Lactate
Dehydrogenase:
Mixed a / b
Immunoglobulin
Fold: b
Hemoglobin B
Chain: a
Tertiary Structure: Protein Folds
Holm, L. and Sander, C. (1996)
Mapping the protein universe.
Science, 273, 595-603.
Protein Folds
 Folds: definition difficult and different criteria
used for different classification systems
– Normally formed around a separate hydrophobic core
 Current protein fold taxonomy
– Very roughly …
– Approx. 1000-2000 different estimated folds,
depending on method of analysis – of which about half
are estimated to be known (500-1000)
– Average domain size approx. 150 aa
(50 – 250 aa approx std dev)
Protein Fold Major Classes
All alpha proteins (all a)
All beta proteins (all b)
Alpha/beta proteins (a/b)
- Parallel strands connected by helices
(bab motifs)
Alpha plus beta proteins (a+b)
- More irregular a and b combinations
“Other”
- Often subclassified now
Protein Fold Classification
• Curated/Semi Manual Classification
– SCOP (Structural Classification Of Proteins)
http://scop.mrc-lmb.cam.ac.uk/scop/
– CATH (Class, Architecture, Topology, Homologous
superfamily)
http://www.cathdb.info/
SCOP classification
 Family: clear evolutionarily relationship
–
–
Residue identities >= 30%
OR known similar functions and structures (example:
globins form family though some only 15% identical)
 Superfamily: Probable common evolutionary
origin
–
Low sequence identities, but structural and functional
features suggest common evolutionary origin.
(example: actin, ATPase domain of heat shock
proteins, and hexakinase form a superfamily).
 Fold: major structural similarity
–
Same major ss in same arrangement with the same
topological connections
– May occur by convergent evolution
SCOP example
17
CATH example
18
Protein Fold Classification
• Automated Classification
– DALI
http://ekhidna.biocenter.helsinki.fi/dali
– VAST (Vector Alignment Search Tool)
http://www.ncbi.nlm.nih.gov/Structure/
VAST/vast.shtml
DALI/FSSP – Automated classification
Exhaustive all-against-all 3D structure comparison of
protein structures currently in the PDB
Domain Classification # (DC_l_m_n_p)
l: fold space attractor region
m: globular folding topology/fold type (clusters of structural neighbours in fold
space with average pairwise Z-scores, by Dali, above 2)
n: functional family (PSI-Blast, clusters of identically conserved functional
residues, E.C. numbers, Swissprot keywords)
p: sequence family (>25% identities)
VAST – Automated classification
http://www.ncbi.nlm.nih.gov/Structure/VAST/vasthelp.html
All against all BLAST comparison of NCBI’s MMDB (database of
known protein structure at NCBI, derived from the PDB)
Clustered into groups by a neighbor joining procedure, using
BLAST p-value cutoffs of C or less (where C=10e-7, 10e-40 or
10e-80, to reflect three different levels of redundancy). A fourth
level of classification is based on sequence identity
Motif and Domain Searching
• InterPro – an integration of tools (PROSITE,
PFAM, PRINTS, PRODOM)
– http://www.ebi.ac.uk/interpro/
• Expasy Tools has more…
– PATTINPROT, to search for patterns in proteins yourself, etc…
But first… Check if the analysis you want to do has
already been done!
i.e. www.ebi.ac.uk/proteome/
db.psort.org
22
Phylofacts
http://phylogenomics.berkeley.edu/phylofacts/
PhyloFacts includes hidden Markov models for classification of usersubmitted protein sequences to protein families across the Tree of Life.
Subcellular Localization Prediction – Example of the
benefit of integrating results with a Baysian approach
Localization Prediction - methods
 Several programs analyze single features:
 TargetP
 Initially one program analyzed multiple features:
 PSORT I (eukaryotes and prokaryotes)
 Developed in 1990
PSORT I prediction method: Rule based
Nakai & Kanehisa, Proteins: Structure, Function, Genetics (1991)
Compositional Analysis
Molecular Weight
Amino Acid Frequency
Isoelectric Point
UV Absorptivity
Solubility, Size, Shape
27
NGS Bioinformatics Workshop
2.1 Meta-Analysis of Genomic Data
SYSTEMS BIOLOGY
Systems Biology
What is systems biology?
① Considers all (or many) of the proteins and genes in
the system
② Links proteins and genes using interactions and
functions
③ Uses computational models to study system
④ Provides insights into mechanisms, system
dynamics, global properties
Molecular Interaction (MI) Network
 Nodes = Gene / Protein
 Edge = Interaction
 Possible interactions:
 phosphorylation
 physical binding
 transcriptional regulation
 others?
Cytoscape
Cytoscape supports many use cases
in molecular and systems biology,
genomics, and proteomics:
 Load molecular and genetic
interaction data sets in many
formats
 Project and integrate global
datasets and functional
annotations
 Establish powerful visual
mappings across these data
 Perform advanced analysis and
modeling using Cytoscape
plugins
 Visualize and analyze humancurated pathway datasets such
as Reactome or KEGG.
http://www.cytoscape.org/
Cytoscape
Control tabs: Network,
VizMapper, plugin tabs
Search for nodes
Visible networks
Network navigation
Change visible attributes
Attributes for highlighted
nodes / edges
Cytoscape – Loading Data
Data Files:
1. Network (Simple Interaction Format)
2. Node attributes (tab-delimited)
3. Gene expression (tab-delimited)
Cytoscape – Loading Data
1. Network (Simple Interaction Format)
• Format:
gene1 interaction_type gene2
• E.g.:
C1QB
C1R
C2
pp
pp
pp
…
C1R
C2
C4
Cytoscape – Loading Data
2. Gene Attribute (tab-delimited table)
•
Maps data values to nodes
Load File
Check off “Show
Text File Import
Options”
Check off “Transfer
first line as attribute
names..”
Preview
Cytoscape – Loading Data
3. Gene expression (tab-delimited table)
• Format:
gene1 exp_cond1 exp_cond2 … sig_cond1 sig_cond2 …
• Expression value: fold-change or intensity from
microarray
• Significance value: P-value indicating how likely
the expression value is different between
conditions.
Cytoscape – Network Style
In “Vizmapper”
tab…
Double-click “Node
color”
Select expression
fold-change values
(CMexp)
Select “Continuous
Mapping” as
mapping type
Can change color by
double-clicking on
arrows
Systems Biology Analyses
1. Differentially-expressed subnetworks
•
jActiveModules
2. Functional enrichment
• BiNGO
Differentially-Expressed Subnetworks
 Search for sub-networks that contain a significant
number differentially-expressed genes (nodes)
 All genes in sub-network interact…
 SO these highly differentially-expressed sub-networks
may represent a critical pathway or complex involved in
a condition of interest
Differentially-Expressed Subnetworks
jActive algorithm:
 Searches for sub-networks that contain a significant
number differentially-expressed genes (or nodes)
 Heuristic – won’t always find the optimum result
 Z-score signifies how likely to find a subnetwork
with a similar number of DE genes.
jActive - Inputs
Select expression
significance
(p-values)
Search from
highlighted nodes
jActive - Results
Subnetworks listed
here
Highlight result and
click “Create
Network”
Functional Enrichment
Functional Enrichment:
 Also called over-representation analysis
 Searches for common or related functions in a gene set
 Is there a common annotation (e.g. pathway, GO term)
for a set of genes that is more frequent than you would
expect by chance?
Gene Ontology
• Controlled vocabulary describing functions, processes and cell
components
• Consistency between organisms and gene products
• GO terms linked by relationships (is-a, part-of) and have
hierarchy (parent – child)
protein complex
organelle
mitochondrion
[other protein
complexes]
fatty acid beta-oxidation
multienzyme complex
[other organelles]
is-a
part-of
Functional Enrichment
BiNGO:
 Looks for GO terms that are over-represented in a set of
genes.
 Displays the results in two ways
 A table with p-values
 A graph showing relationships between terms
 Uses the hypergeometric test to statistically test for overrepresentation of each GO term.
 Performs multiple hypothesis correction (since we are testing
multiple GO terms for over-representation).
BiNGO - Inputs
Fill in Name
Lower significance level
Select “Custom” and then
load go.annot file
Click Start BiNGO
BiNGO - Results
BiNGO - Results
General GO Terms
Significance
Specific GO Terms
EGAN: Exploratory Gene Association Networks
http://akt.ucsf.edu/EGAN/
NGS Bioinformatics Workshop
2.5 Meta-Analysis of Genomic Data
METAGENOMICS
What is Metagenomics?
 The culture-independent isolation and characterization
of DNA from uncultured microorganism communities
 Nice reading list on the topic:
http://www.cbcb.umd.edu/confcour/CMSC828Gmaterials/reading-list.html
 See also: Torsten Thomas Jack Gilbert and Folker Meyer. 2012.
Metagenomics - a guide from sampling to data analysis.
Microb. Inform. Exp. doi:10.1186/2042-5783-2-3
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3351745/
 I will just mention a few relevant bioinformatics tools
here (no specific endorsements implied).
MG-RAST server
http://metagenomics.nmpdr.org/
Meyer, F. et al. 2008. The metagenomics RAST server –
a public resource for the automatic phylogenetic and
functional analysis of metagenomes. BMC
Bioinformatics. 9:386 doi:10.1186/1471-2105-9-386
MEGAN - MEtaGenome ANalyzer
http://ab.inf.uni-tuebingen.de/software/megan/
Huson DH et al. 2007. MEGAN analysis of metagenomic data. Genome Res. 17: 377-386
Download