Biochemical pathways

advertisement
What is Systems Biology?
Systems biology is an academic field that seeks to integrate
different levels of information to understand how biological
systems function. By studying the relationships and
interactions between various parts of a biological system (e.g.,
gene and protein networks involved in cell signaling, metabolic
pathways, organelles, cells, physiological systems, organisms
etc.), it is hoped that eventually an understandable model of the
whole system can be developed. (from Wikipedia)
 Use high-throughput methods to quantify changes in RNA and
protein in response to perturbation of cell
 Build regulatory networks linking genes, RNAs, and proteins
 Develop mathematical models to represent the system
 Predict how different perturbations will affect the system
 Test predictions for validity
 Refine models and repeat
Why use Systems Biology?
 Whole organism view
• Identify total physiological capacity
• More complete understanding
• New drugs/vaccine candidates
 Produce resource of data and
materials
• More efficient
• Collaborative
Systems Biology approaches
 Genome (DNA sequencing)
 Transcriptome (RNA microarrays)
 Proteome (Mass spectrometry)
 Metabolome (Mass spectrometry)
 Phenome (Cell biology)
 ‘ome (anything else)
DNA microarrays
 Different uses
•
•
•
•
Comparative genomic hybridization (CGH)
Resequencing/SNP analysis
Expression profiling
Chromatin immunoprecipitation
 Data analysis
•
•
•
•
Normalization
T-testing
Analysis of variance (ANOVA)
Clustering
What are DNA microarrays
 Spots of DNA arranged on a solid
support (usually glass or silicon)
 Different sources of DNA
• Cloned DNA (genomic or cDNA) – spotted on glass
• PCR products – spotted on glass
• Oligonucleotides (25- to 70-mers)
 Spotted or printed onto glass
 Synthesized directly on silicon
 Different densities
• Spotted or printed – 5,000-30,000/slide
• Synthesized oligos – 200,000-500,000/slide
How do microarrays work?
 Label mRNA or gDNA with fluorescent
probe
 Hybridize to microarray and wash off
excess probe
 Read in a fluorescent scanner
 Quantify signal for each spot
 Signal ~ hybridization ~ abundance of
sequence in probe
One-color (Affymetrix or Nimblegen)
Two color (Spotted or printed)
A typical two color microarray
Plot red vs green intensity
Leishmania procyclics vs metacyclics
•
•
•
•
•
•
Equal green/red signal = yellow
Not differentially expressed
Greater green signal
Procyclic-specific
Greater red signal
Metacyclic-specific
Problems with microarrays
 Cross-hybridization between probes
• false positives (wrong gene)
• false negatives (hides differential expression)
• oligos are better
 Poor experimental design
• Not enough replicates
 Inappropriate analysis
• Normalization of signal within and between arrays
• Need robust statistical analysis
Within Array Normalization
Lowess Normalization
MA - Plots
Between Array Normalization




Invariant gene(s)
RNA Spike In
Median Scaling
Quantile Scaling
Median and Quantile normalization are predicated
upon the arrays in question having the same
distribution. That is to say, if you can safely assume
that the bulk of genes have the same expression
across the arrays, only then you can use those
methods.
Quantile Normalization
Robust Multichip Average (RMA)
http://rmaexpress.bmbolstad.com
Finding Significant Genes
 Assume we will compare two conditions
with multiple replicates for each class
 Our goal is to find genes that are
significantly different between these
classes
 These are the genes that we will use for
later data mining
Fold-change Difference
2-fold ?
 suffers from being arbitrary and not taking
into account systematic variation in the data
T-testing
t = signal = difference between means = <Xq> – <Xc>_
noise
variability of groups
SE(Xq-Xc)
 Tests whether the difference between the mean of
the query and reference groups are the same
 Essentially measures signal-to-noise
 Calculate p-value (permutations or distributions)
Improvement over fold-change
A significant
difference
Probably
not
But: If you use pooled
RNAs, you can’t tell the
difference between the
top and bottom cases.
Analysis of variation (ANOVA)
???
 Which genes are most significant for separating
classes of samples?
 Calculate p-value (permutations or distributions)
 Reduces to a t-test for 2 samples
ANOVA
 Assign experiments to >2 groups
 Calculate F-ratio for each gene
• F = mean square (groups)/mean square (error)
• Between group variability/within group variability
• The large the value of F, the greater the difference
between group means relative to the sampling error
variability
 Calculate p-value associated with F-ratio
Probability value determination
 Calculated from:
• Theoretical t-distribution
• Permutation
 Correction for multiple testing
•
•
•
•
Family Wise Error Rate (FWER)
Bonferroni – too stringent
Adjusted Bonferroni
Benjamin and Hochberg False Discovery Rate (FDR)
Finding patterns of expression
 Individual genes don’t tell the whole
story
 Identify groups of genes with similar
differential expression patterns
 Cluster analysis
 Statistical reliability is still an issue
Clustering algorithms
 Inputs
• Raw data matrix or similarity matrix
• Number of clusters or some other parameters
 Classification of clustering algorithms
• Hierarchical vs. partitional
• Heuristic-based vs. model-based
• Soft vs. hard
Hierarchical clustering
•
Cluster genes with most
similar expression patterns
•
Cluster samples with most
similar gene expression
Example of clustering
Microarray analysis software
 SAM (Significance Analysis of Microarrays)
http://www-stat.stanford.edu/~tibs/SAM/
 TM4 (MIDAS, MADAM, MEV, Spotfinder)
http://www.tigr.org/software/microarray.shtml
 Bioconductor
http://www.bioconductor.org/
 GeneSpring GX
http://www.chem.agilent.com/scripts/pds.asp?lpage=27881
 Rosetta Resolver
http://www.rosettabio.com/products/resolver/default.htm
 Many others
http://ihome.cuhk.edu.hk/~b400559/arraysoft.html
Proteomics
 2-D gel electrophoresis
• Isoelectric focusing
 ISO-DALT
 NEPHGE
 IPG-DALT
• SDS-PAGE
• Computer-aided image analysis
 Protein Identification
• Edman degradation
 expensive
 slow
• Mass-spectometery (MALDI-TOF-MS, LC/EIS-MS)
 Sensitive
 High-throughput
Mass spectrometry in proteomics
 Molecular Weight determination
 Protein identification
 Relative quantitation
 Post-translational modifications
 Biomolecular interactions
What is Mass Spectrometry?
 Proteins are separated or filtered according to their mass-to-charge
(m/z) ratio and detected.
 The resulting mass spectrum is a plot of the (relative) abundance of the
produced ions as a function of the m/z ratio
 Usually carried out after liquid chromatography (LC-MS/MS) or matrix
assisted laser desorption ionization (MALDI-TOF)
 The sequence of the peptide is determined by comparison of acquired
mass spectrum with predicted spectrum from genome / protein
sequence databases, using computer algorithms
Typical proteomics protocol
Cell / Organism
Lysis and Fractionation
Protein purification
• Chromatography
• 1D gel
• 2D gel
Sequence Analysis
using MS/MS
Enzymatic digestion of
the protein(s)
Separation of resulting
peptides by chromatography
• Reverse phase
• IEX - RP
Tandem MS
2. Full Scan MS2
*
I
n
t
e
n
s
i
t
y
1. Full Scan MS
3. Full Scan MS3
*
* ion selected
Time
Collision Induced Dissociation spectrum
Amino acid sequence of a peptide identified by
MS/MS analysis from the tryptic digest of p46
S-A-V-F-A-A-A-A-P-R
Peptide identification
from CID spectra
s072999_ap_tb07.0367.0369.2.out SEQUEST v.22, Copyright 1993-96
# Rank
(M+H)+ C*104
Reference
Peptide
1.
1
1037.1
3.9923
mHEL61
(F) SSGKVRVCER
2.
2
1037.2
2.9684
6A9.TR
(V)VGGIGTTFER
3.
3
1037.2 2.8651
gi1395223
(A)RFFEAGNVP
4.
4
1037.1
2.7472
18L22.TF
(R)VDDSGKMER
5.
5
1037.1
2.7390
trypEf4.p1p
(S)VDDAYM*IGH
Protein identification
from multiple peptides
10/15/01 04:15:41 PM
RT: 19.84 - 59.99
Relative Abundance
NL:
1.47E10
TIC M S
TbIPEC03
_011015161
541
36.56
36.88
100
80
35.67
60
49.54
38.36
53.43
35.21
40
47.81
41.17
34.36
20
53.27
42.50
54.92
30.99
25.46 27.78
TbIPEC03_011015161541 # 1765 RT: 36.56 AV: 1 NL: 4.09E8
T: + c NSI Full ms [ 400.00-2000.00]
551.10
604.62
100
Relative Abundance
D:\Xcalibur\data\TbIPEC03_011015161541
56.45
0
825.74
80
661.02
60
40
510.68
662.08
703.51
20
1002.40
883.30
1070.32
1247.77
30
35
40
Time (min)
45
50
55
400
600
800
1000
1200
m/z
TbIPEC03_011015161541 # 1766 RT: 36.59 AV: 1 NL: 1.16E7
T: + c d Full ms2 551.10@35.00 [ 140.00-1115.00]
625.55
100
TbIPEC03_011015161541 # 1768 RT: 36.62 AV: 1 NL: 1.99E7
T: + c d Full ms2 825.74@35.00 [ 215.00-1665.00]
494.51
100
Relative Abundance
25
Relative Abundance
20
80
582.14
60
704.74
40
20
0
1415.02
1616.48
1400
1600
1853.02
0
207.07
312.05
200
538.93
382.98 470.35
720.79 801.40
1800
2000
992.57
80
1156.56
60
1157.58
40
735.61
607.59
834.61
1138.55
20
1162.60
1249.65
348.41 476.48
993.31
1416.60 1532.77
0
400
600
800
m/z
1000
1200
400
600
800
1000
m/z
1200
1400
>gb-AAK64278.1 Trypanosoma brucei RNA-editing complex protein MP81 gene, complete cds; nuclear gene for mitochondrial product
MRRLTRRSGR LSGKGNGGSC LQMSPTHVGA VVTWALNRLM PLHTRTIPLR CSLPTPESGT TEPRELCFYE TFELTEEDVH
YLLLHEAHVK HGVLLNVPPQ LAPNGTPPEV PEVIMPAAQL ERMGGMKLAY EPTHLPPPLH TTGARQLVLD ESFYTTPTKE
KKATTTAVSH VSESTAASGG RGGASATAAG TALPPRLPPD PTMKFHCSAC GKAFRLKFSA DHHVKLNHGS DPKAAVVDGP
GEGELLGGAV TITTAKVAKH SSSAASGTAS RAGDSATLDV KQQPDPQKEL SAPGISAVKI PYSKAVLSLP DDELVDELLI
DVWDAVAAQR DDVPKSNSAN IFLPFASVVT GTADRRKEME AVARPTARAT PEGAAPGIKR PGAMAGGAVA VGKGRSGGQI
LPIRELIKKY PNPFGDSPNA AVQDLENEPL NPFLPEEELA AQLQVACEED TVVTPSACTT DVSTGSVIGK KGSLEKLKEK
LRGTRPSMAA SAAKRRFTCP ICVEKQQTLQ QQQSENVGSG FCTDIPSFRL LDALLDHVES VHGEELTEDQ LRELYAKQRQ
STLYPQKSST GDGAGSRETP DDSEKKEGSV GNTNMDELKS LPEEVRRVVP PAPVEQDALA VHIRAGSNAL MIGRIADVQH
GFLGAMTVTQ YVLEVDGDER INSKGVTTPA SACTPDPAST KAVEAKGEEG EVVEPEKEFI VIRCMGDNFP ASLLKDQVKL
GSRVLVQGTL RMNRHVDDVS KRLHAYPFIQ VVPPLGYVKV VG
Mass (average): 81294.2 Identifier: gi|14495336 Database: D:/Xcalibur/database//t_bruceiprot.fasta
Protein Coverage: 223/762 = 29.3% by amino acid count, 23252.5/81294.2 = 28.6% by mass
1600
Relative quantitation using ICAT
(Isotope Coded Affinity Tagging)
Gygi et al. Nat Biotech 17:994 (1999)
Multidimensional Protein Identification Technology (MudPIT)





High throughput
100s of proteins
Reiterative
Exclude previous ions
1000s of proteins
Washburn et al. Nat Biotech, 19: 242 (2001)
Software
 Data Acquisition
• Xcalbiur (proprietary)
 CID spectrum filtration
• In-house programs
 Peptide identification
• SEQUEST, Mascot, ProbID, COMET
 Compilation
• DTA select, Contrast, PeptideProphet
 Protein assignment
• SEQUEST, ProteinProphet
• LIMS
• SBeams
Pathways and networks in
systems biology
Linking genes
related to
cellular
processes
Elucidating the
effect changes in
biochemical
pathways may
have on cellular
biology
Genomics
Transcriptomics
Using microarrays
to find coexpression and
infer systemic
relation
Proteomics
Phenomics
Metabolomics
Identifying
interactions and
networks between
multiple proteins
Finding and charting the flow of chemical
compounds created by biochemical processes
Pathways vs. networks
Gene networks
• Clusters of genes (or gene products) with evidence of coexpression
• Connections usually represent degrees of co-expression
• In-depth knowledge of process is not necessary
• Networks are non-predictive
Biochemical pathways
• Series of chained, chemical reactions
• Connections represent describable (and quantifiable) relations
between molecules, proteins, lipids, etc.
• Enzymatic process is elucidated
• Changes via perturbation are predictable downstream
Pathways vs. networks
Gene networks
Curation Relatively easy:
Biochemical pathways
Difficult: mostly manual
manual
Nodes Genes or gene products
Any general molecule
Edges Levels of co-
Representation of possibly
mechanisms between
a qualitative relation
Fidelity Low – usually very little
Predictive power Relatively low
High – specific processes
Relatively high
Network software/databases
 Biocyc/Metacyc
 KEGG
 BioCarta
 BioModels
 Cytoscape
 E-cell
BioCyc/Metacyc
 http://biocyc.org/ & http://metacyc.org
 Krieger et al., Nuc. Acids Res. 32:D438 (2004)
 Pathway analysis for >900 organisms
BioCyc/Metacyc
 260 organism-specific databases
• Automated annotation using Pathologic software (Tier 3)
• Some manual curation (Tier 2) (H. sapiens, P. falciparum, 11 bacteria)
• Extensive manual curation (Tier 1) (EcoCyc and Metacyc)
Kyoto Encyclopedia of Genes and Genomes
 http://www.genome.jp/kegg/
 Pathways from 348 organisms
 Links with other databases
Kyoto Encyclopedia of Genes and Genomes
BioCarta database
 Corporate-owned, publicly-curated
pathway database
 Series of interactive, “cartoon”
pathway maps
 Predominantly human and mouse
pathways
 Contains 160,000 gene entries and 355
pathways
http://www.biocarta.com
http://www.biocarta.com
Glycolysis pathway
http://www.biocarta.com
BioModels database
 Database for published, quantitative
models of biochemical processes
 All models/pathways curated manually,
compliant with MIRIAM
 Models can be output in SBML format for
quantitative modeling
 86 curated models, 40 models pending
curation
http://www.biomodels.net
http://www.biomodels.net
Glycolysis pathway(s)
http://www.biomodels.net
Comparison of pathway databases
MetaCyc/
BioCyc
Curation Manual and
KEGG
BioCarta
BioModels
Automated
Manual
Manual
~289 reference
~355 pathways
~126 models
EC, KO
None
GO
Various
Primarily human
mouse
~475 species
Visuals Species-specific
Reference and
specific
Animated,
Non-standardized
Primary PGDB,
PGDB, pathway
comparisons
Human
disease
Simulations,
Size ~621+ pathways
Nomenclature EC, GO
Organism ~500 species
biology
Cytoscape
http://www.cytoscape.org/index.php
Cytoscape
 Cytoscape is a bioinformatics software
platform for visualizing molecular
interaction networks and integrating
these interactions with gene expression
profiles and other state data.
 Plugins are available for network and
molecular profiling analyses, new
layouts, additional file format support
and connection with databases.
Cytoscape
 Input
•
•
•
Molecular interaction networks such as protein-protein (yeast 2-hybrid and TAP-tag)
and/or protein-DNA interaction pairs (e.g. BIND and TRANSFAC databases)
mRNA expression profiles
Gene functional annotations from the Gene Ontology (GO) and KEGG databases.
 Visualization
•
•
•
Customize network data display using powerful visual styles.
View a superposition of gene expression ratios and p-values on the
network. Expression data can be mapped to node color, label, border thickness, or
border color, etc.
Layout networks in two dimensions. A variety of layout algorithms are available,
including cyclic and spring-embedded layouts.
 Analysis
•
•
•
•
Filter the network to select subsets of nodes and/or interactions
Find active sub-networks/pathway modules
Find clusters (highly interconnected regions) in any network loaded into Cytoscape.
More plugins available on the plugins page.
Cytoscape
E-cell
 E-Cell is an international research project aiming to
model and reconstruct biological phenomena in silico,
and developing necessary theoretical supports,
technologies and software platforms to allow precise
whole cell simulation
• Modeling methodologies, formalisms and techniques,
including technologies to predict, obtain or estimate
parameters such as reaction rates and concentrations
of molecules in the cell
• E-Cell System, a software platform for modeling,
simulation and analysis of complex, heterogeneous and
multi-scale systems
• Numerical simulation algorithms
• Mathematical analysis methods
E-cell
http://www.e-cell.org/
E-cell projects











Mitochondrion (Yugi)
E-Neuron (Kikuchi)
E2coli (Hashimoto)
e-Rice (Ishii, Nakayama)
Erythrocyte (Kinoshita, Nakayama)
Cell Signaling (Shimizu)
Bacterial chemotaxis (Matsuzaki)
Circadian rhythm (Miyoshi, Nakayama)
Diabetes (Sano, Naito)
Mathematical Analysis (Kikuchi)
Myocardial Cell
E-CELL simulation environment
Image from Tomita, et al., 2001
ATP starvation simulation
ATP level
mRNA level
Images from Tomita, et al., 1999
Download