a, b - Taida Institute for Mathematical Sciences(TIMS)

advertisement
Regression, correlation and liquid
association in complex genomic
data analysis
Ker-Chau Li Institute of Statistical
Science, Academia Sinica
gene-expression data
cond1 cond2 …….. condp
gene1
gene2
gene
n
x11
x21
x12 …….. x1p
x22 …….. x2p
…
…
Historical review of correlation
•
•
•
•
•
Why using mean ?
Why using standardization?
Outliers
Rank correlation
Normal score transformation
Theory of liquid association
•
•
•
•
•
•
Definition of LA
Stein lemma
Compute LA
Normal score transformation
P-value
LA plot
Generalization of liquid association
• Binary Z
• Transforming X and Y
• A paradigm of using liquid association
• LAP web application
http://kiefer.stat2.sinica.edu.tw/LAP3
(Note: case sensitive)
Regression: model, correlation and prediction
Regression toward the mean
http://en.wikipedia.org/wiki/Regressi
on_toward_the_mean#History
Regression, correlation
• Sir Francis Galton (1822 - 1911),
• half-cousin of Charles Darwin,
• was an English Victorian
polymath, anthropologist, eugenicist,
tropical explorer, geographer, inventor,
meteorologist, proto-geneticist,
psychometrician, and statistician.
He was knighted in 1909.
• Galton invented the use of the regression line (Bulmer 2003, p. 184),
and was the first to describe and explain the common phenomenon
of regression toward the mean, which he first observed in his
experiments on the size of the seeds of successive generations of
sweet peas.
Bivariate
normal
Pearson correlation
• Corr (X,Y) = E[(X-E(X))(Y-E(Y) ]/ SD(X)SD(Y)
X,Y are two random variables
Corr (often denoted  , or r ) is between -1 and 1
r= 0 , uncorrelated
>0, positive correlation
<0, negative correlation
Intuitive illustration: larger value of X correlates with
Larger value of Y (r>0)
Larger value of X correlates with smaller value of Y (r<0)
Choice of average value : E(X), E(Y) { why not using median ??? Galton did
so}
Application in microarray gene expression analysis
Galton’s discovery of regression
Messy Data, creative summary, no aid by computer
Elliptically contoured bivariate
distribution
• The set of points with equal frequencies of occurrence lie
on concentric ellipses with common center, shape and
orientation
Under the additional assumption that the marginal
distributions of X and Y are normal and the , it can be
proved that
the joint distribution of X,Y must follow a bivariate normal
distribution, joint density function : f(x,y)= a little bit
complex
Parameters of mean of X, mean of Y, SD of X, SD of Y,
correlation, 1, 2,1,2, 
Rate of Regression
prediction
• SD of height (assume = 1 inch, remains same from generation
to generation)
• Father 1 inch above (below) mean
• Son r inches above (below) mean
• Grandson r2 inches above (below) mean
• Grand grandson r3 inches above(below) mean
• N generation later rN inches above(below) mean
• which tends to 0
[T]he mean filial regression toward mediocrity was directly
proportional to the parental deviation from it. - Galton
Recall : median is used in his definition of SD
Required conditions for regression
effect
• If E(Y|X) is linear in X,
then E(Y|X) = a+ b X
b= given before
If SD(X)=SD(Y), then b= r which is between 0 and 1 for
positive correlated case.
So regression effect is direct consequence of mathematics ( I
am sure if this is statistical or not!)
• Correlation or regression are not equivalent to causal
relationship
• Regression can be forward or inverse
Error of prediction
•
Y= a + bX + error
b= r SD(Y)/SD(X)
R-squared = var (fit)/var (Y)=
Var(a +bX)/Var(Y) =
Var(bX)/Var(Y)=b2var(X)/Var(Y)=
r2 [SD(Y)2/SD(X)2 ]var(X)/var(Y)
= r2
F-ratio = [var(fit)/d.f of fit ]/ [var(error)/ d.f. of residual]
d.f. of fit = number of regression parameters (excluding
intercept) (= dimension of regressor in multiple linear
regression),
d.f. of residual = number of cases - 1-d.f. of fit
Inverse regression line
• Assume r=0.6
• The expected height of the son of a father 1
inch taller than average is 0.6 inches taller
than average
• What is the expected height of the father of a
son whose height is 0.6 taller than average ? 1
inches taller than average ?
Sliced inverse regression
for dimension reduction
• Inverse point of view of regression is useful in
reducing the dimensionality of regressors
• Instead of considering E(Y|x1, ..xp)
which is a p-dimensional surface, it is simper to
consider E(x1|Y), E(x2|Y) …E(xp|Y) which is a onedimensional curve in p-dimensional space.
Binary input and binary output
• Event A correlates with Event B
(p53 mutation correlates with decreased survival;
Aberrant p53 expression correlates with expression of
vascular endothelial growth factor mRNA and interleukin8 mRNA and neoangiogenesis in non-small-cell lung
cancer (Yuan et al, 2002, J.Clin.Oncol.20, 900-910)
Mathematically, does it imply
Event B correlates Event A ? Can you prove it or disprove it ?
(Homework )
Many ways of measuring correlation have been proposed
such as rank correlation, Fisher-Yates correlation, etc.
Why clustering make sense
biologically?
The rationale is
Genes with high degree of expression similarity are likely
to be functionally related.
may form structural complex,
may participate in common pathways.
may be co-regulated
elements.
by
common upstream regulatory
Simply put,
Profile similarity implies functional association
However, the converse is not true
The expression profiles of majority of
functionally associated genes are
indeed uncorrelated
• Microarray is too noisy
•Biology is complex
Patterns of Coexpression for Protein Complexes by Size in
Saccharomyces Cerevisiae
NAR 2008, Ching-Ti Liu, Shinsheng Yuan, Ker-Chau Li
•
Many successful functional studies by gene expression profiling in the literature
have led to the perception that profile similarity is likely to imply functional
association. But how true is the converse of the above statement? Do functionally
associated genes tend to be co-regulated at the transcription level? In this paper, we
focused on a set of well-validated yeast protein complexes provided by Munich
Information Center for Protein Sequences (MIPS). Using four well-known largescale microarray expression datasets, we computed the correlations between genes
from the same complex. We then analyzed the relationship between the distribution
of correlations and the complex size (the number of genes in a protein complex).
We found that except for a few large protein complexes such as mitochondrial
ribosomal and cytoplasmic ribosomal proteins, the correlations are on the average
not much higher than that from a pair of randomly selected genes. The global
impact of large complexes on the expression of other genes in the genome is also
studied. Our result also showed that the expression of over 85% of the genes are
affected by six large complexes: the cytoplasmic ribosomal complex, mitochondrial
ribosomal complex, proteasome complex, F0/F1 ATP synthase (complex V) (size
18), rRNA splicing (size 24), and H+- transporting ATPase, vacular (size 15).
Yeast Cell Cycle
(adapted from Molecular Cell Biology, Darnell et al)
Get t ing a homogeneous
populat ion of cells:
cell cycle
Cells at various
st ages of cell cycle
Synchronizat ion condit ions:
-Temperat ure shif t t o 3 7 C f or
CDC1 5 yeast t s-st rain
-add pheromone
-Elut riat ion
Release back int o cell cycle
Take sample
as cells progress
t hrough cycle
simult aneously
Microarray
gene-expression data
cond1 cond2 …….. condp
gene1
gene2
x11
x21
gene
n
Four large scale datasets
x12 …….. x1p
x22 …….. x2p
…
…
•
Figure 1. Comparison of correlation distributions for protein pairs with respect to
functional association (shown in left panel) and complex size (shown in right panel).
The terms “cc”, “yg”, “rst” and “st1” represent four different data sets: cellcycle,
segregation genetics, rosetta and stress data, respectively. Protein complex pairs are
abbreviated as “rel” and unrelated pairs are abbreviated as “unrel”.
Why no correlation?
• Protein rarely works alone
• Protein has multiple functions
• Different biological processes or pathways have to be
synchronized
• Competing use of finite resources : metabolites, hormones,
• Protein modification: Phosphorylation, proteolysis, shuttle, …
Transcription factors serving both as activators and repressors
The
thyroid hormone receptor
differs
functionally from glucocorticoid receptor
in two important respects :
it
binds to its DNA response elements
in the absence of hormone, and the
bound protein represses transcription
rather than activating it.
When
thyroid hormone binds to the
thyroid hormone receptor, the
receptor is converted from a repressor to
an activator.
Gene A = gene produces THR
Gene B= gene regulated by THR
THR alone represses B
THR+ HM activates B
Transcription factors: proteins that bind to DNA Activator;
repressors
Expression levels of A
THR alone represses B
and B can be either
THR+ HM activates B
positively correlated or
negatively correlated,
depending on thyroid hormone level.
If during an experiment, hormone
level fluctuated as organisms try to
accomplish different tasks and if
we cannot tell what tasks are,
then .....
Of course, the book is not talking
about yeast there. However,
Pairwise similarity is not enough!
Liquid Association (LA)
• LA is a generalized notion of association for describing certain kind
of ternary relationship between variables in a system. (Li 2002
PNAS)
•
Liquid Association
high (+)
•
low (-)
Y
transit
state 1
state 2
Linear (state 1)
Linear (state 2)
•
•
low (-)
X
Green points represent four
conditions for cellular state
1.
Red points represent four
conditions for cellular state
2.
Blue points represent the
transit state between
cellular states 1 and 2.
(X,Y) forms a LA.
high (+)
Profiles of genes X and Y are
displayed in the above scatter plot.
Important! Correlation
between X and Y is 0
Statistical theory for LA
• X, Y, Z random variables with mean 0 and
variance 1
• Corr(X,Y)=E(XY)=E(E(XY|Z))=Eg(Z)
• g(z) an ideal summary of association
pattern between X and Y when Z =z
• g’(z)=derivative of g(z)
• Definition. The LA of X and Y with respect to
Z is LA(X,Y|Z)= Eg’(Z)
Statistical theory-LA
• Theorem. If Z is standard normal, then
LA(X,Y|Z)=E(XYZ)
• Proof. By Stein’s Lemma : Eg’(Z)=Eg(Z)Z
• =E(E(XY|Z)Z)=E(XYZ)
• Additional math. properties:
• bounded by third moment
• =0, if jointly normal
• transformation
Stein Lemma
• To compute E(g’(Z)) is not easy. With
help from mathematical statistics theory,
the LA(X,Y|Z) can be simplified as E(XYZ)
when Z follows normal distribution.
Stein lemma
Normality ?
• Convert each gene expression profile by taking
normal score transformation
• LA(X,Y|Z) = average of triplet product of three
gene profiles:
(x1y1z1 + x2y2z2 + …. ) / n
•
•
Liquid Association is not
Partial correlation
• X, Y, Z
• Z->Y, Z->X (Causal analysis )
• X=aZ+b+error
• Y=a’Z+b’+error’
Partial correlation =corr (error, error’)
If Z causes X and Y, then partial correlation=0
(X=Coke sale, Y=eye disease incidence rate, Z=season)
Start with a pair of correlated genes X, Y, find Z to minimize partial
correlation.
This is very different from LA.
Graph representation
ARG1
8th place
negative
Y
Figure 4. Urea
cycle/argininbiosynthesis
pathway. ARG2 encodes
acetyl-glutamate synthase,
which catalyzes the first step
in synthesizing ornithine
from glutamate. Ornithine
and carbamoyl phosphate
are the substrates of the
enzyme ornithine
transcarbamoylase, encoded
by ARG3. Carbamoyl
phosphate synthetase is
encoded by CPA1 and CPA2.
ARG1 encodes
argininosuccinate
synthetase, ARG4 encodes
argininosuccinase, CAR1
encodes arginase, and CAR2
encodes ornithine
aminotransferase.
X
Compute LA(X,Y|Z) for all Z
Adapted from KEGG
Rank and find leading
genes
Why negative LA?
high CPA2 : signal for arginine demand.
up-regulation of ARG2 concomitant with down-regulation of CAR2
prevents ornithine from leaving the urea cycle.
When the demand is relieved, CPA2 is lowered, CAR2 is up-regulated,
opening up the channel for orinthine to leave the urea cycle.
2
0
-2
-1
0
Low
CAR2
High
1
1
-1
-2
Low
ARG2
High
2
low CPA2
median CPA2
high CPA2
Linear (low CPA2)
Linear (high CPA2)
Statistical significance
• P-value can be calculated by permutation test
or by large sample approximation
• Plot of liquid association is provided by two
methods:
MLE for mixture model
discrete method
How does LA work in cell-lines?
Alzheimer’s Disease hallmark gene
Amyloid-beta precursor protein (APP)
Liquid association:
A method for exploiting lack of correlation between
variables
The brain tissue shows "neurofibrillary tangles" (twisted fragments of protein
within nerve cells that clog up the cell), "neuritic plaques" (abnormal clusters of
dead and dying nerve cells, other brain cells, and protein), and "senile
plaques" (areas where products of dying nerve cells have accumulated
around protein). Although these changes occur to some extent in all brains with
age, there are many more of them in the brains of people with AD. The
destruction of nerve cells (neurons) leads to a decrease in neurotransmitters
(substances secreted by a neuron to send a message to another neuron). The
Amyloid beta peptide is the predominant
component of senile plagues in brains of MD patients.
It is derived from
Amyloid-beta precusor protein (APP) by
consecutive proteolytic cleavage of
Beta-secretase and
gamma-secretase
What is the physiological role of
APP?
Cao X, Sudhof TC.
A transcriptionally active complex of APP
with Fe65 and histone acetyltransferase Tip60.
Science. 2001 Jul 6;293(5527):115-20.
Abstract of Cao and Sudhof
Amyloid-beta precursor protein (APP), a widely expressed cell-surface
protein, is cleaved in the transmembrane region by gamma-secretase.
gamma-Cleavage of APP produces the extracellular amyloid betapeptide of Alzheimer's disease and releases an intracellular tail
fragment of unknown physiological function. We now demonstrate that
the cytoplasmic tail of APP forms a multimeric complex
with the nuclear adaptor protein Fe65 and the histone
acetyltransferase Tip60. This complex potently stimulates
transcription via heterologous Gal4- or LexA-DNA binding
domains, suggesting that release of the cytoplasmic tail of
APP by gamma-cleavage may function in gene expression.
Take X=APP, Y=APBP1
• APBP1 encodes FE65
• Find BACE2 from our short list of LA score
leaders.
• BACE2 encodes a beta-site APP-cleaving
enzyme
Take X=APP, Y=HTATIP
HTATIP encodes Tip60
Finds PSEN1 (second place positive
LA score leader)
Which encodes presenilin 1,
a major component of
gamma-secretase
Application: finding candidate genes
for Multiple sclerosis
Multiple sclerosis
• 1. MS is a chronic neurological disorder disease, characterized by
multicentric inflammation, demyelination and axonal damage,
resulting in heterogeneous clinical features, including pareses,
sensory symptoms and ataxia. The classical clinical features
include disturbances in sensation and mobility. The typical age of
onset is between years 20 and 40, making MS one of the most
common neurological diseases of young adults. Four genome-wide
scans (US, UK, Canada, and Finland) have revealed several putative
susceptibility loci, of which the loci on chromosomes 6p, 5p, 17q
and 19q have been replicated in multiple study samples. More
recently, Professors Aarno Peltonen and Leena
Peltonen’s teams have generated a fine map on
17q22-q24 (Saarela et al 2002). They are now
interested in the functional aspect of the genes in this region using
microarray technology.
Multiple sclerosis study
Aarno : How about PRKCA ?
KC: it is not in the NCI’s cell-line data, how
about PRKCQ?
Aarno: No, we are very picky.
Daniel: GNF has PRKCA
KC: we have not analyzed GNF data yet.
Aarno, Daniel, Robert: give it a try
(GNF_ gene expression Atlas;
http://expression.gnf.org ). Su et al.
X=MBP(myelin basic protein), Y= PRKCA (a
gene in interested chromosome 17q22-24
region),
we get Z=SLC1A3 at the second place.
Returning to the cell-line data, we use SLC1A3
as X to find highest score LA pairs.
MYT1(Myelin transcription factor 1) appears
twice in our short list of the best 40 pairs (out of
a total of over 73 million possible pairs)!!
A lot more evidences: including connection to HLA
(major histocompatibility complex) family of genes in
6p21, the locus which most linkage analysis has
pointed to.
To summarize:
Starting from PRKCA(protein kinase C, alpha), a primary
susceptible gene from an MS locus on chromosome 17q22q24, we use the method of liquid association (LA) to probe
for functionally associated genes. A string of evidences
from four large-scale gene expression databases suggest a
strong connection to SLC1A3(glial high glutamate
transporter, member 3). Significant linkage of SLC1A3 to MS
is established in a follow-up SNF typing involving high MSrisk population sub-isolate of Finland. Our results open up a
new approach of finding susceptible genes for complex
diseases.
MYT1 is a zinc finger transcription factor that
binds to the promoter regions of proteolipid
proteins of the central nervous system and
plays a role in the developing nervous system
and FALZ has a DNA-binding domain and a
zinc finger motif. FALZ is located in
chromosome 17q24.3, the region of interest.
SLC1A3 is a glial high affinity glutamate
transporter, located in chromosome 5p13
which is another region of interested mapped
earlier in the Finish study.
Liquid association:
A method for exploiting lack of correlation between
variables
glutamate-induced excitotoxicity
SLC1A3 is highly expressed in various brain regions including cerebellum, frontal
cortex, basal ganglia and hippocampus. It encodes a sodium-dependent
glutamate/aspartate transporter 1 (GLAST). Glutamate and aspartate are
excitatory neurotransmitters that have been implicated in a number of
pathologic states of the nervous system. Glutamate concentration in
cerebrospinal fluid rises in acute MS patients whilst glutamate antagonist
amantadine reduces MS relapse rate. In EAE, the levels of GLAST and GLT-1
(SLC1A2) are found down-regulated in spinal cord at the peak of disease
symptoms and no recovery was observed after remission. We consider highly
encouraging that several lines of evidence including both genetic association
and gene expression association, would be consistent with the glutamateinduced excitotoxicity hypothesis of the mechanisms resulting in
demyelination and axonal damage in MS.
Validation for the genetic relevance of SLC1A3 to MS. We set to test if there is any
genetic relevance of SLC1A3 to MS. Before SLC1A3 was brought up by the LA method,
our fine mapping effort focused on a more telomeric region (between 10.3 and 17.3 Mb)
of 5p, which had provided the highest two-point lod scores in Finnish MS famili es (22) .
Guided by the LA findings, we further included five SNPs flanking the SLC1A3 gene
(Table 2) to be genotyped in our primary study set consisting of 61 MS famili es from the
high risk region of Finland. The most 5ΥSNP, rs2562582, located within 2kb from the
initiation of the SLC1A3 transcript showed initial evidence for association to MS
(p=0.005) in the TDT analysis, sugges ting a possible functional role of this variant in the
transcriptional regulation of this gene. Moreover, as shown in Table 2, stratification of
the Finnish MS famili es according to the strongest associating SNP on the HLA
region23, rs2239802, strengthened the association between the SLC1A3 SNP and MS
(p=0.0002, TDT). Thus, based on LA, and supported by association analyses in an MS
study sample, the presence of SLC1A3 serves to connect all four major MS loci
identified in Finnish families, elucidating a potential functional relationship between
genetically identified genes and loci.
©
20
07
N
at
ur
e
P
u
bli
sh
in
g
Gr
o
u
p
htt
p:/
/w
w
w.
na
tur
e.
co
m/
na
tur
eg
en
eti
cs
Interleukin 7 rece ptor a chain (IL7R) shows allelic and
functional association with multiple sclerosis
1, 9
1, 9
2
3
1
1
Simon G Gregory
, Silke Schmidt
, Pune
et Seth , Jor ge 6R Oksenberg , Joh
n Ha rt , Angel a Proko
p ,
3
4
5
3
7
Stacy J4 Caillie r , Maria Ba 4n , An Goris , Lis a F
Barc
ello
s
,
Robin
Lin
col
n
,
Jacob
L
McCauley
,
Ste
phen J
5
3
2
Sawcer , D A S Co8 mpston , Ben edicte Dubois
,
Ste
ph
en
L
Hauser
,
Maria
no
A
Garcia-Bl
an
co
,
Margaret
7
A Pericak -Vance & Jon atha n L Hain es , fo r th e Mu ltiple Sclerosis Gen eti cs Group
Multip l e sclerosis is a demyeli nat ing neu rodeg ener at ive di sease with a strong genet ic comp onen t. Previo us genet ic risk studi es
have faile d to ident ify con sis tent ly li nke d regio ns or genes ou tsi de of the majo r histocompat ibi lity comp lex on chromosome 6p. We
describe alle li c assoc iat io n of a pol ymorphism in the gene enc odi ng the interleu ki n 7 receptor a chain (IL7R) as a sig niήcant ris k
Π7
factor for multip le sclerosis in four indep end ent family-based or case-con trol data set s (overall P ¥ 2.9 ¾ 10 ). Further, the li kely
causal SN P, rs68 9793 2, located within the alternat ively sp liced exon 6 of IL7R, has a funct io nal effect on gene expressio n. The SN P
inί uenc es the amount of sol uble and membrane-bo und isoforms of the protein by p utat ively di sru pting an exoni c sp lic ing sile ncer.
Multip l e sclerosis is the prototyp ical human demyeli nat ing di sease, whic h requi res multip l e sourc es and typ es of positiv e evi denc e
to and numerou s ep idemiolog ical, adop tio n and twin studie1 s have imp licate a candi date ge ne, can be used to integ rate bo th statist ical
provi ded evi denc e for a strong und erlying genet
ic lia bil ity .The an d funct io nal data. di sease is most common in y oun g adults, with
3
more than 90 % of Using genomic converge nce , we identi ήed 28 genes that were affected individ uals dia gnosed before the age of 55,
and fewer than 5% differentially expressed in at leas t two of nine previo us expressio n dia gnosed before the age of 14.2 Females are two
to three times more studie s (Sup plementar y Table 1 onli ne). We focus ed on thre e genes freque nt ly affected than male s , and t he dis ease
course can vary (interle uki n 7 rece ptor alp ha chain (IL7R) [MIM: 14666 1], matrix su bs tant ially , with some affected in divid uals
suffering on ly minor metallo p roteinase 19 (MM P19) [M IM: 6018 07] and chemokine (C-C di sabil ity several deca des after their initial
dia gnosi s, and others motif) li gand 2 (CCL2) [M IM: 15 810 5]) that had a p revio us ly reaching wheelc hair depende ncy short ly after
di sease onset. The p ubl ished or inferred funct ional role in multip le sclerosis and that complex etiolog y of the di sease and the current ly
undeήned molecular were not located within the M HC. Two of the three genes locali
ze to mechanisms of mult ip l e sclerosis sug gest
4
that moderate contrib ut ion s previo us regio
ns of gen eti c li nkag e on 17q1 2 (CCL2) and 5p13.2 from mult ip l e risk loci u nderlie the
5
develo p ment and progress io n of (IL7R) . We analy zed a large dat a set of 760 US familie s of Eu rop ean the di sease. descent, including
1,05 5 individ uals with multip le sclerosis, and
Many different approach es, includi ng genet ic li nkag e, can di date id enti ήed a sig niήcant assoc iat io n with multip le sclerosi s
sus ceptib il ity ge ne associat io n and gene expressio n studie s, have been used (sum-on ly for a nons ynonymou s codi ng SN P
(rs68979 32) withi n a key marized in ref. 2) to ident ify the genet ic bas is of multip le sclerosis. transmembrane domain of IL7R. We
subsequent ly rep li cated this Howev er, gen et ic li nkag e screens have faile d to ident ify con sis ten t in iti al sig niήcant associat io n in
three ind ep enden t European p op ula regio ns of linkag e out side of the majo r histocompat ibil ity comp lex tions or populat ion s of
European descent: 43 8 individ uals with (M HC). Candi date gene studie s have suggeste d over 100 different mult ip l e sclerosis and 47 9
unrela ted control s ascertained in the U nited associa ted ge nes, bu t ther e has n ot been consensu s acceptance of any States, 1,338
individ uals with multip le sclerosis and thei r parent s such candidates. Similarly, gene express io n studie s have identi ήed ascert ained
in nort her n Europ e (the UK an d Belgi um) and 1,0 77 hund reds of differentia lly expres sed transcrip ts, with li ttl e cons istency
individ uals with multip le sclerosis and 2,7 25 unrela ted control s also across studie s. The altern at ive approach of genomic
3
conv erge nce , ascertained in nort her n Europe. We show that rs68 9793 2 affects
1
2
Ce nter for Human Genetics, and Center for RNA Bi olo g y and Departme nt of Molecular Genetics an d Microbiolog y, D uke University Med ical Center, Durham, North
Variation in interleukin 7 receptor a chain (IL7R)
inί uences risk of multiple sclerosis
1
2
3
1,3
3
Frida Lund
mark , Kristina
Duvef elt , 5Ell en Iacobaeus
, Ingrid Koc kum
, Erik Wall stro¬m
, Mo hsen
3
4
6
7, 8
8
9
Kh ademi , An net
te
Oturai
,
Lars
P
Ryder
,
Janna
Saarela
,
Ha
nne
F
Ha
rbo
,
Elis
abeth
G
Cel
ius
,
Hugh
Salter ,
3
1
T o mas Olsson & Jan Hill ert
Multip l e sclerosis is a chr onic, often di sabli ng , di sease of the
central ner vous system affecting more than 1 in 1,000 peop le in
most west ern coun trie s. The inί ammatory les io ns typical of
multip le sclerosi s show aut oimmune features and dep end part ly
on genet ic factors. Of these ge neti c factors, on ly the HLA ge ne
comp lex has been repeatedl y conήrmed to be associa ted with
multip le sclerosi s, des p ite cons id erable efforts. Pol ymorphisms in
a number of non-HLA genes have been reported to be assoc iated
with multip le sclerosis, bu t so far conήrmation has been difήcult .
Here, we report comp elli ng evi dence that p olymorp hisms in
IL7R, whic h enc odes the int erleuki n 7 receptor a chain (IL7Ra),
ind eed contrib ute to the non-HLA ge netic risk in multip le
sclerosis, demons trat ing a role for this p athway in the
p atho physio log y of this di sease. In additio n, we report altered
express io n of the genes enc odi ng IL7Ra and its li ga nd, IL7, in
the cerebrosp inal ί uid compartment of in divid uals with multip le
sclerosis.
in divid uals with mul tip le scl erosis and 2 ,6 34 h ealthy c o nt rols from th e N o rdic cou n tries
(D enmark , Fin land, N orw ay a nd Sw ed en), indepen den t from the data set
analyzed in ref. 4 (Sup plementary Table 1 onli ne). Of these, 91%
of the affected individ uals had experie nced an init ia lly rela psing remitting course of multip le sclerosis (RRMS), whereas 9% had a
p rimary p rogress iv e course (P PMS). In additio n, we analy zed the
express io n of IL7R and IL7 in the peripher al bloo d as well as in
cell s from the cerebrosp inal ί uid (CSF).
We gen otyped the three previo us ly ass ocia ted SNP s, locat ed in
intron 6 (rs98710 6 and rs98 7107) and exon 8 (rs 319405 1).
rs98
710 6 and rs3194 051 were in hig h li nk age di sequ ilib2 rium (LD)
2
(r ¥ 0.99 , |DΆ ¥ 1.00| ), and rs98710 7 was in partial LD (r ¥ 0.29, |DΆ
¥ 0.99|) w ith rs9871 06 a nd rs3 19405 1, as they were l ocated in the
same hap lo type bl ock. The size of the study allo wed full p ower
(100 %) to detect an odd s ratio (OR) of 1.3. All SNP s were
genotyp ed us ing the Sequenom hM E assays. The ob served contro l
genotyp es conformed to Hardy -We inb erg equ ilib rium. All three
SN P s conήrmed sig niήcant associat io n with multip l e sclerosis in
IL7Ra (also kn own as CD12 7), enc oded by IL7R,is a member of the this nonoverla p p ing case-con trol g roup, with very similar ORs as
in t he previo us study (rs9871 07, P ¥ 0.002 ; rs 98710 6, P ¥ 0.001 ;
type I cytoki ne receptor family and forms a receptor comp lex with
the common cytoki
ne receptor gamma chain (CD13 2) in whic h IL7 rs31 9405 1, P ¥ 0.002). A tes t for heterogeneity bet ween the dat a
1
set s from Norway, Denmark, Finla nd and Sweden showed no
is the li gand . The IL-7ΠIL 7Ra li gand -rece ptor p air is crucial for
evi denc e of strati ήcat io n, thu s permitting a combined analysi s
proliferat io n and survi val of T and B lymphocytes in a
(Mantel-Haen szelΠco rrect ed and crud e ORs are shown in Table 1 ).
no nredund ant fashio n, as shown in human and animal models, in2
We est imated the three-marker
hap lo type frequ encie s us ing the EM
whic h ge netic aberrat io ns lead to immune deήcie ncy synd romes .
5
alg orithm in Hap lovi ew .The estimated di strib ut io n of ha p lo typ es
IL7R is located on chr omosome 5p13, a regio n 3occas io nally
differed s ig niήcant ly bet ween affected individ uals and controls (P ¥
suggest ed to be li nk ed with multip le sclerosis .
0.00 1), with two ha p lo typ es assoc iat ing with multip l e sclerosis,
We hav e cons id ered IL7R a promising can di dat e gene in multip le
one conferring an increased risk of disease (P ¥ 0.000 4) and the
sclerosis, and we have recent ly reported genetic associat ion s
other conferring a decr eased risk of disease (P ¥ 0.003 )
with three IL7R SN P markers in up to 67 2 Swedish individ uals
(Sup plementary
Table 2 onli ne), in accordance with previou s
with multip l e sclerosis and 672 control s, as well
as
two
4
4
resu lts .
assoc iated hap lo typ es spann ing these markers . To co nήrm t h ese
Accordi ng to data from the HapM ap CEU pop ulat io n, IL7R is
as sociat ions, we assessed an independent case-control gro up consisting of 1 ,8 20
located within a tig ht LD bl ock conta ining no addi tio nal g enes.
To
International MS whole genome association
study(2007).
• Affymetrix 500K to screen common genetic variants of 931 family trios.
• Using the on-line supplementary information provided, we found two
SNPs, rs4869676(chr5:36641766) and rs4869675(chr5: 36636676 ) with
TDT p-value 0.0221 and 0.00399 respectively, are in the upstream
regulatory region of the SLC1A3 gene.
• In fact, within the 1Mb region of rs486975, there are a total 206 SNPs in the
Affymetrix 500K chip. No other SNPs have p-value smaller than that of
rs486975.
• The next most significant SNPs in this region are
rs1343692(chr5:35860930), and rs6897932(chr5:35910332; the identified
MS susceptibility SNP in the IL7R axon).
• The MS marker we identified rs2562582(chr5: 36641117) is , less than 5K
apart from rs4869675, but was not in the Affymetrix chip.
A little bit late
• IL7R was found long time ago before by LA !!!See the attached the e-mail
ハ sent more than two years ago in 2005 !!!
• Begin forwarded message:From: Ker Chau Li (local) <kcli@stat.ucla.edu>Date: March 28, 2005 10:17:51 AM
PSTTo: Robert Yuan <syuan@stat.ucla.edu>, Aarno Palotie <APalotie@mednet.ucla.edu>, Daniel Chen
<pharmacogenomics@yahoo.com>, Denis Bronnikov <denis@ucla.edu>, Palotie Leena <leena.peltonen@ktl.fi>Cc: Ker
Chau Li (local) <kcli@stat.ucla.edu>
•
Subject: IL7R
(I thought this e-mail should have been sent out already; but it has not)I take X=SLC1A3,
Y=MBP, Z= any gene, using 2002 Atlas data. Two genes are from the short list of genes
with highest LA scores.IL7R interleukin 7 receptor and HLA-A
IL7R is at 5p13. Interesting coincidence??
other interesting findings include GFAP glial fibrillary acidic protein on 17q21 (Alexander
disease)GRM3 (glutamate receptor, metabotropic )CDR1 (cerebellar degeneration-related
protein 1)Ighg3 (immunoglobulin heavy constant gamma 3)Iglj3 ( immunoglobulin lambda
joining 3)
Figure 3. Organization chart for incorporating LA with similarity based methods. Coexpressed genes found by profile similarity analysis can be pooled together to obtain a
consensus profile for LA-scouting. Likewise, the genes identified through LA system can be
further analyzed for patterns of clustering. For some applications, the scouting variable may
come from external sources related to the expression profiles. SVD: singular value
decomposition; PCA: principal component analysis.
Full genome
expression
prof iles
Similarity based
analy sis
co-expression neighbors
hierachical cluster
eigen prof ile by
SVD or PCA; etc.
LA-based
analy sis
Finding LA-scouting
genes f or a giv en
pair of genes
Finding LAPs f or
a giv en scouting
v ariable Z
Using a consensus
prof ile as Z
using a gene
prof ile as Z
Using an external
v ariable as Z
Similarity based
analysis
III Correlating gene-expression with
drug-responsiveness
c.line1 c.line2 …….. C.linep
gene1
gene2
drug1
drug2
x11
x21
x12 …….. x1p
x22 …….. x2p
…
…
y11 y12 ………… y1p
y21 y22 ……………y2p
…..
Experiment procedure
Rate of inhibition
(Ti  Tz )
100
(C  Tz )
48 h at 37°C
Tz = 6 cells
C = 12 cells
E.g.:
(9-6)/(12-6)*100=50
GI50 is x
Drug
added at
x (M)
48 h at 37°C
Tz = 6 cells
Ti = 9 cells
GI50 is the concentration of the drug needed to
inhibit the growth of the cells up to 50%
Drug sensitivity
is defined as:
-log(GI50)
Example of Drug Sensitivity Data
Sensitivity profile
Cell Line
89
89
89
89
89
89
89
89
89
,M
,M
,M
,M
,M
,M
,M
,M
,M
,-4.00,Non-Small
,-4.00,Non-Small
,-4.00,Non-Small
,-4.00,Non-Small
,-4.00,Non-Small
,-4.00,Non-Small
,-4.00,Non-Small
,-4.00,Non-Small
,-4.00,Non-Small
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Lung
Lung
Lung
Lung
Lung
Lung
Lung
Lung
Lung
,NCI-H23
,NCI-H522
,A549/ATCC
,EKVX
,NCI-H226
,NCI-H322M
,NCI-H460
,HOP-62
,HOP-18
GI50
,1
,1
,1
,1
,1
,1
,1
,1
,1
,1
,3
,4
,8
,13
,17
,21
,26
,27
,4.209,3
,6.159,3
,4.186,3
,4.255,3
,4.231,2
,4.000,3
,4.233,3
,4.000,3
,4.249,3
,3
,3
,3
,3
,2
,3
,3
,3
,3
,0.361
,1.696
,0.322
,0.442
,0.326
,0.000
,0.403
,0.000
,0.431
Drug Sensitivity profile
• For each chemical compound, the tests are done
for all 60 cell lines listed previously.
• For each cell line and each compound, there
were multiple experiments performed to obtain
the average drug concentration.
Drug Sensitivity profile (cont.)
• Drugs with similar profiles usually share similar molecular
structures and biochemical mechanism of actions. (R. Bai et
al. 1991, K. D. Paull et al. 1992, H. N. Jayaram et al., 1992)
• Similarity is measured by Pearson’s correlation coefficients.
• COMPARE ( gateway to NCI’s anticancer screen database)
Compare Drug sensitivity and Gene
expression profiles
• Sherf et. al. (2000) compare gene expression
profile with the drug sensitivity profile by
computing correlation coefficient.
• 9706 genes in the expression data set.
• 118 chemotherapy agents with known
molecular mechanism of drug action
• Genes whose profiles correlate well with drug
sensitivity profiles are thought to be related to
drug functioning. Why?
Figure 1. Gene-drug interaction
drug target genes,
known genes in related pathways,
other genes,
metabolites and others
drug dose
nutrients
other growth conditions
initial
cellular
status
drug
exposure
drug transport, target binding
cytotoxicity
cell cycle arrest, apoptosis
drug resistance
rate of cell growth
pharmacokinetics,
metabolism, signal
transduction,
complex dynamics
48 hours
later
drug sensitivity
-log GI50
In the NCI anticancer screen, each candidate agent is tested for a broad concentration
range against each of the 60 cell lines in the panel. More than 60,000 agents have been
screened.
Similarity comparison : Computer program "COMPARE:each compund is compared with
others and a list of agents with
similar patterns in responsiveness in inhibiting growth inhibition is given,
Limitation of Pearson’s Correlation
Coefficient (CC)
• Many drugs do not correlate well their
molecular target genes.
• The pair, MTX and DHFR, has only 0.071.
Liquid Association (LA) is used to understand the relationship
between MTX and DHFR.
Methotrexate
childhood acute
lymphoblastic leukemia (ALL), non HodgkinΥs lymphoma, osteogenic sarcoma,
chorocarcinoma and carcinomas of breast,
7
head and neck . The b inding of MTX and it s polyglutamated forms to its
MTX has been used for treatment of
primary target DHFR inhibits the reduc tion o f folate and 7,8-dihyd rofo late (DHF) to
5,6,7,8-tetrahydro folate(THF). The inhibiti on in turn results in the quick conv ersion o f all
of a cellΥs limit ed supp ly of THF to DHF by thymi dylate syn thase (encoded by TYMS)
reaction (Figure 2, left panel). This prevents further dTMP synthes is .
Inhibition of DNA component
synthesis
• X is chosen to be a drug. Y
is chosen to be the target
gene which encodes the
protein known to
participate in drug’s
activity.
• E.g. MTX (Methotrexate)
and DHFR (Dihydrofolate
Reductase).
MTX
NADPH+H+
NADP
DHFR
DHF
dTMP
+
THF
TYMS
N5,N10–Methylene-THF
RNA to DNA
U to T
5FU
dUMP
MTX, DHFR,TYMS
Table S.1 Genes with high LA scores for MTX, DHFR, TYMS
Table S.1 (A)
Table S.1 (B)
M TX
DHFR
Z
MDA5
MEF2D
NA-6915
NA-1878
NA-1684
MFGE8
KIAA0337
TESK1
KIAA0555
ZFPL1
EHD2
RER1
DDAH2
NA-728
NA-671
SPTB
APOC1
RAB2L
CNTNAP1
LAP
0.4062
0.3755
0.3746
0.3699
0.3659
0.3655
0.3647
0.3612
0.3585
0.3548
0.3521
0.3435
0.3343
0.3327
0.3227
0.3208
0.3196
0.3117
0.3103
Z
KIAA1706
CCNH
NA-3315
TXN
FLJ10035
EIF4E
COX11
NA-1802
PCSK7
MGC21874
KIAA1354
C2
NA-1844
DPP4
WHSC2
C9orf10
FLJ20156
KIAA1078
MST4
NA-4135
0.3102 ABCE1
Table S.1 (C)
M TX
TYMS
LAP
-0.4263
-0.3878
-0.3869
-0.3832
-0.3832
-0.3675
-0.359
-0.3581
-0.3505
-0.3487
-0.343
-0.3428
-0.3422
-0.3324
-0.3321
-0.3294
-0.3294
-0.3289
-0.3263
Z
DGSI
RANBP1
MAPK1
UFD1L
HTF9C
NA-1390
KPNA1
PPIL2
NA-6953
ATP6V1E1
GHR
IFRD2
CDK4
BHC80
MRPL49
SERHL
HIRA
HSRTSBETA
COPA
-0.3255 NA-1676
LAP
0.391
0.3566
0.3412
0.3375
0.3298
0.324
0.3225
0.3035
0.3021
0.2934
0.2931
0.285
0.2835
0.2802
0.2776
0.2742
0.2685
0.2647
0.2645
Z
NA-6043
NA-4504
KIAA0276
NA-3333
ATP2B1
RAB31
CRIP2
TXNRD1
MGC2721
TNFRSF19L
CYP2C9
ABI-2
AEBP1
SPAG9
NA-2843
APPBP2
FOXP1
GSTZ1
HAN11
0.2639 CORO2B
DHFR
TYMS
LAP
-0.3016
-0.3
-0.2998
-0.2969
-0.2889
-0.287
-0.2768
-0.2752
-0.2732
-0.2732
-0.2689
-0.2686
-0.2674
-0.2619
-0.2614
-0.2603
-0.2601
-0.259
-0.257
Z
NA-1837
NA-2696
NA-2283
TEM8
TOX
SLC9A6
NA-9327
MSN
TMEFF1
NA-6473
TBC1D5
SUSP1
IGFBP3
EDN3
FLJ10392
NA-5323
FST
PRKACB
FAF1
-0.2558 MSCP
LAP
0.4559
0.4421
0.4276
0.415
0.4147
0.4048
0.3995
0.397
0.3931
0.3831
0.3803
0.3736
0.3675
0.3643
0.3633
0.3629
0.3601
0.3582
0.3581
Z
CSTB
CRYL1
NA-1556
GJA5
HPX
MYO15B
M17S2
NA-6537
MRPL12
ICA1
GPCR1
HRASLS3
MAD
FLJ22283
LYZ
TXN
RNF13
LOC51133
SERPINA6
0.3581 DAB2IP
LAP
-0.3975
-0.3925
-0.3903
-0.3903
-0.3868
-0.3848
-0.3837
-0.3827
-0.3788
-0.3757
-0.3721
-0.3704
-0.3702
-0.3692
-0.3679
-0.3666
-0.3665
-0.3661
-0.3656
-0.3638
Pathway for TXN and TXNRD
FAD
NDP
TXN
oxidized
TXN
reduced
FADH2
NADP+
RNR
reduced
TXNRD
reduced
NADPH
TXNRD
oxidized
dNDP
RNR
oxidized
MTX-sensitivity and the co-expression pattern
of DTTT subsystem.
• TXN and TXNRD1 encode thioredoxin and
thioredoxin reductase respectively.
• Together with DHFR and TYMS, they form a
subsystem (abbreviated DTTT) that critically
regulates the biosynthesis of DNA components8
• The role of the thiol-disulfide redox regulation in
tumor growth and drug resistance is reviewed in
Reference (9).
Projective LA ( PLA)
Schematic illustration of LA
high(+)
Fig2-Top
s tate1
low(-) gene Y
transit
s tate2
Linear
(state1 )
Linear
(state2 )
low(-)
gene X
high(+)
condition
Fig2-Bottom
lo w (-)
ge ne Z
hig h(+)
Projection-based LA for studying a
group of genes
Y (second projection)
Second projection
First projection
X (first projection)
mediator Z
Figure 2 (a)
Figure 2(b).
X : vector of p variables, X1,…Xp, each measures the
expression level of one gene. one-dimensional
projection: a’X = a1X1+..+apXp with norm ||a||=1.
2-D projection: a, b be orthogonal :
a’b=a1b1+..+apbp=0.
Liquid association between a’X and b’X as mediated
by Z is LA(a’X, b’X|Z)= E(a’X b’X Z)=E(a’XX’b
Z)=a’E(ZXX’)b.
Most informative 2-D projection :
maximize |a’E(ZXX’)b| over any pair of orthogonal
projection directions a, b.
solution : eigenvalue decomposition of the
matrix E(ZXX’):
E(ZXX’) vi = li vi , l1 ≥ ……… ≥ lp
vi are eigenvectors and lI are eigenvalues.
Theorem. Assume Z is normal with mean 0, SD=
1. Subject to ||a||=||b||=1 and a’b=0, the
maximum for the absolute value of LA(a’X, b’X|Z)
is (l1 - lp)/2. The optimal 2-D projection is given
by a=(v1+vp )/√2 (or -a), b= (v1 - vp )/ √2 (or -b).
An website for co-mining public and in-house
data
http://kiefer.stat2.sinica.edu.tw/LAP3
Data sets
• Organisms :
– Primary: homo sapiens ; mouse; yeast
– Others: C. elegans; arabidopsis; e. coli
• Homo sapiens: 17 datasets (more are added now)
–
–
–
–
–
–
60 cell line_Affy: 60 conditions, 5611 genes
60 cell line cDNA: 60 conditions, 9706 genes
GNF_atlas (2002): 101 conditions, 12536 genes
GNF_atlas(2004): 158 conditions, 33689 genes
Human eQTL (B-cell): 355 conditions, 8793 genes
Lung caner : 4 data sets:
•
•
•
•
Bhattacharjee et al : 203 conditions, 12600 genes
Beer et al : 96 conditions, 7129 genes
Gaber et al: 73 conditions, 23079 genes
Wigle et al: 39 conditions, 18117 genes
Facilities
• Basic
– Correlation for a pair of genes
– Liquid association for a triplet of genes
• Enhancement
– Advanced search methods
• Gene symbols; gene locations; gene ontology; regulation
(Transfac); locus link
– Compute
• Variations in computing LA scores
– Liquid association (default)
– Projective LA (for multiple genes)
» Transformation
– LA scouting genes
• Correlation only
– Raw data; normal transformation
• Clustering: k-mean, hierarchical clustering, self-organizing methods ( still
testing)
Facilities-continued
• Post-LA refining tools
– Summary
• Counts, histogram, GO, Pathway (still testing)
– Correlation
– Liquid association
•
•
•
•
Instant link to Entrez Genes or SGD(yeast only)
Liquid association graphs (two methods )
Save
Info
– (gene annotation, from public domain) Gene_sym, Gene-Name, chrom, start,
stop, etc.
– (expression data, computed) Indices, Ranks, Quantitles, Rank_LAP,
Rank_Corr,
– Transfac
– GO term (for yeast now)
• Compute
– Correlation matrix (raw or normalized)
– Clustering (K means; hierarchical )
Facilities (continued)
• P* : permutation with 50,000 iterations (testing )
• P** : permutation with 1,000,000 (does not work yet)
• Download (create excel files for exporting…)
– MAP (chromosome locations of output genes)
• Alert system
–
–
–
–
MS markers
MS candidate genes
Yeast genetics
User added system (talk to us)
• Disease pages (work in progress)
– Multiple sclerosis
•
•
•
•
Group by
Adding genes
Delete; modify
Computation methods. Databases,
Special tools (under development)
• For handling marker data
– Converting to binary data
• Additional links
– Precomputed data
• Master LA genes (for limited datasets for now)
– Protein Complex data (only in yeast for now)
– KEGG pathway
Heavy IT involvement is required
•
•
•
•
•
•
•
•
•
•
•
GO term specific investigation
KEGG pathway
Many large scale lung cancer microarray data sets
Gene group of interest identified by internal studies, for
example, EMT genes
NSCI-60 Cell line data
Mouse data
MicroRNA data
Phenotype and clinical variables
Array CGH data, SNP data
Different diseases
Cross platform
LAP background
• 2002-2006 : LAP1 and LAP2 websites at UCLA,
Robert Yuan
• 2007 : recruitment of IT persons : including
Guan I (hardware ), Ing-Fu (hardware), and
three outstanding RA with master degrees in
Computer Science students (two continued
their Ph studies in US) to initiate LAP3 which
transformed the scope of LAP2 with more
advanced web2.0 and other techniques
LAP Web Application
• A web application for biological researchers
– http://kiefer.stat2.sinica.edu.tw/LAP3/
– More than a web database
• Search, compute, and data analysis
• Create, save, and share LAP projects.
– No dedicated platform, no installation
• You may use LAP Web Application on any device with
web browser
Technologies behind LAP Web App.
• Distributed computing
– Computation is performed by the cluster of
computers in MIB group
• User does not need a powerful machine
• AJAX(Asynchronous JavaScript and XML,
intensively used in Web 2.0 applications)
– Provide “application-like” web services, better
user experience than traditional web services.
• Enable interactions between front-end user and backend server facilities.
LAP Web App. Features
• LAP is more than a biological database
– Search and compute!
• Supports the following computation methods:
–
–
–
–
–
–
–
–
LAP
PLA
Correlation
Clustering
Cox regression
PCA
P-value
Graphs created by R programs.
LAP Web App. Features (cont.)
• LAP is more than a web site – It’s a web
application!
– Better user experience
• Non-blocking web page
– User can continue using LAP web site functions while project
is still under computing
• Popup tips
– Information about the gene pointed by the cursor
– Notification of job completion
– Tips of icons/tools
• Customized user interface/environment
LAP Web App. Features (cont.)
– Project management
• Just like an application
– Load/Save the current search/computation project
• Session mobility: an active project can be moved
around different computing devices
– User/Service interaction
• User can share their own search/computation result
with other users in LAP web site.
LAP Web App. Architecture
Web browser (user client)
User Interface
HTML+CSS data
JavaScript Call
Ajax Engine
HTTP Request
XML data
Web Server + PHP engine
SQL Query
Commands
Query Result
Computing Result
SQL Query
Biological Databases
Computer Cluster
Query Result
Four-tier architecture for LAP
Future Work — Toward Web 2.0
• Toward Web 2.0
– More community functions
• Users can contribute their own experimental results to
enrich the biological database
• Users can contribute their own computation method
Future Work — Performance improvement
• Improve computing performance
– Optimization/Parallelization of algorithms
• Improve database performance
– Database structures
– Parallelization of database queries
Future Work — Cloud Computing
• Toward cloud computing
– More computation facilities open to the research
community.
– Better separation/protection of computing
resources
Future work – branching out
• Experience to share in statistical computing for other
project oriented scientific investigation
• Echoing the recommendation of “data center” in last
review (except that a bottom-up approach is now taken)
• Data without affiliated computing tools have very limited
use.
• Benefit the statistical theory and methodology
development
• Social network data
• Image data, for clinical application, vision study, neuron
science, etc
• Toward the creation of a cloud environment for statistical
computing
Four-tier architecture for LAP (Cont.)
• Presentation tier
– With web browser, end users can get a search of the databases
and retrieve the computed outcome
• Application tier
– With application platform, we provided end-user a highavailability web server cluster.
• Database tier
– Utilizing EMC storage system, we build up the database cluster
with high-availability and large enough space.
• Computation tier
– With SGE, NFS, NIS and Apache, we set up the computing
environment for computing LA score, correlation, clustering etc,
running C , R, Java codes.
Application tier
• The application
platform will
automatically
redirect end-user to
the web server of
lesser load
• Scheduled website
backup with storage
system
Database tier
• Through highavailability database
cluster, end-user can
get a database query in
more efficient.
• The program in the
computation tier can
also get the data
without affecting the
performance of the
web site.
Computation tier
• 280 computing
nodes for precomputing large
scale data and
serving end-user
from the web site.
• Two submit hosts to
increase host
availability
• Routine backup of
the end-user’s
projects with storage
system of 50
terabytes
Supporting various in house research
projects
• Lung cancer
• Tumor invasion study using NSC60 cell lines
NCI-60 cell line based integrative
computational system for tumor
invasion- related genes
許藝瓊博士
中研院統計所
syic@stat.sinica.edu.tw
Public domain data
mining
Laboratory data
Genomic data from
cancer cell lines
(NCI60) and patients
NCI60
invasion
profile
Biological function
Animal model
Drug development
on-line computational
system
Signature validation
Data mining
Invasion potential-related gene expression signature,
Candidate genes, regulation pathway
Future Work — Toward Web 2.0
• Toward Web 2.0
– More community functions
• Users can contribute their own experimental results to
enrich the biological database
• Users can contribute their own computation method
Future Work — Performance improvement
• Improve computing performance
– Optimization/Parallelization of algorithms
• Improve database performance
– Database structures
– Parallelization of database queries
Future Work — Cloud Computing
• Toward cloud computing
– More computation facilities open to the research
community.
– Better separation/protection of computing
resources
Future work – branching out
• Experience to share in statistical computing for other
project oriented scientific investigation
• Echoing the recommendation of “data center” in last
review (except that a bottom-up approach is now taken)
• Data without affiliated computing tools have very limited
use.
• Benefit the statistical theory and methodology
development
• Social network data
• Image data, for clinical application, vision study, neuron
science, etc
• Toward the creation of a cloud environment for statistical
computing
LA related References
•
•
Li, K.C. (2002) Genome-wide co-expression dynamics: theory and application. Proceedings of National Academy of
Science . 99, 16875-16880.
Li, K.C., and Yuan, S. (2004) A functional genomic study on NCI's anticancer drug screen. The Pharmacogenomics
Journal, 4, 127-135.
•
•
•
•
•
•
•
Li, K.C., Ching-Ti Liu, Wei Sun, Shinsheng Yuan and Tianwei Yu (2004). A system for enhancing genome-wide coexpression dynamics study. Proceedings of National Academy of Sciences . 101 , 15561-15566.
Yu , T., Sun, W., Yuan , S., and Li, K.C. (2005). Study of coordinative gene expression at the biological process level.
Bioinformatics 21 3651-3657.
Yu, T., and Li, K.C. (2005). Inference of transcriptional regulatory network by two-stage constrained space factor analysis. Bioinformatics 21,
4033-4038.
Wei Sun; Tianwei Yu; Ker-Chau Li (2007). Detection of eQTL modules mediated by activity levels of transcription factors. Bioinformatics; doi:
10.1093/bioinformatics/btm327 (correspondence author: Li)
Yuan, S., and Li. K.C. (2007) Context-dependent Clustering for Dynamic Cellular State Modeling of Microarray Gene Expression. Bioinformatics
2007; doi: 10.1093/bioinformatics/btm457 (correspondence author: Li)
Li, KC, Palotie A, Yuan, S, Bronnikov, D., Chen D., Wei X., Choi, O., Saarela J., Peltonen L. (2007) Finding candidate disease genes by liquid
association. Genome Biology (in Press).
Wei, S., Yuan,S., and Li, K.C. Trait-trait interaction: 2D-trait eQTL mapping for genetic vriation study. BMC Genomics 2008, 9:242
Acknowledgements
• Robert Yuan
• Former UCLA students : Wei Sun, Ching-Ti Liu, Xuelian Wei, Tianwei
Yu
• Current biology collaborators: Pan-Chyr Yang (Dean of Medical
School, NTU), S.L.Yu, H.W. Chen
• Post Docs : Yi-Chiung Hsu, Pei-Ing Hwang, Shang-Kai Tai
• Mission specific Research assistants:
Guan I Wu, Ying-Fu Ho, (hardware)
Hung Wei Tseng (Web 2.0, Ajax)
Cin Di Wang, Chia Hsin Liu, Cheng-Tao Chen, Chia Hung Lin, Kang
Chung Yang, Shiao Bang Chang, Shian-Lei Ho
Download