Scientific Advisory Board - Department of Statistics

Exploratory Failure Time Analysis
and Copy Number Variation Inference
Cheng Cheng
Department of Biostatistics
St. Jude Children’s Research Hospital
Outline
Part I Background
Part II
Exploratory Failure Time Analysis
Part III
Copy Number Variation Inference
I. Background
•
•
•
•
•
•
Nucleus, nucleotides, DNA, chromosomes, SNP
SNP arrays
Genome Wide Association Study (GWAS)
Multiple tests
Cause-specific failure and Competing risk
Cumulative incidence function, Gray's test, Fine-Gray
hazard rate regression model
• Censor at time competing event: OK for testing
stochastic independence, biased for estimation
Animal Cell Organelles
Nucleus
Nucleolus
Endoplasmic Reticulum
Centriole
Centrosome
Golgi
Cytoskeleton
Cytosol
Mitochondrion
Secretory Vesicle
Lysosome
Peroxisome
Vacuole
Nucleus Functions
The cell nucleus is an organelle that forms the package for our genes and
their controlling factors.
• Store genes on chromosomes
• Organize genes into chromosomes to allow cell division.
• Transport regulatory factors & gene products via nuclear pores
• Produce messages (messenger Ribonucleic acid or mRNA) that code
for proteins
• Produce ribosomes in the nucleolus
• Organize the uncoiling of DNA to replicate key genes
Chromosome inside nucleus
DNA = deoxyribonucleic acid
• What is a chromosome?
– In the nucleus of each cell,
the DNA molecule is
packaged into thread-like
structures called
chromosomes.
– Each chromosome is made up
of DNA tightly coiled many
times around proteins called
histones that support its
structure.
Human chromosomes
• In humans, each cell normally contains
23 pairs of chromosomes, for a total of
46.
• Twenty-two of these pairs, called
autosomes, look the same in both
males and females.
• The 23rd pair, the sex chromosomes,
differ between males and females.
– Females have two copies of the X
chromosome
– males have one X and one Y
chromosome.
Chromosome Structure
•
Each chromosome has a constriction
point called the centromere, which
divides the chromosome into two
sections, or “arms.”
•
The short arm of the chromosome is
labeled the “p arm.” The long arm of
the chromosome is labeled the “q
arm.”
•
Each chromosome has two
chromatids as a result of duplication
of the DNA which took place during
interphase. The two chromatids are
linked together at a centromere.
DNA structure
DNA is a double-stranded molecule twisted into a helix (think of a spiral staircase).
Each spiraling strand, comprised of a sugar-phosphate backbone
and attached bases, is connected to a complementary strand by non-covalent
hydrogen bonding between paired bases.
The bases are adenine (A), thymine (T), cytosine (C) and guanine (G).
Genetic code is specified by the four nucleotide "letters" A
(adenine), C (cytosine), T (thymine), and G (guanine).
A Single Nucleotide
Polymorphism (SNP) is a
change of a single nucleotide,
such as an T, replaces one of
the other three nucleotide
letters -- A, C, or G, within a
person's DNA sequence.
SNPs occur in human DNA at a frequency of one every 1,000
bases. These variations can be used to track inheritance in
families.
SNP Array Design
Genomic
Sequence
SNP
T/G
5´
3´
SNP
probe = 25 bases
Perfect Match
Mismatch
Allele ‘A’
Perfect Match
Mismatch
Quartet
Allele ‘B’
Hundreds of Millions of Pixel
Intensities…..
Genotype Calling
AA
AB
BB
Genome Wide Association Study (GWAS)
Typically 400,000 to 900,000 SNPs are investigated in a single study
Number of subjects in a study typically ranges from a few hundreds to 20,000
Each SNP takes three possible (generic) values “AA”, “AB”, “BB”, often coded as 0, 1, 2
Each SNP in each individual has a unique value, which is one of 0, 1, or 2
A small number of phenotypes: disease status (yes/no), or quantitative trait
This lecture: time to a cause-specific failure
n subjects, n observed trait values Y1, …, Yn, n observed SNP values for the ith SNP Xi1,
…, Xin
Inference (Test) for stochastic dependence of the ith SNP with the trait based on the
dataset (Xij, Yj), j=1,…,n; do this for each SNP; thus many tests of the null hypothesis
of stochastic independence.
Massive Multiple Tests
“Genome-wide significance”
Bonferroni-type adjustment:
Declare statistical significance if P≤10-7 (0.05/500K)
FDR and q value
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and
powerful approach to multiple testing. JRSS-B, 57, 289–300.
Storey, J. D., Taylor, J. and Siegmund, D. (2003). Strong control, conservative point estimation and
simultaneous conservative consistency of false discovery rates: a unified approach
JRSS-B,
66, 187–205.
Profile information criteria
Cheng, C., Pounds, S., Boyett, J. M. et al (2004). Statistical significance threshold criteria for
analysis of microarray gene expression data. Statistical Applications in Genetics and Molecular
Biology 3, Article 36.
URL //www.bepress.com/sagmb/vol3/iss1/art36
Cheng, C (2006) An adaptive significance threshold criterion for massive multiple hypotheses
testing. IMS Lecture Notes - Monograph Series 2nd Lehmann Symposium – Optimality 49, 51–76
Cause-specific failure and competing risk
Relapse
Alive
2nd
Cancer
Die in
remission
Failure type 1
(of interest)
Failure type 2
(competing risk/event)
Failure type 3
(competing risk)
Klein, J. P. (2010) Competing risks. WIREs Comp Stat,
www.wiley.com/wires/compstats, DOI: 10.1002/wics.83
Cumulative incidence function (CIN)
(T, δ); Fj(t)=Pr(T ≤ t and δ=j)
Gray’s test: Compare CIN across K groups
Analog of weighted log-rank test
Gray, R. J. (1988) A class of K-sample tests for comparing the
cumulative incidence of a competing risk. Ann. Statist. 16,
1141-1154.
Fine-Gray’s CIN hazard rate regression model
Analog of Cox’s hazard rate regression model
Fine, J. P., Gary, R.J. (1999) A proportional hazards model for the
subdistribution of a competing risk.
JASA, 94, 496-509.
Censor at the time of competing event
II. Exploratory Failure Time Analysis
• Large-scale Genomic Association Analysis
o Feature (variable) screening and feature extraction
• A Motivating Example from a GWAS
• Correlation Profile Test (CPT)
o
o
o
o
Hypotheses
Correlation profile function
CPT statistic
Hybrid permutation test of significance
• A Simulation Study: Strength and Weakness
• Example: Analysis of SNPs on Chromosome 9
• Summary and Remarks
• Feature Extraction (sparse regression)
• Example: “Prognostic” Gene (RNA) expression
• Summary and remarks
Large-scale Genomic Association Analysis
• Feature (variable) screening
– Find individual genomic features (factor/predictor variables)
associated with one or more phenotypes (response variables)
• GWAS
– Association: stochastic dependence
– Parametric/semi-parametric approaches: linear models, GLMs, hazard
rate (Cox) regression
• Feature extraction
– Find (linear) combinations (or sets) of genomic features (variables)
associated with one or more phenotypes
– Determine sets of variables using biological knowledge (gene signaling
pathways, functional/ontology groups, etc.): GSEA
– Variable/Model selection methods: ridge regression, LASSO, SCAD,
SEAMLESS, sparse regression
A Motivating Example
•
GWAS to screen SNP markers for risk of relapse in childhood leukemia patients
AA
AB
BB
0
1
2
Relapse
70
0
0
Comp. Event
24
0
0
Censored
585
11
9
X
A Motivating Example
P
Coeff. (s.e.)
1
0.0458
-16.4 (2901)
Test of coeff.
LR test
0
0.0488
-11.4 (0.334)
Gray's Test
0.3542
Cox Regression
Test of coeff.
LR test
R gives a warning
Fine-Gray Regression
JASA 1999 94(446):496-509
Ann. Statist. 1988 16(3):1141-1154
X
AA
AB
BB
0
1
2
Jung's Test
0.6607
Statist. Medicine 2005 24:3077-3088
Relapse
70
0
0
Comp. Event
24
0
0
Censored
585
11
9
Need: a more omnibus and algorithmically robust test procedure
Correlation Profile Test (CPT)
• Model, Null and alternative hypotheses (classical survival setting)
Correlation Profile Test (CPT)
• Sample correlation profile function
observed event point process of individual i
Can do rank transformation for continuous X
Correlation Profile Test (CPT)
Correlation Profile Test (CPT)
• CPT statistic, hybrid permutation test
Back to the SNP Example
P
Coeff. (s.e.)
1
0.0458
-16.4 (2901)
Test of coeff.
LR test
0
0.0488
-11.4 (0.334)
Gray's Test
0.3542
Cox Regression
Test of coeff.
LR test
R gives a warning
Fine-Gray Regression
JASA 1999 94(446):496-509
Ann. Statist. 1988 16(3):1141-1154
X
Relapse
AA
AB
BB
Jung's Test
0
1
2
Statist. Medicine 2005 24:3077-3088
70
0
0
CPT
Comp. Event
24
0
0
Censored
585
11
9
0.6607
0.2582
Test stat is negative
A Simulation Study
•
A model mimicking the SNP example
Generate X: Pr(X=0)=0.98, Pr(X=1)=0.015, Pr(X=2)=0.005
Generate Censor Time TC ~ Exp(0.2)
Generate failure indicator IF|X ~ Bernoulli(πF);
πF = 0.2exp{-θ(X-2)}
If IF = 1, generate Failure Time TF|X ~ LogNormal(βX,1)
else set TF = ∞
Generate competing risk indicator IR ~ Bernoulli(0.1)
If IR = 1, generate Competing Failure Time TR ~ Unif(0,7)
else set TR = ∞
Observed Failure Time T = min{TC TF TR}
Repeat the above n times to simulate n individuals
A Simulation Study
• A model mimicking the SNP example
0.05
CPT
FG
Jung
Gray
Null
0.0486
θ=0, β=0 0.00215
0.1953
0.0040
0
0
θ=0
β=0.5
0.09
0.0028
0.17
0.0038
θ=0.5
β=1.2
0.453
0.0050
θ=0.8
β=0
θ=0.8
β=1.2
0.01
CPT
FG
Jung
Gray
0.0711 Pwr est. 0.0199
0.0007
s.e. 0.0014
0.1658
0.0037
0
0
0.0281
0.0003
0
0
0.096
0.0029
0.007
0.0008
0.069
0.0025
0
0
0.016
0.0012
0.502
0.0050
0
0
0.277
0.0045
0.04
0.0020
0.189
0.0039
0
0
0.05
0.0022
0.67
0.0047
0.802
0.0040
0.004
0.0006
0.807
0.0039
0.378
0.0048
0.495
0.0050
0
0
0.565
0.0050
0.981
0.0014
0.99
0.0010
0.034
0.0018
0.967
0.0018
0.875
0.0033
0.867
0.0034
0.003
0.0005
0.869
0.0034
A Simulation Study
Exact Proportional Hazard, continuous predicator
0.01
0.005
CPT
Cox
Jung
CPT
Cox
Jung
Null
0.0088
0.0009
0.0112
0.0011
0.0096
0.0010
0.0047
0.0007
0.0067
0.0008
0.0053
0.0007
β=0.5
0.093
0.0029
0.173
0.0038
0.117
0.0032
0.066
0.0025
0.123
0.0033
0.082
0.0027
β=0.8
0.359
0.0048
0.617
0.0049
0.518
0.0050
0.268
0.0044
0.524
0.0050
0.436
0.0050
A Simulation Study
Exact Proportional Hazard, continuous predicator
β=0.5
0.01
CPT
Cox
Jung
0.005
CPT
Cox
Jung
n=300
0.149
n=200
0.173
n=200
0.117
n=300
0.094
n=200
0.123
n=200
0.082
n=200
0.524
n=200
0.436
n=400
0.181
n=300
0.537
β=0.8
n=400
0.730
n=400
0.127
n=200
0.617
n=200
0.518
n=300
0.449
n=400
0.655
A Simulation Study
Continuous predictor,
deviation from
proportional hazard
0.01
CPT
0.005
FG
Jung
CPT
FG
Jung
β1=0
β=0
0.0082 0.0159 0.0124
0.0009 0.0012 0.0011
0.0039 0.0096 0.0052
0.0006 0.0010 0.0007
β1=1
β=0
0.193 0.019 0.027
0.0039 0.0014 0.0016
0.13
0.011
0.016
0.0033 0.0010 0.0012
β1=2
β=0
0.361 0.031 0.045
0.0048 0.0017 0.0021
0.254 0.022 0.025
0.0044 0.0015 0.0016
β1=3
β=0
0.399 0.054 0.076
0.0049 0.0022 0.0026
0.302 0.033 0.046
0.0046 0.0018 0.0021
β=0.6
β1=0
0.325 0.236 0.201
0.0047 0.0042 0.0040
0.231 0.165 0.125
0.0042 0.0037 0.0033
β=1.2
β1=0
0.631 0.698 0.596
0.0048 0.0046 0.0049
0.502 0.587 0.488
0.0050 0.0049 0.0050
A Simulation Study
AA
AB
Ordinal predictor, deviation
from proportional hazard
BB
Opposite scenario of the
SNP example
0.01
CPT
FG
Jung
Gray
0.005
CPT
FG
Jung
Gray
Null
θ=0, β=0
0.0088 0.0109 0.005
0.0009 0.0010 0.0007
0.0044 0.0063 0.0019
0.0006 0.0008 0.0004
θ=0
β=0.6
0.172 0.057 0.025 0.061
0.0038 0.0023 0.0016 0.0024
0.121 0.043 0.016 0.041
0.0033 0.0020 0.0012 0.0020
θ=0
β=1.2
0.551 0.36
0.211 0.273
0.0050 0.0048 0.0041 0.0044
0.471 0.286 0.142 0.214
0.0050 0.0045 0.0035 0.0041
θ=0.25
β=0
0.059 0.099 0.057 0.075
0.0024 0.0030 0.0023 0.0026
0.037 0.076 0.039 0.061
0.0020 0.0027 0.0019 0.0024
θ=0.25
β=1.2
0.862 0.708 0.569 0.621
0.0034 0.0045 0.0050 0.0048
0.801 0.632 0.464 0.538
0.0040 0.0048 0.0050 0.0050
θ=0.5
β=0
0.314 0.525 0.375 0.441
0.0046 0.0050 0.0048 0.0050
0.236 0.452 0.288 0.355
0.0042 0.0050 0.0045 0.0048
θ=0.5
β=0.6
0.895 0.831 0.713 0.782
0.0031 0.0037 0.0045 0.0041
0.847 0.76
0.618 0.712
0.0036 0.0043 0.0049 0.0045
Example: Germline SNPs on Chr 9 and risk of
relapse in childhood Acute Lymphoblastic
Leukemia (ALL)
21,909 SNPs on Chr 9 obtained by Affy 100K and 500K SNP
arrays were tested for association with relapse of childhood ALL
Relapse
Alive
2nd
Cancer
Die in
remission
Failure type 1
(of interest)
Failure type 2
(competing risk/event)
Failure type 3
(competing risk)
Example: Germline SNPs on Chr 9 and risk of relapse
in childhood Acute Lymphoblastic Leukemia (ALL)
• n=707 subjects from two most recent clinical trial at
SJCRH
• 21,909 SNPs
• CPT test performed on each SNP, with 200
permutations in the hybrid permutation test
• Significance determined by the profile info criteria Ip
(Cheng et al. 2000); 200 SNPs were considered
statistically significant, estimated FDR=48.7%
0
0.0
1
0.1
2
0.2
3
0.3
4
0.4
0.0
0.2
0.4
0.6
P
0.8
1.0
5
pdf
0.5
cdf
6
0.6
7
0.7
8
0.8
9
0.9
10
1.0
pi0 = 0.9535
0.0
0.2
0.4
0.6
P
0.8
1.0
0
2000
4000
6000
Ip
8000
10000
12000
14000
pi0 = 0.9535 alph.opt = 0.004815 FDR = 0.4869 m in Ip = 160.745
0.00
0.01
0.02
0.03
0.04
0.05
alpha
0.06
0.07
0.08
0.09
0.10
SNP
Pval.CPT
Annotation
SNP_A-4216803
6.44E-06
C9orf82.downstream.461318.AFFY
SNP_A-2142223
7.87E-05
TMC1.upstream.48246.AFFY//ZFAND5.upstream.108579.AFFY
SNP_A-1878719
8.80E-05
PTPRD.In_gene.5000.5kRuleLD
SNP_A-4201296
0.000126717
DBC1.In_gene.5000.5kRuleLD
SNP_A-4254975
0.000131144
JMJD2C.downstream.206626.AFFY
SNP_A-2289668
0.000138093
GLIS3.In_gene.5000.5kRuleLD
SNP_A-1847956
0.000140329
RFX3.downstream.268660.AFFY
SNP_A-1995935
0.000169608
C9orf150.downstream.49950.AFFY
SNP_A-1996276
0.000182668
ELAVL2.In_gene.5000.5kRuleLD
SNP_A-2202030
0.000216207
C9orf93.downstream.130169.AFFY
SNP_A-2100956
0.000254439
C9orf82.downstream.513396.AFFY
SNP_A-2098514
0.000337669
BNC2.In_gene.5000.5kRuleLD
SNP_A-2228460
0.000398633
GLIS3.upstream.6282.AFFY
SNP_A-4252517
0.00040459
TUSC1.downstream.286633.AFFY
SNP_A-1917590
0.000430395
PTPRD.In_gene.5000.5kRuleLD
SNP_A-2052752
0.000432044
DMRT1.In_gene.5000.5kRuleLD
SNP_A-2061098
0.000448184
C9orf94.In_gene.5000.5kRuleLD
SNP_A-1786517
0.000478508
GSN.In_gene.5000.5kRuleLD
SNP_A-2304920
0.000575635
UBE2R2.In_gene.5000.5kRuleLD
SNP_A-1902372
0.00057778
GSN.In_gene.5000.5kRuleLD
SNP_A-2201300
0.000588075
ABL1.In_gene.5000.5kRuleLD
SNP_A-2238268
0.000592182
PCSK5.upstream.10183.AFFY
SNP_A-1830183
0.000613767
TUSC1.downstream.181857.AFFY
1.0
ρ^(tj), j=1, …, J=9
0.0
-0.5
-1.0
Corr
0.5
Test stat = -3.478
1
2
3
time
4
5
1.0
0.8
0.2
0.4
0.6
5.1%
28.7%
66.2%
0.0
Probability
AA
AB
BB
AA
AB
BB
Overall
0
2
4
6
8
10
12
Years
At Risk:
AA:
AB:
BB:
Gary’s test
Overall: 707 676 646 546 454 366 295
Fine-Gray regression
P
0.0451
0.0380; coeff=-0.3905
14
ABL1 Gene Germline SNP
AA
36 (0.051)
A
B
AB
201 (0.287)
BB
464 (0.662)
Tot
701 (1.00)
AA
12
(0.152)
7
(0.065)
11
(0.047)
6
(0.026)
AB
27
BB
75
33
67
74
161
67
161
273 (0.195)
1129 (0.805)
T13B intermediate/high risk
T13B Low risk
T15 standard/high risk
T15 Low risk
Extension to Recurrent Events
• Model, Null and alternative hypotheses
Multiple event times
# events occurred ≤ t
Extension to Recurrent Events
N
N = # events occurred ≤ t
Summary and Remarks
• Correlation Profile Test:
– Computationally more robust
– More omnibus: covers certain deviations from the
semi-parametric hazard regression model
– Highly competitive with other non-parametric
procedures (Gray’s test, Jung’s test)
– Relative deficiency vs. Cox model under PH ??
– Extension to recurrent-event phenotypes
– Informative censoring in the presence of competing
risk
Feature Extraction (Sparse regression)
• Identify (linear) combinations of covariate
variables that are associated with the failure
phenotype
Feature Extraction (Sparse regression)
• Sparse regression by the General Path seeking (GPS)
algorithm (Friedman 2008)
• Exploratory failure time analysis by weighted least
square -- the association criteria
• The modified GPS algorithm to find a solution
• A small simulation study
• Example: Gene (RNA) expression “prognostic” for
relapse of childhood ALL
Sparse Regression by General Path Seeking (GPS, Friedman 2008)
http://www-stat.stanford.edu/~jhf//ftp/GPSpub.pdf
General Setup
Y | X ~ F ( ; X ,  ); X ,   R m , m is large
Loss( )  L(Y , F ( ; X ,  ))
R( )  E X, Y ( Loss( ))
( X 1 , Y1 ), , ( X n , Yn )
n
1
Rˆ ( )   L(Yi , F ( ; X i ,  ))
n i 1
n
1
Rˆ ( ;  , P)  Rˆ ( )  P( ,  )   L(Yi , F ( ; X i ,  ))  P( ,  )
n i 1
Lasso (Tibshirani 1996), grouped lasso (Yuan and Lin 2006),
SCAD (Fan and Li 2001)
Elastic net (Zuo and Hastie 2005)
SEAL (Xihong Lin, 2009 JSM)
Feature Extraction (Sparse regression)
•
The general GPS algorithm
1 n
ˆ
ˆ
R( ;  , P)  R( )  P( ,  )   L(Yi , F ( ; X , )  P( ,  ),   R m
n i 1
P( ,  )
Assume : convexand differentiable loss, and
 0, j  1,...,m
 | j |
 Rˆ ( ) 
g j  




j 

P( ,  )
pj 
 | j |
j 
gj
pj
, j  1,...,m
Initialize:  j  0, j  1,...,m
REP EAT
Compute j , j; set S  { j :  j j  0}
IF S   j*  argmaxj |  j | ELSE j*  argmaxjS |  j |
 j*   j*    sign( j* )
UNT IL  j  0, j  1,...,m
Feature Extraction (Sparse regression)
• Exploratory failure time analysis: setup
Feature Extraction (Sparse regression)
• Association criteria: Penalized weighted least square
Feature Extraction (Sparse regression)
1.0
• The power penalty function |β|γ, 0<γ≤1
0.6
0.8
γ =0.0001
0.2
0.4
γ =0.5
0.0
γ =1
-0.04
-0.02
0.0
0.02
0.04
Feature Extraction (Sparse regression)
•
The modified GPS algorithm
1 n
ˆ
ˆ
R ( ;  , P )  R ( )  P ( ,  )   L(Yi , F ( ; X i ,  ))  P ( ,  ),   R m
n i 1
P ( ,  )
Assume : convexand differentiable loss, and
 0, j  1,...,m
 | j |
 Rˆ ( ) 
P ( ,  )
g j  
p


j
 | j |
  j 
Init ialize:  j  0, j  1,...,m
j 
gj
pj
, j  1,...,m
REP EAT
Compute j , j; set S  {1,2,...,m}
Set j*  argmaxjS |  j |
 j*   j*    sign( j* )
IF j * is thesame as previousinterationAND g j* changedsign
T HENremove j* from S
UNT IL abs(|| g ||  || g || pre )  specified value OR max # of iteration
Feature Extraction (Sparse regression)
• Gradient descent with a fixed step size; searches the solution along a
sequence of increasing values of the penalty parameter λ, thus no need to
use CV or GCV type criteria to determine λ.
• Initial value: all β’s are set to zero
• Each iteration modifies just one of the m dimensions, criteria to choose
which dimension to update involves the gradient of the association criteria
and penalty function
• Need to modify for this particular application
– Relatively large step size Δν: 0.01 (sometimes)
– Stopping rule: Stop if the size of the gradient vector does not change by more than Δν
from the previous iteration or pre-specified max number of iterations is reached
Feature Extraction (Sparse regression)
• A small simulation study
Simulation model: Proportional hazard
Feature Extraction (Sparse regression)
• A small simulation study
Performance assessment
Characteristic of Solution
Freq.
%
X1 & X2 no false + (perfect)
27
5.4
X1 & X2 w/ false +
X2 only no false +
X2 only w/ false +
40
223
199
8
44.6
39.8
X1 only no false +
2
0.4
X1 only w/ false +
1
0.2
None (all false +, worst)
8
1.6
TOTAL
500
100
Feature Extraction (Sparse regression)
• A small simulation study
Performance assessment
#non-zero Freq. (%)
1
2
3
4
5
6
7
8
230 (46)
132 (26.4)
70 (14)
31 (6.2)
19 (3.8)
10 (2)
7 (1.4)
1 (0.2)
X1-only X2-only
2
0
1
0
0
0
0
0
223
103
54
21
10
5
5
1
Both
none
0
27
14
10
9
5
2
0
5
2
1
0
0
0
0
0
Feature Extraction (Sparse regression)
• A small simulation study
Performance assessment
R=# of non-zeros in solution
V=# of incorrect non-zeros in solution
FDR = E(V/R|R>0)
Estimated FDR = 0.2984
s.e. = 0.0142
99% CI (0.2618, 0.3350)
Feature Extraction (Sparse regression)
• An Example: Gene (RNA) expression in ALL and risk
of relapse
• Affymetrix U133A GeneChip
• n=287 Arrays (subjects)
•
•
•
•
m=22,278 Probesets
Two clinical variables: Age group at Dx, lineage
Intercept term
Total number of variables = m+3
Feature Extraction (Sparse regression)
Example (Cont.)
• Run parameters:
– step size = 6x10-4, γ = 0.01
• Initial values:
– Coeff. of Age = 6x10-4
– Coeff of Lineage = 6x10-4
– All others set to 10-8
• Top meaningful findings
Variable
Coeff
Gene
Age.DX
0.183
Lineage
0.1302
212869_x_at
0.003
TPT1; (similar to) tumor protein;
translationally controlled
201288_at
0.0006
RhoGDI2; plays a role in apoptosis;
may be a marker for tumor
progression in gastric and breast
cancer; literature not consistent
Summary and Remarks
• The sparse regression approach to exploratory
survival analysis
– Step size of descent is crucial
– Newton-type descent more adaptive, maybe
better, how to incorporate?
– Stopping criteria: change in gradient vector? size
of the gradient vector?
– Other association criteria: -log likelihood by
Logistic, Probit, Poisson etc. links
– “Oracle” property of the solution? “Accuracy” of
the solution? in asymptopia
III. Copy Number Variation Inference
•
•
•
•
•
•
•
Cell division
DNA Copy Number Variation (CNV)
Use SNP array signals to infer CNV
Reference signal alignment: example
Reference signal alignment procedure
Recent development
Examples
Cell division:
How cell grow and divide
• Mitosis: The process in somatic cell division by which the
nucleus divides
• Meiosis: The process of cell division in sexually reproducing
organisms that reduces the number of chromosomes in
reproductive cells from diploid to haploid, leading to the
production of gametes in animals and spores in plants
Mitosis: Prophase
• Prophase is the first stage of cell division, the cell prepares itself for
division. The nucleus swells, and chromosomes become visible.
• Each chromosome has two chromatids as a result of duplication of the
DNA which took place during interphase. The two chromatids are linked
together at a centromere.
• The centrosome (2 centrioles) duplicates into 2 diplosomes, and each
diplosome, or aster moves toward opposite poles of the nucleus.
Mitosis: Metaphase
• Microtubules assemble, and form a network (the spindle fibres).
• The chromosomes move towards the equator of the cell, where they are
visible.
• This is the phase in which morphological studies of chromosomes are
carried out, often for clinical purposes.
Mitosis: Anaphase
• The two sister chromatids separate.
• Each one migrates to opposite ends of the cell. So each daughter cell has
an identical complement of chromosomes .
• The nuclear membrane has disappeared at this stage. The cell membrane
expands as the cell itself elongates.
• The diameter of the cell decreases at the equator.
Mitosis: Telophase
• A new membrane forms around the new nuclei and two cells are quickly
formed.
• The chromatid, now called a chromosome, uncoils, and the nucleolus
becomes visible again.
• Each cell contains a pair of chromosomes (2n chromosomes)
Meiosis
•
•
•
The process of meiosis essentially involves two cycles of division, involving a gamete mother
cell (diploid cell) dividing and then dividing again to form 4 haploid cells. These can be
subdivided into four distinct phases which are a continuous process
1st Division
– Prophase - Homologous chromosomes in the nucleus begin to pair up with one another
and then split into chromatids (one half of a chromosome) where cross over can occur.
Cross over can increase genetic variation.
– Metaphase - Chromosomes line up at the equator of the cell, where the sequence of the
chromosomes lined up is at random, through chance, increasing genetic variation via
independent assortment.
– Anaphase - The homologous chromosomes move to opposing poles from the equator
– Telophase - A new nuclei forms near each pole alongside its new chromosome
compliment.
– At this stage two haploid cells have been created from the original diploid cell of the
parent.
2nd Division
– Prophase II - The nuclear membrane disappears and the second meiotic division is
initiated.
– Metaphase II - Pairs of chromatids line up at the equator
– Anaphase II - Each of these chromatid pairs move away from the equator to the poles
via spindle fibres
– Telophase II - Four new haploid gametes are created that will fuse with the gametes of
the opposite sex to create a zygote.
Meiosis
Meiosis vs. Mitosis
MITOSIS (In somatic cells)
MEIOSIS (In reproductive cells )
One single division of the mother cell (m) results Two divisions of the mother cell result in four
meiotic products (p)
in two daughter cells (d)
The number of chromosomes per nucleus
remains the same after division
Chromosomal re-distribution
The meiotic products contain a haploid (n)
number of chromosomes, in contrast to the 2n
mother cell
Chromosomal change (cross over)
and re-distribution
Gain/duplication or loss of DNA can occur in either process, resulting in
deviations from the normal, 2-copy state of the chromomes or segments
on chromosomes.
Oncogenesis
• DNA gain: Excess of genes promoting cell
division and proliferation
• DNA loss: Loss of gene functions regulating
cell cycles, such as signaling apoptosis.
• DNA loss: Loss of functions necessary for
proper lineage differentiation
Karyotyping and
Complex CNV patterns in tumor genomes
• An assay technology to assess gains/losses of
DNA; now routinely performed at diagnosis of
childhood leukemia; not so readily available
for solid tumors
67<3n>,XXY,-3,+8,-9,-16,-17,+20/66,idem,del(X)(p22.1),-8,del(10)(q22q26),-20,+mar
1-2(3)3(2)4-7(3)8(4)9(2)10-15(3)16-17(2)18-19(3)20(4)21-22(3)X(2)
6 2 12 4 2
18
4
6
4
6 2 total=66 +Y = 67
Contemporary technology:
Array Comparative Genome Hybridization
(aCGH)
Use SNP array signals to infer CNV
• Goal: Infer loss/gain -- qualitative
Use SNP array signals to infer CNV
• Importance of normalization
• Reason for single-array reference alignment:
example from paper
Use SNP array signals to infer CNV
• Motivation/Reason for single-array reference alignment
Use SNP array signals to infer CNV
• Basic algorithm (Pounds et al. 2009)
1. Select a chromosome that is most likely in the 2-copy
(diploid) state – make an educated guess
2. Use the empirical distribution (EDF) of the signals of the
markers on this chromosome to transform all marker signals
into the unit interval (0,1) via the probability-integral
(quantile) transformation
3. Map the above transformed data into a known, convenient
target distribution (e.g., N(0,1)); this produces the
reference-aligned signals
4. Perform CNV segmentation using the above referencealigned signals.
Note after Step 3 the empirical distr. of reference markers is
essentially the same as the target distribution (N(0,1))
HOW TO SELECT THE REFERENCE CHROMOSOME?
Pounds et al. (2009)
• Cytonormalization
– Utilize karyotype data: select a chromosome not implied as
abnormal by karyotyping
• Algorithmic selection
– Select a chromosome that appears most likely be in the
diploid state based on a set of statistics, such the
percentage of heterozygous calls, joint behavior of signal
mean and standard deviation (details in paper).
Use SNP array signals to infer CNV
New Development
• Affymetrix SNP6 array: Genotype (SNP) and CNV
probesets – two type of signals; don’t always follow the
same distribution
• The auto selection of reference chromosome fails on
cases with complex CNV patterns
• Modified algorithm: more flexible
– Marker (instead of chrom.) based reference
– Initial CNV inference, adjustment, final CNV inference
Use SNP array signals to infer CNV
• Modified algorithm: Work in progress …
1. Select all SNP markers with heterozygous calls (‘AB’) as reference
markers; use the empirical distribution of these marker signals as the
initial reference distribution
2. Map all the SNP marker signals into (0,1) by probability-integral
transformation using the above distribution
3. Map the above transformed signals to a target distribution – take
N(0,1), to produce initially reference-aligned signals
4. Map the CNV marker signals to have the same distribution as the SNP
markers via a quantile transformation
5. Perform initial CNV segmentation – windowed t test + run-length
encoding
6. Assess each autosome and each “large” (>20 SNP markers) inferred
CNV segments to identify problems
7. Correct the problems by adjusting the initial reference-aligned signals
(step 3) chromosome by chromosome
8. Perform final CNV segmentation using corrected signals – windowed t
test plus run-length encoding.
Germline
Tumor
Male, Hypodiploid (<46 chromosomes) ALL and germline samples: Steps 3, 4, 5, 6
overall
2cp, initially inferred loss
Initialinitial
reference
reference
t(9)
Use SNP array signals to infer CNV
New Development
• Examples:
– Two ALL cases -- one hypodiploid (<46 chr’s)
with matched germline, one hyperdiploid
(triploid, 66 chr’s) with matched germline and
relapse samples
Male, Hypodiploid ALL and germline samples
Probeset signal: mean of probes, directly from the .cell file
Male, Hypodip ALL and matched germline
%Het=0.16%
%Het=26%
%Het=29%
Germline
Tumor
Germline
Tumor
Male, Hypodiploid ALL and germline samples
overall
2cp, initially inferred loss
Initialinitial
reference
reference
t(9)
Male, Triploid ALL, DNA index 1.46
67<3n>,XXY,-3,+8,-9,-16,-17,+20/66,idem,del(X)(p22.1),-8,del(10)(q22q26),-20,+mar
1-2(3)3(2)4-7(3)8(4)9(2)10-15(3)16-17(2)18-19(3)20(4)21-22(3)X(2)
6 2
12 4 2
18
4
6
4
6 2 total=66 +Y = 67
Matched germline sample, relapse sample
Probeset signal: generated by Affy MAS 5.0 (?) package
67<3n>,XXY,-3,+8,-9,-16,-17,+20/66,idem,del(X)(p22.1),-8,del(10)(q22q26),-20,+mar
1-2(3)3(2)4-7(3)8(4)9(2)10-15(3)16-17(2)18-19(3)20(4)21-22(3)X(2)
6 2
12 4 2
18
4
6
4
6 2 total=66 +Y = 67
Relapsed Tumor
Germline
overall
2cp, initially inferred loss
Initial reference
t(9) initial reference
Use SNP array signals to infer CNV
New Development
• Extension to NextGen sequence data
– Use sequence coverage counts as raw signals
– Preprocessing: Adjust signals for reference genome
and sequence features that affect the depth of
coverage (“mapability”, GC content, etc.)
– The new alignment and segmentation algorithms
consist of only single-marker based computation,
thus straightforward to implement divide-andconquer strategies and parallelization on
multiprocessor or CPU cluster systems
ACKNOWLEDGEMENTS
NIH/NIGMS Pharmacogenetics Research Network and Database (U01 GM61393,
U01GM61374, http://pharmgkb.org/) National Institutes of Health
Cancer Center Support Grant P30 CA-21765, NIH
The American Lebanese and Syrian Associated Charities (ALSAC).
Stan Pounds, Deqing Pei, Xueyuan Cao – Biostatistics
Mary Relling, William Evans, Jun Yang, Wenjian Yang – Pharmaceutical Science
Ching-Hon Pui, Dario Campana – Oncology & Pathology
Charles Mullighan, James Downing -- Pathology
Geoff Neale, Yiping Fan -- Bioinformatics
Javier Rojo – My first and most favorite Math Statistics teacher
THANK YOU !!