Selecting explanatory variables with the modified version of

advertisement
Selecting explanatory variables with the modified
version of Bayesian Information Criterion
Małgorzata Bogdan
Institute of Mathematics and Computer Science, Wrocław University of
Technology, Poland
in cooperation with
J.K.Ghosh, R.W.Doerge, R. Cheng – Purdue University
A. Baierl, F. Frommlet, A. Futschik – Vienna University
A. Chakrabarti - Indian Statistical Institute
P. Biecek, A. Ochman, M. Żak – Wrocław University of Technology
Vienna, 24/07/2008
Małgorzata Bogdan
Modified BIC
Searching large data bases
Y - the quantitative variable of interest (fruit size, survival time,
process yield)
Małgorzata Bogdan
Modified BIC
Searching large data bases
Y - the quantitative variable of interest (fruit size, survival time,
process yield)
Aim – identify factors influencing Y
Małgorzata Bogdan
Modified BIC
Searching large data bases
Y - the quantitative variable of interest (fruit size, survival time,
process yield)
Aim – identify factors influencing Y
Properties of the data base – number of potential factors,
m, may be much larger than the number of cases, n
Małgorzata Bogdan
Modified BIC
Searching large data bases
Y - the quantitative variable of interest (fruit size, survival time,
process yield)
Aim – identify factors influencing Y
Properties of the data base – number of potential factors,
m, may be much larger than the number of cases, n
Assumption of Sparsity - only a small proportion of potential
explanatory variables influences Y
Małgorzata Bogdan
Modified BIC
Specific application - Locating Quantitative Trait Loci
Małgorzata Bogdan
Modified BIC
Data for QTL mapping in backcross population and
recombinant inbred lines
Only two genotypes possible at a given locus
Małgorzata Bogdan
Modified BIC
Data for QTL mapping in backcross population and
recombinant inbred lines
Only two genotypes possible at a given locus
Xij - dummy variable encoding the genotype of i-th individual at
locus j
Małgorzata Bogdan
Modified BIC
Data for QTL mapping in backcross population and
recombinant inbred lines
Only two genotypes possible at a given locus
Xij - dummy variable encoding the genotype of i-th individual at
locus j
Xij ∈ {−1/2, 1/2}
Małgorzata Bogdan
Modified BIC
Data for QTL mapping in backcross population and
recombinant inbred lines
Only two genotypes possible at a given locus
Xij - dummy variable encoding the genotype of i-th individual at
locus j
Xij ∈ {−1/2, 1/2}
Multiple regression model:
Yi = β0 +
m
X
βj Xij + i ,
j=1
where i ∈ {1, . . . , n} and i ∼ N(0, σ 2 )
Małgorzata Bogdan
Modified BIC
(0.1)
Data for QTL mapping in backcross population and
recombinant inbred lines
Only two genotypes possible at a given locus
Xij - dummy variable encoding the genotype of i-th individual at
locus j
Xij ∈ {−1/2, 1/2}
Multiple regression model:
Yi = β0 +
m
X
βj Xij + i ,
j=1
where i ∈ {1, . . . , n} and i ∼ N(0, σ 2 )
Problem :
estimation of the number of influential genes
Małgorzata Bogdan
Modified BIC
(0.1)
Bayesian Information Criterion (1)
Mi - i-th linear model with ki < n regressors
Małgorzata Bogdan
Modified BIC
Bayesian Information Criterion (1)
Mi - i-th linear model with ki < n regressors
θi = (β0 , β1 , . . . , βki , σ) - vector of model parameters
Małgorzata Bogdan
Modified BIC
Bayesian Information Criterion (1)
Mi - i-th linear model with ki < n regressors
θi = (β0 , β1 , . . . , βki , σ) - vector of model parameters
Bayesian Information Criterion (Schwarz, 1978) –
maximize BIC = log L(Y |Mi , θ̂i ) − 12 ki log n
Małgorzata Bogdan
Modified BIC
Bayesian Information Criterion (1)
Mi - i-th linear model with ki < n regressors
θi = (β0 , β1 , . . . , βki , σ) - vector of model parameters
Bayesian Information Criterion (Schwarz, 1978) –
maximize BIC = log L(Y |Mi , θ̂i ) − 12 ki log n
If m is fixed, n → ∞ and X 0 X /n → Q, where Q is a positive
definite matrix, then BIC is consistent - the probability of choosing
the proper model converges to 1.
Małgorzata Bogdan
Modified BIC
Bayesian Information Criterion (1)
Mi - i-th linear model with ki < n regressors
θi = (β0 , β1 , . . . , βki , σ) - vector of model parameters
Bayesian Information Criterion (Schwarz, 1978) –
maximize BIC = log L(Y |Mi , θ̂i ) − 12 ki log n
If m is fixed, n → ∞ and X 0 X /n → Q, where Q is a positive
definite matrix, then BIC is consistent - the probability of choosing
the proper model converges to 1.
When n ≥ 8 BIC never chooses more regressors than AIC and is
usually considered as one of the most restrictive model selection
criteria.
Małgorzata Bogdan
Modified BIC
Bayesian Information Criterion (1)
Mi - i-th linear model with ki < n regressors
θi = (β0 , β1 , . . . , βki , σ) - vector of model parameters
Bayesian Information Criterion (Schwarz, 1978) –
maximize BIC = log L(Y |Mi , θ̂i ) − 12 ki log n
If m is fixed, n → ∞ and X 0 X /n → Q, where Q is a positive
definite matrix, then BIC is consistent - the probability of choosing
the proper model converges to 1.
When n ≥ 8 BIC never chooses more regressors than AIC and is
usually considered as one of the most restrictive model selection
criteria.
Surprise ? :
- Broman and Speed (JRSS, 2002) report that BIC
overestimates the number of regressors when applied to QTL
mapping.
Małgorzata Bogdan
Modified BIC
Explanation - Bayesian roots of BIC (1)
f (θi ) – prior density of θi , π(Mi ) – prior probability of Mi
Małgorzata Bogdan
Modified BIC
Explanation - Bayesian roots of BIC (1)
f (θi ) – prior density of θi , π(Mi ) – prior probability of Mi
R
mi (Y ) = L(Y |Mi , θi )f (θi )d θi – integrated likelihood of the data
given the model Mi
Małgorzata Bogdan
Modified BIC
Explanation - Bayesian roots of BIC (1)
f (θi ) – prior density of θi , π(Mi ) – prior probability of Mi
R
mi (Y ) = L(Y |Mi , θi )f (θi )d θi – integrated likelihood of the data
given the model Mi
posterior probability of Mi : P(Mi |Y ) ∝ mi (Y )π(Mi )
Małgorzata Bogdan
Modified BIC
Explanation - Bayesian roots of BIC (1)
f (θi ) – prior density of θi , π(Mi ) – prior probability of Mi
R
mi (Y ) = L(Y |Mi , θi )f (θi )d θi – integrated likelihood of the data
given the model Mi
posterior probability of Mi : P(Mi |Y ) ∝ mi (Y )π(Mi )
BIC neglects π(Mi ) and uses approximation
log mi (Y ) ≈ log L(Y |Mi , θ̂i ) − 1/2(ki + 2) log n + Ri ,
Ri is bounded in n.
Małgorzata Bogdan
Modified BIC
Explanation - Bayesian roots of BIC (2)
neglecting π(Mi ) ≡ assuming all the models have the same prior
probability
Małgorzata Bogdan
Modified BIC
Explanation - Bayesian roots of BIC (2)
neglecting π(Mi ) ≡ assuming all the models have the same prior
probability
≡ assigning a large prior probability to the event that the true
model contains approximately m2 regressors
Małgorzata Bogdan
Modified BIC
Explanation - Bayesian roots of BIC (2)
neglecting π(Mi ) ≡ assuming all the models have the same prior
probability
≡ assigning a large prior probability to the event that the true
model contains approximately m2 regressors
200
m=200, 200 models with one regressor, 2 = 19900 models
200
with two regressors, 100 = 9 × 1058 models with 100 regressors
Małgorzata Bogdan
Modified BIC
Modified version of BIC, mBIC (1)
M. Bogdan, J.K. Ghosh,R.W. Doerge, Genetics (2004)
Proposed solution - supplementing BIC with an informative
prior distribution on the set of possible models, proposed in George
and McCulloch (1993)
Małgorzata Bogdan
Modified BIC
Modified version of BIC, mBIC (1)
M. Bogdan, J.K. Ghosh,R.W. Doerge, Genetics (2004)
Proposed solution - supplementing BIC with an informative
prior distribution on the set of possible models, proposed in George
and McCulloch (1993)
p - prior probability that a randomly chosen regressor influences Y
π(Mi ) = p ki (1 − p)m−ki
Małgorzata Bogdan
Modified BIC
Modified version of BIC, mBIC (1)
M. Bogdan, J.K. Ghosh,R.W. Doerge, Genetics (2004)
Proposed solution - supplementing BIC with an informative
prior distribution on the set of possible models, proposed in George
and McCulloch (1993)
p - prior probability that a randomly chosen regressor influences Y
π(Mi ) = p ki (1 − p)m−ki
log π(Mi ) = m log(1 − p) − ki log
Małgorzata Bogdan
Modified BIC
1−p
p
Modified version of BIC, mBIC (1)
M. Bogdan, J.K. Ghosh,R.W. Doerge, Genetics (2004)
Proposed solution - supplementing BIC with an informative
prior distribution on the set of possible models, proposed in George
and McCulloch (1993)
p - prior probability that a randomly chosen regressor influences Y
π(Mi ) = p ki (1 − p)m−ki
1−p
log π(Mi ) = m log(1 − p) − ki log
p
Modified version of BIC recommends choosing the model
maximizing
1
log L(Y |Mi , θ̂i ) − ki log n − ki log
2
Małgorzata Bogdan
Modified BIC
1−p
p
mBIC (2)
c = mp - expected number of true regressors
Małgorzata Bogdan
Modified BIC
mBIC (2)
c = mp - expected number of true regressors
m
1
mBIC = log L(Y |Mi , θ̂i ) − ki log n − ki log
−1
2
c
Małgorzata Bogdan
Modified BIC
mBIC (2)
c = mp - expected number of true regressors
m
1
mBIC = log L(Y |Mi , θ̂i ) − ki log n − ki log
−1
2
c
Standard version of mBIC uses c = 4 to control the overall type I
error at the level below 10%
Małgorzata Bogdan
Modified BIC
mBIC (2)
c = mp - expected number of true regressors
m
1
mBIC = log L(Y |Mi , θ̂i ) − ki log n − ki log
−1
2
c
Standard version of mBIC uses c = 4 to control the overall type I
error at the level below 10%
A similar log m penalty appears also in RIC of Foster and George
(1994)
Małgorzata Bogdan
Modified BIC
Relationship to multiple testing (1)
Orthogonal design: X T X = nI(m+1)×(m+1),
Małgorzata Bogdan
Modified BIC
(1)
Relationship to multiple testing (1)
Orthogonal design: X T X = nI(m+1)×(m+1),
BIC chooses those Xj ’s for which
nβ̂j2
σ2
Małgorzata Bogdan
> log n
Modified BIC
(1)
Relationship to multiple testing (1)
Orthogonal design: X T X = nI(m+1)×(m+1),
BIC chooses those Xj ’s for which
nβ̂j2
> log n
σ2
√
Under H0j : βj = 0,
Zj =
nβ̂j
σ
Małgorzata Bogdan
∼ N(0, 1)
Modified BIC
(1)
Relationship to multiple testing (1)
Orthogonal design: X T X = nI(m+1)×(m+1),
BIC chooses those Xj ’s for which
nβ̂j2
> log n
σ2
√
Under H0j : βj = 0,
Since for c > 0,
Zj =
1 − Φ(c) =
nβ̂j
σ ∼ N(0, 1)
φ(c)
c (1 + oc )
Małgorzata Bogdan
Modified BIC
(1)
Relationship to multiple testing (1)
Orthogonal design: X T X = nI(m+1)×(m+1),
(1)
BIC chooses those Xj ’s for which
nβ̂j2
> log n
σ2
√
Under H0j : βj = 0,
Since for c > 0,
Zj =
1 − Φ(c) =
nβ̂j
σ ∼ N(0, 1)
φ(c)
c (1 + oc )
It holds that for large values of n
s
αn = 2P(Zj >
p
Małgorzata Bogdan
log n) ≈
Modified BIC
2
.
πn log n
Relationship to multiple testing (2)
When n and m go to infinity and the number of true signals
remains fixed, the expected number of “false discoveries” is of the
rate √ m .
n log n
Małgorzata Bogdan
Modified BIC
Relationship to multiple testing (2)
When n and m go to infinity and the number of true signals
remains fixed, the expected number of “false discoveries” is of the
rate √ m .
n log n
Corollary:
BIC is not consistent when √ m
n log n
Małgorzata Bogdan
Modified BIC
→∞
.
Bonferroni correction for multiple testing : αn,m =
Małgorzata Bogdan
Modified BIC
αn
m
Bonferroni correction for multiple testing : αn,m =
αn
m
probability of detecting at least one “false positive”: FWER ≤ αn
Małgorzata Bogdan
Modified BIC
Bonferroni correction for multiple testing : αn,m =
αn
m
probability of detecting at least one “false positive”: FWER ≤ αn
√
2(1 − Φ( cBon )) = αmn
Małgorzata Bogdan
Modified BIC
Bonferroni correction for multiple testing : αn,m =
αn
m
probability of detecting at least one “false positive”: FWER ≤ αn
√
2(1 − Φ( cBon )) = αmn
cBon = 2 log
m
αn
(1 + on,m ) = (log n + 2 log m)(1 + on,m )
where on,m converges to zero when n or m tends to infinity.
Małgorzata Bogdan
Modified BIC
Bonferroni correction for multiple testing : αn,m =
αn
m
probability of detecting at least one “false positive”: FWER ≤ αn
√
2(1 − Φ( cBon )) = αmn
cBon = 2 log
m
αn
(1 + on,m ) = (log n + 2 log m)(1 + on,m )
where on,m converges to zero when n or m tends to infinity.
cmBIC = log n + 2 log m
c − 1 ≈ log n + 2 log m − 2 log c
Małgorzata Bogdan
Modified BIC
Properties of mBIC
1. FWER ≈
q
2
π
√
c
n(log n+2 log m−2 log c)
Małgorzata Bogdan
Modified BIC
Properties of mBIC
1. FWER ≈
q
2
π
√
c
n(log n+2 log m−2 log c)
2. The power of detecting the explanatory variable with βj 6= 0 is
given by
!
√
√
√
√
nβj
n(β̂j − βj ) √
nβj
<
< cmBIC −
1 − P − cmBIC −
σ
σ
σ
√
nβj √
→1
>1−Φ
cmBIC − σ Małgorzata Bogdan
Modified BIC
,
Properties of mBIC
1. FWER ≈
q
2
π
√
c
n(log n+2 log m−2 log c)
2. The power of detecting the explanatory variable with βj 6= 0 is
given by
!
√
√
√
√
nβj
n(β̂j − βj ) √
nβj
<
< cmBIC −
1 − P − cmBIC −
σ
σ
σ
√
nβj √
→1
>1−Φ
cmBIC − σ Corollary:
Independently on the choice of c mBIC is consistent
Małgorzata Bogdan
Modified BIC
,
Properties of mBIC
1. FWER ≈
q
2
π
√
c
n(log n+2 log m−2 log c)
2. The power of detecting the explanatory variable with βj 6= 0 is
given by
!
√
√
√
√
nβj
n(β̂j − βj ) √
nβj
<
< cmBIC −
1 − P − cmBIC −
σ
σ
σ
√
nβj √
→1
>1−Φ
cmBIC − σ Corollary:
Independently on the choice of c mBIC is consistent
The standard version of mBIC uses c = 4 to control FWER at the
level below 10%, when n ≥ 200.
Małgorzata Bogdan
Modified BIC
,
Asymptotic optimality of mBIC (1)
γ0 - cost of the false discovery, γA - cost of missing the true signals
Małgorzata Bogdan
Modified BIC
Asymptotic optimality of mBIC (1)
γ0 - cost of the false discovery, γA - cost of missing the true signals
βj ∼ (1 − p)δ0 + pN(0, τ 2 )
Małgorzata Bogdan
Modified BIC
Asymptotic optimality of mBIC (1)
γ0 - cost of the false discovery, γA - cost of missing the true signals
βj ∼ (1 − p)δ0 + pN(0, τ 2 )
Expected value of the experiment cost:
R = m(γ0 t1 (1 − p) + γA t2 p),
where t1 and t2 are type I and type II errors
Małgorzata Bogdan
Modified BIC
Asymptotic optimality of mBIC (1)
γ0 - cost of the false discovery, γA - cost of missing the true signals
βj ∼ (1 − p)δ0 + pN(0, τ 2 )
Expected value of the experiment cost:
R = m(γ0 t1 (1 − p) + γA t2 p),
where t1 and t2 are type I and type II errors
Optimal rule: Bayes oracle
fA (β̂j )
f0 (β̂j )
where fA (β̂j ) ∼ N(0, τ 2 +
σ2
n )
>
(1 − p)γ0
,
pγA
2
and f0 (β̂j ) ∼ N(0, σn )
Małgorzata Bogdan
Modified BIC
Asymptotic optimality of mBIC (2)
Bayes oracle
nβ̂j2
σ2
2
γ0
σ 2 + nτ 2
nτ + σ 2
1−p
+ 2 log
>
log
+ 2 log
2
2
nτ
σ
p
γA
.
Małgorzata Bogdan
Modified BIC
Asymptotic optimality of mBIC (2)
Bayes oracle
nβ̂j2
σ2
2
γ0
σ 2 + nτ 2
nτ + σ 2
1−p
+ 2 log
>
log
+ 2 log
2
2
nτ
σ
p
γA
Asymptotic Optimality: the model selection rule V is
asymptotically optimal if
lim
n→∞,m→∞
RV
=1 .
RBO
.
Małgorzata Bogdan
Modified BIC
Asymptotic optimality of mBIC (2)
Bayes oracle
nβ̂j2
σ2
2
γ0
σ 2 + nτ 2
nτ + σ 2
1−p
+ 2 log
>
log
+ 2 log
2
2
nτ
σ
p
γA
Asymptotic Optimality: the model selection rule V is
asymptotically optimal if
lim
n→∞,m→∞
RV
=1 .
RBO
Theorem 1 (Bogdan, Chakrabarti, Ghosh, 2008). Under
orthogonal design (1) mBIC is asymptotically optimal when
limm→∞ mp = s, where s ∈ R.
Małgorzata Bogdan
Modified BIC
Asymptotic optimality of mBIC (2)
Bayes oracle
nβ̂j2
σ2
2
γ0
σ 2 + nτ 2
nτ + σ 2
1−p
+ 2 log
>
log
+ 2 log
2
2
nτ
σ
p
γA
Asymptotic Optimality: the model selection rule V is
asymptotically optimal if
lim
n→∞,m→∞
RV
=1 .
RBO
Theorem 1 (Bogdan, Chakrabarti, Ghosh, 2008). Under
orthogonal design (1) mBIC is asymptotically optimal when
limm→∞ mp = s, where s ∈ R.
Conjecture (Frommlet, Bogdan, 2008). Theorem 1 holds also
when
βj ∼ (1 − p)δ0 + pFA ,
where FA has a positive density at 0.
Małgorzata Bogdan
Modified BIC
Computer simulations(1)
Setting : n = 200, m = 300, entries of X ∼ N(0, σ = 0.5),
k ∼ Binomial (m, p), with p =
1
30
(mp = 10), βi ∼ N(0, σ = 1.5),
ε ∼ N(0, 1) and Tukey’s gross error model:
ε ∼ Tukey (0.95, 100, 1) = 0.95 ∗ N(0, 1) + 0.05 ∗ N(0, 10).
Małgorzata Bogdan
Modified BIC
Computer simulations(1)
Setting : n = 200, m = 300, entries of X ∼ N(0, σ = 0.5),
k ∼ Binomial (m, p), with p =
1
30
(mp = 10), βi ∼ N(0, σ = 1.5),
ε ∼ N(0, 1) and Tukey’s gross error model:
ε ∼ Tukey (0.95, 100, 1) = 0.95 ∗ N(0, 1) + 0.05 ∗ N(0, 10).
Characteristics : Power, FDR =
P
2
l2 = m
j=1 (βj − β̂j )
FP
AP ,
MR = FP + FN,
mean value of the absolute prediction error based on 50 additional
observations, d
Małgorzata Bogdan
Modified BIC
Computer simulations
Table: Results for 1000 replications.
noise
citerion
FP
FN
Power
FDR
MR
l2
d
BIC
13.3
1.84
0.8155
0.5889
15.1480
2.3610
0.9460
E |ε1 | ≈ 0.8
N(0,1)
mBIC
0.073
2.97
0.7030
0.0107
3.0410
0.6025
0.8505
rBIC
0.08
3.45
0.6586
0.0116
3.5310
0.8500
0.8687
Tukey(0.95, 100, 1)
BIC
mBIC
rBIC
12.5
0.08
0.1
3.95
6.11
4.29
0.6087 0.3923 0.5806
0.6487 0.0210 0.0162
16.4440 6.1910 4.3910
13.51
4.732
1.597
1.714
1.503
1.298
, E |ε2 | ≈ 1.16
Małgorzata Bogdan
Modified BIC
Applications for QTL mapping
Yi = µ +
X
βj Xij +
j∈I
X
γuv Xiu Xiv + εi ,
(u,v )∈U
I - a certain subset of the set N = {1, . . . , m},
U - a certain subset of N × N
Małgorzata Bogdan
Modified BIC
Applications for QTL mapping
Yi = µ +
X
βj Xij +
j∈I
X
γuv Xiu Xiv + εi ,
(u,v )∈U
I - a certain subset of the set N = {1, . . . , m},
U - a certain subset of N × N
Standard version of mBIC - minimize
n log(RSS)+(p+r ) log(n)+2p log(m/2.2−1)+2r log(Ne /2.2−1)
p - number of main effects, r - number of interactions,
Ne = m(m − 1)/2
Małgorzata Bogdan
Modified BIC
Further applications for QTL mapping
1. Extending to more complicated genetic scenarios + iterative
version of mBIC : Baierl, Bogdan, Frommlet, Futschik Genetics,
2006
2. Robust versions based on M-estimates: Baierl, Futschik,
Bogdan, Biecek CSDA, 2007
3. Rank version: Żak, Baierl, Bogdan, Futschik Genetics, 2007
4. Taking into account the correlations between neighboring
markers: Bogdan, Frommlet, Biecek, Cheng, Ghosh, Doerge,
Biometrics, 2008
Małgorzata Bogdan
Modified BIC
Real Data Analysis (1)
Huttunen et al (2004) - data on the variation in male courtship
song characters in Drosophila virilis.
Małgorzata Bogdan
Modified BIC
Real Data Analysis (2)
Drosophila "sing" by vibrating their wings. The most common song
type is pulse song and consists of rapid transients (short-lived
oscillations) of low frequency.
Małgorzata Bogdan
Modified BIC
Real Data Analysis (2)
Drosophila "sing" by vibrating their wings. The most common song
type is pulse song and consists of rapid transients (short-lived
oscillations) of low frequency.
Quantitative trait PN - number of pulses in a pulse train.
Małgorzata Bogdan
Modified BIC
Real Data Analysis (2)
Drosophila "sing" by vibrating their wings. The most common song
type is pulse song and consists of rapid transients (short-lived
oscillations) of low frequency.
Quantitative trait PN - number of pulses in a pulse train.
Data - 24 markers on three chromosomes, n=520 males
Huttunen et al (2004) used single marker analysis and composite
interval mapping. They found one QTL on chromosome 2, five
QTL on chromosome 3 (not sure if there are only 2) and another
QTL on chromosome 4.
Małgorzata Bogdan
Modified BIC
Real Data Analysis (2)
Drosophila "sing" by vibrating their wings. The most common song
type is pulse song and consists of rapid transients (short-lived
oscillations) of low frequency.
Quantitative trait PN - number of pulses in a pulse train.
Data - 24 markers on three chromosomes, n=520 males
Huttunen et al (2004) used single marker analysis and composite
interval mapping. They found one QTL on chromosome 2, five
QTL on chromosome 3 (not sure if there are only 2) and another
QTL on chromosome 4.
We use mBIC supplied with Haley and Knott regression. We impute
the genotypes inside intermarker intervals so the distance between
tested positions does not exceed 10 cM. We penalize these imputed
locations as real markers. In the results m = 59 and Ne = 1711.
Małgorzata Bogdan
Modified BIC
Real Data Analysis (3)
Małgorzata Bogdan
Modified BIC
Real Data Analysis (4)
Zeng et al. (2000) data on the morphological differences between
two species of Drosophila, Drosophila simulans and Drosophila
mauritana
Małgorzata Bogdan
Modified BIC
Real Data Analysis (4)
Zeng et al. (2000) data on the morphological differences between
two species of Drosophila, Drosophila simulans and Drosophila
mauritana
Trait - the size and the shape of the posterior lobe of the male
genital arch, quantified by a morphometric descriptor.
Małgorzata Bogdan
Modified BIC
Real Data Analysis (4)
Zeng et al. (2000) data on the morphological differences between
two species of Drosophila, Drosophila simulans and Drosophila
mauritana
Trait - the size and the shape of the posterior lobe of the male
genital arch, quantified by a morphometric descriptor.
n1 = 471, n2 = 491, m = 193, genotypes at neighboring positions
are closely correlated, Ne = 18, 528
Małgorzata Bogdan
Modified BIC
Real Data Analysis, BM
ds r
ds r
s s
0
20
Forward
mBIC
Zeng
40
60
s s
s s
s t
0
20
t
t
s
t
r
s
s
40
60
ud q
ud
t
Małgorzata Bogdan
r
80
v
v
v
100
q
s
r
s
q s
120
140
d q dt
d r ds
s
s
Modified BIC
Forward
mBIC
Zeng
s
s
s
Forward
mBIC
Zeng
Real Data Analysis, BS
v s
t t
t t
Forward
mBIC
Zeng
0
20
40
s
s
s
vd
v
s t
0
20
40
60
s
s
t
60
s t s v
s t
v
s t
v
Małgorzata Bogdan
s
s
q
r
q
80
100
120
mBIC
Zeng
140
s ds t
s u t
s v
t t s
t t
s
t t t
Modified BIC
Forward
Forward
mBIC
Zeng
Further work
1. Relaxing the penalty so as to control FDR instead of FWER,
expected optimality for a wider range of values of p - with F.
Frommlet, J. K. Ghosh, A. Chakrabarti and M. Murawska.
2. Application for association mapping - with F. Frommlet and
M. Murawska.
3. Application for GLM and Zero Inflated Generalized Poisson
Regression, with M.Zak, C. Czado, V. Earhardt.
4. Application for model selection in logic regression and
comparison with Bayesian Regression Trees - with M. Malina,
K. Ickstadt, H. Schwender.
Małgorzata Bogdan
Modified BIC
References
1. Baierl, A., Bogdan, M., Frommlet, F., Futschik, A., 2006. On Locating multiple interacting
quantitative trait loci in intercross designs. Genetics 173, 1693-1703.
2. Baierl, A., Futschik, A.,Bogdan, M.,Biecek, P., 2007. Locating multiple interacting quantitative trait
loci using robust model selection, Computational Statistics and Data Analysis 51, 6423-6434.
3. Bogdan, M., Ghosh, J.K., Doerge, R.W., 2004. Modifying the Schwarz Bayesian Information
Criterion to locate multiple interacting quantitative trait loci. Genetics 167, 989–999.
4. Bogdan, M., Frommlet, F., Biecek, P., Cheng, R., Ghosh, J. K., Doerge R. W. 2008 Extending the
Modified Bayesian Information Criterion (mBIC) to dense markers and multiple interval mapping.
Biometrics, doi: 10.1111/j.1541-0420.2008.00989.x.
5. Broman, K.W., Speed, T.P., 2002. A model selection approach for the identification of quantitative
trait loci in experimental crosses. J. Roy. Stat. Soc. B 64, 641–656.
6. George, E.I., McCulloch, R.E., 1993. Variable Selection Via Gibbs Sampling. J. Amer. Statist.
Assoc. 88 : 881-889.
7. Żak, M., Baierl, A., Bogdan, M., Futschik, A., 2007. Locating multiple interacting quantitative trait
loci using rank-based model selection. Genetics 176, 1845-1854.
Małgorzata Bogdan
Modified BIC
Download