On primary statistical data processing of experimental

advertisement
ON PRIMARY STATISTICAL DATA PROCESSING OF
EXPERIMENTAL MEASUREMENTS OF LYMPHOCYTES USING
C57BL/6 MOUSE LINE1
L.K. Babadzanjanz1, A.V. Voitylov2
Saint-Petersburg State University, Saint- Petersburg, Russia
E-mail: 1- levon@mail.wplus.net, 2- voivv@mail.ru
P. Krebs, B. Ludewig3
Institute of Experimental Immunology,University of Zürich, Zürich, Switzerland
E-mail: 3-burkhard.ludewig@kssg.ch
D.R. Sarkissian4, G.A. Bocharov5
Institute of Numerical Mathematics,Russian Academy of Science, Russia
E-mail: 4-voidcaller@mail.domonet.ru, 5- bocharov@inm.ras.ru
Ключевые слова: доверительный интервал, идентификация вероятностного распределения,
критерий Колмогорова-Смирнова, критерий омега квадрат, ошибки измерений в биологии,
оптимальный размер выборки
Key words: confidence interval, probability distribution identification, Kolmogorov-Smirnov
criterion, omega-squared criterion, measurement errors in biology, optimal sample size
В данной работе при помощи критерия Колмогорова-Смирнова идентифицируется
наиболее подходящее из трех (нормального, лог-нормального и гамма) распределений
ошибок измерений процентного содержания CD8+ βgal+ лимфоцитов в крови и селезёнке
мышей линии C57BL/6. Опираясь на метод Монте-Карло, мы предлагаем критерий
вычисления наименьшего размера выборки, по которой достаточно точно
идентифицируется вероятностное распределение ошибок измерения.
ON PRIMARY STATISTICAL DATA PROCESSING OF EXPERIMENTAL
MEASUREMENTS OF LYMPHOCYTES USING C57BL/6 MOUSE LINE /
L.K.Babadzanjanz, A.V.Voitylov (SPbSU, SPb, Russia), D.R.Sarkissian, G.A.Bocharov (INM
RAS, Moscow, Russia), T. Junt, P. Krebs, B. Ludewig (U. of Zürich, Zürich, Switzerland)
This paper uses Kolmogorov-Smirnov criteria to identify the best probability distribution (among
normal, log-normal and gamma distributions) for the errors of percent measurements of CD8+
βgal+ lymphocytes in blood or spleen of C57BL/6 mouse line. Based on Monte-Carlo study in this
paper we suggest a criteria to compute the smallest sample size, still allowing precise enough
identification of probability distribution for measurement errors.
1. Introduction
As is known, primary statistical processing of experimental data has to be carried out in
accordance with suitable probability distribution of measurement errors. If such probability
distribution is chosen incorrectly, the results of statistical processing (mean, confidence
intervals, etc.) may deviate from the true ones significantly. Mathematical model
identification in immunology relies on the assumptions about the statistical features of data
samples used in parameter estimation procedures [1]. Therefore, the question of
characterizing (choosing) the probability distribution for typical data sets arising in
immunology is an important problem that we consider in this paper. We begin with series of
samples of percent measurements of CD8+ lymphocytes specific for β-galactosidase (βgal) in
1
This work was supported by the grant number RFFI 03-01-00689.
blood or spleen of C57BL/6 mouse line to find out the suitable probability distribution.
Kolmogorov-Smirnov criteria is used to select the one of three popular probability
distributions: Normal, Log-Normal and Gamma. Herein the results are briefly summarized in
tables, while the details of our computational technique have been prepared for separate
paper.
2. Data and methods
Adenovirus based immunization represents a promising approach to the treatment of
chronic virus infections and tumors. Recent experimental studies of C57BL/6 mice infected
with βgal recombinant adenovirus provided a quantitative insight into the kinetics of CD8+
cytotoxic T lymphocyte response (CTL) induction (Krebs et al.). In particular, detailed
characterization of the population dynamics of βgal –specific CD8+ lymphocytes as well as
the percentage of CD8+ T cells in blood or spleen of C57BL/6 mouse line are available (see
Krebs et al. [1,2]). In this paper we analyze the following samples
Name Sample
size
b1
50
b2
32
b3
50
b4
32
s1
23
s2
33
s3
23
s4
33
Sample description
Percentage of βgal –specific CD8+ lymphocytes in blood at day 0
Percentage of βgal –specific CD8+ lymphocytes in blood at day 13
Percentage of CD8+ T cells in blood at day 0
Percentage of CD8+ T cells in blood at day 13
Percentage of βgal –specific CD8+ lymphocytes in spleen at day 0
Percentage of βgal –specific CD8+ lymphocytes in spleen at day 13
Percentage of CD8+ T cells in spleen at day 0
Percentage of CD8+ T cells in spleen at day 13
Table 1. Experimental samples of C57BL/6 mice line infected with βgal recombinant adenovirus
3. Results
To select the most appropriate distribution among three popular probability distributions
(Normal, Log-Normal and Gamma) we use Kolmogorov-Smirnov criteria. Cumulative
distribution functions with location parameter  and shape parameter  are defined as
follows (see [3] for details)
2
 1 x 1  t  

e 2    dt ,
for Normal distribution

  2 0
2

x 1  ln t   


 1
2  
F ( x,  ,  )  
dt , for Log  Normal distribution
e
  x 2 0
 1 x
x /   1 e  x /  dt, for Gamma distribution


 (  ) 0
The symbol F ( x,  , ) stands for three different probability distributions. Note that for
each distribution the location parameter  and the shape parameter sigma  have its own
“physical significance”.
The Kolmogorov-Smirnov criteria is based on comparing for all distributions under
consideration the following quantity
1 if x  xk
1 n
KS  min max F ( x,  ,  )   I x  xk , where I x  xk  
 ,
x
n k 1
0 if x  xk
To compute KS we have used an iterative optimizing procedure by Artem M.
Babadzhanyants based on an approach from [4,5]. As initial guess we take standard maximum
likelihood estimates of parameters  and  . The computed values for KS , sample mean
values, 95% confidence intervals, and parameters  ,  for all samples from Table 1 are
summarized in Tables 2 and 3.
#
b1
b2
b3
b4
s1
s2
s3
s4
Normal distribution
Confidence
KS Mean
Interval
0.085 0.28 -0.018, 0.57
0.13
9.5
-0.24, 19
0.070
11
7.1, 14
0.081
15
8.6, 22
0.16 0.28 -0.11, 0.67
0.11
4.3
0.2, 8.4
0.076
10
5.6, 15
0.069
11
4.5, 18
#
b1
b2
b3
b4
s1
s2
s3
s4
Log-Normal distribution
Confidence
KS Mean
interval
0.052 0.30 0.085, 0.76
0.084
10
3.0, 26
0.078
11
7.6, 15
0.058
15
9.6, 23
0.095 0.32
0.055, 1.1
0.066 4.6
1.5, 11
0.091
10
6.3, 16
0.085
11
6.0, 20
Normal distribution Log-Normal distribution




0.27470 0.14947 -1.3707
0.55632
9.5247
4.9835
2.1830
0.55163
10.756
1.8808
2.3684
0.17656
15.081
3.3005
2.7016
0.22515
0.27870 0.19730 -1.4237
0.75163
4.3036
2.0676
1.3987
0.50592
10.032
2.2800
2.2924
0.23080
11.107
3.3841
2.3891
0.30134
Gamma distribution
Confidence
KS Mean
Interval
0.044 0.29 0.065,0.68
0.089
10
2.4, 23
0.089
11
7.2, 15
0.074
15
9.1, 23
0.11 0.30 0.032, 0.87
0.084 4.5
1.2, 9.8
0.082
10
6.1, 15
0.073
11
5.6, 19
Gamma distribution


3.2449 0.0890185
3.3934 2.9583
28.123 0.38467
18.017 0.85174
1.8581 0.16296
4.0382 1.1138
18.907 0.53523
10.774 1.0563
Tables 2, 3. Statistical data processing for samples in blood (b1-b4) and spleen (s1-s4) using C57BL/6
mouse line with Normal, Log-Normal and Gamma probability distributions. Here “KS” is the value of
functional from Kolmogorov-Smirnov criteria, “Mean” is the sample’s mean and “Confidence interval” is
the sample’s 95% confidence interval.
The goodness of the fit of the identified probability distributions for all samples can be
compared visually using Figures 1 - 8. Each of this figures shows the histogram, constructed
using well known Freedmann-Diaconis criterion, and the density functions of the best
cumulative distribution function, identified via minimization of the value of KolmogorovSmirnov criterion, among all normal, log-normal and gamma distributions respectively. Each
density function is equipped with the vertical lines showing its average and 2.5% left and
right quantiles. The figures visualize how the choice of distribution function will influence the
best estimate of the sample average and of the 95% confidence interval.
Figure 1. Histogram and identified normal (solid line), log-normal (dashed line) and gamma (dotted line)
probability density functions for sample #b1. Vertical lines shows average value and 95% confidence
interval for each probability density function.
Figure 2. Histogram and identified normal (solid line), log-normal (dashed line) and gamma (dotted line)
probability density functions for sample #b2. Vertical lines shows average value and 95% confidence
interval for each probability density function.
Figure 3. Histogram and identified normal (solid line), log-normal (dashed line) and gamma (dotted line)
probability density functions for sample #b3. Vertical lines shows average value and 95% confidence
interval for each probability density function.
Figure 4. Histogram and identified normal (solid line), log-normal (dashed line) and gamma (dotted line)
probability density functions for sample #b4. Vertical lines shows average value and 95% confidence
interval for each probability density function.
Figure 5. Histogram and identified normal (solid line), log-normal (dashed line) and gamma (dotted line)
probability density functions for sample #s1. Vertical lines shows average value and 95% confidence
interval for each probability density function.
Figure 6. Histogram and identified normal (solid line), log-normal (dashed line) and gamma (dotted line)
probability density functions for sample #s2. Vertical lines shows average value and 95% confidence
interval for each probability density function.
Figure 7. Histogram and identified normal (solid line), log-normal (dashed line) and gamma (dotted line)
probability density functions for sample #s3. Vertical lines shows average value and 95% confidence
interval for each probability density function.
Figure 8. Histogram and identified normal (solid line), log-normal (dashed line) and gamma (dotted line)
probability density functions for sample #s4. Vertical lines shows average value and 95% confidence
interval for each probability density function.
Table 2 shows that overall the measurement errors of experiments #b1 through #s4 are
better approximated using the Log-Normal distribution. This agrees with Figures 1 – 8 quite
well. Note that the heavy negative tail of normal distribution, which has no biological
meaning, visibly lowers the estimate of its mean.
4. Optimal sample size selection
State-of-the-art biological methods are so complex, that statistical post processing is
usually required to check their integrity and precision. A popular approach is to independently
repeat the same measurements n times and then take the average of this sample as an
estimate of the true measurement result, hoping to reduce the distortion due to the
measurement errors. Since repeating experiments are quite costly, it is important to determine
the smallest sample size n that is sufficient to get good results. To do this we perform MonteCarlo study with the same parameters of probability distributions as the ones identified for the
experimental measurement sample #b1. In Figure 9 we plot the average precision of identified
cumulative distribution function (using the value KS of Kolmogorov-Smirnov criterion) vs.
the size n of simulated samples.
More precisely, let X=N stand for the Normal distribution with  =0.27470 and
 =0.14947, let X=L stand for the Log-Normal distribution with  =-1.3707 and  =0.55632,
and let X=G stand for the Gamma distribution with  =3.2449 and  =0.0890185. For each
n we generate the sample with n independent realizations of the random quantity X and then
compute the value KS of Kolmogorov-Smirnov criterion to see how well the empirical
distribution of the generated sample matches its true distribution X. Repeating this process
until the average of values of KS stabilizes (by Central Limit Theorem), in Fugure 9 we plot
these averages for n  2,...,60 in black and label the graphs as “N” (if X=N), “L” (if X=L)
and “G” (if X=G).
The gray graphs, labeled by “XY” are obtained as follows. First, for each sample
generated using distribution X, we identify the parameters assuming probability distribution Y.
Here Y stand for either normal, log-normal or gamma distribution with such parameters that
minimize the value KS of Kolmogorov-Smirnov criterion showing how well the empirical
distribution of the sample generated using distribution X matches with the identified
distribution Y. We plot the average values of KS for n  2,...,60 in gray.
Figure 9. The quality of probability distribution parameter identification as a function of sample size if the
exact distribution of the sample is known. Normal, log-normal and gamma distributions are used to
generate samples for Monte-Carlo study.
Figure 9 shows that samples of size n  20 are not reliable enough and can be
significantly improved by adding a few more points to them. On the other hand, samples of
size n  30 shows little improvement for the quality of the empirical distribution as n
increases.
The graphs of Figure 9 show the average values of KS obtainable when the family
(Normal, Log-Normal or Gamma) of the underlying sample distribution is determined
correctly. In this paper we determine the class of probability distribution on the basis of the
lowest value KS of Kolmogorov-Smirnov criterion. We note that when the distribution family
is determined correctly for sample sizes n  10 the quality of approximation is independent
of the probability distribution.
The Monte-Carlo study presented in the next three figures shows what happens if the
family of sample distribution is guessed wrongly and allows an empirical justification to
distribution family selection based on the lowest value of KS .
Figure 10. The quality of probability distribution parameter identification as a function of sample size if
sample distribution is unknown. Normal distribution is used to generate samples.
Figure 11. The quality of probability distribution parameter identification as a function of sample size if
sample distribution is unknown. Log-normal distribution is used to generate samples.
Figure 12. The quality of probability distribution parameter identification as a function of sample size if
sample distribution is unknown. Gamma distribution is used to generate samples.
Figures 10 - 12 show that the size of Kolmogorov-Smirnov criterion becomes a reliable
criterion to select the distribution class (choosing between normal, log-normal and gamma
distributions) for samples of size n  20 . They also show that if a class of probability
distributions is guessed incorrectly then the identified cumulative distribution function might
fit the empirical distribution function of the experimental sample quite badly. For example,
postulating the normal distribution of measurement errors in experiment #b1 would have led
us to physically meaningless confidence interval [-0.018, 0.57] for the percent value of CD8+
βgal+ lymphocytes.
4. Conclusions
Using Tables 2 and 3, we suggest that on overall Log-Normal distribution agrees with the
samples better that Normal or Gamma distributions. What we can undoubtedly conclude is
that results, summarized in Table 2, show that Normal distribution is the most unsuitable one,
among the three distributions considered. In particular, it means that mean and confidence
intervals computed under the assumption of underlying distribution’s normality might deviate
from the true ones significantly. Therefore, we suggest that processing of immunology
experimental data must be based on suitable probability distribution of measurement errors
obtained for each experimental technique beforehand. We also suggest that at least 30
repeated measurements are needed to determine the suitable probability distribution of
measurement errors. This is of especial importance for the mathematical model identification
problems in immunology [6].
5. Bibliography
[1] Bocharov, G., Ludewig, B., Junt, T., Krebs, P. 2004. Underwhelming the Immune
Response: Effect of Slow Virus Growth on CD-8-T-Lymphocyte Responses. Journal of
Virology, 78(5):2247-2254.
[2] Krebs P., Scandella E., Odermatt B., Ludewig B. 2005. Rapid functional exhaustion of
cytotoxic T lymphocytes following immunization with recombinant adenovirus. J. Exp. Med.
(submitted)
[3] NIST/SEMATECH e-Handbook of Statistical Methods. 2004.
http://www.itl.nist.gov/div898/handbook/.
[4] Babadzanjanz, L.K., Boyle, J.A., Sarkissian, D.R., Zhu, J. 2003. Parameter Identification
For Oscillating Chemical Reactions Modelled By Systems Of Ordinary Differential
Equations. Journal of Computational Methods in Science and Engineering, 3(2):233-247.
[5] Babadzanjanz L.K., Voitylov, A.V., Sarkissian, D.R., Krebs, P., Ludewig, B., Bocharov,
G.A. 2005. Assessing the statistical distribution of tet+ CTL using the recombinant
adenovirus immunization data. J. Immunol. Methods (in preparation)
[6] Марчук Г.И. Математические модели в иммунологии. Вычислительные методы и
эксперименты, 3е изд., М:Наука, Гл. ред. Физ.-мат. Лит., 1991.
Download