ON PRIMARY STATISTICAL DATA PROCESSING OF EXPERIMENTAL MEASUREMENTS OF LYMPHOCYTES USING C57BL/6 MOUSE LINE1 L.K. Babadzanjanz1, A.V. Voitylov2 Saint-Petersburg State University, Saint- Petersburg, Russia E-mail: 1- levon@mail.wplus.net, 2- voivv@mail.ru P. Krebs, B. Ludewig3 Institute of Experimental Immunology,University of Zürich, Zürich, Switzerland E-mail: 3-burkhard.ludewig@kssg.ch D.R. Sarkissian4, G.A. Bocharov5 Institute of Numerical Mathematics,Russian Academy of Science, Russia E-mail: 4-voidcaller@mail.domonet.ru, 5- bocharov@inm.ras.ru Ключевые слова: доверительный интервал, идентификация вероятностного распределения, критерий Колмогорова-Смирнова, критерий омега квадрат, ошибки измерений в биологии, оптимальный размер выборки Key words: confidence interval, probability distribution identification, Kolmogorov-Smirnov criterion, omega-squared criterion, measurement errors in biology, optimal sample size В данной работе при помощи критерия Колмогорова-Смирнова идентифицируется наиболее подходящее из трех (нормального, лог-нормального и гамма) распределений ошибок измерений процентного содержания CD8+ βgal+ лимфоцитов в крови и селезёнке мышей линии C57BL/6. Опираясь на метод Монте-Карло, мы предлагаем критерий вычисления наименьшего размера выборки, по которой достаточно точно идентифицируется вероятностное распределение ошибок измерения. ON PRIMARY STATISTICAL DATA PROCESSING OF EXPERIMENTAL MEASUREMENTS OF LYMPHOCYTES USING C57BL/6 MOUSE LINE / L.K.Babadzanjanz, A.V.Voitylov (SPbSU, SPb, Russia), D.R.Sarkissian, G.A.Bocharov (INM RAS, Moscow, Russia), T. Junt, P. Krebs, B. Ludewig (U. of Zürich, Zürich, Switzerland) This paper uses Kolmogorov-Smirnov criteria to identify the best probability distribution (among normal, log-normal and gamma distributions) for the errors of percent measurements of CD8+ βgal+ lymphocytes in blood or spleen of C57BL/6 mouse line. Based on Monte-Carlo study in this paper we suggest a criteria to compute the smallest sample size, still allowing precise enough identification of probability distribution for measurement errors. 1. Introduction As is known, primary statistical processing of experimental data has to be carried out in accordance with suitable probability distribution of measurement errors. If such probability distribution is chosen incorrectly, the results of statistical processing (mean, confidence intervals, etc.) may deviate from the true ones significantly. Mathematical model identification in immunology relies on the assumptions about the statistical features of data samples used in parameter estimation procedures [1]. Therefore, the question of characterizing (choosing) the probability distribution for typical data sets arising in immunology is an important problem that we consider in this paper. We begin with series of samples of percent measurements of CD8+ lymphocytes specific for β-galactosidase (βgal) in 1 This work was supported by the grant number RFFI 03-01-00689. blood or spleen of C57BL/6 mouse line to find out the suitable probability distribution. Kolmogorov-Smirnov criteria is used to select the one of three popular probability distributions: Normal, Log-Normal and Gamma. Herein the results are briefly summarized in tables, while the details of our computational technique have been prepared for separate paper. 2. Data and methods Adenovirus based immunization represents a promising approach to the treatment of chronic virus infections and tumors. Recent experimental studies of C57BL/6 mice infected with βgal recombinant adenovirus provided a quantitative insight into the kinetics of CD8+ cytotoxic T lymphocyte response (CTL) induction (Krebs et al.). In particular, detailed characterization of the population dynamics of βgal –specific CD8+ lymphocytes as well as the percentage of CD8+ T cells in blood or spleen of C57BL/6 mouse line are available (see Krebs et al. [1,2]). In this paper we analyze the following samples Name Sample size b1 50 b2 32 b3 50 b4 32 s1 23 s2 33 s3 23 s4 33 Sample description Percentage of βgal –specific CD8+ lymphocytes in blood at day 0 Percentage of βgal –specific CD8+ lymphocytes in blood at day 13 Percentage of CD8+ T cells in blood at day 0 Percentage of CD8+ T cells in blood at day 13 Percentage of βgal –specific CD8+ lymphocytes in spleen at day 0 Percentage of βgal –specific CD8+ lymphocytes in spleen at day 13 Percentage of CD8+ T cells in spleen at day 0 Percentage of CD8+ T cells in spleen at day 13 Table 1. Experimental samples of C57BL/6 mice line infected with βgal recombinant adenovirus 3. Results To select the most appropriate distribution among three popular probability distributions (Normal, Log-Normal and Gamma) we use Kolmogorov-Smirnov criteria. Cumulative distribution functions with location parameter and shape parameter are defined as follows (see [3] for details) 2 1 x 1 t e 2 dt , for Normal distribution 2 0 2 x 1 ln t 1 2 F ( x, , ) dt , for Log Normal distribution e x 2 0 1 x x / 1 e x / dt, for Gamma distribution ( ) 0 The symbol F ( x, , ) stands for three different probability distributions. Note that for each distribution the location parameter and the shape parameter sigma have its own “physical significance”. The Kolmogorov-Smirnov criteria is based on comparing for all distributions under consideration the following quantity 1 if x xk 1 n KS min max F ( x, , ) I x xk , where I x xk , x n k 1 0 if x xk To compute KS we have used an iterative optimizing procedure by Artem M. Babadzhanyants based on an approach from [4,5]. As initial guess we take standard maximum likelihood estimates of parameters and . The computed values for KS , sample mean values, 95% confidence intervals, and parameters , for all samples from Table 1 are summarized in Tables 2 and 3. # b1 b2 b3 b4 s1 s2 s3 s4 Normal distribution Confidence KS Mean Interval 0.085 0.28 -0.018, 0.57 0.13 9.5 -0.24, 19 0.070 11 7.1, 14 0.081 15 8.6, 22 0.16 0.28 -0.11, 0.67 0.11 4.3 0.2, 8.4 0.076 10 5.6, 15 0.069 11 4.5, 18 # b1 b2 b3 b4 s1 s2 s3 s4 Log-Normal distribution Confidence KS Mean interval 0.052 0.30 0.085, 0.76 0.084 10 3.0, 26 0.078 11 7.6, 15 0.058 15 9.6, 23 0.095 0.32 0.055, 1.1 0.066 4.6 1.5, 11 0.091 10 6.3, 16 0.085 11 6.0, 20 Normal distribution Log-Normal distribution 0.27470 0.14947 -1.3707 0.55632 9.5247 4.9835 2.1830 0.55163 10.756 1.8808 2.3684 0.17656 15.081 3.3005 2.7016 0.22515 0.27870 0.19730 -1.4237 0.75163 4.3036 2.0676 1.3987 0.50592 10.032 2.2800 2.2924 0.23080 11.107 3.3841 2.3891 0.30134 Gamma distribution Confidence KS Mean Interval 0.044 0.29 0.065,0.68 0.089 10 2.4, 23 0.089 11 7.2, 15 0.074 15 9.1, 23 0.11 0.30 0.032, 0.87 0.084 4.5 1.2, 9.8 0.082 10 6.1, 15 0.073 11 5.6, 19 Gamma distribution 3.2449 0.0890185 3.3934 2.9583 28.123 0.38467 18.017 0.85174 1.8581 0.16296 4.0382 1.1138 18.907 0.53523 10.774 1.0563 Tables 2, 3. Statistical data processing for samples in blood (b1-b4) and spleen (s1-s4) using C57BL/6 mouse line with Normal, Log-Normal and Gamma probability distributions. Here “KS” is the value of functional from Kolmogorov-Smirnov criteria, “Mean” is the sample’s mean and “Confidence interval” is the sample’s 95% confidence interval. The goodness of the fit of the identified probability distributions for all samples can be compared visually using Figures 1 - 8. Each of this figures shows the histogram, constructed using well known Freedmann-Diaconis criterion, and the density functions of the best cumulative distribution function, identified via minimization of the value of KolmogorovSmirnov criterion, among all normal, log-normal and gamma distributions respectively. Each density function is equipped with the vertical lines showing its average and 2.5% left and right quantiles. The figures visualize how the choice of distribution function will influence the best estimate of the sample average and of the 95% confidence interval. Figure 1. Histogram and identified normal (solid line), log-normal (dashed line) and gamma (dotted line) probability density functions for sample #b1. Vertical lines shows average value and 95% confidence interval for each probability density function. Figure 2. Histogram and identified normal (solid line), log-normal (dashed line) and gamma (dotted line) probability density functions for sample #b2. Vertical lines shows average value and 95% confidence interval for each probability density function. Figure 3. Histogram and identified normal (solid line), log-normal (dashed line) and gamma (dotted line) probability density functions for sample #b3. Vertical lines shows average value and 95% confidence interval for each probability density function. Figure 4. Histogram and identified normal (solid line), log-normal (dashed line) and gamma (dotted line) probability density functions for sample #b4. Vertical lines shows average value and 95% confidence interval for each probability density function. Figure 5. Histogram and identified normal (solid line), log-normal (dashed line) and gamma (dotted line) probability density functions for sample #s1. Vertical lines shows average value and 95% confidence interval for each probability density function. Figure 6. Histogram and identified normal (solid line), log-normal (dashed line) and gamma (dotted line) probability density functions for sample #s2. Vertical lines shows average value and 95% confidence interval for each probability density function. Figure 7. Histogram and identified normal (solid line), log-normal (dashed line) and gamma (dotted line) probability density functions for sample #s3. Vertical lines shows average value and 95% confidence interval for each probability density function. Figure 8. Histogram and identified normal (solid line), log-normal (dashed line) and gamma (dotted line) probability density functions for sample #s4. Vertical lines shows average value and 95% confidence interval for each probability density function. Table 2 shows that overall the measurement errors of experiments #b1 through #s4 are better approximated using the Log-Normal distribution. This agrees with Figures 1 – 8 quite well. Note that the heavy negative tail of normal distribution, which has no biological meaning, visibly lowers the estimate of its mean. 4. Optimal sample size selection State-of-the-art biological methods are so complex, that statistical post processing is usually required to check their integrity and precision. A popular approach is to independently repeat the same measurements n times and then take the average of this sample as an estimate of the true measurement result, hoping to reduce the distortion due to the measurement errors. Since repeating experiments are quite costly, it is important to determine the smallest sample size n that is sufficient to get good results. To do this we perform MonteCarlo study with the same parameters of probability distributions as the ones identified for the experimental measurement sample #b1. In Figure 9 we plot the average precision of identified cumulative distribution function (using the value KS of Kolmogorov-Smirnov criterion) vs. the size n of simulated samples. More precisely, let X=N stand for the Normal distribution with =0.27470 and =0.14947, let X=L stand for the Log-Normal distribution with =-1.3707 and =0.55632, and let X=G stand for the Gamma distribution with =3.2449 and =0.0890185. For each n we generate the sample with n independent realizations of the random quantity X and then compute the value KS of Kolmogorov-Smirnov criterion to see how well the empirical distribution of the generated sample matches its true distribution X. Repeating this process until the average of values of KS stabilizes (by Central Limit Theorem), in Fugure 9 we plot these averages for n 2,...,60 in black and label the graphs as “N” (if X=N), “L” (if X=L) and “G” (if X=G). The gray graphs, labeled by “XY” are obtained as follows. First, for each sample generated using distribution X, we identify the parameters assuming probability distribution Y. Here Y stand for either normal, log-normal or gamma distribution with such parameters that minimize the value KS of Kolmogorov-Smirnov criterion showing how well the empirical distribution of the sample generated using distribution X matches with the identified distribution Y. We plot the average values of KS for n 2,...,60 in gray. Figure 9. The quality of probability distribution parameter identification as a function of sample size if the exact distribution of the sample is known. Normal, log-normal and gamma distributions are used to generate samples for Monte-Carlo study. Figure 9 shows that samples of size n 20 are not reliable enough and can be significantly improved by adding a few more points to them. On the other hand, samples of size n 30 shows little improvement for the quality of the empirical distribution as n increases. The graphs of Figure 9 show the average values of KS obtainable when the family (Normal, Log-Normal or Gamma) of the underlying sample distribution is determined correctly. In this paper we determine the class of probability distribution on the basis of the lowest value KS of Kolmogorov-Smirnov criterion. We note that when the distribution family is determined correctly for sample sizes n 10 the quality of approximation is independent of the probability distribution. The Monte-Carlo study presented in the next three figures shows what happens if the family of sample distribution is guessed wrongly and allows an empirical justification to distribution family selection based on the lowest value of KS . Figure 10. The quality of probability distribution parameter identification as a function of sample size if sample distribution is unknown. Normal distribution is used to generate samples. Figure 11. The quality of probability distribution parameter identification as a function of sample size if sample distribution is unknown. Log-normal distribution is used to generate samples. Figure 12. The quality of probability distribution parameter identification as a function of sample size if sample distribution is unknown. Gamma distribution is used to generate samples. Figures 10 - 12 show that the size of Kolmogorov-Smirnov criterion becomes a reliable criterion to select the distribution class (choosing between normal, log-normal and gamma distributions) for samples of size n 20 . They also show that if a class of probability distributions is guessed incorrectly then the identified cumulative distribution function might fit the empirical distribution function of the experimental sample quite badly. For example, postulating the normal distribution of measurement errors in experiment #b1 would have led us to physically meaningless confidence interval [-0.018, 0.57] for the percent value of CD8+ βgal+ lymphocytes. 4. Conclusions Using Tables 2 and 3, we suggest that on overall Log-Normal distribution agrees with the samples better that Normal or Gamma distributions. What we can undoubtedly conclude is that results, summarized in Table 2, show that Normal distribution is the most unsuitable one, among the three distributions considered. In particular, it means that mean and confidence intervals computed under the assumption of underlying distribution’s normality might deviate from the true ones significantly. Therefore, we suggest that processing of immunology experimental data must be based on suitable probability distribution of measurement errors obtained for each experimental technique beforehand. We also suggest that at least 30 repeated measurements are needed to determine the suitable probability distribution of measurement errors. This is of especial importance for the mathematical model identification problems in immunology [6]. 5. Bibliography [1] Bocharov, G., Ludewig, B., Junt, T., Krebs, P. 2004. Underwhelming the Immune Response: Effect of Slow Virus Growth on CD-8-T-Lymphocyte Responses. Journal of Virology, 78(5):2247-2254. [2] Krebs P., Scandella E., Odermatt B., Ludewig B. 2005. Rapid functional exhaustion of cytotoxic T lymphocytes following immunization with recombinant adenovirus. J. Exp. Med. (submitted) [3] NIST/SEMATECH e-Handbook of Statistical Methods. 2004. http://www.itl.nist.gov/div898/handbook/. [4] Babadzanjanz, L.K., Boyle, J.A., Sarkissian, D.R., Zhu, J. 2003. Parameter Identification For Oscillating Chemical Reactions Modelled By Systems Of Ordinary Differential Equations. Journal of Computational Methods in Science and Engineering, 3(2):233-247. [5] Babadzanjanz L.K., Voitylov, A.V., Sarkissian, D.R., Krebs, P., Ludewig, B., Bocharov, G.A. 2005. Assessing the statistical distribution of tet+ CTL using the recombinant adenovirus immunization data. J. Immunol. Methods (in preparation) [6] Марчук Г.И. Математические модели в иммунологии. Вычислительные методы и эксперименты, 3е изд., М:Наука, Гл. ред. Физ.-мат. Лит., 1991.