Using growth curve model in anthropometric data analysis Anu Roos University of Tartu, Institute of Mathematical Statistics, Liivi 2, 50409 Tartu, Estonia anu.roos@ut.ee Summary. In this paper we use an extension of the MANOVA model – the Growth Curve model – for analyzing data on growth of the children (age 6-11). Growth Curve model was first introduced by Potthoff & Roy (1964) [PR64], although some other authors had earlier worked with a similar model. In the present paper we examine the distribution of the location parameter of the Growth Curve model in a specific case using real data. The paper consists of four sections. In Section 2 the overview of the Growth Curve model and the theory used are given. In Section 3, the data is shortly described. In Section 4 the Growth Curve model is fitted to the data. In Section 5 the estimated distribution of the location parameter of the Growth Curve model is studied and compared with the normal distribution. Key words: Growth Curve model, Kotz-type distribution, location parameter, mixture distribution, parameter estimation 1 Introduction Over the years linear models have been one of the most widely used tools in data analysis. When considering repeated measurements an immediate extension is a MANOVA model. In MANOVA model, the number of parameters to evaluate may be large. When we assume a functional relationship between the mean parameters within experimental units, we get the Growth Curve model. Unfortunately the distribution of the location parameter in Growth Curve model is not normal, even if the underlying data follows normal law. In the paper we examine an approximation to the distribution of the location parameter in the Growth Curve model and compare it with the normal distribution (an assumption often made in practice). 2 Growth Curve model An overview of the Growth Curve model is given, for example, by Woolson & Leeper (1980) [WL80], von Rosen (1991) [vRo91] or Srivastava & von Rosen (1999) [SvR99]. 744 Anu Roos Kshirsagar & Smith (1995) [KS95] have written a book on the model and for a recent contribution see Kollo & von Rosen (2005, [KvR05] Chapter 4), where the model and some extensions are presented. In Kollo, Roos & von Rosen (2005) [KRv05], the distribution of the location parameter of the model is studied. The Growth Curve model includes two design matrices, the model has a form x = ABC + E, (1) where x : p × n is a data matrix (a column corresponds to each object and a row to each observation-point), B : q × k is a matrix of unknown parameters, A : p × q, is a known within individuals design matrix of full rank, C : k × n is a known between individuals design matrix of full rank, and E = (e1 , e2 , . . . , en ) in an error matrix , where ei ∼ Np (0, Σ) are i.i.d. and Σ is an unknown positive definite parameter matrix. The MLE of B (e.g. see [Kha68] or [KvR05] §. 4.1.2) equals b = (A′−1 A)−1 A′−1 xC ′ (CC ′ )−1 , B (2) where = x(I − C ′ (CC ′ )−1 C)x′ . (3) The other parameter - the covariance matrix Σ for the error matrix, can be estimated using: b = (x − ABC)(x b b ′ − ABC) nΣ (4) b To evaluate the goodness of model, we need to estimate the covariance matrix of B: b = DB n−k−1 (CC ′ )−1 ⊗ (A′ ΣA)−1 n−k−p+q−1 (5) b in (2) is a non-linear estimator and its Note that unlike the MANOVA model, B distribution is not easy to find, unfortunately it is not normal. One good approxib is derived in [KvR05], §4.3.2. The result is stated mation for the distribution of B in the next theorem. b be given by (2). Then an Edgeworth type expansion of the denTheorem 1. Let B b sity of B equals f (B 0 ) = fB E (B 0 ) + · · · , B where fB E (B 0 ) = fN (B 0 ) n × 1 + 21 s Tr{A′ Σ −1 A(B 0 − B)CC ′ (B 0 − B)′ } − kq o (6) , with s= p−q n−k−p+q−1 and fN (B 0 ) is a density function Nq,k (B, (A′ Σ −1 A)−1 , (CC ′ )−1 ). (7) of the matrix normal distribution Using growth curve model in anthropometric data analysis 745 Following the ideas of Fujikoshi (1987) [Fuj87], Kollo & von Rosen (2005, [KvR05] §4.3.2) have shown that fB E (B 0 ) is a good approximation to the density f (B 0 ): B Theorem 2. fB E (B 0 ) − f (B 0 ) = O(n−2 ). B The distribution of B E (density given in (6)) is a mixture of matrix normal distribution and another elliptical distribution - matrix Kotz type distribution with the weights 1 − 12 skq and 21 skq respectively (here s is given in (7), k and q are the dimensions of B). An overview of Kotz type distributions can be found in [FKN90] §3.2, we only give here the definition of the Kotz distribution - the special case of Kotz type distributions. Definition 1 Matrix Y : q × k is Kotz distributed with parameter M , V and W if its density function is in the following form: fY (Y) = p n kq |V |− 2 |W |− 2 g(Tr{V −1 (Y − M )W −1 (Y − M )′ }), (2π)kq/2 (8) where g(x) = x exp(−x/2). We write Y ∼ Kq,k (M , V , W ) if Y has the density function (8). Theorem 3. Let a random matrix x be a mixture of random matrices xN and xK with weights 1 − λ and λ respectively, where xN ∼ Np,n (M , V , W ) and xK ∼ Kp,n (M , V , W ). Let x= x1 x̃1 x2 , M = µ1 µ̃1 M2 and Σ = V ⊗ W = Σ 11 Σ 21 Σ 12 Σ 22 , where x1 : p1 × 1, x̃1 : (p − p1 ) × 1 and x2 : p × (n − 1), matrix M is with the same structure as x and Σ 11 : p1 × p1 . Then the vector x1 is a mixture of random vectors 1 1 and λp respectively, where x1N ∼ Np1 (µ1 , Σ 11 ) x1N and x1K with weights 1 − λp np np and x1K ∼ Kp1 (µ1 , Σ 11 ). The distribution of a marginal of x still is a mixture of the same type, but weight of the Kotz distribution decreases. Therefore the distribution of marginals is always closer to the normal than the distribution of the original matrix. 3 Data description The data consists of measurements on children’s weight. The data was collected in 1997, retrospectively - from doctors over Estonia, using medical records. There are 121 girls and 88 boys observed, all measured six times, starting at the age of 6 until the age of 11. The growth patterns of some individuals can be seen in Figure 1. Only a few individuals are chosen to make the figure understandable. The growth patterns of the children are quite similar in most cases - the ”starting weight” at age of 6 is 20-30 kg and the weight at the age of 11 lies between 30 and 50 kg. There are some exceptional cases as well. In some cases the weight decreases at some point, which could be measuring errors. The overall trend can be better seen in Figure 2, where the box plots for each sex and age group are plotted, the mean weight of each group is also added. 746 Anu Roos Some individual growth curves 50 40 20 30 weight of individuals 60 70 boys girls 6 7 8 9 10 11 age Fig. 1. The individual growth curves for some children The box plots of the weigth by age groups 50 20 30 40 weight 60 70 boys girls group mean 6 6 7 7 8 8 9 9 10 age Fig. 2. The box plot of weight, data grouped by age and sex 10 11 11 Using growth curve model in anthropometric data analysis 747 4 Results In this section we present the Growth Curve model for described data. We use statistical software R for all the calculations. First we estimate the parameters of the model (1) using the formulas (2)-(4). We assume the linear growth for the weight of both, boys and girls. From Figure 2 one could get the idea, that this could be the case. The design matrices A and C are of the form: 01 B1 B B1 A=B B1 1 1 6 1 7 C C 8 C C, 9 C A 10 11 0 1 1 1 ... 1 0 0 ... 0 C = |0 0 {z . . . 0} 1 . . . 1} A . | 1 {z 88 (9) 121 The estimates of B and Σ are: 0 14.27 b = B 3.73 3.20 3.16 3.20 B 15.17 B B 16.71 b , Σ=B B 19.27 23.05 27.26 15.17 19.61 21.21 23.92 29.09 33.58 16.71 21.21 26.30 29.27 35.13 40.38 19.27 23.92 29.27 36.60 42.08 48.88 23.05 29.09 35.13 42.08 54.88 61.53 27.26 1 33.58 C C 40.38 C C. 48.88 C A 61.53 78.20 (10) b is as follows: The covariance matrix for B 0 0.39 B −0.06 d b =B DB 0 0 −0.06 0.01 0 0 0 0 0.28 −0.05 1 0 0 C C. −0.05 A 0.01 (11) To provide some evaluation of the model, we use the simultaneous Bonferroni confidence intervals (on conf. level 0.95) for elements of B. The classical assumption, b ∼ N4 (vecB, (A′ Σ −1 A)−1 ⊗ (CC ′ )−1 ) is used. The that the distribution of vecB confidence intervals are given in Table 1. b Table 1. The Bonferroni confidence intervals for the elements of the B parameter B 1,1 B 2,1 B 1,2 B 2,2 estimate 3.73 3.20 3.16 3.20 confidence interval 2.17 5.28 2.90 3.50 1.84 4.49 2.94 3.45 The confidence intervals for the intercepts (B 1,1 and B 1,2 ) are quite large, but they are narrower for the linear parameters. 748 Anu Roos 5 Approximate distribution of the location parameter b estimator B We shall find the approximate distribution of the estimator of the location parameter B using Theorem 1. The distribution is a mixture of the matrix normal distribution N2,2 (B, (A′ Σ −1 A)−1 , (CC ′ )−1 ) and the matrix Kotz distribution 4 97 and 101 respectively, where A and K2,2 (B, (A′ Σ −1 A)−1 , (CC ′ )−1 ), with weights 101 C are given in (9). We do not know two parameters of the mixture distribution, we only know the third one, which does not depend on the model: (CC ′ )−1 = 0.0114 0 0 0.0083 . We can calculate the estimates of the other parameters, the estimate of the B is given in (10), the estimate of the second parameter equals: (A′ Σ −1 A)−1 = 33.58 −5.51 −5.51 1.25 . The weight of the Kotz distribution is rather small in the mixture. To study the b we need to calculate the cross-product of the distributions of the elements of B, second and third parameter of the component-distributions: 0 (A′ Σ −1 A)−1 ⊗ (CC ′ )−1 0.382 B −0.063 B = 0 0 −0.063 0.014 0 0 0 0 0.278 −0.046 1 0 0 C C. −0.046 A 0.010 b Using Theorem 3 we get that the distribution the element B i,j , i, j = 1, 2 of B is also mixture of a one-dimensional normal distribution N1 (µi,j , σi,j ) and a Kotz distribution K1 (µi,j , σi,j ), the estimates of the values of µi,j and σi,j are given in 1 Table 2. The weight of the Kotz distribution is 101 and for the normal distribution 100 . 101 Table 2. Parameters of the distributions of the components i 1 2 1 2 j 1 1 2 2 µi,j 3.73 3.20 3.16 3.20 σi,j 0.382 0.014 0.278 0.010 b 1,1 , the first element of the estimator matrix B. b We take a closer look at B According to Theorem 2 and Theorem 3, a good approximation to the density of b 1,1 is a mixture of the normal distribution N (3.73, 0.382) and the Kotz distribution B 1 K(3.73, 0.382) with weights 100 and 101 respectively. Classical way for estimating 101 the same distribution would be using of the normal distribution N (3.73, 0.382). It is interesting to find out in which way these distributions differ. Using growth curve model in anthropometric data analysis 749 Fig. 3. The difference of the density functions of first marginals - normal distribution minus the mixture Table 3. The values of density functions at the selected points, f (x) - density function of N (3.73, 0.389), g(x) - density function of N (3.73, 0.382) and K(3.73, 0.382) 1 and 101 with weights 100 101 √ µ − 10√ σ µ−9 σ √ . µ−8 σ √ µ − 7√σ µ−6 σ √ µ−5 σ x −2.51 −1.89 −1.27 −0.641 −0.0175 0.606 f (x) 4.62779 × 10−23 7.46225 × 10−19 4.33983 × 10−15 9.10291 × 10−12 6.88642 × 10−09 1.87894 × 10−06 g(x) 9.23 × 10−23 1.35 × 10−18 7.09 × 10−15 1.35 × 10−11 9.31 × 10−09 2.33 × 10−06 √ µ − 4√σ µ−3 σ √ µ − 2√ σ µ −√ σ µ − σ/2 µ x 1.23 1.85 2.48 3.1 3.41 3.73 f (x) 0.000185 0.00656 0.08400 0.38783 0.56850 0.64579 g(x) 0.000213 0.00709 0.08655 0.38791 0.56433 0.63943 In Figure 3 the difference of the density functions (the density function of the mixture is subtracted from the density function of the normal distribution) is plotted and in Table 3 some values of both density functions are calculated. It is apparent that near the mean the density of the normal distribution is greater and then the difference decreases. At the distance of two standard deviation from the mean, the density function of the mixture is greater and this does not change. The mixture distribution has slightly heavier tails than the normal distribution. In this case the difference is very small. This leads to the conclusion, that the normal approximation is acceptable in large samples. With smaller sample size, the weight of Kotz distribution in the mixture increases and the distribution deviates more from normal distribution which may lead to mistakes. Acknowledgements The author is thankful for the financial support from Estonian Science Foundation through the grant No. 5686. 750 Anu Roos References [FKN90] Fang, K.-T., Kotz, S., Ng, W.: Symmetric Multivariate and Related Distributions. Chapman and Hall, London, New York (1990) [Fuj87] Fujikoshi, Y.: Error bounds for asymptotic expansions of the distribution of the MLE in a gmanova model. Ann. Inst. Statist. Math. 39 153-161 (1987) [Kha68] Khatri, C.G.: Some results for the singular normal multivariate regression model. Sankhyā, Ser. A 30 267–280 (1968) [KvR05] Kollo, T. & von Rosen, D.: Advanced Multivariate Statistics with Matrices. Springer, Dordrecht (2005) [KRv05] Kollo, T., Roos, A. & von Rosen, D.: Approximation of the Distribution of the Location Parameter in the Growth Curve Model. Research Report 2005: 8, Swedish University of Agricultural Sciences, Centre of Biostochastics (2005) [KS95] Kshirsagar, A.M. & Smith, W.B.: Growth Curves. Marcel Dekker, New York (1995) [PR64] Potthoff, R.F. & Roy, S.N.: A generalized multivariate analysis of variance model useful especially for growth curve problems. Biometrika 51 313– 326 (1964) [vRo91] von Rosen, D.: The growth curve model: A review. Comm. Statist. Theory Methods 20 2791–2822 (1991) [SvR99] Srivastava, M.S. & von Rosen, D.: Growth curve models. In: Multivariate Analysis, Design of Experiments, and Survey Sampling, Ed. S. Ghosh. Marcel Dekker, New York, 547–578 (1999) [WL80] Woolson, R.F. & Leeper, J.D.: Growth curve analysis of complete and incomplete longitudinal data. Comm. Statist. A, Theory Methods 9 1491– 1513 (1980)