CI \ / ,-« ALFRED P. WORKING PAPER SLOAN SCHOOL OF MANAGEMENT An Empirical Study of the Chi-Square Probability Plot for Assessing Multivariate Normality by M.A. Wong and W.Mo Lebow Working Paper # 1132-80/1 June 30, 1980 MASSACHUSETTS INSTITUTE OF TECHNOLOGY 50 MEMORIAL DRIVE CAMBRIDGE, MASSACHUSETTS 02139 An Empirical Study of the Chi-Square Probability Plot for Assessing Multivariate Normality by M.A. Wong and W.M„ Lebow Working Paper # 1132-80/^ June 30, 1980 An Empirical Study of the Chi-Square Probability Plot for Assessing Multivariate Normality by M.A. Wong and W.M. Lebow Sloan School of Management Massachusetts Institute of Technology Cambridge, MA 02139, U.S.A. SUr/IMARY The chi-square probability plot of the ordered generalized distances from the sample mean vector has been suggested for use in checking the normal assumption for a given body of multivariate observations. Here, the results of an empirical study are presented to illustrate the small and large sample behavior of this plot for samples from both normal and non-normal populations. Our sample sizes range from 10 to 200. As might be expected, samples of 10 tell us almost nothing about multivariate normality, whereas samples of 200 seem very stable except for their few highest points. In general, for samples of size 30 or more, the chi-square plot appears to be an effective graphical aid for assessing multivariate normality. Keywords: Multivariate Normality; Generalized Distances; Chi-Square Probability Plot; Computer Simulations . 1. INTRODUCTION Since the assumption of multivariate normality underlies much of the classical multivariate statistical methodology, it would be useful to have procedures for checking the validity of the normal assumption for a given set of multivariate data. As pointed out by Gnanadesikan (1977) and Everitt (1978) , such checks would be helpful in guiding the subsequent analysis of the data, perhaps by suggesting the need for and nature of a transformation of the data to make them more nearly normally distributed, or perhaps by indicating appropriate modifications of the models and methods for analyzing the data. Numerical methods for testing joint normality have been developed by Mardia (1970, 1975) (1973), and Andrews et„ al. Afifi, A. A. , Malkovich, J.F. and (1971, 1972, 1973); a detailed summary of these techniques is given in Gnanadisikan (1977) Healy (1958) first proposed an extension of the normal probability plot to assess multivariate normality graphically. His method uses the generalized distance, d. , of each point from the sample mean vector, where d. 1 and X. = (x. ~i _ T - x) ~ -1 S (x. -^ - X) ~ is the i th observation vector, x is the p-variate sample mean vector, and S is the sample variance-covariance matrix. If the sample data are from a multivariate normal distribution, then these distances have approximately a chi-square distribution with p degrees of freedom. (The exact marginal distribution of d. is known to be a constant multiple of a beta rather than a chi-square distribution; see Gnanadesikan and Kettenring, 1972). This distributional property suggests that a chi-square probability plot of the ordered generalized distances may be useful for assessing joint normality. (i = 1,..., n), are Specifically, the n generalized distances, d. ordered in magnitude, and the ith ordered value is plotted against the quantile of a chi-square distribution with p degrees of freedom corresponding to a cumulative probability of i = l,...,n. (i - *5)/n, for For multivariate data, the resulting plot should resemble a straight line passing through the origin; departures from normality will be indicated by departures from linearity in this plot. In 1973, Andrews et. al. suggested a graphical procedure that utilizes a generalized distances - and - angles representation of multivariate data, a procedure which enconpasses Healy's technique. Like other probability plotting procedures, the proposed chi-square probability plot can be considered as an informal but informative graphical aid for data analysis. However, unless the users acquire some feeling for the behavior of this plot for samples from both normal and non-normal distribution, its effectiveness would be limited. Some examples illustrating the usefulness of the chi-square probability plot for assessing joint normality are given in Gnanadesikan (1977) and Everitt (1978); but, no systematic study has been done to examine the small and large sample behavior of this plot for samples from both normal and non-normal populations. In this study, we use computer simulated random samples to examine the behavior of the chi-square probability plot for both the null case (normal population) and the non-null case (non-normal population) In the null case, we generated samples of size n = 10, 20, 30, 50, and 200 from 2- and 5- variate normal distributions, and obtained the corresponding chi-square probability plots. These generated plots provide a reference for judging the normality of a given sample; they are given in Section 2. The results obtained for the non-null case are given in Section 3. There, generated plots are produced to illustrate the power of the chi-square probability plot when samples are drawn from 2- and 5- variate (independently distributed variates) non-normal distributions including chi-square (10 degrees of freedom) , 10% contaminated normal, and Cauchy distributions. Concluding remarks are given in the final section. . CHI-SQUARE PROBABILITY PLOTS FOR NORMAL DATA 2. All the probability plots presented in this paper were produced on an IBM 370/168 computer, and the plotting routine used is a modification of Spark's (1971) algorithm. In each of the plots, a line with unit slope which passes through the origin is plotted for reference purposes; moreover, multiple points are represented by the letter "0" and single point by the asterisk "*". And the required chi-square quantiles were computed using the methods of Kuo (1965) and Goldstein (1973) The random number generator used is the one given in Lewis et. al. (1969) , In the null case, normal deviates were computed from the generated random numbers using Cunningham's (1969) algorithm. For each of the two number of dimensions, p = 2 and 5, four chi-square probability plots were produced for each of the sample sizes n= 10, 20, 30, 50, and 200; where MVN I (0, matrix. half of the normal samples were drawn from MVN(0, I), is the identity matrix of rank p, and half were drawn from E ) where E The plots are given in Figures lA - 5A (p= Figures IB - 5B (p= 30, is some randomly generated positive - definite 50 and 200. 5) 2) and respectively for the five sample sizes n=10, 20, As might be expected, samples of 10 tell us almost nothing about normality, whereas samples of 200 seem very stable except for their few highest points. Samples of 20 show wild fluctuations; samples of 30 are better behaved; and samples of 50 nearly always appear linear but fluctuate at their upper ends. Neither the population covariance structure nor the number of dimension appears to affect the behavior of the plot. (Insert Figures lA-B to 5A-B here.) . . 3. CHI-SQUARE PROBABILITY PLOTS FOR NON-NORMAL DATA In order to examine the power of the chi-square probability plot, we repeated the simulation study using samples from non-normal distributions 3.1 10% Contaminated Multivariate Normal Distribution Here the generated data are from 90% MVN (0, For each of the two number of dimensions, p= 2 I) + 10% MVN (0,101) and 5, two chi-square probability plots were produced for each of the sample sizes n= 10,30 and 200. At n= 200, the chi-square plots clearly indicate non-normality (see Figure 6) ; at n=30, there is some evidence that the distribution (see Figure 7) is not normal can be observed (see Figure ; 8) but at n= 10, no indication of contamination . Thus as expected, large sample size is needed to detect contamination. (Insert Figures 6 to 8 here) 3.2 Multivariate Chi-Square Distribution The generated data consists of p independent (p= 2 and 5) variates, each of which has a chi-square (10) distribution. This is a mildly skewed distribution and at n= 200, there is again a clear indication of non-normality (see Figure (n=10 and 30) , 9) ; but for smaller samples we find it hard to distinguish the plots given in Figures 10 and 11 from those given in Figures lA-B and 3A-B respectively. Hence, a rather large sample is needad to detect the moderate skewness in the chi-square population, and it should be noted that the chi-square probability plots obtained from a "skewed" distribution are convex. (Insert Figures 9 to 11 here.) 3.3 Multivariate Cauchy distribution This is an extremely heavy-tailed distribution. Even at n= 10, the probability plots are far from linear (see Figure 12) . Apparently, the chi-square probability plot has good power for rejecting normality when the time distribution is heavy-tailed. The plots for the sample sizes n= 20 and 100 are given in Figures 13 and 14 respectively. (Insert Figures 12 to 14 here.) 3.4 Multivariate Normal Distribution with an Outlier Since the commonest type of non-normality is a single oversized outlier, it would be interesting to find out if the chi-square probability plot would be able to detect multivariate outliers. Realizing that there may be various types of multivariate outliers (see Gnanadesikan and Kettenring, 1972) , we confine our effort to studying outliers in one particular dimension. In each of the plots given in Figure 15, one observation has been contaminated by adding 10 standard deviations to the first variate. For samples of 30 and more, the chi-square plots consistently identify the outliers; however, the plots are not that effective for small samples. (Insert Figure 15 here.) . 4. DISCUSSION The object of this simulation study is to illustrate the usefulness of the chi-square probability plot for assessing multivariate normality. The results indicate that the minimum sample size needed to detect non-normality depends on the degree to which the underlying distribution is "non-normal"; but, a sample size of 30 is usually sufficient to detect all but slight deviations from normality. For normal samples, chi-square plots for various sample sizes were generated to provide a reference for the "expected" behavior. However, it should be pointed out that it is only through routine usage that an user can become more adept at interpreting the chi-square plots 10 REFERENCES Andrews, D.F., Gnanadesikan, R. , and Warner, J.L. (1971). Transformations of multivariate data. Biometrics , 27 , 825-840. Bell Methods for assessing multivariate normality. (1972) . Laboratories Memorandum. Methods for assessing multivariate normality. In (1973) . Multivariate Analysis III (P.R. Krishnaiah, ed.). Academic Press, New York, pp. 95-116. Cunningham, S.W. (1969). Algorithm AS 24. From Normal Integral to Deviate. Applied Statistics , 18 , 290-293. Everitt, B.S. (1978). Graphical Techniques for Multivariate Data Holland, New York. . North- Gnanadesikan, R. (1977). Methods for Statistical Data Analysis of Multivariate Observations. John Wiley and Sons, New York. Robust estimates, residuals Gnanadesikan, R. and Kettenring, J.R. (1972). and outlier detection with multiresponse data. Biometrics , 28 , 81-124. Goldstein, R.B. (1973). Algorithm 451. Chi-Square Quantiles. Communications of the ACM , 16 , 483-485. Healy , M.JoRo 17_, Kuo, S.S. (1968). Multivariate Normal Plotting. Applied Statistics , 157-161. Numerical Methods and Computers Add i son-Wesley Publishing Company, Reading, Massachusetts, pp. 83-89. (1965). . Lewis, P.A.W., Goodman, A.S., and Mills, JoM. (1969). Pseudo Random Number Generator for the System /360. IBM Systems Journal 2_. , Malkovich, J.F. and Afifi, A. A. (1973). On tests for multivariate normality. J. of American Stat. Assoc , 68, 176-179. Mardia, K.V. (1970). Measures of multivariate skewness and kurtosis with 57 , 519-530. applications. Biometrika , (1975). Assessment of multinormality and the robustness of Hotelling's test. Applied Statistics , 24 , 163-171. Sparks, D.N. (1971). Algorithm AS 44. Scatter Diagram Plotting. Applied Statistics, 20, 327-331. o o a o o « 3 • o •oX -o o o lflCLJI3 o-*u]>-«zuujcn o >a: a iiJ o: 1 O •'V-Vl »- <zU tAl 09 o o •• .1 o nz « 3 O— n= o o o o O C£ CWU I c^tn^czuutm o o o o O OC C LJ C£ UJ <^ < Z UU Bl cnmr>z>-»o»**o w & (D 3 n (D 0) H- M (D Oi & H(0 rt (U 3 O (B CO on^moxo tftrno2i>-if>«o o o o o o t?T ^ m tr 3? 0) o o o o 4J fd o o c< a •o » u. a e o • b. la >^ m m Id o •-I o M ti) o 3 a (0 a a m Ui o» OI o oX «-« u n • > o u w o (U c; nJ -P o w •H o o -a (U N •H QOCOUiCfiuO O Q^ir. p-C£<-IUJU1 ar A uj oc wo OMtA»-«ZUWbl . o o a o ri o u 3 3 a » » • o OX • •-• • <J CM o o o o o o u c .9 4J =1 x> •H U -P to H rs • nr o • C CD n u)rinzj»-«tn — s ifln>nzj»-44r) tanaji —^ c^i^srisTrQ aH- 3 I W M O O C 0) H o O O o o o fD •a hi o tr CO O rt M O —o (A O o c > Z >J ^• o ro o c > u o u o • <-• i-h • 3 u vt o o Ul C4 n> h( 0) N a a I— CO 3 O ttmoz^-^WMO (D 01 wnnz>-4tni-ie em^pioaso I-h o o O e o m o o 0) (P CO Ml l-i 3 o X -• m o o c » r- u • t» o • • o o o o u n= I' o in o 3 C > o »r- o o ft (D 3 O o wo o HCO ft H Htr c rt H- O 3 o o o o e o o o • • on;3rno:^o o o C o P 3 XI H U W -P o o o o 31 (0 e c 0) -p •H > I 3 CN e o n I u M-l • ' O oo aj 0) o o oo o u o o o o M-l M 0) O o n au ecu a OKQUJIKUJC a«-.(A»-czuujcn C ia»UI-«ZUUlU> +J (0 •H 'd o o o -a <u N (0 0) »> M C OJ f^ Cn U. li. O in en o P p>i •H .H H ^ a o a• o X! O rO in X! II o o o o o o o o o a in o 3 w H I • Ui ^ uj :^ M— ',1»-<I 2Ut-t*/^ OCCCUJC^UJii 0^tA»-tZUWtfl U < -.H ' W iftmnz^-^y***^ •amT'noTio irifnnzj^-H'.n t' — ^r- — — o a o o o I —^ o o (0 o o o ai o o « n 0) H- N (D (D to II O o • o oo c - H- 1 »o o in o H- o H >? o»o o • o. o o o c > zo u o •a o » Z -H •r- m • o o z I u o CD • o o « in o (+ 0) o uo o l-h iQ (D a> » H H- o o o o N (D O' a HCO o>rnnz>-no>-"e3 ft onxmsxo iDRnz>-Hyi^o trmx'^tsao (U 3 O o o <0 o> Ml o o •a CO 3 o HCO ft ^i Pcr c rt H- O 3 o o o o CD O o o S O •H •y 3 P 01 o o o • o- o»•o r • O a o »a oa o oo o oo o o o o Q£ a u: a: uj I a oo oo o ooo o o o o o o OKQUJQlUja Qt-t<nt-<t2uujcn Qi-iU3»~«ZUbJU) o o o o z o o 3 o - o o oo «o •o o • OX o a a ao a aa a ao o ao o « 2 ^^iuj^uj— o^-cni o o <r 2 t_l Ui tn o o o o a i£ c \jj :£ uc c»«4/i»-ftz(jujvi o o n C on^no^ao o)mn2>-<'y»»-»'3 «.-)rn-l^I>-«t/»«-3 Ln W o W H- (t I ooo hQ C c &> H- O o o o• Xo o • cr ^ & o c o o o co o oo o n «o • • •o oo o tn •o o »— zo M o o o o o rt ^3 o oa oo o oo o oo a o o oo (D 'O o o o ooo ft 3 o e o o o M Pcr o o o o o o H- QO oa a o• o o c zo o • ^3 •n »• u uo N o O Hi o o tD 3 m m o o N umnz (D >-4 01 Xo om»mo»o innoz»-«(o«-i*3 "snajniir^oo Oi Oi H0) o o o sr 3 O a o• o c oo o• o (D (0 ? Io »o -a (D CO -h O 3 < o c— > zo o o o o o o oo o o o oo o no I -• o oo o o• » o o- > — 2O -* • »o o n r«n 11 u o PI ft § 3 01 o o o oo o co a o o o» o oo o tn o oo Q o > o o o ' o o t3 0) C o u »o •» •• o o o o z o u o u 0) o — O X o a o o o o o o o o Q»^(n»-<c2(juj<n oaouiccuj— QQc a-M ccui c e o a>-iu)v-«zutij(n a <n o u -• u O o o o o (n •-• : u et o o o o O Q^ Z^ U U Ui c* a^ ifi >- •xzuuit/i o o < o o Oa£CUJC£UC Cii-iUI^^ZUUUI C ^^ 0) u 3 to I •rH u 0) p > T1 o oI II z • •-• II u Z o o o o a iL ^ ui CL u cu occCLUbCUCi i=i>-it/)»-<zzuuicn cii->(/it><rzuu(n o o in (M » • O (A n2 3 u o II Z • o •- • o O o o o o I a — •-•tn)-<tzuujui o o n o o o cc a uj c£ uj a o o • (/I »- •! Z (J LJ (/I -< ^ n r4 a- o y)rnnz>-*tn«o (t Ofn»mo»o cr rt o P- O 3 cn o cr w 0) N o (D rt 01 O 3 n 0) o o o • o o o o o o z 3 » oo« o o o U) ri & o OZ J* -• w> ' • O o'^75'*'^^0 (/) n. o r > _, -J, — c- o f^ '- " c x o , [ ,^n-^o « TDfiD 3 S2M EEb M HD28.IV1414 no.l129- 80A Stephe/Taking the burdens Barley, DDl 1T2 bMD TQflD 3 : HD28.M414 no.l129- BOB Roberts. Edwar/Critical functions D*BKS 739558 QDl TTb 203 TOflD 3 P0L32£ HD28.M414 no.ll30- 80 Schmalensee. R/Economies of scale and 739574 P'BKS PQI?,^??? 001 ITb Ibl TOflO 3 HD28.M414 no.1l30- 80A Sterman, John /The effect of D»BKS 739856 energy de 00132G25 TOaO DDl TTS 3 72ti HD28,M414 no.ll31- 80 Latham, Mark. /Doubling strategies, li 00153013... 739564 D>fBKS TOaO DD2 2bl bM^ 3 no.l131- HD28.I\/I414 Latham, Mark. 80 1981 /Doubling strategies, li 001526"" D»BKS 743102 TOAD DD2 257 bl3 3 ;|lol-^0 3 TOflO DDM SEM EME HD28.M414 no.ll32- 80 Keen, Peter 3 G. /Building a decision DxBKS 740317 TOflO sup 00136494 OOE Om 7fl5 HD28.M414 no.ll32- 80A Wong, M. Antho/An empirical study .D.x.BKS. 739570. ..001.5.2.66.6. . 3 of . TOaO OOE ES7 b3'l t l^r/^lb