Simple conditions to theoretically limit potential confounding of least-squares estimates Brian Knaeble University of Utah May 2013 Abstract Results of multiple regression analysis are often reported as if model uncertainty is not an issue. If, however, omitted-variable bias is a valid concern, then the results of this paper may apply. The main result is a simple theorem, roughly asserting that weak correlates can not reverse existing estimates. The contrapositive, in a special case, produces necessary conditions for the Yule–Simpson effect. Other applications are discussed, and a few counter examples are presented to demonstrate how confounding can occur when least expected. Keywords: least-squares, correlation, confounding, reversal 1 Introduction Consider this scenario. A sufficient number of high-dimensional observations have been made to fit a linear model, but the set of explanatory variables to be used in the model has yet to be determined. Inference regarding the qualitative nature of the unique effect of Xi on Y is desired. The dimension is large enough so that it is not computationally feasible to fit every possible model. Thus it seems that any conclusions reached regarding the unique effect of Xi on Y are potentially spurious. Specifically, suppose that a linear model has been selected, with explanatory variables indexed by I. Denote with I β̂i the ith fitted coefficient within this model, obtained using the principle of least squares. As long as there is some error in the model, it remains possible that through consideration of the data associated with additional explanatory variables, indexed by J, that sign( J,I β̂i ) 6= sign( I β̂i ). For a particular J, the probability of such a reversal occurring may be low, but when working with a large set of explanatory variables, there are many such Js to consider. The situation becomes more serious if one attempts to account for confounding that arises due to unobserved lurking variables. In the presence of model uncertainty is sound inference even possible? For more reading along these lines consult [1] or [2]. 1 To clarify our understanding of confounding, we distinguish between latent confounding of I β̂i , due to the data associated with variables indexed by I, and potential confounding of I β̂i , due to the data associated with variables indexed by J. Note that even the data associated with a single Xj , despite not being correlated with the data for Xi , or the data for Y , holds the potential to drastically confound I β̂i , just by revealing latent confounding.† This curious phenomenon is demonstrated in Table 1.1. Latent confounding is not an issue for data that has been transformed so that each of the vectors of observations indexed by I become centered and then orthogonal to one another (see Lemma 4.4 of Section 4.3). Orthogonality can be obtained by applying a Gram–Schmidt procedure in such a way so that the specific vector xi remains unchanged (except for centering) and the span of the design matrix remains unchanged as well. If we assume full rank for the original design matrix then such a transformation is always possible. However, despite having arranged for an orthogonal design matrix, potential confounding, due to vectors of data indexed by J, remains a possibility. Some of the concern can be addressed with the following theorem. It is perhaps best understood as a pure mathematical statement regarding the least-squares fitting procedure. Let r denote the bi variate correlation coefficient, and let R denote the positive square root of the coefficient of determination. Let I index centered, orthogonal columns of data for a subset of explanatory variables, and let J index disjoint (from I), not necessarily centered or orthogonal (to themselves or the vectors indexed by I), additional columns of data. Theorem 1.1. For any i ∈ I, JR < |r(xi , y)| =⇒ sign( J,I β̂i ) = sign( I β̂i ). The conclusion is modest, but similar reasoning has been used in the past † An educational article in Occupational & Environmental Medicine speaks of a trend toward increasing use of (multiple) regression, as opposed to stratification, within epidemiology, as a means to control for confounding and arrive at an adjusted estimate I β̂i for the unique effect of Xi on Y [5]. However, there appears to be a belief that in order to confound the estimate I β̂i , a lurking variable Xj must correlate with both Xi and Y [6]. y √ √2 − 2 0 0 x1 √ √2 −√ 2 2 √2 −2 2 x2 1 √1 −5√ 2 − 1 5 2−1 x3 √ 5 + √2 5√− 2 √2 − 5 − 2−5 Table 1.1: A contrived dataset where x3 is uncorrelated with both x1 and y, yet sign( 1,2,3 β̂1 ) 6= sign( 1,2 β̂1 ). 2 (see [3] for an early precedent).‡ Also, the main premise is simple, and when true it can be readily verified. This is especially apparent when J = {j} because j R = r(xj , y) and correlations are well understood. Conceivably, specialists could verify the truth of this simpler premise, even in the absence of data, with an argument based on existing subject matter knowledge. However, caution is advised in general as the combined effect of a set of multiple, weak correlates can be greater than expected (see Table 1.2). 2 The Yule–Simpson effect In this section it is shown how the content of Theorem 1.1 is related to the Yule–Simpson effect. A few descriptive examples of the Yule–Simpson effect are stated in order to provide context. The terms Yule–Simpson effect and Simpson’s paradox are used interchangeably. Clifford H. Wagner, writing in The American Statistician, introduces Simpson’s paradox as the designation for a surprising situation that may occur when two populations are compared with respect to the incidence of some attribute: If the populations are separated in parallel into a set of descriptive categories, the population with higher overall incidence may yet exhibit a lower incidence within each such category [4]. In a medical context differing terminology may be used. The following example is taken from the British Medical Journal. Open surgery (1972-80) had a success rate of 78% (273/350) while percutaneous nephrolithotomy (1980-5) had a success rate of 83% (289/350), an improvement over the use of open surgery. However, the success rates looked rather different when stone diameter was taken into account. This showed that, for stones of less than 2 cm, ‡ For papers containing similar mathematical results see any of [7, 8, 9]. For scientific papers where Theorem 1.1 could apply see any of [10, 11, 12, 13, 14]. y √ √ 2 + √ √3 − 2√ + 3 −√ 3 − 3 x1 √ 2 √2 −2√ 2 2 √2 −2 2 x2 √ √ √2 √3 + − √ 2 3 + 2 √2 − −2 2 − x3 √ √ √2 √3 + − 2√ 3 + −2√ 2 − 2 2− Table 1.2: A contrived data set illustrating how the confounding potential of x2 and x3 combined, can be greater than expected. ∀ 6= 0, 1 β̂1 = 0.5 and 1,2,3 β̂1 = −1.0, yet as ↓ 0 both 2 R ↓ 0 and 3 R ↓ 0, while 2,3 R ≡ 0.75 > 0.5 = 1 R. Incidentally, 1 β̂1 = 1,2,3 β̂1 = 0.5 when = 0. 3 93% (81/87) of cases of open surgery were successful compared with just 83% (234/270) of cases of percutaneous nephrolithotomy. Likewise, for stones of more than 2 cm, success rates of 73% (192/263) and 69% (55/80) were observed for open surgery and percutaneous nephrolithotomy respectively [15]. A mathematical definition for the Yule–Simpson effect is proposed in Table 2.1. Although, such a definition fails to apply to a well-known, historical example of the Yule–Simpson effect. As as described in Science: Examination of aggregate data on graduate admissions to the University of California, Berkeley, for fall 1973 shows a clear but misleading pattern of bias against female applicants. . . . If the data is properly pooled, taking into account the autonomy of departmental decision making, thus correcting for the tendency of women to apply to graduate departments that are more difficult for applicants of either sex to enter, there is a small but statistically significant bias in favor of women [16]. The details show that not every department had a higher acceptance rate for females. Nonetheless, the authors still choose to describe the reversal as “a paradox, sometimes referred to as Simpson’s” [16]. To accommodate their use of the term, and other similar usage (see [17]), an alternative definition of the Yule–Simpson effect should be considered. The terminology of linear modeling can apply. Let Y indicate the presence or absence of an attribute, taking the values one or zero. Let Xi indicate membership within one population or another, taking the values zero or one. Let the s indicator variables Xj1 , Xj2 , ..., Xjs together indicate category. With I = {i} and J = {j1 , j2 , ..., js } the Yule– Simpson effect can be understood as a special instance of the following more general phenomenon. Definition 2.1. Let I index a subset of explanatory variables and let J index a disjoint subset of explanatory variables. We say that J has induced a reversal of β̂i if sign( J,I β̂i ) 6= sign( I β̂i ). In light of Definition 2.1 we can interpret Theorem 1.1 as providing conditions that preclude the possibility of a reversal. population 1 population 2 category 1 category 2 ··· category s a1 /b1 c1 /d1 a2 /b2 c2 /d2 ··· ··· as /bs cs /ds Table 2.1: The Yule-Simpson effect occurs when 4 P P aj bj > P P cj , dj yet ∀j aj bj < cj dj . 3 The strongest correlate To further demonstrate the practical nature of Theorem 1.1, we state here a corollary, that asserts the dominance, or importance, of the strongest observed correlate. An explicit application to the population-level study of the effects of nutrition on disease is then discussed. Corollary 3.1. For i 6∈ J we have JR < |r(xi , y)| =⇒ sign( J,i β̂i ) = sign(r(xi , y)). Knowledge of Corollary 3.1 can influence the decision making process of a descriptive statistician who is charged with the task of summarizing an existing data set. As an example consider the large data set associated with an ecological study of mortality, biochemistry, diet and lifestyle that was carried out in rural China in the 1980s and early 1990s [18]. County-level consumption rates for various dietary variables were obtained along with county-level rates for heart disease. Unadjusted, bi variate, sample correlations were computed, and the results are displayed in Table 3.1. One may be tempted to dismiss the strong, observed association between wheat and heart disease as spurious, because it is conceivable that it would vanish or even reverse after adjustment for confounding factors. However, in light of Corollary 3.1, even in the presence of model uncertainty, one may quickly surmise that the data likely do not indicate a protective effect of wheat on heart disease. 4 Mathematics In this section the mathematics that allows for a proof of Theorem 1.1 is presented. Notation is as close to “standard” as possible, while detailed enough for dietary variable observed correlation with Heart Disease Cholesterol Saturated Fat Fish Nuts Salt Spices Wheat Beans Fruits Vegetables -.15 -.18 -.21 .01 .00 .33 .64 -.33 -.03 -.13 Table 3.1: Unadjusted, bi-variate, sample correlation values for select variables; with J indexing every explanatory variable except wheat, J R = 0.55. 5 clear development of the theory. There is a geometric flavor to the definitions that is best embraced before moving on to the lemmas and propositions. A solid understanding of the propositions leads to a thorough understanding of the proof for the theorem. 4.1 Notation The existence of a general dataset as depicted in Table 4.1 is assumed. There are n, m-dimensional observations. Let I index a subset of {1, 2, ..., m}, J index a disjoint subset, and K index a generic subset. Let i stand for a generic element of I, j stand for a generic element of J, and k stand for a generic element of K. Bold symbols indicate observed vectors of data within Rn . Also, h·, ·i is used for the standard inner product, | · | for the associated, Euclidean norm, and ⊥ to indicate orthogonality. With e denoting a vector of n ones, the vectors {e, x1 , x2 , ..., xm } are assumed to be linear independent. The span of e, and a subset of vectors indexed by K, is a vector subspace denoted with K V . For every K, both y 6∈ K V and y 6⊥ K V are assumed. In general, V stands for a vector subspace. Also, pre subscripts indicate a subset of explanatory variables, and a post subscript typically indicates a variable of interest. 4.2 Definitions In this subsection K = {k1 , k2 , ..., kp }. Definition 4.1. Denote the projection of y onto V with pV (y) = argmin(|y − v|). v∈V Definition 4.2. The vector of fitted coefficients, ( K β̂0 , is the unique solution of p K V (y) = K β̂0 e + K β̂k1 xk1 + K β̂k2 xk2 K β̂k1 , K β̂k2 , ..., K β̂kp ), + ... + y x1 x2 ... xm y1 y2 y3 .. . x1,1 x1,2 x1,3 .. . x2,1 x2,2 x2,3 .. . ··· ... ... .. . xm,1 xm,2 xm,3 .. . yn x1,n x2,n ... xm,n K β̂kp xkp . Table 4.1: A sufficiently general data set that illustrates the notation 6 Definition 4.3. Ky is the function Ky Ky : (αk1 , αk2 , ..., αkp ) 7→ K β̂0 + : Rp → R K β̂k1 αk1 + K β̂k2 αk2 + ... + K β̂kp αkp . Definition 4.4. The qth fitted value is K ŷq = K y(xk1 ,q , xk2 ,q , ..., xkp ,q ). Definition 4.5. The vector of fitted values is K ŷ Remark 4.1. Within Rn , Definition 4.6. Define determination: = ( K ŷ1 , K ŷ KR K ŷ2 , ..., K ŷn ). = p K V (y). as the positive square root of the coefficient of s Pn p ( K ŷq − ȳ)2 2 Pq=1 . KR = + KR = + n 2 q=1 (yq − ȳ) Definition 4.7. For generic vectors x = (x1 , x2 , ..., xn ) and y = (y1 , y2 , ..., yn ), and with s denoting the sample standard deviation, define the Pearson correlation coefficient r as n 1 X xq − x̄ yq − ȳ r(x, y) = . n − 1 q=1 sx sy 4.3 Geometry The following lemmas are stated without proof, as they can be surmised to be true or derived from the material in books on mathematical analysis (e.g. Cheney’s text [19]). See the appendix for a proof of Proposition 4.1. Lemma 4.1. For any y and for any V (y − pV (y)) ⊥ V. Lemma 4.2. For any y and for any V |pV (y)|2 + |y − pV (y)|2 = |y|2 . Lemma 4.3. For any vectors x, y x ⊥ y =⇒ |x|2 + |y|2 = |x + y|2 . Lemma 4.4. For V1 ⊥ V2 and V = span{V1 , V2 } pV (y) = pV1 (y) + pV2 (y). 7 Definition 4.8. For nonzero vectors y ∈ Rn and v ∈ V , define θ(y, v), with 0 ≤ θ ≤ π, via hy, vi cos(θ) = . |y||v| Proposition 4.1. Let V be a vector subspace of Rn . For a fixed vector y 6∈ V , with y 6⊥ V , and for a fixed, nonzero vector w ∈ V : (i) If w has a component orthogonal to pV (y), then θ(y, pV (y) + tw) is a strictly increasing function of t > 0. (ii) Otherwise, θ(y, pV (y) + tw) is nondecreasing on {t : t > 0, pV (y) + tw 6= 0}. 4.4 Simplifications Proofs of the propositions in this section are left to the reader. Definition 4.9. A vector of data x is centered if x̄ = 0. Definition 4.10. A vector of data x is geometrically standardized if x̄ = 0 and |x| = 1. Definition 4.11. Given a vector of data x we use the term standardization to describe the process x − x̄e x 7→ . |x − x̄e| Remark 4.2. Standardization results in geometrically standardized data. Proposition 4.2. Standardization preserves the orthogonality of a set of centered vectors. Proposition 4.3. For any K, standardization preserves the signs of { K β̂k }k∈K and the value of K R. Proposition 4.4. For any K, if the data is geometrically standardized, then K β̂0 = 0. Proposition 4.5. For any K, if the data is geometrically standardized, then K R = cos(θ(y, p K V (y)) = |p K V (y)|. Proposition 4.6. For k = 1, 2, ..., m, kR = |r(xk , y)|. Proposition 4.7. For any disjoint, indexing sets I and J, and for any i ∈ I, if I indexes orthogonal vectors of data, then sign( J,I β̂i ) = sign( J,i β̂i ). 8 4.5 Proof of Theorem 1.1 By Proposition 4.3, geometrically standardized data can be assumed, and by Proposition 4.2, orthogonality of the vectors indexed by I is retained. Proposition 4.6 allows us to state the contrapositive of the implication from Theorem 1.1 as sign( J,I β̂i ) 6= sign( I β̂i ) =⇒ JR > i R. JR > i R. By Proposition 4.7 it suffices to demonstrate sign( J,i β̂i ) 6= sign( i β̂i ) =⇒ The hypothesis, sign( J,i β̂i ) 6= sign( i β̂i ), implies that within JV J,i V separates p J,i V (y) from p i V (y). Thus the straight line from p J,i V (y) to p i V (y) intersects J V at a point q. Consider the two-stage path: from p J V (y) to q within J V , and then from q to p i V (y) within J,i V , along two straight line segments. Using Proposition 4.1 we can conclude∗ that θ(y, p J V (y)) ≤ θ(y, q) < θ(y, p i V (y)). Application of the cos function reverses the ordering, resulting in cos(θ(y, p J V (y))) ≥ cos(θ(y, q)) > cos(θ(y, p i V (y))). Proposition 4.5 allows us to substitute cos(θ(y, p i V (y))), and the result is JR JR for cos(θ(y, p J V (y))) and i R for > i R. ∗ We have assumed in Section 4.1 that for any K, y 6⊥ K V , which implies, even for geometrically standardized data, and again for any K, that K β̂i 6= 0. Also, if p i V (y) − p J,i V (y) is a scalar multiple of p J,i (y) then q = 0 and Proposition 4.4 ensures that p J,i (y) is a scalar multiple of xi . This contradicts either J,i β̂i 6= 0 or J,i β̂j 6= 0 for j ∈ J. Thus we conclude that p i V (y) − p J,i V (y) is not a scalar multiple of p J,i (y), and we are justified in using part (i) of Proposition 4.1 along the first segment. Finally, note that Proposition 4.1 applies along the second segment because the segment lies along a ray emanating from p J,i V (y). 9 Appendix: Proof of Proposition 4.1 hy,pV (y)+t(αpV (y))i |y||pV (y)+t(αpV (y))| hy,pV (y)+t(αpV (y))i |y||pV (y)+t(αpV (y))| = For part (ii), with α 6= 0, it suffices to show that cos(θ) = is non increasing on {t : t > 0, t 6= −1/α}. For α > 0, (1+tα) hy,pV (y)i hy,pV (y)i (1+tα) |y||pV (y)| = |y||pV (y)| , which is constant. For α < 0, hy,pV (y)i hy,pV (y)+t(αpV (y))i hy,(1+tα)pV (y)i when t < −1/α, |y||p = |y||(1+tα)p = (1+tα) (1+tα) |y||pV (y)| = V (y)+t(αpV (y))| V (y)| hy,pV (y)i hy,pV (y)+t(αpV (y))i hy,(1+tα)pV (y)i |y||pV (y)| , which is constant, and when t > −1/α, |y||pV (y)+t(αpV (y))| = |y||(1+tα)pV (y)| (1+tα) hy,pV (y)i hy,pV (y)i hy,pV (y)i −(1+tα) |y||pV (y)| = − |y||pV (y)| , which is also constant. Furthermore, |y||pV (y)| ≥ hy,pV (y)i − |y||p because Lemma 4.2 states |pV (y)|2 + |y − pV (y)|2 = |y|2 , which exV (y)| hy,(1+tα)pV (y)i |y||(1+tα)pV (y)| = pands to give hpV (y), pV (y)i + hy, yi − 2hy, pV (y)i + hpV (y), pV (y)i = hy, yi, which implies hy, pV (y)i ≥ 0. For part (i), with α ∈ R, write w = αpV (y) + u, where u ⊥ pV (y). cos(θ) hy,pV (y)+t(αpV (y)+u)i hy,(1+tα)pV (y)+tui V (y)i+hy,tui thus becomes |y||p = |y||(1+tα)p = hy,(1+tα)p |y||(1+tα)pV (y)+tu| . V (y)+t(αpV (y)+u)| V (y)+tu| The hy, tui term can be dropped since hy, tui = thy, ui = thpV (y) + (y − pV (y)), ui = thpV (y), ui + h(y − pV (y)), ui = 0 + 0, where the final zero is due hy,(1+tα)pV (y)i to Lemma 4.1. Thus, it suffices to show that |y||(1+tα)p is decreasing for V (y)+tu| t > 0. Lemma .5. For (1 + tα) 6= 0, t/(1 + tα) is a strictly increasing function of t. Proof. d t dt 1+tα = 1(1+tα)−αt (1+tα)2 = For t such that (1+tα) > 0, 1 (1+tα)2 > 0. hy,(1+tα)pV (y)i |y||(1+tα)pV (y)+tu| = 1/(1+tα) hy,(1+tα)pV (y)i 1/(1+tα) |y||(1+tα)pV (y)+tu| = hy,pV (y)i |y||pV (y)+tu/(1+tα)| . Note that t/(1+tα) is positive because t > 0 and (1+tα) > 0, and note also that t/(1 + tα) is increasing by Lemma .5. Thus, as a consequence of Lemma 4.3, |pV (y) + tu/(1 + tα)| is increasing in t, which implies that hy,pV (y)i |y||pV (y)+tu/(1+tα)| is decreasing in t as desired. For t such that (1+tα) < 0, hy,(1+tα)pV (y)i |y||(1+tα)pV (y)+tu| = 1/(1+tα) hy,(1+tα)pV (y)i 1/(1+tα) |y||(1+tα)pV (y)+tu| = hy,pV (y)i −|y||pV (y)+tu/(1+tα)| . Note that t/(1 + tα) is negative because t > 0 and (1 + tα) < 0, and note also that t/(1 + tα) is increasing by Lemma .5. Thus, as a consequence of Lemma 4.3, |pV (y) + tu/(1 + tα)| is decreasing in t, so that V (y)i −|y||pV (y)+tu/(1+tα)| is increasing in t, which implies that −|y||pVhy,p (y)+tu/(1+tα)| is decreasing in t as desired. For t such that (1 + tα) = 0, note that α < 0 so that 0 < t < −1/α ⇐⇒ (1+tα) > 0, t = −1/α ⇐⇒ (1+tα) = 0, and t > −1/α ⇐⇒ (1+tα) < 0. Note also that since y 6∈ V and y 6⊥ V , Lemma 4.2 implies not only hy, pV (y)i ≥ 0 as derived previously, but also the strict inequality hy, pV (y)i > 0. Thus hy,(1+t1 α)pV (y)i for {(t1 , t2 , t3 ) : 0 < t1 < t2 = −1/α < t3 < ∞}, |y||(1+t = 1 α)pV (y)+t1 u| hy,pV (y)i hy,(1+t2 α)pV (y)i hy,(1+t3 α)pV (y)i |y||pV (y)+t1 u/(1+t1 α)| > 0, |y||(1+t2 α)pV (y)+t2 u| = 0, and |y||(1+t3 α)pV (y)+t3 u| = hy,pV (y)i hy,(1+tα)pV (y)i −|y||pV (y)+t3 u/(1+t3 α)| < 0. This shows that |y||(1+tα)pV (y)+tu| must be decreas- ing at any positive t satisfying(1 + tα) = 0 as desired. 10 = References [1] Chatfield C. Model Uncertainty, Data Mining and Statistical Inference. Journal of the Royal Statistical Society: Series A 1995; A 158, Part 3, pp. 419466. [2] Kurth D, Sonis J. Assessment and Control of Confounding in Trauma Research. Journal of Traumatic Stress October 2007; Vol. 20, No. 5, pp. 807820. [3] Cornfield et al. Smoking and lung cancer: Recent evidence and a discussion of some questions. Journal of the National Cancer Institute 1959; 22, 173203. [4] Wagner CH. Simpson’s Paradox in Real Life. The American Statistician February 1982; 36 (1): 4648. 4648. doi:10.2307/2684093. [5] McNamee R. Regression modeling and other methods to control confounding. Occupational & Environmental Medicine 2004; 62:500-506 doi:10.1136/oem.2002.001115. [6] Lu CY. Observational studies: a review of study designs, challenges and strategies to reduce confounding. The International Journal of Clinical Practice May 2009; Blackwell Publishing Ltd. 63 5 691-697. [7] Rosenbaum PR, Rubin DB. Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome. Journal of the Royal Statistical Society, Series B 1983; 11, 212-218. [8] Lin DY, Psaty BM, Kronmal RA. Assessing the Sensitivity of Regression Results to Unmeasured Confounders in Observational Studies. Biometrics September 1998; 54, 948-963. [9] Hosman CA, Hansen BB, Holland PW. The sensitivity of linear regression coefficients’ confidence limits to the omission of a confounder. The Annals of Applied Statistics 2010; Vol. 4, No. 2, 849-870. [10] Davis et al. Rice Consumption and Urinary Arsenic Concentrations in U.S. Children. Environmental Health Perspectives. October 2012; Vol.120, Issue 10, p1418-1424. [11] Jungert et al. Serum 25-hydroxyvitamin D3 and body composition in an elderly cohort from Germany: a crosssectional study. Nutrition & Metabolism 2012; 9:42. http://www.nutritionandmetabolism.com/content/9/1/42 [12] Nelson et al. Daily physical activity predicts sistance: a cross-sectional observational study National Health and Nutrition Examination Journal of Behavioral Nutrition and Physical http://www.ijbnpa.org/content/10/1/10. 11 degree of insulin reusing the 2003–2004 Survey. International Activity 2013; 10:10 [13] Lignell et al. Prenatal exposure to polychlorinated biphenyls and polybriminated diphenyl ethers may influence birth weight among infants in a Swedish cohort with background exposure: a cross-sectional study Environmental Health 2013; 12:44 http://www.ehjournal.net/content/12/1/44. [14] Cervellati et al. Bone mass density selectively correlates with serum markers of oxidative damage in post-menopausal women. Clinical Chemistry and Laboratory Medicine 2012. Volume 51, Issue 2, Pages 333-338. [15] Julious SA, Mullee MA. Confounding and Simpson’s paradox. British Medical Journal 12/03/1994; 309 (6967): 14801481 [16] Bickel PJ, Hammel EA, O’Connell JW. Sex Bias in Graduate Admissions: Data From Berkeley. Science 1975; 187 (4175): 398404. [17] Appleton DR, French JM, and Vanderpump M. Ignoring a Covariate: an Example of Simpson’s Paradox. The American Statistician 1996; Volume 50, Issue 4, 340-341. [18] Chen et al. Geographic study of mortality, biochemistry, diet and lifestyle in rural China. 1975; Retrieved August 21, 2003, from University of Oxford’s Epidemiological Studies Unit website; http://www.ctsu.ox.ac.uk/ china/monograph/. [19] Cheney W. Analysis for Applied Mathematics. Springer: New York, Berlin, Heidelberg, Barcelona, Hong Kong, London, Milan, Paris, Singapore, Tokyo, 2001. 12