Detecting Outliers • An outlier is a measurement that appears to be much different from neighboring observations. • In the univariate case with adequate sample sizes, and assuming that normality holds, an outlier can be detected by: 1. Standardizing the n measurements so that they are approximately N (0.1). 2. Flagging observations with standardized values below or above 3.5 or thereabouts. • In p dimensions, detecting outliers is not so easy. A sample unit which may not appear to be an outlier in each of the marginal distributions, can still be an outlier relative to the multivariate distribution. 235 Detecting outliers (cont’d) 236 Steps for detecting outliers 1. Investigate all univariate marginal distributions, visually by √ constructing the standardized values zij = (xij − x̄)/ σjj for the i-th sample unit and j-th variable. 2. If p is moderate, construct all bivariate scatter plots. There are p(p − 1)/2 of them. 3. For each sample unit, calculate the squared distance 0 −1 (x − x̄), where x is the p × 1 vector of d2 i i i = (xi − x̄) S measurements on the i-th sample unit. 2 are 4. To decide if d2 is ’extreme’, recall that the d i i 2 approximately χp . For example, if n = 100, we would expect to observe about 5 squared distances larger than the 0.95 percentile of the χ2 p distribution. 237 Example: Stiffness of lumber boards • Recall that four measurements were taken on each of 30 boards. • Data are shown in Table 4.4 along with the four columns of standardized values and the 30 squared distances. • Note that boards 9 and 16 have unusually large d2, of 12.26 and 16.85, respectively. The 0.975 and 0.995 percentiles of a χ2 4 distribution are 11.14 and 14.86. • In a sample of size 30, we would expect less than 0.75 obs. with d2 > 11.14 and less than 0.015 obs. with d2 > 14.86. • Unit 16, with d2 = 16.85 is not flagged as an outlier when we only consider the univariate standardized measurements. 238 Example: Stiffness of boards (cont’d) 239 Transformations to near normality • If observations show gross departures from normality, it might be necessary to transform some of the variables to near normality. • The following are some suggestions which stabilize variances, but some people use them to transform to near-normality. Original scale Right skewed data x are counts Transformed scale log(x) √ x √ x are proportions p̂ Arc sine transformation: arcsin( p) x are correlations r Fisher’s z(r) = 1/2 log[(1 + r)/(1 − r)] 240 The Box-Cox transformation • Proposed by Box and Cox in a 1964 JRSS(B) article. • The Box-Cox transformation is a member of the family of power transformations: xλ − 1 , λ λ 6= 0, = log(x), λ = 0. x(λ) = • To estimate λ we do the following: 1. Assume that xλ ∼ N (µ, σ 2) for some unknown λ. 2. Do a change of variables to obtain the likelihood with respect to x. 3. Maximize the resulting log likelihood for λ. 241 Box-Cox Transformation (cont’d) • If xλ ∼ N (µ, σ 2), then with respect to the untransformed observations the resulting log-likelihood is n n X X n 1 (λ) L(µ, σ 2, λ) = − log(2πσ 2)− 2 (xi −µ)2+(λ−1) log(xi), 2 2σ i=1 i=1 where the last term is the logarithm of the Jacobian of the transformation |dxλ i /dxi |. • Substituting MLEs for µ and σ 2 we get, as a function of λ alone: X n 1 X (λ) 2 ¯ (λ) l(λ) = − [ (xi − x ) ] + (λ − 1) log(xi). 2 n i i • The best power transformation λ̂ is the one that maximizes the expression above. 242 Computing the Box-Cox transformation in practice • One way to find the MLE (or an approximation to the MLE) of λ is to simply plug into the log likelihood in the previous transparencies a sequence of discrete values of λ ∈ (−3, 3) or in some other range. • For each λ, we compute the log-likelihood and then we pick the λ for which the log-likelihood is maximized. • SAS will also compute the MLE of λ and so will R. 243 Radiation in microwave ovens example • Recall example 4.10. Radiation measurements on 42 ovens were obtained. • The log-likelihood was evaluated for 26 values of λ ∈ (−1, 1.5) in steps of 0.10. 244 Microwave example (cont’d) 245 The Multivariate Normal Distribution Microwave example (cont’d) 246 Another way to find a transformation • If data are normally distributed, ordered observations plotted against the quantiles under the assumption of normality will fall on a straight line. • Consider a sequence of values of λ: λ1, λ2, ..., λk and for λj fit the regression λ j = β0 + β1q(i),j + ei, x(i) λ j are the transformed and ordered sample values, where x(i) q(i),j are the quantiles corresponding to each transformed, ordered observation under the assumption of normality, and β0, β1 are the usual regression coefficients. 247 Alternative estimation method (cont’d) • For each of the k regressions (one for each λj ), compute the MSE. • The best fitting model will have lowest MSE. • Therefore, the λ which minimizes the MSE of the regression of sample quantiles on normal quantiles will be ’best’. • It can be shown that this is also the λ that maximizes the log-likelihood function shown earlier. 248 Transforming multivariate observations • By transforming each of the p measurements individually, we approximate normality of the marginals but not necessarily of the joint p-dimensional distribution. • It is possible to proceed as in the Box-Cox approach and try to maximize the log-likelihood jointly for all p lambdas, but in most cases this is not worth it. See discussion on page 199 of text book. 249 Bivariate transformation in microwave example • The bivariate log-likelihood was maximized for λ1, λ2. Contours are shown below. Optimum near (0.16, 0.16) 250