On the estimation of linear models with interval-valued data González-Rodrı́guez, Gil, Colubi, Ana,1 Coppi, Renato and Giordani, Paolo2 1 2 Dpto. de Estadı́stica e I.O. Universidad de Oviedo. 33007 Oviedo. Spain {gil, colubi}@uniovi.es Dipartimento di Statistica, Probabilità e Statistiche Applicate, Università degli Studi di Roma, ”La Sapienza”, {renato.coppi, paolo.giordani}@uniroma1.it Summary. The problem of estimating the parameters of a linear regression model between interval-valued random sets based on a interval arithmetic approach is analyzed. We show empirically that, contrary to what happens in the case of real valued variables, the least squares optimal solutions for the affine transformation model do not produce “suitable” estimates for the simple linear regression model. Concretely, we propose new estimators for the simple linear regression model and we show some simulations which suggest that the solution for the affine transformation model does not provide us with the optimal estimates, since the new ones lead to a lower mean square error. Key words: simple linear regression model, interval-valued data, interval arithmetic approach, point estimation, least squares method 1 Introduction In [GLML02] a linear regression model with interval-valued data based on an interval arithmetic approach was studied, by modelling interval-valued data as values from compact convex random sets. Least squares optimal solutions for the affine transformation model based on a generalized operational metric were obtained. In fact, it was shown that the solutions are mostly unique, although in some special cases, two solutions can be considered. The solutions can be expressed in terms of different moments of the sample distribution of the random vector determined by the mid-points and spreads of the random intervals. Modelling the stochastic behaviour of each observation by means of a simple linear regression model implies some remarkable differences w.r.t. the preceding viewpoint. In this respect, in [GLML02] it is shown that testing about the linear independence by assuming the underlying linear model reduces the problem to testing about the covariance of certain real-valued variables which characterize the interval-valued random elements (concretely, the mid and the spread values). In this paper, we will show empirically that, contrary to what happens in the case of real valued variables under the usual normal assumption, the least squares optimal 698 González-Rodrı́guez, Gil, Colubi, Ana, Coppi, Renato and Giordani, Paolo solutions for the affine transformation model do not produce “suitable” estimates for the simple linear regression model, due to the special features of the operations between interval-valued data. In Section 2, we introduce some supporting concepts and results. In Section 3 we discuss the solutions for both models. In Section 4 we suggest new estimators for the simple linear regression model. Finally, in Section 5 we show some simulations which suggest that the solution for the affine transformation model does not provide us with the optimal estimates, since the new ones lead to a lower mean square error. 2 Preliminaries Let Kc (R) be the class of nonempty compact intervals of R endowed with the semilinear structure induced by the Minkowski addition and the product by a scalar (A + B = {a + b | a ∈ A , b ∈ B} and λA = {λa | a ∈ A} for all A, B ∈ Kc (R) and λ ∈ R). As well, given A, B ∈ Kc (R) if it exists C ∈ Kc (R) so that A = B+C, C is defined as the Hukuhara difference A −H B (that is, A −H B = [inf A − inf B, sup A − sup B] if inf A − inf B ≤ sup A − sup B). Given a probability space (Ω, A, P ), a mapping X : Ω → Kc (R) is said to be an interval-valued random set (or compact convex random set), if it is A|BdH measurable, BdH denoting the σ-field generated by the topology induced by the well-known Hausdorff metric dH on Kc (R). An interval-valued random set X can be characterized by means of the random vector (mid X, spr X) where mid A = (sup A+inf A)/2 and spr A = (sup A−inf A)/2 denote the mid-point (centre) and the spread (radius) of interval A ∈ Kc (R), respectively. Let X : Ω → Kc (R) random set such that E(|X| ; P ) < ∞ an interval-valued be (with |X|(ω) = sup |x| x ∈ X(ω) for all ω ∈ Ω), then, the expected value of X in Kudō-Aumann’s sense (see, e.g., [Aum65]) is, in this case, the set E A [X; P ] = [E(inf X; P ), E(sup X; P )]. Interval-valued random sets represent a suitable model for random elements being either essentially imprecise (like economical fluctuations and ranges) or associated with certain imprecisely-known variables (like interval censoring times, or variables from which available data are grouped). The least squares method in [GLML02] is based on the dW −distance (W being a nondegenerate symmetric probability measure on ([0, 1], β[0,1]) ), which is defined by qR dW (A, B) = [fA (λ) − fB (λ)]2 d W (λ) [0,1] with fA (λ) = λ sup A + (1 − λ) inf A, for all A, B ∈ Kc (R) (see [BCS95]). 3 Linear model estimates Let {(xi , yi )}n i=1 be a data matrix with (xi , yi ) ∈ Kc (R) × Kc (R) for all i = 1, . . . , n. This matrix can be associated with a realization of a random sample {Xi , Yi }n i=1 obtained from a bidimensional interval-valued random set (X, Y ) : Ω → Kc (R) × Kc (R). We consider two statistical-probabilistic Models: On the estimation of linear models with interval-valued data 699 Model 1: (X, Y ) : Ω → Kc (R) × Kc (R) with P (X, Y ) ∈ P and Y ∗ = aX + B with a ∈ R and B ∈ Kc (R). Model 2: (X, Y ) : Ω → Kc (R) × Kc (R) with P (X, Y ) ∈ P Y |X = aX + ǫX , where a ∈ R and ǫX is a random set with expected value E A [ǫX ; P ] = B ∈ Kc (R) The relationships imposed between the variables in the two models specify a restriction on the family of probability distributions over the sample space, yielding two subfamilies (satisfying the specific restrictions), namely, P1 and P2 . The inference on the parameters a and B is carried out within, respectively, P1 and P2 . In this work we are only interested in point estimation. The estimation problem concerning Model 1 is to find a suitable P ∗ ∈ P1 , that is, values a∗ ∈ R and B ∗ ∈ Kc (R) so that {(xi , a∗ xi + B ∗ )}n i=1 is as close to the data matrix as possible according to a given criterion. On the other hand, the estimation problem concerning Model 2 implies b ∈ Kc (R) so that {(xi , b b n finding Pb ∈ P2 , that is, values b a ∈ R and B axi + B)} i=1 axi = ǫi ∈ Kc (R). Thus the is as close to the data matrix as possible and yi −H b viewpoints are different in the following sense: in Model 1 we search for the Affine Function closest to the data matrix, but data are not assumed to verify the affine transformation, while in Model 2 we look for the Simple Linear Regression relationship (see [GGCM06] for testing approaches concerning this model). In this case the collected data are supposed to verify a regression model. In [GLML02] the least squares optimal solution based on the dW -distance for Model 1 has been obtained. In P other words, the value a∗ ∈ R and the interval ∗ 2 B ∈ Kc (R) minimizing (1/n) n i=1 dW (Yi , aXi + B) have been deduced. If X and Y are nondegenerate, the solutions have been expressed as functions of variances/covariances of mids and spreads, as follows: Algorithm for the least squares optimal solution for Model 1 Step 1. Compute the values µfX = mid X, σf2[0,1] = h 2 2 σf2X = Smid + 4 σf2[0,1] Sspr X + spr X X 2 i R [0,1] λ2 dW (λ) − .25, , ρ fX fY = Smid X mid Y + 4 σf2[0,1] Sspr X spr Y + spr X · spr Y Smid X mid Y − 4 σf2[0,1] Sspr X spr Y + spr X · spr Y q σf2X σf2Y ρfX f[0,1] = − ρ feX f and a1 = s σf2Y σf2X v u 2 u σf b1 = t 2 Y · σf[0,1] [0,1] ρfX fY − ρfX f[0,1] ρfY f[0,1] 1− · , σf2X σf2Y ρfeX fY = q ρ2fX f[0,1] 1− ρ2fX f[0,1] , v u 2 u σf [0,1] = 2 t 2 spr X. σfX s , a2 = ρfY f[0,1] − ρfX f[0,1] ρfX fY σf2Y σf2X · v u 2 u σf , b2 = t 2 Y ρfeX fY + ρfX f[0,1] ρfY f[0,1] σf[0,1] 1 − ρ2fX f[0,1] · , ρfY f[0,1] + ρfX f[0,1] ρfeX fY 1 − ρ2fX f[0,1] , 700 González-Rodrı́guez, Gil, Colubi, Ana, Coppi, Renato and Giordani, Paolo s a′1 = c1 = µfY − a1 µfX − .5 b1 , c2 = µfY − a2 µfX − .5 b2 , σf2X s b′′ = s σf2Y · ρ fX fY , 2 σf 2 σf Y a′2 = σf2Y σf2X · ρfeX fY , c′1 = µfY −a′1 µfX , c′2 = µfY −a′2 µfX , · ρfY f[0,1] , c′′ = µfY − .5 b′′ and Y = [c′′ , c′′ + b′′ ] and go to [0,1] Step 2. Step 2. IF a1 ≤ 0 THEN go to Step 3, ELSE go to Step 5. Step 3. IF a2 ≥ 0 THEN the optimal solution is given by (a∗ , B ∗ ) = (0, Y ), ELSE go to Step 4. Step 4. IF b2 > 0 THEN the optimal solution is given by (a∗ , B ∗ ) = a2 , [µfY − a2 µfX − .5 b2 , µfY − a2 µfX + .5 b2 ] , ELSE the optimal ′ ′ ∗ ∗ solution is given by (a , B ) = a2 , {µfY − a2 µfX } . Step 5. IF a2 ≥ 0 THEN go to Step 6, ELSE go to Step 7. Step 6. IF b1 > 0 THEN the optimal solution is given by (a∗ , B ∗ ) = a1 , [µfY − a1 µfX − .5 b1 , µfY − a1 µfX + .5 b1 ] , ELSE the optimal ∗ ∗ ′ ′ solution is given by (a , B ) = a1 , {µfY − a1 µfX } . Step 7. IF b1 > 0 THEN go to Step 8, ELSE go to Step 11. Step 8. IF b2 > 0 THEN go to Step 9, ELSE go to Step 10. Step 9. IF a21 > a22 THEN the optimal solution is given by (a∗ , B ∗ ) = a1 , [µfY − a1 µfX − .5 b1 , µfY − a1 µfX + .5 b1 ] , ELSE IF a21 < a22 THEN the optimal solution is given by (a∗ , B ∗ ) = a2 , [µfY − a2 µfX − .5 b2 , µfY − a2 µfX + .5 b2 ] , ELSE there are two optimal solutions, namely, (a∗ , B ∗ ) = a1 , [µfY − a1 µfX − .5 b1 , µfY − a1 µfX + .5 b1 ], and (a∗ , B ∗ ) = a2 , [µfY − a2 µfX − .5 b2 , µfY − a2 µfX + .5 b2 ] . σf2 2 Step 10. IF a21 [1 − ρ2fX f[0,1] ] > a′2 − 2Y ρ2fY f[0,1] THEN the optimal solution is σfX given by (a∗ , B ∗ ) = a1 , [µfY − a1 µfX − .5 b1 , µfY − a1 µfX + .5 b1 ] , ELSE σf2 2 IF a21 [1 − ρ2fX f[0,1] ] < a′2 − 2Y ρ2fY f[0,1] THEN the optimal σfX solution is given by (a∗ , B ∗ ) = a′2 , {µfY − a′2 µfX } , ELSE there are two optimal solutions, namely, (a∗ , B ∗ ) = a1 , [µfY − a1 µfX − .5 b1 , µfY − a1 µfX + .5 b1 ] , and ∗ ∗ ′ ′ (a , B ) = a2 , {µfY − a2 µfX } . Step 11. IF b2 > 0 THEN go to Step 12, ELSE go to Step 13. 2 2 Step 12. IF a′1 > a′2 THEN the optimal solution is given by (a∗ , B ∗ ) = a′1 , {µfY − a′1 µfX } , ELSE 2 2 IF a′1 < a′2 THEN the optimal solution is given by ∗ ∗ (a , B ) = a′2 , {µfY − a′2 µfX } , ELSE there are two optimal solutions, namely, (a∗ , B ∗ ) = a′1 , {µfY − a′1 µfX } , and (a∗ , B ∗ ) = a′2 , {µfY − a′2 µfX } . σf2 2 Step 13. IF a22 [1 − ρ2fX f[0,1] ] > a′1 − 2Y ρ2fY f[0,1] THEN the optimal solution is σfX given by (a∗ , B ∗ ) = a2 , [µfY − a2 µfX − .5 b2 , µfY − a2 µfX + .5 b2 ] , ELSE On the estimation of linear models with interval-valued data 701 σf2Y 2 ρ THEN the optimal σf2X fY f[0,1] ∗ ∗ ′ solution is given by (a , B ) = a1 , {µfY − a′1 µfX } , ELSE there are two optimal solutions, namely, (a∗ , B ∗ ) = a2 , [µfY − a2 µfX − .5 b2, µfY − a2 µfX + .5 b2 ] , and (a∗ , B ∗ ) = a′1 , {µfY − a′1 µfX } . 2 IF a22 [1 − ρ2fX f[0,1] ] < a′1 − Concerning the estimation problem in Model 2, we could take advantage of the solutions for Model 1 as usual, although it should be noted that the interval arithmetic-approach involves some constraints. More precisely, the assumption of the linear model Y |X = aX + ǫX , based on interval-arithmetic forces the Hukuhara differences Y −H aX to exist for the true value of a and for all the values of (X, Y ) in the population. However, if we estimate a from the sample on the basis of the solution for Model 1, we could obtain unappropriate values of a∗ (i.e., values for which Y −H a∗ X does not make sense). In fact, if the estimate a∗ is closer to 0 than the true value of a (i.e., |a∗ | ≤ |a|), then under the linear model assumption, a∗ X ⊂ aX ⊂ Y , whence Y −H a∗ X is well-defined. On the contrary, if a∗ is such that |a∗ | > |a|, the last assertion is not always true. Example: To illustrate the last comments, suppose that Y = X +ǫ, where mid X and mid ǫ are N (0, 1) independent random variables and spr X and spr ǫ are χ21 independent random variables. In this case, the population regression function is E A [Y |x] = x + [−1, 1]. If (X1 , Y1 ) are sampled from (X, Y ), then we can write Y1 = aX1 + ǫ1 , where a = 1 and ǫ1 corresponds to the Hukuhara difference between Y1 and aX1 . Table 1. Simulated data to estimate the model E A [Y |x] = x + [−1, 1] mid xi spr xi mid ǫi spr ǫ1 0.6561 0.2863 0.1238 0.0893 -0.0334 0.06533 1.0936 0.8334 -0.2719 0.5166 -1.7599 1.3875 In order to illustrate the inconveniences of combining the linear model assumption with the solutions for Model 1, we have simulated 3 data from X and ǫ (see Table 1) and we have later computed the estimates of a obtained from the preceding algorithm. For these data we have that spr y1 = spr x1 + spr ǫ1 = 0.2863 + 0.0893 = 0.3736. If we compute the least squares solutions by applying the algorithm, we obtain a∗ = 2.2644 and B ∗ = [−0.3282, 0.4041], hence spr (a∗ x1 ) = 2.2644 · 0.2863 = 0.6478. Since spr (a∗ x1 ) > spr y1 , it is not possible to compute the Hukuhara difference between y1 and a∗ x1 , and hence, it is not possible to find any error ǫ∗1 so that y1 = a∗ x1 + ǫ∗1 . As a consequence, with these sample data, if we put a∗ = 2.2644, we would conclude that there is no linear relationship between X and Y , which is not true. To overcome these difficulties we propose the method in the following section. 702 González-Rodrı́guez, Gil, Colubi, Ana, Coppi, Renato and Giordani, Paolo 4 A restricted least squares method for linear models The restricted Lest Squares estimator we propose is constrained to satisfy the linear regression model, at least over the observed data. We cannot state the same for unobserved data. Nonetheless, the simulation experiment shows that the inferential performance (in terms of Mean Square Error) of the proposed estimator is better than that of the original Least Squares estimator. However, we are not considering, at this stage, optimality properties in a class of possible estimators for Model 2 (this will be the object of future research in the field). The key idea in the suggested method is based on finding a kind of “threshold estimate” of |a|. This threshold, that will be denoted by a0 , will be computed as the maximum value of |a| for which the sample Hukuhara differences Yi −H aXi exist for all i = 1, . . . , n. Once this threshold a0 is computed, we consider the estimate a∗ given by the optimal solution for Model 1, and we proceed as follows: – if |a∗ | ≤ a0 , then we estimate a by means of a∗ ; – if a∗ < −a0 , then we estimate a by means of −a0 , otherwise we estimate a by means of a0 . b ∈ This method can be formalized as follows. We will search for b a ∈ R and B Kc (R) in order to n 1X 2 dW (Yi , aXi + B) Minimize n i=1 restricted to: Yi −H aXi exists ∀ i = 1, . . . , n Equivalently, since Yi −H aXi exists if spr aXi ≤ spr Yi , then |a| ≤ spr Yi /spr Xi for all i = 1, . . . , n. Hence, if we consider the threshold a0 = min i=1...n spr Yi , spr Xi we can rewrite the restriction about the existence of the Hukuhara difference as −a0 ≤ a ≤ a0 . It should be noted that if −a0 ≤ a ≤ a0 , we can verify that dW (Yi , aXi + B) = b are the optima, then a and B dW (Yi −H aXi , B) for all i = 1 . . . , n. Consequently, if b b B minimizes 1 n n X aXi , B), d2W (Yi −H b i=1 and hence b= B n 1X aXi ) = Y −H b aX, (Yi −H b n i=1 P Pn where X and Y are the sample means (X = 1/n n i=1 inf Xi , 1/n As a result, the problem reduces to n 1X 2 a ∈ R minimizing d (Yi −H aXi , Y −H aX) Find b n i=1 W 0 0 restricted to: −a ≤ a ≤ a . i=1 sup Xi ). On the estimation of linear models with interval-valued data 703 5 Simulation studies We consider the linear regression model Y = aX + ǫ, where mid X and mid ǫ are N (0, 1) independent random variables, and spr X and spr ǫ are χ21 independent random variables. For the simulations we have considered random samples {(Xi , Yi )}n i=1 from the regression linear model for different values of a, namely, a = 0, a = .5 and a = 1 and different sample sizes n. We have computed the least squares optimal estimates for a and B obtained from the preceding algorithm, denoted by a∗ and B ∗ , and the corresponding estimates obtained by means of the procedure in the b a and B. preceding section, denoted by b Table 2. Comparison between the two estimation methods. Parameter n a = .0 Expected value Ratio Least Squares Restricted LS MSELS /MSERLS a 10 30 100 -.000942 .000181 .000100 -.000165 .000050 -.000002 16.249071 174.868423 4673.475688 B 10 30 100 [-.8285,.8307] [-.9033,.9037] [-.9471,.9476] [-.9764,.9779] [-.9966,.9968] [-.9990,.9994] 1.152597 1.111057 1.101803 a 10 30 100 .465281 .497963 .499708 .380383 .445101 .470880 1.537858 1.864509 2.044996 B 10 [-.9852,.9874] [-1.0676,1.0694] 30 [-.9999,1.0005] [-1.0513,1.0516] 100 [-0.9996,1.0001] [-1.0282,1.0287] 1.069550 1.049216 1.047945 a 10 30 100 1.669655 2.049933 2.052986 B 10 [-1.0013, 1.0017] [-1.0816,1.0824] 30 [-.9995,.9996] [-1.0508,1.0508] 100 [-1.0007,.9998] [-1.0294,1.0285] a = .5 a=1 .991353 .999783 .999987 .904667 .947226 .970990 1.065727 1.050074 1.047574 By Monte Carlo method we have approximated the distributions of the estimators on the basis of 100, 000 iterations from which we have computed the mean values and the mean square errors for each estimator. In Table 2 we gather the obtained results. We see that, the estimator a∗ seems to be unbiased, while the estimator b a seems to be biased when a 6= 0. However, the Restricted LS estimator produces a much lower mean square error than the Least Squares one. Actually, the closer a to 0 (w.r.t. the order of magnitude of variables X and Y ), the greater the ratio between the mean square errors (MSELS /MSERLS ). We have checked that, in many cases, if a is close to 0, the empirical ratios between the MSE’s become greater as n increases (being a = 0 an extreme case), although if a is far away from 0, this ratio seems to remain more stable. In connection with 704 González-Rodrı́guez, Gil, Colubi, Ana, Coppi, Renato and Giordani, Paolo the estimation of B, we can see that the differences are not so remarkable, although they depend on the results for a. References [Art75] Artstein, Z., Vitale, R.A.: A strong law of large numbers for random compact sets. Ann. Probab. 3, 879–882. (1975) [Aum65] Aumann, R.J.: Integrals of set-valued functions. J. Math. Anal. Appl. 12, 1–12 (1965) [BCS95] Bertoluzza, C., Corral, N., Salas A.: On a new class of distances between fuzzy numbers. Mathware & Soft Computing 2, 71–84 (1995) [GLML02] Gil, M.A., Lubiano, M.A., Montenegro, M., López-Garcı́a, M.T.: Least squares fitting of an affine function and strength of association for interval-valued data. Metrika 56, 97–111 (2002) [GGCM06] Gil, M.A., González-Rodrı́guez, G., Colubi, A., Montenegro, M.: Testing Linear Independence in Linear Models with Interval-Valued Data. Comp. Stat. Data Anal. (2006, to appear)