On the estimation of linear models with interval-valued data

advertisement
On the estimation of linear models with
interval-valued data
González-Rodrı́guez, Gil, Colubi, Ana,1 Coppi, Renato and Giordani, Paolo2
1
2
Dpto. de Estadı́stica e I.O. Universidad de Oviedo. 33007 Oviedo. Spain {gil,
colubi}@uniovi.es
Dipartimento di Statistica, Probabilità e Statistiche Applicate, Università degli
Studi di Roma, ”La Sapienza”, {renato.coppi, paolo.giordani}@uniroma1.it
Summary. The problem of estimating the parameters of a linear regression model
between interval-valued random sets based on a interval arithmetic approach is analyzed. We show empirically that, contrary to what happens in the case of real valued
variables, the least squares optimal solutions for the affine transformation model do
not produce “suitable” estimates for the simple linear regression model. Concretely,
we propose new estimators for the simple linear regression model and we show some
simulations which suggest that the solution for the affine transformation model does
not provide us with the optimal estimates, since the new ones lead to a lower mean
square error.
Key words: simple linear regression model, interval-valued data, interval arithmetic approach, point estimation, least squares method
1 Introduction
In [GLML02] a linear regression model with interval-valued data based on an interval arithmetic approach was studied, by modelling interval-valued data as values
from compact convex random sets. Least squares optimal solutions for the affine
transformation model based on a generalized operational metric were obtained. In
fact, it was shown that the solutions are mostly unique, although in some special
cases, two solutions can be considered. The solutions can be expressed in terms of
different moments of the sample distribution of the random vector determined by
the mid-points and spreads of the random intervals.
Modelling the stochastic behaviour of each observation by means of a simple
linear regression model implies some remarkable differences w.r.t. the preceding
viewpoint. In this respect, in [GLML02] it is shown that testing about the linear independence by assuming the underlying linear model reduces the problem to
testing about the covariance of certain real-valued variables which characterize the
interval-valued random elements (concretely, the mid and the spread values).
In this paper, we will show empirically that, contrary to what happens in the case
of real valued variables under the usual normal assumption, the least squares optimal
698
González-Rodrı́guez, Gil, Colubi, Ana, Coppi, Renato and Giordani, Paolo
solutions for the affine transformation model do not produce “suitable” estimates for
the simple linear regression model, due to the special features of the operations
between interval-valued data. In Section 2, we introduce some supporting concepts
and results. In Section 3 we discuss the solutions for both models. In Section 4 we
suggest new estimators for the simple linear regression model. Finally, in Section 5 we
show some simulations which suggest that the solution for the affine transformation
model does not provide us with the optimal estimates, since the new ones lead to a
lower mean square error.
2 Preliminaries
Let Kc (R) be the class of nonempty compact intervals of R endowed with the semilinear structure induced by the Minkowski addition and the product by a scalar
(A + B = {a + b | a ∈ A , b ∈ B} and λA = {λa | a ∈ A} for all A, B ∈ Kc (R) and λ ∈
R). As well, given A, B ∈ Kc (R) if it exists C ∈ Kc (R) so that A = B+C, C is defined
as the Hukuhara difference A −H B (that is, A −H B = [inf A − inf B, sup A − sup B]
if inf A − inf B ≤ sup A − sup B).
Given a probability space (Ω, A, P ), a mapping X : Ω → Kc (R) is said to
be an interval-valued random set (or compact convex random set), if it is A|BdH measurable, BdH denoting the σ-field generated by the topology induced by the
well-known Hausdorff metric dH on Kc (R).
An interval-valued random set X can be characterized by means of the random
vector (mid X, spr X) where mid A = (sup A+inf A)/2 and spr A = (sup A−inf A)/2
denote the mid-point (centre) and the spread (radius) of interval A ∈ Kc (R), respectively.
Let X : Ω → Kc (R)
random set such that E(|X| ; P ) < ∞
an interval-valued
be
(with |X|(ω) = sup |x| x ∈ X(ω) for all ω ∈ Ω), then, the expected value of X
in Kudō-Aumann’s sense (see, e.g., [Aum65]) is, in this case, the set E A [X; P ] =
[E(inf X; P ), E(sup X; P )].
Interval-valued random sets represent a suitable model for random elements
being either essentially imprecise (like economical fluctuations and ranges) or associated with certain imprecisely-known variables (like interval censoring times, or
variables from which available data are grouped).
The least squares method in [GLML02] is based on the dW −distance (W being
a nondegenerate symmetric probability measure on ([0, 1], β[0,1]) ), which is defined
by
qR
dW (A, B) =
[fA (λ) − fB (λ)]2 d W (λ)
[0,1]
with fA (λ) = λ sup A + (1 − λ) inf A, for all A, B ∈ Kc (R) (see [BCS95]).
3 Linear model estimates
Let {(xi , yi )}n
i=1 be a data matrix with (xi , yi ) ∈ Kc (R) × Kc (R) for all i = 1, . . . , n.
This matrix can be associated with a realization of a random sample {Xi , Yi }n
i=1
obtained from a bidimensional interval-valued random set (X, Y ) : Ω → Kc (R) ×
Kc (R). We consider two statistical-probabilistic Models:
On the estimation of linear models with interval-valued data
699
Model 1: (X, Y ) : Ω → Kc (R) × Kc (R) with P (X, Y ) ∈ P and Y ∗ = aX + B with
a ∈ R and B ∈ Kc (R).
Model 2: (X, Y ) : Ω → Kc (R) × Kc (R) with P (X, Y ) ∈ P Y |X = aX + ǫX , where
a ∈ R and ǫX is a random set with expected value E A [ǫX ; P ] = B ∈ Kc (R)
The relationships imposed between the variables in the two models specify a
restriction on the family of probability distributions over the sample space, yielding
two subfamilies (satisfying the specific restrictions), namely, P1 and P2 . The inference on the parameters a and B is carried out within, respectively, P1 and P2 . In this
work we are only interested in point estimation. The estimation problem concerning
Model 1 is to find a suitable P ∗ ∈ P1 , that is, values a∗ ∈ R and B ∗ ∈ Kc (R) so that
{(xi , a∗ xi + B ∗ )}n
i=1 is as close to the data matrix as possible according to a given
criterion. On the other hand, the estimation problem concerning Model 2 implies
b ∈ Kc (R) so that {(xi , b
b n
finding Pb ∈ P2 , that is, values b
a ∈ R and B
axi + B)}
i=1
axi = ǫi ∈ Kc (R). Thus the
is as close to the data matrix as possible and yi −H b
viewpoints are different in the following sense: in Model 1 we search for the Affine
Function closest to the data matrix, but data are not assumed to verify the affine
transformation, while in Model 2 we look for the Simple Linear Regression relationship (see [GGCM06] for testing approaches concerning this model). In this case the
collected data are supposed to verify a regression model.
In [GLML02] the least squares optimal solution based on the dW -distance for
Model 1 has been obtained. In P
other words, the value a∗ ∈ R and the interval
∗
2
B ∈ Kc (R) minimizing (1/n) n
i=1 dW (Yi , aXi + B) have been deduced. If X
and Y are nondegenerate, the solutions have been expressed as functions of variances/covariances of mids and spreads, as follows:
Algorithm for the least squares optimal solution for Model 1
Step 1. Compute the values µfX = mid X, σf2[0,1] =
h
2
2
σf2X = Smid
+ 4 σf2[0,1] Sspr
X + spr X
X
2 i
R
[0,1]
λ2 dW (λ) − .25,
,
ρ fX fY =
Smid X mid Y + 4 σf2[0,1] Sspr X spr Y + spr X · spr Y
Smid X mid Y − 4 σf2[0,1] Sspr X spr Y + spr X · spr Y
q
σf2X σf2Y
ρfX f[0,1] = − ρ feX f
and
a1 =
s
σf2Y
σf2X
v
u 2
u σf
b1 = t 2 Y
·
σf[0,1]
[0,1]
ρfX fY − ρfX f[0,1] ρfY f[0,1]
1−
·
,
σf2X σf2Y
ρfeX fY =
q
ρ2fX f[0,1]
1−
ρ2fX f[0,1]
,
v
u 2
u σf
[0,1]
= 2 t 2 spr X.
σfX
s
, a2 =
ρfY f[0,1] − ρfX f[0,1] ρfX fY
σf2Y
σf2X
·
v
u 2
u σf
, b2 = t 2 Y
ρfeX fY + ρfX f[0,1] ρfY f[0,1]
σf[0,1]
1 − ρ2fX f[0,1]
·
,
ρfY f[0,1] + ρfX f[0,1] ρfeX fY
1 − ρ2fX f[0,1]
,
700
González-Rodrı́guez, Gil, Colubi, Ana, Coppi, Renato and Giordani, Paolo
s
a′1
=
c1 = µfY − a1 µfX − .5 b1 , c2 = µfY − a2 µfX − .5 b2 ,
σf2X
s
b′′ =
s
σf2Y
· ρ fX fY ,
2
σf
2
σf
Y
a′2
=
σf2Y
σf2X
· ρfeX fY , c′1 = µfY −a′1 µfX , c′2 = µfY −a′2 µfX ,
· ρfY f[0,1] , c′′ = µfY − .5 b′′ and Y = [c′′ , c′′ + b′′ ] and go to
[0,1]
Step 2.
Step 2. IF a1 ≤ 0 THEN go to Step 3, ELSE go to Step 5.
Step 3. IF a2 ≥ 0 THEN the optimal solution is given by (a∗ , B ∗ ) = (0, Y ), ELSE
go to Step 4.
Step 4. IF b2 > 0 THEN the optimal solution is given by
(a∗ , B ∗ ) = a2 , [µfY − a2 µfX − .5 b2 , µfY − a2 µfX + .5 b2 ] , ELSE the optimal
′
′
∗
∗
solution is given by (a , B ) = a2 , {µfY − a2 µfX } .
Step 5. IF a2 ≥ 0 THEN go to Step 6, ELSE go to Step 7.
Step 6. IF b1 > 0 THEN the optimal solution is given by
(a∗ , B ∗ ) = a1 , [µfY − a1 µfX − .5 b1 , µfY − a1 µfX + .5 b1 ] , ELSE the optimal
∗
∗
′
′
solution is given by (a , B ) = a1 , {µfY − a1 µfX } .
Step 7. IF b1 > 0 THEN go to Step 8, ELSE go to Step 11.
Step 8. IF b2 > 0 THEN go to Step 9, ELSE go to Step 10.
Step 9. IF a21 > a22 THEN the optimal solution is given by (a∗ , B ∗ ) = a1 , [µfY − a1 µfX − .5 b1 , µfY − a1 µfX + .5 b1 ] , ELSE
IF a21 < a22 THEN the optimal solution is given by
(a∗ , B ∗ ) = a2 , [µfY − a2 µfX − .5 b2 , µfY − a2 µfX + .5 b2 ] ,
ELSE there are two optimal solutions, namely,
(a∗ , B ∗ ) = a1 , [µfY − a1 µfX − .5 b1 , µfY − a1 µfX + .5 b1 ], and
(a∗ , B ∗ ) = a2 , [µfY − a2 µfX − .5 b2 , µfY − a2 µfX + .5 b2 ] .
σf2
2
Step 10. IF a21 [1 − ρ2fX f[0,1] ] > a′2 − 2Y ρ2fY f[0,1] THEN the optimal solution is
σfX
given by
(a∗ , B ∗ ) = a1 , [µfY − a1 µfX − .5 b1 , µfY − a1 µfX + .5 b1 ] , ELSE
σf2
2
IF a21 [1 − ρ2fX f[0,1] ] < a′2 − 2Y ρ2fY f[0,1] THEN the optimal
σfX
solution is given by (a∗ , B ∗ ) = a′2 , {µfY − a′2 µfX } ,
ELSE there are two optimal solutions, namely,
(a∗ , B ∗ ) = a1 , [µfY − a1 µfX − .5 b1 , µfY − a1 µfX + .5 b1 ] , and
∗
∗
′
′
(a , B ) = a2 , {µfY − a2 µfX } .
Step 11. IF b2 > 0 THEN go to Step 12, ELSE go to Step 13.
2
2
Step 12. IF a′1 > a′2 THEN the optimal
solution is given by
(a∗ , B ∗ ) = a′1 , {µfY − a′1 µfX } , ELSE
2
2
IF a′1 < a′2 THEN the optimal
solution is given by
∗
∗
(a , B ) = a′2 , {µfY − a′2 µfX } , ELSE there are two optimal solutions, namely,
(a∗ , B ∗ ) = a′1 , {µfY − a′1 µfX } , and (a∗ , B ∗ ) = a′2 , {µfY − a′2 µfX } .
σf2
2
Step 13. IF a22 [1 − ρ2fX f[0,1] ] > a′1 − 2Y ρ2fY f[0,1] THEN the optimal solution is
σfX
given by
(a∗ , B ∗ ) = a2 , [µfY − a2 µfX − .5 b2 , µfY − a2 µfX + .5 b2 ] , ELSE
On the estimation of linear models with interval-valued data
701
σf2Y 2
ρ
THEN the optimal
σf2X fY f[0,1]
∗
∗
′
solution is given by (a , B ) = a1 , {µfY − a′1 µfX } ,
ELSE there are two optimal solutions, namely,
(a∗ , B ∗ ) = a2 , [µfY − a2 µfX − .5 b2, µfY − a2 µfX + .5 b2 ] ,
and (a∗ , B ∗ ) = a′1 , {µfY − a′1 µfX } .
2
IF a22 [1 − ρ2fX f[0,1] ] < a′1 −
Concerning the estimation problem in Model 2, we could take advantage of
the solutions for Model 1 as usual, although it should be noted that the interval
arithmetic-approach involves some constraints. More precisely, the assumption of
the linear model Y |X = aX + ǫX , based on interval-arithmetic forces the Hukuhara
differences Y −H aX to exist for the true value of a and for all the values of (X, Y )
in the population. However, if we estimate a from the sample on the basis of the
solution for Model 1, we could obtain unappropriate values of a∗ (i.e., values for
which Y −H a∗ X does not make sense). In fact, if the estimate a∗ is closer to 0
than the true value of a (i.e., |a∗ | ≤ |a|), then under the linear model assumption,
a∗ X ⊂ aX ⊂ Y , whence Y −H a∗ X is well-defined. On the contrary, if a∗ is such
that |a∗ | > |a|, the last assertion is not always true.
Example:
To illustrate the last comments, suppose that Y = X +ǫ, where mid X and mid ǫ
are N (0, 1) independent random variables and spr X and spr ǫ are χ21 independent
random variables. In this case, the population regression function is E A [Y |x] =
x + [−1, 1]. If (X1 , Y1 ) are sampled from (X, Y ), then we can write Y1 = aX1 + ǫ1 ,
where a = 1 and ǫ1 corresponds to the Hukuhara difference between Y1 and aX1 .
Table 1. Simulated data to estimate the model E A [Y |x] = x + [−1, 1]
mid xi spr xi mid ǫi spr ǫ1
0.6561 0.2863 0.1238 0.0893
-0.0334 0.06533 1.0936 0.8334
-0.2719 0.5166 -1.7599 1.3875
In order to illustrate the inconveniences of combining the linear model assumption with the solutions for Model 1, we have simulated 3 data from X and
ǫ (see Table 1) and we have later computed the estimates of a obtained from
the preceding algorithm. For these data we have that spr y1 = spr x1 + spr ǫ1 =
0.2863 + 0.0893 = 0.3736. If we compute the least squares solutions by applying the algorithm, we obtain a∗ = 2.2644 and B ∗ = [−0.3282, 0.4041], hence
spr (a∗ x1 ) = 2.2644 · 0.2863 = 0.6478. Since spr (a∗ x1 ) > spr y1 , it is not possible to compute the Hukuhara difference between y1 and a∗ x1 , and hence, it is not
possible to find any error ǫ∗1 so that y1 = a∗ x1 + ǫ∗1 . As a consequence, with these
sample data, if we put a∗ = 2.2644, we would conclude that there is no linear relationship between X and Y , which is not true. To overcome these difficulties we
propose the method in the following section.
702
González-Rodrı́guez, Gil, Colubi, Ana, Coppi, Renato and Giordani, Paolo
4 A restricted least squares method for linear models
The restricted Lest Squares estimator we propose is constrained to satisfy the linear
regression model, at least over the observed data. We cannot state the same for
unobserved data. Nonetheless, the simulation experiment shows that the inferential
performance (in terms of Mean Square Error) of the proposed estimator is better
than that of the original Least Squares estimator. However, we are not considering,
at this stage, optimality properties in a class of possible estimators for Model 2 (this
will be the object of future research in the field).
The key idea in the suggested method is based on finding a kind of “threshold
estimate” of |a|. This threshold, that will be denoted by a0 , will be computed as the
maximum value of |a| for which the sample Hukuhara differences Yi −H aXi exist
for all i = 1, . . . , n. Once this threshold a0 is computed, we consider the estimate a∗
given by the optimal solution for Model 1, and we proceed as follows:
– if |a∗ | ≤ a0 , then we estimate a by means of a∗ ;
– if a∗ < −a0 , then we estimate a by means of −a0 , otherwise we estimate a by
means of a0 .
b ∈
This method can be formalized as follows. We will search for b
a ∈ R and B
Kc (R) in order to
n
1X 2
dW (Yi , aXi + B)
Minimize
n i=1
restricted to:
Yi −H aXi exists ∀ i = 1, . . . , n
Equivalently, since Yi −H aXi exists if spr aXi ≤ spr Yi , then |a| ≤
spr Yi /spr Xi for all i = 1, . . . , n. Hence, if we consider the threshold
a0 = min
i=1...n
spr Yi
,
spr Xi
we can rewrite the restriction about the existence of the Hukuhara difference as
−a0 ≤ a ≤ a0 .
It should be noted that if −a0 ≤ a ≤ a0 , we can verify that dW (Yi , aXi + B) =
b are the optima, then
a and B
dW (Yi −H aXi , B) for all i = 1 . . . , n. Consequently, if b
b
B minimizes
1
n
n
X
aXi , B),
d2W (Yi −H b
i=1
and hence
b=
B
n
1X
aXi ) = Y −H b
aX,
(Yi −H b
n i=1
P
Pn
where X and Y are the sample means (X = 1/n n
i=1 inf Xi , 1/n
As a result, the problem reduces to
n
1X 2
a ∈ R minimizing
d (Yi −H aXi , Y −H aX)
Find b
n i=1 W
0
0
restricted to: −a ≤ a ≤ a .
i=1
sup Xi ).
On the estimation of linear models with interval-valued data
703
5 Simulation studies
We consider the linear regression model Y = aX + ǫ, where mid X and mid ǫ are
N (0, 1) independent random variables, and spr X and spr ǫ are χ21 independent random variables. For the simulations we have considered random samples {(Xi , Yi )}n
i=1
from the regression linear model for different values of a, namely, a = 0, a = .5 and
a = 1 and different sample sizes n. We have computed the least squares optimal
estimates for a and B obtained from the preceding algorithm, denoted by a∗ and
B ∗ , and the corresponding estimates obtained by means of the procedure in the
b
a and B.
preceding section, denoted by b
Table 2. Comparison between the two estimation methods.
Parameter n
a = .0
Expected value
Ratio
Least Squares Restricted LS MSELS /MSERLS
a
10
30
100
-.000942
.000181
.000100
-.000165
.000050
-.000002
16.249071
174.868423
4673.475688
B
10
30
100
[-.8285,.8307]
[-.9033,.9037]
[-.9471,.9476]
[-.9764,.9779]
[-.9966,.9968]
[-.9990,.9994]
1.152597
1.111057
1.101803
a
10
30
100
.465281
.497963
.499708
.380383
.445101
.470880
1.537858
1.864509
2.044996
B
10 [-.9852,.9874] [-1.0676,1.0694]
30 [-.9999,1.0005] [-1.0513,1.0516]
100 [-0.9996,1.0001] [-1.0282,1.0287]
1.069550
1.049216
1.047945
a
10
30
100
1.669655
2.049933
2.052986
B
10 [-1.0013, 1.0017] [-1.0816,1.0824]
30 [-.9995,.9996] [-1.0508,1.0508]
100 [-1.0007,.9998] [-1.0294,1.0285]
a = .5
a=1
.991353
.999783
.999987
.904667
.947226
.970990
1.065727
1.050074
1.047574
By Monte Carlo method we have approximated the distributions of the estimators on the basis of 100, 000 iterations from which we have computed the mean
values and the mean square errors for each estimator.
In Table 2 we gather the obtained results. We see that, the estimator a∗ seems
to be unbiased, while the estimator b
a seems to be biased when a 6= 0. However, the
Restricted LS estimator produces a much lower mean square error than the Least
Squares one. Actually, the closer a to 0 (w.r.t. the order of magnitude of variables X
and Y ), the greater the ratio between the mean square errors (MSELS /MSERLS ).
We have checked that, in many cases, if a is close to 0, the empirical ratios between
the MSE’s become greater as n increases (being a = 0 an extreme case), although
if a is far away from 0, this ratio seems to remain more stable. In connection with
704
González-Rodrı́guez, Gil, Colubi, Ana, Coppi, Renato and Giordani, Paolo
the estimation of B, we can see that the differences are not so remarkable, although
they depend on the results for a.
References
[Art75] Artstein, Z., Vitale, R.A.: A strong law of large numbers for random compact
sets. Ann. Probab. 3, 879–882. (1975)
[Aum65] Aumann, R.J.: Integrals of set-valued functions. J. Math. Anal. Appl. 12,
1–12 (1965)
[BCS95] Bertoluzza, C., Corral, N., Salas A.: On a new class of distances between
fuzzy numbers. Mathware & Soft Computing 2, 71–84 (1995)
[GLML02] Gil, M.A., Lubiano, M.A., Montenegro, M., López-Garcı́a, M.T.: Least
squares fitting of an affine function and strength of association for interval-valued
data. Metrika 56, 97–111 (2002)
[GGCM06] Gil, M.A., González-Rodrı́guez, G., Colubi, A., Montenegro, M.: Testing
Linear Independence in Linear Models with Interval-Valued Data. Comp. Stat.
Data Anal. (2006, to appear)
Download