An Alternative Method of Comparing the Ratio of Two

advertisement
An alternative method of comparing the means of two populations
Chand Chauhan, Ph.D
Associate Professor
Dept. Of Mathematical Sciences
Indiana University-Purdue University
2101 E, Coliseum Blvd, Fort Wayne, Indiana, 46835
Phone #: 260-481-6227
Fax # : 260-481-0155
chauhan@ipfw.edu
Brad Moss, Student
Dept. Of Mathematical Sciences
Northern Illinois University
Dekalb, Il, 60115
Phone: 574-453-6120
b.rammoss@gmail.com
Results on the ratio of the means of two normally distributed variables.
Abstract
The equality of the means of two populations are tested under three different situations : a)
population standard deviations are known, b) population standard deviations are unknown and
equal, and c) population standard deviations are unknown and unequal. In most elementary
courses, situations a and b are discussed in detail. However, situation c is either neglected or
briefly discussed with a reference of the formula of the degrees of freedom for the t value used
in the formula. It is not very easy to provide a logical explanation for the complicated formula
of the degrees of freedom of the t distribution.
In this paper we discuss an alternative method of comparing two population means under case
c. We propose a confidence interval of the ratio of the means of populations, with the
conclusion of equal means if the interval includes the value of one. The proposed method
is valid for both independent or dependent populations. However in this paper the focus is on
the independent case. The conditions under which the proposed formula works well are
discussed. Simulation results are presented to illustrate the validity of the proposed interval. A
numerical example is provided to illustrate the application of the proposed method.
Introduction:
Consider two positive sets of values, X and Y, such that X ~ N(  X ,  X2 ) and Y~ N(  y  y 2 )
,
Further suppose  represents the correlation coefficient of X and Y. Let X and Y represent
the corresponding sample means of n pairs, ( X i , Yi ) .
In most practical situations,  X and Y are unknown and we may be interested in estimating

and computing the confidence interval of the ratio Y . For this objective, one needs to
X
Y

, denoted R , which is a reasonable estimate of Y . While a
X
X
linear combination of X and Y has a normal distribution, the same is not true for the ratio. In
this paper, the results of Hays, Armstrong, and Grasses (1975), have been modified to
compute such an interval.
investigate the distribution of
Theoretical Results:
Y
, and another on the
X
approximate distribution of a function of R, given in Hays, Armstrong, and Grasses (1975),
are as follows:
Two specific results, one on the approximate distributions of R=
(1) E (
 
 2
Y
)  Y  X 3Y   X 2 y
X
X
X
X
 

  
Y
)  X 4y  y2  2  X y3 Y
X
X
X
X
2
(2) Var (
2
2
Y
has an approximately normal distribution under certain conditions. (Note in the
X
present discussion  refers to an approximately equal value.)
(3)
(4) Z 
( R X  Y )
(  2 R  X  Y  R 2 X2 )
2
Y
, known as the Geary_Hinkley transformation also has an
approximate standard normal distribution under certain conditions.
Hays et al provided conditions for 3) and 4) at 5% level of significance. Both conditions are in
terms of the values of  , coefficients of variation of X and that of Y. The rationale of the
conditions are driven by the fact that for a small enough value of standard deviation of X, (or
Y
may be regarded as Y times a constant,
X
whose distribution is normal . Hays et al used these results to derive a confidence interval of
Y
, assuming that the values of  X ,  X , Y , Y , and  , are all known. From practical point of
X
view it is highly unlikely to know the values of  X ,  X , Y , Y , and  .Moreover, from
.
consequently the coefficient of variation of X),
applications point of view a confidence interval of
Y
.provides useful information regarding
X
the ratio of two attributes, on an average basis, than that of
Y
X
Main Results:
It can be shown that
(5) The correlation coefficient between X and Y , denoted  is equal to  . This result
follows directly from the fact that Covariance (X,Y) = Covariance ( X , Y )
Following (5), results (1)……(4) are modified as follows:
(6)

Y
1  X  y  X  Y
E( )  Y 

X
X n X 3
 X 2n
(7)
Var (
2
2
2
2
 
Y
1 X y
1 y
)

 2 X 3 y
4
2
X
n X
n X
X n
Y
has an approximately normal distribution if the standard deviation of X is much
X
smaller than the standard deviation of Y . The same rationale applies as (3).
(8)
(9)
Z n
( R  X  Y )
also has an approximate standard normal
( Y2  2 R  X  Y  R 2 X2 )
distribution under certain conditions as well.
Conditions of approximate normality:
Modifying the simulation results of Hays et al,
(10) If ρ=0, then
Y
(10)
 R has approximately normal distribution at 5% level of significance if
X
(10)
Since the coefficient of variation of X, CVx =

X
, and CVy = y , equations 6 and 7
y
X
can be approximated as follows:
Y
CVX2
CVX CVY
Y
CVX2
Y
(10) E ( ) 
(1 

)
(1 
)
X
X
n
n
X
n
(11) Var (
2 1
 CVxCVy
2 1
Y
1
1
)  y 2 ( CVx2  CVy2  2
)  y 2 ( CVx2  CVy2 )
X
X n
n
n
X n
n
The last approximations in equations (10) and (11) are justified when n is large, and ρ is either

0 or very small. The ratio
is insignificant even for moderately small value of ρ and
n
moderately large value of n. Recall, for normal approximation we do require   0.5. For
CVX2
for sufficiently large n and
n
Y

sufficiently small value of CVX . In that case
is an unbiased estimate of Y with
X
X
negligible bias.
further ease of algebra, one may cautiously drop the factor
Applications :
1. If ρ=0,
Y
provides a useful alternative to Behren-Fisher
X
problem, when two populations have unequal standard deviations. The inclusion of the value 1 in
the confidence interval leads to the conclusion of the equality of means.
the approximate confidence interval of
Applications:
In many businesses related settings, a hypothesis test (or a confidence interval) for the equality of
the means of two independent populations is conducted. When comparing the mean profits of
two stores of varying sizes, or the mean number of transactions for two different credit cards are
examples of some of the situations where such a confidence interval may be useful. Well known
methods to compute such an interval are given as long as the population variances are either
known or are unknown but equal. However no simple solution exits when the population
variances are unknown and unequal. In this paper we have utilized results 5-7 for such a purpose.
Y
In our approach we compute a confidence interval of , and conclude that the population means
X
are equal if the interval contains 1. This approach is different from a traditional approach in which
the means are compared by computing an interval based on Y  X and concluding the equality of
the means if zero is within the interval.
Formula for the confidence interval of
y
X
Assumptions: We assume that the two populations are independent and normally distributed, and
the samples are drawn randomly from each population. Further we assume that the standard
deviation of one population, say, X, is much smaller than that of Y( although this condition
may be relaxed depending on the sample sizes). Since  =0 , results 8 and 9 simplify as
follows:
(10)
E(


CVx2
Y
)  y (1 
) y
X
X
n
X
y2 1
Y
1
Var ( )  2 ( CVx2  CVy2 )
X
X n
m
(11)
Moreover, the bias is even smaller for a reasonably large value of n. Central Limit Theorem and
some algebraic manipulations lead to the following 95% confidence interval:
Y
X
(12)
1  1.96
1
1
CVx2  CVy2
n
m

 y 
X
Y
X
1  1.96
1
1
CVx2  CVy2
n
m
Intervals of different confidence levels may be computed by selecting appropriate values of Z.
The values of the coefficients of variation are estimated from the samples.
Simulation results: Nine different simulation results of 4000 runs were ran on Minitab to
compute 95% confidence intervals. In each case both population means were equal. The sample
sizes and the standard deviations of X and Y varied keeping the standard deviation of X smaller
than that of Y. The actual confidence level and the length of interval were computed in each
case. The simulation result 1, for example, is based on two normally distributed populations with
the means of 100 each and standard deviations of 10 and 0.5 respectively. The actual confidence
level is 94.04 % and the length of the interval is .0711821.The results follow;
1.
Y~N(100,10), X~N(100,0.5), Sample Size 30 each
Confidence Interval of 94.04%
Mean Length: 0.0711821
2.
Y~N(100,10), X~N(100,2),
Sample Size 30 each
Confidence Interval of 94.42%
Mean Length: 0.0726839
3.
Y~N(100,10), X~N(100,5), Sample Size 30 each
Confidence Interval of 94.54%
Mean Length: 0.0795576
4.
Y~N(100,10) Sample Size 15, X~N(100,0.5) Sample Size 20
Confidence Interval of 93.34%
Mean Length: 0.100127
5.
Y~N(100,10) Sample size 15, X~N(100,2)
Confidence Interval of 92.88%
Mean Length: 0.101451
6.
Y~N(100,10) Sample Size 15, X~N(100,5)
Confidence Interval of 93.68%
Mean Length: 0.108929
7.
Y~N(100,10) Sample Size 20, X~N(100,0.5) Sample Size 15
Confidence Interval of 93.8%
Mean Length: 0.0869802
8. Y~N(100,10) Sample Size 20, X~N(100,2)
Confidence Interval of 93.56%
Mean Length: .0891369
9.
Sample Size 20
Sample Size 20
Sample Size 15
Y~N(100,10) Sample Size 20, X~N(100,5) Sample Size 15
Confidence Interval of 93.84
Mean Length: 0.100611
Interesting observations:
Notice that in each of the cases the confidence level is below 95%. This is due to the fact that
in simulation, we use the estimates for the coefficient of variation for both X and Y, keeping
the practical issue in mind. In terms of confidence level the best results were obtained when
the sample sizes were equal. The worst results were obtained when the sample was smaller for
Y than for X. A logical explanation for this occurrence is that one must take larger sample for
a population with a larger variability. The length of the interval gets shorter as the ratio of the
standard deviation of X with that of Y gets smaller. The length also increases as the sample
sizes decrease.
Restrictions on Standard deviations of X and Y
Y
to have
X
an approximately normal distribution: Coefficient of variation of X<= .09 and coefficient of
variation of Y> .19. In our proposed result, the normal assumption will depend on the ratio of
Hayya et al,( 1975), recommended the following conservative rule of thumb for
y
as well as the values of n and m. For example, simulation result of 4000 runs showed that
X
y
Y
even with the ratio of
=1,
has a normal distribution when n=15, and m=9. Therefore
X
X
our result has more flexibility. More investigation is underway on this topic.
Numerical Example:
Suppose the following information is obtained from two independent business transactions:
Transaction1: n= 15, X = $ 188.00, sample standard deviation = $ 18.00
Transaction 2: m= 9, Y = $ 196.00, sample standard deviation = $ 28.00
Based on (12) a 95% confidence interval of
y
is as follows:
X
.96791 
y
 1.1297 .
X
y
. Moreover we also
X
conclude that the two sample means are not statistically different, (note the value of 1 is
included in the interval).
Interval ( .96791, 1.1297) provides possible values for the ratio of
Suggestions to improve the confidence level of the proposed interval:
As noted from the simulation study, the actual confidence level for the interval is below
95%.This is particularly true when the samples are 20 or less. It is believed that the confidence
level will increase to the desired level for sample of more than 30. To increase the confidence
level for all sample sizes, a value of t may be used in formula (12) in place of z. However,
formula (12) is very appealing to practitioners for its simplicity. More work is underway to
determine the degrees of freedom, if a t value is used in above formula.
References
Jack Hayya, Donald Armstrong, and Nicolas Gressis. “A Note on the Ratio
of Two
Normally Distributed Variables” Management Science, Volume 21, No. 11, Theory Series,
1975, pp 1338-1341.
Roussas, George. An Introduction To Probability and Statistical Inference.San Diego: Elsevier
Science, 2003
Montgomery, Douglas. Design and Analysis of Experiments 7th Edition. New Jersey: John
Wiley & Sons, 2009
Download