Exam solutions January 2012

advertisement
Statistics II
Final Exam - January 2012
Use the University stationery to give your answers to the following questions.
Do not forget to write down your name and class group in each page.
Indicate clearly the beginning and end of each question.
Exercises
1. (2 points) In a certain game, a good player is assumed to be one who scores more than 4 points
per match. You have been following player A, who scored an average of 5 points per match in a
large series of 100 matches, with a sample (quasi)variance of 3.94.
a) (0.5 points) Would you consider player A to be a good player at a 95 % confidence level?
b) (0.5 points) Suppose you also observed player B, whose p-value corresponding to the goodplayer test is 0.002. According to this evidence, whom would you consider a better player,
A or B? Why?
c) (0.5 points) You have used Statgraphics to carry out a hypothesis test on the data for player
A, with the following results:
Hypothesis Tests
Sample mean = 5,0
Sample standard deviation = 1,98479
Sample size = 100
99,0% confidence interval for mean: 5,0 +/- 0,521287 [4,47871;5,52129]
Null Hypothesis: mean = 4,5
Alternative: not equal
Computed t statistic = 2,51916
P-Value = 0,0133645
******************************
Indicate the null and alternative hypotheses for this test. Would you reject the null hypothesis for a significance level of 1 %? Why?
d ) (0.5 points) Suppose that the sample average and variance values for player A have been
obtained from a small series of 5 matches (instead of 100). Can you reach any meaningful
conclusion for the test about the goodness of A?
1) Yes, without any further assumptions.
2) Yes, but we need to make some distributional assumption about the scores.
3) No.
Solution. Let Xi denote the points scored by the player for the i-th match, and X̄ = (X1 +
· · · + Xn )/n.
a) We have to test the null hypothesis
H0 : µ ≤ 4 vs.
H1 : µ > 4.
where µ denotes player A’s average score per match. As n = 100, from the Central Limit
Theorem we have that the test statistic is
√
Z = (X̄ − µ0 )/S/ n ∼ N (0, 1)
p
and we reject H0 if zobs > z0.05 = 1.65. In this case, zobs = (5 − 4)/ 3.94/100 = 5.038, we
reject the null hypothesis and we conclude that player A can be considered a good one.
b) To compare the scores of the two players, we consider the corresponding p-values. For player
A, its value is
p-value = Pr(Z > zobs ) = Pr(Z > 5.038) = 2.35 10−7 << 0.002,
and we conclude that the probability of getting the scores obtained by A under the null
hypothesis is lower than that for B and consequently player A seems to be much better
than B.
c) The test carried out in this case is
H0 : µ = 4.5
H1 : µ 6= 4.5,
As the p-value for this test is 0.0134, we will reject the null hypothesis for all significance
levels larger than this value. In particular, for a significance level of 1 % we would not reject
H0 , but we would reject it for 5 %.
d ) 2. If the score per match and its corresponding sample variance were obtained observing only
n = 5 matches, this sample size would not be enough to apply the Central Limit Theorem
and we could not reach a meaningful decision, unless we were to compensate the lack of
information in such a small sample with an assumption on the probability distribution of
the scores, such as considering that they follow a Normal distribution.
2. (2 points) You are conducting a study on the seasonal variations in the sales of shellfish in
one of Madrid’s districts. You have collected sales data from 20 fish markets in the district,
corresponding to two days in two different periods of interest: December 20th (Christmastime),
and April 17th (Spring); both days are Wednesdays. The following table presents a summary of
the shellfish sales income in each one of the days, as well as the value for the difference in sales
income between both periods:
Average sales
Quasi standard deviation
December
300 euros
44 euros
April
180 euros
29 euros
December−April
120 euros
44 euros
Answer the following questions, indicating in each case any sample or population assumptions
that you might need to make:
a) (1 point) Compute two confidence intervals for the average of the sales income in each of
the two periods, for a confidence level of 99 %.
b) (1 point) For a significance level of 5 %, conduct a hypothesis test to determine if the average
daily sales in December are at least 100 euros greater than the sales in April. Indicate the
null and alternative hypotheses and justify your conclusion.
Solution. We define the variables of interest as X
shellfish sales on December 20, Y
shellfish sales on April 17. As we only have information for the 20 fish markets in the district,
we cannot assume that we have a large sample; as a consequence, we will need to assume
that the population follows a normal distribution. We will also assume that the observations
corresponding to X and Y for the 20 markets are simple random samples. These samples (X, Y )
are paired, as they have been obtained for the same markets on two different dates.
a) The confidence intervals are given by
sx
CIµX (99 %) = x̄ ± t19,0.005 √
20
= (271.85; 328.15)
sy
CIµY (99 %) = ȳ ± t19,0.005 √
20
= (161.45; 198.55)
44
= 300 ± 2.86 √
20
in euros;
29
= 180 ± 2.86 √
20
in euros.
b) The null and alternative hypotheses for the test are:
H0 : µX − µY ≤ 100
H1 : µX − µY > 100,
and if we define D = X − Y ,
H0 : µD ≤ 100
H1 : µD > 100.
The value of the test statistic is
t=
d¯ − d0
120 − 100
√ =
√
= 2.033.
sd / n
44/ 20
As this statistic follows a Student-t distribution with n − 1 degrees of freedom, the rejection
region is defined as those samples that have a value of the statistic larger than the quantile
of the Student-t, t19,0.05 = 1.73,
CR = {t > 1.73}.
As this condition is satisfied for our samples, we conclude that we reject the null hypothesis
for a significance level of 5 %, that is, we accept that the average increase of sales income
between December and April in this district is larger than 100 euros.
3. (3 points) The sales department of a clothing company is conducting a study on the company’s
catalog sales. Their goal is to determine if there is a meaningful relationship between the number
of phone lines open to receive orders (“Phone lines”, L) and the volume of catalog sales (“Sales”,
S) (measured in hundreds of euros). The department has the following data on the values of
these variables for the last 20 days:
P20
P20 2
P20 2
P20
i=1 li = 19195,
i=1 si = 458657,
i=1 si = 2835,
i=1 li = 599,
P20 2
P20
i=1 ei = 16823.72
i=1 li si = 92000,
where ei denotes the residuals of the regression model explaining the variable S as a function of
L.
a) (0.5 points) Compute the ANOVA table for S.
b) (0.5 points) Test if the variable “Phone lines” has no impact on the values of the variable
“Sales”, for a significance level of 5 %.
c) (0.5 points) Compute the value of the coefficient of determination and interpret it.
d ) (0.5 points) Obtain the least-squares estimates for the parameters of the regression line
explaining the variable “Sales” (S) as a function of the values of the variable “Phone lines”
(L).
e) (0.5 points) Obtain an estimate for the sales forecast corresponding to a day in which
you have 12 open phone lines. Compute also a confidence interval at a 95 % level for this
forecast.
f ) (0.5 points) Additionally, you have information on the number of catalogs that have been
distributed each day (“Number catalogs”, C). You fit a multiple regression model including
this new variable, and you obtain the following Statgraphics output:
Multiple Regression - Sales
Dependent variable: Sales
Independent variables:
Phone_lines
Number_catalogs
Parameter
CONSTANT
Phone_lines
Number_catalogs
Estimate
-99,269
5,01165
0,00957155
Standard
Error
69,8328
1,03056
0,00861747
T
Statistic
-1,42152
4,86301
1,11071
P-Value
0,1733
0,0001
0,2822
Identify the values of the estimates for the parameters of the multiple linear regression
model, and interpret the value of the coefficient of the variable “Phone lines” (L).
Solution.
a) From the data we have been given we obtain SSR = 16823.72, and also
20
20
20
X
X
X
SST = (n − 1)s2s =
(si − s̄)2 =
s2i − 20 × s̄2 =
s2i −
i=1
i=1
i=1
20
X
!2
si
/20
i=1
= 56795.75.
Based on this information, the ANOVA table is given by:
Source
Model
Residuals
Total
Sum of squares
39972.03
16823.72
56795.75
D.F.
1
18
19
Mean Squares
39972.03
934.651
F-ratio
42.767
b) From the information in the ANOVA table, and in particular from the value of the F-ratio,
we conduct a significance test for the model with critical region given by
CR0.05 = {F > F1,18;0.05 } = {F > 4.41}
As the value of the ratio is in the critical region, we reject the null hypothesis and we
conclude that the value of the variable “open lines” is linearly related to that of the variable
“sales”.
c) The coefficient of determination is given by
R2 =
SSE
39972.03
=
= 0.704.
SST
56795.75
The value of the variable “open lines” explains 70.4 % of the variability in the variable
“sales”.
d ) We compute first some required values:
¯l =
20
X
li /20 = 29.95,
i=1
20
X
s2l = (
s̄ =
20
X
si /20 = 141.75
i=1
li2 − 20 × ¯l2 )/19 = 66.05,
i=1
i=1
cov(l, s) = (
20
X
i=1
20
X
s2s = (
s2i − 20 × s̄2 )/19 = 2989.25
li si − 20 × ¯ls̄)/19 = 373.25
From these values we obtain
cov(l, s)
= 5.651
s2l
= s̄ − β̂1 ¯l = −27.50,
β̂1 =
β̂0
and the regression model is ŝ = −27.50 + 5.651l.
We also have that the residual variance is (see the ANOVA table)
s2R =
20
X
e2i /(n − 2) = 934.651.
i=1
e) The point estimate for the forecast corresponding to l0 = 12 is
ŝ0 = −27.50 + 5.651l0 = 40.31.
To obtain the confidence interval we use the formula,
s 1
(l0 − ¯l)2
2
CI0.05 = ŝ0 ± t18;0.025 sR 1 + +
n (n − 1)s2l
s
1
(12 − 29.95)2
= 40.31 ± 2.101 934.651 1 +
+
= (−33.12; 113.74).
20
19 × 66.05
f ) The multiple linear regression model of interest is
ŝi = β̂0 + β̂1 li + β̂2 ci ,
and the values of the parameters from the Statgraphics output are β̂0 = −99.269, β̂1 =
5.01165, β̂2 = 0.00957155, yielding the model
ŝi = −99.269 + 5.01165li + 0.00957155ci .
If we increase the number of open lines by one unit, while keeping constant the value of
the variable “number of catalogs”, the value of the sales increases by 501.165 euros on the
average.
Questions
1. (1 point) Determine if the following statements are true or false. Provide a brief justification for
your answer.
a) (0.5 points) As a response to the current economic crisis, 15 countries have decided to apply
a policy based on austerity measures, while another group of 15 countries have chosen to
follow a policy based on the use of stimulus packages. You wish to use a statistical testing
procedure to evaluate if the growth rates associated to each set of policies are significantly
different. An appropriate hypothesis test is a two-sided test for paired samples.
b) (0.5 points) We are interested in studying if there is a significant difference between the
salaries of men and women in the communications and services sectors. We have selected
100 companies in the communications sector and 100 companies in the services sector.
For each company we collect information on a standardized indicator for the difference in
salaries between men and women. An appropriate hypothesis test is a two-sided test for
independent samples.
Solution.
a) FALSE. We have no information to think that the countries included in both samples can
be paired in any meaningful way for this study. It would be more reasonable in this case to
consider the samples as independent.
b) TRUE. As in the preceding case, we do not have any information that might indicate that
the companies included in both samples have any relationship. Thus, it seems reasonable
in this case to treat both samples as independent.
2. (1 point) For a simple linear regression model y = β0 + β1 x + u, determine if the following
statements are true or false. Provide a brief justification for your answer.
a) (0.5 points) If the variance of the errors is equal to 0, the coefficient of determination is
also equal to 0.
b) (0.5 points) For the estimated linear regression model ŷi = −3 + 0.5xi , each additional unit
of variable X implies a decrease of 3 units in the value of variable Y .
Solution.
a) FALSE. If the variance of the errors is equal to 0, then the coefficient of determination is
equal to 1. If the variance of the errors is 0, then SSR = 0 and
R2 =
SST − SSR
SST
=
= 1.
SST
SST
b) FALSE. For each additional unit of X the variable Y has an increase equal to β̂1 , that is,
0.5 units.
3. (1 point) Answer the following questions, using the information provided in the Statgraphics
output.
Simple Regression - Y vs. X
Dependent variable: Y
Independent variable: X
Linear model: Y = a + b*X
Coefficients
Parameter
Intercept
Slope
Least Squares
Estimate
21,5885
-2,68469
Analysis of Variance
Source
Sum of Squares
Model
561,472
Residual
383,553
Total (Corr.)
945,025
Standard
Error
2,46742
0,838677
Df
1
7
8
T
Statistic
8,74945
-3,20111
Mean Square
561,472
54,7933
P-Value
0,0001
0,0150
F-Ratio
10,25
P-Value
0,0150
Correlation Coefficient = -0,770801
R-squared = 59,4134 percent
R-squared (adjusted for d.f.) = 53,6154 percent
Standard Error of Est. = 7,40225
Mean absolute error = 4,99915
Durbin-Watson statistic = 2,71064 (P=0,8750)
Lag 1 residual autocorrelation = -0,366548
a) (0.5 points) Specify the values of the estimates for the three parameters in the model.
b) (0.5 points) Is the independent variable significant to explain the values of the response
variable? Why?
Solution.
a) The estimated model is given by ŷi = 21.5885−2.68469xi , with a residual variance s2R equal
to 54.7933 (from the ANOVA table).
b) To carry out this test we look at the p-value associated to the slope of the regression line,
equal to 0.0150 (this same p-value is associated to the F-ratio in the ANOVA table). We
conclude that for any significance level larger than this p-value (α > 0.0150) we reject the
null hypothesis and the independent variable x is significant to explain the values of the
response variable y.
Download