The chi-square test statistic is computed by summing

advertisement
Q1 - Recently a survey was conducted involving customers of a fitness center in Dallas, Texas.
Participants were asked to indicate how often they use the club by checking one of the following
categories: 0 – 1 time per week; 2-3 times per week; 4-5 times per week; more than 5 times. The
following data show how males and females responded to this question.
One of the purposes of the survey was to determine whether there is a relationship between the
gender of the customer and the number of visits made each week.
a. State the appropriate null and alternative hypothesis.
H o : number of visits is independent of gender
H A : number of visits is related to gender
b. What test procedure is appropriate to use to conduct this test?
Because the data are observed frequencies in various discrete categories, the appropriate test to use is the
chi-square contingency analysis. This involves determining the expected frequencies assuming the null
hypothesis is true and then comparing these expected frequencies, cell by cell, to the observed
frequencies. If these closely match, then the null hypothesis should not be rejected. However, if there is
a big difference between the expected and observed cell frequencies, we should reject the null hypothesis.
c. Conduct the hypothesis test using an alpha = .05 level.
The test statistic for performing a chi-square contingency analysis is computed as follows:
r
c
  
2
i 1 j 1
(oij  eij ) 2
eij
with d.f. = (r – 1)(c – 1).
The first step needed is to compute the expected cell frequencies. This is done under the assumption that
the null hypothesis is true and that the proportion of customers in each use level is the same regardless of
gender. The expected frequencies can be computed using:
Expected Frequency =
row total x column total
.
grand total
For example for the cell corresponding to males who use the center 0-1 times per week, we get:
Expected Frequency =
172 x 150
= 54.3158.
475
The following shows the expected cell frequencies for each cell:
( o  e) 2
. For example in the cell for males and use between 0 and 1,
e
Next for each cell we compute:
we get:
(41  54.3158) 2
 3.26 .
54.3158
Below we show the computation for each cell:
0-1
2-3
4-5
over 5
Males 3.264433 0.822572 2.59585 0.531475
Females 1.853077 0.466939 1.473552 0.301696
The chi-square test statistic is computed by summing these values giving
r
c
  
2
i 1 j 1
(oij  eij ) 2
eij
=11.309.
The critical value for the contingency analysis test with (2-1) x (4-1) = 3 degrees of freedom and alpha
equal .05 is found in the chi-square table to be 7.8147.
The decision rule is:
If  2 > 7.8147, reject the null hypothesis
Otherwise, do not reject.
Since  2 =11.309 > 7.8147, reject the null hypothesis and conclude that use rate is
related to gender of the customer
Q2 : A nation job placement company is interested in developing a model that might be used to
explain the variation in starting salaries for college graduates based on the college GPA. The
following data were collected through a random sample of the clients with which this company
has been associated.
Based on this sample information, determine the least squares regression model, determine what
percent of the variation in starting salaries is explained by GPA, and test to determine whether
the regression model is statistically significant at the 0.05 level of significance. Also, develop a
scatter plot of the data and locate the regression line on the scatter plot.
ANSWER:
In this situation, the dependent variable, y, is the starting salary for college graduates.
The independent variable, x, is the graduate’s college GPA. A random sample of n = 10 people
were selected and the data were recorded. The least squares regression model seeks to fit a
straight line to the data that “best” fits the data. The least squares criterion states that the
regression line will minimize the sum of the squared residuals. The residual is the difference
between the fitted y value, as determined by the regression line, and the actual y value. The
sample regression model will take the form, yˆ  bo  b1 x
The values bo and b1 are called the regression coefficients. They are the y intercept and the
slope respectively. The following equations are used to arrive at bo and b1 :
b1 
 ( x  x )( y  y )
 (x  x)
2
bo  y  b1 x
By using these least squares equations, we will arrive at the “optimal” values for regression
coefficients. The following regression model is determined:
yˆ  13,524.6  6,208.72 x
The scatter plot for the data with the regression line fitted to the data is shown as follows:
As seen in the scatter plot, the regression line intercepts the y axis at about 13,524 and the slope
of the regression line is about 6,208.
To determine the percentage of variation in starting salary that is explained by the regression
model with GPA as the independent variable, we must compute the coefficient of determination
or R2. The following equation is used to compute the coefficient of determination:
R2 
SSR
TSS
The quantity SSR stands for sum of squares regression and SST is the total sum of squares.
These are computed as follows:
SSR   ( yˆ  y ) 2 and TSS   ( y  y ) 2
For these sample data we get:
SSR   ( yˆ  y ) 2  49,495,938
and
TSS   ( y  y ) 2  86,256,000
Thus the coefficient of determination is:
R2 
SSR 49,495,938
=
 .5738
TSS 86,256,000
Therefore, knowing GPA explains just over 57 percent of the variation in starting salaries for the
sample data.
To determine whether the regression model is statistically significant, we need to test the
following null and alternative hypotheses:
H o :  1  0 .0
H a :  1  0 .0
We can test this hypothesis in two ways: using a t-test approach or using the related F-test
approach. We will use the F-test approach. The following table provides the required values:
The calculated F test statistic is found by taking the ratio of the mean square regression over the
mean square error:
F
MSR 49,495,938

 10.77168
MSE
4,595,008
We then compare this calculated F to a critical F for an alpha = 0.05 and degrees of freedom
equal to 1 and 8 which is 5.318. Since F = 10.77 > 5.318, we reject the null hypothesis and
conclude that the regression slope coefficient is not zero. This means that GPA is a significant
variable for explaining the variation in the dependent variable, starting salaries.
Download