Q1 - Recently a survey was conducted involving customers of a fitness center in Dallas, Texas. Participants were asked to indicate how often they use the club by checking one of the following categories: 0 – 1 time per week; 2-3 times per week; 4-5 times per week; more than 5 times. The following data show how males and females responded to this question. One of the purposes of the survey was to determine whether there is a relationship between the gender of the customer and the number of visits made each week. a. State the appropriate null and alternative hypothesis. H o : number of visits is independent of gender H A : number of visits is related to gender b. What test procedure is appropriate to use to conduct this test? Because the data are observed frequencies in various discrete categories, the appropriate test to use is the chi-square contingency analysis. This involves determining the expected frequencies assuming the null hypothesis is true and then comparing these expected frequencies, cell by cell, to the observed frequencies. If these closely match, then the null hypothesis should not be rejected. However, if there is a big difference between the expected and observed cell frequencies, we should reject the null hypothesis. c. Conduct the hypothesis test using an alpha = .05 level. The test statistic for performing a chi-square contingency analysis is computed as follows: r c 2 i 1 j 1 (oij eij ) 2 eij with d.f. = (r – 1)(c – 1). The first step needed is to compute the expected cell frequencies. This is done under the assumption that the null hypothesis is true and that the proportion of customers in each use level is the same regardless of gender. The expected frequencies can be computed using: Expected Frequency = row total x column total . grand total For example for the cell corresponding to males who use the center 0-1 times per week, we get: Expected Frequency = 172 x 150 = 54.3158. 475 The following shows the expected cell frequencies for each cell: ( o e) 2 . For example in the cell for males and use between 0 and 1, e Next for each cell we compute: we get: (41 54.3158) 2 3.26 . 54.3158 Below we show the computation for each cell: 0-1 2-3 4-5 over 5 Males 3.264433 0.822572 2.59585 0.531475 Females 1.853077 0.466939 1.473552 0.301696 The chi-square test statistic is computed by summing these values giving r c 2 i 1 j 1 (oij eij ) 2 eij =11.309. The critical value for the contingency analysis test with (2-1) x (4-1) = 3 degrees of freedom and alpha equal .05 is found in the chi-square table to be 7.8147. The decision rule is: If 2 > 7.8147, reject the null hypothesis Otherwise, do not reject. Since 2 =11.309 > 7.8147, reject the null hypothesis and conclude that use rate is related to gender of the customer Q2 : A nation job placement company is interested in developing a model that might be used to explain the variation in starting salaries for college graduates based on the college GPA. The following data were collected through a random sample of the clients with which this company has been associated. Based on this sample information, determine the least squares regression model, determine what percent of the variation in starting salaries is explained by GPA, and test to determine whether the regression model is statistically significant at the 0.05 level of significance. Also, develop a scatter plot of the data and locate the regression line on the scatter plot. ANSWER: In this situation, the dependent variable, y, is the starting salary for college graduates. The independent variable, x, is the graduate’s college GPA. A random sample of n = 10 people were selected and the data were recorded. The least squares regression model seeks to fit a straight line to the data that “best” fits the data. The least squares criterion states that the regression line will minimize the sum of the squared residuals. The residual is the difference between the fitted y value, as determined by the regression line, and the actual y value. The sample regression model will take the form, yˆ bo b1 x The values bo and b1 are called the regression coefficients. They are the y intercept and the slope respectively. The following equations are used to arrive at bo and b1 : b1 ( x x )( y y ) (x x) 2 bo y b1 x By using these least squares equations, we will arrive at the “optimal” values for regression coefficients. The following regression model is determined: yˆ 13,524.6 6,208.72 x The scatter plot for the data with the regression line fitted to the data is shown as follows: As seen in the scatter plot, the regression line intercepts the y axis at about 13,524 and the slope of the regression line is about 6,208. To determine the percentage of variation in starting salary that is explained by the regression model with GPA as the independent variable, we must compute the coefficient of determination or R2. The following equation is used to compute the coefficient of determination: R2 SSR TSS The quantity SSR stands for sum of squares regression and SST is the total sum of squares. These are computed as follows: SSR ( yˆ y ) 2 and TSS ( y y ) 2 For these sample data we get: SSR ( yˆ y ) 2 49,495,938 and TSS ( y y ) 2 86,256,000 Thus the coefficient of determination is: R2 SSR 49,495,938 = .5738 TSS 86,256,000 Therefore, knowing GPA explains just over 57 percent of the variation in starting salaries for the sample data. To determine whether the regression model is statistically significant, we need to test the following null and alternative hypotheses: H o : 1 0 .0 H a : 1 0 .0 We can test this hypothesis in two ways: using a t-test approach or using the related F-test approach. We will use the F-test approach. The following table provides the required values: The calculated F test statistic is found by taking the ratio of the mean square regression over the mean square error: F MSR 49,495,938 10.77168 MSE 4,595,008 We then compare this calculated F to a critical F for an alpha = 0.05 and degrees of freedom equal to 1 and 8 which is 5.318. Since F = 10.77 > 5.318, we reject the null hypothesis and conclude that the regression slope coefficient is not zero. This means that GPA is a significant variable for explaining the variation in the dependent variable, starting salaries.