solutions

advertisement

Introduction to Statistics

Math 58 – Fall 2007

Professor Hardin

Exam 1

1) The following data report the number of flights that were “on-time” and “late” for

Continental Airlines and American Airlines in November, 2002 for all flights to Houston,

Chicago, and Los Angeles. The last column is the overall percent on-time for each city.

Houston Chicago Los Angeles Overall

On time

Late % on time

On time

Late % on time

On time

Late % on time

% on time

Continental 7318 1017 87.8 466 135 77.5 544 145 79.0 86.5

American 598 70 89.5 8330 1755 82.6 2707 566 82.7 82.9

(a) (2 pts) Is Simpson’s Paradox present with these data? Explain.

Yes, American had a higher on time arrival rate in each city but a lower rate overall.

(b) (4 pts) Use the data provided to explain why the paradox occurs here.

Continental tends to fly to Houston more than Chicago and LA and Houston has higher arrival rates in general (perhaps the airport is not as busy). While more American flights tend to go to Chicago and LA which have lower arrival rates pulling down American’s overall percentage.

(c) (3 pts) Based on these data, assuming that arriving on time is your prime consideration, which airline would you rather fly on? Explain.

No matter which (of these three) city you fly to, the chance of being on time is higher with American.

2) “Headset Phones May Still Pose Risks for Drivers” (Jesse Drucker, September 24,

2004) reported on a University of Utah study that used a driving simulator to see how drivers performed when randomly assigned to talk on a headset cellphone or to talk to another passenger. Part of the simulation instructed the drivers to take a specific exit.

Ten of 24 using a cellphone with a headset missed the exit, 5 of 24 talking to a passenger missed it.

(a) (1 pt) Identify the observational units.

Drivers

(b) (2 pts) Identify the explanatory variable and the response variable.

Explanatory: Talking via a headset or talking to the passenger

Response: Whether or not they missed the exit

(c) (2 pts) Is this an observational study or an experiment? Explain.

Experiment because the researchers manipulated (assigned) the explanatory variable.

(d) (3 pts) Create a two-way table for these data.

Cell phone passenger Total

Missed exit 10

Didn’t miss exit

14

Total 24

5

19

24

15

33

48

(e) (3 pts) Produce a segmented bar graph of these results (remember to label your axes).

Summarize what the graph reveals.

(f) (3 pts) Calculate the odds ratio for this table. Write a sentence interpreting this odds ratio in the context of this study.

Odds ratio = 10(19)/(14*5) = 2.71

The odds of missing the exit are 2.71 times higher when using the cell phone than when talking to a passenger.

(g) (3 pts) The following table reports the probabilities associated with randomly sampling 15 people from a group of 48 (24 of which were on the cell phone, 24 of which weren’t) and recording which of the 15 were on the cell phone. That is, if I select 15 people from a group of 48 (24 on the phone, 24 not on the phone), the probability that 8 of my 15 will be on the phone is 0.23.

Report the appropriate p-value for this situation.

X 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

P(X=x) 0 0 0.0006 0.005 0.024 0.076 0.16 0.23 0.23 0.16 0.076 0.024 0.005 0.0006 0

P( X ≥ 10 ) = 0.076+0.024+0.005+0.0006 = 0.1056

(h) (4 pts) Provide an interpretation of the p-value you calculated in (g). What is it the probability of?

0

If there was no real difference between the two groups (using the cell phone vs. talking to passenger), randomization alone would produce at least 10 of the 15 “successes” (missing the exit) about 10.56% of the time.

(i) (4 pts) Summarize and justify the conclusions that you would draw from this study, on the question of whether cellphone usage is more distracting than talking to a passenger.

The odds ratio of 2.71 is possibly unlikely to have happened by chance alone. We have very weak evidence (p-value ≈ 0.1) that the cell phone usage caused (since this was an experiment) a higher proportion of drivers to miss the exit.

3) USA Today is known as the “nation’s newspaper.” It strives to reach a very broad readership and is, therefore, reputed to be written at a fairly low readability level. On the other hand, the Washington Post generally has a reputation as a more serious newspaper aiming for a more intellectual readership. To assess whether or not quantitative data can reveal any evidence to support this reputation, data on sentence lengths (measured by number of words in a sentence) for the lead editorials from January 22, 2007, in USA

Today ( n = 31) and in the Washington Post ( n = 21) was collected.

(a) (5 pts) Based on the following dotplots, describe key features of the distributions and any similarities and differences revealed in the dotplots.

Both sentence length distributions are failure symmetric though a bit more right skew to the Washington Post values. There appears to be a tendency for longer sentences in the

USA Today actually but with some large outliers for the Washington Post. This also leads to a perception of more variability for the Washington Post.

You do not need to include an explanation or calculation with the following 3 questions:

(b) (1 pt) Which do you think is larger, the median USA Today sentence length or the median Washington Post sentence length? median USA Today median Washington Post

(c) (1 pt) Which do you think is larger, the median Washington Post sentence length or the mean Washington Post sentence length? median Washington Post mean Washington Post

(d) (3 pts) Suppose we were to remove the two largest sentence lengths for the Washington

Post ( WP ).

(i) How would the WP mean sentence length change? Larger

(ii) How would the WP median sentence length change? Larger

Smaller No Change

Smaller No Change

(iii) How would the WP standard deviation change? Larger Smaller No Change

The following dotplot displays an empirical randomization distribution comparing the mean sentence length between the two papers ( USA Today minus Washington Post ). difference in means (USA Today – WP)

(e) (6 pts) Briefly describe how such a randomization distribution is created and what information it provides.

We would take the 52 sentence lengths and randomly divided them into a group of 31 and a group of 21. Then find the means of these two groups and subtract. Repeat this say

1000 times. This graph gives us insight into how much these group means can differ from zero (variability) when the only reason for them to differ is the random assignment process.

(f) (3 pts) Based on this randomization distribution, approximately how large would the difference in means need to be for you to be convinced that the Washington Post editorials, on average, have longer sentence lengths? Explain your reasoning.

The bottom 5% of these simulated difference seems to be around -6 and below.

(g) (4 pts) For each of the following situations, indicate whether the p-value would increase (I), decrease (D), or stay the same (S), (no explanations needed):

(i) all the sentences examined had one more word

(ii) all the Washington Post sentences were one word longer

I

I

D

D

S

S presuming this creates a larger difference between the groups

(iii) both sample standard deviations were larger

(iv) both sample sizes were larger

I

I

D

D

S

S

4) In 1977, the U.S. government sued the City of Hazelwood, a suburb of St. Louis, on the grounds that it discriminated against African Americans in its hiring of school teachers (Finkelstein and Levin, 1990). The statistical evidence introduced noted that of the 405 teachers hired in 1972 and 1973 (the years following the passage of the Civil

Rights Act), only 15 had been African American. But according to 1970 census figures,

15.4% of teachers employed in St. Louis County that year were African American.

Suppose we model Hazelwood’s hiring practices as a Bernoulli process.

(a) (1 pt) Describe (in words) the variable of interest in this study. whether or not the hired teacher was African American

(b) (2 pts) Describe (in words) the parameter of interest. probability that the hiring process in Hazelwood hires an African American teacher

(c) (3 pts) State appropriate null and alternative hypotheses (in words and symbols) to assess whether there is convincing evidence that the Hazelwood district was hiring

African American teachers at a disproportionately low rate. let p represent the probability described in (b).

Ho: p = .154 (the probability in Hazelwood matches the County)

Ha: p < .154 (suspect Hazelwood is hiring at a lower rate)

(e) (3 pts) If, in fact, 15.4% of the teachers in Hazelwood are African American, we would virtually never see only 15 African Americans randomly selected out of a sample of 405 teachers.

Based on the above statement, should you reject or fail to reject the null hypothesis?

Why?

With such a small p-value we easily reject the null hypothesis.

(g) (3 pts) Include a one-sentence summary of the corresponding conclusion in context.

We have strong evidence that Hazelwood is hiring African Americans at a lower rate than the rest of the county.

(h) (4 pts) Provide a one-sentence interpretation of this p-value (what is it the probability of?) in this context.

This is the probability that we would find 15 or fewer African Americans among 405 randomly selected hired teachers if they were coming from a process where the probability of an African American is .154.

(i) (3 pts) Suppose the researchers obtain a very small p-value, would it be reasonable to use this analysis as proof of discrimination by the city? If not, briefly explain why not and what conclusion can reasonably be drawn.

No, this is only an observational study and cannot to identify the cause for the lower percentage of African Americans among Hazelwood teachers.

5) Some researchers wanted to estimate the proportion of Cal Poly students who drink alcohol.

(a) (3 pts) Briefly explain a method for obtaining a simple random sample of n =100 Cal

Poly students.

Get a list of all Cal Poly students, number each member of the list, and then use some kind of random number generator to select 100 ID numbers. Match these ID numbers to the individuals that will then comprise your simple random sample.

Note: This question specially asked about a “simple random sample”, so don’t describe a systematic sample or a stratified sample. They will lead to “random samples” but not every sample of size n will be equally likely.

(b) (3 pts) Using the same data, one researcher calculates a confidence interval to be

(.701, .846) but another calculates an interval to be (.686, .857). Which one is the 90% confidence interval and which is the 95% confidence interval? Explain briefly.

The interval that is wider will correspond to the larger confidence level.

(c) (3 pts) Briefly explain what the confidence interval (.701, .846) represents in this context as if to someone not in a statistics class.

We are 90% confident that the proportion of all Cal Poly students who drink alcohol is between .701 and .846. That is, this is the range of plausible values for the population parameter.

Note:

Later, we will talk more about what we mean by the phrase “90% confidence.”

Here I wanted you to convince me you knew that the interval was for the population proportion (as defined in this context), and the idea of “range of plausible values.” Recall that this is the set of values that would not be rejected by a two-sided test of significance with a .10 level of significance.

We do not say that there is a 90% chance that the proportion is in the interval. In fact, the proportion is either in the interval or it isn’t. There is no randomness associated with the interval after we have collected the data.

We also don’t say anything about the data being in the interval. The interval represents possible values for the parameter (of the population). In particular, here the data couldn’t possible be in the interval because the data are a series of “yes” and “no” responses.

Download