In the paper "Color Association of Male and

advertisement

MASSEY UNIVERSITY

PALMERSTON NORTH CAMPUS

EXAMINATION FOR

161.120 INTRODUCTORY STATISTICS

161.130 BIOMETRICS

SEMESTER II 2003

________________________________________________

Time allowed: THREE (3) hours.

This paper comprises:

SECTION A, containing 30 multiple choice questions;

SECTION B, containing 3 questions.

An appendix of tables is at the back of the paper.

A Scantron card is provided for your answers to Section A.

Instructions:

Attempt ALL of Section A and

TWO (2) questions from Section B.

Section A will be marked out of 30, and each question in Section B will be marked out of 15.

This examination contributes 50% (internal) or 70% (extramural) to the final assessment

_________________________________________________________________________

SECTION A:

The answers to the questions in this section must be entered on a Scantron card in pencil and handed in with your blue examination book.

A1 Soil samples were collected at thirty different sites in an agricultural area and the soil acidity (pH) was measured. The following stem-and-leaf plot shows the pH values, which range from 2.6 to 6.3.

Stems Leaves

2 679

3 237789

4 1222446899

5 0556788

6 0233

The median acidity is: a. 4.6

* b. 4.5 c. 4.4 d. 4.3 e. 4.2

A2 A _______________ shows the relationship between two variables. a.

box plot b.

bar chart c.

histogram

* d.

scatter plot e.

pie chart

A3 To help interpret diagnostic tests, doctors need to understand the distribution of the test results for ‘normal’ people. The histogram below shows the plasma glucose concentrations

(mg per dL) for 50 normal fasting people.

What proportion of the people has plasma glucose levels below 95? a.

0.09 b.

0.10 c.

0.16

* d.

0.34 e.

0.50

A4 A sample of 99 distances has a mean of 24 metres and a median of 24.5 metres.

Unfortunately, it has just been discovered that an observation which was erroneously recorded as "30" actually had a value of "35". If we make this correction to the data, then: a the mean remains the same, but the median is increased b the mean and median remain the same

* c the median remains the same, but the mean is increased d the mean and median are both increased e we do not know how the mean and median are affected without further calculations; but the variance is increased.

A5 The weights of the male and female students in a class are summarized in the following boxplots:

Males

Females

80 100 120 140 160 180

Weight (pounds)

200 220

Four of the following statements are correct. Which one is FALSE? a. About 50% of the male students have weights between 150 and 185 lbs. b. About 25% of female students have weights more than 130 lbs.

240

c. The median weight of male students is about 162 lbs. d. The mean weight of female students is about 120 because their distribution is fairly symmetric.

* e. The male students have less variability than the female students.

A6 Four of the following statements are correct. Which one is FALSE?

* a The numbers 1, 5, 9 have a smaller standard deviation than 101, 105, 109. b The numbers 3, 3, 3 have a standard deviation of 0.

c The numbers 3, 4, 5 have the same standard deviation as 1003, 1004, 1005. d The standard deviation is a measure of spread around the centre of the data. e The standard deviation can only be computed for numerical data.

A7 A researcher wishes to calculate the average height of patients suffering from a particular disease. From patient records, the mean was computed as 156 cm, and standard deviation as 5 cm. Further investigation reveals that the scale was misaligned, and that all reading are

2 cm too large, e.g., a patient whose height is really 180 cm was measured as 182 cm.

Furthermore, the researcher would like to work with statistics based on meters. The correct mean and standard deviation are: a 1.56m, .05m

* b 1.54m, .05m c 1.56m, .03m d 1.58m, .05m e 1.58m, .07m

A8 The following information about the country of origin of immigrants to Australia was published in the Dominion Post on Fri August 15.

Country of origin of immigrants to Australia, in the year to June 30

Britain

China

Fiji

12,510

6,600

1,610

Philippines

South Africa

Taiwan

3,190

4,600

1,110

Former USSR

India

Indonesia

New Zealand

1,100 United States

5,780 Vietnam

3,030

12,370

Yugoslavia

Other

1,320

2,570

1,630

36,490

The graphical display of these data that makes it easiest to see the proportion of immigrants from Asia is… a A bar chart with the countries in the same order as presented in the table b A bar chart with the countries ordered in decreasing order of frequency c A pie chart with the countries ordered in decreasing order of frequency d A bar chart with the countries reordered so that the Asian countries are adjacent

* e A pie chart with the countries reordered so that the Asian countries are adjacent

A9 When looking at a sequence of monthly postal revenue data, we note that the revenue is consistently highest in December. The high December revenue is an illustration of: a trend

* b seasonal variation

c random fluctuations d a cycle e an outlier

A10 For children between the ages of 18 months and 29 months, there is approximately a linear relationship between "height" and "age". From a data set of 100 children in this age group, the least squares line was y

64.93

0.63

x where y represents height (in centimeters) and x represents age (in months). One of the

* a +0.9 b 79.1 c -0.9 d 56.6 e 64.93

A11 A company that conducts regular political public opinion polls for a TV station has decided to increase the size of its random sample of voters from about 1500 people to about 4000 people. The effect of this increase is to: a reduce the bias of the estimate. b increase the standard error of the estimate.

* c reduce the variability of the estimate. d increase the confidence interval width for the parameter. e have no effect since the population size is the same.

A12 Four of the following statements are correct. Which one is FALSE? a In a proper random sampling, every element of the population has a known (and often equal) chance of being selected. b The precision of a sample mean or sample proportion depends mainly upon the sample size

(and not the population size) in a proper random sample. c Convenience sampling often leads to biases in estimates since the sample is often not representative of the population.

* d In a telephone survey of households in New Zealand, a high sample size guarantees that the mean household income in the country can be accurately estimated. e The sampling distribution of the sample mean describes how the sample mean will vary among repeated samples.

A13 A new headache remedy was given to a group of 25 subjects who had headaches. Four hours after taking the new remedy, 20 of the subjects reported that their headaches had disappeared. From this information you conclude: a that the remedy is effective for the treatment of headaches.

* b nothing, because there is no control group for comparison. c nothing, because the sample size is too small. d that the new treatment is better than aspirin. e that the remedy is not effective for the treatment of headaches.

A14 What is the best reason for performing a paired experiment rather than an experiment with two independent samples? a It is easier to do since we need fewer experimental units and each unit receives more than one treatment.

* b It allows us to remove variation in the results caused by other factors since we can compare both treatments within the same experimental unit. c The calculations will be more accurate since we work only with the differences. d The paired t-test uses fewer degrees of freedom than the two-sample t-test. e It allows us to do more experiments since we use each experimental unit twice.

A15 The daily milk production of Guernsey cows is approximately normally distributed with a mean of 35 kg/day and a std. deviation of 6 kg/day. The probability that one day’s production for a single animal will be less than 28 kg is approximately:

* a .12 b .41 c .09 d .38 e .62

A16 If the sampled population has mean 48 and standard deviation 16, then the mean and the standard deviation for the sampling distribution of x

for n = 16 are a.

Mean

4 b. 12

* c. 48 d. 48 e. 48

St devn

1

4

4

1

16



A17 The essence of the Central Limit Theorem is that:

* a. Irrespective of the distribution of the parent population, the distribution of the sample mean will be approximately normal, provided the sample size is large b. Irrespective of the sample size, the sample mean will be normally distributed c. Irrespective of the sample size, the population mean will be normally distributed d. Provided the sample size is large, the distribution of the population from which the sample is selected will be normal e. Provided the sample size is large, the distribution of the sample can be regarded as approximately normal

A18 Suppose that 30% of first year students in the University of Auckland live in flats. If 200 students are randomly selected, then the standard deviation of the number who live in flats will be approximately a 0.0011 b 0.0324 c 0.3

* d 6.48 e 42.0

A19 Minitab reports the following information about the weights in pounds of 143 bears, classified by gender (male=1, female=2)

Descriptive Statistics: Weight by Sex

Variable Sex N Mean Median TrMean StDev

Weight 1 99 214.0 180.0 208.9 119.7

2 44 143.05 141.00 139.17 64.48

Variable Sex SE Mean Minimum Maximum Q1 Q3

Weight 1 12.0 34.0 514.0 122.0 316.0

2 9.72 26.00 356.00 114.00 164.50

The difference between the mean weights of the male and female bears is estimated to be

70.9 pounds. The standard error of this estimate is...

* a. 15.4 b. 21.7 c. 55.2 d. 92.1 e. 136.0

Questions 20 and 21 relate to the following problem.

A Massey University researcher wishes to investigate whether a new variety of wheat is more resistant to a disease than an old variety. It is known that this disease strikes approximately 15% of all plants of the old variety. A field experiment was conducted and 12 of the 120 experimental plants became infected.

A20 The null and alternative hypothesis are: a H

0

: π = 0.10 H

A

: π > 0.15 b H

0

: π = 0.10 H

A

: π > 0.10 c H

0

: π = 0.15 H

A

: π ≠ 0.15

* d H

0

: π = 0.15 H

A

: π < 0.15 e H

0

: π = 0.15 H

A

: π > 0.15

A21 The calculated value of the test statistic is: a z = -47.1 b z = -0.39 c z = -3.07 d z = -1.83

* e z = -1.53

A22 A study was carried out on the effectiveness of a grain additive in deterring pigs from eating the grain. 1000 pigs were selected for the study, with 500 assigned to the treatment group (grain laced with 1080+dye) and the remaining 500 assigned to the placebo group

(grain laced with dye only). Minitab reports…

T-Test of difference = 0 (vs not =): T-Value = -5.42 P-Value = 0.000 DF = 499

The best conclusion is… a. There is a large difference between the effects of the treatment and the placebo

b. There is strong evidence that the 1080 additive is very effective in altering the intake of grain by pigs

* c. There is strong evidence of a difference in intake between the treatment and placebo but the difference may be small d. There is little evidence that the treatment has any effect on the intake of grain by pigs e. There is evidence of a strong treatment effect

A23 Health researchers wish to investigate whether the tar content (milligrams) varies among four brand of cigarettes. Three packs of each brand were selected, and one cigarette from each pack was placed in a smoking machine to determine the tar content. An analysis of variance was performed and here are the results (some parts are hidden):

Analysis of Variance for Tar

Source DF SS MS F P

Brand 3 348.00 116.0 11.60 0.003

Error 8 80.00 10.0

Total 11 428.00

Which of the following is correct: a Because the p-value is small, there is evidence that all the brands differ from each other in the mean amount of tar present.

* b Because the p-value is small, there is evidence that at least one brand has a different mean tar content from the other brands. c Because the p-value is small, there is no evidence that any of the brands differ in the mean tar content. d Because the p-value is small, there is no evidence that at least one brand has a different mean tar content from the other brands. e Because the p-value is small, there is evidence that all of brands have the same mean tar content.

Questions 24 to 26 relate to the following.

One concern about the depletion of the ozone layer is that the increase in UV light will decrease crop yields. An experiment was conducted in a green house where 40 soybean plants were exposed to varying levels of UV, measured in Dobson units. At the end of the experiment the yield (kg) was measured. A regression analysis was performed in Minitab with the following results:

The regression equation is

Yield = 3.98 - 0.0463 UV

Predictor Coef SE Coef T P

Constant 3.980 0.0538 74.01 0.000

UV -0.04629 0.01074 -4.31 0.001

A24 The least squares regression line is the line… a that minimizes the sum of the squared differences between the actual UV values and the predicted UV values. b that minimizes the sum of the residuals between the actual yield and the predicted yield.

* c that minimizes the sum of the squared differences between the actual yield and the predicted yield.

d that minimizes the sum of the squared residuals between the actual UV reading and the predicted UV reading. e that minimizes the total variation in the data.

A25 Which of the following is correct? a If the UV reading is increased by 1 Dobson unit, the yield is expected to increase by .0463 kg. b If the yield increases by 1 kg, the UV reading is expected to decline by .0463 Dobson units.

* c The estimated yield is 3.98 kg when the UV reading is 0 Dobson units. d The predicted yield is 4.3 kg when the UV reading is 20 Dobson units. e The t-ratio 74.01 is used to test whether the estimated slope is different from zero.

A26 A 95% confidence interval for the slope is… a

–0.046 ± 0.011 b –0.046 ± 0.108 c

–0.046 ± 0.054 d

–0.046 ± 0.046

* e –0.046 ± 0.021

Questions 27 to 29 relate to the following data set.

In the paper "Color Association of Male and Female Fourth-Grade School Children" (J. Psych.,

1988, 383-8), children were asked to indicate what emotion they associated with the color red.

The response and the sex of the child are shown in the table below. anger happy love pain total female male

27

34

19

12

39

38

17

28

102

112 total 61 31 77 45 214

A27. Four of the following statements are correct. Which one is FALSE? a A lower percentage of girls associate the emotion "anger" with the color red than do boys. b More students associate the color red with the emotion "love" than with the emotion

"anger". c Each student was classified by gender and by emotion association. Each student was counted in one and only one cell. d We will be unable to compute a correlation for this data because the variables are not both numerical variables.

* e We compute conditional proportions (given gender) by dividing the cell counts by the table total, 214.

A28. The null hypothesis for a chi-squared test on the above data is:

* a the emotion associated with red is independent of gender b gender is dependent upon the emotion associated with red c the probability of a child associating any of the emotions with red is related to gender

d the number of children in each cell does not depend upon gender nor upon emotion e the color red is independent of the emotion associated with it and with gender.

A29. Under this null hypothesis, the expected frequency for the cell corresponding to Anger and

Males is: a 34.0 b 55.7 c 30.4 d 29.1

* e 31.9

A30 A survey was conducted to investigate the severity of rodent problems in egg and chicken operations. A random sample of 78 egg operators and 53 chicken operators was selected, and the operators were classified according to the extent of the rodent population. A

Minitab analysis of the data gave the following output. (NB the first row of the contingency table corresponds to egg operators and the second row to chicken operators.)

Chi-Square Test: mild, moderate, severe

Expected counts are printed below observed counts

mild moderate severe Total

1 26 37 15 78

31.56 35.13 11.31

2 27 22 4 53

21.44 23.87 7.69

Total 53 59 19 131

Chi-Sq = 0.979 + 0.100 + 1.202 +

1.440 + 0.147 + 1.768 = 5.635

DF = 2, P-Value = 0.060

The conclusion from the test is... a The severity of rodent problems is the same for egg and poultry operators. b. The severity of rodent problems is different for egg and poultry operators c There is no evidence that the severity of rodent problems is different for egg and poultry operators.

* d The evidence for a difference in severity of the rodent problem between egg and poultry operators is only weak. e There is strong evidence of a difference in severity of rodent problems between egg and poultry operators.

SECTION B:

Answer two out of the three questions in this section.

B1. This question investigates traffic fatalities in New Zealand during 2001 and their relationship to the blood alcohol level of the drivers. The data used in the question were published by the Land Transport Safety Authority.

(a) The diagram below shows the distribution of ages of fatally injured drivers in 2001 and the numbers of these who were tested for the alcohol level in their blood.

50

45

40

35

30

25

20

Tested

Not tested

15

10

5

0

15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60+

Age

(i) Ignoring initially the testing for blood alcohol level, critically discuss the effectiveness of this diagram as a way of showing how age is related to death. In particular, consider…

• Treatment of the category “60+”

Wider than the other categories, explaining the higher number of fatalities

• The effectiveness of the graphic as a display of the data

The 3D enhancements are chartjunk

• Whether a different type of display might have been better

Age is continuous, so a histogram would be better, though there is a problem with the “60+” age group. Perhaps treat it as “60 to 75”?

• Lurking variables that may explain the downward trend up to age 59

There are probably more kilometres driven by younger drivers, so the accident rate per km may not be higher for them.

(ii) In the 35-39 age group, 11 out of the 15 fatalities were tested for alcohol level, whereas in the 40-44 age group, 22 out of 25 were tested. Test whether the probability of getting tested is the same in both age groups.

The pooled p = 33/40=0.825. z = (11/15 – 22/25) / root(0.825 * 0.175 * (1/15+1/25)) = –

1.18, so there is no evidence that the probabilities are different.

(iii) Discuss the following diagram, taking into account what you have learned in (ii).

100%

80%

60%

40%

20%

Tested

Not tested

0%

15-19 20-24 25-29 30-34 35-39

Age

40-44 45-49 50-54 55-59 60+

This is an effective way to show how the proportion tested depends on age. There is no obvious trend and a fair amount of variability. Since (ii) showed that two of the larger differences were not significantly different, we can conclude that there is little if any influence of age on whether the dead drivers were tested.

(b) Out of the total of 204 dead drivers who were tested, 43 had alcohol levels over 80 mg per

100 ml of blood (the legal limit).

(i) Find a 95% confidence interval for the probability that a tested driver is over the legal limit.

43/204 ± 2 * root(43 * 161 / 204 3

) = 0.154 to 0.268

(ii) Use the confidence interval in (i) to find a 95% confidence interval for the number of the 63 untested drivers who were over the 80mg limit, assuming that they have the same distribution of blood alcohol levels as the tested drivers, and hence a 95% confidence interval for the total number of dead drivers over the limit in 2001.

For the untested drivers, 0.154*63 to 0.268*63 = 9.7 to 16.9, so for all drivers, the CI is 52.7 to 59.9.

(c) The table below describes only the fatally injured drivers who were tested for alcohol level.

The deaths were classified by blood alcohol level and the time of day when the accident happened. (Note that the legal limit for driving is 80 mg per 100 ml of blood.

Blood alcohol level (mg per 100ml blood)

Time of day Under 80 80 to 200 Over 200 Total

1am to 9am

9am to 5pm

5pm to 1am

Total

49

77

35

161

8

2

18

28

5

6

4

15

62

85

57

204

(i) Draw on graph paper a stacked bar chart that can be used to compare blood alcohol levels at the different times of day. Describe the pattern in this diagram in words that a traffic researcher might understand.

1am to 9am

9am to 5pm

Under 80

80 to 200

Over 200

5pm to 1am

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

The proportion of deaths with extremely high alcohol levels does not seem to depend on the time of day, but the proportion with moderately high alcohol levels (80 to 200) is much higher between 5pm and 1am and is lowest between 9am and 5pm.

(ii) Minitab reports the following results from a chi-squared test on the data.

Expected counts are printed below observed counts

Under 80 80 to 20 Over 200 Total

1 49 8 5 62

48.93 8.51 4.56

2 77 2 6 85

67.08 11.67 6.25

3 35 18 4 57

44.99 7.82 4.19

Total 161 28 15 204

Chi-Sq = 0.000 + 0.031 + 0.043 +

1.466 + 8.010 + 0.010 +

2.216 + 13.237 + 0.009 = 25.021

DF = 4, P-Value = 0.000

2 cells with expected counts less than 5.0

What are your conclusions from the test?

The alcohol levels of the dead drivers are different at different times of day. There is strong evidence that the pattern described in (i) is therefore not due to chance.

Download