Chi-squared Worksheet

advertisement

 

Test Worksheet

Adapted from: http://www.perdisco.com/

Purpose of

 

test statistics:

Analysis of categorical data

Goodness of fit (see statistics video)

1.

Hypothesis:

H o

: There is no association between row and column variables.

H a

: There is association between row and column variables

To test the null hypotheses, we compare the observed cell counts with the expected cell counts calculated under the assumption that the null hypothesis is true.

2.

Test Statistic :

 

test statistic is

2  i d 

1

X i

  i

2

2

 2

= ∑

(𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑐𝑜𝑢𝑛𝑡 − 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑐𝑜𝑢𝑛𝑡)

2

𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑐𝑜𝑢𝑛𝑡

𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑐𝑜𝑢𝑛𝑡 =

(𝑟𝑜𝑤 𝑡𝑜𝑡𝑎𝑙 × 𝑐𝑜𝑙𝑢𝑚𝑛 𝑡𝑜𝑡𝑎𝑙) 𝑛 𝑛 = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠

𝑇ℎ𝑒 𝑡𝑒𝑠𝑡 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 𝑋

2

𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ (𝑟 − 1)(𝑐 − 1)

𝑑𝑒𝑔𝑟𝑒𝑒𝑠 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚 𝑤ℎ𝑒𝑛 𝑡ℎ𝑒 𝑛𝑢𝑙𝑙 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑖𝑠 𝑡𝑟𝑢𝑒

Note:

 the

 

test statistic is always non-negative.

The smallest possible value of the test statistic is zero

Large values of the test statistic occur when the observed frequencies are far from the expected frequencies under the null hypothesis.

Therefore: large values of test statistic give evidence against H o

3.

P-value: area to the right of the test statistic under the

 

density curve with ( r - 1)( c - 1) degrees of freedom

4.

Joint, Marginal, and Conditional probability

Marginal probabilities of A

1

, A

2 𝑎+𝑐

=

𝑛

The joint probability of A 𝑏+𝑑 𝑛

=

2

and B

1 𝑏

=

𝑛

B

1

A

1

A

2

Total

A B A+B

B

2

C D C+D

TOTAL A+C B+D N=A+B+C+D

The conditional distribution 𝑗𝑜𝑖𝑛𝑡

=

𝑚𝑎𝑟𝑔𝑖𝑛𝑎𝑙

Examples:

1.

A survey was conducted with a large group of teenagers from 6 different states and it asked them for their favourite sport out of a large list of sports. The study aims to look at whether a teenager's state of residence is associated with their sporting preferences.

Assume that p is the proportion of teenagers who selected the sport most preferred in their state of residence, such that p

1

is that for State 1, p

2

is that for State 2 and so on. Select from the following all statements that are true as well as the hypothesis that corresponds to this study: a. H

0

: There is an association between state and sporting preference. b. H

0

: p

1

= p

2

= ... = p

6 c. H

0

: There is no association between state and sporting preference. d. H a

: There is no association between state and sporting preference. e. H a

: p

1

≠ p

2

≠ ... ≠ p

6 f. This is a chi-square test for association. g. This is a z-test. h. H

0

: p

1

≠ p

2

≠ ... ≠ p

6 i. H a

: p

1

= p

2

= ... = p

6 j. H a

: There is an association between state and sporting preference.

Answer : c, f, and j.

The hypothesis that is to be tested is whether the factor 'state' is associated with the factor

'sporting preference'. The study is therefore about 'association' and whether the distributions of sporting preferences are identical across states. In this case, the

 

test for association is the most appropriate test.

Two factors are deemed to not be associated unless there is supporting evidence to suggest otherwise. Since the null hypothesis is the default belief, the correct pair of hypotheses are:

H

0

: There is no association between state and sporting preference.

H a

: There is an association between state and sporting preference.

2.

Select whether a chi-square test for association is the most appropriate for the following scenarios:

Most

Appropriate

Not

Appropriate c1 d1 e1 a. b. c.

Emma wishes to know whether gender has any relation with whether one prefers wine, beer, or liqueur. She surveys a group of men and women and records the number of participants who prefer wine, the number who prefer beer and the number who prefer liqueur.

Isabelle surveys a group of people on how often they eat dinner at restaurants and diners. The age and annual income of each participant are recorded. Isabelle is interested in knowing whether these two factors are important in determining how often one goes out for dinner.

It has been proposed that one's education level has a strong relation with one's income level. Gabrielle collects data from a large group of 38 yearold adults about their education level (primary and lower, secondary, tertiary and higher) and their income level (to the nearest $100).

Gabrielle plans to test this data to see whether the mean income in each education level is the same across all three levels.

Answer: a. The chi-square test is the most appropriate here because the two factors, gender and alcohol preference, are both categorical variables . The frequency of each combination of the two factors is recorded on a two-way table as shown below:

Wine Beer Liqueur Total

Male a b c d

Female e f g h

Total i j k l

Emma plans to test whether gender has any effect on alcohol preference. This can be interpreted as being a test of whether the factors - gender and alcohol preference, are associated (null) or not (alternate). b. The chi-square test is not the most appropriate here because the factors - frequency of dining outside, age, income level - are all numerical variables. The income level, for example, has an almost infinite number of levels (e.g. $1,000, $1,001 ... $100,000). Chi-square tests are used for categorical variables that have only a few levels (e.g. Male, Female). Even if the levels are grouped into categories of ranges, there will still be the problem of having more than two factors each with different levels. The data would be tabulated as follows:

Frequency of Dining Out Age Income Level a1 b1 a2 b2 a3 b3 c2 c3 d2 d3 e2 e3

The most appropriate test here is the multiple regression test for correlation, due to there being two variables which (are hypothesised) to affect another variable. c. The chi-square test is not the most appropriate here because the hypothesis to be tested is whether the mean income within each group is the same. A chi-square test for association is only involved with whether there is an association between two factors, or whether they are independent of one another.

Furthermore, one of the factors (income level) is not categorical since it can take a range of values. The data would be tabulated as follows:

Primary Education or

Lower

Secondary

Education

Tertiary Education or

Higher

Income

Level a1 b1 a2 b2 a3 b3 c1 m1 c2 m2 c3 m3 Mean

Standard

Deviation s1 s2 s3

3.

A pharmaceuticals company claims to have developed a cure for the common cold. In order to test whether this is true, one group of recent cold-sufferers was given this medicine and another group was given a placebo instead. The selection process was completely random.

You plan to test whether the medicine has any effect on the cold by conducting a chi-square test for association on the data available. Calculate the

 

test statistic. Give your answer to 4 decimal places.

At α = 0.05, tell whether the null hypothesis is rejected. You may find a

 

distribution table helpful.

The participants were examined two hours after they were given the medicine or placebo. The results are tabulated below:

No Effect

Placebo Medicine Total

47 33 80

Cold Cured 32

Total 79

25

58

57

137

Answer:

 

= 0.0930

At α = 0.05, the null hypothesis that the medicine has no effect is not rejected.

Degrees of freedom = (total number of rows - 1) (total number of columns - 1) df = (2-1)(2-1) = 1

Calculating the

 

test statistic: f e

= expected count r = row total c = column total n = overall sample size 𝑓 𝑒

= r × 𝑐

= n

=

(47 - 46.13138686...) 2

46.13138686...

+

(33 - 33.86861314...) 2

33.86861314...

+ ... +

(25 - 24.13138686...) 2

24.13138686...

= 0.09285271...

= 0.0929

The chi-square statistic of 0.09285271... corresponds to the p-value of 0.76058138....Since the pvalue is greater than the α (0.05), the null hypothesis is not rejected .

Download