Statistics in Excel Chapter 3 7. Nonparametric Tests 7.1 Tests of

advertisement
Statistics in Excel
Chapter 3
7. Nonparametric Tests
7.1 Tests of Independence for two Categorical Variables
The following table illustrates the relation between the categorical traits „sex” and „department”.
We wish to choose between the null hypothesis H0: there is no association between sex and the
department in which a student studies, and the alternative HA: there exists an association between
sex and department. The data are given by the following contingency table
Computer science
Management
Mathematics
Females
17
14
17
Males
17
19
16
A
B
C
1
17
14
17
2
17
19
16
We enter these frequencies into cells e.g. A1-C2 to obtain
First, we calculate the row and column sums. The sum of the row sums (or equivalently, the sum of
the column sums) is the sample size. We obtain
A
B
C
D
1
17
14
17
48
2
17
19
16
52
3
34
33
33
100 (n)
Now we calculate the table of expected frequencies under the hypothesis that department is
independent of sex. We place this table in cells e.g. A5-C6. The expected frequency in a cell is equal
to the product of the corresponding row sum and column sum divided by the total number of
observations, i.e.
A
B
C
5
d1*a3/d3
d1*b3/d3
d1*c3/d3
6
d2*a3/d3
d2*b3/d3
d2*c3/d3
We can calculate these expected frequencies step by step. However, note that in the first term the
column is fixed and the row changes as appropriate, while in the second term the row is fixed and
the column changes as appropriate. The denominator is fixed. Hence, we may use the formula
A5=$d1*a$3/$d$3 and copy this formula into the remaining cells. We thus obtain the following
table of expected frequencies:
A
B
C
5
16.32
15.84
15.84
6
17.68
17.16
17.16
Now we calculate the relative square deviation of the observed frequencies from the expected
frequencies (their square deviations divided by the expected frequency) for each cell in the table.
We write these values into cells e.g. A8-C9. For cell A8, this value is given by A8=(A1-A5)^2/A5.
We can copy this formula into the remaining cells. We obtain
A
B
C
8
0.030000
0.213737
0.084949
9
0.026155
0.197296
0.078415
The realisation of the test statistic for this test of association is given by the sum of the elements in
the third table. Summing these values, =sum(a8:c9), we obtain t =0.628885. Under the null
hypothesis, this test statistic has approximately a chi-squared distribution with (r-1)(c-1) degrees of
freedom, where r is the number of rows and c the number of columns. Here, there are two degrees
of freedom. Large values of the realisation of the test statistic indicate that the null hypothesis is
incorrect. The p-value is given by P(T > t), whereT has the appropriate chi-squared distribution
(here with two degrees of freedom). We calculate P(T > t) using the function Chisq.dist.rt(t ,k), [in
older versions of Excel Chidist(t ,k)] where k denotes the number of degrees of freedom. Hence, the
p-value for this test is given by Chisq.dist.rt(0.628885,2)=0.7302.
Since this value is greater than 0.05, we do not have any evidence against H0, i.e. we may assume
that the choice of department does not depend on sex.
7.2 Goodness of Fit Tests
7.2.1 Goodness of Fit Tests – without estimation of parameters
We wish to test the hypothesis H0: data come from a precisely defined distribution, against the
alternative HA: the data come from another distribution. When a variable is continuous, we first
have to group the observations into categories (as when drawing a histogram). The first group
contains all the observations below a certain value, the last group contains all the values above
another value. The other groups contain observations from a given interval.
For example, we may group the observations of height from the file lista1.xls in the following way
Height
≤150
(150,160]
(160,170]
(170,180]
>180
Frequency
12
22
23
27
16
We have 100 observations (100=12+22+23+27+16). We test the hypothesis that these observations
come from a normal distribution with mean 165 and standard deviation 17. It should be observed
that this distribution is precisely defined (i.e. both parameters are given). Such a hypothesis is called
simple.
First, we calculate the probability that an observation belongs to a given group according to the null
hypothesis and place these probabilities under the frequency table.
For the first group, this probability is given by
P(X<150) =NORMDIST(150,165,17,1)
For the last group, this probability is given by
P(X>180)=1-P(X<180)=1-NORMDIST(180,165,17,1)
For the remaining groups, this probability is given by
P(X<b)-P(X<a)=NORMIST(b,165,17,1)-NORMDIST(a,165,17,1),
where a is the lower end point of the interval and b is the upper end point.
Note: These probabilities must sum to 1.
In this way, we obtain
Frequency
12
22
23
27
16
Probability
0.188793
0.195541
0.231332
0.195541
0.188793
Now we calculate the expected frequencies for each group. If we have n observations and the
probability of being in group i is pi, then we expect npi observations in group i. Thus, here we must
multiply these probabilities by 100, obtaining:
Frequency
12
22
23
27
16
Probability
0.18879299
0.19554102
0.23133199
0.19554102
0.18879299
Expected Frequency
18.879299
19.554102
23.133199
19.554102
18.879299
Obviously, we may combine these two steps.
As in the test of independence, we calculate the relative square deviations from the expected
frequencies (the square deviations divided by the expected frequency). We obtain:
Frequency (O)
Probability
12
22
23
27
16
0.188792988 0.195541016 0.231331993 0.195541016 0.188792988
Expected Frequency (E)
18.8792988
(E-O)^2/E
2.5067
19.5541016 23.1331993
0.30594
0.00077
19.5541016
18.8792988
2.83528
0.43912
The realisation of the test statistic is equal to the sum of the entries in the bottom row. Thus, we
have t=6.08782. This statistic has approximately a chi-squared distribution with k-1 degrees of
freedom, where k is the number of groups (here 5). We calculate the p-value using the function
CHISQ.DIST.RT(t, k-1). In this case, the p-value is
p = CHISQ.DIST.RT(6.08782, 4)=0.1927.
Since p >0.05, we may assume that these data come from a normal distribution with mean 165 and
standard deviation 17.
7.2.2 Goodness of Fit Tests – with parameter estimation
Assume that we wish to test the null hypothesis H0: the data come from a normal distribution
against the alternative HA: the data do not come from a normal distribution. In this case, the null
hypothesis does not precisely define the distribution (its parameters are not given). Such a
hypothesis is called a composite hypothesis. First, we must estimate the parameters of the
distribution. In the case of the normal distribution, we must estimate the parameters of the
distribution, the expected value (using the sample mean) and the standard deviation (using the
sample standard deviation). Using the raw data, the sample mean is average(B2:B101)=166.59 and
the sample standard deviation is stdev(B2:B101)=13.90886.
Now we repeat the calculations from the previous example, but now we assume that height has a
normal distribution with mean 166.59 and standard deviation 13.90886. Thus the probability that an
observation belongs to the first group is given by P(X<150)=normdist(150,166.59,13.90886,1).
In this way, we obtain the following table
Frequency
12
22
23
27
16
Probability
0.116481
0.201341
0.279015
0.235674
0.167490
Expected Frequency
11.6481
20.1341
27.9015
23.5674
16.7490
(E-O)^2/E
0.010633
0.17292
0.86105
0.49996
0.0335
The realisation of the test statistic is the sum of the values in the final row, 1.578. This statistic has
approximately a Chi-squared distribution with k-m-1 degrees of freedom where m is the number of
parameters estimated. In this case, k=5 and m=2, thus there are 2 degrees of freedom. The p-value is
given by CHISQ.DIST.RT(1.578,2)=0.4543. Since p >0.05, we do not have any evidence against
H0. Thus we may assume that the observations come from a normal distribution.
Note: This approximation to the chi-squared distribution is accurate when each of the expected
frequencies are greater than 5. When this is not the case, then we should combine groups
(obviously, in this case the frequency and expected frequency corresponding to a group formed by
such a combination are equal to the sums of the frequencies and expected frequencies in the
constituent groups).
We use a similar procedure to test whether the data from a specified discrete distribution (or family
of discrete distributions). In this case, we only have to group data when a) the expected frequency
of a given observation is less than 5, (see above), b) the null hypothesis assumes that the data come
from a distribution with a large or infinite support (set of possible variables), e.g. the Poisson
distribution. In this case, we denote the largest observation by xmax. The probability that an
observation belongs to this last group should be defined as P(X ≥ xmax). We use a similar approach
when there is a positive probability of observing a smaller observation than the smallest actually
observed in the sample.
Now we test the hypothesis that the following data come from a Poisson distribution:
Observation
0
1
2
3
4
5
Frequency
359
370
185
68
17
1
The number of observations 359+370+185+68+17+1=1000.
Since the null hypothesis does not completely specify the distribution (the hypothesis is composite),
we have to estimate it from the mean of the data. First, we calculate the products of the values of the
observations together with their frequencies. We obtain
Observation
0
1
2
3
4
5
Frequency
359
370
185
68
17
1
Product
0
370
370
204
68
5
The sum of these products is equal to the sum of all the observations (1017). Hence, the estimtor of
the parameter of the distribution is 1017/1000=1.017.
Now we calculate the expected frequencies. It should be noted that the final group should contain
all the values ≥ 5, since the value taken by a Poisson random variable is unbounded. The expected
frequency of the value k is given by 1000*POISSON(k,1.017,0). It should be noted that the sum of
the expected frequencies should sum to 1000, thus the expected frequency of the frequency of
observations ≥ 5 must be equal to 1000 minus the remaining expected frequencies. In this way, we
obtain the following table of frequencies and expected frequencies:
Observation
0
1
2
3
4
≥5
Frequency
359
370
185
68
17
1
Expected Frequency
361.678
367.827
187.040
63.407
16.121
3.927
Since the expected frequency of observations in the final group is less than 5, we combine the final
two groups and then calculate the relative squared deviation of the observed frequencies from the
expected frequencies. We obtain
Observation
0
1
2
3
≥4
Frequency
359
370
185
68
18
Expected Frequency
361.678
367.827
187.040
63.407
20.048
(O-E)^2/E
0.020
0.013
0.022
0.333
0.209
The realisation of the test statistic is the sum of the values in the final row 0.59695. The number of
degrees of freedom is k-m-1, where k, the number of groups, is 5, and m, the number of parameters
estimated is 1. Hence, there are 3 degrees of freedom. We calculate the p-value for this test using
the function p =CHISQ.DIST.RT(0.59695,3)=0.8971. Since this value is greater than 0.05, we may
assume that the data come from the Poisson distribution.
Final note: Tests of independence and goodness of fit (SIMPLEX, NOT COMPOSITE) may be
carried out using the command CHITEST(Range 1, Range 2), where Range 1 defines where the
table of observed frequencies lies and Range 2 defines where the table of expected frequencies lies.
These ranges must be of the same dimensions (i.e. contain the same number of rows and columns).
When there is only one row or one column, then Excel carries out the appropriate goodness of fit
test with a simple hypothesis. When the number of rows and the number of columns is greater than
1, then Excel carries out a test of independence. This command gives the appropriate p-value.
Linear Regression
Suppose that we have two continuous variables: X and Y. Let X be the explanatory (independent)
variable and Y the dependent variable. We wish to describe how Y „depends” on X by means of the
linear equation
Y = ͭβ0+β1X +ε,
where we assume that the random errors ε have a normal distribution with mean 0. In addition, we
assume that the residuals are independent. β0 is called the intercept, and β1 is the coefficient of the
independent variable (the slope of the regression equation).
First, we should check whether the association between the variables is really linear. We do
this using a scatter plot. For example, we want to derive a model for weight (in kg) as a linear
function of height (in cm) (data from lista1.xls). We highlight the columns that contain height (X)
and weight (Y). The independent variable (X) should be on the left. We choose graphs from the
insert menu and then click on scatter plots. We obtain the following graph:
120
100
80
60
40
20
0
130
140
150
160
170
180
190
200
It may be assumed that these points form a cloud around a line rather than a curve. Hence, linear
regression is appropriate.
In order to carry out linear regression, choose “Data Analysis” from the “Data” menu.
Note: If the “Data Analysis” option is unavailable, it is necessary to install it (choose „options” fron
the “file” menu. Choose „add-ons”, and select „Analysis Tool Pack”).
From the „Data Analysis” menu we choose „Regression”. We input the range of the dependent
variable (weight), here c2:c101, as well as the range of the independent variable X (height), here
b2:b101. The most important tables are the first, which contains summary statistics and the final
table, which contains the coefficients of the regression model β0, β1. We obtain
Regression Statistics
Multiple R
0.888203
R square
0.788904
Adjusted R square
0.786750
Standard Error
5.566871
Observations
100
R2 is the coefficient of determination. Here R2 = 0.788904. This is the proportion of the variance of
the dependent variable Y explained by the independent variable. Hence, height explains around
80% of the variance in weight.
The standard error is the standard deviation of the error terms (residuals). This can be interpreted as
a measure of the expected absolute error, when we estimate a person's height based on the
regression equation.
Coefficients
Coefficients
Std error
t Stat
p-values
Intercept
-63.763800
6.724257
-9.482650
1.61×10-15
Var X1
0.769817
0.040226
19.137510
7.16×10-35
We can read the parameters of the regression equation from the first column, β0 = -63.7638,
β1 =0.769817. Hence, our model is of the form
Y (weight) = -63.7638+0.769827X (height) +ε.
The coefficient β1 describes by how much the mean weight changes (in kg) when height increases
by one unit (cm), i.e. when height increases by one centimetre, weight increases on average by
0.769827kg.
The p-value in the second row refers to a test of the hypothesis H0: β1=0 against the alternative H1:
β1≠ 0. The null hypothesis states that there is no (linear) association between weight and height.
Since the p-value is very small (in particular p<0.001), we have very strong evidence that there
exists an association between weight and height. This association is positive, since the regression
coefficient is greater than zero (as X increases, Y also increases on average).
We can use this model to estimate the mean weight of people of a given height (by setting ε=0), e.g.
the mean height of people of height 170cm is given by
Y = -63.7638+0.769827×170 ≈ 67.1kg.
Thus the mean height of people of height 170cm is estimated to be 67.1kg.
Exponential Regression
Many economic time series are characterised by exponential or near exponential growth (e.g. prices,
gross domestic product, population size). In such cases, we have a model
Y=αeβt,
where t denotes time and β is (approximately) the mean percentage growth per unit time. Taking
logarithms, we obtain
Z = ln(Y)=ln(α)+βt.
Hence, there exists a linear dependence between Z=ln(Y) and time t. We can carry out a regression
of Z with respect to t, and then transform the equation to express Y (=eZ) as a function of time.
Example (population of Nigeria)
The graph and data below illustrate the growth of the population of Nigeria (year – number of years
after 1950)
Year Pop (thou.) ln(pop.)
0
37860 17,44941
5
41122 17,53205
10
45212 17,62687
15
50239 17,7323
20
56132 17,84322
25
63566 17,96759
30
73698 18,11549
35
83902 18,24516
40
95617 18,37586
45
108425 18,50157
50
122877 18,62669
55
139586 18,75419
60
159708 18,88886
180000
160000
140000
120000
100000
80000
60000
40000
20000
0
0
10
20
30
40
50
60
70
It can be seen from the scatter plot that the population does not increase linearly (the graph is
convex). Assuming that population growth is exponential, we have
P = αeβt,
where P is the population. We calculate logarithms of the population size (see the table above), then
carry out a regression of ln(Pop) with respect to time. We obtain the following table with the
regression coefficients:
Coefficients
Std error
t Stat
P-values
Intercept
17.389260
0.014568
1193.699000
1.79×10-29
Var X1
0.024612
0.000412
59.734330
3.58×10-15
Hence, we obtain the model ln(P)=17.38926+0.024612t.
Taking exponentials, we obtain
P = e17.38926+0.024612t = e17.38926 e0.024612t = 35 650 060 e0.024612t.
The coefficient of time in the exponent gives the mean growth rate of the population, i.e. 2.46% per
annum.
Using this equation, we can estimate the population in 2015 (i.e. t = 65)
P = 35 650 060 e0.024612×65 ≈ 176 542 441.
Note: Analysis of time series using regression analysis has its problems. Regression makes the
assumption that the error terms are uncorrelated. On the other hand, if the population is relatively
large in one year (according to the model), then we expect that it will be relatively large in the
following year (i.e. the error terms are correlated). For this reason, in order to predict future values
of a variable, it is better to use methods which are adapted to time series (time series analysis will
be discussed in the “Econometrics” course).
Download