Statistics in Excel Chapter 3 7. Nonparametric Tests 7.1 Tests of Independence for two Categorical Variables The following table illustrates the relation between the categorical traits „sex” and „department”. We wish to choose between the null hypothesis H0: there is no association between sex and the department in which a student studies, and the alternative HA: there exists an association between sex and department. The data are given by the following contingency table Computer science Management Mathematics Females 17 14 17 Males 17 19 16 A B C 1 17 14 17 2 17 19 16 We enter these frequencies into cells e.g. A1-C2 to obtain First, we calculate the row and column sums. The sum of the row sums (or equivalently, the sum of the column sums) is the sample size. We obtain A B C D 1 17 14 17 48 2 17 19 16 52 3 34 33 33 100 (n) Now we calculate the table of expected frequencies under the hypothesis that department is independent of sex. We place this table in cells e.g. A5-C6. The expected frequency in a cell is equal to the product of the corresponding row sum and column sum divided by the total number of observations, i.e. A B C 5 d1*a3/d3 d1*b3/d3 d1*c3/d3 6 d2*a3/d3 d2*b3/d3 d2*c3/d3 We can calculate these expected frequencies step by step. However, note that in the first term the column is fixed and the row changes as appropriate, while in the second term the row is fixed and the column changes as appropriate. The denominator is fixed. Hence, we may use the formula A5=$d1*a$3/$d$3 and copy this formula into the remaining cells. We thus obtain the following table of expected frequencies: A B C 5 16.32 15.84 15.84 6 17.68 17.16 17.16 Now we calculate the relative square deviation of the observed frequencies from the expected frequencies (their square deviations divided by the expected frequency) for each cell in the table. We write these values into cells e.g. A8-C9. For cell A8, this value is given by A8=(A1-A5)^2/A5. We can copy this formula into the remaining cells. We obtain A B C 8 0.030000 0.213737 0.084949 9 0.026155 0.197296 0.078415 The realisation of the test statistic for this test of association is given by the sum of the elements in the third table. Summing these values, =sum(a8:c9), we obtain t =0.628885. Under the null hypothesis, this test statistic has approximately a chi-squared distribution with (r-1)(c-1) degrees of freedom, where r is the number of rows and c the number of columns. Here, there are two degrees of freedom. Large values of the realisation of the test statistic indicate that the null hypothesis is incorrect. The p-value is given by P(T > t), whereT has the appropriate chi-squared distribution (here with two degrees of freedom). We calculate P(T > t) using the function Chisq.dist.rt(t ,k), [in older versions of Excel Chidist(t ,k)] where k denotes the number of degrees of freedom. Hence, the p-value for this test is given by Chisq.dist.rt(0.628885,2)=0.7302. Since this value is greater than 0.05, we do not have any evidence against H0, i.e. we may assume that the choice of department does not depend on sex. 7.2 Goodness of Fit Tests 7.2.1 Goodness of Fit Tests – without estimation of parameters We wish to test the hypothesis H0: data come from a precisely defined distribution, against the alternative HA: the data come from another distribution. When a variable is continuous, we first have to group the observations into categories (as when drawing a histogram). The first group contains all the observations below a certain value, the last group contains all the values above another value. The other groups contain observations from a given interval. For example, we may group the observations of height from the file lista1.xls in the following way Height ≤150 (150,160] (160,170] (170,180] >180 Frequency 12 22 23 27 16 We have 100 observations (100=12+22+23+27+16). We test the hypothesis that these observations come from a normal distribution with mean 165 and standard deviation 17. It should be observed that this distribution is precisely defined (i.e. both parameters are given). Such a hypothesis is called simple. First, we calculate the probability that an observation belongs to a given group according to the null hypothesis and place these probabilities under the frequency table. For the first group, this probability is given by P(X<150) =NORMDIST(150,165,17,1) For the last group, this probability is given by P(X>180)=1-P(X<180)=1-NORMDIST(180,165,17,1) For the remaining groups, this probability is given by P(X<b)-P(X<a)=NORMIST(b,165,17,1)-NORMDIST(a,165,17,1), where a is the lower end point of the interval and b is the upper end point. Note: These probabilities must sum to 1. In this way, we obtain Frequency 12 22 23 27 16 Probability 0.188793 0.195541 0.231332 0.195541 0.188793 Now we calculate the expected frequencies for each group. If we have n observations and the probability of being in group i is pi, then we expect npi observations in group i. Thus, here we must multiply these probabilities by 100, obtaining: Frequency 12 22 23 27 16 Probability 0.18879299 0.19554102 0.23133199 0.19554102 0.18879299 Expected Frequency 18.879299 19.554102 23.133199 19.554102 18.879299 Obviously, we may combine these two steps. As in the test of independence, we calculate the relative square deviations from the expected frequencies (the square deviations divided by the expected frequency). We obtain: Frequency (O) Probability 12 22 23 27 16 0.188792988 0.195541016 0.231331993 0.195541016 0.188792988 Expected Frequency (E) 18.8792988 (E-O)^2/E 2.5067 19.5541016 23.1331993 0.30594 0.00077 19.5541016 18.8792988 2.83528 0.43912 The realisation of the test statistic is equal to the sum of the entries in the bottom row. Thus, we have t=6.08782. This statistic has approximately a chi-squared distribution with k-1 degrees of freedom, where k is the number of groups (here 5). We calculate the p-value using the function CHISQ.DIST.RT(t, k-1). In this case, the p-value is p = CHISQ.DIST.RT(6.08782, 4)=0.1927. Since p >0.05, we may assume that these data come from a normal distribution with mean 165 and standard deviation 17. 7.2.2 Goodness of Fit Tests – with parameter estimation Assume that we wish to test the null hypothesis H0: the data come from a normal distribution against the alternative HA: the data do not come from a normal distribution. In this case, the null hypothesis does not precisely define the distribution (its parameters are not given). Such a hypothesis is called a composite hypothesis. First, we must estimate the parameters of the distribution. In the case of the normal distribution, we must estimate the parameters of the distribution, the expected value (using the sample mean) and the standard deviation (using the sample standard deviation). Using the raw data, the sample mean is average(B2:B101)=166.59 and the sample standard deviation is stdev(B2:B101)=13.90886. Now we repeat the calculations from the previous example, but now we assume that height has a normal distribution with mean 166.59 and standard deviation 13.90886. Thus the probability that an observation belongs to the first group is given by P(X<150)=normdist(150,166.59,13.90886,1). In this way, we obtain the following table Frequency 12 22 23 27 16 Probability 0.116481 0.201341 0.279015 0.235674 0.167490 Expected Frequency 11.6481 20.1341 27.9015 23.5674 16.7490 (E-O)^2/E 0.010633 0.17292 0.86105 0.49996 0.0335 The realisation of the test statistic is the sum of the values in the final row, 1.578. This statistic has approximately a Chi-squared distribution with k-m-1 degrees of freedom where m is the number of parameters estimated. In this case, k=5 and m=2, thus there are 2 degrees of freedom. The p-value is given by CHISQ.DIST.RT(1.578,2)=0.4543. Since p >0.05, we do not have any evidence against H0. Thus we may assume that the observations come from a normal distribution. Note: This approximation to the chi-squared distribution is accurate when each of the expected frequencies are greater than 5. When this is not the case, then we should combine groups (obviously, in this case the frequency and expected frequency corresponding to a group formed by such a combination are equal to the sums of the frequencies and expected frequencies in the constituent groups). We use a similar procedure to test whether the data from a specified discrete distribution (or family of discrete distributions). In this case, we only have to group data when a) the expected frequency of a given observation is less than 5, (see above), b) the null hypothesis assumes that the data come from a distribution with a large or infinite support (set of possible variables), e.g. the Poisson distribution. In this case, we denote the largest observation by xmax. The probability that an observation belongs to this last group should be defined as P(X ≥ xmax). We use a similar approach when there is a positive probability of observing a smaller observation than the smallest actually observed in the sample. Now we test the hypothesis that the following data come from a Poisson distribution: Observation 0 1 2 3 4 5 Frequency 359 370 185 68 17 1 The number of observations 359+370+185+68+17+1=1000. Since the null hypothesis does not completely specify the distribution (the hypothesis is composite), we have to estimate it from the mean of the data. First, we calculate the products of the values of the observations together with their frequencies. We obtain Observation 0 1 2 3 4 5 Frequency 359 370 185 68 17 1 Product 0 370 370 204 68 5 The sum of these products is equal to the sum of all the observations (1017). Hence, the estimtor of the parameter of the distribution is 1017/1000=1.017. Now we calculate the expected frequencies. It should be noted that the final group should contain all the values ≥ 5, since the value taken by a Poisson random variable is unbounded. The expected frequency of the value k is given by 1000*POISSON(k,1.017,0). It should be noted that the sum of the expected frequencies should sum to 1000, thus the expected frequency of the frequency of observations ≥ 5 must be equal to 1000 minus the remaining expected frequencies. In this way, we obtain the following table of frequencies and expected frequencies: Observation 0 1 2 3 4 ≥5 Frequency 359 370 185 68 17 1 Expected Frequency 361.678 367.827 187.040 63.407 16.121 3.927 Since the expected frequency of observations in the final group is less than 5, we combine the final two groups and then calculate the relative squared deviation of the observed frequencies from the expected frequencies. We obtain Observation 0 1 2 3 ≥4 Frequency 359 370 185 68 18 Expected Frequency 361.678 367.827 187.040 63.407 20.048 (O-E)^2/E 0.020 0.013 0.022 0.333 0.209 The realisation of the test statistic is the sum of the values in the final row 0.59695. The number of degrees of freedom is k-m-1, where k, the number of groups, is 5, and m, the number of parameters estimated is 1. Hence, there are 3 degrees of freedom. We calculate the p-value for this test using the function p =CHISQ.DIST.RT(0.59695,3)=0.8971. Since this value is greater than 0.05, we may assume that the data come from the Poisson distribution. Final note: Tests of independence and goodness of fit (SIMPLEX, NOT COMPOSITE) may be carried out using the command CHITEST(Range 1, Range 2), where Range 1 defines where the table of observed frequencies lies and Range 2 defines where the table of expected frequencies lies. These ranges must be of the same dimensions (i.e. contain the same number of rows and columns). When there is only one row or one column, then Excel carries out the appropriate goodness of fit test with a simple hypothesis. When the number of rows and the number of columns is greater than 1, then Excel carries out a test of independence. This command gives the appropriate p-value. Linear Regression Suppose that we have two continuous variables: X and Y. Let X be the explanatory (independent) variable and Y the dependent variable. We wish to describe how Y „depends” on X by means of the linear equation Y = ͭβ0+β1X +ε, where we assume that the random errors ε have a normal distribution with mean 0. In addition, we assume that the residuals are independent. β0 is called the intercept, and β1 is the coefficient of the independent variable (the slope of the regression equation). First, we should check whether the association between the variables is really linear. We do this using a scatter plot. For example, we want to derive a model for weight (in kg) as a linear function of height (in cm) (data from lista1.xls). We highlight the columns that contain height (X) and weight (Y). The independent variable (X) should be on the left. We choose graphs from the insert menu and then click on scatter plots. We obtain the following graph: 120 100 80 60 40 20 0 130 140 150 160 170 180 190 200 It may be assumed that these points form a cloud around a line rather than a curve. Hence, linear regression is appropriate. In order to carry out linear regression, choose “Data Analysis” from the “Data” menu. Note: If the “Data Analysis” option is unavailable, it is necessary to install it (choose „options” fron the “file” menu. Choose „add-ons”, and select „Analysis Tool Pack”). From the „Data Analysis” menu we choose „Regression”. We input the range of the dependent variable (weight), here c2:c101, as well as the range of the independent variable X (height), here b2:b101. The most important tables are the first, which contains summary statistics and the final table, which contains the coefficients of the regression model β0, β1. We obtain Regression Statistics Multiple R 0.888203 R square 0.788904 Adjusted R square 0.786750 Standard Error 5.566871 Observations 100 R2 is the coefficient of determination. Here R2 = 0.788904. This is the proportion of the variance of the dependent variable Y explained by the independent variable. Hence, height explains around 80% of the variance in weight. The standard error is the standard deviation of the error terms (residuals). This can be interpreted as a measure of the expected absolute error, when we estimate a person's height based on the regression equation. Coefficients Coefficients Std error t Stat p-values Intercept -63.763800 6.724257 -9.482650 1.61×10-15 Var X1 0.769817 0.040226 19.137510 7.16×10-35 We can read the parameters of the regression equation from the first column, β0 = -63.7638, β1 =0.769817. Hence, our model is of the form Y (weight) = -63.7638+0.769827X (height) +ε. The coefficient β1 describes by how much the mean weight changes (in kg) when height increases by one unit (cm), i.e. when height increases by one centimetre, weight increases on average by 0.769827kg. The p-value in the second row refers to a test of the hypothesis H0: β1=0 against the alternative H1: β1≠ 0. The null hypothesis states that there is no (linear) association between weight and height. Since the p-value is very small (in particular p<0.001), we have very strong evidence that there exists an association between weight and height. This association is positive, since the regression coefficient is greater than zero (as X increases, Y also increases on average). We can use this model to estimate the mean weight of people of a given height (by setting ε=0), e.g. the mean height of people of height 170cm is given by Y = -63.7638+0.769827×170 ≈ 67.1kg. Thus the mean height of people of height 170cm is estimated to be 67.1kg. Exponential Regression Many economic time series are characterised by exponential or near exponential growth (e.g. prices, gross domestic product, population size). In such cases, we have a model Y=αeβt, where t denotes time and β is (approximately) the mean percentage growth per unit time. Taking logarithms, we obtain Z = ln(Y)=ln(α)+βt. Hence, there exists a linear dependence between Z=ln(Y) and time t. We can carry out a regression of Z with respect to t, and then transform the equation to express Y (=eZ) as a function of time. Example (population of Nigeria) The graph and data below illustrate the growth of the population of Nigeria (year – number of years after 1950) Year Pop (thou.) ln(pop.) 0 37860 17,44941 5 41122 17,53205 10 45212 17,62687 15 50239 17,7323 20 56132 17,84322 25 63566 17,96759 30 73698 18,11549 35 83902 18,24516 40 95617 18,37586 45 108425 18,50157 50 122877 18,62669 55 139586 18,75419 60 159708 18,88886 180000 160000 140000 120000 100000 80000 60000 40000 20000 0 0 10 20 30 40 50 60 70 It can be seen from the scatter plot that the population does not increase linearly (the graph is convex). Assuming that population growth is exponential, we have P = αeβt, where P is the population. We calculate logarithms of the population size (see the table above), then carry out a regression of ln(Pop) with respect to time. We obtain the following table with the regression coefficients: Coefficients Std error t Stat P-values Intercept 17.389260 0.014568 1193.699000 1.79×10-29 Var X1 0.024612 0.000412 59.734330 3.58×10-15 Hence, we obtain the model ln(P)=17.38926+0.024612t. Taking exponentials, we obtain P = e17.38926+0.024612t = e17.38926 e0.024612t = 35 650 060 e0.024612t. The coefficient of time in the exponent gives the mean growth rate of the population, i.e. 2.46% per annum. Using this equation, we can estimate the population in 2015 (i.e. t = 65) P = 35 650 060 e0.024612×65 ≈ 176 542 441. Note: Analysis of time series using regression analysis has its problems. Regression makes the assumption that the error terms are uncorrelated. On the other hand, if the population is relatively large in one year (according to the model), then we expect that it will be relatively large in the following year (i.e. the error terms are correlated). For this reason, in order to predict future values of a variable, it is better to use methods which are adapted to time series (time series analysis will be discussed in the “Econometrics” course).