Statistics in Survey Analysis1 Ric Coe ICRAF, Nairobi, Kenya Contents Introduction ................................................................................................................................ 1 Preliminaries .............................................................................................................................. 3 Descriptive Statistics .................................................................................................................. 4 1. Summarizing Single Variables ........................................................................................ 4 2. Two variables. .................................................................................................................. 6 Descriptive statistics - common problems ............................................................................... 11 Confirmatory analysis: estimation and hypothesis testing....................................................... 13 The problem .......................................................................................................................... 13 Estimates, standard errors and confidence intervals. ............................................................ 14 Hypothesis tests: The logic ................................................................................................... 16 Examples of calculations ...................................................................................................... 17 Limitations ............................................................................................................................ 19 What should you do .............................................................................................................. 20 Confirmatory Analysis - Regression ........................................................................................ 20 Starting Regression ............................................................................................................... 20 Fitting the regression line ..................................................................................................... 21 Check the fit.......................................................................................................................... 23 Interpretation ........................................................................................................................ 23 Adding more variables - Multiple regression ....................................................................... 23 Interpretation ............................................................................................................................ 24 References ................................................................................................................................ 25 Introduction This guide summarises the use of simple statistical analyses in the interpretation of survey data. It is aimed at the typical small surveys (up to a few hundred respondents) carried out by researchers looking at the role and uptake of new agricultural technologies. Modified from input to a course ‘Formal data analysis for bean researchers’ organised by CIAT at CMRT, Egerton University, February 1996. Thanks to Soniia David for permission to quote the example. 1 1 There are several common problems in the approaches to survey analysis used by many researchers, probably a result of the research methods courses followed during training. One is to concentrate attention on a few well known statistical techniques, such as chisquared tests in 2-way tables and regression analysis, and to place a naively simplistic reliance on the results. This is the topic of this guide. A second problem is to treat statistical analysis as a recipe that can be followed to a successful conclusion without much thought or understanding along the way. This is the topic of a companion guide ‘Steps in survey analysis’ (Coe 2002). A third problem is to ignore the context in which the survey was carried out, so ignoring many of the possibilities and limitations of the statistical analysis. This is the topic of the guide ‘Approaches to analysis of survey data’ (SSC, 2001). Example The example used in this guide was a survey of farmers in two districts of Uganda. It aimed to characterize the pattern of bean growing and understand role of new bean varieties in the household economy of new farmers. A few of the stated objectives were: Overall: Provide a baseline against which to measure adoption and impact of improved bean varieties. Hypotheses: 1. Adoption. a. There is no relationship between adoption of new varieties and wealth. b. The rate of adoption for MCM5001 will be higher in Mbale than Mukono, due to strong non-appreciation of small seeded varieties in Mukono. 2. Impact. a. Adoption of new varieties will result in an increase in absolute quantities and proportion of beans sold, hence increasing household income from beans. b. Adoption of new varieties will not result in increased sales of fresh beans. c. Adoption of new varieties will not change the amount of income from beans controlled by women. d. ... The examples are based on a subset of just 50 households from the whole survey of 179. The variables used in the example have been labeled so should be self-explanatory. In this guide SPSS has been used for the statistical analysis. General points appear in normal text. Computer output and other items relating specifically to the example are boxed. 2 Preliminaries Before starting analysis: 1. Make sure you are familiar with the data source and collection methods. For example: Was a random sampling scheme used? Were individual questionnaires completed during a group meeting? Who was the data collected by? Why and when? 1. Clarify objectives These should have been listed in detail when the survey was planned. If they were not, or have changed, they must be listed now. It is impossible to analyze a survey if you do not know what you are trying to find out. 3. Coding and Data entry. 4. Make sure you understand the data. You must understand the exact meaning of every number and code. Data that needs clarifying. Variable WIVES (Question 3): Does ‘1’ mean 1 wife or 2 wives? (conflict between questionnaire and code book). Variable ARRANGE (Question 4). Does ‘NA’ mean there are no bean plots or no husband/wife? Variables OCCUPHDI and OCCUPHD2 (Question 8): Why are two occupations given when the question asks for the main occupation? Variable KAW94A (Question 21). What is the difference between ‘na’ and ‘No’? Variable AMKW94A Question 21). What are the units? 3 Descriptive Statistics 1. Summarizing Single Variables Qualitative (“Coded”) variables. MATOKE Useful summaries are just frequencies and percentages. Grows matoke Value Label Yes No Value Frequency Percent Valid Percent Cum Percent 1 2 42 8 ------50 84.0 16.0 ------100.0 84.0 16.0 ------100.0 84.0 100.0 Total Valid cases HHTYPE 50 Missing cases 0 Household type Value Label Male headed one wife Male headed more tha Female headed absent Female headed, no hu Single man Other Value Frequency Percent Valid Percent Cum Percent 1 2 3 4 5 7 27 4 3 13 2 1 ------50 54.0 8.0 6.0 26.0 4.0 2.0 ------100.0 54.0 8.0 6.0 26.0 4.0 2.0 ------100.0 54.0 62.0 68.0 94.0 98.0 100.0 Total Valid cases 50 Missing cases 0 4 Note different emphasis of frequencies and percentages. Frequencies emphasize the sample, percentages emphasize the population. Give total sample size with percentages. Take care with percentages: make sure you are using an appropriate baseline (what is 100%) and remember that percentages might not have to add to 100, as in the example below. Edit the computer output for presentation! Crop % growing Cassava Beans Matoke Maize Yams Sample size 100 98 84 78 20 50 Look carefully at and identify rare cases. Such data points may be errors, or may need special treat What is the 1 “other” household type in question 2? One farmer does not grow beans. Should this case be deleted from all analyses? Bar charts are most appropriate when the categories can be ordered in some useful way. Quantitative Variables In summarizing quantitative variables the most interesting things are: o Location o Spread o Odd values (What is a typical value) (How much variation is there?) (What is their source and interpretation?) Location is measured by mean or median (not usefully the mode) Spread is measured by standard deviation or distance between quartiles. Quantities such as the 10% and 90% point are useful in some situations. 5 Use Histograms and boxplots. Amount of beans harvested in 94a Mean Standard deviation Median 25% point 75% Mean (ignoring 200) 15.9 34.2 4.0 0 14.0 10.1 40 30 20 10 S td . D e v = 34 .2 1 Mea n = 1 6.0 N = 4 7.0 0 0 0 .0 2 5.0 5 0.0 7 5.0 1 00 .0 1 25 .0 1 50 .0 1 75 .0 2 00 .0 total beans harvested 9 4a 2. Two variables. Two qualitative variables = cross tabulation Interpretation can be helped by careful layout. Percentages may be calculated of row totals, column totals or overall totals. Not all of them will make sense! 6 Crop earning highest income Male Headed Coffee Groundnut Bogoya Cassava Matoke Beans Other No sales Total 19 2 1 1 2 1 5 0 Household type Female Headed 7 4 3 0 0 0 0 2 7 Single Male 1 0 0 1 0 0 0 0 Total 27 6 4 2 2 1 5 2 49 12 One qualitative and one quantitative variable = group23 comparison Mean Median 25% point Number Total beans harvested in 94a Household type Male Female 31.3 5.9 10.0 0 0 0 31 16 50 45 total beans harvested 94a 40 16 35 30 25 20 9 15 10 5 0 N= 1 31 15 Mis sin g m a le fe m al e Simplified hhtype Two quantitative variables A scatter diagram is the only really useful way to summarize two quantitative variables and their relationship. The correlation coefficient is a summary of the strength of linear relationship between variables. It should NOT be quoted unless the data have first been looked at in a scatter diagram. If there appears to be a relationship between variables the points to look for are: 8 7. Is the relationship monotonic? Are the variables negatively or positively related. Can the relationship be summarized by a straight line? How much effect does X have on Y? How highly clustered are points around a line? Are there any gaps in the plot or do we have data values covering the whole range of X or Y? Are there any outliers or odd observations? 3 00 2 00 total beans harvested 94a 1. 2. 3. 4. 5. 6. 1 00 0 Simplified hhtype fe m al e - 10 0 - 10 m a le 0 20 10 total amount beans planted 94a 9 30 40 50 Three or more variables When three or more variables are being investigated, cross tabulations become sparse and difficult to interpret and clear graphs difficult to construct. A simple example of the need for not always considering just two variables at a time is given. In both Region 1 and Region 2 it is clear adoption is not related to income (67% adopt in both high and low income groups in Region 1 and 33% in Region 2) but if the sum of the two regions is studied there appears to be higher adoption in the high income group. Artificial Example Region 1 Incom e H Incom e _ 10 L 20 Region 2 40 L H 20 Overall Adoption + 20 40 Adoption + 20 10 Adoption Exactly the same thing occurs with + continuous variables where spurious Incom L 50 40 correlation (or lack of it) can be due to e a third variable which has not been H 40 50 allowed for. More advanced graphical (e.g. small multiple pictures) and numerical (regression and log-linear modeling, multivariate methods such as principal components) methods exist to help there. p lan te d 9 4a p lan te d 9 4b h ar ve ste d 9 4 a h ar ve ste d 9 4 b 10 Descriptive statistics - common problems Use of standard techniques rather than the most appropriate. An example is the histogram to show the distribution of a continuous variable. The histogram shows features such as location and skewness. However, other possibilities are cumulative histograms (which show % points), boxplots (good for comparing, and showing outliers), q-q or normal probability plots (to check if the variable has a normal distribution) or stem-and-leaf plots (to look at individual values). Be imaginative - find the best way to display the information you want. H is to g r a m C u mmu la tiv e h is to g r a m 28 52 26 48 24 44 22 40 20 36 No of o bs No o f o b s 18 16 14 12 32 28 24 20 10 16 8 6 12 4 8 4 2 0 0 <= 0 ( 0 ,5 ] ( 5 ,1 0 ] ( 1 0 ,1 5 ] ( 1 5 ,2 0 ] ( 2 0 ,2 5 ] ( 2 5 ,3 0 ] ( 3 0 ,3 5 ] ( 3 5 ,4 0 ] > 40 <= 0 ( 0 ,5 ] ( 5 ,1 0 ] ( 1 0 ,1 5 ] ( 1 5 ,2 0 ] ( 2 0 ,2 5 ] ( 2 5 ,3 0 ] ( 3 0 ,3 5 ] ( 3 5 ,4 0 ] AMPL T9 4 A > 40 AMPL T9 4 A Bo x Plo t Q u a n tile - Q u a n tile D is tr ib u tio n : N o r ma l .0 5 .1 .2 5 .5 .7 5 .9 .9 5 .9 9 50 40 40 Ob se rve d Va lue 30 20 10 0 AMPL T9 4 A N o n - O u tlie r Ma x = 7 N o n - O u tlie r Min = 0 75% = 3 25% = 0 Me d ia n = 1 .7 5 O u tlie r s Ex tr e me s 30 20 10 0 -10 -2 -1 0 1 2 3 Th e o r e tic a l Q u a n tile Use of techniques you can get your computer to do. Much statistics software is very flexible. If you learn enough about it you can get it to do most things, but not everything. Be prepared to do some analysis, including drawing of graphs or tables, by hand. Concentration on means when variation is important. 11 Cases which deviate from the mean, contributing to variability, are probably just as important as the average values. Make sure you understand whether variation is important, and if so, describe it. Limited use of derived quantities. It is unlikely that each substantive question can be answered from columns of raw data alone. Calculations of new variables is certain to be important. Calculate new variables that are needed to answer the questions. Confusion over the ‘unit of analysis’. Many datasets contain data collected at more that 1 level ( e.g. plot, person, household, community). Analyses must use the relevant level. Mixed levels are almost wrong. Even in surveys with data collected at one level there is room for confusion regarding, for example, calculations of percentages. Variety Kawanda Manyigamulimi Kanyebwa White haricot All others No beans planted Number of farmers planting in 94A 11 21 0 0 14 18 Average of those farmers who planted 2.45 10.53 2.04 - The various interesting percentages are: Percent of all farmers planting Kawanda = 11/50 = 22% Percent of all farmers who planted in 94A who planted Kawanda = 11/(50-18) = 34% Percent of amount planted that was planted to Kawanda = (11 x 2.45) / (11 x 2.45 + 21 x 10.53 + 14 x 2.04) = 26.95/276.64 = 9.7% Not working with relevant subsets of the data Should the farmer who never grows beans be deleted from the dataset? Should cases for whom farming is not the main occupation be omitted when analyzing economic activity? 12 Make sure all relevant data, but no irrelevant data, is being used. Poor handling of outliers. Be on the look out for all odd observations, which might represent mistakes or unusual cases. Mistakes must be corrected. Treatment of unusual cases depends on context. Including them can distort the picture. Omitting them can induce bias. Balance between ‘Exploratory analysis’ and ‘Data Dredging’ Exploratory analysis means looking for interesting patterns in the data without focusing on a specific question (e.g. “Who are the farmers who have heard of the new variety?”). This can be valuable, and show up facts which had not been thought of or hypothesized. Data dredging means searching through many statistics until ‘something turns up’. For example, doing a cross-calculation of “Heard of new varieties” with every other qualitative variable. The results will be spurious ( if you search through enough columns of random numbers you will eventually find ‘interesting’ correlations). The distinction between the two approaches is fine! Confirmatory analysis: estimation and hypothesis testing The problem A. Labour Household Type Never hire or exchange Hire or exchange In the Table A we can see: 13 Male Female 23 13 36 10 33 3 16 13 49 33% of the households are female headed. 30% of male headed households hire labour, but only 19% of female headed households do. B. Amount Planted Mean s.d. n Farmers who planted beans in 94 a Male Female Overall 6.5 2.9 5.8 9.5 1.3 8.6 24 6 30 In Table B we can see: The mean amount of beans planted in 94a by farmers who grew beans that season is 5.8 kg. The amount planted by males was 6.5 kg, but only 2.9 kg by females. All these results are based on data from a sample of just 50 farmers in the district. How reliable are they? If we had measured a different 50 how similar would the results have been? If we had measured 500, or the whole population, would the conclusions have been much the same? The results differ from ‘true’ answer for two reasons: Non sampling errors - incorrect responses, mistakes in coding and data entry, poor recall, biased selection of respondents. Sampling errors - those due to the fact that we have measured only some (a sample) of the population. The non-sampling errors can not usually be measured, but can be minimized by good survey practice. Sampling errors can be measured, and that is the purpose of much confirmatory statistics. Estimates, standard errors and confidence intervals. Proportions The proportion of female headed households in the population is P. P is unknown. The sample value is p = 0.33 ( = 16/49). The uncertainty due to sampling errors in 14 this is measured by the standard error. The standard error is se ( p) p(1 p) , n where n = sample size. se(p) is estimated by . 33(1. 33) . 07 49 This is the standard deviation of possible estimates that could be produced by different simple random samples of the same size. The standard error is best interpreted via a confidence interval. A 95% confidence interval for p is p ± 2 x se(p) = 0.33 ± 2 x 0.07 = (0.19, 0.47) This is interpreted as “We are 95% confident that the true percentage of female headed households is between 19% and 47%”. Hence the uncertainty in results due to sampling error is quantified. Means The mean amount of beans planted in 94a is 5.8 kg. The standard deviation of this 2 s , where s2 is the variance in amount of beans and n the sample size. n 8. 62 se ( mean) 1. 6 30 The 95% confidence interval is mean ± 2 x se(mean) = ± 2 x 1.6 = (2.6, 9.0) is se( mean) The mean amount of beans planted is between 2.6 and 9.0 kg. Differences If interested in differences between subgroups we can similarly estimate the difference and find a standard error of the estimate. Difference in mean amount of beans planted by males and females = 6.5 - 2.9 = 3.6 kg. 15 se ( difference ) s12 s22 n1 n2 9. 52 1. 32 = 24 6 = 2.0 95% confidence interval for difference is 3.6 ± 2 x 2.0 (-0.4, 7.6) The mean difference between amounts planted by males and females could be anything between -0.4 kg and 7.6 kg. Hypothesis tests: The logic The logic of all the tests commonly used depends on the fact that random samples from a population behave in a predictable way. The mean amount of beans planted by female households of 2.9 kg, is not the actual mean of all households in the districts where the study took place. If a different sample had been randomly selected the mean would have been different. The question is ‘How different?’. If all households are very similar (low variation between households) then it really does not matter which sample is selected. On the other hand, high variation in the population will lead to very different sample means, and hence less certainty in the results obtained. The mathematics of statistics allows quantification of these ideas, and hence answers to the question of how certain we are of the results. The logic of the hypothesis tests is as follows: 1. Assume some fact is true - the null hypothesis (e.g. There is no difference in mean amount of beans planted by male and female headed income households). 2. Deduce how the sample would behave if (1) is true (e.g. How big could the sample differences between male and female headed households be?) 3. Compare the actual sample with the predictions in (2). 4. If (2) and (3) do not agree then (1) must be untrue - the null hypothesis is rejected. If (2) and (3) do agree then there is no reason, in this data, not to believe (1). 16 The level of agreement is measured by the 'significance level', explained in the examples below. Examples of calculations Chi-squared test for no association in a 2 x 2 table. Taking Table A as an example, we want to test whether the proportion of households hiring labour is the same in male and female headed households. The steps are: 1. Formulate the null hypothesis: the proportion is equal for both male and female households. If (1) is true, then this proportion is estimated by 36/49. Hence we would expect numbers in each category to be : 17 Male Female Never hire 33 x 36 = 24. 2 49 Hire 33 x 3. 16 x 13 = 8.8 49 36 = 11.8 49 16 13 4. 2 49 The difference between observed and expected frequencies is summarised as ( 24. 2 - 23 )2 ( 11. 8 - 13 )2 ( 8.8 - 10 )2 ( 4. 2 - 3 )2 + + + = 0. 74 = 24. 2 11. 8 8. 8 4. 2 2 4. If (1) is valid then the value of 2 should be an observation from a 12 distribution. Comparison with tables shows that 0.74 is not an extreme observation. A number at least as big as this would occur 39% of the time. The significance level is p = 0.39. Hence there is no strong reason not to believe the null hypothesis. t-test to compare two means In example B the steps needed are: 1. Formulate the null hypothesis: the difference in mean amount of beans planted for male and female households is zero. 2,3 If (1) is true, then the difference in means of 3.6kg, scaled by its standard error (= 2.0) , t 3. 6 1.8 , 2. 0 is an observation from a t28 distribution. 18 4. Comparison with tables shows that 1.8 is not an extreme observation. A difference as big as this would occur 8% of the time (1) is true. The significance level is p = 0.08. Hence there is not much reason not to believe the null hypothesis. Limitations Assumptions. The calculations in both 4.1 and 4.2 are based on a series of assumptions. The key ones are: Independence. In both examples A and B we assume observations are independent. Lack of independence is caused by: (i) non-simple random samples. In this case we have used a stratified sample. (ii) interference between observations. This would be the case if individuals within these household responded, or if data were collected at a group meeting. Lack of bias due to non-response, interviewer effects, attempts to 'please' the researcher etc. Equality of variance and normal distribution (t-test). These assumptions can be checked. In example B the data is clearly not normally distributed Limits to interpretation. (1) If the result is ‘significant’ we can reject the null hypothesis, and conclude that there is a real difference in the population. If the result is ‘not significant’ we have not proved there is no difference. It is never possible to prove the null hypothesis is true (if almost never will be!). All we can say is this study has not produced evidence to make us disbelieve the null hypothesis. (2) At what level of significance should the null hypothesis be rejected? 5% is commonly used but there is absolutely no reason why it should be treated as a rigid cut off. 6% and 4% significance levels are, for all real purposes, equivalent. (3) Whether the null-hypothesis is rejected depends as much on the sample size and precision of the study, as on the 'truth' of the null hypothesis. A small, imprecise survey will not detect a difference that could be picked up by a larger study. May be we just did not collect enough data! 19 (4) The whole logic of significance testing and the p-value rests on what would happen in repeated surveys of the same design, using new randomisations. Is this sense, when we know the survey would not and can not ever be repeated? (5) In most analysis exercises, differences which 'look interesting' at the exploratory stage are investigated further in the confirmatory analysis. If the tests to perform have been selected because differences look large, all significance levels are invalid. (6) If a large number of tests are performed, as is often the case in analysis of a study with many variables, then we would expect 5% of the tests to give "significant" results at the p = 0.5 level even if all null hypotheses were true. Hence it can be difficult to interpret the results of multiple tests. What should you do (1) Treat the significance level p as an indication of 'strength of evidence' against the null hypothesis, not as a Yes/No decision maker. (2) Concentrate on estimating the size of differences, rather than just testing whether they exist. Confidence intervals for differences will be much more useful than hypothesis tests. At the end of every significance test apply the SO WHAT? test. Ask yourself 'So what?'. Has the significance test really improved your understanding of the situation and helped you take a rational decision for future action? If not forget it, and get on with something more useful. Confirmatory Analysis - Regression Starting Regression - Beware! Even ‘simple’ regression is not simple! - Start by considering types of relationship that might exist. The most useful regression analysis will be one that starts from understanding of the theory behind the process being studied. 20 The example used here is rather artificial. It examines the proposition that the amount of beans harvested in 94a depends only on land area. - Plot the data to see if there is any evidence of the relationship. 220 180 HVTOT94A 140 100 60 20 -20 -1 1 3 5 LANDAREA Fitting the regression line - Software is widely available to do this - Understand the output! 21 7 9 11 * * * * M U L T I P L E R E G R E S S I O N * * * * Listwise Deletion of Missing Data Equation Number 1 Dependent Variable.. total beans harvested 94a Block Number 1. Method: Enter HVTOT94A LANDAREA Variable(s) Entered on Step Number 1.. LANDAREA Multiple R R Square Adjusted R Square Standard Error .54425 .29621 .28057 29.01659 Analysis of Variance DF 1 45 Regression Residual F = Sum of Squares 15946.10384 37888.31105 18.93921 Signif F = Mean Square 15946.10384 841.96247 .0001 ------------------ Variables in the Equation ----------------Variable Sig T LANDAREA .0001 (Constant) .6383 B SE B Beta T 8.200238 1.884280 .544249 4.352 -2.863844 6.051297 End Block Number 1 -.473 All requested variables entered. 22 Check the fit - Look for any unusual points or outliers. They could represent mistakes or cases that require special treatment. They certainly require explanation. - Look for influential points, which largely determine results. They are not a bad thing, but you must be aware if your conclusions depend critically on one or two observations. - Look at the residuals to determine: 1. Whether they satisfy the main assumptions that validate the analysis (constant variance, independence, roughly normally distributed) 2. Whether they show patterns according to the value of other variables, indicating that those other variables should be allowed for in the analysis. Interpretation ‘Significance’ does not tell you whether the fitted model is logically sound or if it fits the data well. ‘Significance’ does not tell you whether the model is useful in explaining or describing a relationship, or if the relationship has much predictive power. A regression model derived from survey data can not tell you what would happen when a ‘x-variable’ is changed. For example we can not use it to predict the bean harvest of a farmer whose land holding changes. Existence of a regression relationship between two variables does not mean there is a causal relationship. Regression relationships become useful when similar relationships are found in a number of different conditions. Look for ‘significant sameness’ between regions, crops, farm types, etc. Adding more variables - Multiple regression Multiple regression is a powerful tool for understanding the relationship of one variable to several others. BUT..... All the limitations to interpretation above apply, and are compounded by the existence of several ‘x-variables’. It is hard to draw graphs that show the relationships and the way data depart from them, so the analyst must rely more on numerical indicators of lack of fit, outliers, 23 and influential points. Multiple regression analysis will not be successful if these are not understood. ‘Stepwise’ and similar variable selection techniques, so loved by social scientists, have little theoretical basis and can produce answers which are very poor. Regression modeling will be most successful if understanding of the underlying processes is used to choose possible models, rather than relying on computer algorithms. The sample size required for multiple regression analysis depends on the ‘configuration’ of the data (in particular the range of the x-variables and correlations among them). The required sample size quickly becomes large as the number of xvariables increases. If regression analysis is the part of the principle objectives of the survey, it might be possible to select the sample in a way that makes the analysis more efficient. Raw residuals vs. HHTYPE2 160 120 Raw residuals 80 40 0 -40 -80 1 2 HHTYPE2 Interpretation Interpret results. This does not mean ‘understand which effects are significant’ but ‘understand and communicate what you now know about the problem’. You should be able to: Meet the objectives of the study. Clearly state what is the substantive new knowledge which as been generated. Show how this new information and understanding builds on what was there before. Does it: o add more examples of something previously known? o mean that general rules or principles can be stated with more confidence? o allow predictions to be made for new and important situations? 24 o mean that current understanding or theory has to be substantially modified? Use the quantitative information you have generated to make quantitative predictions about the larger picture. The ultimate goal of the research is a development objective. Explain how your results help you towards that objective, and what the next steps will be. Your survey and its analysis cost thousands of dollars. Explain why this was a good investment. Answer the ‘So what? question. What can we now do which we could not do before you did your survey? References Coe R (2002) Steps in Survey Analysis. Nairobi: ICRAF. 15pp SSC (2001) Approaches to analysis of survey data. Reading: Statistical Services Centre. 28 pp 25