Statistics in Survey Analysis

advertisement
Statistics in Survey Analysis1
Ric Coe
ICRAF, Nairobi, Kenya
Contents
Introduction ................................................................................................................................ 1
Preliminaries .............................................................................................................................. 3
Descriptive Statistics .................................................................................................................. 4
1. Summarizing Single Variables ........................................................................................ 4
2. Two variables. .................................................................................................................. 6
Descriptive statistics - common problems ............................................................................... 11
Confirmatory analysis: estimation and hypothesis testing....................................................... 13
The problem .......................................................................................................................... 13
Estimates, standard errors and confidence intervals. ............................................................ 14
Hypothesis tests: The logic ................................................................................................... 16
Examples of calculations ...................................................................................................... 17
Limitations ............................................................................................................................ 19
What should you do .............................................................................................................. 20
Confirmatory Analysis - Regression ........................................................................................ 20
Starting Regression ............................................................................................................... 20
Fitting the regression line ..................................................................................................... 21
Check the fit.......................................................................................................................... 23
Interpretation ........................................................................................................................ 23
Adding more variables - Multiple regression ....................................................................... 23
Interpretation ............................................................................................................................ 24
References ................................................................................................................................ 25
Introduction
This guide summarises the use of simple statistical analyses in the interpretation of
survey data. It is aimed at the typical small surveys (up to a few hundred respondents)
carried out by researchers looking at the role and uptake of new agricultural technologies.
Modified from input to a course ‘Formal data analysis for bean researchers’ organised by CIAT at CMRT,
Egerton University, February 1996. Thanks to Soniia David for permission to quote the example.
1
1
There are several common problems in the approaches to survey analysis used by many
researchers, probably a result of the research methods courses followed during training.
One is to concentrate attention on a few well known statistical techniques, such as chisquared tests in 2-way tables and regression analysis, and to place a naively simplistic
reliance on the results. This is the topic of this guide. A second problem is to treat
statistical analysis as a recipe that can be followed to a successful conclusion without
much thought or understanding along the way. This is the topic of a companion guide
‘Steps in survey analysis’ (Coe 2002). A third problem is to ignore the context in which
the survey was carried out, so ignoring many of the possibilities and limitations of the
statistical analysis. This is the topic of the guide ‘Approaches to analysis of survey data’
(SSC, 2001).
Example
The example used in this guide was a survey of farmers in two districts of Uganda. It
aimed to characterize the pattern of bean growing and understand role of new bean
varieties in the household economy of new farmers. A few of the stated objectives were:
Overall: Provide a baseline against which to measure adoption and impact of improved
bean varieties.
Hypotheses:
1. Adoption.
a. There is no relationship between adoption of new varieties and wealth.
b. The rate of adoption for MCM5001 will be higher in Mbale than Mukono, due to
strong non-appreciation of small seeded varieties in Mukono.
2. Impact.
a. Adoption of new varieties will result in an increase in absolute quantities and
proportion of beans sold, hence increasing household income from beans.
b. Adoption of new varieties will not result in increased sales of fresh beans.
c. Adoption of new varieties will not change the amount of income from beans controlled
by women.
d. ...
The examples are based on a subset of just 50 households from the whole survey of 179.
The variables used in the example have been labeled so should be self-explanatory.
In this guide SPSS has been used for the statistical analysis. General points appear in
normal text. Computer output and other items relating specifically to the example are
boxed.
2
Preliminaries
Before starting analysis:
1. Make sure you are familiar with the data source and collection methods.
For example:



Was a random sampling scheme used?
Were individual questionnaires completed during a group meeting?
Who was the data collected by? Why and when?
1. Clarify objectives
These should have been listed in detail when the survey was planned. If they were not, or
have changed, they must be listed now. It is impossible to analyze a survey if you do
not know what you are trying to find out.
3. Coding and Data entry.
4. Make sure you understand the data. You must understand the exact meaning of every
number and code.
Data that needs clarifying.
Variable WIVES (Question 3): Does ‘1’ mean 1 wife or 2 wives? (conflict between
questionnaire and code book).
Variable ARRANGE (Question 4). Does ‘NA’ mean there are no bean plots or no
husband/wife?
Variables OCCUPHDI and OCCUPHD2 (Question 8): Why are two occupations given
when the question asks for the main occupation?
Variable KAW94A (Question 21). What is the difference between ‘na’ and ‘No’?
Variable AMKW94A Question 21). What are the units?
3
Descriptive Statistics
1.
Summarizing Single Variables
Qualitative (“Coded”) variables.

MATOKE
Useful summaries are just frequencies and percentages.
Grows matoke
Value Label
Yes
No
Value
Frequency
Percent
Valid
Percent
Cum
Percent
1
2
42
8
------50
84.0
16.0
------100.0
84.0
16.0
------100.0
84.0
100.0
Total
Valid cases
HHTYPE
50
Missing cases
0
Household type
Value Label
Male headed one wife
Male headed more tha
Female headed absent
Female headed, no hu
Single man
Other
Value
Frequency
Percent
Valid
Percent
Cum
Percent
1
2
3
4
5
7
27
4
3
13
2
1
------50
54.0
8.0
6.0
26.0
4.0
2.0
------100.0
54.0
8.0
6.0
26.0
4.0
2.0
------100.0
54.0
62.0
68.0
94.0
98.0
100.0
Total
Valid cases
50
Missing cases
0
4




Note different emphasis of frequencies and percentages. Frequencies emphasize
the sample, percentages emphasize the population. Give total sample size with
percentages.
Take care with percentages: make sure you are using an appropriate baseline
(what is 100%) and remember that percentages might not have to add to 100, as in
the example below.
Edit the computer output for presentation!
Crop
% growing
Cassava
Beans
Matoke
Maize
Yams
Sample size
100
98
84
78
20
50
Look carefully at and identify rare cases. Such data points may be errors, or may
need special treat

What is the 1 “other” household type in question 2?
One farmer does not grow beans. Should this case be deleted from all
analyses?

Bar charts are most appropriate when the categories can be ordered in some useful
way.
Quantitative Variables

In summarizing quantitative variables the most interesting things are:
o Location
o Spread
o Odd values



(What is a typical value)
(How much variation is there?)
(What is their source and interpretation?)
Location is measured by mean or median (not usefully the mode)
Spread is measured by standard deviation or distance between quartiles.
Quantities such as the 10% and 90% point are useful in some situations.
5

Use Histograms and boxplots.
Amount of beans harvested in 94a
Mean
Standard deviation
Median
25% point
75%
Mean (ignoring 200)
15.9
34.2
4.0
0
14.0
10.1
40
30
20
10
S td . D e v = 34 .2 1
Mea n = 1 6.0
N = 4 7.0 0
0
0 .0
2 5.0
5 0.0
7 5.0
1 00 .0
1 25 .0
1 50 .0
1 75 .0
2 00 .0
total beans harvested 9 4a
2.
Two variables.
Two qualitative variables = cross tabulation
 Interpretation can be helped by careful layout.
 Percentages may be calculated of row totals, column totals or overall totals. Not
all of them will make sense!
6
Crop earning
highest income
Male
Headed
Coffee
Groundnut
Bogoya
Cassava
Matoke
Beans
Other
No sales
Total
19
2
1
1
2
1
5
0
Household type
Female
Headed
7
4
3
0
0
0
0
2
7
Single
Male
1
0
0
1
0
0
0
0
Total
27
6
4
2
2
1
5
2
49
12
One qualitative and one quantitative variable = group23 comparison
Mean
Median
25% point
Number
Total beans harvested in 94a
Household type
Male
Female
31.3
5.9
10.0
0
0
0
31
16
50
45
total beans harvested 94a
40
16
35
30
25
20
9
15
10
5
0
N=
1
31
15
Mis sin g
m a le
fe m al e
Simplified hhtype
Two quantitative variables
A scatter diagram is the only really useful way to summarize two quantitative
variables and their relationship.
The correlation coefficient is a summary of the strength of linear relationship
between variables. It should NOT be quoted unless the data have first been looked
at in a scatter diagram.
If there appears to be a relationship between variables the points to look for are:
8
7.
Is the relationship monotonic?
Are the variables negatively or positively related.
Can the relationship be summarized by a straight line?
How much effect does X have on Y?
How highly clustered are points around a line?
Are there any gaps in the plot or do we have data values covering the whole
range of X or Y?
Are there any outliers or odd observations?
3 00
2 00
total beans harvested 94a
1.
2.
3.
4.
5.
6.
1 00
0
Simplified hhtype
fe m al e
- 10 0
- 10
m a le
0
20
10
total amount beans planted 94a
9
30
40
50
Three or more variables
When three or more variables are being
investigated, cross tabulations become
sparse and difficult to interpret and
clear graphs difficult to construct.
A simple example of the need for not
always considering just two variables
at a time is given. In both Region 1
and Region 2 it is clear adoption is not
related to income (67% adopt in both
high and low income groups in Region
1 and 33% in Region 2) but if the sum
of the two regions is studied there
appears to be higher adoption in the
high income group.
Artificial
Example
Region 1
Incom
e
H
Incom
e
_
10
L
20
Region 2
40
L
H
20
Overall
Adoption
+
20
40
Adoption
+
20
10
Adoption
Exactly the same thing occurs with
+
continuous variables where spurious
Incom
L
50
40
correlation (or lack of it) can be due to
e
a third variable which has not been
H
40
50
allowed for. More advanced graphical
(e.g. small multiple pictures) and numerical (regression and log-linear modeling,
multivariate methods such as principal components) methods exist to help there.
p lan te d 9 4a
p lan te d 9 4b
h ar ve ste d 9 4 a
h ar ve ste d 9 4 b
10
Descriptive statistics - common problems

Use of standard techniques rather than the most appropriate.
An example is the histogram to show the distribution of a continuous variable. The
histogram shows features such as location and skewness. However, other
possibilities are cumulative histograms (which show % points), boxplots (good for
comparing, and showing outliers), q-q or normal probability plots (to check if the
variable has a normal distribution) or stem-and-leaf plots (to look at individual
values).
Be imaginative - find the best way to display the information you want.
H is to g r a m
C u mmu la tiv e h is to g r a m
28
52
26
48
24
44
22
40
20
36
No of o bs
No o f o b s
18
16
14
12
32
28
24
20
10
16
8
6
12
4
8
4
2
0
0
<= 0
( 0 ,5 ]
( 5 ,1 0 ]
( 1 0 ,1 5 ] ( 1 5 ,2 0 ] ( 2 0 ,2 5 ] ( 2 5 ,3 0 ] ( 3 0 ,3 5 ] ( 3 5 ,4 0 ]
> 40
<= 0
( 0 ,5 ]
( 5 ,1 0 ]
( 1 0 ,1 5 ] ( 1 5 ,2 0 ] ( 2 0 ,2 5 ] ( 2 5 ,3 0 ] ( 3 0 ,3 5 ] ( 3 5 ,4 0 ]
AMPL T9 4 A
> 40
AMPL T9 4 A
Bo x Plo t
Q u a n tile - Q u a n tile
D is tr ib u tio n : N o r ma l
.0 5
.1
.2 5
.5
.7 5
.9
.9 5
.9 9
50
40
40
Ob se rve d Va lue
30
20
10
0
AMPL T9 4 A

N o n - O u tlie r Ma x = 7
N o n - O u tlie r Min = 0
75% = 3
25% = 0
Me d ia n = 1 .7 5
O u tlie r s
Ex tr e me s
30
20
10
0
-10
-2
-1
0
1
2
3
Th e o r e tic a l Q u a n tile
Use of techniques you can get your computer to do.
Much statistics software is very flexible. If you learn enough about it you can get it
to do most things, but not everything.
Be prepared to do some analysis, including drawing of graphs or tables, by hand.

Concentration on means when variation is important.
11
Cases which deviate from the mean, contributing to variability, are probably just as
important as the average values.
Make sure you understand whether variation is important, and if so, describe it.

Limited use of derived quantities.
It is unlikely that each substantive question can be answered from columns of raw data
alone. Calculations of new variables is certain to be important.
Calculate new variables that are needed to answer the questions.

Confusion over the ‘unit of analysis’.
Many datasets contain data collected at more that 1 level ( e.g. plot, person, household,
community). Analyses must use the relevant level. Mixed levels are almost wrong.
Even in surveys with data collected at one level there is room for confusion
regarding, for example, calculations of percentages.
Variety
Kawanda
Manyigamulimi
Kanyebwa
White haricot
All others
No beans planted
Number of
farmers planting
in 94A
11
21
0
0
14
18
Average of those
farmers who planted
2.45
10.53
2.04
-
The various interesting percentages are:
 Percent of all farmers planting Kawanda = 11/50 = 22%
 Percent of all farmers who planted in 94A who planted Kawanda
= 11/(50-18) = 34%
 Percent of amount planted that was planted to Kawanda
= (11 x 2.45) / (11 x 2.45 + 21 x 10.53 + 14 x 2.04) =
26.95/276.64 = 9.7%

Not working with relevant subsets of the data
Should the farmer who never grows beans be deleted from the dataset? Should cases
for whom farming is not the main occupation be omitted when analyzing economic
activity?
12
Make sure all relevant data, but no irrelevant data, is being used.

Poor handling of outliers.
Be on the look out for all odd observations, which might represent mistakes or
unusual cases. Mistakes must be corrected. Treatment of unusual cases depends on
context. Including them can distort the picture. Omitting them can induce bias.

Balance between ‘Exploratory analysis’ and ‘Data Dredging’
Exploratory analysis means looking for interesting patterns in the data without
focusing on a specific question (e.g. “Who are the farmers who have heard of the new
variety?”). This can be valuable, and show up facts which had not been thought of or
hypothesized.
Data dredging means searching through many statistics until ‘something turns up’.
For example, doing a cross-calculation of “Heard of new varieties” with every other
qualitative variable. The results will be spurious ( if you search through enough
columns of random numbers you will eventually find ‘interesting’ correlations).
The distinction between the two approaches is fine!
Confirmatory analysis: estimation and
hypothesis testing
The problem
A.
Labour
Household Type
Never hire
or exchange
Hire or
exchange
In the Table A we can see:
13
Male
Female
23
13
36
10
33
3
16
13
49
33% of the households are female headed.
30% of male headed households hire labour, but only 19% of female headed households
do.
B.
Amount
Planted
Mean
s.d.
n
Farmers who planted beans in 94 a
Male
Female
Overall
6.5
2.9
5.8
9.5
1.3
8.6
24
6
30
In Table B we can see:
The mean amount of beans planted in 94a by farmers who grew beans that season is 5.8
kg.
The amount planted by males was 6.5 kg, but only 2.9 kg by females.
All these results are based on data from a sample of just 50 farmers in the district.
How reliable are they? If we had measured a different 50 how similar would the
results have been? If we had measured 500, or the whole population, would the
conclusions have been much the same?
The results differ from ‘true’ answer for two reasons:
Non sampling errors - incorrect responses, mistakes in coding and data entry, poor
recall, biased selection of respondents.
Sampling errors - those due to the fact that we have measured only some (a sample)
of the population.
The non-sampling errors can not usually be measured, but can be minimized by good
survey practice. Sampling errors can be measured, and that is the purpose of much
confirmatory statistics.
Estimates, standard errors and confidence intervals.
Proportions
The proportion of female headed households in the population is P. P is unknown.
The sample value is p = 0.33 ( = 16/49). The uncertainty due to sampling errors in
14
this is measured by the standard error. The standard error is se ( p) 
p(1  p)
,
n
where n = sample size.
se(p) is estimated by
. 33(1. 33)
. 07
49
This is the standard deviation of possible estimates that could be produced by
different simple random samples of the same size.
The standard error is best interpreted via a confidence interval. A 95% confidence
interval for p is p ± 2 x se(p)
= 0.33 ± 2 x 0.07
= (0.19, 0.47)
This is interpreted as “We are 95% confident that the true percentage of female
headed households is between 19% and 47%”. Hence the uncertainty in results due
to sampling error is quantified.
Means
The mean amount of beans planted in 94a is 5.8 kg. The standard deviation of this
2
s
, where s2 is the variance in amount of beans and n the sample size.
n
8. 62
se ( mean) 
 1. 6
30
The 95% confidence interval is
mean ± 2 x se(mean)
= ± 2 x 1.6
= (2.6, 9.0)
is se( mean) 
The mean amount of beans planted is between 2.6 and 9.0 kg.
Differences
If interested in differences between subgroups we can similarly estimate the
difference and find a standard error of the estimate.
Difference in mean amount of beans planted by
males and females = 6.5 - 2.9
= 3.6 kg.
15
se ( difference ) 
s12 s22

n1 n2
9. 52 1. 32

=
24
6
= 2.0
95% confidence interval for difference is
3.6 ± 2 x 2.0
(-0.4, 7.6)
The mean difference between amounts planted by males and females could be
anything between -0.4 kg and 7.6 kg.
Hypothesis tests: The logic
The logic of all the tests commonly used depends on the fact that random samples from a
population behave in a predictable way. The mean amount of beans planted by female
households of 2.9 kg, is not the actual mean of all households in the districts where the study
took place. If a different sample had been randomly selected the mean would have been
different. The question is ‘How different?’. If all households are very similar (low variation
between households) then it really does not matter which sample is selected. On the other
hand, high variation in the population will lead to very different sample means, and hence
less certainty in the results obtained. The mathematics of statistics allows quantification of
these ideas, and hence answers to the question of how certain we are of the results.
The logic of the hypothesis tests is as follows:
1. Assume some fact is true - the null hypothesis (e.g. There is no difference in mean
amount of beans planted by male and female headed income households).
2. Deduce how the sample would behave if (1) is true (e.g. How big could the sample
differences between male and female headed households be?)
3. Compare the actual sample with the predictions in (2).
4. If (2) and (3) do not agree then (1) must be untrue - the null hypothesis is rejected.
If (2) and (3) do agree then there is no reason, in this data, not to believe (1).
16
The level of agreement is measured by the 'significance level', explained in the examples
below.
Examples of calculations
Chi-squared test for no association in a 2 x 2 table.
Taking Table A as an example, we want to test whether the proportion of
households hiring labour is the same in male and female headed households. The steps are:
1.
Formulate the null hypothesis: the proportion is equal for both male and
female households.
If (1) is true, then this proportion is estimated by 36/49. Hence we would expect numbers in
each category to be :
17
Male
Female
Never hire
33 x
36
= 24. 2
49
Hire
33 x
3.
16 x
13
= 8.8
49
36
= 11.8
49
16 
13
 4. 2
49
The difference between observed and expected frequencies is summarised as
( 24. 2 - 23 )2
( 11. 8 - 13 )2
( 8.8 - 10 )2
( 4. 2 - 3 )2
+
+
+
= 0. 74
 =
24. 2
11. 8
8. 8
4. 2
2
4.
If (1) is valid then the value of  2 should be an observation from a 12 distribution. Comparison with tables shows that 0.74 is not an extreme observation. A
number at least as big as this would occur 39% of the time. The significance level is p =
0.39. Hence there is no strong reason not to believe the null hypothesis.
t-test to compare two means
In example B the steps needed are:
1.
Formulate the null hypothesis: the difference in mean amount of beans
planted for male and female households is zero.
2,3
If (1) is true, then the difference in means of 3.6kg, scaled by its standard
error
(= 2.0) ,
t
3. 6
 1.8 ,
2. 0
is an observation from a t28 distribution.
18
4.
Comparison with tables shows that 1.8 is not an extreme observation. A
difference as big as this would occur 8% of the time (1) is true. The significance level is p =
0.08. Hence there is not much reason not to believe the null hypothesis.
Limitations
Assumptions.
The calculations in both 4.1 and 4.2 are based on a series of assumptions. The key
ones are:
Independence. In both examples A and B we assume observations are independent.
Lack of independence is caused by:
(i)
non-simple random samples. In this case we have used a stratified sample.
(ii)
interference between observations. This would be the case if individuals
within these household responded, or if data were collected at a group meeting.
Lack of bias due to non-response, interviewer effects, attempts to 'please' the
researcher etc.
Equality of variance and normal distribution (t-test). These assumptions can be
checked. In example B the data is clearly not normally distributed
Limits to interpretation.
(1)
If the result is ‘significant’ we can reject the null hypothesis, and conclude
that there is a real difference in the population. If the result is ‘not significant’ we have not
proved there is no difference. It is never possible to prove the null hypothesis is true (if
almost never will be!). All we can say is this study has not produced evidence to make us
disbelieve the null hypothesis.
(2)
At what level of significance should the null hypothesis be rejected? 5% is
commonly used but there is absolutely no reason why it should be treated as a rigid cut off.
6% and 4% significance levels are, for all real purposes, equivalent.
(3)
Whether the null-hypothesis is rejected depends as much on the sample size
and precision of the study, as on the 'truth' of the null hypothesis. A small, imprecise survey
will not detect a difference that could be picked up by a larger study. May be we just did not
collect enough data!
19
(4)
The whole logic of significance testing and the p-value rests on what would
happen in repeated surveys of the same design, using new randomisations. Is this sense,
when we know the survey would not and can not ever be repeated?
(5)
In most analysis exercises, differences which 'look interesting' at the
exploratory stage are investigated further in the confirmatory analysis. If the tests to
perform have been selected because differences look large, all significance levels are
invalid.
(6)
If a large number of tests are performed, as is often the case in analysis of a
study with many variables, then we would expect 5% of the tests to give "significant" results
at the p = 0.5 level even if all null hypotheses were true. Hence it can be difficult to
interpret the results of multiple tests.
What should you do
(1)
Treat the significance level p as an indication of 'strength of evidence'
against the null hypothesis, not as a Yes/No decision maker.
(2)
Concentrate on estimating the size of differences, rather than just testing
whether they exist. Confidence intervals for differences will be much more useful than
hypothesis tests.
At the end of every significance test apply the SO WHAT? test. Ask yourself 'So
what?'. Has the significance test really improved your understanding of the situation
and helped you take a rational decision for future action? If not forget it, and get on
with something more useful.
Confirmatory Analysis - Regression
Starting Regression
- Beware!
Even ‘simple’ regression is not simple!
- Start by considering types of relationship that might exist. The most useful regression
analysis will be one that starts from understanding of the theory behind the process being
studied.
20
The example used here is rather artificial. It examines the proposition that the amount of
beans harvested in 94a depends only on land area.
-
Plot the data to see if there is any evidence of the relationship.
220
180
HVTOT94A
140
100
60
20
-20
-1
1
3
5
LANDAREA
Fitting the regression line
- Software is widely available to do this
- Understand the output!
21
7
9
11
* * * *
M U L T I P L E
R E G R E S S I O N
* * * *
Listwise Deletion of Missing Data
Equation Number 1
Dependent Variable..
total beans harvested 94a
Block Number
1.
Method:
Enter
HVTOT94A
LANDAREA
Variable(s) Entered on Step Number
1..
LANDAREA
Multiple R
R Square
Adjusted R Square
Standard Error
.54425
.29621
.28057
29.01659
Analysis of Variance
DF
1
45
Regression
Residual
F =
Sum of Squares
15946.10384
37888.31105
18.93921
Signif F =
Mean Square
15946.10384
841.96247
.0001
------------------ Variables in the Equation ----------------Variable
Sig T
LANDAREA
.0001
(Constant)
.6383
B
SE B
Beta
T
8.200238
1.884280
.544249
4.352
-2.863844
6.051297
End Block Number
1
-.473
All requested variables entered.
22
Check the fit
- Look for any unusual points or outliers. They could represent mistakes or cases that
require special treatment. They certainly require explanation.
- Look for influential points, which largely determine results. They are not a bad thing,
but you must be aware if your conclusions depend critically on one or two observations.
- Look at the residuals to determine:
1. Whether they satisfy the main assumptions that validate the analysis (constant
variance, independence, roughly normally distributed)
2. Whether they show patterns according to the value of other variables, indicating that
those other variables should be allowed for in the analysis.
Interpretation

‘Significance’ does not tell you whether the fitted model is logically sound or if it fits
the data well.
 ‘Significance’ does not tell you whether the model is useful in explaining or
describing a relationship, or if the relationship has much predictive power.
 A regression model derived from survey data can not tell you what would happen
when a ‘x-variable’ is changed. For example we can not use it to predict the bean
harvest of a farmer whose land holding changes.
 Existence of a regression relationship between two variables does not mean there is a
causal relationship.
Regression relationships become useful when similar relationships are found in a number
of different conditions. Look for ‘significant sameness’ between regions, crops, farm
types, etc.
Adding more variables - Multiple regression



Multiple regression is a powerful tool for understanding the relationship of one
variable to several others. BUT.....
All the limitations to interpretation above apply, and are compounded by the
existence of several ‘x-variables’.
It is hard to draw graphs that show the relationships and the way data depart from
them, so the analyst must rely more on numerical indicators of lack of fit, outliers,
23
and influential points. Multiple regression analysis will not be successful if these are
not understood.
‘Stepwise’ and similar variable selection techniques, so loved by social scientists,
have little theoretical basis and can produce answers which are very poor. Regression
modeling will be most successful if understanding of the underlying processes is
used to choose possible models, rather than relying on computer algorithms.
The sample size required for multiple regression analysis depends on the
‘configuration’ of the data (in particular the range of the x-variables and correlations
among them). The required sample size quickly becomes large as the number of xvariables increases. If regression analysis is the part of the principle objectives of the
survey, it might be possible to select the sample in a way that makes the analysis
more efficient.


Raw residuals vs. HHTYPE2
160
120
Raw residuals
80
40
0
-40
-80
1
2
HHTYPE2
Interpretation
Interpret results. This does not mean ‘understand which effects are significant’ but
‘understand and communicate what you now know about the problem’. You should be
able to:



Meet the objectives of the study.
Clearly state what is the substantive new knowledge which as been generated.
Show how this new information and understanding builds on what was there
before. Does it:
o add more examples of something previously known?
o mean that general rules or principles can be stated with more confidence?
o allow predictions to be made for new and important situations?
24




o mean that current understanding or theory has to be substantially
modified?
Use the quantitative information you have generated to make quantitative
predictions about the larger picture.
The ultimate goal of the research is a development objective. Explain how your
results help you towards that objective, and what the next steps will be.
Your survey and its analysis cost thousands of dollars. Explain why this was a
good investment.
Answer the ‘So what? question. What can we now do which we could not do
before you did your survey?
References
Coe R (2002) Steps in Survey Analysis. Nairobi: ICRAF. 15pp
SSC (2001) Approaches to analysis of survey data. Reading: Statistical Services Centre.
28 pp
25
Download