Chapter 12 Quantitative data analysis

advertisement
Chapter 13
Analysing quantitative data
Suggested solutions to questions and exercises
1.
(a)
(b)
(c)
Describe what is meant by the following:
a case
a variable
a value.
(a) A case is a unit of analysis. A completed questionnaire, containing answers from an
individual is a case. If you have a final sample size of 300, you have 300 cases.
(b) A variable is an individual bit of information (a question or part of questions).
(c) The answer the respondent gives to the questions is the value.
2. Describe the process of transferring responses from a questionnaire to a data entry
package.
Data collected on a paper questionnaire is transferred or entered into an analysis package
either manually (responses keyed in to the data entry program) or electronically
(responses read by a scanner or optical mark reader). On most questionnaires opposite
each response, or at the end of a question, is a number. These numbers are codes that
represent the responses (or values) to the question (variable). This code, representing the
answer given by the respondent, is recorded or entered as a number at a fixed place in a
data record sheet or grid. In order for the program to receive and understand data from
the questionnaire, the data must be in a regular, predictable format such as a grid. The
grid is made up of rows of cases and columns of variables. Each case makes up a line or
row of data and the variables appear as columns of number codes. The purpose of data
entry is to convert the answers on the questionnaire into a ‘line of data’ that the analysis
program will accept and recognise.
3. What is involved in data editing?
As the data are being entered on a case by case basis they can be edited or cleaned to
ensure that they are free of errors and inconsistencies. Data editing can be carried out by
interviewers and field supervisors during fieldwork and by editors when the
questionnaires are returned from the field. With computer-aided data capture this process
is incorporated into the data capture program. During the editing process missing values,
out of range values and errors due to misrouting of questions are sorted out and the data
are checked for other inconsistencies.
4. How are missing data handled?
If a response has been left blank it is known as a ‘missing value’. It is important to deal
with missing values so that they do not contaminate the dataset and mislead the
researcher or client. One way of dealing with the possibility of missing values is at the
1
questionnaire design stage and at interviewer training and briefing sessions. In a welldesigned questionnaire there will be codes for ‘Don’t know’ and ‘No answer’ or
‘Refused’. Interviewers should be briefed about how to handle such responses and how
to code them on the questionnaire. It is also possible to avoid missing values by checking
answers with respondents at the end of the interview or during quality control call-backs.
If missing values remain, a code (or codes) can be added to the data entry program that
will allow a missing value to be recorded. Typically a code is chosen with a value that is
out of range of the possible values for that variable. Imagine that for some reason a
respondent to the Life and Times survey did not answer, or the interviewer did not ask for
or record a response to Q3 ‘Would you describe the place where you live as …?’ The
values or response codes for this question range from 1 = ‘big city’ to 5 = ‘farm or home
in the country’; you could assign a missing value code of 9 for ‘No response’. If you
know in more detail why the information is missing – for instance, ‘doesn't apply’,
‘refused to answer’, or ‘don't know’ – and this is not already allowed for on the
questionnaire, you can give each of these a different missing value code: ‘doesn't apply’
could be 96; ‘refused to answer’ could be 97; ‘don't know’ could be 98; and ‘missing for
some other reason’ could be 99.
There are other ways of dealing with missing values. One extreme approach, known as
casewise deletion, is to remove from the dataset any case or questionnaire that contains
missing values. This approach, however, results in a reduction in sample size and may
lead to bias, as cases with missing values may differ from those with none. A less drastic
approach is the pairwise deletion in which only those cases without missing values are
used in the table or calculation. This too will affect the quality of the data, especially if
the sample size is relatively small, or if there is a large number of cases with missing
values. An alternative is to replace the missing value with a real value. There are two
ways of approaching this. You could calculate the mean value for the variable and use
that; or you could calculate an imputed value based on either the pattern of response to
other questions in the case (on that questionnaire) or the response of respondents with
similar profiles to the respondent with the missing value. Substituting a mean value
means that the distribution of the values for the sample does not change. We are
assuming, however, that the respondent gave such a response when of course he or she
may have given a more extreme answer. If we substitute an imputed value, we are
making assumptions and risk introducing bias.
5. Why is it necessary to weight data? Describe how weighting is carried out.
Weighting is used to adjust sample data in order to make them more representative of the
target population on particular characteristics, including, for example, demographics and
product or service usage. The procedure involves adjusting the profile of the sample data
in order to bring it into line with the population profile, to ensure that the relative
importance of the characteristics within the dataset reflects that within the target
population. For example, say that in the usage and attitude survey, the final sample
comprises 60% women and 40% men. Census data tell us that the proportion should be
52% women and 48% men. To bring the sample data in line with the population profile
indicated by the Census data, we apply weights to the gender profile. The over2
represented group – the women – is down-weighted, and the under-represented group –
the men – is up-weighted. Multiplying the sample percentage by the weighting factor
will achieve the target population proportion. To calculate the weighting factor, divide
the population percentage by the sample percentage. Any weighting procedure used
should be clearly indicated and data tables should show unweighted and weighted data.
6. What is a holecount and why is it useful?
A frequency count is a count of the number of times a value occurs in the dataset, the
number of respondents who gave a particular answer. For example, we want to know
how many people in the sample are very satisfied with the level of service provided by
Bank S. A frequency count – a count of the number of people who said they are very
satisfied with Bank S – tells us this. The ‘holecount’ is set of these frequency counts for
each of the values of a variable; it may be the first data you see. It is useful to run a
holecount before preparing a detailed analysis or table specification as it gives an
overview of the dataset, allowing you to see the size of particular sub-samples, what
categories of responses might be grouped together, and what weighting might be
required. For example, say we have asked if respondents are users of a particular Internet
banking service. The holecount or frequency count will tell us how many users we have.
We can decide if it is feasible to isolate this group – to look at how the attitudes or
behaviour or opinion of Internet customers compare to those of non-Internet customers,
for example.
7. What do descriptive statistics tell you? Why are they useful? Give examples.
Descriptive statistics are statistics that summarise a set of data or a distribution. Under
the heading of descriptive statistics come frequencies, proportions, percentages, measures
of central tendency (averages – the mean, the mode and the median) and measures of
variation or spread (the range, the interquartile range, the mean deviation, the standard
deviation).
These statistics are useful because they allow us to summarise a mass of numbers in a
small amount of numbers and yet still know a lot about the data they describe (and from
which they are derived.
For example, if we have collected data on annual household income from a sample of
1,000 people, we can use descriptive statistics to tell us the mean or average income
across the group, what the variation in income is within the group, what percentage of the
sample have a particular income and so on.
Or, for example, in comparing service A and service B, we know that the mean price paid
for A and B was the same at €79 but by calculating the standard deviation in the price
paid for service A we find that it is greater – €22 compared to €14. This tells us that
while the average prices are the same, the price of A is more variable than the price of B.
The next step might then be to check why this variation exists (what might explain it) – is
it due to a sub-group of service A providers charging more, or to one or two providers
charging a lot more?
3
8. What are cross-tabulations and why are they useful?
Most quantitative data analysis involves inspecting data laid out in a grid or table format
known as a cross-tabulation. This is the most convenient way of reading the responses of
the sample and relevant groups of respondents within it. For example, we may need to
know to what the total sample’s view of a product is, as well as what a particular type or
group of consumers thinks – whether the product appeals more to men or women, or to
different age groups or different geodemographic groups, for instance. Each table or
cross-tab sets out the answers to a question by the total sample and by particular groups
or sub-sets within the sample that are relevant to the aims of the research.
9. What is meant by ‘filtering’ the data?
A data table is usually based on those in the sample eligible to answer the question to
which it relates. Not all questions are asked of the total sample, however, and analysis
based on total sample is not always relevant. Those that are will be based on the total
sample; those that are not will be based on the relevant sub-sample. For example, in a
survey of the use of e-commerce, we might ask all respondents whether or not their
organisation uses automated voice technology (Q7, say). Those who say ‘Yes’ are asked
a bank of questions (Q8a to Q8f) related to this; those who say ‘No’ are filtered out and
routed to the next relevant question (Q9). When the data tables are run it would be
misleading to base the tables that relate to these questions on the total sample if the
purpose of the table is to show the responses of users of the service. The tables should be
based on those who were eligible to answer the questions, in other words, those saying
‘Yes’ at Q7. The tables for Q8a to Q8f that relate to automated voice technology are said
to be based on those using automated voice technology (those saying ‘Yes’ at Q7). The
table that relates to Q7 is said to be based on the total sample. In designing tables, it is
important to think about what base is relevant to the aims of your analysis.
If you have a particularly large or unwieldy dataset and you do not need to look at
responses from the total sample, ‘filtering’ the data, excluding some types of respondents
or basing tables on the relevant sub-sample can make analysis more efficient and safer.
For example, your preliminary analysis of data from a usage and attitude survey in the
deodorants market involved an overview of the total sample. Your next objective is to
examine the women’s deodorant market. In the interests of efficiency and safety, it may
be worthwhile to have the tables re-run based on the sub-set of women only.
10. What do inferential statistics tell you? Describe how you select an appropriate test.
There is a battery of inferential statistical tests and procedures that allow us to determine
if the relationships between variables or the differences between means or proportions or
percentages are real or whether they are more likely to have occurred by chance. For
example, is the mean score among women on an attitude to the environment scale
significantly different from the mean score among men? Is the proportion of those who
buy product A significantly greater than the proportion who buy product B? The choice
of test will depend on the type of data and on the level of measurement.
These tests are necessary because most research uses samples rather than populations.
When we talk about our findings, we want to generalise – we want to talk about our
4
findings in terms of the population and not just the sample. We can do this with some
conviction if we know that our sample is truly representative of its population. With any
sample, however, there is a chance that it is not truly representative. As a result we
cannot be certain that the findings apply to the population. For example, we conduct a
series of opinion polls among a nationally-representative sample of voters of each
European Union member state. In our findings we may want to talk about how the
opinions of German voters compare to those of French voters. We want to know if the
two groups of voters really differ. We compare opinions on a range of issues. There are
some big differences and some small differences. Are these differences due to chance or
do they represent real differences in opinions? We use inferential statistical tests to tell
us if the differences are real rather than due to chance. But we cannot say this for certain.
The tests tell us what the probability is that the differences could have arisen by chance.
If there is a relatively low probability that the differences have arisen by chance, then we
can say that the differences we see between the samples of German voters and French
voters are statistically significant – real differences that are likely to exist in the
population and not just in the sample we have studied.
It is important to choose the correct test for the data otherwise we risk either ending up
with a test result and a finding that is meaningless or we miss an interesting and useful
finding. First of all, we have to determine what it is we are testing for – a difference or
relationship. Next we check the level of measurement of the data involved; and finally
check whether the data are derived from one sample or two, and if two, whether the
samples are related or unrelated:














Type of analysis: testing for difference or association/relationship?
If testing for difference: data categorical/non-metric or continuous/metric?
If non-metric: from one sample or two or more samples?
If one sample: chi-square and binomial
If two related samples: Sign test, Wilcoxon test and chi-square
If two unrelated samples: Mann-Whitney U test, chi-square, Kruskal-Wallis, ANOVA
If metric: from one sample or two or more samples?
If one sample: z test and t test
If two or more unrelated samples: z test, t test, ANOVA
If two or more related samples: paired t test
If testing for association: level of measurement of dependent variable (DV) and
independent variable (IV)?
Both DV and IV categorical: measures of association chi-square, tau B (and others).
DV continuous and IV categorical: ANOVA
IV and DV continuous: regression and correlation
5
Download