Developing Literacy in Quantitative Methods

advertisement
DEVELOPING LITERACY IN
QUANTITATIVE
RESEARCH METHODS
Dr Christina Hughes
University of Warwick
C.L.Hughes@warwick.ac.uk
These materials have two inter-related aims. The primary aim is
to develop students' literacy in the use and reading of research
that uses quantitative data. The second is to enhance students'
confidence in their understandings of such approaches. To
achieve these aims the package will introduce students to a
number of basic statistical techniques that are used in social
research. In addition the materials will explore some common
concepts that underpin quantitative social research.
The specific objectives are:
· To develop understandings of the relationship between
different types of quantitative data and their implications for
descriptive and inferential statistical techniques;
· To develop understandings of the statistical techniques of:
measures of central tendency, measures of dispersion;
· To explore the meanings of correlation and causality in
relation to quantitative social research;
· To explore uses, and misuses, of official statistics.
Quantitative techniques are most commonly associated with
survey and experimental research designs. As the name
suggests, quantitative research is concerned with the collection
and analysis of data in numeric form. It tends to emphasize
relatively large-scale and representative sets of data, and is often
(problematically) presented or perceived as being about the
gathering of `facts'. Because of strong associations that are made
between statistics as social facts and dominant ideas of science
as objective and detached, quantitative strategies are often
viewed as more valid.
Many small-scale research studies that use questionnaires as a
form of data collection will not need to go beyond the use of
descriptive statistics and the exploration of the interrelationships
between pairs of variables. It will be adequate to say that so
many respondents (either the number or the proportion of the
total) answered given questions in a certain way; and that the
answers given to particular questions appear to be related. Such
an analysis will make wide use of proportions and percentages,
and of the various measures of central tendency (averages) and
of dispersion (ranges).
You may, however, wish or need to go beyond this level of
analysis, and make use of inferential statistics or multivariate
methods of analysis. There are dozens of inferential statistics
available: three commonly used examples are Chi-square;
Kolmogorov-Smirnov and Student's t-test. The functions of
these statistics vary but they are typically used to compare the
measurements you have collected from your sample for a
particular variable with another sample or a population in order
that a judgement may be made on how similar or dissimilar they
are. It is important to note that all of these inferential statistics
make certain assumptions about both the nature of your data and
how it was collected. This means that you have to be clear
whether your data is, for example, nominal, ordinal, interval or
ratio. If these assumptions do not hold these measures should
not be used.
Multivariate methods of analysis may be used to explore the
interrelationships among three or more variables
simultaneously. Commonly used examples include multiple
regression, cluster analysis and factor analysis. While you do
not need to have an extensive mathematical knowledge to apply
these techniques, as they are all available as part of computer
software packages, you should at least have an understanding of
their principles and purposes.
One key point to be aware of when carrying out quantitative
analysis is the question of causality. One of the purposes of
analysis is to seek explanation and understanding. We would
like to be able to say that something is so because of something
else. However, just because two variables of which you have
measurements appear to be related, this does not mean that they
are. Statistical associations between two variables may be a
matter of chance, or due to the effect of some third variable. In
order to demonstrate causality, you also have to find, or at least
suggest, a mechanism linking the variables together.
[Extracted from Blaxter, Hughes and Tight, 1996]
Bibliography
This bibliography includes texts that are useful for students
new to quantitative techniques and those that are useful for
the more advanced. The asterisk (*) indicates those that are
introductory. The key publishers of methodology texts are
Sage, Routledge and Open University Press. If you wish to
extend your reading or keep up to date with developments you
should put your name on these publishers' catalogue mailing
lists. There are also a number of journals that are primarily
concerned with developments in methodology. These include:
The International Journal of Social Research Methodology
and Social Research Online ( http://www.socresonline.org.uk).
In addition, secondary sources produced by the Office for
National Statistics for the Government Statistical Service can
be obtained from The Office for National Statistics, 1
Drummond Gate, London, SW1V 2QQ or through the
STATBASE on-line directory.
Black, T (1999) Doing Quantitative Research in the Social
Sciences: An Integrated Approach to Research Design,
Measurement and Statistics, London, Sage
Blaxter, L, Hughes, C and Tight, M (1996) How to Research,
Buckingham, Open University Press*
Bowling, A (1997) Research Methods in Health:
Investigating Health and Health Services, Buckingham, Open
University Press*
Bryman, A and Cramer, D (1990) Quantitative Data Analysis
for Social Scientists, London, Routledge
Calder, J (1996) Statistical Techniques, in R Sapsford and V
Jupp (Eds) Data Collection and Analysis, London, Sage, pp
225-261
Cramer, D (1994) Introducing Statistics for Social Research:
Step-by-step calculations and computer techniques using
SPSS, London, Routledge
Denscombe, M (1998) The Good Research Guide: For small
scale social research projects, Buckingham, Open University
Press*
De Vaus, D (1991) Surveys in Social Research, Sydney, NSW,
Allen and Unwin
Hek, G, Judd, M and Moule, P (1996) Making Sense of
Research: An Introduction for Nurses, London, Cassell*
Hinton, P (1995) Statistics Explained: A guide for social
science students, London, Routledge*
Leary, M (1991) Introduction to Behavioural Research
Methods, Belmont, Calif, Wadsworth Publishing
Levitas, R and Guy, W (1996) Interpreting Official Statistics,
London, Routledge
Persell, C and Maisel, R (1995) How Sampling Works,
Newbury Park, Calif, Pine Forge
Pilcher, D (1990) Data Analysis for the Helping Professions:
A Practical Guide, Newbury Park, Calif, Sage
Sapsford, R (1996) Extracting and Presenting Statistics, in R
Sapsford and V Jupp (Eds) Data Collection and Analysis,
London, Sage, pp 184-224
Solomon, R and Winch, C (1994) Calculating and Computing
for Social Science and Arts Students, Buckingham, Open
University Press*
Stanley, L (Ed) (1990) Feminist Praxis, London, Routledge
Townsend, P (1996) The Struggle for Independent Statistics on
Poverty, in R Levitas and W Guy (Eds) Interpreting Official
Statistics, London, Routledge, pp 26-44
Traub, R (1994) Reliability for the Social Sciences: Theory
and Application, Thousand Oaks, Calif, Sage
Wright, D (1997) Understanding Statistics: An introduction
for the social sciences, London, Sage*
TYPES OF QUANTITATIVE DATA
Nominal data
Nominal data come from counting things and placing them in a
category. They are the lowest level of quantitative data in the
sense that they allow little by way of statistical manipulation
compared with the other types. Typically there is a head count
of members of a particular category, such as female/male or
African Caribbean/South Asian. These categories are based
simply on names; there is no underlying order to the names.
Used for the following descriptive statistics: proportions,
percentages, ratios.
Ordinal data
Like nominal data, ordinal data are based on counts of things
assigned to specific categories but in this case the categories
stand in some clear, ordered, ranked relationship. The categories
are `in order'. This means that the data in each category can be
compared with the data in the other categories as being higher or
lower than, more or less than, etc. those in other categories. The
most obvious examples of ordinal data come from the use of
questionnaires in which respondents are asked to respond to a
five-point Likert scale. It is worth stressing that rank order is all
that can be inferred. With ordinal data we do not know the
cause of the order or by how much they differ.
Used for the following descriptive statistics: proportions,
percentages, ratios.
Interval data
Interval data are like ordinal data but the categories are ranked
on a scale. This means that the `distance' between the categories
is a known factor and can be pulled into the analysis. The
researcher can not only deal with the data in terms of `more
than' or `less than' but also say how much more or how much
less. The ranking of the categories is proportionate and this
allows for direct contrast and comparison. Calendar years are
one example. This allows the researcher to use addition and
subtraction (but not multiplication and division) to contrast the
difference between various periods.
Used for the following descriptive statistics: measures of central
tendency (mode, median, mean)
Ratio data
Ratio data are like interval data except that the categories exist
on a scale which has a `true zero' or an absolute reference point.
When the categories concern things like incomes, distances and
weights they give rise to ratio data because the scales have a
zero point. Calendar years, in the previous example, do not exist
on such a scale because the year 0 does not denote the beginning
of all time and history. The important thing about the scale
having a true zero is that the researcher can compare and
contrast the data for each category in terms of ratios, using
multiplication and division, rather than being restricted to the
use of addition and subtraction as is the case with interval data.
Ratio data are the highest level of data in terms of how
amenable they are to mathematical manipulation.
Used for the following descriptive statistics: measures of central
tendency (mode, median, mean)
[adapted from Blaxter, Hughes and Tight, 1996 and Denscombe,
1998]
TYPES OF QUANTITATIVE DATA
EXAMPLES
Are the following nominal, ordinal, ratio or interval data?
· The income levels of social workers;
· The examination scores of members of this course;
· The sex of your research participants;
· The birth position of members of a family;
· Exam grades received at school;
· Number of exam passes;
· The temperatures of different geographical zones;
· The size of families in the UK;
· IQ scores;
Illustrative Issue
A Likert scale is written to convey equidistant points along an
axis:
*-----------------*---------------- *---------------- *---------------- *
Very Fairly Important Not very Not at all
Important Important Important Important
Are the meanings ascribed by research respondents similarly
equidistant?
Is such data interval or nominal?
TYPES OF QUANTITATIVE DATA
A CAUTIONARY COMMENT
Very important 1
Fairly important 2
Not very important 3
Not at all important 4
The problem is that the `real' distance between the ratings
numbered 3 and 4 for a respondent may be much greater than
the distance they perceive between the items numbered 1 and 2.
The `real' distances between each of the ratings may also vary
from person to person. In theory, therefore, such data should be
treated as ordinal data. Most researchers take a pragmatic
approach, however, and continue with the practice of treating
ratings and psychological tests as interval data.
One way of dealing with data that are difficult to `type' correctly
is through the use of models. Scientists use models of weather
systems to study the relationships between different factors in
order to understand better what the contributory factors are. In
the same way, statisticians produce statistical models based on
their current understanding of the problem. When they do not
quite work as expected, they modify some of their assumptions.
If the assumption of an interval scale does not work, then
further analyses can be carried out on the assumption of an
ordinal scale. Over the years, reviews of the statistical evidence
suggested that the assumption of equality of equal intervals
within rating scales is justified. But where such assumptions are
made, there is always the possibility of misinterpretation of the
data. The important point is to be clear always that there are
different types of data, and that this will affect the type of
analyses that can be used on them. (Calder, 1996: 229)
MEASURES OF CENTRAL TENDENCY
OR MID-POINTS AND AVERAGES
There are three types of average and these are collectively called
`measures of central tendency'. These are the mean, the median
and the mode.
The mean (or arithmetic average)
This is the most common meaning of `average'. It includes the
total spread and finds the mid-point. To calculate the mean:
1. Add together the total of all the values for the category
2. Divide this total by the number of cases
· The mean cannot be used with nominal data. For example, you
cannot `average' names, sexes, nationalities and occupations.
· The mean is affected by extreme values, or outliers. Because
the mean includes all values the average can be pulled toward
the value of the outlier or toward the more extreme values.
· The mean can lead to strange descriptions, such as 2.4 person
households.
Example: Calculate the mean from the following:
•
1 4 7 11 12 17 17 47
The median or mid-point
The median is the mid-point of the range. To calculate:
1. Place the values in ascending/descending rank order
2. Find the mid-point number
3. With even numbers of values the mid-point is half-way
between the two middle values
· The median can be used with ordinal data as well as interval
and ratio data.
· The median is not affected by extreme values or outliers.
· The median works well with a low number of values.
· The main disadvantage is that you can do no further
calculations with the median.
Example: Calculate the median from the following:
•
1 4 7 11 12 17 17 47
The Mode
The mode is the value that is most common. To calculate:
1. Arrange the data in ascending/descending order;
2. Identify the value that occurs more frequently than any other.
· The mode can be used with nominal, ordinal, interval and ratio
data. It has the widest possible scope therefore.
· It is unaffected by outliers or extreme values.
· It does not allow any further mathematical calculations.
· There may not be any `most common' values or there may be
more than one.
Example: Calculate the mode of the following:
1 1 4 4 7 11 12 17 17 17 47
MEASURES OF DISPERSION
Given some of the problems in the accuracy of conveying
meaning with measures of central tendency, measures of
dispersion are an important adjunct in any description of the
data. Measures of dispersion are used to indicate how widely the
data is spread and how evenly the data is spread. In other words,
how far from the central point is the data dispersed?
There are three main measures of dispersion: the range, fractiles
and standard deviation.
The range
This is the simplest, and a very effective, way of describing the
spread of the data. To calculate the range:
· Substract the minimum value in the distribution from the
maximum value.
Although effective, the range can still be affected by the value
of any outliers. In consequence it can give a misleading
impression of the spread of the data. This is why is it important
to include a note of the highest and lowest score in your written
presentation of data.
Example: Calculate the range from the following:
3 4 7 11 12 17 17 47
Fractiles
To take account of the spread of values across the whole range,
fractiles (eg quartiles/quarters, deciles/tenths,
percentiles/hundredths) are used. These divide the range into
smaller, equidistant ranges. Fractiles are used with median
values. To calculate:
1. Subdivide the range into equal parts (eg quartiles, deciles,
percentiles)
2. Find the median (mid point) value;
3. Working from the median point divide your data into the
relevant fractiles.
Fractiles can eliminate the high and low values that affect
measures of central tendency. For example, by focusing on the
cases that fall between the second and third quartile reasearchers
know that they are dealing with the half of the values that fall in
the middle. In addition it allows the comparison of values
between fractiles. For example, the top ten percent of earners
can be compared with the bottom.
Example: The following is income data of social workers.
Divide the data into quartiles. Find the median that occurs in
each quartile. Find the median that occurs between the second
and third quartile. How would you present this data? What
would you say about the validity of these data?
Income per annum (thousands):
15 16 17 21 22 27 27 47
Standard Deviation (SD)
The standard deviation is used with the arithmetic mean. The
standard deviation uses all the values in the range to calculate
the spread of the data. It is a measure of the distance of the
scores from your mean. The larger the standard deviation the
more spread out the range is. To calculate:
1. Find the mean
2. Subtract the mean from all your values
3. Square all the results (to turn your minuses into pluses)
4. Add all these `squared numbers' together
5. Divide this by the number of your values minus one
6. Find the square root of this
· The standard deviation can be used for further statistical
analysis
· Because of this standard deviation is an immensely important
aspect of social research
· The standard deviation can only be used with interval and ratio
data. It is meaningless when used with nominal and even ordinal
data.
Exercise: Find the standard deviation of the following:
•
1 4 7 11 12 17 17 47
CORRELATION
Correlation
How closely are two variables connected? This question is
answered in statistical terms with correlation. For example, do
the students who spend the most time studying achieve the
highest marks? Do those who spend least time studying get the
lowest marks? These question are asking us to compare two
variables: study time and examination performance. We are
asking to what extent is there a relationship between these two
variables.
If the answer was that that those who spend most time studying
do achieve the highest marks we would say that there is a
positive correlation between the two variables. In other words
we would be saying that as the score increases on one variable it
also increases on the other variable. In addition, if those who
study least achieve the lowest marks, we would also say that
there is a positive correlation between the two variables.
However, if we found that the more students spent studying the
lower their marks, this would be described as a negative
correlation. There is, for example, a negative correlation
between the variables of smoking and health. The more a person
smokes the less healthy that person is likely to be. If there is no
relation between two variables then we would say that the
variables are uncorrelated. For example, if the hypothesis was
that wearing jeans improved exam scores and the results
suggested that some students who wore jeans had high scores
and some who wore jeans had low scores, some students who
did not wear jeans had high scores and some who did not wear
jeans had low scores the results are likely to show no
correlation.
To calculate correlation one plots the scores on a scatter
diagram. This requires you to plot the scores of the two
variables along the axes of a graph and mark the results. If a
straight line can be drawn there is a correlation. The direction of
the lines indicates whether this is a positive (up) correlation or a
negative (down) correlation.
The two most commonly used correlation statistics are
Spearman's rank correlation coefficient that works for ordinal
data and Pearsons's product moment correlation coefficient that
works for interval and ratio data.
When reading statistical research you are likely to find the
following signs:
· +1 this equals a perfect positive correlation (as one variable
goes up so does the other)
· 0 this means there is no relationship between the variables
· -1 this equals a perfect negative correlation (as one variable
goes up the other goes down)
· In practice any correlation coefficient between 0.3 (weak) and
0.7 (strong) suggests a reasonable correlation.
Example: Do the following data indicate a correlation?
Student Study Time Examination Mark
1 40 58
2 43 73
3 18 56
4 10 47
5 25 58
6 33 54
7 27 45
8 17 32
9 30 68
10 47 69
(from Hinton, 1995)
CORRELATION AND CAUSATION
CORRELATION DOES NOT MEAN CAUSATION
If two things go together it is easy to assume that they are
causally related in some way. Is this the case? Even if the
thickness of a caterpillar's coat correlates closely with the
severity of the winter weather, can we conclude that caterpillars
cause bad weather?
Three criteria are required to achieve causality in statistical
research:
· Covariation
· Directionality
· Elimination of extraneous variables
Covariation
To conclude that two variables are causally related they need to
covary or correlate. If one variable causes the other then
changes in the values of one variable should be associated with
changes in the values of the other. This is, of course, the
definition of correlation.
Directionality
To infer that two variables are causally related we much show
that the presumed cause precedes the presumed effect in time.
However in most correlational research both variables are
measured at the same time. There is therefore no way to
determine the direction of causality. Has X causes Y or Y
caused X?
Elimination of Extraneous variables
The third criterion for inferring causality is that all extraneous
factors that might influence the relationship between the two
variables are eliminated. Correlational research never satisfies
this requirement completed. Two variables may be correlated
not because they are causally related to one another but because
they are both related to a third variables. For example, does
loneliness cause depression? Maybe but a third variable - the
quality of a person's social network - may reduce both loneliness
and depression.
Example: Does smoking cause cancer?
There is a wealth of research that suggests a strong correlation
between smoking and cancer. Does smoking cause cancer?
[adapted from Leary, 1991]
USING OFFICIAL DATA SETS
There are a number of important, and useful, data sets collected
by government and which can be used for secondary analyses.
These include:
· Census of Employment
· Census of Population
· Labour Force Survey
· General Household Survey
· Family Expenditure Survey
The annual publication Social Trends is a useful source for
those who are seeking some simple statistics. Social Trends
compiles its analyses from these data sets. In addition, the
ESRC keeps data archives of both quantitative and qualitative
research that can be consulted.
Care should be taken in the use of statistics however. For
example, in a discussion of poverty statistics, Townsend notes
how successive governments in the UK have chosen to avoid
using the term `poverty'. As he further notes (1996: 26):
Statistics don't fall out of the skies. Like words - of which they
are of course an extension - they are constructed by human
beings influenced by culture and the predispositions and
governing ideas of the organisations and groups within which
people work. Statistical methodologies are not timeless
creations. They are the current expression of society's attempts
to interpret, represent and analyse information about economic
and social (and other) conditions. As the years pass they change
- not just because there may be technical advances but because
professional, cultural, political and technical conventions
change in terms of retreat as well as advance ... [Thus] Every
student of social science ... needs to be grounded in how
information about social conditions is acquired. Statistics form
a substantial part of such information. Acquiring information is
much more than looking up handbooks of statistics. We have to
become self-conscious about the process of selection.
Levitas and Guy (1996) contextualise these concerns in terms of
the following:
There are developments which may make official data more
easily accessible to academic experts [on-line access]. They do
not make data more easily available to the public in the interests
of informed political debate. Moreover, the (relative) ease of
conducting secondary analysis carries the danger of forgetting
that the concepts used in any research derive from the questions
and interests of its original intentions. The extent to which
secondary analysis can bend data sets to the service of
sometimes quite different agendas is necessarily limited. (p 3)
...The debates ... show that the insistence on the neutrality and
objectivity of facts still dominates discussion of official statistics
and their production. The presentation of statistics in particular
ways for political ends, and the abolition of inconvenient
measures, continue. It is understandable that professional
statisticians should try to counter this by appeals to objectivity.
But it is also abundantly clear that the definitions used in
official statistics still produce measures which embody the
interests of the state rather than of citizens. It is therefore only
with the utmost care that such data can be interpreted for
democratic purposes. (p 6)
The edited text by Levitas and Guy (1996) outlines the kinds of
data sets that are available. It also contains discussions of the
use, and misuse, of government statistics in the following areas:
poverty, unemployment, social class, health, safety at work,
working women, ethnicity, disability and crime. Another useful
text is that of Stanley L (Ed) (1990) Feminist Praxis, London,
Routledge. Amongst the range of issues discussed, this contains
discussions on the ways in which statistics collected on the
homeless are `compromised' by the processes of turning raw
data into statistical information. A chapter by Liz Stanley (A
Referral Was Made) discusses the politics of objectivity
influences the presentation of a social service's case.
USING OFFICIAL DATA SETS
EXERCISE
1. How would you interpret the following statement?
"Statistics on patterns of household disposable income are
provided in Households below Average Income reports ... The
best response to low household income is to sustain economic
recovery and to assist those in greatest need" (Reported in
Townsend, 1996: 27-28)
2. How would you interpret the following conversation?
Ms Corston: Is the Prime Minister aware that Social Trends
1994, a Government publication, reveals that as a direct
consequence of Tory Government policy since 1979 the average
disposable income of the richest 20 per cent of households has
increased by £6,000 a year while the 20 per cent of households
at the bottom of the income scale have had their average
disposable income cut by £3,000 a year? Does that reveal the
hypocrisy of the Prime Minister's professed commitment to
creating a nation at ease with itself?
The Prime Minister: The hon. Lady [Ms Corston] was being
selective in what she said - [Interruption]. She was selective
from the report. The net disposable income of people at all
ranges of income has increased and the proportion of total tax
paid by those on top incomes has increased, not been reduced.
(Reported in Townsend, 1996: 40)
Download