Thursday 21 st January 2016
• Cramer (1998) Chapter 2:
- Measurement and univariate analysis.
• Diamond and Jefferies (2001) Chapter 5:
- Measures and displays of spread.
• Sarantakos (2007) Chapter 5:
- Graphical displays.
• Huizingh (2007) Chapter 12:
- SPSS material.
• Quantitative measures are typically referred to as variables.
• Some variables are generated directly via the data generation process, but other, derived
variables may be constructed from the original set of variables later on.
• As the next slide indicates, variables are frequently referred to in more specific ways.
• Often, one variable (and occasionally more than one variable) is viewed as being the dependent
variable.
• Variables which are viewed as impacting upon this variable, or outcome, are often referred to as
independent variables.
• However, for some forms of statistical analyses, independent variables are referred to in more specific ways (as can be seen within the menus of
SPSS for Windows)
• A nominal variable relates to a set of categories such as ethnic groups or political parties which is not ordered.
• An ordinal variable relates to a set of categories in which the categories are ordered, such as social classes or levels of educational qualification.
• An interval-level variable relates to a ‘scale’ measure, such as age or income, that can be subjected to mathematical operations such as averaging.
• The starting point for statistical analyses is typically an examination of the distributions of values for the variables of interest. Such examinations of variables one at a time are a form of univariate analysis.
• Once a researcher moves on to looking at relationships between pairs of variables she or he is engaging in
bivariate analyses.
• … and if they attempt to explain why two variables are related with reference to another variable or variables they have moved on to a form of multivariate analysis.
• For nominal/ordinal variables this largely means looking at the frequencies of each category, often pictorially using, say, bar-
charts or pie-charts.
• It is usually easier to get a sense of the relative importance of the various categories if one converts the frequencies into percentages!
Example of a frequency table
Place met marital or cohabiting partner
At school, college or university
At/through work
In a pub/cafe/restaurant/ bar/club
At a social event organised by friend(s)
Other
TOTAL
Frequency
872
1405
2096
1055
1631
7059
%
12.4
19.9
29.7
14.9
23.1
100.0
Other
At school, college or university
At/through work
At a social event organised by friend(s)
In a pub/cafe/restaurant/ bar/club
• It may seem self-evident, but percentages are a form of descriptive statistic
• Specifically, they are useful in describing the distributions (of frequencies) for nominal or ordinal (i.e. categorical) variables
• When we consider interval-level variables or more than one variable, we need (somewhat) more sophisticated descriptive statistics
• ... are data summaries which provide an alternative to graphical representations of distributions of values (or relationships)
• ... aim to describe key aspects of distributions of values (or relationships)
• ... are of most relevance when we are thinking about interval-level variables (scales)
• Descriptive statistics summarise relevant features of a set of values.
• Inferential statistics help researchers decide whether features of quantitative data from a sample can be safely concluded to be present in the population.
• Generalizing from a sample to a population is part of the process of statistical inference
• One objective may be to produce an estimate of the proportion of people in the population with a particular characteristic, i.e. a process of estimation.
Types of (univariate) descriptive statistics
Measures of ...
• ... location (averages)
• ... spread
• ... skewness (asymmetry)
• ... kurtosis
• We typically want to know about the first two, sometimes about the third, and rarely about the fourth!
• Increasing kurtosis is associated with the “movement of probability mass from the shoulders of a distribution into its
center and tails.” (Balanda, K.P. and MacGillivray, H.L. 1988.
‘Kurtosis: A Critical Review’, The American Statistician 42:2:
111–119.)
• Below, kurtosis increases from left to right...
• For interval-level data the appropriate visual summary of a distribution is a histogram, examining which can allow the researcher to assess whether it is reasonable to assume that the quantity of interest has a particular distributional shape (and whether it exhibits
skewness).
• Unlike bar charts, distances along the ‘horizontal’ dimension of a histogram have a well-defined, consistent meaning: i.e. they represent differences between values on the interval-level scale in question.
• Mean (the arithmetic average of the values, i.e. the result of dividing the sum of the values by the total number of cases)
• Median (the middle value, when the values are ranked/ordered)
• Mode (the most common value)
• Standard deviation (and Variance)
(This is linked with the mean, as it is based on averaging [squared] deviations from it. The variance is simply the standard deviation squared).
• Interquartile range / Quartile deviation
(These are linked with the median, as they are also based on the values placed in order).
Measures of location and spread: an example (household size)
Mean = 2.94, Median = 2, Mode = 2 Mean = 2.96, Median = 3, Mode = 2 s.d. = 1.93, skewness = 2.10; kurtosis = 5.54
s.d. = 1.58, skewness = 1.27; kurtosis = 2.24
West Midlands London
• The standard deviation (or, more precisely, the variance) is important because it introduces the idea of summarising variation in terms of summed, squared deviations.
• And it is also central to some of the statistical theory used in statistical testing/statistical inference...
An example of the calculation of a standard deviation
• Number of seminars attended by a sample of undergraduates:
5, 4, 4, 7, 9, 8, 9, 4, 6, 5
• Mean = 61/10 = 6.1
• Variance = ((5 – 6.1) 2 + (4 – 6.1) 2 + (4 – 6.1) 2 + (7 – 6.1) 2 +
(9 – 6.1) 2 + (8 – 6.1) 2 + (9 – 6.1) 2 + (4 – 6.1) 2 + (6 – 6.1) 2 +
(5 – 6.1) 2 )/(10 – 1) = 36.9 /9 = 4.1
• Standard deviation = Square root of variance = 2.025
• Comparing descriptive statistics between groups can be done graphically in a rather nice way using a form of display called a ‘boxplot’.
• Boxplots are based on medians and quartiles rather than on the more commonly found mean and standard deviation.
• These are referred to as ‘Measures of association’, as they quantify the (strength of the) association between two variables
• The most well-known of these is the (Pearson) correlation coefficient, often referred to as ‘the correlation coefficient’, or even ‘the correlation’
• This quantifies the closeness of the relationship between two interval-level variables (scales)
Positive and negative relationships
Positive or direct relationships
• If the points cluster around a line that runs from the lower left to upper right of the graph area, then the relationship between the two variables is positive or direct.
• An increase in the value of x is more likely to be associated with an increase in the value of y.
• The closer the points are to the line, the stronger the relationship.
Negative or inverse relationships
• If the points tend to cluster around a line that runs from the upper left to lower right of the graph, then the relationship between the two variables is negative or inverse.
• An increase in the value of x is more likely to be associated with a decrease in the value of y.
Working out the correlation coefficient
(Pearson’s r )
• Pearson’s r tells us how much one variable changes as the values of another changes – their covariation.
• Variation is measured with the standard deviation. This measures
• Pearson’s r is calculated by dividing this by (SD of x ) x (SD of y ) in order to standardize it.
x
X
( n
1) s s x y average variation of each variable from the mean for that variable.
• Covariation is measured by calculating the amount by which each value of X varies from the mean of X, and the amount by which each value of Y varies from the mean of Y and multiplying the differences together and finding the average (by dividing by n -1). x
X
n
1
Working out the correlation coefficient
(Pearson’s r )
• Because r is standardized it will always fall between +1 and -1.
• A correlation of either 1 or -1 means perfect association between the two variables.
• A correlation of 0 means that there is no association.
• Note: correlation does not mean causation . We can only investigate causation by reference to our theory. However (thinking about it the other way round) there is unlikely to be causation if there is not correlation.
(corresponding to the last slide)
• X = 5, 4, 4, 7, 9, 8, 9, 4, 6, 5 Mean(X) = 6.1
• Y = 8, 7, 9, 7, 8, 8, 8, 5, 5, 6 Mean(Y) = 7.1
• (5 - 6.1)(8 – 7.1) + (4 – 6.1)(7 – 7.1) ... etc.
• -0.99 + 0.21 + ... = 7.9 (Covariation)
• S.D. (X) = 2.02 ; S.D. (Y) = 1.37
• (7.9 / 9) / (2.02 x 1.37) = 0.316
If two variables are nominal or ordinal, i.e. categorical, we can look at the relationship between them in the form of a cross-tabulation, using percentages to summarize the pattern.
(Typically, if there is one variable that can be viewed as depending on the other, i.e. a dependent variable, and the categories of this variable make up the columns of the crosstabulation, then it makes sense to have percentages that sum to 100% across each row; these are referred to as row percentages).
(from Jamieson et al., 2002 # )
‘When you and your current partner first decided to set up home or move in together, did you think of it as a permanent arrangement or something that you would try and then see how it worked?’
TOTAL Both
‘permanent’
Both
‘try and see’
Different answers
15 (48%) 4 (13%) 12 (39%) 31 (100%) Cohabiting without marriage
Cohabited and then married
Married without cohabiting
16 (67%)
9 (100%)
1 (4%)
0 (0%)
7 (29%)
0 (0%)
24 (100%)
9 (100%)
# Jamieson, L. et al. 2002. ‘Cohabitation and commitment: partnership plans of young men and women’, Sociological Review 50.3: 356–377.
• In the following example, row percentages allow us to compare outcomes between the categories of an independent variable.
• However, we can also use column percentages to look at the composition of each category of the dependent variable.
• In addition, we can use total percentages to look at how the cases are distributed across combinations of the two variables.
Class origin * Class destination Crosstabulation
Class destination
Service Intermediate Working
Count 730 323 189 Class origin Service
% within Class origin
Intermediate Count
58.8%
857
26.0%
1140
Total
1242
15.2% 100.0%
1108 3105
Working
Total
% within Class origin
Count
% within Class origin
Count
% within Class origin
27.6%
786
15.5%
2373
25.2%
36.7%
1385
27.2%
2848
30.2%
35.7% 100.0%
2916 5087
57.3% 100.0%
4213 9434
44.7% 100.0%
Derived from: Goldthorpe, J.H. with Llewellyn, C. and Payne, C. (1987). Social Mobility
and Class Structure in Modern Britain (2nd Edition). Oxford: Clarendon Press.
Class origin
Class origin * Class destination Crosstabulation
Service
Class destination
Intermediate Working
Count 730 323 189 Service
% within Class destination
Intermediate Count
30.8%
857
11.3%
1140
4.5%
1108
Working
Total
% within Class destination
Count
% within Class destination
Count
% within Class destination
36.1%
786
33.1%
2373
100.0%
40.0%
1385
48.6%
2848
100.0%
26.3%
2916
69.2%
4213
100.0%
Total
1242
13.2%
3105
32.9%
5087
53.9%
9434
100.0%
Class origin Service
Class origin * Class destination Crosstabulation
Class destination
Service Intermediate Working
Count 730 323 189
% of Total
Intermediate Count
Working
Total
% of Total
Count
% of Total
Count
% of Total
7.7%
857
9.1%
786
8.3%
2373
25.2%
3.4%
1140
12.1%
1385
14.7%
2848
30.2%
2.0%
1108
11.7%
2916
30.9%
4213
44.7%
Total
1242
13.2%
3105
32.9%
5087
53.9%
9434
100.0%
• It is possibly self-evident that the differences between the percentages in different rows (or columns) can collectively be viewed as measuring association
• In the case of a 2x2 cross-tabulation (i.e. one with two rows and two columns), the difference between the percentages is a measure of association for that cross-tabulation
• But there are other ways of quantifying the association in the cross-tabulation…
Odds ratios as a measure of association
• The patterns in the social mobility table examined in an earlier session can clearly be expressed as differences in percentages (e.g. the differences between the percentages of sons with fathers in classes I and VII who are themselves in classes I and VII.
• However, an alternative way of quantifying these class differences is to compare the odds of class I fathers having sons in class I as opposed to class VII with the odds of class
VII fathers having sons in class I as opposed to class VII.
• The ratio of these two sets of odds is an odds ratio, which will have a value of close to 1.0 if the two sets of odds are similar, i.e. if there is little or no difference between the chances of being in classes I and VII for sons with fathers in classes I and VII respectively.
Odds Ratios vs. % Differences
An Example: Gender and Higher Education
Age 30-39
Men
Women
Age 40-49
Men
Women
Age 50-59
Men
Women
Degree No Degree
56 (13.0%) 374
70 (13.8%) 438
Degree No Degree
56 (14.4%) 334
38 (9.1%) 378
Degree No Degree
34 (9.9%) 308
18 (5.2%) 329
% difference = -0.8%
Odds ratio = ((56/374)/(70/438))
= 0.937
% difference = 5.3%
Odds ratio = ((56/334)/(38/378))
= 1.668
% difference = 4.7%
Odds ratio = ((34/308)/(18/329))
= 2.018
• The choice of differences between percentages versus
odds ratios as a way of quantifying differences between groups can matter, as in the preceding example of the
‘effect’ of gender on the likelihood of having a degree, according to age.
• The % difference values of 4.7%, 5.3% and -0.8% suggest that inequality increased before it disappeared, whereas the odds ratios of 2.018, 1.668 and 0.937 suggest a small decrease in inequality before a larger decrease led to approximate equality!
• Evidently, there are competing ways of measuring association in a cross-tabulation. But neither differences between percentages nor odds ratios provide an overall summary of the association in a cross-tabulation…
• If we need an overall measure of association for two cross-tabulated (categorical) variables, one standard possibility is Cramér’s V
• Like the Pearson correlation coefficient, it has a maximum of 1, and 0 indicates no relationship, but it can only take on positive values, and makes no assumption of linearity.
• It is derived from a test statistic (inferential statistic), chi-square, which we will consider in a later session…
An example of Cramér’s V
Cramér’s V = 0.074
• In a literature review more than thirty years ago,
Goodman and Kruskal identified several dozen of these:
Goodman, L.A. and Kruskal, W.H. 1979. Measures of association for cross
classifications. New York, Springer-Verlag.
• … and I added one of my own, Tog, which measures inequality (in a particular way) where both variables are ordinal…
One of Tog’s
(distant) relatives
• The equivalent to comparing percentages in this instance is comparing means… but there may be quite a lot of these!
• So one possible overall measure of association used in this situation is eta 2 (η 2 ) (eta-squared)
• But this is a less familiar measure (at least to researchers in some social science disciplines!)