Descriptive statistics - University of Warwick

advertisement

DTC Quantitative Research Methods

(IM911)

Descriptive Statistics

Thursday 21 st January 2016

Some relevant online course extracts

• Cramer (1998) Chapter 2:

- Measurement and univariate analysis.

• Diamond and Jefferies (2001) Chapter 5:

- Measures and displays of spread.

• Sarantakos (2007) Chapter 5:

- Graphical displays.

• Huizingh (2007) Chapter 12:

- SPSS material.

Some basic terminology

• Quantitative measures are typically referred to as variables.

• Some variables are generated directly via the data generation process, but other, derived

variables may be constructed from the original set of variables later on.

• As the next slide indicates, variables are frequently referred to in more specific ways.

Cause(s) and effect…?

• Often, one variable (and occasionally more than one variable) is viewed as being the dependent

variable.

• Variables which are viewed as impacting upon this variable, or outcome, are often referred to as

independent variables.

• However, for some forms of statistical analyses, independent variables are referred to in more specific ways (as can be seen within the menus of

SPSS for Windows)

Levels of measurement

(Types of quantitative data)

• A nominal variable relates to a set of categories such as ethnic groups or political parties which is not ordered.

• An ordinal variable relates to a set of categories in which the categories are ordered, such as social classes or levels of educational qualification.

• An interval-level variable relates to a ‘scale’ measure, such as age or income, that can be subjected to mathematical operations such as averaging.

How many variables?

• The starting point for statistical analyses is typically an examination of the distributions of values for the variables of interest. Such examinations of variables one at a time are a form of univariate analysis.

• Once a researcher moves on to looking at relationships between pairs of variables she or he is engaging in

bivariate analyses.

• … and if they attempt to explain why two variables are related with reference to another variable or variables they have moved on to a form of multivariate analysis.

Looking at categorical variables

• For nominal/ordinal variables this largely means looking at the frequencies of each category, often pictorially using, say, bar-

charts or pie-charts.

• It is usually easier to get a sense of the relative importance of the various categories if one converts the frequencies into percentages!

Example of a frequency table

Place met marital or cohabiting partner

At school, college or university

At/through work

In a pub/cafe/restaurant/ bar/club

At a social event organised by friend(s)

Other

TOTAL

Frequency

872

1405

2096

1055

1631

7059

%

12.4

19.9

29.7

14.9

23.1

100.0

Other

Example of a pie-chart

At school, college or university

At/through work

At a social event organised by friend(s)

In a pub/cafe/restaurant/ bar/club

What are percentages?

• It may seem self-evident, but percentages are a form of descriptive statistic

• Specifically, they are useful in describing the distributions (of frequencies) for nominal or ordinal (i.e. categorical) variables

• When we consider interval-level variables or more than one variable, we need (somewhat) more sophisticated descriptive statistics

Descriptive statistics...

• ... are data summaries which provide an alternative to graphical representations of distributions of values (or relationships)

• ... aim to describe key aspects of distributions of values (or relationships)

• ... are of most relevance when we are thinking about interval-level variables (scales)

Description or inference?

Descriptive statistics summarise relevant features of a set of values.

Inferential statistics help researchers decide whether features of quantitative data from a sample can be safely concluded to be present in the population.

Generalizing from a sample to a population is part of the process of statistical inference

• One objective may be to produce an estimate of the proportion of people in the population with a particular characteristic, i.e. a process of estimation.

Types of (univariate) descriptive statistics

Measures of ...

• ... location (averages)

• ... spread

• ... skewness (asymmetry)

• ... kurtosis

• We typically want to know about the first two, sometimes about the third, and rarely about the fourth!

What is ‘kurtosis’ anyway?

• Increasing kurtosis is associated with the “movement of probability mass from the shoulders of a distribution into its

center and tails.” (Balanda, K.P. and MacGillivray, H.L. 1988.

‘Kurtosis: A Critical Review’, The American Statistician 42:2:

111–119.)

• Below, kurtosis increases from left to right...

Visualising ‘scale’ variables

• For interval-level data the appropriate visual summary of a distribution is a histogram, examining which can allow the researcher to assess whether it is reasonable to assume that the quantity of interest has a particular distributional shape (and whether it exhibits

skewness).

• Unlike bar charts, distances along the ‘horizontal’ dimension of a histogram have a well-defined, consistent meaning: i.e. they represent differences between values on the interval-level scale in question.

Example of a histogram

Measures of location

• Mean (the arithmetic average of the values, i.e. the result of dividing the sum of the values by the total number of cases)

• Median (the middle value, when the values are ranked/ordered)

• Mode (the most common value)

... and measures of spread

• Standard deviation (and Variance)

(This is linked with the mean, as it is based on averaging [squared] deviations from it. The variance is simply the standard deviation squared).

• Interquartile range / Quartile deviation

(These are linked with the median, as they are also based on the values placed in order).

Measures of location and spread: an example (household size)

Mean = 2.94, Median = 2, Mode = 2 Mean = 2.96, Median = 3, Mode = 2 s.d. = 1.93, skewness = 2.10; kurtosis = 5.54

s.d. = 1.58, skewness = 1.27; kurtosis = 2.24

West Midlands London

Why is the standard deviation so important?

• The standard deviation (or, more precisely, the variance) is important because it introduces the idea of summarising variation in terms of summed, squared deviations.

• And it is also central to some of the statistical theory used in statistical testing/statistical inference...

An example of the calculation of a standard deviation

• Number of seminars attended by a sample of undergraduates:

5, 4, 4, 7, 9, 8, 9, 4, 6, 5

• Mean = 61/10 = 6.1

• Variance = ((5 – 6.1) 2 + (4 – 6.1) 2 + (4 – 6.1) 2 + (7 – 6.1) 2 +

(9 – 6.1) 2 + (8 – 6.1) 2 + (9 – 6.1) 2 + (4 – 6.1) 2 + (6 – 6.1) 2 +

(5 – 6.1) 2 )/(10 – 1) = 36.9 /9 = 4.1

• Standard deviation = Square root of variance = 2.025

The Empire Median Strikes Back!

• Comparing descriptive statistics between groups can be done graphically in a rather nice way using a form of display called a ‘boxplot’.

• Boxplots are based on medians and quartiles rather than on the more commonly found mean and standard deviation.

Example of a boxplot

Moving on to bivariate ‘descriptive statistics'...

• These are referred to as ‘Measures of association’, as they quantify the (strength of the) association between two variables

• The most well-known of these is the (Pearson) correlation coefficient, often referred to as ‘the correlation coefficient’, or even ‘the correlation’

• This quantifies the closeness of the relationship between two interval-level variables (scales)

Positive and negative relationships

Positive or direct relationships

• If the points cluster around a line that runs from the lower left to upper right of the graph area, then the relationship between the two variables is positive or direct.

• An increase in the value of x is more likely to be associated with an increase in the value of y.

• The closer the points are to the line, the stronger the relationship.

Negative or inverse relationships

• If the points tend to cluster around a line that runs from the upper left to lower right of the graph, then the relationship between the two variables is negative or inverse.

• An increase in the value of x is more likely to be associated with a decrease in the value of y.

Working out the correlation coefficient

(Pearson’s r )

• Pearson’s r tells us how much one variable changes as the values of another changes – their covariation.

• Variation is measured with the standard deviation. This measures

• Pearson’s r is calculated by dividing this by (SD of x ) x (SD of y ) in order to standardize it.

  x

X



( n

1) s s x y average variation of each variable from the mean for that variable.

• Covariation is measured by calculating the amount by which each value of X varies from the mean of X, and the amount by which each value of Y varies from the mean of Y and multiplying the differences together and finding the average (by dividing by n -1).   x

X



 n

1

Working out the correlation coefficient

(Pearson’s r )

• Because r is standardized it will always fall between +1 and -1.

• A correlation of either 1 or -1 means perfect association between the two variables.

• A correlation of 0 means that there is no association.

• Note: correlation does not mean causation . We can only investigate causation by reference to our theory. However (thinking about it the other way round) there is unlikely to be causation if there is not correlation.

A scatterplot of the values of two interval-level variables

Example of calculating a correlation coefficient

(corresponding to the last slide)

• X = 5, 4, 4, 7, 9, 8, 9, 4, 6, 5 Mean(X) = 6.1

• Y = 8, 7, 9, 7, 8, 8, 8, 5, 5, 6 Mean(Y) = 7.1

• (5 - 6.1)(8 – 7.1) + (4 – 6.1)(7 – 7.1) ... etc.

• -0.99 + 0.21 + ... = 7.9 (Covariation)

• S.D. (X) = 2.02 ; S.D. (Y) = 1.37

• (7.9 / 9) / (2.02 x 1.37) = 0.316

Looking at the relationship between two categorical variables

If two variables are nominal or ordinal, i.e. categorical, we can look at the relationship between them in the form of a cross-tabulation, using percentages to summarize the pattern.

(Typically, if there is one variable that can be viewed as depending on the other, i.e. a dependent variable, and the categories of this variable make up the columns of the crosstabulation, then it makes sense to have percentages that sum to 100% across each row; these are referred to as row percentages).

An example of a cross-tabulation

(from Jamieson et al., 2002 # )

‘When you and your current partner first decided to set up home or move in together, did you think of it as a permanent arrangement or something that you would try and then see how it worked?’

TOTAL Both

‘permanent’

Both

‘try and see’

Different answers

15 (48%) 4 (13%) 12 (39%) 31 (100%) Cohabiting without marriage

Cohabited and then married

Married without cohabiting

16 (67%)

9 (100%)

1 (4%)

0 (0%)

7 (29%)

0 (0%)

24 (100%)

9 (100%)

# Jamieson, L. et al. 2002. ‘Cohabitation and commitment: partnership plans of young men and women’, Sociological Review 50.3: 356–377.

Alternative forms of percentage

• In the following example, row percentages allow us to compare outcomes between the categories of an independent variable.

• However, we can also use column percentages to look at the composition of each category of the dependent variable.

• In addition, we can use total percentages to look at how the cases are distributed across combinations of the two variables.

Example Cross-tabulation II:

Row percentages

Class origin * Class destination Crosstabulation

Class destination

Service Intermediate Working

Count 730 323 189 Class origin Service

% within Class origin

Intermediate Count

58.8%

857

26.0%

1140

Total

1242

15.2% 100.0%

1108 3105

Working

Total

% within Class origin

Count

% within Class origin

Count

% within Class origin

27.6%

786

15.5%

2373

25.2%

36.7%

1385

27.2%

2848

30.2%

35.7% 100.0%

2916 5087

57.3% 100.0%

4213 9434

44.7% 100.0%

Derived from: Goldthorpe, J.H. with Llewellyn, C. and Payne, C. (1987). Social Mobility

and Class Structure in Modern Britain (2nd Edition). Oxford: Clarendon Press.

Example Cross-tabulation II:

Column percentages

Class origin

Class origin * Class destination Crosstabulation

Service

Class destination

Intermediate Working

Count 730 323 189 Service

% within Class destination

Intermediate Count

30.8%

857

11.3%

1140

4.5%

1108

Working

Total

% within Class destination

Count

% within Class destination

Count

% within Class destination

36.1%

786

33.1%

2373

100.0%

40.0%

1385

48.6%

2848

100.0%

26.3%

2916

69.2%

4213

100.0%

Total

1242

13.2%

3105

32.9%

5087

53.9%

9434

100.0%

Example Cross-tabulation II:

Total percentages

Class origin Service

Class origin * Class destination Crosstabulation

Class destination

Service Intermediate Working

Count 730 323 189

% of Total

Intermediate Count

Working

Total

% of Total

Count

% of Total

Count

% of Total

7.7%

857

9.1%

786

8.3%

2373

25.2%

3.4%

1140

12.1%

1385

14.7%

2848

30.2%

2.0%

1108

11.7%

2916

30.9%

4213

44.7%

Total

1242

13.2%

3105

32.9%

5087

53.9%

9434

100.0%

Percentages and Association

• It is possibly self-evident that the differences between the percentages in different rows (or columns) can collectively be viewed as measuring association

• In the case of a 2x2 cross-tabulation (i.e. one with two rows and two columns), the difference between the percentages is a measure of association for that cross-tabulation

• But there are other ways of quantifying the association in the cross-tabulation…

Odds ratios as a measure of association

• The patterns in the social mobility table examined in an earlier session can clearly be expressed as differences in percentages (e.g. the differences between the percentages of sons with fathers in classes I and VII who are themselves in classes I and VII.

• However, an alternative way of quantifying these class differences is to compare the odds of class I fathers having sons in class I as opposed to class VII with the odds of class

VII fathers having sons in class I as opposed to class VII.

• The ratio of these two sets of odds is an odds ratio, which will have a value of close to 1.0 if the two sets of odds are similar, i.e. if there is little or no difference between the chances of being in classes I and VII for sons with fathers in classes I and VII respectively.

Odds Ratios vs. % Differences

An Example: Gender and Higher Education

Age 30-39

Men

Women

Age 40-49

Men

Women

Age 50-59

Men

Women

Degree No Degree

56 (13.0%) 374

70 (13.8%) 438

Degree No Degree

56 (14.4%) 334

38 (9.1%) 378

Degree No Degree

34 (9.9%) 308

18 (5.2%) 329

% difference = -0.8%

Odds ratio = ((56/374)/(70/438))

= 0.937

% difference = 5.3%

Odds ratio = ((56/334)/(38/378))

= 1.668

% difference = 4.7%

Odds ratio = ((34/308)/(18/329))

= 2.018

Choice of measure can matter!

• The choice of differences between percentages versus

odds ratios as a way of quantifying differences between groups can matter, as in the preceding example of the

‘effect’ of gender on the likelihood of having a degree, according to age.

• The % difference values of 4.7%, 5.3% and -0.8% suggest that inequality increased before it disappeared, whereas the odds ratios of 2.018, 1.668 and 0.937 suggest a small decrease in inequality before a larger decrease led to approximate equality!

• Evidently, there are competing ways of measuring association in a cross-tabulation. But neither differences between percentages nor odds ratios provide an overall summary of the association in a cross-tabulation…

Another measure of association

• If we need an overall measure of association for two cross-tabulated (categorical) variables, one standard possibility is Cramér’s V

• Like the Pearson correlation coefficient, it has a maximum of 1, and 0 indicates no relationship, but it can only take on positive values, and makes no assumption of linearity.

• It is derived from a test statistic (inferential statistic), chi-square, which we will consider in a later session…

An example of Cramér’s V

Cramér’s V = 0.074

Other measures of association for cross-tabulations…

• In a literature review more than thirty years ago,

Goodman and Kruskal identified several dozen of these:

Goodman, L.A. and Kruskal, W.H. 1979. Measures of association for cross

classifications. New York, Springer-Verlag.

• … and I added one of my own, Tog, which measures inequality (in a particular way) where both variables are ordinal…

One of Tog’s

(distant) relatives

What if one variable is a set of categories, and the other is a scale?

• The equivalent to comparing percentages in this instance is comparing means… but there may be quite a lot of these!

• So one possible overall measure of association used in this situation is eta 2 (η 2 ) (eta-squared)

• But this is a less familiar measure (at least to researchers in some social science disciplines!)

Download