Methodology - Office for National Statistics

advertisement

Methodology Note for the 2011 Area

Classification for

Output Areas

Updated April 2015

Office for National Statistics

Methodology note

Contents

Introduction ........................................................................................................................ 2

Overarching aim of a new 2011 Area Classification for Output Areas .......................... 2

Choice of variables ............................................................................................................ 2

Data Preparation ................................................................................................................ 3

Data Transformation .......................................................................................................... 3

Data Standardisation ......................................................................................................... 3

Final Variable Selection ..................................................................................................... 4

Clustering ........................................................................................................................... 4

Identifying optimum dataset and cluster numbers ......................................................... 5

Creating a hierarchical classification using K-means .................................................... 5

Methodology Overview ...................................................................................................... 6

Annex A List of 60 final 2011 Census Variables.......................................................... 7

Office for National Statistics Methodology note 1

Methodology note on the 2011 Area

Classification for Output Areas

Introduction

The aim of this paper is to outline the methodology used to produce the 2011 Area Classification for Output Areas (2011 OAC). The classification places each UK output area, as defined following the 2011 Census, into a group with those other output areas that are most similar in terms of census variables. This enables similar areas to be classified according to their particular combination of characteristics.

Overarc hing aim of a new 2011 Area Classifica tion for Output Areas

Following a user engagement exercise with users over plans for an updated 2011 Area

Classification for Output Areas in February 2012, there was a clear indication from respondents to keep as far as possible the existing three-tier hierarchical nature of the classification with supergroups, groups and subgroups – but not being constrained with the exact same number of categories within each hierarchical level as with the 2001 Area Classification for Output Areas

(2001 OAC), but at the same time recognising the existing number of categories with the 2001

OAC as a suitable ‘target’ to aim for with the 2011 OAC.

These principles have been followed with the creation of the 2011 OAC, though rather than repeat the exact same methodology as was used for creating the 2001 OAC, a fresh consideration has been given to the available statistical techniques which could potentially be used, and testing of these has been undertaken to help produce an ‘optimised’ 2011 OAC.

Choice of variables

The analysis was carried out using the 2011 Census Key Statistics and Quick Statistics tables, as published by the Office for National Statistics for England and Wales, National Records of Scotland for Scotland, and the Northern Ireland Statistics and Research Agency for Northern Ireland. Only variables that were consistent across the whole of the UK were considered for the 2011 OAC. An initial 167 socio-economic and demographic variables covered the main dimensions of the Census and for presentation purposes these have been defined as demographic structure; household composition; housing; socio-economic character; and employment. Strongly correlated variables were removed to avoid the duplication of particular factors. This allowed the minimum number of variables to be included so that the five main census domains were represented using the available data.

Office for National Statistics Methodology note 2

Data Prepara tion

Before any analysis of the initial 167 variables of the output area

1

data could be done, the data needed to be prepared to assist with the processes used in the selection of the final variables. For the data preparation, three different rate calculation techniques were considered:

1. Conversion of the data to raw percentages.

2. The calculation of index scores, in this context calculating the percentage of each variable’s count in relation to its denominator.

3. The calculation of mean differences, to identify variables with the greatest deviation away from

‘average’ characteristics.

With the calculation of percentages (method 1), a few variables, such as those relating to area and population density could not be converted into percentages and were left unchanged.

Three initial datasets using the 167 variables were created, one from each of the three data preparation methods.

Data Trans formation

The three datasets created from the data preparation were investigated to examine the extent to which outliers may exist within the data, and the data ranges. As a result of this investigation, three different transformation techniques were applied to reduce skew in the three datasets:

1. Log.

2. Box-Cox.

3. Inverse Hyperbolic Sine.

These three techniques all perform a ‘normalisation’ function of the datasets. The impact of these three data transformation techniques was later tested with the final assignment of areas into categories after ‘clustering’.

By transforming the data to a log (logarithmic) scale (as was used for the 2001 OAC) the problem of very high value outliers is greatly reduced as the differences between values at the extremities of the dataset are reduced by more than differences between the smaller average values. The

Box-Cox method is considered better than the log method for dealing with different distributions which may occur between variables, whilst the Inverse Hyperbolic Sine method is considered better at dealing with datasets with a large number of zero values.

Data S tandardisa tion

All clustering techniques are based on the similarity or dissimilarity of the cases to be clustered.

This is measured by constructing a distance matrix reflecting all the variables in the dataset for each case. It is clear that problems will occur if there are differing scales or magnitudes among the variables. In general, variables with larger values and greater variation will have more impact on the final similarity measure. It is necessary to ensure each variable is equally represented in the distance measure by standardising the data.

1

In Northern Ireland the 2001 Output Areas have been merged to produce new 2011 'small areas'. For ease of readability, references to output areas in this note include Northern Ireland’s small areas.

Office for National Statistics Methodology note 3

Three methods of standardisation were considered:

1. Z-score standardisation.

2. Range standardisation.

3. Inter-decile range standardisation.

The Z-score standardisation is the most common form of standardisation, and compares each value of a variable Xi to the variable mean X. This is then divided by the standard deviation. Zscore standardisation works well when the data are normally distributed but this may not always be the case.

The range standardisation method was implemented in both the 1991 and 2001 classifications, and compares each value of a variable, Xi, to the minimum value Xmin. This is then divided by the distance between the minimum value, Xmin, and the maximum value, Xmax, of the variable. After the data has been range standardised, each variable has a range of 1 with the maximum value being 1 and minimum value being 0. This method does not work well if the data contain extreme outliers.

The inter-decile range standardisation method is a variation of the range standardisation method that overcomes the problems associated with outliers. It compares each value of a variable, Xi, to the median, which is then divided by the distance between the 90th percentile, and the 10th percentile.

Final Variable Selection

The process of reducing the 167 initial variables to a list of final variables to cluster required multiple steps with the overall aim of reducing the number of variables so that the remaining ones were the most important for the 2011 OAC. By applying different combinations of data preparation, data transformation and data standardisation techniques, 27 unique datasets (3 x 3 x 3) were created and evaluated. From this evaluation a list of 60 final variables were chosen (see Annex A), and the data preparation, data transformation and data standardisation techniques were repeated to produce a further 27 datasets; all with the potential to be used for cluster analysis to produce the

2011 OAC.

Further consideration was then undertaken of these additional 27 datasets, and methods were used to reduce these further to a smaller number for clustering suitability assessment.

Clus tering

There are a number of different clustering techniques which can be applied to group together areas with similar characteristics. Previous user engagement had identified a preference for a top-down hierarchical structure for the 2011 OAC as had been used for the 2001 OAC. As a result, the partitional clustering method was chosen. Whilst different methods were assessed, the partitional clustering method was considered the best way to ensure a similar cluster structure as created as for the 2001 OAC.

K-means is a simple non-parametric partitional clustering method which was chosen that minimises the within cluster sum of squares whilst maximising the between cluster variability. The k-means method requires that the number of clusters is specified beforehand. It is an iterative relocation algorithm based on the sum of squares. The algorithm repeatedly moves a case from one cluster to another to improve the sum of squares within each cluster. The case is assigned/removed from the cluster to which it brings the greatest improvement. When all cases

Office for National Statistics Methodology note 4

have been processed then the algorithm moves to the next iteration. A stable solution is reached when there are no more moves in a complete iteration.

Identif ying optimum datase t and cluster numbe rs

The four most suitable datasets for clustering using the 60 final variables, and the 27 different datasets generated from different combinations of data preparation, data transformation and data standardisation were identified. Three rationales were applied to reduce the number of datasets from 27 to 4:

1. Datasets that had skewness values above 1 or less than -1.

2. Datasets that led to the creation of very small clusters, accounting for a small percentage of the population.

3. Datasets that led to the creation of different clusters that were not considered sufficiently distinguishable.

Creating a hierarc hical classifica tion using K -mea ns

When the k-means algorithm is run on the dataset n clusters are produced. The original dataset is then split into n separate datasets (representing the highest level of the hierarchy). Each of the new datasets then has the k-means algorithm run on them separately to create the second level of the hierarchy. The second level of the hierarchy is then separated into m separate datasets and each one has the k-means algorithm run on them to create the lowest level of the hierarchy.

With the four remaining datasets to be evaluated, numerous outputs and cluster number permutations were tested using k-means, based on a preferred range of five to nine clusters for the

Supergroup level, reflecting a similar hierarchical structure to the 2001 OAC. A final decision made on both the dataset and cluster numbers used to create the final 2011 OAC hierarchy was based on both qualitative and quantitative assessments.

The dataset selected created the optimum clusters in terms of the numbers of groups, and their reflection of the population and geographical distribution characteristics. This dataset was created using the following combination of data preparation, data transformation and data standardisation methods:

Data Preparation – conversion of the data to raw percentages (method 1)

Data Transformation – inverse hyperbolic sine (method 3)

Data Standardisation – range standardisation (method 2)

The resultant three-tier classification contains 8 Supergroups, 26 Groups and 76 Subgroups.

Finally, in addition to allocated codes for each cluster of the 2011 OAC hierarchy, names were also created to aid the understand ing of a cluster’s characteristics.

Office for National Statistics Methodology note 5

Methodolog y Overview

2011 UK Census Data

Initial Census Variable Selection (167)

Data Preparation, Data Transformation and Data Standardisation

Methods Applied to Produce Final

Census Variables (60)

Rate Calculation, Data Transformation and

Data Standardisation

K-means Clustering Technique Applied

Three-tiered 2011 Area Classification for

Output Areas

Office for National Statistics Methodology note 6

Annex A Lis t of 60 final 2011 Census Variables

Variable

Number

1

17

18

19

20

13

14

15

16

21

22

23

24

25

6

7

8

9

2

3

4

5

10

11

12

30

31

32

33

26

27

28

29

34

35

36

37

38

39

40

41

42

43

44

Variable Description

% Persons aged 0

– 4

% Persons aged 5 –14

% Persons aged 25 –44

% Persons aged 45 –64

% Persons aged 65 –89

% Persons aged 90+

Number of persons per hectare

% Persons living in a communal establishment

% Persons aged over 16 who are single

% Persons aged over 16 who are married or in a registered same-sex civil partnership

% Persons aged over 16 who are divorced or separated

% Persons who are white

% Persons who have mixed ethnicity or are from multiple ethnic groups

% Persons who are Asian/Asian British: Indian

% Persons who are Asian/Asian British: Pakistani

% Persons who are Asian/Asian British: Bangladeshi

% Persons who are Asian/Asian British: Chinese and Other

% Persons who are Black/African/Caribbean/Black British

% Persons who are Arab or from other ethnic groups

% Persons whose country of birth is the United Kingdom or Ireland

% Persons whose country of birth is in the old EU (pre 2004 accession countries)

% Persons whose country of birth is in the new EU (post 2004 accession countries)

% Persons whose main language is not English and they cannot speak English well or at all

% Households with no children

% Households with non-dependent children

% Households with full-time students

% Households who live in a detached house or bungalow

% Households who live in a semi-detached house or bungalow

% Households who live in a terrace or end-terrace house

% Households who live in a flat

% Households who own or have shared ownership of property

% Households who are social renting

% Households who are private renting

% Households who have one fewer or less rooms than required

Individuals day-to-day activities limited a lot or a little (Standardised Illness Ratio)

% Persons providing unpaid care

% Persons aged over 16 whose highest level of qualification is Level 1, Level 2 or

Apprenticeship

% Persons aged over 16 whose highest level of qualification is Level 3 qualifications

% Persons aged over 16 whose highest level of qualification is Level 4 qualifications and above

% Persons aged over 16 who are schoolchildren or full-time students

% Households with two or more cars or vans

% Persons aged 16 –74 who use public transport to get to work

% Persons aged 16 –74 who use private transport to get to work

% Persons aged 16 –74 who walk, cycle or use an alternative method to get to work

Domain

Office for National Statistics Methodology note 7

Variable

Number

45

46

47

48

49

50

51

52

53

54

58

59

60

55

56

57

Variable Description

% Persons aged 16 –74 who are unemployed

% Employed persons aged 16

–74 who work part-time

% Employed persons aged 16 –74 who work full-time

% Employed persons aged 16

–74 who work in the agriculture, forestry or fishing industries

% Employed persons aged 16 –74 who work in the mining, quarrying or construction industries

% Employed persons aged 16 –74 who work in the manufacturing industry

% Employed persons aged 16 –74 who work in the energy, water or air conditioning supply industries

% Employed persons aged 16

–74 who work in the wholesale and retail trade; repair of motor vehicles and motor cycles industries

% Employed persons aged 16

–74 who work in the transport or storage industries

% Employed persons aged 16

–74 who work in the accommodation or food service activities industries

% Employed persons aged 16 –74 who work in the information and communication or professional, scientific and technical activities industries

% Employed persons aged 16

–74 who work in the financial, insurance or real estate industries

% Employed persons aged 16

–74 who work in the administrative or support service activities industries

% Employed persons aged 16 –74 who work in the in public administration or defence; compulsory social security industries

% Employed persons aged 16 –74 who work in the education sector

% Employed persons aged 16 –74 who work in the human health and social work activities industries

Domain

Office for National Statistics Methodology note 8

Download