Working with EU-SILC using the hierarchical data structure, matching & aggregating data
Practical computing session I – Part 2
Heike Wirth
GESIS – Leibniz Institut für Sozialwissenschaften
DwB-Training Cource on EU-SILC , February 13-15, 2013
Romanian Social Data Archive at the Departement of Sociology
University of Bucharest, Romania
2
Introduction
• EU-SILC data has a hierarchical structure
•
•
• more than one level of analysis is possible household & individual levels are represented by separate files data are stored in multiple data files
Example of household level data
Example 1: Household record
#
Year of survey
Country HH-ID Dwelling type Total disposable
HHLD
Income
Ability to make ends meet
1
2
HB010 HB020 HB030 HH010
2010
2010
…
1500 2010
1501 2010
AT
AT
RO
RO
1
1
2 apartment or flat in
2 detached house detached house detached house
HY020
15,271
30,081
2,243
2,409
HS120 with great difficulty fairly easily fairly easily with difficulty
… … … … … … …
….
…
…
3
1 observation = 1 Household
Please note: HHLD-ID does not differentiate between countries
To be on the safe side use HHLD-ID with country & year of survey
Example of individual level data
Example 2: Individual data
Marital status record
#
Year of survey
Country HH-ID Person-
ID
1
2
3
4
…
PB010 PB020 PX030 PB030
2010
2010
AT
AT
1
1
11
12
2010
2010
AT
AT
1
1
13
14
30001 2010
30002 2010
30003 2010
…
RO
RO
RO
…
1
1
1
11
12
13
…
4
1 observation = 1 Person
Person-ID sequential within household
PB190 married married never married never married married married never married
…
Gross monthly earnings
PY0200G
3500
1400
1450
2307
Highest ISCED Level attained
PE040
(upper) secondary lower secondary
(upper) secondary lower secondary
1500
750
250
…
(upper) secondary lower secondary
(upper) secondary
…
5
Working with this kind of data, requires
• Decision on the appropriate unit of analysis for your research question, e.g.
• research interest in households or persons?
% of households /persons/men/women/children who live in poverty?
% of households with only 1 person or % of persons who live alone?
• Knowledge of procedures for manipulating the data
6
Types of Matching
• One-to-one matching
•
• Household Register to Household Data;
Personal Register to Personal Data
• One-to-many matching
• Household variables to Individual data
• Many-to-one matching (‘aggregation’)
• e.g. adding information from the individual data to the household data
7
EU-SILC – Types of matching n:1
1:n
Household-
Register File (D)
1:1
Household-
Data File (H) n:1
1:n
Personal-
Register File (R)
1:1
Personal-
Data File (P)
Linking EU-SILC files (cross-sectional)
• Key variables provide links between the related records
•
•
• between household files between individual files between household and individual files
• Key variables (depending on the files) are
•
• household id (DB030; HB030; RX030; PX030)
personal id (RB030; PB030)
8
• to be on the safe side: Use key variables always with
•
• ‘year of survey’ (DB010; HB010; RB010; PB010) &
‘country’ (DB020; HB020; RB020; PB020)
9
Example 1: one-to-one
• Attach household register information (D-File) to household data file (H-File)
• e.g. ‘Degree of urbanisation’ (DB100) is only included in the household register, it might be of use having this information in the household data, too.
One-to-One Match, e.g. household information
Household Register ( separate file)
DB010 DB020 DB030
2010 AT 2
2010
2010
AT
AT
12
13
2010
2010
2010
AT
AT
AT
19
26
59
DB075
3
2
3
2
3
4
(…)
(…)
(…)
(…)
(…)
(…)
(…)
DB100 intermediate area thinly populated area thinly populated area thinly populated area thinly populated area densely populated area
Household Data (separate file)
HB010 HB020 HB030
2010
2010
2010
2010
2010
2010
AT
AT
AT
AT
AT
AT
2
12
13
19
26
59
10
HS090 HS120 (…) HX060 no - cannot afford with great difficulty (…) yes with difficulty
One person household
(…) Other hhlds without dep. children no - other reason yes yes yes fairly easily fairly easily
(…) One person household
(…) Other hhlds without dep. children easily (…) Other hhlds without dep. children with some difficulty (…) One person household
Result: Combined Household File
Household Data (combined file)
HB010 HB020 HB030
2010 AT 2
11
2010
2010
2010
2010
2010
AT
AT
AT
AT
AT
12
13
19
26
59
HS090 no - cannot afford yes no - other reason yes yes yes
HS120 with great difficulty with difficulty fairly easily fairly easily easily with some difficulty
(…)
(…)
(…)
(…)
(…)
(…)
(…)
HX060
One person household
DB100 intermediate area
Other households without dependent children
One person household thinly populated area thinly populated area
Other households without dependent children thinly populated area
Other households without dependent children
One person household thinly populated area densely populated area
12
Example 2: one-to-many
• Attach household register information (D-File) to personal data file (P-File)
• Attach ‘Degree of urbanisation’ (again) to the personal data file
13
Attaching household data to personal data (1:n)
Household Register ( separate file)
DB010
2010
2010
DB020
AT
AT
DB030
2
12
DB075
3
2
(…)
(…)
(…)
DB100 intermediate area thinly populated area
2010 AT 26 3 (…) thinly populated area
Personal Data (combined)
PB010 PB020
2010
2010
2010
AT
AT
AT
PX30 PB030
2
12
12
201
1201
1202
PH010 fair fair fair
PH020 PH030 PX020 DB100 yes yes, limited 71 intermediate area no no, not limited 32 thinly populated area yes yes, limited 31 thinly populated area
2010
2010
(…)
AT
AT
12 1203 good
12 1204 fair no no, not limited 30 thinly populated area no no, not limited 26 thinly populated area
14
Example 3: many-to-one
• e.g. number of persons in a households who are
•
•
• unemployed, full-time employed self-employed?
• such information is not included in the data
=> own computation
Matching: many-to-one (summarizing information)
Personal Data
PB010 PB020 PX30 PB030
2010 AT 2 201
2010
2010
2010
AT
AT
AT
12
12
12
1201
1202
1203
2010
(…)
AT 12 1204
PL031
Unemployed (5)
Empl. full time (1)
Emp. full time (1)
Emp. part time (2)
Self-employed (3)
Summarized variables
# unempl
1
0
0
0
# employed full time
0
2
2
2
# self employed
0
1
1
1
0 2 1
Household Data( combined file)
HB010 HB020 HB030 # unempl # employed # self employed
2010
2010
AT
AT
2
12
1
0
0
2
0
1
2010 AT 26 ..
…
15
Hands on – matching 1:1
• Attach ‘Degree of Urbanisation’ (DB100) to household data file (H-File)
• Open the EU-SILC training dataset – D-File *.
• Check the variables you are interested in .
• Sort your data according to key variables used für linkage *.
• Names of key variables in files to be matched must identical
=> Create new key variables (ID010, ID020, ID_HH) in such a way that
DB010 = ID010
DB020 = ID020
DB030 = ID_HH
• Create a new file with only the key variables & the variable(s) you are interested in
16
• name the new file DB100.sav
SPSS–Matching: one-to-one
• **** Before you start ************.
* specify the path where the EU-SILC training dataset is stored.
FILE HANDLE data_path / NAME='H:\wirth\DWB_TRAINING\SILC\DATA\'.
* specify the path where you want to save your data.
FILE HANDLE mydata_path /NAME='H:\wirth\DWB_TRAINING\SILC\EXERCISE_1\'.
open the EU-SILC training dataset – D-File *.
GET FILE='data_path/udb_c10d_silc_course.sav'.
* check the variables you are interested in .
17 cross DB020 by DB100.
SPSS–Matching: one-to-one
* open the EU-SILC training dataset – D-File *.
GET FILE='data_path/udb_c10d_silc_course.sav'.
* check the variables you are interested in .
cross DB020 by DB100.
* Step 1- Sort your data according to key variables used für linkage *.
sort cases by DB010 DB020 DB030.
* Step 2 - Names of key variables in files to be matched must identical *. rename variables (DB010 DB020 DB030 = ID010 ID020 ID_HH).
* create a new file with the key variables & the variable(s) you are interested in *.
18 save outfile = 'mydata_path/DB100.sav'
/keep ID010 ID020 ID_HH DB100.
SPSS–Matching: one-to-one
GET FILE='data_path/udb_c10H_silc_course.sav'.
sort cases HB010 HB020 HB030.
* Key – Variables *.
* either rename (like before) or better generate a new variable *
STRING ID020 (A2).
compute ID010 = HB010.
compute ID020 = HB020.
compute ID_HH = HB030.
MATCH FILES FILE= *
/file ='mydata_path/DB100.sav'
/BY ID010 ID020 ID_HH.
execute.
19
* check whether it worked.
cross HB020 by DB100.
SPSS–Matching: One-to-many Match (1:n)
Example 2: Combing household and personal data
E.g. ‘Degree of Urbanisation’ (DB100) to personal data.
GET FILE='data_path/udb_c10p_silc_course.sav'.
* Sort key variables used für linkage *.
sort cases by PB010 PB020 PX030.
* PB020 = string variable - create a new string variable ID020 /or use the rename command *
20
STRING ID020 (A2).
compute ID010 = PB010.
compute ID020 = PB020.
compute ID_HH = PX030.
21
SPSS–Matching: One-to-many Match (1:n)
MATCH FILES FILE= *
/table = 'mydata_path/DB100.sav'
/BY ID010 ID020 ID_HH.
execute.
* Check whether it worked *.
cross pb020 by db100.
save outfile = 'mydata_path/personal_data.sav'.
22
Matching: many-to-one (n : 1)
• Create new summary variables for personal data (P-File)
•
•
•
•
•
• number of persons living in the same household number of unemployed persons living in a household number of full-time employed persons living in a household number of part-time employed persons living in a household number of self-employed persons living in a household sum of ‘pensions from individual private plans (PY080G)
• *********************************************************.
• * many-to-one (n:1)
• * Personal Data
• * example 1
• * number of persons living in the same household
• * number of unemployed persons living in a household
• *********************************************************.
• * specify the path where the EU-SILC training dataset is stored.
• FILE HANDLE data_path / NAME='H:\wirth\DWB_TRAINING\SILC\DATA\'.
• * specify the path where you want to save your data.
• FILE HANDLE mydata_path / NAME='H:\wirth\DWB_TRAINING\SILC\EXERCISE_1\'.
• * open the EU-SILC training dataset.
• GET FILE='data_path/udb_c10p_silc_course.sav'.
23