Documentation file for CCRI 1911 test

advertisement
Document1
Documentation file for test of CCRI Household Classification 1911 for use in RDC for 1921-51
RDC project: File 493861 Contract no: Moldofsky 2688
Steps:
1. Create directories to correspond to RDC directory structure:
H:\moldofsky2688\data
H:\moldofsky2688\syntax
2. Within data directory within folder: copy original 1911 dataset from:
P:\project\CFI_CCRI\Research\Research_1911\1911_DATA\original\
\CCRI-CDN-Census1911V20091117.sav
FOR 1921-51 datasets: Make sure MISSING values are defined as such, using the variable
def, for all variables used in the syntax: RELATIONSHIP, MARITAL_STATUS, SEX,
Derived_Surname_Number, Derived_Age_In_Years
3. Large dwellings (LDs) of Single Unit type (SU) (such as camps, prisons) will be excluded
from analysis. LDs of Multiple Unit type (apartments, rooming houses) will be included
Extract records with Dwelling_type= MU or UU and save as file:
Select cases if: Dwelling_Unit_Type = "UU" OR Dwelling_Unit_Type = "MU"
CCRI_CDN_Census1911V20091117_MU_UU_only.sav
This will be the base working data file from which all others are created. 355036 records.
4. Use Gordon Darroch’s syntax file as basis, and edit for new paths and filenames. Original
file: Darroch General HHLD Syntax for 1911 - four files with comments.sps
NEW FILE: MOLDOFSKY General HHLD Syntax for 1911 - four files with comments.sps
5. Data file corrections part 1:
Darroch’s file documents corrections made after the fact because of missing values which
were later inferred. Better method would be to analyze and correct these before
processing. Issues dealt with as below. Corrections made manually and saved in file:
CCRI_CDN_Census1911V20091117_MU_UU_revised.sav
This now becomes the new “working file.”
5a. SEX missing for HEAD OF HOUSEHOLD. Can be inferred from name and occupation in
most cases.
Select cases: SELECT IF (MISSING (SEX) and RELATIONSHIP = 1).
Save file as: SEX_HEAD_MISSING.spv. 47 records selected.
Copy variable SEX to base working file: SEX_HEAD_REV
1
Document1
Find identified cases by Derived_Individual_Id. Infer and correct
SEX_HEAD_REV. Replace this variable for SEX throughout syntax.
5b. MARITAL_STATUS missing for HEAD OF HOUSEHOLD. Can be inferred from examination
of household members in many cases.
Select cases: SELECT IF (RELATIONSHIP = 1 and MISSING (MARITAL_STATUS)).
Save file as: MARITAL_STATUS_HEAD_MISSING.spv. 322 records selected.
Copy variable MARITAL_STATUS to base working file: MARITAL_ST_REV
Find identified cases by Derived_Individual_Id. Infer and correct
MARITAL_ST_REV. Replace this variable for MARITAL_STATUS throughout syntax.
5c. RELATIONSHIP incorrect related to MARITAL STATUS missing 5b. Can be
inferred for example if MARRIED can change unknown relationship of partner
to SPOUSE or WIFE.
Copy variable RELATIONSHIP to RELATION_REV. Only correct if 5b requires it.
Replace RELATIONSHIP with this variable throughout syntax.
6. Data file corrections part 2:
Edited Gordon’s syntax file to de-bug it and add comments - ran tests on Alberta dataset see H:\moldofsky_2688\data\CCRI_1911_TEST\DATA\Alberta_test
made further changes and refinements and then ran on entire dataset - final version
MOLDOFSKY General HHLD Syntax for 1911 REV 2012 02 29.sps
This syntax splits working file into 4 parts based on household heads: Regular heads, Double
(multiple) heads, No heads and Solo heads (only one person in household.) Each of these
files is then processed separately to assign HHLDTYPE (Darroch’s 25 classes of Household
type). Examining these files after completion and debugging revealed a number of other
errors/inconsistencies in the dataset, and it was decided to correct these. Many of these
were revealed when the HHLD_TYPE value was MISSING; some by visual examination of
data. These were handled as follows below. Corrections were made to working file
CCRI_1911_V20091117_MU_UU_only.sav and saved as
CCRI_1911_V20091117_MU_UU_revised.sav becomes new working file. Syntax changed for
next round.
6a. Solo heads (file 1911_Solo_Head_ALLVARIABLES.sav)
29 records had HHLD_TYPE MISSING. This is because MARITAL_STATUS or SEX were missing
in the original data file.
For Marital Status: In Darroch’s syntax he has Recoded missing values as 0 for both - this
causes problems later in classification. Therefore for SOLO heads, in syntax change Recode
Not_Married_Head (missing=4) instead of 0. This assumes missing is unmarried, which for
households of people living alone is overwhelmingly true.
For SEX: examine records where MISSING(SEX) (only a few, which were not caught in 5a)
and make corrections manually in SEX_HEAD_REV where inferrable from other data such as
FIRST_NAME or RELATIONSHIP.
6b. Solo heads (file 1911_Solo_Head_ALLVARIABLES.sav)
Incorrect identification of Households as different: Examination of file revealed many
records with consecutive Derived_Household_Ids and the same surnames. This should not
2
Document1
happen and is a data entry error; if related people live in the same dwelling they should be
part of the same household. In most cases these were errors in the way the households
were numbered on the manuscript schedule, which were then carried through into the
database because of our “verbatim data entry” policy.
The file 1911_Solo_Head_ALLVARIABLES.sav was put into Access for analysis to identify
duplicate Household/Surname records. These records were copied into a separate file for
documentation purposes: 1911_Solo_Heads_Dup_DIDs_Lastnames.sav.
These records WILL BE combined in the new working file by assigning them the identical
Derived_Household_Id and changing the following fields appropriately as necessary:
Derived_Household_Id_In_Dwelling,Derived_Household_Id_In_Dwelling,
Derived_Household_Id_In_Dwelling, Derived_Person_Num_In_Household,
Derived_Person_Num_In_Dwelling, Derived_Surname_Number
6c. No Heads file ( file 1911_No_Head_ALLVARIABLES.sav)
33 Records had HHLD_TYPE MISSING. This is because MARITAL_STATUS was missing in the
original data file; in 4 records SEX was missing as well.
For MARITAL_STATUS and SEX: examine records where either is MISSING, which were not
caught in 5a), and make corrections manually in MARITAL_STATUS_REV and SEX_HEAD_REV
where inferrable from other data. This is possible most of the time. This process was
augmented by examination of manuscript schedules online at:
http://automatedgenealogy.com/census11/index.jsp
In the process some 7 records were identified where entire households were entered with
MISSING values for RELATIONSHIP as well as MARITAL_STATUS and SEX. Out of these, 5/7
seem to be clear from the manuscript schedules. Therefore these data were added to the
dataset in the fields RELATION_REV, MARITAL_ST_REV and SEX_HEAD_REV. These records
and the corrections are documented in the file:
No_Head_HHLDTYPE_Missing_Corrections.sav (total 59 records)
The corrections were then made manually in the new working file.
7. After all corrections were made to new working file:
CCRI_1911_V20091117_MU_UU_revised.sav
DELETE all constructed variables (egonum, TOTinDWELL, TOTinHHLD, etc)
Entire syntax was re-run to create new separated classified files.
These were then merged to create one classified file with all heads.
Result is to create a classification file with all records of heads in it:
1911_ALL_Head_only_classif.sav
To this file we add Gordon’s reclassification into two reclassed grouped variables for 8
classes and 3 classes. See in file:
*Reclassification to create HHLD_8 and HHLD_3
8. Tested merging output classification files with original data file (test for Alberta)
Merge screen set up as below, to add HHLD_type variable, using Derived Household ID as
the key or linking variable
Open classification file - Sort on Derived indiv id
Open full data file - Sort on Derived indiv id
Merge files by following using Derived household id as KEY VARIABLE
Data -> Merge files -> Add variables (see below)
3
Document1
Syntax generated was used as a guide as follows, with appropriate filename changes
This has been incorporated into the main syntax file.
This file is then merged with original file to create new cleaned full HHLDTYPE file:
CCRI_1911_V20091117_MU_UU_HHLDTYPE.sav
see:
*Merging with full working file based on Derived Household ID - Using Data - Merge - Add
variables
9. Final syntax file: MOLDOFSKY General HHLD Syntax for 1911 REV 2012 02 29.sps
Directory: C:\moldofsky_2688\syntax\CCRI_1911_TEST
Data: C:\moldofsky_2688\data\CCRI_1911_TEST\2012_02
Copied into: H:\moldofsky_2688\ as defined at top of file.
10. Test adding other variables to HHLD file for aggregation etc
Use Merge->Add variables - Merge by Derived Individual ID to add fields to the Household
only file.
4
Download