Report into the feasibility of matching teenage conception data with school census data Vera Ruddock and Peter Davies, Office for National Statistics 1. Background There is considerable research and policy interest in teenage pregnancy. The Office for National Statistics (ONS) publishes statistics on the numbers of women conceiving under the age of 181. These figures are derived from birth and abortion records; they hold no information about the social background and educational achievement of those who conceive. The Department of Education (DfE) maintains the National Pupil Database and the School Census2,3. These two sources contain a wealth of information on the characteristics of pupils in England. The Teenage Pregnancy Unit commissioned ONS to carry out a study linking the conceptions and education data sources. The ultimate aim was to produce a dataset that could be analysed to identify factors associated with teenage pregnancy. Consent to match the ONS conceptions data to the DfE education data was obtained from the data custodians for the NPD and School Census, the ONS Caldicott Guardian (for the conceptions leading to maternity data) and the Chief Medical Officer for England (for the conceptions leading to abortion data). Access to data was restricted to ONS staff directly involved in the project. This report describes the process and results of the linkage study. 2. Data sources The study attempted to link data for 3 cohorts of girls (Table 1). Table 1 Cohorts of girls included in study Cohort Dates of birth included Conceptions School data census data included included A 1 September 1989 – 30 August 1990 1 September 1990 – 30 August 1991 1 September 1991 – 30 August 1992 2003-2008 2003-2008 Age at end of 2008 (when follow up ended) 18 2004-2008 2004-2008 17 2005-2008 2005-2008 16 B C This design was chosen to provide a cohort of girls which were followed up until the age of 18 (cohort A). The inclusion of cohorts B and C increase the sample of girls who conceived under the age of 16. Without this boost the number of conceptions to girls under 16 would be very small and any analysis on the linked data would lack statistical power. 1 2.1 Preparing the data for matching – National Pupil Database records Table 2 shows the number of NPD records in each of the cohorts by survey year Table 2 NPD data for England Year of survey Cohort A 2003 296,842 2004 2005 2006 2007 2008 296,663 296,144 292,520 107,430 86,757 Cohort B Cohort C All cohorts 296,842 300,437 300,146 299,609 296,357 108,726 297,439 297,358 296,955 294,051 597,100 893,729 889,487 700,742 489,534 Each girl will have many NPD records, one for each year where she was in school, and also a unique NPD identifier. A dataset was constructed with one record for each girl for every combination of date of birth and postcode in the NPD data. A girl who remained at the same postcode throughout the surveys and had consistent dates of birth would be included once. Those who had changes in postcode or recorded date of birth were included a number of times. This was to allow us to match conceptions occurring throughout the study period. The year of the survey used for matching was also recorded on the dataset. Where there was more than one year with the same combination of pupil id number, date of birth and postcode the earliest survey record was used. 2.2 Preparing the data for matching – ONS conception records There were 91,476 conceptions to girls under 18 in the three cohorts. These conception records were matched using postcode of residence and the women’s dates of birth to create pseudo conception histories. A dataset was created containing the first conception record for each unique date of birth/postcode combination (86,560 records). Eighty eight second or subsequent conceptions were included in the denominator in error. Since this number is very small relative to the total the impact on match rates is low. 2.3 Matching variables The matching was carried out in two stages using purely statistical data and then using names to improve matching of conceptions leading to a maternity. ONS does not have access to the names of women who have an abortion. Only two variables were available on both the conception and National Pupil Database – postcode of residence and date of birth of the woman. Names are not included on the statistical conception dataset. 2 2.4 The matching process Matching was carried out in two stages as shown in Figures 1 and 2. The conception records were separated into those where there was a single conception for a given postcode/date of birth combination (Figure 1) and those where there was more than one conception (Figure 2). Match rates were analysed using a dataset containing only the first conception for each date of birth combination. The first stage matched records using only date of birth and postcode. Subsequently conceptions leading to a maternity were matched using names recorded at birth registration. 2.5 Matching conceptions leading to a maternity using names In addition to the statistical conceptions dataset ONS also holds information on the names of mothers for births registered in England and Wales. This information may be used by ONS for statistical purposes, such as this project. These names were matched to the conceptions leading to a maternity. ONS does not have access to information on the names of women who have had an abortion. The Department of Education supplied ONS with the names of girls on the NPD which were then linked to the NPD data. These two named datasets were matched together. The ONS conceptions leading to maternity records were split into 3 groups: • • • Group 1 – Conception records already matched using the key fields of date of birth of mother and postcode (N = 18,806) Group 2 – Conception records that matched using the key fields above, but the match was not unique, resulting in two or more NPD pupils being identified (N = 191) Group 3 - Conception records that could not be matched to the NPD data using these key fields (N = 23,356), or where two or more ONS records matched to two or more NPD records (N= 57) Matching using names - Group 1 These data had previously been matched on a one to one basis. If the first names matched (ONS & NPD) the match was confirmed. If the first names did not match, records were manually checked and were marked as matched if the forenames were the same allowing for differences in spelling. Matching using names – Group 2 These conceptions were checked automatically using forenames. Records which did not match automatically were checked manually. The record was marked as matched if the ONS forenames matched the NPD forenames, again allowing for differences in spelling. Matching using names – Group 3 It was assumed that the most likely reason for these records not matching originally was that the girl had moved house before having the baby. The enhanced matching used the date of birth of the mother as a first match, then used the names to confirm the match or otherwise. The surname was 3 included in the matching, with the maiden name being used as well as the married surname, if available. The records were matched in 3 stages: • a match was attempted using the forenames and the surname(s), and confirmed if successful (Quality=1). • if no match was obtained then a match was attempted using the forenames and the first four characters of the surname (Quality=2). • if no match was obtained then a match was obtained just using the forenames. These records were then also manually checked for confirmation of the match (Quality =3). Table 3 shows the impact of using names to match conceptions in the three groups. The weakest matches (Quality=3) will probably include some false positives. However many of the matches share similar postcodes; in 560 records the postcode recorded at birth shared the same first two characters as the postcode on the school records. These matches are highly likely to be true matches. For this reason it was decided to include these quality 3 matched records in the analysis dataset. 2.6 Matching rates 2.6.1 Match rates using only date of birth and postcode Table 4 shows the number of matches and match rate by outcome (maternity/abortion) and girl’s age. There were 47,808 unique matches (where one conception record linked to a single NPD record). For conceptions to girls under 18 years the match rate was 55 per cent. The match rate was much higher for conceptions leading to abortion (67 per cent) than for conceptions leading to a maternity (44 per cent). There is a very good reason for this. The postcode recorded on the conceptions leading to a maternity is the postcode where the girl was living at the time of the birth, not the conception. Girls who moved out of the family home before the birth would have a different postcode than when they were at school. Unless they returned to school after the birth, the conception record would not match with the school records. For conceptions leading to abortion the postcode recorded is the usual residence at the time of the abortion. This is far less likely to be different from the postcode recorded in school records. Match rates decrease with age at conception. In the case of conceptions leading to abortion this decline is small (from 72 per cent for girls aged under 15 at the time of the conception to 61 per cent for 17 year olds). For conceptions leading to maternity the decline is much larger. While 68 per cent of conceptions to girls under 15 years match to NPD records only 34 per cent of conceptions to 17 year olds match. This is likely to be because older girls are more likely to move to a new address separate from the family home before giving birth. 2.6.2 Match rates using date of birth, postcode and names Including names in the matching algorithm for conceptions leading to a maternity had a big impact on match rates. The overall match rate for conceptions leading to a maternity increased from 44 per cent to 71 per cent. The biggest improvement in match rate was for the older girls; the rate for 17 year olds increased from 34% to 68%. 4 2.7 Investigating poor match rates for conceptions leading to a maternity When a birth is registered the Registrar records the type of birth registration. There are four possibilities: • • • • Births inside marriage Births outside marriage registered by two parents living at the same address Births outside marriage registered by two parents living at different addresses Births registered only by the mother (sole registrations). Table 5 shows the match rates by type of registration and age. Match rates are similar for sole registrations and births registered by two parents living at different addresses (73 per cent and 79 per cent respectively). Rates for the small number of maternities where the birth occurred inside marriage are much lower (31 per cent). Including names in the matching algorithm had a big impact on match rates for births outside marriage living at the same address. The match rate for this group increased from 20 per cent to 63 per cent. This is not surprising since many of these mothers will have moved away from the parental home. 2.8 Match rates by region Table 6 shows match rates by outcome and region. The variation in match rates is larger for conceptions leading to a maternity than conceptions leading to abortion. The lowest match rates for both maternities and abortions occur in London. One reason for this may be poorer coverage of the NPD in London. In 2011 ten per cent of girls aged 14 living in London attended an independent school compared with four per cent of girls in the North East4. Information on these girls would not be included in the NPD data. 2.9 Match rates by the Income Deprivation Affecting Children Index (IDACI) 5 The IDACI was constructed by the Social Disadvantage Research Centre at the University of Oxford as part of the English Indices of Deprivation 2010 for the Department of Communities and Local Government. This index represents the proportion of children under 15 living in income deprived households. The IDACI index of deprivation was assigned to each conception record using the postcode. This enabled an analysis of match rates by IDACII quintile (Tables 7a and 7b). Match rates for conceptions to girls under 18 years were lower for those living in more deprived areas (quintile 1). This was true for both conceptions leading to a maternity and leading to an abortion. For abortions the match rates varied from 63 per cent in the most deprived areas to 70 per cent in the least deprived. Match rates for conceptions leading to a maternity varied from 67 per cent to 77 per cent. 5 2.10 Possible reasons for poor match rates There are a number of reasons why conception records may not match with NPD records. It is important to examine these and consider whether they are likely to cause bias in the matched dataset. The main problems are matching records where the girl’s postcode changed between the ages of 13 and 18 and she was not studying in an educational setting covered by the NPD until the age of 18. A high proportion of girls in England aged 16 and 17 do not attend either maintained schools or academies. In 2008 only 35 per cent of girls aged 16 attended a maintained school, city technology college or academy 6. An additional 34 per cent attended further education colleges and 12 per cent sixth form colleges. Neither of these two groups are covered by the NPD. Sixteen year olds attending further education colleges in England are nearly twice as likely to be eligible for free school meals than those attending a state funded school (15.2 per cent vs 7.8 per cent)7. 3 Creation of linked pupil level dataset The second output of the matching study is the creation of a pupil level dataset with one record per girl on the NPD dataset. All matched conceptions would be included on the same record. There were six stages to this process. 1. The original matched dataset (not using names) was used to identify matched conception records. 2. This was supplemented by maternity records matched using names. 3. Second and subsequent conceptions were linked in to the data. 4. Conception records with the same NPD pupil ID were linked together. This created a conception history for girls who moved house between conceptions. 5. The resulting file was joined to NPD data for girls who did not have a conception 6. Explanatory variables from the NPD were merged into the final data file. The final dataset contained one record for each girl on the NPD with linked in conception data. A single matching variable was added describing the match status of the first conception as matched, possible match or non-match. Possible matches were where two or more NPD records had linked to a single conception history (but see point 2 below). During the process of creating the file two small errors occurred: 1. In the initial matching (without names) there were 2,697 abortions followed by one or more further conceptions that linked to a single NPD record. These were categorised as unique or ‘true’ matches. Most of these records were misclassified in the final dataset as possible matches. The algorithm created to carry out the matching checked whether the names of NPD and conception records matched before assigning match status. Since abortion records were not matched using names they were reclassified as possible matches. 2. In the original matching there were 533 conception histories where a girl had a maternity followed by one or more further conceptions which did not match at the first stage. When names were added to the algorithm 251 of these records matched. Unfortunately only the 6 first conception (the maternity) was added to the analysis dataset. Of these 251, 134 were for a maternity followed by an abortion and 117 for a maternity followed by a maternity. The impact of this error will be on the analyses of repeat maternities in cohort A. Out of the 117 repeat maternities, 59 were to girls in cohort A. 3.1 Research access Identifiable information (such as postcodes, names, dates of birth) was removed from the matched data file. The resulting file was loaded into the ONS secure Virtual Microdata Laboratory (VML)8. The Approved Researchers accessed the data in the VML at ONS to carry out their analyses. Outputs were disclosure controlled by ONS before being released to the researchers. 4 Conclusions This report shows that it is possible to link NPD pupil data to ONS conception data. However using only date of birth and postcode match rates were low, particularly for conceptions leading to a maternity. For conceptions to girls under 18 years the overall match rate was 55 per cent, with rates of 67 per cent for abortions and 44 per cent for maternities. Including names in the matching algorithm for conceptions leading to maternities increased the match rate for these conceptions to 71 per cent. There is clear evidence of differences between the matched sample and conceptions that did not match. • • • Match rates were lower for maternities where the birth either occurred within marriage or was registered by 2 people at the same address. Match rates were lower for conceptions to girls living in deprived areas. Match rates were lower in London and East of England. A number of possible reasons for non-matches were identified: • • • Girls who moved house and changed their postcode of residence during their secondary career. The NPD includes nearly all girls under the age of 16, but it does not include records for those at sixth form colleges and other forms of higher education. If a girl moved after she left maintained school and then conceived the conception would not match. Girls living in care are more likely to change address than other girls. The matched sample probably does not include as many girls in care as it should. Acknowledgement ONS would like to acknowledge the assistance of Sarah Butt, formerly of the Department of Education, in supplying data and background information. 7 References 1. Office for National Statistics. Conception statistics, England and Wales, available at: http://www.ons.gov.uk/ons/rel/vsob1/conception-statistics--england-andwales/2010/index.html2010 2. Information on the National Pupil Database is available at : http://www.education.gov.uk/vocabularies/educationtermsandtags/5385 3. Information on the School Census is available at: http://www.education.gov.uk/rsgateway/schoolcensus.shtml 4. Figures derived from : Department for Education: Schools, Pupils and their characteristics 2011 Tables 9b,9c and 9d available at: http://www.education.gov.uk/researchandstatistics/statistics/statistics-bytopic/schoolpupilcharacteristics/a00196810/schools-pupils-and-their-characteristics-january-2 5. Department for Communities and Local Government 2010. Available at : http://www.communities.gov.uk/publications/corporate/statistics/indices2010 6. Department for Education: Participation in Education, Training and Employment by 16-18 Year Olds in England, Table 6 available at: http://www.education.gov.uk/rsgateway/DB/SFR/s000938/index.shtml 7. Department for Education, Pupil Census 2007/8. Personal communication from Tim Thair. 8. Virtual Microdata Laboratory. Office for National Statistics. Details of this service are available at: http://www.ons.gov.uk/ons/about-ons/who-we-are/services/virtual-microdatalaboratory/index.html 8 Table 3 Group Number and percentage of conceptions to girls under 18 leading to a maternity matched to education records using different matching strategies Status of match using date of birth and postcode only 1 Unique match 2 Matched to more than 1 NPD record Number Additional variables used Quality (Group 3 only) 18,806 Forenames Forenames Automated Manual Total 191 Forenames (one ONS record) 3 Matched to more than 1 NPD record 57 Forenames (more than 1 ONS record) 3 Did not match at all 23,909 42,963 Number of matches Percentage matched Total 17,895 791 18,686 17,903 903 18,806 99.3 86.8 99.4 Manual 189 191 99.0 Total 189 191 99.0 Automated 24 57 42.1 Total 24 57 42.1 Date of birth,forename,surname, maiden name 1 Automated 9,723 23,909 40.7 Date of birth,forename, first four characters of surname, maiden name 2 Automated 130 14,186 0.9 Date of birth, forenames 3 Automated + manual check Total 1,648 14,056 11.7 Did not match Total Method Conceptions in England in 3 cohorts, 2003‐2008 11,501 12,408 30,400 70.8 Numbers and per cent of conceptions matched to education records Table 4 Conceptions Matching using only leading to an postcode and date of abortion birth Age in years Total number Under 15 15 16 17 3,817 9,303 16,140 14,337 Number matched Percentage to 1 NPD matched1 Conceptions in England in 3 cohorts, 2003‐2008 Conceptions leading to a maternity Matching using only postcode and date of birth Total number Number matched to 1 NPD Percentage matched1 Number matched using names Percentage matched1 Matching using names 2,736 6,769 10,722 8,775 71.7 72.8 66.4 61.2 2,116 6,564 16,117 18,166 1,431 3,977 7,161 6,237 67.6 60.6 44.4 34.3 1,650 4,999 11,337 12,414 78.0 76.2 70.3 68.3 29,002 66.5 42,963 18,806 43.8 30,400 70.8 Under 18 43,597 1 Unique matches only included Table 5 Conceptions in England in 3 cohorts, 2003‐2008 Numbers of cases and percentage of maternities matched to education records by type of registration Conceptions leading to a Using only postcode and maternity date of birth Number Percentage matched Total number matched Age1 Adding names Number matched Percentage matched Births inside marriage Under 15 15 16 17 Under 18 25 71 350 621 2 9 77 110 8.0 12.7 22.0 17.7 4 10 121 198 16.0 14.1 34.6 31.9 1,067 198 18.6 333 31.2 Births outside marriage to parents living at the same address Under 15 15 16 17 Under 18 219 1,401 5,305 7,264 102 470 1,102 1,152 46.6 33.5 20.8 15.9 139 893 3,255 4,601 63.5 63.7 61.4 63.3 14,189 2,826 19.9 8,888 62.6 Births outisde marriage to parents living at different addresses Under 15 15 16 17 Under 18 807 2,733 6,184 6,396 613 1,968 3,733 3,245 76.0 72.0 60.4 50.7 684 2,276 4,904 4,883 84.8 83.3 79.3 76.3 16,120 9,559 59.3 12,747 79.1 1,065 2,359 4,278 3,885 714 1,530 2,249 1,730 67.0 64.9 52.6 44.5 823 1,820 3,057 2,732 77.3 77.2 71.5 70.3 11,587 6,223 53.7 8,432 72.8 Sole registrations Under 15 15 16 17 Under 18 1 Age at conception Table 6 Numbers of cases and percentage of maternities matched to education records by region Conceptions leading to a maternity Using only postcode and date of birth Total number Number Percentage matched matched Conceptions in England in 3 cohorts, 2003‐2008 Conceptions leading to an abortion Matching using names Using only postcode and date of birth Number matched Percentage matched Total number Number matched Percentage matched Region North East North West Yorkshire and The Humber East Midlands West Midlands East of England London South East South West England 3,077 7,138 1,548 3,308 50.3 46.3 2,243 5,199 72.9 72.8 2,372 6,731 1,682 4,539 70.9 67.4 5,716 4,060 5,489 3,720 4,651 5,406 3,706 2,530 1,838 2,510 1,456 1,717 2,347 1,552 44.3 45.3 45.7 39.1 36.9 43.4 41.9 4,159 3,018 4,012 2,562 2,444 3,983 2,780 72.8 74.3 73.1 68.9 52.5 73.7 75.0 4,794 3,373 5,292 3,915 7,345 5,937 3,838 3,317 2,280 3,591 2,551 4,449 4,055 2,538 69.2 67.6 67.9 65.2 60.6 68.3 66.1 42,963 18,806 43.8 30,400 70.8 43,597 29,002 66.5 Table 7a Conceptions in England in 3 cohorts, 2003‐2008 Match rates by IDACI quintile ‐ conceptions to under 18s Conceptions leading to an abortion Conceptions leading to a maternity Using only postcode and date of birth Matching using names Using postcode and date of birth Number Matched Percentage matched Matched Percentage matched Number Percentage matched Matched 1 2 3 4 5 18,467 11,506 6,816 3,866 2,308 7,852 5,051 2,962 1,789 1,152 42.5 43.9 43.5 46.3 49.9 12,448 8,242 5,033 2,905 1,772 67.4 71.6 73.8 75.1 76.8 12,970 10,112 7,841 6,686 5,988 8,145 6,717 5,278 4,663 4,199 62.8 66.4 67.3 69.7 70.1 England 42,963 18,806 43.8 30,400 70.8 43,597 29,002 66.5 IDACI Quintile Table 7b Conceptions in England in 3 cohorts, 2003‐2008 Match rates by IDACI quintile ‐ conceptions to under 16s Conceptions leading to an abortion Conceptions leading to a maternity Using only postcode and date of birth Number Matched Match rate Matching using names Using postcode and date of birth Matched Percentage matched Number Matched Percentage matched IDACI Quintile 1 2 3 4 5 3,898 2,313 1,292 712 465 2,361 1,486 814 440 307 60.6 64.2 63.0 61.8 66.0 2,887 1,809 1,008 555 390 74.1 78.2 78.0 77.9 83.9 4,062 3,194 2,322 1,921 1,621 2,784 2,343 1,699 1,452 1,227 68.5 73.4 73.2 75.6 75.7 England 8,680 5,408 62.3 6,649 76.6 13,120 9,505 72.4 Single conception for date of birth/ postcode Conceptions to girls in cohorts A B and C under 18 N=91,476 Data in matching rate tables Matched records in tables Match on date of birth/postcode Single conception N=81,742 NPD records 1 record per postcode Match on date of birth/ postcode Possible match >1 NPD matches N=554 Unique match N=44,207 Maternity N=17,902 Abortion N=26,305 Match using names Match N=17,791 No match 111 Maternity N= 191 Abortion N=363 Match using names Match N= 189 No match N=36,981 Maternity N=23,356 Abortion N=13,625 Match using names No match N= 2 Match N=11,250 No match N=12,106 More than one conception for date of birth/ postcode Conceptions to girls in cohorts A B and C under 18 N= 91,476 Match on date of birth/postcode Data in matching rate tables More than 1 conception N= 9,733 Matched records in tables NPD records 1 record per postcode AA = two abortions AM = abortion followed by maternity Match on date of birth/ postcode 1st conception N= 4,730 Subsequent conceptions N=5,003 No match N=1,049 Abortion N=496 Unique match N=3,601 Maternity N=553 Abortion N=2,697 Match using names Matched N=251 Not matched N=302 Possible match N=80 Maternity N=904 AA=1,640 AM=1,057 Matched N=895 Abortion N=56 No match N=1,095 Maternity N=24 Match using names Not matched N=9 Unique match N=3,820 Match using names Matched N=24 Not matched N=0 Possible match N=88