Drug Development Statistics & Data Management Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's College London Email: yanzhong.wang@kcl.ac.uk Session objectives • Part 1: – How to create a simple data file ready for analysis – How to create data files for large scale studies – Case study on data checking/cleaning • Part 2 – Advantages and disadvantages of various summary statistics – Select appropriate summary statistics for categorical, binary, ordinal and continuous data. Reading: Statistics as Square One, Chapter 2. 2 Outline of computerisation of data • • • • • • Plan – at protocol stage of study. Data entry. Data checking and editing of individual files. Merging and appending files if necessary. Cross-checking of merged files. Data analysis. 3 Planning stage • Design of data collection forms, e.g. questionnaires, clinical data sheets. • Coding instructions. • Decide on data entry program. • Decide on eventual data analysis program. • Ensure compatibility between data to be entered in different files. • Set up data entry. 4 Design of data collection forms Example: European Community Respiratory Health Survey 6 5 Questionnaires - layout • Readability and attractiveness to responder or interviewer. • Readability and lack of ambiguity for data entry clerk. • Collect dates of birth and occasion, not age. • Other issues later in course. 6 Coding instructions • Unique identification required for each individual in the study – must be included in each separate set of data. • Assign unique NUMERIC code to all categorical/qualitative data, e.g. male – 1, female – 2. • Codes may be printed on questionnaire, implemented in data entry; or later text to numeric conversion. • Decide on code for ‘missing’ data – should be a number well away from possible data, e.g. 9 for gender, 999 for weight in kg (if use ‘blank’ need to be sure that this will be transferred as ‘missing’). 7 ECRHS coding instructions General ECHRS-European Community Respiratory Health Survey 8 ECRHS coding instructions Specific 9 Data entry • Excel – part of Microsoft Office package so almost always available. • Access – part of Microsoft Office Professional. • Stata and SPSS – statistical analysis programs, available at King’s. • Epi-Data – freeware from http://www.epidata.dk/ 10 Data entry - Epi-Data 11 Data entry programs • Small amounts of data can be entered in Excel, Stata or SPSS. • Verification/double entry? – necessary except for small amounts of data. • Verification/double entry most easily carried out if data entered using Epi-Data. • Microsoft Access – sophisticated data entry, but verification requires complex programming. 12 Data analysis programs • Stata – popular with medical statisticians and epidemiologists. – flexible, powerful, very few drawbacks • • • • SPSS – popular with sociologists and psychologists. SAS – popular with statisticians in pharmaceutical R&D. R or S-plus – popular with academic statisticians. Beware of little-known packages. – Unknown, limited validation 13 File transfer between programs • Excel can read and write text delimited files. (A delimited text file is one in which each line of text is a record, and the fields are separated by a known character such as comma and tab) • Stata, SPSS and most statistical packages can read and write text delimited files or Excel files. • Data can be exported directly from Epi-Data to Stata or SPSS. • Variable names/labels preserved in most cases. 14 Program formats • Each program has its own special format • File extensions tell you the file format – – – – – – Excel .xls Stata .dta SPSS .sav Access .mdb Epi-Data .rec Comma separated file .csv 15 Spreadsheet & comma-separated files • Each spreadsheet has its own ‘format’, but it is possible to write a ‘comma-separated file’ which can be read by other programs. 16 Spreadsheet & comma-separated files PID,CENTRE,REC DATE,BASE_SS,TPA,DOSE,AGE,TIME TO TREAT,DAY 90 OK,FINAL_DATE, SSS,CHANGE, 1,26620,24-Nov-00,30,0,0.1,81,3.42,Y,22-Feb-01,42,12, 3,26620,28-Nov-00,39,0,0.2,74,3.75,Y,01-Mar-01,58,19, 5,25224,29-Nov-00,39,0,0.1,51,4.5,Y,27-Feb-01,58,19, 7,30912,30-Nov-00,31,0,0.8,61,4.25,Y,27-Feb-01,52,21, 9,30969,05-Dec-00,40,0,0.2,96,5.25,Y,06-Mar-01,55,15, 11,27460,08-Dec-00,28,0,0,80,3.02,Y,08-Mar-01,58,30, 17 Steps for most studies • Data entered in Excel (or Epi-Data). • Data transferred (‘exported’) to Stata or SPSS. • Data checking and editing. • Data analysis. 18 Setting up data entry in Epi-Data • Should correspond to questionnaire or other data collection form. • Allowable data determined by coding instructions. • Set ranges for quantitative data, e.g. Height. • Data entry “clerk” should not be constantly checking. • Decide how dates are to be handled. 19 Preliminary editing • If data entered as text codes convert to numeric codes. • Text to numeric conversion simple in Excel. 20 Data checking • Data correspond with coding instructions. • Data correspond with plausible/possible distribution. • Graphs, tables can identify if there is a problem, e.g. outliers and missing values. • Listing selected data required to identify where there is a problem. 21 Multiple files • Data from different centres need to be appended – add more rows (more records). • Data from different sources/questionnaires/time periods for the same individuals need to be merged (e.g. cohort studies and RCTs) – add more columns (more variables). • Efficient to enter data in separate files if not all data apply to all individuals, e.g. special questionnaire for women. 22 Compatible files • If files are to be appended they must contain the same data variables names (columns) for different people (rows). • If files are to be merged they should contain different data (columns) for the same individuals. The identification number(s) needs to be the same on each file for each individual and the identification variable name needs to be the same in each file. 23 Graphs for checking data (1) • Single continuous variable –Histogram can detect ‘outliers’, e.g. in height (also dot plot, some box and whisker plots). 24 Histogram of age of SLSR patients Boxplot of age of SLSR patients Graphs for checking data (2) • Two continuous variables – Scatter plot can detect ‘outliers’, e.g. in weight for height. • Follow with list of aberrant values. • Graphs less useful for categorical data 26 The relationship between Aortic pulse wave velocity (Ao-PWV) and Ambulatory arterial stiffness index (AASI) in patients with type2 diabetes, microalbuminuria and systolic hypertension at baseline 20 Linear relationship between Ao-PWV and AASI in patients with type2 diabetes Fitted line 14 12 10 8 BasePWV y 16 18 95% Confidence Interval 0.2 0.3 0.4 0.5 BaseAASI x 0.6 0.7 0.8 Tables for categorical data Wheeze in last 12 months Frequency (n) % No 1945 75.0 Yes 642 24.7 Not known 8 0.3 Total 2595 100.0 28 Tables for checking categorical data . tab q1 q1 | Freq. Percent Cum. ------------+----------------------------------1 | 1945 74.95 74.95 2 | 642 24.74 99.69 9 | 8 0.31 100.00 ------------+----------------------------------Total | 2595 100.00 . list area id if q1==9 640. 1853. 3280. 3624. 3663. 4509. 4623. 4923. area 110 110 110 110 110 110 110 110 id 640 1853 3280 3624 3663 4509 4623 4923 29 Missing data • Convert to program missing value code before calculating summary statistics or plotting graphs • E.g. in Stata – mvdecode gender, mv(9) – mvdecode weight, mv(999) 30 Case study: Scottish Family History Study Data • SFHS Data Quality Report: PCQ data • • • • • • • • • • • • • • Report description: Draft Summary Report Prepared by: Yanzhong Wang Last run on: 07/11/2008 by Yanzhong Wang Report file name: SFHS_PCQ_check_report.doc Created by program: //Rcb-file-2000/Filestore/Studies/SFHS/statistics/programs/PCQ_datacheck_prog/SFHS_PCQ_check_v7.R Created using software: R version 2.5.1 (2007-06-27) for Windows Checked by The program for producing this report has not been checked by second statistician Overview of SFHS PCQ data • Duplicate records • • • • • • • • • The analysis data set PCQ combines subjects from both Pre-clinic questionnaire version 1 and version 2. There are total 6882 records in the PCQ data set and each record has 415 variables. 6863 records have unique subject numbers. 19 subject numbers appear more than once. The number of subject numbers that appear more than twice is 0. The duplicate subject numbers are SFT0400662 SFT0400659 SFT0400656 SFT0400660 SFT0400985 SFT0434277 SFG9500780 SFT0441461 SFT0435134 SFT0441530 SFT0435427 SFT0435155 SFT0441473 SFT0435157 SFT0441602 SFT0435584 SFT0435438 SFG9501775 SFT0435630. The 19 duplicated records are omitted from further analysis, leaving a total of 6863 records with unique subject numbers. Overview of SFHS PCQ data (continue) • Blank variables • There are 5 variables which contain all NAs for all the subjects. They are: • • "PCQCIV4" – “Prescribed injection / Suppository 4”, • "PCQCIV6" – “Prescribed injection / Suppository 6”, • "PCQEHB" – “Family Health / Breast cancer – brother (version 1)”, • "PCQEKS" – “Family Health / Prostate cancer – sister (version 1)”, • "PCQELB" – “Family Health / Hip fracture – brother (version 1)”. Overview of SFHS PCQ data (continue) • Pre-processing data All variable are annotated in the following format: Variable name Description Date type e.g. PCQH1 Troubled by pain/discomfort? Categorical Variables are formatted into numeric, categorical and text data accordingly. “NA” denotes a missing value. Summary tables and histograms are produced based on the pre-processed data. Methods Summary tables For each variable the type of data (numeric, categorical or text) and the number (%) of non-missing data are given. Additional summary statistics are given for numeric and categorical variables: Numeric variable: minimum, 1st quartile, median, mean, 3rd quartile, maximum, number of missing data points. Categorical variable: number (%) in each category. Histogram For each numeric variable, a histogram is used to show how the data is distributed. (frequences in specified categories) Bar chart For each categorical variable, a bar chart is used for comparing values (counts or percentages) in different categories. Output 2. Summary of variables for Chest pain (Angina). For each variable the type of data (numeric, categorical or text) and the number (%) of non-missing data are given. Additional summary statistics are given for numeric and categorical variables. Numeric: minimum, 1st quartile, median, mean, 3rd quartile, maximum, number of missing data points. Categorical: number (%) in each category. Produced by Yanzhong Wang on Fri Nov 07 16:18:52 2008 Variable Data type N (%) Summary recorded data Min. 2002-02-20; 1st Qu. 2007-03-07; Median 2007-09Pre-clinic questionnaire date Num 6833 (99.6%) 18; Mean 2007-09-16; 3rd Qu. 2008-05-13; Max. 2009-09-01 PRESENCE OF ANGINA Cat 6863 (100.0%) Yes: 134 (2.0%); No: 6729 (98.0%) SEVERITY OF ANGINA Cat 6863 (100.0%) Grade 1: 107 (1.6%); Grade 2: 24 (0.3%); NA: 6732 (98.1%) PAIN OF POSSIBLE INFARCTION Cat 1725 (25.1%) Yes: 256 (14.8%); No: 1469 (85.2%) Output 5. Summary of variables for smoking history. For each variable the type of data (numeric, categorical or text) and the number (%) of non-missing data are given. Additional summary statistics are given for numeric and categorical variables. Numeric: minimum, 1st quartile, median, mean, 3rd quartile, maximum, number of missing data points. Categorical: number (%) in each category. Produced by Yanzhong Wang on Fri Nov 07 16:18:57 2008 Variable Data type N (%) recorded data Summary EVER SMOKED TOBACCO? AGE WHEN STARTED SMOKING CIGARETTES/WEEK (MAX) PACKETS OF TOBACCO/WEEK (MAX) CIGARS/WEEK (MAX) HOW LONG SINCE GIVING UP SMOKING? WHY DID YOU GIVE UP SMOKING? Cat Num Num Num Num Num Cat 6777 (98.7%) 3165 (46.1%) 2749 (40.1%) 622 (9.1%) 343 (5.0%) 1837 (26.8%) 1804 (26.3%) 1: 1201 (17.7%); 2: 222 (3.3%); 3: 1748 (25.8%); 4: 3606 (53.2%) Min. 0; 1st Qu. 14; Median 16; Mean 16.78; 3rd Qu. 18; Max. 62; NA's 3698 Min. 0; 1st Qu. 35; Median 100; Mean 95.68; 3rd Qu. 140; Max. 560; NA's 4114 Min. 0; 1st Qu. 0; Median 2; Mean 3.322; 3rd Qu. 6; Max. 81; NA's 6241 Min. 0; 1st Qu. 0; Median 0; Mean 6.449; 3rd Qu. 3.5; Max. 99; NA's 6520 Min. 0; 1st Qu. 4; Median 13; Mean 15.49; 3rd Qu. 25; Max. 92; NA's 5026 1: 60 (3.3%); 2: 1641 (91.0%); 3: 103 (5.7%) Education and occupation Smoking history PCQ data: outliers/extreme values report PCQ data: outliers/extreme values report by Yanzhong Wang on Fri Nov 07 15:27:37 2008 Variable 1 PCQN1 Description Low_limit TOTAL YEARS IN FULL-TIME 2 STUDY Outlier_low_SNO_value SFG9500481, 2, SFT0434849, 0, SFG9500819, 1, SFG9501105, 0 HOURS/WEEK WORKING AT 2 PCQN5III NIGHT 7PM7AM 3 PCQO1 NO OF PEOPLE LIVE IN 0 HOUSEHOLD? SFG9500124, 0, SFT0401027, 0, SFT0434305, 0, SFT0434390, 0, SFT0434428, 0, SFT0441113, 0, SFT0434541, 0, SFT0434828, 0, SFG9500674, 0, SFG9500949, 0, SFT0435099, 0, SFT0435778, 0, High_limit Outlier_high_SNO_value 25 SFT0400530, 25, SFG9500408, 27, SFT0434350, 25, SFG9501237, 27, SFT0435564, 31, SFT0435386, 28, SFT0441832, 29, SFT0435764, 39, SFG9502609, 25, SFG9501887, 27, SFG9502363, 26, SFT0442041, 27, SFT0442200, 30 50 SFT0400539, 50, SFT0400564, 60, SFT0400595, 60, SFT0400627, 50, SFT0400777, 73, SFT0400793, 50, SFT0401048, 56, SFG9500441, 54, SFT0401182, 70, SFT0434230, 50, SFT0434367, 60, SFT0434391, 60, SFT0441084, 72, SFG9500487, 55, SFG9500466, 60, SFG9500845, 60, SFG9501128, 60, SFG9501326, 50, SFT0441418, 50, SFT0435323, 50, SFT0435389, 56, SFT0441788, 60, SFG9502202, 59, SFG9502156, 65, SFT0441972, 50, SFT0435854, 54 8 SFG9500474, 10, SFG9500998, 8, SFG9501593, 9, SFT0441511, 10, SFT0435430, 8, SFT0441995, 9 Lunch time See you at 1pm for Part 2