Data Handling I: Data Preparation and Data Cleaning

advertisement
Drug Development Statistics & Data Management
Data Handling I:
Data Preparation and Data Cleaning
Dr Yanzhong Wang
Lecturer in Medical Statistics
Division of Health and Social Care Research
King's College London
Email: yanzhong.wang@kcl.ac.uk
Session objectives
• Part 1:
– How to create a simple data file ready for analysis
– How to create data files for large scale studies
– Case study on data checking/cleaning
• Part 2
– Advantages and disadvantages of various summary
statistics
– Select appropriate summary statistics for categorical,
binary, ordinal and continuous data.
Reading:
Statistics as Square One, Chapter 2.
2
Outline of computerisation of data
•
•
•
•
•
•
Plan – at protocol stage of study.
Data entry.
Data checking and editing of individual files.
Merging and appending files if necessary.
Cross-checking of merged files.
Data analysis.
3
Planning stage
• Design of data collection forms, e.g. questionnaires,
clinical data sheets.
• Coding instructions.
• Decide on data entry program.
• Decide on eventual data analysis program.
• Ensure compatibility between data to be entered in
different files.
• Set up data entry.
4
Design of data collection forms
Example: European Community Respiratory Health Survey
6
5
Questionnaires - layout
• Readability and attractiveness to responder or
interviewer.
• Readability and lack of ambiguity for data
entry clerk.
• Collect dates of birth and occasion, not age.
• Other issues later in course.
6
Coding instructions
• Unique identification required for each individual in the
study – must be included in each separate set of data.
• Assign unique NUMERIC code to all
categorical/qualitative data, e.g. male – 1, female – 2.
• Codes may be printed on questionnaire, implemented in
data entry; or later text to numeric conversion.
• Decide on code for ‘missing’ data – should be a number
well away from possible data, e.g. 9 for gender, 999 for
weight in kg (if use ‘blank’ need to be sure that this will
be transferred as ‘missing’).
7
ECRHS coding instructions
General
ECHRS-European Community
Respiratory Health Survey
8
ECRHS coding instructions
Specific
9
Data entry
• Excel – part of Microsoft Office package so
almost always available.
• Access – part of Microsoft Office Professional.
• Stata and SPSS – statistical analysis programs,
available at King’s.
• Epi-Data – freeware from
http://www.epidata.dk/
10
Data entry - Epi-Data
11
Data entry programs
• Small amounts of data can be entered in
Excel, Stata or SPSS.
• Verification/double entry? – necessary except
for small amounts of data.
• Verification/double entry most easily carried
out if data entered using Epi-Data.
• Microsoft Access – sophisticated data entry,
but verification requires complex
programming.
12
Data analysis programs
• Stata – popular with medical statisticians and
epidemiologists.
– flexible, powerful, very few drawbacks
•
•
•
•
SPSS – popular with sociologists and psychologists.
SAS – popular with statisticians in pharmaceutical R&D.
R or S-plus – popular with academic statisticians.
Beware of little-known packages.
– Unknown, limited validation
13
File transfer between programs
• Excel can read and write text delimited files.
(A delimited text file is one in which each line of text is a
record, and the fields are separated by a known character
such as comma and tab)
• Stata, SPSS and most statistical packages can read and write
text delimited files or Excel files.
• Data can be exported directly from Epi-Data to Stata or SPSS.
• Variable names/labels preserved in most cases.
14
Program formats
• Each program has its own special format
• File extensions tell you the file format
–
–
–
–
–
–
Excel .xls
Stata .dta
SPSS .sav
Access .mdb
Epi-Data .rec
Comma separated file .csv
15
Spreadsheet &
comma-separated files
• Each spreadsheet has its own ‘format’, but it is
possible to write a ‘comma-separated file’
which can be read by other programs.
16
Spreadsheet &
comma-separated files
PID,CENTRE,REC DATE,BASE_SS,TPA,DOSE,AGE,TIME TO TREAT,DAY 90 OK,FINAL_DATE,
SSS,CHANGE,
1,26620,24-Nov-00,30,0,0.1,81,3.42,Y,22-Feb-01,42,12,
3,26620,28-Nov-00,39,0,0.2,74,3.75,Y,01-Mar-01,58,19,
5,25224,29-Nov-00,39,0,0.1,51,4.5,Y,27-Feb-01,58,19,
7,30912,30-Nov-00,31,0,0.8,61,4.25,Y,27-Feb-01,52,21,
9,30969,05-Dec-00,40,0,0.2,96,5.25,Y,06-Mar-01,55,15,
11,27460,08-Dec-00,28,0,0,80,3.02,Y,08-Mar-01,58,30,
17
Steps for most studies
• Data entered in Excel (or Epi-Data).
• Data transferred (‘exported’) to Stata or SPSS.
• Data checking and editing.
• Data analysis.
18
Setting up data entry in Epi-Data
• Should correspond to questionnaire or other
data collection form.
• Allowable data determined by coding
instructions.
• Set ranges for quantitative data, e.g. Height.
• Data entry “clerk” should not be constantly
checking.
• Decide how dates are to be handled.
19
Preliminary editing
• If data entered as text codes convert to
numeric codes.
• Text to numeric conversion simple in Excel.
20
Data checking
• Data correspond with coding instructions.
• Data correspond with plausible/possible
distribution.
• Graphs, tables can identify if there is a
problem, e.g. outliers and missing values.
• Listing selected data required to identify
where there is a problem.
21
Multiple files
• Data from different centres need to be
appended – add more rows (more records).
• Data from different
sources/questionnaires/time periods for the
same individuals need to be merged (e.g.
cohort studies and RCTs) – add more columns
(more variables).
• Efficient to enter data in separate files if not
all data apply to all individuals, e.g. special
questionnaire for women.
22
Compatible files
• If files are to be appended they must contain
the same data variables names (columns) for
different people (rows).
• If files are to be merged they should contain
different data (columns) for the same
individuals. The identification number(s)
needs to be the same on each file for each
individual and the identification variable name
needs to be the same in each file.
23
Graphs for checking data (1)
• Single continuous variable
–Histogram can detect ‘outliers’, e.g. in
height (also dot plot, some box and
whisker plots).
24
Histogram of age of SLSR patients
Boxplot of age of SLSR patients
Graphs for checking data (2)
• Two continuous variables
– Scatter plot can detect ‘outliers’, e.g. in weight for
height.
• Follow with list of aberrant values.
• Graphs less useful for categorical data
26
The relationship between Aortic pulse wave velocity (Ao-PWV) and
Ambulatory arterial stiffness index (AASI) in patients with type2
diabetes, microalbuminuria and systolic hypertension at baseline
20
Linear relationship between Ao-PWV and AASI in patients with type2 diabetes
Fitted line
14
12
10
8
BasePWV
y
16
18
95% Confidence Interval
0.2
0.3
0.4
0.5
BaseAASI
x
0.6
0.7
0.8
Tables for categorical data
Wheeze in last 12
months
Frequency (n)
%
No
1945
75.0
Yes
642
24.7
Not known
8
0.3
Total
2595
100.0
28
Tables for checking categorical data
. tab q1
q1 |
Freq.
Percent
Cum.
------------+----------------------------------1 |
1945
74.95
74.95
2 |
642
24.74
99.69
9 |
8
0.31
100.00
------------+----------------------------------Total |
2595
100.00
. list area id if q1==9
640.
1853.
3280.
3624.
3663.
4509.
4623.
4923.
area
110
110
110
110
110
110
110
110
id
640
1853
3280
3624
3663
4509
4623
4923
29
Missing data
• Convert to program missing value code before
calculating summary statistics or plotting
graphs
• E.g. in Stata
– mvdecode gender, mv(9)
– mvdecode weight, mv(999)
30
Case study: Scottish Family History Study Data
• SFHS Data Quality Report: PCQ data
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Report description:
Draft Summary Report
Prepared by:
Yanzhong Wang
Last run on:
07/11/2008 by Yanzhong Wang
Report file name:
SFHS_PCQ_check_report.doc
Created by program:
//Rcb-file-2000/Filestore/Studies/SFHS/statistics/programs/PCQ_datacheck_prog/SFHS_PCQ_check_v7.R
Created using software:
R version 2.5.1 (2007-06-27) for Windows
Checked by
The program for producing this report has not been checked by second statistician
Overview of SFHS PCQ data
• Duplicate records
•
•
•
•
•
•
•
•
•
The analysis data set PCQ combines subjects from both Pre-clinic questionnaire
version 1 and version 2. There are total 6882 records in the PCQ data set and each
record has 415 variables. 6863 records have unique subject numbers. 19 subject
numbers appear more than once.
The number of subject numbers that appear more than twice is 0. The duplicate
subject numbers are
SFT0400662 SFT0400659 SFT0400656 SFT0400660 SFT0400985 SFT0434277
SFG9500780 SFT0441461 SFT0435134 SFT0441530 SFT0435427 SFT0435155
SFT0441473 SFT0435157 SFT0441602 SFT0435584 SFT0435438 SFG9501775
SFT0435630.
The 19 duplicated records are omitted from further analysis, leaving a total of
6863 records with unique subject numbers.
Overview of SFHS PCQ data (continue)
• Blank variables
• There are 5 variables which contain all NAs for all the subjects. They
are:
•
• "PCQCIV4" – “Prescribed injection / Suppository 4”,
• "PCQCIV6" – “Prescribed injection / Suppository 6”,
• "PCQEHB" – “Family Health / Breast cancer – brother (version 1)”,
• "PCQEKS" – “Family Health / Prostate cancer – sister (version 1)”,
• "PCQELB" – “Family Health / Hip fracture – brother (version 1)”.
Overview of SFHS PCQ data (continue)
• Pre-processing data
All variable are annotated in the following format:
Variable name
Description
Date type
e.g. PCQH1
Troubled by pain/discomfort?
Categorical
 Variables are formatted into numeric, categorical and text data accordingly.
 “NA” denotes a missing value.
 Summary tables and histograms are produced based on the pre-processed data.
Methods
Summary tables
For each variable the type of data (numeric, categorical or text) and the number (%) of
non-missing data are given. Additional summary statistics are given for numeric and
categorical variables:

Numeric variable: minimum, 1st quartile, median, mean, 3rd quartile,
maximum, number of missing data points.

Categorical variable: number (%) in each category.
Histogram
For each numeric variable, a histogram is used to show how the data is distributed.
(frequences in specified categories)
Bar chart
For each categorical variable, a bar chart is used for comparing values (counts or
percentages) in different categories.
Output 2. Summary of variables for Chest pain (Angina). For each variable the type of data
(numeric, categorical or text) and the number (%) of non-missing data are given. Additional
summary statistics are given for numeric and categorical variables. Numeric: minimum, 1st
quartile, median, mean, 3rd quartile, maximum, number of missing data points. Categorical:
number (%) in each category. Produced by Yanzhong Wang on Fri Nov 07 16:18:52 2008
Variable
Data
type
N (%)
Summary
recorded data
Min. 2002-02-20; 1st Qu. 2007-03-07; Median 2007-09Pre-clinic questionnaire date
Num 6833 (99.6%)
18; Mean 2007-09-16; 3rd Qu. 2008-05-13; Max. 2009-09-01
PRESENCE OF ANGINA
Cat 6863 (100.0%)
Yes: 134 (2.0%); No: 6729 (98.0%)
SEVERITY OF ANGINA
Cat 6863 (100.0%)
Grade 1: 107 (1.6%); Grade 2: 24 (0.3%); NA: 6732 (98.1%)
PAIN OF POSSIBLE INFARCTION Cat 1725 (25.1%)
Yes: 256 (14.8%); No: 1469 (85.2%)
Output 5. Summary of variables for smoking history. For each variable the type of data
(numeric, categorical or text) and the number (%) of non-missing data are given. Additional
summary statistics are given for numeric and categorical variables. Numeric: minimum, 1st
quartile, median, mean, 3rd quartile, maximum, number of missing data points. Categorical:
number (%) in each category. Produced by Yanzhong Wang on Fri Nov 07 16:18:57 2008
Variable
Data
type
N (%)
recorded data
Summary
EVER SMOKED TOBACCO?
AGE WHEN STARTED SMOKING
CIGARETTES/WEEK (MAX)
PACKETS OF TOBACCO/WEEK (MAX)
CIGARS/WEEK (MAX)
HOW LONG SINCE GIVING UP SMOKING?
WHY DID YOU GIVE UP SMOKING?
Cat
Num
Num
Num
Num
Num
Cat
6777 (98.7%)
3165 (46.1%)
2749 (40.1%)
622 (9.1%)
343 (5.0%)
1837 (26.8%)
1804 (26.3%)
1: 1201 (17.7%); 2: 222 (3.3%); 3: 1748 (25.8%); 4: 3606 (53.2%)
Min. 0; 1st Qu. 14; Median 16; Mean 16.78; 3rd Qu. 18; Max. 62; NA's 3698
Min. 0; 1st Qu. 35; Median 100; Mean 95.68; 3rd Qu. 140; Max. 560; NA's 4114
Min. 0; 1st Qu. 0; Median 2; Mean 3.322; 3rd Qu. 6; Max. 81; NA's 6241
Min. 0; 1st Qu. 0; Median 0; Mean 6.449; 3rd Qu. 3.5; Max. 99; NA's 6520
Min. 0; 1st Qu. 4; Median 13; Mean 15.49; 3rd Qu. 25; Max. 92; NA's 5026
1: 60 (3.3%); 2: 1641 (91.0%); 3: 103 (5.7%)
Education and occupation
Smoking history
PCQ data: outliers/extreme values report
PCQ data: outliers/extreme values report by Yanzhong Wang on Fri Nov 07 15:27:37 2008
Variable
1 PCQN1
Description
Low_limit
TOTAL YEARS
IN FULL-TIME 2
STUDY
Outlier_low_SNO_value
SFG9500481, 2, SFT0434849, 0,
SFG9500819, 1, SFG9501105, 0
HOURS/WEEK
WORKING AT
2 PCQN5III
NIGHT 7PM7AM
3 PCQO1
NO OF PEOPLE
LIVE IN
0
HOUSEHOLD?
SFG9500124, 0, SFT0401027, 0,
SFT0434305, 0, SFT0434390, 0,
SFT0434428, 0, SFT0441113, 0,
SFT0434541, 0, SFT0434828, 0,
SFG9500674, 0, SFG9500949, 0,
SFT0435099, 0, SFT0435778, 0,
High_limit
Outlier_high_SNO_value
25
SFT0400530, 25, SFG9500408, 27,
SFT0434350, 25, SFG9501237, 27,
SFT0435564, 31, SFT0435386, 28,
SFT0441832, 29, SFT0435764, 39,
SFG9502609, 25, SFG9501887, 27,
SFG9502363, 26, SFT0442041, 27,
SFT0442200, 30
50
SFT0400539, 50, SFT0400564, 60,
SFT0400595, 60, SFT0400627, 50,
SFT0400777, 73, SFT0400793, 50,
SFT0401048, 56, SFG9500441, 54,
SFT0401182, 70, SFT0434230, 50,
SFT0434367, 60, SFT0434391, 60,
SFT0441084, 72, SFG9500487, 55,
SFG9500466, 60, SFG9500845, 60,
SFG9501128, 60, SFG9501326, 50,
SFT0441418, 50, SFT0435323, 50,
SFT0435389, 56, SFT0441788, 60,
SFG9502202, 59, SFG9502156, 65,
SFT0441972, 50, SFT0435854, 54
8
SFG9500474, 10, SFG9500998, 8,
SFG9501593, 9, SFT0441511, 10,
SFT0435430, 8, SFT0441995, 9
Lunch time
See you at 1pm for Part 2
Download