Data Quality and Ensuring Usability …of routinely collected PC data Presented to Integrating Clinical and Genetic Datasets: Nirvana or Pandora’s Box Presented by Simon de Lusignan slusigna@sgul.ac.uk 9th May 2006 About me • GP in Guildford • 11,500 patient practice • 6.5 Whole time equivalent GPs • Computerised since 1988 • Senior Lecturer, St. Georges • Primary Care Informatics (PCI) research group Using routinely collected data for quality improvement + research Electronic libraries Computer in the consultation Telemonitoring • Chair PCI WG of EFMI • Developing a BSc in BMI Overview • Introduction • Benefits from linking clinical + genetic data • Growing volumes of accessible primary care data… for quality improvement + research …increasingly used • Objective • Is it possible to define the features of a routinely collected dataset which can be integrated to genetic data • Method • Literature review + 10 years of experiential learning working with data • Features of “quality” data: 1. What is data quality? 2. Unique identifiers + denominators 3. What need to be defined about data processing + storage • Discussion Introduction • “GIVEN” Benefits from linking clinical and genetic data • Routinely collected clinical data is used increasingly for: 1. 2. 3. 4. Quality improvement Clinical Audit Health Service Planning Research References: 1. de Lusignan S, van Weel C. The use of routinely collected computer data for research in primary care: opportunities and challenges. Fam Pract. 2006 Apr;23(2):253-63. 2: de Lusignan S, Hague N, van Vlymen J, Kumarapeli P. Routinely collected general practice data are complex but with systematic processing can be used for quality improvement and research. Accepted for publication: Informatics in primary care Objective • To define the features of clinical data which make them fit for integration with genetic data Features of “quality” data • • • Defining Data Quality Unique identitifiers Defined process of data extraction + storage Defining data quality Evolving definitions: • Completeness + accuracy • Currency • Sensitivity + positive predictive value • Data Quality Probe • “Fit for purpose” (Pringle et al. BJGP 1995) (Williams, Methods 2003) (Thiru et al., BMJ 2003) (Brown + Warmington IPC 2003) (PCI WG EFMI, 2005) Unique IDs • • • • Linkage of data Interoperability of systems Follow-up / traceability of individuals Population denominator + ghosts…. • England + Wales • Scotland - NHS number - CHI number Our system • “MIQUEST” unique ID for one practice + compound with study number + unique ID for practice • Convert to non-case sensitive ASCII format Processing data (1) Appreciation of data entry issues + contemporary perspective of system users; (2) Defined stages of data processing + applications used at each stage, + quality controls; (3) Archive coding systems and the look-up tables used to infer meaning or rubrics; (4) The queries used to extract the data; (5) A metadata system to ensure traceability of each cell of data; (6)The ethical constraints that apply to the dataset. (1) Data entry issues + contemporary perspective of users • COPD and Bronchitis codes are easily confused • Recoding half of the practice asthmatics from a diagnosis to “history of” code Ref: Faulconer ER, de Lusignan S. An eight-step method for assessing diagnostic data quality: COPD as an exemplar. Inform Prim Care. 2004;12(4):243-54. (2) Defined stages of data processing We have defined eight discrete steps in data processing: (1) Design of queries, + piloting, (2) Data: entry, (already dealt with) (3) Extraction, (4) Migration, unique IDs essential (5) Integration, (6) Cleaning, (7) Processing, and (8) Analysis Ref: van Vlymen J, de Lusignan S, Hague N, Chan T, Dzregah B. Ensuring the Quality of Aggregated General Practice Data: Lessons from the Primary Care Data Quality Programme (PCDQ). Stud Health Technol Inform. 2005;116:1010-5. (3) Archive coding systems…. • • Coding systems are constantly evolving In general coding systems are becoming larger + more complex • You can go from many to few; but not from few to many… • We archive: Clinical codes look-up engine used e.g. NHS Triset Browser • Each relevant version E.g. 4 and 5-Byte Read Codes; Drug Dictionary, Proprietary codes Example of “look-up engine” (4) The query library • • • • Re-issued by date Query set for each clinical programme • e.g. C1, C2, C3 – Cardiac programme Query set for each extraction type • e.g. E4, E5, G4, G5 (E for EMIS, G for Generic) Defined look-up tables + rubrics for queries The query library… The “C2” queries The “C2” EMIS 5-Byte set (5) Metadata system • • • • • Follows data from query set to analysis Preserves original data Derived variables clearly identified Associated dates + numerics labelled • Rules for units used Look-up table used to define variable names van Vlymen J, de Lusignan S. A system of metadata to control the process of query, aggregating, cleaning and analysing large datasets of primary care data. Inform Prim Care. 2005;13(4):281-91. Source data – metadata structure originating query set bigram C 2 query file BIGRAM MEANING DI Diagnosis RX Drugs Prescription OC Occupation HO History Symptoms OE Examination Signs Read code / CCC _ PDNP _ G 3 P1 repeat index _ type bigram 1 _ D I Linking elements: Query library Query & Core Clinical Concept Read code Core clinical concept (CCC) Automation (6) Ethics • The Ethical constrains on any dataset are indexed in the query library Summary 9th May 2006 Summary • Data quality is best defined in terms of • “Fitness for purpose” - What purpose when? • Transparent methods of data processing allow audit of results • Understanding data entry issues / context is essential • Metadata can help control processing • Careful curation of data may allow its use beyond the timescale of the original study Thanks for listening Simon de Lusignan Tel: Fax: Email: Web: 020 8725 5661 020 8767 7697 slusigna@sgul.ac.uk www.gpinformatics.org www.sgul.ac.uk/informatics/ 9th May 2006