NICOR NATIONAL ADULT CARDIAC SURGERY AUDIT: RESPONSE TO PROFESSOR NICK BLACK’S REPORT INTO THE REVIEW OF DATA VALIDATION AND STATISTICAL PROCESSES FOR CONSULTANT OUTCOMES PUBLICATION 2014 (2010-2013 DATA) November 19th 2014. NICOR Executive Professor John Deanfield Dr Mark de Belder Dr Peter Ludman Professor Adam Timmis Professor Ben Bridgewater Professor Adam Timmis Dr Julie Sanders NICOR REPONSE TO NICK BLACK REPORT Professor Nick Black was asked to review the National Adult Cardiac Surgery Audit (NACSA) to provide assurance that the NACSA data were fit for purpose for publication for NHS England’s consultant outcomes publication programme (COP). NICOR would like to thank Nick Black and his panel for their report (Appendix I). They have acknowledged the complexity of this process, made recommendations for potential improvements, and highlighted the chain of responsibility. They have concluded that, in spite of these complexities, the data are of sufficient completeness and accuracy to allow for a comparison of individual consultants. They have recommended further development of the processes involved and the risk adjustment model, but have concluded that the adopted method is fit-for-purpose. The panel has raised some important points. NICOR would wish to highlight a number of issues related to his report. 1. NACSA management: The NACSA is managed by the National Institute for Cardiovascular Outcomes Research (NICOR) – there is extensive governance surrounding this programme, within NICOR, between NICOR and the professional society (the SCTS), and between NICOR and HQIP. In particular the project group managing the audit has been appropriately configured with patient, lay, methodological, project management and professional representation. All minutes are made publically available. There is limited specific national guidance on consultant data validation, data sign-off, risk adjustment or definition of outliers. All guidance that does exist has been complied with by NICOR. In the absence of specific guidance, NICOR has developed methodologies which Nick Black and colleagues have acknowledged to be fit for purpose, but the report raises issues about a number of aspects of the methodologies applied. NICOR would like to point out that all methodologies used were developed with appropriate governance with independent statistical review. All methodologies used for both risk adjustment and definition of outliers was published openly through peer review, (Eur J Cardiothorac Surg (2014) 45 (2): 225233.doi: 10.1093/ejcts/ezt476). NICOR believes that is has been open and transparent about all these aspects of its work. 2. Case ascertainment: As described in the NICOR report submitted to Nick Black, there is no gold standard for case ascertainment for adult cardiac surgery. Following a review between NICOR and the National Advisory Group on Clinical Audit and Enquiries (NCAGCAE) last year it was agreed to work toward innovative ways of combining the datasets to achieve optimal information on case ascertainment. NICOR is working towards this, but these data are not yet available for the current round of COP. 3. Data collection: The definitions for the NACSA dataset and the EuroSCORE variable are freely available, and have been disseminated to units. Recent events have highlighted inconsistencies in the way these data definitions have been applied locally. NICOR would add that even widely accepted international benchmarking algorithms such as the EuroSCORE or EuroSCORE II have not been developed with issues of scientific rigour such as examinations of test-retest reliability or inter-observed variability for the risk factors used in the models, nor have these algorithms been accompanied by detailed user manuals, but NICOR accept that the COP programme is making these issues increasingly important and will respond by reviewing the dataset in detail, and issuing appropriate guidance to surgeons and units. 4. Data completeness: NICOR recognises that that the process of imputation used for missing data is only one of a number of available options, but the process chosen has been selected for specific reasons that have been openly disseminated and described in detail in the peer review publications. NICOR is familiar with other techniques and will review the options for subsequent rounds of COP. 5. Data validation: The audit structures within participating hospitals should ensure that the data sent to NICOR are as complete as possible. NICOR believes that its process for data validation for the NACSA with the units (which in the case of this round of COP have involved four repeated rounds communicating with the units and positive sign-off prior to publication) have been particularly extensive compared to other national audits. Data quality must always ultimately be a local responsibility, but audit providers must support that responsibility. NICOR and SCTS have taken significant steps to flag up to surgeons and units the shared responsibility required for data to be fit for purpose, as described in the report submitted to Nick Black. 6. Feedback to audit staff: In response to one specific issue raised by Prof Black, NICOR would add that audit staff are notified and sent the units’ data, as well as communicating with the clinical staff at each hospital. As Prof Black describes one particular item (unstable angina) was considered as part of the revalidation process, but all data items were included in the feedback to units, highlighting all fields that had an unexpected variation from the national average, both at hospital and individual surgeon level. Finally NICOR notes Prof Black’s comments with respect to the funnel plot methodology used but would note that these would act to be more, rather than less, sensitive in detecting unexpected variation for potentially high incidences. Given that concerns had been raised about erroneously high incidences leading to false negative issues on a surgeon outlier analysis, the issue about the lower funnel limit is less significant. NICOR will continue to review the methodologies used in this process. 7. Risk adjustment and outlier detection methodology: NICOR has taken these aspects particularly seriously and has published a series of peer review manuscripts on the methodology used, the reasons for their selection, and potential limitations. In particular NICOR believes that the approach to recalibration to use contemporary coefficients to give true peer group benchmarking, recalibrated for each year of scrutiny, is particularly robust and acts in the best interests of patients in an environment of on-going quality improvement. NICOR believes it has been particularly transparent about these aspects of its processes. There are limitations of all methods but NICOR will continue to refine its processes. NICOR welcomes any national guidance on these issues and would be happy to contribute to developing this alongside other stakeholders. APPENDIX I: Report of an independent review Full report Comparison of consultants' outcomes in adult cardiac surgery Report of an independent Review Group (Chair: Nick Black) Background In April 2014, NICOR together with the Society of Cardiothoracic Surgeons provided consultant outcome data based on April 2010-March 2013 to their members. This included profiles of the risk factor incidences for each consultant. Concern was expressed about one surgeon having reported rather extreme values for some risk factors, which would have markedly increased his expected mortality (and thus under-estimated his risk adjusted mortality).Enquiries followed and disciplinary action was taken by relevant authorities. This raised the concerns of other surgeons who challenged the validity of the data more generally. In particular there were concerns, shared by NICOR and SCTS, about the validity of one risk factor, unstable angina. In June-August NICOR asked all consultants to repeat the validation of their own data and confirm it's authenticity. On 6 October 2014, Professor Sir Bruce Keogh requested that the Healthcare Quality Improvement Partnership provide assurance that consultant level outcome data were fit to publish, particularly regarding the adequacy of data validation and analysis. In order to do that he requested HQIP secure an independent assessment. Review method On 8 October, as Chair of the National Advisory Group for Clinical Audit & Enquiries, Nick Black agreed to undertake a rapid review and advise HQIP. Three other members of NAGCAE or NAGCAE Sub-Groups with methodological expertise were recruited (Professor Jan van der Meulen; Professor Kathy Rowan; Dr Robert Grant) plus Dr David Harrison (ICNARC), to assist in addressing three questions: 1. Are the data management and validation processes that are used to produce the adult cardiac surgery database fit-for purpose (ie comparing consultants' outcomes)? 2. Are the data of sufficient completeness and validity for comparing consultants? 3. Is the method of risk adjustment fit-for-purpose? On 10 October NICOR provided a detailed account of the processes they had used to collect, validate and analyse data from all cardiothoracic consultants in England, Wales and Scotland. After reviewing it, further clarification and additional information was requested from NICOR on 13 October. This was received on 14 October (see below). Findings 1. Are the data management and validation processes that are used to produce the adult cardiac surgery database fit-for purpose (ie comparing consultants' outcomes)? 1.1 Case ascertainment Case ascertainment (recruitment proportion) is determined by comparison with the number of cases recorded in the hospital administrative database (Hospital Episode Statistics in England). NICOR defend this approach by referring to work that was carried out by the Clinical Effectiveness Unit of the Royal College of Surgeons that showed the overall number of patients included in the audit between April 2007 and March 2009 (37712) was similar to the corresponding number in HES (37542). However, that report also found differences in some units (ranging from 37% fewer to 22% more patients in HES than in the audit). The differences in 30-day mortality within units were even larger (ranging from 34% fewer to 59% more deaths in HES than in the audit). The report also indicated that it is likely that for a small number of consultants in two trusts the number of procedures according to HES were considerably different from the number according to audit data. It's not clear why such discrepancies arise and which database is the more accurate. An analysis of linked data would be able to provide not only an estimate of case ascertainment but also a better understanding of the differences, assuming they have persisted into the 2010-13 data. It might also be worth comparing ascertainment with an alternative data source such as operating theatre information systems. 1.2 Data collection There is limited knowledge of how data collection rules and definitions are actually applied locally. It is unclear what standard instructions exist regarding data collection, such as a manual with rules and definitions for each variable, how these are communicated to those responsible for data entry locally, and how individuals are trained. For example, for the ten items of data used to derive 'critical preoperative state' (one of the variables required for the EuroSCORE), it is important to know about several aspects of the data collected such as what definitions are used, how objective they are, what algorithms are used for derivation, what is the validity of the raw rather than the derived data etc. It is also concerning that invalid data are logically mapped to sensible data on processing. National clinical audits can be made less likely to be manipulated by individual consultants. For example, the staging data in the cancer audits comes from the pathology laboratories and other data items are determined by multidisciplinary teams. This makes audits, such as this one, that rely on data entered by consultants, more vulnerable to data manipulation. 1.3 Data completeness There were very few patient records that lacked data on outcome (dead/alive) partly because those with missing data are completed and validated through a comparison with ONS mortality records. If after this process the survival of a patient is still unclear, it is assumed that a patient has died, providing a strong disincentive for consultants to fail to provide this information. Similarly, missing data for variables needed for risk adjustment are imputed with the values that give the least possible increasing predicted risk. This approach is taken to encourage consultants to record all data items because not reporting data items would lead to worse outcomes being reported. However, imputing missing data for variables included in the risk adjustment model with values that give the least possible increase in risk may have a detrimental impact on the ability to develop the risk adjustment model. More advanced imputation techniques are available that reduce bias and increase precision and should be considered. 1.4 Data validity Three principal mechanisms are in place to check data validity. First, consultants are required to authenticate their own data. This approach clearly places the responsibility for validity on the consultants rather than being assumed by NICOR. This makes it difficult for NICOR to be confident that consultants are not wilfully misleading the system. Reassurance about the data validity might be increased if hospital managers or audit staff were involved. It is unclear what action is actually taken by consultants when asked to validate their own data (eg re-enter a random sample, review specific variables for which they are outliers etc). Second, NICOR carry out a number of range checks for data values and tests for internal consistency between variables. Third, the data items used for risk adjustment are assessed by comparing their incidence for each consultant. This allows for consultants with significantly different rates to be identified. While this cannot distinguish justifiable 'outliers' (ie consultants with a specialist case-mix) from those who may be supplying erroneous data (gaming), further qualitative local enquiry can determine which is the case. There are three concerns about the current approach to validation. First, if data on risk factor incidence for validating is provided to consultants alongside their risk-adjusted results there is a danger of the latter influencing the former. The two activities should be divorced from each other with the validity of the risk factor data checked first. Second, the repeat validation carried out during summer 2014 appears to have asked consultants to focus on one specific variable (unstable angina) so as to minimise the burden for consultants. Given the recent history of serious concern about this audit, a more wide-ranging and thorough revalidation might have been appropriate to help restore confidence. The third, more minor point, is that the use of funnel plots based on the Normal distribution for binary risk factor data is somewhat questionable. Certainly the lower funnel lines (which are admittedly of less interest) will be completely irrelevant as they frequently correspond to a prevalence of zero. The upper funnels from the Normal approximation will have somewhat less dispersion than an exact binomial approach, particularly at small sample sizes. 2. Are the data of sufficient completeness and validity for comparing consultants? Overall, despite some concerns about the processes employed to collect and validate data, it appears the data are of sufficient completeness and accuracy to compare consultants. In addition, each and every consultant has authenticated their data. Further improvements to the process could be made. The identification of consultants with an outlying incidence of a risk factor has been based on fairly crude analysis of funnel plots. While the Winsorization of these funnel plots may do a reasonable job of accounting for overdispersion in risk factors that are continuously (but not evenly/randomly) distributed across providers, it will be much worse at dealing with sub-populations of providers (particularly of individual consultants) that do or don't routinely deal with certain types of patients. In such cases, we may expect the individual data points to form two (or more) sub-funnels. There is possibly some suggestion of this for 'Emergency or Salvage', 'Other than isolated CABG' and 'Surgery on Thoracic Aorta'. If it were possible to categorise the surgeons in some way by sub-specialties or the types of operations they perform then examining separate funnels for these categories may be more informative. Concern about deliberate attempts to obscure poor outcomes by gaming the risk factor data remain (ie consultants whose results are not flagged up as outliers). If data manipulation takes place, it is most likely that it would concern data items that are included in the risk adjustment model because manipulating the outcome (mortality), would be much more likely to be detected. One way to explore to what extent any issues in data on risk factors may mask poor performance is to compare observed and adjusted outcomes. Large differences would suggest unusual risk factor profiles. 3. Is the method of risk adjustment fit-for-purpose? The risk adjustment used by NICOR is based on a logistic model of hospital mortality that discriminates well between patients who die and those who survive and produces mortality predictions that fit observed mortality well. While further refinement may not lead to large improvements in the risk adjustment method, the choice of a recalibrated EuroSCORE deserves further consideration. Four options have been considered by NICOR: Logistic EuroSCORE (as published) Refitted logistic EuroSCORE Refitted modified EuroSCORE Recalibrated EuroSCORE Using logistic EuroSCORE as published is quickly (and appropriately) rejected as it is clearly poorly calibrated. However, the approach by which they have selected a cubic transformation of the logistic EuroSCORE (the recalibrated EuroSCORE) over either refitted model is unclear. The basis for the selection appears to be that the recalibrated EuroSCORE more closely follows a straight line of observed to predicted risk than the refitted approaches. However, this is, in a sense, a self-fulfilling prophecy - the method applied is solely designed to produce such a straight line, without any consideration of what it actually means in terms of the underlying risk factors. The thing that statistically we would be most concerned about in applying an out-of-date model, such as EuroSCORE, is 'differential miscalibration'. In other words, that the effect of some risk factors on outcome has changed (relative to the original development population) differently from other risk factors. The fact that a simple linear transformation of the logistic EuroSCORE did not produce good calibration and a cubic transformation was required strongly suggests that such differential miscalibration is present. The cubic transformation masks this (by giving good overall calibration according to predicted risk) but does not address it directly - only refitting the model can do this. Essentially, any two patients that received similar predicted risk to each other under the original logistic EuroSCORE will still receive similar predicted risk to each other under the recalibrated EuroSCORE and this assumption (inherent in this approach to recalibration) appears unlikely to hold true. Therefore, while there would be some concern about the amount of miscalibration exhibited by the refitted models, either of these is likely to give a fairer comparison of consultants than the recalibrated EuroSCORE. This limitation is very briefly acknowledged: "A disadvantage is that this approach does not account for the varying contemporaneous adjustments each risk factor might have." The only way to provide any reassurance regarding this would be to look at the calibration within subgroups defined by risk factors rather than solely by overall predicted risk. For example, does the recalibrated model perform equally well across different age groups, sexes, those with and without chronic pulmonary disease, etc? It is not clear whether this has been done Recommendations It is clear that the processes for data collection and validation need to be improved, as do the governance arrangements that oversee these and other processes. Case ascertainment needs to be investigated more rigorously and both data completeness and data validity could be further improved. The method of data collection allows for variation on how consultants interpret and define some key risk factors, a situation that needs attention. Despite our concerns about these aspects of the current processes, the data appear to be of sufficient completeness and validity to be used to compare consultants. It is reassuring that the clinical lead in all units have confirmed their satisfaction with the accuracy of the data their consultants colleagues have provided and had the opportunity to validate a second time. This minimises the risk that any consultant found to be a poor outlier (more than 3 standard deviations worse than expected) has any justification to claim an unjustifiable judgement. While we would recommend further investigation and development of the risk adjustment model, the adopted method is felt to be fit-for-purpose. The principal risk is that a consultant with poor outcomes will not be detected as the current data collection and validation processes can allow a misleading assessment of the consultant's outcome. Improvements to the governance and data management processes are required to reduce this risk in future years.