blogs.hds.com Research Integrity: The Importance of Data Acquisition and Management. Jennifer E. Van Eyk, Ph.D. Prof. Medicine, Biol. Chem. and BME Director, JHU Bayview Proteomics Center Director, JHU ICTR Biomarker Development Center JHU NHLBI Innovative Proteomics Center on Heart Failure Principles of Research Data Integrity • Research integrity depends on data integrity. – Includes all aspects of collection, use, storage and sharing of data. • Data integrity is a shared responsibility. – Everyone involved in the research is responsible. – The ultimate responsibility belongs to the PI. – However, there is a broader role and responsibility for the institute and scientific community. • Transparency of the research data is required. Free and accurate information exchange is fundamental to scientific progress. Data integrity can be compromised numerous ways. i) malicious proprietors, Top Ten Less-Extreme Rock ii) human mistakes and naivety, Climbing Routes iii) technical error. Stolen Chimney, Utah Data integrity is based on accurate and traceable: i) collection, ii) recording, iii) storage, iv) reporting. www.gorp.com/parks-guide/travel The Consequences of Failure • • • • • • Personal loss Blocked scientific progression Impaired technology development Damage to the institution and sponsors Tarnished public perception of science Damage to or loss of patent protection allnewswire.com Clinical trials based on genomic selection: Duke University Based on 2 genomic studies coming from the same multi-disciplinary group (Potti and Nevins) from which three clinical trials were undertaken. All clinical trails have been ultimately suspended. 1. Papers using cancer tissue (Potti et al., N Engl J Med 2006;355:570) and cell based approaches (Hsu et al, J Clin Oncol.2007;25:4350; Potti, A. et al. Nat. Med. 2006 12, 1294–1300) were published with a lot of hype. 2. Issues were raised by K. Baggerly and/or K. Coombes (M.D. Anderson Cancer Center) based on publically available data (Annals of Applied Statistics, 2009:3:1309 and Nat Med. 2007;13:1276) as well as others. 3. Dukes argues mistakes were “clerical errors” and do not alter fundamental conclusions of papers. 4. NCI/CTEP required LMS to be tested in blinded pre-validation study. It failed. Predictor was altered after corrected for having been carried out in two different labs. Trials had to be randomized and blinded to minimize sensitivity of predictor to laboratory effects. 5. NCI due to continued concerns requests all computer code and data preprocessing in order to try to replicate earlier finding. It failed. Using predictor for randomization stratification was stopped in the trial. 6. Duke carried out internal review and reopened trial. 7. NCI determines it is partially funding another trial based on a different paper (Chemo-sensitivity). Issues were discovered with respect to differences in data used to build predictor and data used for validation. Trials were ultimately suspended. 8. Potti’s academic credentials were found to be falsified. Duke acts. 9. Duke statement indicates that that with respect to validation studies: the sensitivity labels are wrong, samples labels are wrong, the gene labels are wrong, making it “wrong: in way that could lead to assignment of patients to the wrong treatment. 10. Co-author J. Nevins institute (Duke Institute for Genomic Science and Policy) he directed is closed (“due reorganization”) as is the Center for applied Genomics and Technology which Dr. Anil Potti was based. Hsu D is still publishing. 11. Papers were retracted(JCO Dec 2010;28:5229 and N Engl J Med. Mar 2011;364:1176). 12. Institute of Medicine (IOM) committee struck for independent review and recommendations. 13. FDA audit at Duke starting in 2011. The Cancer Letters, edit and published by P. Goldbert, www.cancerletter.com Learn From Mistakes • Mentorship (do no harm) • Oversight, training • Verification and more verification • Develop of multiple processing pipelines • Patience and wisdom in applying translation • Ensure benefit patents (do no harm) Outline • Being practical while avoiding potential errors – The data • Individual Responsibilities – Data Management • Data Collection • Data Storage – Data Interpretation • Data interpretation and publication in a changing world of translation – The reality of translational science • Challenges • Role of Core Facilities – The role of the scientific community • Journals • Scientific organizations • Round table discussion on the research at Johns Hopkins The Data Fundamental to research Basis for writing papers Important for experiment replication Meet contractual/funding requirements Settle intellectual property claims Defense against a charge of fraud Images from the front covers of Circulation Research – S. Elliott (Van Eyk Lab) 8 Individual responsibility Data Management computer-networks-webdesign.com Three aspects to consider before starting your data collected: 1. Ownership 2. Collection 3. Storage/protection of confidentiality/sharing 4. Interpretation and publication 9 Whose data is it? • Custody does not imply ownership. • Custody remains with investigator (PI) but JHU owns all data. But, others have rights. – Funders – Other data sources If there is intellectual property or the research was funded by a sponsored research agreement with a company, who owns the data? Data Collection • Depends on the type of raw data • Notebooks – day to day or specialize types of experiments • Images • Generated numbers and information Goal is to preserve raw data, transparent processing of data, unbiased interpretation and representation of data. nature.com Data integrity in the digital age “With the emergence of web-based lab notebooks, digital image “enhancement”, and the quick and easy (and possibly dirty) generation and dissemination of colossal amounts of data, it’s becoming increasingly clear that technology provides new challenges to maintaining scientific integrity. In an attempt to tame the beast while it still has its baby teeth, the US National Academy of Sciences released a report today that provided a framework for dealing with these challenge "Ensuring the integrity, accessibility and stewardship of research data in the digital age.” http://blogs.nature.com/news/2009/07/data_integrity.html “One theme, that threads through many fields: the primacy of scrupulously recorded data. Because the techniques that researchers employ to ensure the integrity—the truth and accuracy—of their data are as varied as the fields themselves, there are no universal procedures for achieving technical accuracy. The term “integrity of data” also has a structural meaning, related to the data’s preservation and presentation. “ “Broadly accepted practices for generating and analyzing research must be shown to be reproducible in order to be credible. Other general practices include checking and rechecking data to confirm their accuracy, validity and also submitting data and results to peer review to ensure that the interpretation is valid.” What evidence proves the 67 kDa band is the same data as the 32 kDa band? How can you show that three lanes are the same data? Gross manipulation of blots http://ori.hhs.gov/ What image should be published? http://jcb.rupress.org/content/166/1/11/F6.expansion.html Misrepresentation of immunogold data. The gold particles, which were actually present in the original (left), have been enhanced in the manipulated image (right). Note also that the background dot in the original data has been removed in the manipulated image. Example provided by Journal of Cell Biology. http://jcb.rupress.org/content/166/1/11/F6.expansion.html Data Forensics • Can only "de-authenticate" an image (indicate discrepancies). • Authentication requires access to the original data. • The identification of a discrepancy is an allegation, and does not mean there was an intentional falsification of data. • The interpretation of whether any image manipulation is serious requires familiarity with the experiment(s) and imaging instruments. Data Forensic Tools are employed by journals in a manner similar to tools used to detect plagiarism. Office of research integrity US Department of Health and Human Services http://ori.hhs.gov/ Data Storage, Protection & Sharing • Raw data needs to be stored – Lab notebooks should be stored in a safe place – Computer files should be backed up – Protected and limited access to computer raw data • Samples should be saved appropriately so they will not degrade over time. • Data and experiment information should be available after publication. • Data should be retained for a reasonable period of time. Dilemma: When PDF or clinical fellow leaves a lab (especially if paper is not written up) where does the data stay? 18 How long? • Retain study records and records of disclosures of study information: – For IRB clinical trial • Retain records for 7 years after last subject completed study OR 7 years after date of last disclosure of identifiable health information from study records. • If research subject is a child, retain until subject reaches age of 23 – For Investigational New Drug (IND) research – • Retain records for 2 years after marketing application approved for new drug or until 2 years after shipment and delivery of drug for investigation use is discontinued. – For Investigational Device Exemption (IDE) research • Retain records for 2 years after the latter of the following two dates: date on which investigation is terminated or completed or date on which records no longer required for purposes of supporting a premarket approval application or notice of completion of a product development protocol. • • Provide adequate data and safety monitoring (if activity represents more than minimal risk to participants) Complete required training – Human Subject Research (HSR) compliance – Conflict of Interest (COI) – Privacy issues 19 Traditional Lab Notebook Best Practices admin.ox.ac.uk •Date all entries (especially important if contesting IP) •Title and state purpose of experiment •Describe experiment in detail – Protocol – Calculations – Reagents (lot numbers, passage numbers, etc.) – Results (everything that does and doesn’t happen) – Print-outs, pictures, graphs, etc with links to other data storage locations. • Record needs to be intact and permanent – All mistakes are to be left (cross out) – Do not remove pages – Write in pen – Clearly link connected experiments across time Requirement: Need to be able to follow the development and execution of the experiments and all of the data analysis. 20 Laboratory Notebooks: Types, Advantages & Drawbacks Type Advantages Disadvantages Bound book •No lost sheets •Proof against fraud •Experiments entered as done, no logical order • can not keep some raw data forms Loose-leaf sheets/folders •Can group by experiment, maintain order •Easy to record data during experiments • More flexible to hold various types of data •Can lose sheets, harder to prove authenticity Electronic notebook •Easy to read •Easy to do calculations •Must back up data, harder to prove authenticity •Can be manipulated after the fact. Barker, Kathy. At the Bench: A Laboratory Navigator. Cold Spring Harbor: Cold Spring Harbor Laboratory Press (2005), 90. 21 Video,Other Images,types ELISA readouts, genomic or miRNA data, GWAS data, etc. of data storage: Van Eyk Lab Raw MS spectrum Processed MS spectrum Identifies peptides based on comparison to existing database Peptides clustered to identify protein name Proteins are clustered to remove name redundancy present BACKGROUND Proteomics uses mass spectrometers (MS) to identify peptides and proteins. MS accurately weighs the mass of peptides and their fragments. The observed spectrum is compared to the theoretical mass of all known amino acid sequences in a database allowing assignment to a protein with a certain probability. Quantification can be based on number of spectrum observed per/analysis (spectral count). From one high accurate MS instrument (Orbitrap LC/MS/MS) produces in a single run C number spectrum. This means there is ~ 21-25million spectrum/yr = to ~1 terabyte of raw data/year. It is challenging as science, technology and our understanding is always advancing. Interpretation can change but RAW data never does. Auditing Logs Offsite Independent local back-ups JVE Lab Uploads and Downloads ONLY NO Deletes NO Overwrites Collaborators JVE Lab IA Storage Server • Currently Using 1.73TB • Multiple Redundant Hard Disks • Secured Data Center 1 Checks/ 2 Balances 3 Informatics Processing Lab (Pass) Database Collaborators ERROR! ERROR! Rounding ERROR! Backup of raw mass spectrometry data 1. Source Matches Target i) Size and date (easy but 95% accurate) ii) CheckSum method (time consuming but 100% accurate) 2. No file update and/or delete permitted i) If deletion of file required, written authorization required from PI ii) Overwriting is not possible 3. Easily accessible auditing of all activities (who uploaded/downloaded which files, when, from where etc.) 4. Backup, backup, backup…. i) Different locations ii) Multiple time points Data Processing Pipeline Quality Control Ensuring integrity of data analysis at local level • “Bookkeeping Checks” – Treat core or, if reasonable, the entire informatics pipeline as “black box”, then run as many “integrity tests” as possible to verify input (i.e. original raw files) matches final output (e.g. reports). – Manual Spot-Checking • Compare final outputs (e.g. from reports) for a few select data points with same data points in original inputs (e.g. raw files). What is different? Auditing Logs JHU NHLBI Proteomic Community Offsite Independent local back-ups JVE Lab Uploads and Downloads ONLY NO Deletes NO Overwrites Collaborators JVE Lab IA Storage Server • Currently Using 1.73TB • Multiple Redundant Hard Disks • Secured Data Center 1 Checks/ 2 Balances 3 Informatics Processing Lab (Pass) Database Collaborators ERROR! ERROR! Rounding ERROR! When there is lots of data and/or fast analysis is required….how secure is the Cloud? • 99.9% uptime guarantee for most clouds providers • Still, good to have local “cheap” backups (e.g. 23 computers in the JVE lab) • To ensure security… – Transmission over internet can use same security as your online banking system (not hard to do…) – Clouds can be “VPN-enabled”, so that those cloud machines are “behind” JHU firewall, thus benefitting from JHU firewall – Nevertheless, best idea (for any system, cloud or not): minimize capturing sensitive information unless absolutely required for stats/analysis; if that’s not possible, encrypt sensitive information & restrict access conservatively • NB – Different rules for HIPAA protected data Learning from Recent Warning Letters Related to Computer Validation 6th October 2011 at 11:00 to 12:00 Organizer: Dr. Ludwig Huber Type:Webinar ___________________________________________________________________________ This seminar will provide more that 20 examples of recent FDA warning letters and give clear recommendations for corrective and preventive actions. In the last couple of years the FDA has discovered serious fraud related to security and integrity of electronic data. As a result FDA inspections look more than ever at computers, how they are validated and how companies comply with FDA's 21 CFR Part 11. This seminar will present more than 20 related warning letter examples together with detailed recommendations on how to avoid them. Areas Covered in the Seminar: FDA inspections: Preparation, conducts, follow up The meaning of warning letters and 483 inspectional observations Learning from an FDA presentation: “Data Integrity and Fraud – Another Looming Crisis?” Data integrity and authenticity: FDA's new focus during inspections Examples of recent Part 11 related 483’s and Warning Letters Examples of recent 483’ and warning letters related to computer system validation ‘ Most obvious reasons for deviations Responding to 483's to avoid warning letters: going through case studies Writing corrective AND preventive action plans as follow up to 483's Using internal audits to prepare yourself for Part 11 related FDA inspections? Strategies and tools for compliant Part 11 implementation The future of Part 11 and computer system validation http://www.nature.com/natureevents/science/events/12389 Interpretation and Publication • • • • Use of core facilities Role of collaborators Setting standards via professional societies Responsibilities of journals and reviewers Translational Science - Big Science Dealing with massive data sets, new technologies, and novel statistical approaches. allnewswire.com 1. 2. 3. 4. 5. More Lessons from Duke Requirement of additional expertise outside group. It is still rare to have someone in group with sufficient expertise to monitor cross all aspect of the project. Massive amounts of data and software complexity. Error introduced due to data handling and poor documentation. Computer software maybe “research grade”, highly complex and misunderstood or used inappropriately. If you think or figure out something is not right, admit it and track it down and correct it. “Some times the glamour (and ease) of (some) technology makes investigators forget basic scientific (and biological) principals” The Cancer Letters, edit and published by P. Goldbert, www.cancerletter.com What’s wrong with this MS spectrum? Unless you are an expert you will not know. But, it is wrong. Proteomic analysis of age dependent nitration of rat cardiac proteins by solution isoelectric focusing coupled to nanoHPLC tandem mass spectrometry. Hong SJ, Gokulrangan G, Schöneich C. Exp Gerontol. 2007;42(7):639-51. 4 proteins and post-translational modified amino acid residues were reported, all were subsequently shown to be incorrect. Manuscript that corrected it: Misidentification of nitrated peptides: comments on Hong, S.J., Gokulrangan, G., Schöneich, C., 2007. Exp. Gerontol. 42, 639-651. Prokai L. Exp Gerontol. 2009; 44(6-7):367-9. biocompare.com Use of core facilities: Can they (should they) provide required expertise? • Concern: – Who is responsible for analysis? – Complex data still requires understanding of technology limitations? – Situation is worse with emerging technologies where development of data analysis is still being developed? • Solutions – Cores with experts (and provided support to help with data analysis but is time consuming and expensive) – New hybrid cores-academic technology development labs. – Preservation of data transparency and storage of raw data. – Time to learn the methods across disciplines . – New paradigm in collaboration requires new approaches to training. We assume the best in people. B. Obama at Martin Luther King Memorial speech Oct 2011. What is the role of a collaborator? Is being naïve or inexperienced a sufficient reason for not being responsible for data integrity and data interpretation? How is broad (but in-depth) experience obtained when focused expertise is the norm? How do you develop collaborative networks where in-depth cross disciplinary learning and training is intensic? Role of our scientific community Setting data standards HUPO Proteomics Standards Initiative www.psidev.info/ The HUPO Proteomics Standards Initiative (PSI) defines community standards for data representation in proteomics to facilitate data comparison, exchange and verification. The PSI was founded at the HUPO meeting in Washington, April 28-29, 2002 (Science 296,827). MIAPE The minimum information about a proteomics experiment • • • • • • • • 2DE MS MS informatics Column chromatography Capillary electrophoresis Protein modifications Protein affinity Bioactive entities All process MS data can be/should be uploaded to public data base Role of journals • • • • • • State requirements in instruction to authors Set standards for data integrity? Who curates? Store raw or processed data? How is annotated? Detection prior to publication? How many DBs? Who pays? Have expert reviewers? If detected, bar authors from publishing in their journal and or inform author’s institute? • If detected after publication? Enforced retraction? As a reviewer, what is your role? How do we train reviewers? allnewswire.com More lessons from Duke IOM “Committee on the Review of Omics-Based Tests for Predicting Outcomes in Clinical Trials” for IOM will determine criteria important for analytical validation, qualification and utilization components of test evaluation for the use of models that predict clinical outcomes from genomic and other Omic technologies. Report is due shortly. Hopefully, IOM report will set goal posts and pathways in the same manner as for drug trials. What I have learnt. Science is difficult. We are limited by our knowledge. Raw data is never wrong. We may misinterpret it, be fooled by an incorrect assumption, be limited by technology/approach but intrinsically, it is not wrong. Your scientific reputation is based on the quality of your data. Conclusion Data Integrity Principle: Research data integrity is essential for advancing scientific and medical knowledge and for maintaining public trust. Researchers that are ultimately responsible. Data Access and Sharing Principle: Research (raw and processed) data, methods, and other information integral to publicly reported results should be publicly accessible. Data Stewardship Principle: Research data should be retained to serve future use. Thus, (raw and processed) data must documented, referenced, and indexed in order for them to be used accurately and appropriately. Round table Sheila Garrity (Moderator) Allen Everett (Assoc Professor) Kathleen Barnes (Professor) David Graham (Asst Professor) Director, Division of Research Integrity Pediatrics - Cardiology Medicine–Clinical Immunology Molecular and Comparative Pathobiology Johns Hopkins • JHU Data Management Policy: http://jhuresearch.jhu.edu/Data_Management_Policy.pdf • Overall list of JHU policies page: http://jhuresearch.jhu.edu/policies-hopkins.htm • JHU laptop encryption: http://www.it.johnshopkins.edu/security/encryption.html • Overall JHU IT security page: http://www.it.johnshopkins.edu/security/