blogs.hds.com Research Integrity: The Importance of Data Acquisition and Management. Jennifer E. Van Eyk, Ph.D. Prof. Medicine, Biol. Chem. and BME Director, JHU Bayview Proteomics Center Director, JHU ICTR Biomarker Development Center JHU NHLBI Innovative Proteomics Center on Heart Failure Principles of Research Data Integrity • Research integrity depends on data integrity. – Includes all aspects of collection, use, storage and sharing of data. • Data integrity is a shared responsibility. – Everyone involved in the research is responsible. – The ultimate responsibility belongs to the PI. – However, there is a broader role and responsibility for the institute and scientific community. • Transparency of the research data is required. Free and accurate information exchange is fundamental to scientific progress. Data integrity can be compromised numerous ways. i) malicious proprietors Top Ten Less-Extreme Rock ii) human mistakes and naivety Climbing Routes iii) technical error Stolen Chimney, Utah Data integrity is based on accuracy and traceability: i) collection ii) recording iii) storage iv) reporting www.gorp.com/parks-guide/travel The Consequences of Failure • • • • • • • Personal loss Blocked scientific progression Impaired technology development Damage to others in lab, the institution and sponsors Tarnished public perception of science Damage to or loss of patent protection Harm patents (directly or indirectly) If it happens or is thought to possibly have happened: be watchful, recognize it, take responsibility, sort out if it has occurred and fix it. allnewswire.com Clinical trials based on genomic selection: Duke University http://en.wikipedia.org/wiki/Anil_Potti Based on 2 genomic studies coming from the same multi-disciplinary group (Potti and Nevins) from which three clinical trials were undertaken. All clinical trails have been ultimately suspended and stopped in 2011. At least 10 papers were retracted. 1. Papers using cancer tissue (Potti et al., N Engl J Med 2006;355:570) and cell based approaches (Hsu et al, J Clin Oncol.2007;25:4350; Potti, A. et al. Nat. Med. 2006 12, 1294–1300) were published with a lot of hype. 2. Issues were raised by K. Baggerly and/or K. Coombes (M.D. Anderson Cancer Center) based on publically available data (Annals of Applied Statistics, 2009:3:1309 and Nat Med. 2007;13:1276) as well as others. 3. Dukes argues mistakes were “clerical errors” and do not alter fundamental conclusions of papers. 4. NCI/CTEP required LMS to be tested in blinded pre-validation study. It failed. Predictor was altered after corrected for having been carried out in two different labs. Trials had to be randomized and blinded to minimize sensitivity of predictor to laboratory effects. 5. NCI due to continued concerns requests all computer code and data preprocessing in order to try to replicate earlier finding. It failed. Using predictor for randomization stratification was stopped in the trial. 6. Duke carried out internal review and reopened trial. 7. NCI determines it is partially funding another trial based on a different paper (Chemo-sensitivity). Issues were discovered with respect to differences in data used to build predictor and data used for validation. Trials were ultimately suspended. 8. Potti’s academic credentials were found to be falsified. Duke acts. 9. Duke statement indicates that that with respect to validation studies: the sensitivity labels are wrong, samples labels are wrong, the gene labels are wrong, making it “wrong: in way that could lead to assignment of patients to the wrong treatment. 10. Co-author J. Nevins institute (Duke Institute for Genomic Science and Policy) he directed is closed (“due reorganization”) as is the Center for applied Genomics and Technology which Dr. Anil Potti was based. Hsu D is still publishing. 11. Papers were retracted(JCO Dec 2010;28:5229 and N Engl J Med. Mar 2011;364:1176). 12. Institute of Medicine (IOM) committee struck for independent review and recommendations. 13. FDA audit at Duke starting in 2011.; Medical board of N. Caroline reprimanded Potti. The Cancer Letters, edit and published by P. Goldbert, www.cancerletter.com Learn From Mistakes • Mentorship (do no harm) • Oversight, training • Verification and more verification • Develop of multiple processing pipelines • Patience and wisdom in applying translation • Ensure benefit patents (do no harm) Outline • Being practical while avoiding potential errors – The data • Individual Responsibilities – Data Management • Data Collection • Data Storage – Data Interpretation • Data interpretation and publication in a changing world of translation – The reality of translational science • Challenges • Role of Core Facilities – The role of the scientific community • Journals • Scientific organizations • Round table discussion on the research at Johns Hopkins What is Truth? Standards of Scientific Integrity in American Heart Association Journals Joseph M. Miano, Arterioscler Thromb Vasc Biol. 2010;30:1-4 • Plagiarism: Not acceptable unless quotation marks and citation. – • Use text similarity comparison algorithms as Google (http://www. google.com/), eTBLAST (http://invention.swmed.edu/ etblast/etblast.shtml) or De ́ ja` vu database(http://spore.swmed.edu/dejavu/). Inaccurate Citations: Must read each one. – Referenced bibliography = author’s understanding and thoughtful interpretation of the current and past literature past on the primary citations. • Manuscript Duplication, Data duplication and manipulation: Not allowed. • Data Fabrication: Not allowed. “Perhaps the most egregious acts of scientific misconduct are those in which an investigator (or his/her subordinates) creates data to deceive a reader.” • “Gel blots may be digitally cropped, but authors must retain all original primary data in case a journal reviewer or editor requests to see it during the review process. Further, it is possible that AHA journals may adopt the same practice as other journals in publishing original blots online as a supplemental figure.” – – – – – • It is scientific misconduct to cut a lane from one gel and “splice” it together with another independent gel, giving the false impression that the entire gel is an original. When “splicing” lanes is essential, authors must clearly separate lanes from separate blots with enough space between them for readers to recognize each as separate blots. Moreover, authors must also write a brief statement in the legend of the figure acknowledging the independent nature of the blots. It is not acceptable to use a control blot from one experiment as the same control for an independent experiment. This form of splicing is surprisingly prevalent but nevertheless constitutes scientific misconduct. When in doubt, authors should simply repeat the entire experiment, which they need to do anyway to confirm the original finding! Sometimes, it is necessary to alter the contrast of a gel or blot. This is permissible so long as the adjustment is made uniformly across the entire medium, and a statement to this effect is provided. It is scientific misconduct to enhance or reduce the contrast of a selected region of a gel or blot. Moreover, it is scientific misconduct to increase or decrease selectively the fluorescence of only a portion of a microscopic image . As with gels and blots, any manipulation to a microscopic image must be done so evenly across the image, with full disclosure of the adjustment in the figure or in Materials and Methods, or both. It is scientific misconduct to show the results of an experiment that was performed only once, unless the author discloses that the experiment was performed only once.” What is missing? Guidelines for large “omic-like” datasets including GWAS, genetics, epigenetics, genomics, proteomics, metabolomics. Often this is the case in biological or clinical based journals while earlier adoption of conduct standards can be found in technology-based journals. The Data Fundamental to research Basis for writing papers Important for experiment replication Meet contractual/funding requirements Settle intellectual property claims Defense against a charge of fraud Images from the front covers of Circulation Research – S. Elliott (Van Eyk Lab) 10 Individual responsibility Data Management computer-networks-webdesign.com Three aspects to consider before starting your data collected: 1. Ownership 2. Collection 3. Storage/protection of confidentiality/sharing 4. Interpretation and publication 11 Whose data is it? • Custody does not imply ownership. • Custody remains with investigator (PI) but JHU owns all data. But, others have rights. – Funders – Other data sources If there is intellectual property or the research was funded by a sponsored research agreement with a company, who owns the data? Data Collection • Depends on the type of raw data • Notebooks – day to day or specialize types of experiments • Images and large raw and processed data sets • Generated numbers and information Goal is to preserve raw data, transparent processing of data, unbiased interpretation and representation of data. Altering even a single data point is fraud, even it if does not change significance or interpretation. Dipak Das, carried out some research on resveratrol, present in red wine, and its effect on the cardiovascular system. http://en.wikipedia.org/wiki/Dipak_K._Das http://tafino.net/blog/red-wine-prevent-heart-disease-or-cardiac/ Impact on science, health and all of us. • In early 2012, the University of Connecticut Health Center found Das guilty of In 2006 it was shown that obese mice lived longer with resveratrol 145 counts of fabrication or falsification of data. treatment. waswere not involved in that research but it does not matter Western Das blot data manipulated in published papers. as it taints perception of impact. • A special review board at the university produced a 60,000-page report from Dasinvestigation involved inconducted many studies plant biomolecules CVD including over 3 on years covering 7 years of and reserach. garlic and vitamin E. – Office of Research Integrity (NIH office that investigates fraud by researchers who receive its funding). – Retractions 11 journals of 36 paper. 30 with of >100 times. Das hasinpublished broadly with overcitation 500 manuscripts. All data becomes suspect. http://www.nytimes.com/2012/01/12/science/fraud-charges-for-dipak-k-das-a-university-of-connecticutresearcher.html?_r=0 For facts about red wine and heart disease go to http://retractionwatch.wordpress.com/2012/01/11/uconn-resveratrol-researcher-dipak-das-fingered-insweeping-misconduct-case/ http://www.nlm.nih.gov/medlineplus/ency/article/001963.htm The good and the bad of image analysis • 10-15 years ago, generating a figure was a painstaking process involving cutting blades, rub-on letters, tracing raw data images and traditional photography. • Today, in a digital world, scanners, imaginers (b and software programs, some very complicated and a black box, are used in figure generation. – Speeds up data recording – Should provide more accurate quantification – Increases ability to manipulate quantification nature.com Data integrity in the digital age “With the emergence of web-based lab notebooks, digital image “enhancement”, and the quick and easy (and possibly dirty) generation and dissemination of colossal amounts of data, it’s becoming increasingly clear that technology provides new challenges to maintaining scientific integrity. In an attempt to tame the beast while it still has its baby teeth, the US National Academy of Sciences released a report today that provided a framework for dealing with these challenge "Ensuring the integrity, accessibility and stewardship of research data in the digital age.” http://blogs.nature.com/news/2009/07/data_integrity.html “One theme, that threads through many fields: the primacy of scrupulously recorded data. Because the techniques that researchers employ to ensure the integrity—the truth and accuracy—of their data are as varied as the fields themselves, there are no universal procedures for achieving technical accuracy. The term “integrity of data” also has a structural meaning, related to the data’s preservation and presentation. “ “Broadly accepted practices for generating and analyzing research must be shown to be reproducible in order to be credible. Other general practices include checking and rechecking data to confirm their accuracy, validity and also submitting data and results to peer review to ensure that the interpretation is valid.” What evidence proves the 67 kDa band is the same data as the 32 kDa band? How can you show that three lanes are the same data? Gross manipulation of blots http://ori.hhs.gov/ What image should be published? http://jcb.rupress.org/content/166/1/11/F6.expansion.html Misrepresentation of immunogold data. The gold particles, which were actually present in the original (left), have been enhanced in the manipulated image (right). Note also that the background dot in the original data has been removed in the manipulated image. Example provided by Journal of Cell Biology. http://jcb.rupress.org/content/166/1/11/F6.expansion.html Data Forensics • Can only "de-authenticate" an image (indicate discrepancies). • Authentication requires access to the original data. • The identification of a discrepancy is an allegation, and does not mean there was an intentional falsification of data. • The interpretation of whether any image manipulation is serious requires familiarity with the experiment(s) and imaging instruments. Data Forensic Tools are employed by journals in a manner similar to tools used to detect plagiarism. Office of research integrity US Department of Health and Human Services http://ori.hhs.gov/ Data Storage, Protection & Sharing • Raw data needs to be stored – Lab notebooks should be stored in a safe place – Computer files should be backed up – Protected and limited access to computer raw data • Samples should be saved appropriately so they will not degrade over time. • Data and experiment information should be available after publication. • Data should be retained for a reasonable period of time. Dilemma: When PDF or clinical fellow leaves a lab (especially if paper is not written up) where does the data stay? 22 How long? • Retain study records and records of disclosures of study information: – For IRB clinical trial • Retain records for 7 years after last subject completed study OR 7 years after date of last disclosure of identifiable health information from study records. • If research subject is a child, retain until subject reaches age of 23 – For Investigational New Drug (IND) research – • Retain records for 2 years after marketing application approved for new drug or until 2 years after shipment and delivery of drug for investigation use is discontinued. – For Investigational Device Exemption (IDE) research • Retain records for 2 years after the latter of the following two dates: date on which investigation is terminated or completed or date on which records no longer required for purposes of supporting a premarket approval application or notice of completion of a product development protocol. • • Provide adequate data and safety monitoring (if activity represents more than minimal risk to participants) Complete required training – Human Subject Research (HSR) compliance – Conflict of Interest (COI) – Privacy issues 23 Traditional Lab Notebook Best Practices admin.ox.ac.uk •Date all entries (especially important if contesting IP) •Title and state purpose of experiment •Describe experiment in detail – Protocol – Calculations – Reagents (lot numbers, passage numbers, etc.) – Results (everything that does and doesn’t happen) – Print-outs, pictures, graphs, etc with links to other data storage locations. • Record needs to be intact and permanent – All mistakes are to be left (cross out) – Do not remove pages – Write in pen – Clearly link connected experiments across time Requirement: Need to be able to follow the development and 24 execution of the experiments and all of the data analysis. New technologies also prone • Naivety • Technological, software and bioinformatics errors • Fraud and manipulation Video,Other Images,types ELISA readouts, genomic or miRNA data, GWAS data, etc. of data storage: Van Eyk Lab Raw MS spectrum Processed MS spectrum Identifies peptides based on comparison to existing database Peptides clustered to identify protein name Proteins are clustered to remove name redundancy present BACKGROUND Proteomics uses mass spectrometers (MS) to identify peptides and proteins. MS accurately weighs the mass of peptides and their fragments. The observed spectrum is compared to the theoretical mass of all known amino acid sequences in a database allowing assignment to a protein with a certain probability. Quantification can be based on number of spectrum observed per/analysis (spectral count). From one high accurate MS instrument (Orbitrap LC/MS/MS) produces in a single run C number spectrum. This means there is ~ 21-25million spectrum/yr = to ~1 terabyte of raw data/year. It is challenging as science, technology and our understanding is always advancing. Interpretation can change but RAW data never does. Auditing Logs Offsite Independent local back-ups JVE Lab Uploads and Downloads ONLY NO Deletes NO Overwrites Collaborators JVE Lab IA Storage Server • Currently Using 1.73TB • Multiple Redundant Hard Disks • Secured Data Center 1 Checks/ 2 Balances 3 Informatics Processing Lab (Pass) Database Collaborators ERROR! ERROR! Rounding ERROR! Backup of raw mass spectrometry data 1. Source Matches Target i) Size and date (easy but 95% accurate) ii) CheckSum method (time consuming but 100% accurate) 2. No file update and/or delete permitted i) If deletion of file required, written authorization required from PI ii) Overwriting is not possible 3. Easily accessible auditing of all activities 4. Backup, backup, backup…. i) Different locations ii) Multiple time points Complete Data Processing Pipeline Quality Control Ensuring integrity of data analysis at local level Interpretation and Publication • • • • Use of core facilities Role of collaborators Setting standards via professional societies Responsibilities of journals and reviewers Translational Science - Big Science Dealing with massive data sets, new technologies, and novel statistical approaches. allnewswire.com More Lessons from Duke 1. Requirement of additional expertise outside group. It is still rare to have someone with sufficient expertise to monitor cross all aspect of a project. 2. Massive amounts of data and software complexity. 3. Error introduced due to data handling and poor documentation. 4. Computer software maybe “research grade”, highly complex and misunderstood or used inappropriately. 5. If you think or figure out something is not right, admit it and track it down and correct it. “Some times the glamour (and ease) of (some) technology makes investigators forget basic scientific (and biological) principals” The Cancer Letters, edit and published by P. Goldbert, www.cancerletter.com What’s wrong with this MS spectrum? Unless you are an expert you will not know. But, it is wrong. Proteomic analysis of age dependent nitration of rat cardiac proteins by solution isoelectric focusing coupled to nanoHPLC tandem mass spectrometry. Hong SJ, Gokulrangan G, Schöneich C. Exp Gerontol. 2007;42(7):639-51. 4 proteins and post-translational modified amino acid residues were reported, all were subsequently shown to be incorrect. Manuscript that corrected it: Misidentification of nitrated peptides: comments on Hong, S.J., Gokulrangan, G., Schöneich, C., 2007. Exp. Gerontol. 42, 639-651. Prokai L. Exp Gerontol. 2009; 44(6-7):367-9. biocompare.com Use of core facilities: Can they (should they) provide required expertise? • Concern: – Who is responsible for analysis? – Complex data still requires understanding of technology limitations? – Situation is worse with emerging technologies where development of data analysis is still being developed? • Solutions – Cores with experts (and provided support to help with data analysis but is time consuming and expensive) – New hybrid cores-academic technology development labs. – Preservation of data transparency and storage of raw data. – Time to learn the methods across disciplines . – New paradigm in collaboration requires new approaches to training. We assume the best in people. B. Obama at Martin Luther King Memorial speech Oct 2011. What is the role of a collaborator? Many paper retractions are around authorship. Who Deserves To Be A Co-Author? Does a colleagues who provide intellectual background, funding or samples but did not participate in the collection of the data or the preparation of the manuscript warrant authorship? Is their responsibility equivalent with respect to tracking potential data integrity issue? The International Committee of Medical Journal Editors (ICMJE) “Authorship credit should be based on 1) substantial contributions to conception and design, acquisition of data, or analysis and interpretation of data; 2) revising it critically for important intellectual content; and 3) final approval of the version to be published. Acquisition of funding, collection of data, or general supervision of the research group alone does not constitute authorship.” Is being naïve or inexperienced a sufficient reason for not being responsible for data integrity and data interpretation? Reality: Collaboration is essential Collaboration scale science is encouraged Howand is large broad (but in-depth) experience Collaborative science each person/lab does their own bit. obtained when focused expertise is the norm? Practical solutions: Each person needs know enough of the process to ask How do youto develop collaborative networks Questions where disciplinary learning and QC and CV% in-depth need to be cross in place. Validation should be assumed to publication. trainingprior is intense? Truth: Trust is assumed but should be earned. Role of our scientific community Setting data standards HUPO Proteomics Standards Initiative www.psidev.info/ The HUPO Proteomics Standards Initiative (PSI) defines community standards for data representation in proteomics to facilitate data comparison, exchange and verification. The PSI was founded at the HUPO meeting in Washington, April 28-29, 2002 (Science 296,827). MIAPE The minimum information about a proteomics experiment • • • • • • • • 2DE MS MS informatics Column chromatography Capillary electrophoresis Protein modifications Protein affinity Bioactive entities All process MS data can be/should be uploaded to public data base Role of journals • • • • • • State requirements in instruction to authors Set standards for data integrity? Who curates? Store raw or processed data? How is annotated? Detection prior to publication? How many DBs? Who pays? Have expert reviewers? If detected, bar authors from publishing in their journal and or inform author’s institute? • If detected after publication? Enforced retraction? As a reviewer, what is your role? How do we train reviewers? allnewswire.com More lessons from Duke IOM “Committee on the Review of Omics-Based Tests for Predicting Outcomes in Clinical Trials” for IOM will determine criteria important for analytical validation, qualification and utilization components of test evaluation for the use of models that predict clinical outcomes from genomic and other Omic technologies. IOM set goal posts and pathways for large science omic data in the same manner as there are for drug trials. What I have learnt. Science is difficult. We are limited by our knowledge and perspective. Raw data is never wrong. We may misinterpret it, be fooled by an incorrect assumption, be limited by technology/approach but intrinsically, it is not wrong if quality control is in place. Your scientific reputation is based on the quality of your data. Highly collaborative technologies and fields require new training programs to ensure sufficient knowledge to question other sections in a unbiased environment. Today’s reality The National Institutes of Health is reducing funding levels, both new and existing grants. Impact: More pressure to publish. But, less money to carry out the science. The impact will be less personal and trainee or less funds to reproduce or validate data? “Could the sequester mean more business for Retraction Watch?” http://retractionwatch.wordpress.com Conclusion Data Integrity Principle: Research data is integrity is essential for advancing scientific and medical knowledge and for maintaining public trust. Every researcher is ultimately responsible. Data Access and Sharing Principle: Research (raw and processed) data, methods, and other information integral to publicly reported results should be publicly accessible. Data Stewardship Principle: Research data should be retained to serve future use. Thus, (raw and processed) data must documented, referenced, and indexed in order for them to be used accurately and appropriately. Round table Sheila Garrity (Moderator) Director of the Division of Research Integrity Johns Hopkins • JHU Data Management Policy: http://jhuresearch.jhu.edu/Data_Management_Policy.pdf • Overall list of JHU policies page: http://jhuresearch.jhu.edu/policies-hopkins.htm • JHU laptop encryption: http://www.it.johnshopkins.edu/security/encryption.html • Overall JHU IT security page: http://www.it.johnshopkins.edu/security/ Panel Discussion • Landon King, M. D. • Frederick Luthardt, M.A., M.A., • Ingo Ruczinski, Ph.D. • Sheila Garrity, J.D., M.P.H., M.B.A. – moderator