Research and data intergrity

advertisement
blogs.hds.com
Research Integrity:
The Importance of Data Acquisition
and Management.
Jennifer E. Van Eyk, Ph.D.
Prof. Medicine, Biol. Chem. and BME
Director, JHU Bayview Proteomics Center
Director, JHU ICTR Biomarker Development Center
JHU NHLBI Innovative Proteomics Center on Heart Failure
Principles of Research Data Integrity
• Research integrity depends on data integrity.
– Includes all aspects of collection, use, storage and
sharing of data.
• Data integrity is a shared responsibility.
– Everyone involved in the research is responsible.
– The ultimate responsibility belongs to the PI.
– However, there is a broader role and responsibility
for the institute and scientific community.
• Transparency of the research data is required.
Free and accurate information exchange is
fundamental to scientific progress.
Data integrity can be compromised numerous ways.
i) malicious proprietors
Top Ten Less-Extreme Rock
ii) human mistakes and naivety
Climbing Routes
iii) technical error
Stolen Chimney, Utah
Data integrity is based on accuracy
and traceability:
i) collection
ii) recording
iii) storage
iv) reporting
www.gorp.com/parks-guide/travel
The Consequences of Failure
•
•
•
•
•
•
•
Personal loss
Blocked scientific progression
Impaired technology development
Damage to others in lab, the institution and sponsors
Tarnished public perception of science
Damage to or loss of patent protection
Harm patents (directly or indirectly)
If it happens or is thought to possibly have happened: be
watchful, recognize it, take responsibility, sort out if it
has occurred and fix it.
allnewswire.com
Clinical trials based on genomic selection:
Duke University
http://en.wikipedia.org/wiki/Anil_Potti
Based on 2 genomic studies coming from the same multi-disciplinary group (Potti and Nevins) from
which three clinical trials were undertaken. All clinical trails have been ultimately suspended and
stopped in 2011. At least 10 papers were retracted.
1. Papers using cancer tissue (Potti et al., N Engl J Med 2006;355:570) and cell based approaches (Hsu et al, J Clin Oncol.2007;25:4350;
Potti, A. et al. Nat. Med. 2006 12, 1294–1300) were published with a lot of hype.
2. Issues were raised by K. Baggerly and/or K. Coombes (M.D. Anderson Cancer Center) based on publically available data (Annals of
Applied Statistics, 2009:3:1309 and Nat Med. 2007;13:1276) as well as others.
3. Dukes argues mistakes were “clerical errors” and do not alter fundamental conclusions of papers.
4. NCI/CTEP required LMS to be tested in blinded pre-validation study. It failed. Predictor was altered after corrected for having been
carried out in two different labs. Trials had to be randomized and blinded to minimize sensitivity of predictor to laboratory effects.
5. NCI due to continued concerns requests all computer code and data preprocessing in order to try to replicate earlier finding. It
failed. Using predictor for randomization stratification was stopped in the trial.
6. Duke carried out internal review and reopened trial.
7. NCI determines it is partially funding another trial based on a different paper (Chemo-sensitivity). Issues were discovered with
respect to differences in data used to build predictor and data used for validation. Trials were ultimately suspended.
8. Potti’s academic credentials were found to be falsified. Duke acts.
9. Duke statement indicates that that with respect to validation studies: the sensitivity labels are wrong, samples labels are wrong, the
gene labels are wrong, making it “wrong: in way that could lead to assignment of patients to the wrong treatment.
10. Co-author J. Nevins institute (Duke Institute for Genomic Science and Policy) he directed is closed (“due reorganization”) as is the
Center for applied Genomics and Technology which Dr. Anil Potti was based. Hsu D is still publishing.
11. Papers were retracted(JCO Dec 2010;28:5229 and N Engl J Med. Mar 2011;364:1176).
12. Institute of Medicine (IOM) committee struck for independent review and recommendations.
13. FDA audit at Duke starting in 2011.; Medical board of N. Caroline reprimanded Potti.
The Cancer Letters, edit and published by P. Goldbert, www.cancerletter.com
Learn From Mistakes
• Mentorship (do no harm)
• Oversight, training
• Verification and more verification
• Develop of multiple processing pipelines
• Patience and wisdom in applying translation
• Ensure benefit patents (do no harm)
Outline
• Being practical while avoiding potential errors
– The data
• Individual Responsibilities
– Data Management
• Data Collection
• Data Storage
– Data Interpretation
• Data interpretation and publication in a changing world of translation
– The reality of translational science
• Challenges
• Role of Core Facilities
– The role of the scientific community
• Journals
• Scientific organizations
• Round table discussion on the research at Johns Hopkins
What is Truth? Standards of
Scientific Integrity in
American Heart Association
Journals
Joseph M. Miano, Arterioscler Thromb
Vasc Biol. 2010;30:1-4
•
Plagiarism: Not acceptable unless
quotation marks and citation.
–
•
Use text similarity comparison algorithms as
Google (http://www. google.com/), eTBLAST
(http://invention.swmed.edu/
etblast/etblast.shtml) or De ́ ja`
vu
database(http://spore.swmed.edu/dejavu/).
Inaccurate Citations: Must read
each one.
–
Referenced bibliography = author’s
understanding and thoughtful interpretation
of the current and past literature past on the
primary citations.
•
Manuscript Duplication, Data
duplication and manipulation: Not
allowed.
•
Data Fabrication: Not allowed.
“Perhaps the most egregious acts of
scientific misconduct are those in which
an investigator (or his/her subordinates)
creates data to deceive a reader.”
•
“Gel blots may be digitally cropped, but authors must retain all original primary data in
case a journal reviewer or editor requests to see it during the review process. Further, it is
possible that AHA journals may adopt the same practice as other journals in publishing
original blots online as a supplemental figure.”
–
–
–
–
–
•
It is scientific misconduct to cut a lane from one gel and “splice” it together with another
independent gel, giving the false impression that the entire gel is an original. When “splicing” lanes
is essential, authors must clearly separate lanes from separate blots with enough space between
them for readers to recognize each as separate blots. Moreover, authors must also write a brief
statement in the legend of the figure acknowledging the independent nature of the blots.
It is not acceptable to use a control blot from one experiment as the same control for an
independent experiment. This form of splicing is surprisingly prevalent but nevertheless constitutes
scientific misconduct. When in doubt, authors should simply repeat the entire experiment, which
they need to do anyway to confirm the original finding!
Sometimes, it is necessary to alter the contrast of a gel or blot. This is permissible so long as the
adjustment is made uniformly across the entire medium, and a statement to this effect is provided.
It is scientific misconduct to enhance or reduce the contrast of a selected region of a gel or blot.
Moreover, it is scientific misconduct to increase or decrease selectively the fluorescence of only a
portion of a microscopic image . As with gels and blots, any manipulation to a microscopic image
must be done so evenly across the image, with full disclosure of the adjustment in the figure or in
Materials and Methods, or both.
It is scientific misconduct to show the results of an experiment that was performed only
once, unless the author discloses that the experiment was performed only once.”
What is missing?
Guidelines for large “omic-like” datasets including GWAS, genetics, epigenetics,
genomics, proteomics, metabolomics.
Often this is the case in biological or clinical based journals while earlier
adoption of conduct standards can be found in technology-based journals.
The Data
Fundamental to research
Basis for writing papers
Important for experiment
replication
Meet contractual/funding
requirements
Settle intellectual property
claims
Defense against a charge
of fraud
Images from the front covers of Circulation Research – S. Elliott (Van Eyk Lab)
10
Individual responsibility
Data Management
computer-networks-webdesign.com
Three aspects to consider before starting
your data collected:
1. Ownership
2. Collection
3. Storage/protection of confidentiality/sharing
4. Interpretation and publication
11
Whose data is it?
• Custody does not imply ownership.
• Custody remains with investigator (PI) but JHU
owns all data. But, others have rights.
– Funders
– Other data sources
If there is intellectual property or the research
was funded by a sponsored research
agreement with a company, who owns the data?
Data Collection
• Depends on the type of raw data
• Notebooks – day to day or specialize types of
experiments
• Images and large raw and processed data sets
• Generated numbers and information
Goal is to preserve raw data, transparent processing of data,
unbiased interpretation and representation of data.
Altering even a single data point is fraud, even it if does not
change significance or interpretation.
Dipak Das, carried out some research
on resveratrol, present in red wine, and
its effect on the cardiovascular system.
http://en.wikipedia.org/wiki/Dipak_K._Das
http://tafino.net/blog/red-wine-prevent-heart-disease-or-cardiac/
Impact on science, health and all of us.
•
In early 2012, the University of Connecticut Health Center found Das guilty of
In
2006
it was
shown that
obese mice
lived longer with resveratrol
145
counts
of fabrication
or falsification
of data.
treatment.
waswere
not involved
in that
research
but it does not matter
Western Das
blot data
manipulated
in published
papers.
as it taints perception of impact.
•
A special review board at the university produced a 60,000-page report from
Dasinvestigation
involved inconducted
many studies
plant
biomolecules
CVD including
over 3 on
years
covering
7 years of and
reserach.
garlic
and
vitamin
E.
– Office of Research Integrity
(NIH
office
that investigates
fraud by researchers
who receive its funding).
– Retractions
11 journals of
36 paper.
30 with
of >100 times.
Das hasinpublished
broadly
with
overcitation
500 manuscripts.
All data becomes suspect.
http://www.nytimes.com/2012/01/12/science/fraud-charges-for-dipak-k-das-a-university-of-connecticutresearcher.html?_r=0
For facts about red wine and heart disease go to
http://retractionwatch.wordpress.com/2012/01/11/uconn-resveratrol-researcher-dipak-das-fingered-insweeping-misconduct-case/
http://www.nlm.nih.gov/medlineplus/ency/article/001963.htm
The good and the bad of
image analysis
• 10-15 years ago, generating a figure was a painstaking process involving cutting blades, rub-on
letters, tracing raw data images and traditional
photography.
• Today, in a digital world, scanners, imaginers (b
and software programs, some very complicated
and a black box, are used in figure generation.
– Speeds up data recording
– Should provide more accurate quantification
– Increases ability to manipulate quantification
nature.com
Data integrity in the digital age
“With the emergence of web-based lab notebooks, digital image “enhancement”, and
the quick and easy (and possibly dirty) generation and dissemination of colossal
amounts of data, it’s becoming increasingly clear that technology provides new
challenges to maintaining scientific integrity. In an attempt to tame the beast while it
still has its baby teeth, the US National Academy of Sciences released a report today
that provided a framework for dealing with these challenge "Ensuring the integrity,
accessibility and stewardship of research data in the digital age.”
http://blogs.nature.com/news/2009/07/data_integrity.html
“One theme, that threads through many fields: the primacy of scrupulously
recorded data. Because the techniques that researchers employ to ensure the
integrity—the truth and accuracy—of their data are as varied as the fields
themselves, there are no universal procedures for achieving technical accuracy. The
term “integrity of data” also has a structural meaning, related to the data’s
preservation and presentation. “
“Broadly accepted practices for generating and analyzing research must be shown
to be reproducible in order to be credible. Other general practices include checking
and rechecking data to confirm their accuracy, validity and also submitting data and
results to peer review to ensure that the interpretation is valid.”
What evidence proves the 67 kDa band is the
same data as the 32 kDa band?
How can you show that three lanes are the same data?
Gross manipulation of blots
http://ori.hhs.gov/
What image should be published?
http://jcb.rupress.org/content/166/1/11/F6.expansion.html
Misrepresentation of immunogold data. The gold particles, which were actually
present in the original (left), have been enhanced in the manipulated image (right).
Note also that the background dot in the original data has been removed in the
manipulated image. Example provided by Journal of Cell Biology.
http://jcb.rupress.org/content/166/1/11/F6.expansion.html
Data Forensics
• Can only "de-authenticate" an image (indicate
discrepancies).
• Authentication requires access to the original data.
• The identification of a discrepancy is an allegation, and does
not mean there was an intentional falsification of data.
• The interpretation of whether any image manipulation is
serious requires familiarity with the experiment(s) and imaging
instruments.
Data Forensic Tools are employed by journals in a manner
similar to tools used to detect plagiarism.
Office of research integrity
US Department of Health and Human Services
http://ori.hhs.gov/
Data Storage, Protection & Sharing
• Raw data needs to be stored
– Lab notebooks should be stored in a safe place
– Computer files should be backed up
– Protected and limited access to computer raw data
• Samples should be saved appropriately so they will not degrade
over time.
• Data and experiment information should be available after
publication.
• Data should be retained for a reasonable period of time.
Dilemma: When PDF or clinical fellow leaves a lab
(especially if paper is not written up) where does
the data stay?
22
How long?
•
Retain study records and records of disclosures of study information:
– For IRB clinical trial
• Retain records for 7 years after last subject completed study OR 7 years after date
of last disclosure of identifiable health information from study records.
• If research subject is a child, retain until subject reaches age of 23
– For Investigational New Drug (IND) research –
• Retain records for 2 years after marketing application approved for new drug or
until 2 years after shipment and delivery of drug for investigation use is
discontinued.
– For Investigational Device Exemption (IDE) research
• Retain records for 2 years after the latter of the following two dates: date on which
investigation is terminated or completed or date on which records no longer
required for purposes of supporting a premarket approval application or notice of
completion of a product development protocol.
•
•
Provide adequate data and safety monitoring (if activity represents more than
minimal risk to participants)
Complete required training
– Human Subject Research (HSR) compliance
– Conflict of Interest (COI)
– Privacy issues
23
Traditional Lab Notebook
Best Practices
admin.ox.ac.uk
•Date all entries (especially important if contesting IP)
•Title and state purpose of experiment
•Describe experiment in detail
– Protocol
– Calculations
– Reagents (lot numbers, passage numbers, etc.)
– Results (everything that does and doesn’t happen)
– Print-outs, pictures, graphs, etc with links to other data storage locations.
• Record needs to be intact and permanent
– All mistakes are to be left (cross out)
– Do not remove pages
– Write in pen
– Clearly link connected experiments across time
Requirement: Need to be able to follow the development and
24
execution of the experiments and all of the data analysis.
New technologies also prone
• Naivety
• Technological, software and bioinformatics errors
• Fraud and manipulation
Video,Other
Images,types
ELISA readouts,
genomic
or miRNA
data,
GWAS
data, etc.
of data
storage:
Van
Eyk
Lab
Raw MS spectrum
Processed MS spectrum
Identifies peptides based on
comparison to existing database
Peptides clustered to identify
protein name
Proteins are clustered to remove
name redundancy present
BACKGROUND
Proteomics uses mass spectrometers (MS) to
identify peptides and proteins. MS accurately
weighs the mass of peptides and their fragments.
The observed spectrum is compared to the
theoretical mass of all known amino acid
sequences in a database allowing assignment to a
protein with a certain probability. Quantification
can be based on number of spectrum observed
per/analysis (spectral count).
From one high accurate MS instrument (Orbitrap
LC/MS/MS) produces in a single run C number
spectrum. This means there is ~ 21-25million
spectrum/yr = to ~1 terabyte of raw data/year.
It is challenging as science, technology and our
understanding is always advancing.
Interpretation can change but RAW data never does.
Auditing Logs
Offsite Independent
local back-ups
JVE Lab
Uploads and
Downloads
ONLY
NO Deletes
NO Overwrites
Collaborators
JVE Lab
IA Storage Server
• Currently Using 1.73TB
• Multiple Redundant Hard Disks
• Secured Data Center
1
Checks/ 2
Balances 3
Informatics
Processing
Lab (Pass)
Database
Collaborators
ERROR!
ERROR!
Rounding ERROR!
Backup of raw mass spectrometry data
1. Source Matches Target
i) Size and date (easy but 95% accurate)
ii) CheckSum method (time consuming but 100% accurate)
2. No file update and/or delete permitted
i) If deletion of file required, written authorization required from PI
ii) Overwriting is not possible
3. Easily accessible auditing of all activities
4. Backup, backup, backup….
i) Different locations
ii) Multiple time points
Complete Data Processing Pipeline Quality Control
Ensuring integrity of data analysis at local level
Interpretation and Publication
•
•
•
•
Use of core facilities
Role of collaborators
Setting standards via professional societies
Responsibilities of journals and reviewers
Translational Science - Big Science
Dealing with massive data sets,
new technologies, and
novel statistical approaches.
allnewswire.com
More Lessons from Duke
1.
Requirement of additional expertise outside group. It is still rare to have
someone with sufficient expertise to monitor cross all aspect of a project.
2.
Massive amounts of data and software complexity.
3.
Error introduced due to data handling and poor documentation.
4.
Computer software maybe “research grade”, highly complex and
misunderstood or used inappropriately.
5.
If you think or figure out something is not right, admit it and track it down
and correct it.
“Some times the glamour (and ease) of (some) technology
makes investigators forget basic scientific
(and biological) principals”
The Cancer Letters, edit and published by P. Goldbert,
www.cancerletter.com
What’s wrong with this MS spectrum?
Unless you are an expert you will not know. But, it is wrong.
Proteomic analysis of age dependent
nitration of rat cardiac proteins by
solution isoelectric focusing coupled
to nanoHPLC tandem mass
spectrometry. Hong SJ, Gokulrangan
G, Schöneich C. Exp Gerontol.
2007;42(7):639-51.
4 proteins and post-translational
modified amino acid residues
were reported, all were
subsequently shown to be
incorrect.
Manuscript that corrected it:
Misidentification of nitrated
peptides: comments on Hong, S.J.,
Gokulrangan, G., Schöneich, C., 2007.
Exp. Gerontol. 42, 639-651. Prokai L.
Exp Gerontol. 2009; 44(6-7):367-9.
biocompare.com
Use of core facilities:
Can they (should they) provide
required expertise?
• Concern:
– Who is responsible for analysis?
– Complex data still requires understanding of technology limitations?
– Situation is worse with emerging technologies where development of
data analysis is still being developed?
• Solutions
– Cores with experts (and provided support to help with data analysis
but is time consuming and expensive)
– New hybrid cores-academic technology development labs.
– Preservation of data transparency and storage of raw data.
– Time to learn the methods across disciplines .
– New paradigm in collaboration requires new approaches to training.
We assume the best in people.
B. Obama at Martin Luther King Memorial speech Oct 2011.
What is the role of a collaborator?
Many paper retractions are around authorship.
Who Deserves To Be A Co-Author?
Does a colleagues who provide intellectual background, funding or
samples but did not participate in the collection of the data or the
preparation of the manuscript warrant authorship?
Is their responsibility equivalent with respect to tracking potential data
integrity issue?
The International Committee of Medical Journal Editors (ICMJE)
“Authorship credit should be based on 1) substantial contributions to
conception and design, acquisition of data, or analysis and interpretation
of data; 2) revising it critically for important intellectual content; and 3) final
approval of the version to be published.
Acquisition of funding, collection of data, or general supervision of the
research group alone does not constitute authorship.”
Is being naïve or inexperienced a sufficient
reason for not being responsible for data
integrity and data interpretation?
Reality:
Collaboration is essential
Collaboration
scale
science
is encouraged
Howand
is large
broad
(but
in-depth)
experience
Collaborative science each person/lab does their own bit.
obtained when focused expertise is the norm?
Practical solutions:
Each person
needs
know enough
of the process
to ask
How do
youto develop
collaborative
networks
Questions
where
disciplinary learning and
QC and
CV% in-depth
need to be cross
in place.
Validation should be assumed
to publication.
trainingprior
is intense?
Truth:
Trust is assumed but should be earned.
Role of our scientific community
Setting data standards
HUPO Proteomics Standards Initiative
www.psidev.info/
The HUPO Proteomics Standards
Initiative (PSI) defines community
standards for data representation in
proteomics to facilitate data
comparison, exchange and
verification. The PSI was founded at
the HUPO meeting in Washington,
April 28-29, 2002 (Science 296,827).
MIAPE
The minimum information
about a proteomics experiment
•
•
•
•
•
•
•
•
2DE
MS
MS informatics
Column chromatography
Capillary electrophoresis
Protein modifications
Protein affinity
Bioactive entities
All process MS data can be/should be
uploaded to public data base
Role of journals
•
•
•
•
•
•
State requirements in instruction to authors
Set standards for data integrity?
Who curates?
Store raw or processed data?
How is annotated?
Detection prior to publication?
How many DBs?
Who pays?
Have expert reviewers?
If detected, bar authors from publishing in their
journal and or inform author’s institute?
• If detected after publication? Enforced
retraction?
As a reviewer, what is your role?
How do we train reviewers?
allnewswire.com
More lessons from Duke
IOM “Committee on the Review of Omics-Based Tests for Predicting
Outcomes in Clinical Trials” for IOM will determine criteria important for
analytical validation, qualification and utilization components of test
evaluation for the use of models that predict clinical outcomes from genomic
and other Omic technologies.
IOM set goal posts and pathways for large science omic
data in the same manner as there are for drug trials.
What I have learnt.
Science is difficult. We are limited by our knowledge
and perspective.
Raw data is never wrong. We may misinterpret it, be
fooled by an incorrect assumption, be limited by
technology/approach but intrinsically, it is not wrong if
quality control is in place.
Your scientific reputation is based on the quality of
your data.
Highly collaborative technologies and fields require
new training programs to ensure sufficient knowledge
to question other sections in a unbiased environment.
Today’s reality
The National Institutes of Health is reducing funding levels, both
new and existing grants.
Impact: More pressure to publish.
But, less money to carry out the science.
The impact will be less personal and trainee or less funds to
reproduce or validate data?
“Could the sequester mean more business for
Retraction Watch?”
http://retractionwatch.wordpress.com
Conclusion
Data Integrity Principle: Research data is integrity is
essential for advancing scientific and medical knowledge and
for maintaining public trust. Every researcher is ultimately
responsible.
Data Access and Sharing Principle: Research (raw and
processed) data, methods, and other information integral to
publicly reported results should be publicly accessible.
Data Stewardship Principle: Research data should be
retained to serve future use. Thus, (raw and processed) data
must documented, referenced, and indexed in order for them
to be used accurately and appropriately.
Round table
Sheila Garrity (Moderator)
Director of the Division of
Research Integrity
Johns Hopkins
• JHU Data Management Policy:
http://jhuresearch.jhu.edu/Data_Management_Policy.pdf
• Overall list of JHU policies page:
http://jhuresearch.jhu.edu/policies-hopkins.htm
• JHU laptop encryption:
http://www.it.johnshopkins.edu/security/encryption.html
• Overall JHU IT security page:
http://www.it.johnshopkins.edu/security/
Panel Discussion
• Landon King, M. D.
• Frederick Luthardt, M.A., M.A.,
• Ingo Ruczinski, Ph.D.
• Sheila Garrity, J.D., M.P.H., M.B.A.
– moderator
Download