Informatics-Summary-2007-02-02

advertisement
CCEGA Informatics Project:
Developing Shared Infrastructure and
Data Models
Project Leader: Brad Hemminger
bmh@ils.unc.edu
School of Information and Library Science
University of North Carolina at Chapel Hill
Participants
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Brad Hemminger bmh at ils.unc.edu
Kaye Balke
balke at ils.unc.edu
Kirk Wilhemsen
kirk at neurology.unc.edu
David Threadgill
dwt at med.unc.edu
Dong Xiang
dxiang at email.unc.edu
Min Xu
xumin at med.unc.edu
Joel Kingsolver
jgking at bio.unc.edu
Paul Brown
paul.brown at unc.edu
Lavana Ramakrishnan lavanya at renci.org
Roger Akers
akers at unc.edu
Peter DeSaix
pdesaix at email.unc.edu
Clark Jeffries
clark_jeffries at med.unc.edu
Xiaojun Guan
xguan at renci.org
Kevin Gamiel
kgamiel at renci.org
Erik Scott
escott at renci.org
Barrie Hayes
bhayes at email.unc.edu
Project Aims
Goal: Development of common data model and
informatics infrastructure for UNC
• Determine needs of research labs on campus
• Determine applicable global standards that can
be utilized
• Determine issues that affect whether research
labs would utilize a common infrastructure and
common data model.
• Understand and address security issues
• Based on this information, develop model
Lab Surveys
• Bioinformatics Research labs at UNC
were invited to provide details of their
data infrastructure, in particular their
data models (and example data).
• PIs and database administrators from the
projects meet with our full committee for
interviews, and afterwards we followed
up to obtain dumps of their data
schemas.
Labs that provided in depth
interviews and complete data models
• Kirk Wilhelmsen (alcoholism and addiction
projects)
• Paul Brown (Cell Biology, multiple projects)
• Roger Akers (Epidemiology Specimen
Tracking)
• Lineberger (multiple cancer projects)
• Mike Knowles (Pulmonary and Cystic Fibrosis)
• Kari North (case control and family based
studies of cardiovascular disease)
• Proteomics Center (earlier project)
Global Standards
• While there are no overarching standards
that define common definitions for all the
data elements necessary, standards
exists in many individual domains
(microarrays, genetic sequences,
proteins, etc). Additionally, larger scale
efforts are being made, such as CDSIC
(clinical trials) and caBIG (cancer).
caBIG has a whole workgroup devoted to
vocabularies and common data elements
(VCDE).
Issues affecting user acceptance
• Most all research projects prefer to have their
own database
–
–
–
–
–
Specific projects
No need to tie into other researchers data
No need to preserved data generated by study
Easier to build themselves
More control when managed themselves
• Core facilities
– Require specific control, privacy of data
• Clinical facilities
– Rigorous requirements regarding sharing of data (ELSI,
HIPAA)
Reasons for Sharing
• More studies are required to share data
between projects (larger studies,
multicenter studies)
• More projects depend on outside
resources (databanks)
• Free, or inexpensive disk space
• Dependable archiving of data
• Assistance in designing data models for
study
Security
Possible security design requirements:
• Identification tables of entities (as in Trusted Broker doc)
• Translation tables among entities
• Authentication (two-way) between broker and entities
• Authorization of entities by broker
• Encrypted channels (SSL, IPSec, other)
• Protection against various denial of service attack types (limiting
multiple accesses or very frequent access requests from any one
researcher, etc.)
• Multiple types of access requirements for the human trusted broker
(something you have, you know, or you are)
• Other requirements on trusted broker (bonded staff, permission to
modify databases requiring at least two separate trusted brokers
cooperating, etc.)
• Remote backup system...
Common Data Model
• Had a general framework from previous work
• Built new model from ground up
– Took all data elements from all the research labs and
pooled together to define overall set of elements,
including which elements from different labs mapped to
the same “common” elements.
– Produced set of core elements that were common to
many projects and important for sharing.
• Integrated new model with overall design
principles from general framework to develop
final “common data model”.
INVESTIGATOR
PARAMETERS
algorithm scoring
confidence value
PROTOCOL
analysis method
probabilistic algorithm
name
organization
department
contact person
address physical
address billing
account number
telephone
fax
e-mail
chromatographic
ESI
gel electrophoresis 1D/2D
imaging
MS
MS/MS
spot picking
spot selection
EXTERNAL SOURCE
Hospital lab
Supplier
DATA ANALYSIS
database search method
Denovo sequencing
probability based matching
SAMPLE
ANALYTICAL DEVICES
software
search engine
PROTOCOL
id
name
state
parent sample
root sample
processing
measurement
BIOLOGICAL SOURCE
age
anatomical
developmental stage
disease state
gender
genetic variation
organism name (NCBI)
PROCESSING
cloning
digestion
imaging
MS
MS/MS
preparation
separation
spot selection (image analysis)
OBSERVED VALUES
annotated spectrum
candidate protein ID
derived monoisotopic spectra
file format/ size (scoring graph)
fragment characteristics
probability scores
quality assurance spreadsheet
unassigned peptides
PROCESSING DEVICES
ANNOTATION
citation
database registration No.
digestion station
imaging analyzer
imaging system
mass spectrometer
separation device
spot picking system
scale
PARAMETERS
applied filters (spot selection)
column characteristics (chromatography)
concentration (solvent, reagent, buffer)
file format (TIFF, CSV, XML)
flow rate (ESI)
media composition (gel, solution, buffer)
picking tip (spot picker)
pressure (HPLC,ESI)
proteolytic enzyme (digestion)
resolution (ESI)
selected spots (cutlist) (image analyzer)
selection/excision (pen) size (spot picker)
stage (gel)
stationary phase composition (chromatography)
tip internal diameter (ESI)
transit time (gel,chromatography)
volume (solution, wash )
voltage (ESI, gel)
well plate specification (digestion, MS)
MEASURED VALUES
aliqout volume (LC, digestion)
dispense volume (gel)
file size
mass accuracy (MS, ESI)
mass/charge (m/z) ratio (MS)
molecular weight (MS)
MS spectra (mass fingerprint)
MS/MS spectra (fragment ion, ESI)
OD rating (MS)
pick shift (spot picker)
pick volume (spot picker)
position coordinates (spot picker)
post pick image (image analyzer)
resolution (image)
root sample image
sample weight (gel)
spot picker image
Example of integrating data
• View integration spreadsheet, look at
example (samples) of before and after.
Final Common Model
• Developing taking common data
elements and putting into a database
system for testing.
– Database schema design (see printout)
– Integrate standards in definition of data
elements
– Incorporate into actual database
• Test model database by incorporating
actual data from volunteer labs (Kirk,
Roger)
Next Steps
• The aim of this P20 planning project is to
prepare for further grants in this area,
and to hopefully help lay the groundwork
for building a common biomedical
informatics infrastructure at UNC
• In Jan 2007, we submitted a CTSA grant
(Clinical and Translational Science
Award). This grant aims to integrate all
biomedical informatics infrastructure on
campus.
CTSA--overview
• The TraCS Biomedical Informatics Core will unite the
silos of biomedical informatics research excellence at
UNC and across North Carolina to maximize re-use of
data, knowledge and processes.
With the
establishment of the North Carolina Collaboratory for
Biomedical Informatics (NCCBI), TraCS will support
research, patient care, education and policy-making
while building upon, leveraging and extending the
current biomedical informatics infrastructure at UNCCH. This core involves several external partners with a
strong presence in NC and world-wide: Red Hat, IBM,
SAS, Allscripts, Quintiles and NCHICA. We are
committed to achieving a national leadership role in
the design and development of best practices for the
inclusion of clinical data into shared repositories of
biomedical data.
CTSA—tie in clinical data
• To support the goals of the TraCS Institute, the Biomedical
Informatics Core will create a statewide interdisciplinary and interinstitutional collaboratory (collaborative laboratory): the North
Carolina Collaboratory for Biomedical Informatics (NCCBI). It will
build on the transformative technology used by the NIH to create
Entrez for the NCBI. The long-term goal is to create a shared
biomedical informatics data repository connecting clinical
enterprises across the State of North Carolina to create a
demonstration project for clinical data that will be a model for
sharing and re-use of clinical data. This repository will contain
appropriately de-identified data from clinical trials and clinical
care. With the establishment of the NCCBI, the TraCS Biomedical
Informatics Core will transform the excellent but fragmented
biomedical informatics capabilities at UNC-CH into a coherent and
connected system that facilitates routine re-use of research
knowledge, data and processes throughout UNC and North
Carolina, serving as a prototype for the nation.
Example Centers Included
• General Clinical Research Center, the
Collaborative Studies Coordinating Center, the
Lineberger Comprehensive Cancer Center, the
Carolina Center for Exploratory Genetic
Analysis, the Carolina Center for Genome
Sciences, the Carolina Exploratory Center for
Cheminformatics Research, the Biomedical
Imaging Research Center, the Carolina
Environmental Bioinformatics Center, the
Center for Bioinformatics, the Renaissance
Computing Institute, and the Odum Institute
for Research in Social Science
CTSA
• In short, the CTSA proposal builds on the
work of the P20, and offers us the
potential to truly transform the way
scientists and clinicians work at UNC,
and bring about unprecedented
integration and data sharing.
Summary--Timeline
Initial Workshop beginning project (spring 2005)
• Analysis of data requirements, policies, and existing
infrastructure at UNC. Internal interviews with labs
(spring through fall 2005)
• Development complete list of data elements, review
with labs and finalize elements for common model (fall
2005-spring 2006)
• Development of draft model (fall 2006-spring 2007)
• Testing of draft model using example labs data (fall
2007)
• Review by labs and researchers at UNC. Share with
outside experts to solicit critiques. (fall 2007)
• Use this work to develop new grants to fund actual
deployment of common data models, policies and
infrastructure at UNC. (spring 2007-current)
Download