CCEGA Informatics Project: Developing Shared Infrastructure and Data Models Project Leader: Brad Hemminger bmh@ils.unc.edu School of Information and Library Science University of North Carolina at Chapel Hill Participants • • • • • • • • • • • • • • • • Brad Hemminger bmh at ils.unc.edu Kaye Balke balke at ils.unc.edu Kirk Wilhemsen kirk at neurology.unc.edu David Threadgill dwt at med.unc.edu Dong Xiang dxiang at email.unc.edu Min Xu xumin at med.unc.edu Joel Kingsolver jgking at bio.unc.edu Paul Brown paul.brown at unc.edu Lavana Ramakrishnan lavanya at renci.org Roger Akers akers at unc.edu Peter DeSaix pdesaix at email.unc.edu Clark Jeffries clark_jeffries at med.unc.edu Xiaojun Guan xguan at renci.org Kevin Gamiel kgamiel at renci.org Erik Scott escott at renci.org Barrie Hayes bhayes at email.unc.edu Project Aims Goal: Development of common data model and informatics infrastructure for UNC • Determine needs of research labs on campus • Determine applicable global standards that can be utilized • Determine issues that affect whether research labs would utilize a common infrastructure and common data model. • Understand and address security issues • Based on this information, develop model Lab Surveys • Bioinformatics Research labs at UNC were invited to provide details of their data infrastructure, in particular their data models (and example data). • PIs and database administrators from the projects meet with our full committee for interviews, and afterwards we followed up to obtain dumps of their data schemas. Labs that provided in depth interviews and complete data models • Kirk Wilhelmsen (alcoholism and addiction projects) • Paul Brown (Cell Biology, multiple projects) • Roger Akers (Epidemiology Specimen Tracking) • Lineberger (multiple cancer projects) • Mike Knowles (Pulmonary and Cystic Fibrosis) • Kari North (case control and family based studies of cardiovascular disease) • Proteomics Center (earlier project) Global Standards • While there are no overarching standards that define common definitions for all the data elements necessary, standards exists in many individual domains (microarrays, genetic sequences, proteins, etc). Additionally, larger scale efforts are being made, such as CDSIC (clinical trials) and caBIG (cancer). caBIG has a whole workgroup devoted to vocabularies and common data elements (VCDE). Issues affecting user acceptance • Most all research projects prefer to have their own database – – – – – Specific projects No need to tie into other researchers data No need to preserved data generated by study Easier to build themselves More control when managed themselves • Core facilities – Require specific control, privacy of data • Clinical facilities – Rigorous requirements regarding sharing of data (ELSI, HIPAA) Reasons for Sharing • More studies are required to share data between projects (larger studies, multicenter studies) • More projects depend on outside resources (databanks) • Free, or inexpensive disk space • Dependable archiving of data • Assistance in designing data models for study Security Possible security design requirements: • Identification tables of entities (as in Trusted Broker doc) • Translation tables among entities • Authentication (two-way) between broker and entities • Authorization of entities by broker • Encrypted channels (SSL, IPSec, other) • Protection against various denial of service attack types (limiting multiple accesses or very frequent access requests from any one researcher, etc.) • Multiple types of access requirements for the human trusted broker (something you have, you know, or you are) • Other requirements on trusted broker (bonded staff, permission to modify databases requiring at least two separate trusted brokers cooperating, etc.) • Remote backup system... Common Data Model • Had a general framework from previous work • Built new model from ground up – Took all data elements from all the research labs and pooled together to define overall set of elements, including which elements from different labs mapped to the same “common” elements. – Produced set of core elements that were common to many projects and important for sharing. • Integrated new model with overall design principles from general framework to develop final “common data model”. INVESTIGATOR PARAMETERS algorithm scoring confidence value PROTOCOL analysis method probabilistic algorithm name organization department contact person address physical address billing account number telephone fax e-mail chromatographic ESI gel electrophoresis 1D/2D imaging MS MS/MS spot picking spot selection EXTERNAL SOURCE Hospital lab Supplier DATA ANALYSIS database search method Denovo sequencing probability based matching SAMPLE ANALYTICAL DEVICES software search engine PROTOCOL id name state parent sample root sample processing measurement BIOLOGICAL SOURCE age anatomical developmental stage disease state gender genetic variation organism name (NCBI) PROCESSING cloning digestion imaging MS MS/MS preparation separation spot selection (image analysis) OBSERVED VALUES annotated spectrum candidate protein ID derived monoisotopic spectra file format/ size (scoring graph) fragment characteristics probability scores quality assurance spreadsheet unassigned peptides PROCESSING DEVICES ANNOTATION citation database registration No. digestion station imaging analyzer imaging system mass spectrometer separation device spot picking system scale PARAMETERS applied filters (spot selection) column characteristics (chromatography) concentration (solvent, reagent, buffer) file format (TIFF, CSV, XML) flow rate (ESI) media composition (gel, solution, buffer) picking tip (spot picker) pressure (HPLC,ESI) proteolytic enzyme (digestion) resolution (ESI) selected spots (cutlist) (image analyzer) selection/excision (pen) size (spot picker) stage (gel) stationary phase composition (chromatography) tip internal diameter (ESI) transit time (gel,chromatography) volume (solution, wash ) voltage (ESI, gel) well plate specification (digestion, MS) MEASURED VALUES aliqout volume (LC, digestion) dispense volume (gel) file size mass accuracy (MS, ESI) mass/charge (m/z) ratio (MS) molecular weight (MS) MS spectra (mass fingerprint) MS/MS spectra (fragment ion, ESI) OD rating (MS) pick shift (spot picker) pick volume (spot picker) position coordinates (spot picker) post pick image (image analyzer) resolution (image) root sample image sample weight (gel) spot picker image Example of integrating data • View integration spreadsheet, look at example (samples) of before and after. Final Common Model • Developing taking common data elements and putting into a database system for testing. – Database schema design (see printout) – Integrate standards in definition of data elements – Incorporate into actual database • Test model database by incorporating actual data from volunteer labs (Kirk, Roger) Next Steps • The aim of this P20 planning project is to prepare for further grants in this area, and to hopefully help lay the groundwork for building a common biomedical informatics infrastructure at UNC • In Jan 2007, we submitted a CTSA grant (Clinical and Translational Science Award). This grant aims to integrate all biomedical informatics infrastructure on campus. CTSA--overview • The TraCS Biomedical Informatics Core will unite the silos of biomedical informatics research excellence at UNC and across North Carolina to maximize re-use of data, knowledge and processes. With the establishment of the North Carolina Collaboratory for Biomedical Informatics (NCCBI), TraCS will support research, patient care, education and policy-making while building upon, leveraging and extending the current biomedical informatics infrastructure at UNCCH. This core involves several external partners with a strong presence in NC and world-wide: Red Hat, IBM, SAS, Allscripts, Quintiles and NCHICA. We are committed to achieving a national leadership role in the design and development of best practices for the inclusion of clinical data into shared repositories of biomedical data. CTSA—tie in clinical data • To support the goals of the TraCS Institute, the Biomedical Informatics Core will create a statewide interdisciplinary and interinstitutional collaboratory (collaborative laboratory): the North Carolina Collaboratory for Biomedical Informatics (NCCBI). It will build on the transformative technology used by the NIH to create Entrez for the NCBI. The long-term goal is to create a shared biomedical informatics data repository connecting clinical enterprises across the State of North Carolina to create a demonstration project for clinical data that will be a model for sharing and re-use of clinical data. This repository will contain appropriately de-identified data from clinical trials and clinical care. With the establishment of the NCCBI, the TraCS Biomedical Informatics Core will transform the excellent but fragmented biomedical informatics capabilities at UNC-CH into a coherent and connected system that facilitates routine re-use of research knowledge, data and processes throughout UNC and North Carolina, serving as a prototype for the nation. Example Centers Included • General Clinical Research Center, the Collaborative Studies Coordinating Center, the Lineberger Comprehensive Cancer Center, the Carolina Center for Exploratory Genetic Analysis, the Carolina Center for Genome Sciences, the Carolina Exploratory Center for Cheminformatics Research, the Biomedical Imaging Research Center, the Carolina Environmental Bioinformatics Center, the Center for Bioinformatics, the Renaissance Computing Institute, and the Odum Institute for Research in Social Science CTSA • In short, the CTSA proposal builds on the work of the P20, and offers us the potential to truly transform the way scientists and clinicians work at UNC, and bring about unprecedented integration and data sharing. Summary--Timeline Initial Workshop beginning project (spring 2005) • Analysis of data requirements, policies, and existing infrastructure at UNC. Internal interviews with labs (spring through fall 2005) • Development complete list of data elements, review with labs and finalize elements for common model (fall 2005-spring 2006) • Development of draft model (fall 2006-spring 2007) • Testing of draft model using example labs data (fall 2007) • Review by labs and researchers at UNC. Share with outside experts to solicit critiques. (fall 2007) • Use this work to develop new grants to fund actual deployment of common data models, policies and infrastructure at UNC. (spring 2007-current)