SCIENCE-DRIVEN INFORMATICS FOR PCORI PPRN Kristen Anton UNC Chapel Hill/ White River Computing Dan Crichton White River Computing February 3, 2014 From Information Design, Nathan Shedroff White River Computing / UNC Chapel Hill Architecture – what is it? Architecture: • The fundamental organization of a system embodied in its components, their relationships to each other and to the environment, and the principles guiding its design and evolution (ANSI/IEEE Std. 1471-2000) Architecture is decomposed into four core pieces : • Process Architecture – describes the core processes for the system • Data Architecture – describes the information models and data standards for the system • Application Architecture – Portals, tools, etc. • Technology Architecture – Infrastructure elements White River Computing / UNC Chapel Hill Architecture Development Approach • Identify the drivers and requirements • Create an architectural description of the system – identified stakeholders, concerns and associated models • Identify core architectural principals • Separate the architecture into key viewpoints • Create a decomposition of the system identifying the elements and mapping to the requirements • Identified the high-level flows and analyze from the rpocess, information and application/technology perspectives • Generate the architectural models White River Computing / UNC Chapel Hill Communicating an Architecture • One of the major challenges is communicating an architecture • • Determine a useful view of the system for the stakeholder Projects have suffered because a useful view wasn’t provided • Who are the PCORI stakeholders that care about the architecture? • How do we communicate their care-abouts? The view is what you see The viewpoint is where you look from (Stakeholders) White River Computing / UNC Chapel Hill Software Development • The organization, implementation and deployment of the software should follow the identification of an architecture which aligns with the principles and needs of the stakeholders • The separation of the architecture into concerns will let us determine what capabilities exist and what capabilities need to be developed • Ultimately this will help to ensure that a system is deployed which will integrate White River Computing / UNC Chapel Hill Recommended Software Development Approach Jan 2014 – Mar 2014 Project Formulation Feb 2014 – June 2014 Project Organization, Objectives, High Level Schedule and Project Plan System Formulation/ Architecture High-Level Architecture for System and Data, Architecture, Data Flows, Initial Data Structure, etc June 2014 – June 2015 Site Development Development and deployment of the infrastructure and architecture; development of the core data model/ consistent with PCORnet “universal” data model? White River Computing / UNC Chapel Hill Supporting science-driven research needs: Case Study – Early Detection Research Network (EDRN) • Research network of collaborating scientists from more than 40 institutions – international network of networks • Focus on identifying and validating biomarkers of cancer at early stage/ preclinical Bioinformatics challenges in EDRN: Developing computing infrastructure that is “biomarker-centric.” Improve research capability by enabling real-time access to a variety of information that crosses institutional boundaries. White River Computing / UNC Chapel Hill Bioinformatics – Goals Supporting science-driven research needs • Coordinated discovery and validation of biomarkers across cancer research centers to increase accuracy of the results of studies • Accommodating various data types • Facilitation of analytics through data integration and single-point access • Support workflows associated with various types of information • Encouraging and supporting collaboration White River Computing / UNC Chapel Hill Bioinformatics – Goals Supporting science-driven research needs • Linking highly diverse systems together to integrate and present data for analytics • Defining a comprehensive information model for describing the problem space/ ontology • Providing software interfaces for capture, discovery, and access of data resources • Providing a secure transfer and distribution infrastructure • Enabling all data sources to be heterogeneous and distributed • Providing integrated portal for access to distributed data • Providing bioinformatics tools/ pipelines for uniform data processing White River Computing / UNC Chapel Hill Bioinformatics EDRN Knowledge Environment Functional architecture: Services • • • • • • Data capture Data discovery Data access Data retrieval Data processing Data distribution White River Computing / UNC Chapel Hill Bioinformatics EDRN Knowledge Environment Information architecture: Data Model across EDRN projects (“universal” data model) • Representation of information associated with data objects managed within the knowledge system • Models for: • • • • • Biomarkers Studies Participants Organs Data generated from instruments (e.g. mass spec, arrays) White River Computing / UNC Chapel Hill Bioinformatics EDRN Knowledge Environment Information architecture: Data Model • Relationships between and among objects • Standard set of metadata elements that can be used for annotating objects • Multiple metadata schemata for machine usable explanations of the metadata descriptions • Metadata descriptions describe the inception and composition of data • Common language for describing data and associated attributes: Common Data Elements (CDEs) • CDE has a Uniform Resource Identifier (URI) – URL form points to CDE definition page – used in XML standards White River Computing / UNC Chapel Hill EDRN Knowledge Environment Public Portal BIOINFORMATICS TOOLS EDRN science data results eCAS (protocol_id, participant_id) Science Warehouse (protocol_id, participant_id) ERNE Participant DB Participants and their characteristics EDRN science data results (local, distributed and varying degrees of validation) Protocol DB Protocols and their descriptions (protocol_id, participant_id) (protocol_id) Biomarker_DB Descriptions of biomarkers and their use (protocol_id) Distributed Specimen Databases CDE Repository Data elements and their descriptions VSIMS Descriptions of EDRN studies -Participants -Specimen tracking, etc EDRN Knowledge Environment Success? • Biomarker Database holds 850 curated biomarkers, including panels/ signatures of biomarkers • Biomarker Database modeled to reflect the data model: activity in multiple organs, protocols, data files – facilitate single-point data access • eSIS contains 165 protocols • eCAS holds 56 data sets, with many files in each set, and more added daily – standard metadata around each set and each product • Two bioinformatics tools implemented: Proteomics “pipeline” (generating standardized biomarker identification files); REDCap (standardized data definition and capture at the project level) – additional in progress • Common Data Elements (CDEs) contributed to the NCI repository • CDE has a Uniform Resource Identifier (URI) – URL form points to CDE definition page – used in XML standards • Portal facilitates authorized access to almost 200,000 specimens • Publications and Resources White River Computing / UNC Chapel Hill EDRN Knowledge Environment Technology • Iterative development • Open Source philosophy and tools • Apache OODT (Object Oriented Data Technology) Software components developed independent of any data model: EDRN’s computing infrastructure can be replicated White River Computing / UNC Chapel Hill EDRN Knowledge Environment Technology White River Computing / UNC Chapel Hill Bioinformatics – Goals Supporting science-driven research needs: SHARE White River Computing / UNC Chapel Hill Bioinformatics – Goals Supporting science-driven research needs: SHARE Geisel School of Medicine at Dartmouth / UNC Chapel Hill Supporting science-driven research needs: PCORI PPRN Geisel School of Medicine at Dartmouth / UNC Chapel Hill Opportunity to offer our architecture to PCORnet? Synergy in data model Query across CCFA PPRN network …network of networks? Geisel School of Medicine at Dartmouth / UNC Chapel Hill