Emerging Technologies Project Plan Template 1. Project Title: Impact of Emerging “Big Data” Sources on Clinical Research and Review 2. Project Scope: Identify the impact of emerging data sources in clinical research. Impact includes: How/where this data is typically collected and stored Typical data volumes and rate of data collection Use cases for the access, analysis, viewing, and review of these data types Mechanisms by which this data might be viewed or analyzed “in place” instead of physically moving or copying the entire data set The overall goal of this effort is to promote awareness of the potential changes needed to cope with these emerging data sources. 3. Project Team Members Interim industry lead: Louis Norton, Covance Suresh Madhavan, PointCross (+individual to be assigned) Hugo Geerts Marcelina Hungria More members strongly encouraged Note: Project would prefer to get an FDA co-lead to be named 4. Affected Stakeholder: Data providers (e.g. labs, hospitals, etc) CRO(s) Sponsor scientists & regulatory submissions groups Reviewers / Regulatory agencies System / Software vendors 5. Project Meeting Frequency: Every 2 weeks. 6. Project Objectives and Timeline: Collect relevant data sources and use cases. Identify complementary or 3 months similar efforts for re-use of information already available. Prioritize and do more detailed analysis on 3-4 relevant cases (ones for 5 months which information is available) – including consumption patterns (analysis, viewing). Develop/document findings and recommendations for solution approach / 3 month architecture. Assessing Impact of Emerging “Big Data” Sources on Clinical Research and Review Background A new working group focused on “Emerging Technologies” was established at the PhUSE/FDA meeting on March 18th and 19th 2013 at Silver Springs, MD. One of the many subgroups in this group took up the task of assessing the impact of emerging “Big Data” sources, that are expected to be prevalent in clinical research, on both researchers and reviewers at regulatory agencies such as FDA. For the purposes of this working group, “Emerging” will include those proven technologies that are being adopted and which are expected to become widely adopted in the mid term future. In that spirit we plan to use this forum to look ahead, imagine, and forecast, as best we can, some of these emerging sources that will generate very large data sets, in the aggregate, of petabytes or even exabytes. We will consider how these “big data” will need to be: Stored at source Used at source Exchanged or made available to other stakeholders Used at the destination Included in part or whole within regulatory submissions Archived, reposed or stored for future retrospective look-backs Secured from source to destination and along the way at various stores While contemplating each of these aspects we will necessarily have to touch upon or identify, but not resolve: Technology issues related to storage and access Nature of the data and nature of access Nature of applications that will access these data sets Constraints on transmission of data among stakeholders Meaning or relevance of “validation” of computer systems in the context of CFR Part 11 Validation of data and how traditional rules must evolve The role of semantically enabled analysis and ontology management - a topic that will be touched on the surface as there is another sub-group that is assigned this task Nature of, and role of enriched metadata in the world of MapReduce and specialized “apps” New role of validating the functionality of specialized apps when metadata is the only data that can be moved How “data in motion” may play a different role in such areas as in Pharmacovigilence as well in new clinical trial arms where constant monitoring on the cloud may be important Role of Cloud for certain types of data - although we will not delve into much details since this is already covered by another sub-group. Separately, we will consider the following aspects for each of the big data sources: Source of the big-data and its type in terms of how it is represented elementally or in aggregation How it is collected and frequency, and expected triggers for collection (protocol, AE) Nature of the data – o level of structure, o associated metadata and its essential structure o its inherent consistency, completeness, and quality at source o its usability – direct or only post-processing Likely use in real-time by which stakeholders Likely use in quasi real-time by which stakeholders Likely use in research by which stakeholders Likely use for reporting by which stakeholders Likely use by reviewers Size of data at each collection event Expected size over a trial or investigation Expected size over a period of time by a large to medium scale operation