CSE 300 Data Mining & Cyberinfrastructures in Biomedical Informatics Ryan McGivern CSE5095 May 1, 2011 Data Mining and Cyberinfrastructures in Biomedical Informatics - 1 Main Concepts CSE 300 Data Mining Knowledge Discovery Cyberinfrastructures Collaborative Research Data Mining and Cyberinfrastructures in Biomedical Informatics - 2 Nature of Biomedical Data CSE 300 Health care is more than numbers and readings Can’t replace the subjective sense of disease severity that a physician has in moments Capture data in a way that best captures observation Data representation Precision Data Mining and Cyberinfrastructures in Biomedical Informatics - 3 Review Medical datum Any single observation of a patient Knowledge Derived through formal/informal analysis of data Information Combine knowledge with data for new information CSE 300 Heuristics and research models BMI Data-Knowledge Spectrum What information constitutes the substance of medicine Data Mining and Cyberinfrastructures in Biomedical Informatics - 4 Nature of Biomedical Data CSE 300 Knowledge at one level of abstraction might be considered data at another Medical Database is a Collection of individual patient observations EHR is in some sense simply a database Using historical patient data from the EHR system can facilitate the deduction of new knowledge related to health care strategies Data Mining and Cyberinfrastructures in Biomedical Informatics - 5 Nature of Biomedical Data CSE 300 Humans can intuitively decompose information from unitary view of data But nothing is intuitive to computational systems Example Clinical setting BP of 120/80 may suffice to indicate a normal reading Analytical setting Systolic BP = 120 mm Hg Diastolic BP = 80 mm Hg Data Mining and Cyberinfrastructures in Biomedical Informatics - 6 Nature of Biomedical Data CSE 300 Data mining in health is mainly related to Clinical Research Support Clinical Data Repositories (CDRs) New knowledge learned through aggregated info from a large number of patients Can be facilitated by EHRs Unfortunately CDRs generally limited to admin data sources Rarely store patient charts Data Mining and Cyberinfrastructures in Biomedical Informatics - 7 Nature of Biomedical Data CSE 300 CDRs support Clinical Research Studies Retrospective studies Investigate a hypothesis that was not a subject of the study at the time the data were collected Prospective studies Clinical hypothesis known in advance Research protocol designed to collect future data Data Mining and Cyberinfrastructures in Biomedical Informatics - 8 Nature of Biomedical Data CSE 300 Knowledge base Facts Heuristics Complex models Semantic linking Conduct case based problem solving Medical data is intrinsically heterogeneous Illusory to conceive ‘complete medical dataset’ Data selective based on treatment Data Mining and Cyberinfrastructures in Biomedical Informatics - 9 Data Mining in BMI CSE 300 Data mining Knowledge discovery technique Sophisticated statistical methods Identify trend patterns hidden amongst the sheer size of the dataset Data warehouse Multiple heterogeneous data sources Organized under a unified schema Single site Facilitate management and decision making Data Mining and Cyberinfrastructures in Biomedical Informatics - 10 Data Mining in BMI CSE 300 CDR is essentially a data warehouse Architecture consists of four tiers External data sources Operational databases, flat files, etc. Data storage layer Unified schema, metadata, data marts OLAP Layer Data mining engine Presentation layer GUI Usually web-based Data Mining and Cyberinfrastructures in Biomedical Informatics - 11 Data Mining in BMI CSE 300 Figure: Clinical Data Repository Data Mining and Cyberinfrastructures in Biomedical Informatics - 12 Data Mining in BMI CSE 300 Data integration mechanism Extraction Transformation Refresh Scrubbing Data marts Subsets of data tailored to a user group Cache resultant datasets Data Mining and Cyberinfrastructures in Biomedical Informatics - 13 Data Mining in BMI CSE 300 Data integration Heterogeneous data under a unified schema Ontologies Link primary data expressions to structured vocabularies Data now available to search and algorithmic processing at different levels of abstraction Clinical domain Notorious for overwhelming presence of natural language text Natural language processing Data Mining and Cyberinfrastructures in Biomedical Informatics - 14 Data Mining in BMI CSE 300 Data integration Cancer Biomedical Informatics Grid (caBIG) Seeks to integrate all cancer research data Standardize the way by which data is acquired, formatted, processed, and stored – Whole data ‘life cycle’ Translational research No common architecture among vocabularies Therefore difficult to consolidate terms into a single system Data Mining and Cyberinfrastructures in Biomedical Informatics - 15 Data Mining in BMI CSE 300 Communication HL7 Communication standard for exchange of all information relevant to health care Focuses on meta-level of data integration within a clinical setting Data Mining and Cyberinfrastructures in Biomedical Informatics - 16 Data Mining in BMI Online Analytical Processing Layer (OLAP) Formats aggregated data in multidimensional way Evaluated and visualized at presentation layer User specifies summary technique Data Cube Roll-up and drill-down operations Control abstraction level for each data dimension CSE 300 Data Mining and Cyberinfrastructures in Biomedical Informatics - 17 Data Mining in BMI CSE 300 Data mining techniques Descriptive methods Mine for relationships among attribute types with as few variables as possible Predictive methods Iterate through attributes and classify data into predefined classes Identify similar classes Other related methods Neural Networks Machine Learning Each provides a way of recognizing data patterns Data Mining and Cyberinfrastructures in Biomedical Informatics - 18 Data Mining in BMI CSE 300 UWV & VCU (2006) Data mining research 667,00 digital records Duke University (1997) Perinatal outcomes 45,922 patient records Out-patient & in-patient De-identified HealthMiner® (IBM) CliniMiner® Association analysis THOTH Predictive analysis 215,626 encounters 3,898,887 lab results 217,453 procedures 3,016,313 physical findings SQL Queries Average time 3 minutes Longest time 12 minutes 4 million records Data Mining and Cyberinfrastructures in Biomedical Informatics - 19 Data Mining in BMI CSE 300 Challenges in mining biomedical data Non-hypothesis driven approaches Combinatorial explosion Degree of non-reducibility – Minimize with sophisticated heuristics High dimensionality Sparse complex relationships – Spread thinly across many dimensions Hypotheses Limit inherent bias in traditional clinical data analysis Data Mining and Cyberinfrastructures in Biomedical Informatics - 20 Data Mining in BMI CSE 300 Challenges in warehousing biomedical data IT infrastructure for CDRs Established for clinical trials but separated from EHR systems Data integration Map clinical terminologies to clinical research standards Pseudonymization De-identification is a ‘must’ when EHR leaves the realm of primary health care Data Mining and Cyberinfrastructures in Biomedical Informatics - 21 Data Mining in BMI CSE 300 General road blocks Data sharing Researchers are protective of their data Language/vocabulary changes Due to required detail Bedside vs. laboratory Transdisciplinary research leads to competing standards Data Mining and Cyberinfrastructures in Biomedical Informatics - 22 Data Mining in BMI CSE 300 Advantages of mining biomedical data New health management strategies Relationships among patient observations Understanding of disease progression Undetected drug events Prevalence through larger sample populations Clinical trial cohort selection Identify patient types that will best prove a given hypothesis Data Mining and Cyberinfrastructures in Biomedical Informatics - 23 Cyberinfrastructures in BMI CSE 300 Motivations Computer systems are now more than essential to research Development of complex modeling tools But generally only available to a handful of clinical researchers Integration of data from different disciplines Can require specialized training in mathematics, statistics, and software Ideally want to provide a layer of abstraction that can make this integration transparent to the researcher Data Mining and Cyberinfrastructures in Biomedical Informatics - 24 Cyberinfrastructures in BMI CSE 300 Mission Develop a geographically distributed virtual research community that facilitates Data sharing – Data warehousing Computational resource sharing – Distributed grid computing Collaboration – Research management – Research protocol sharing Data Mining and Cyberinfrastructures in Biomedical Informatics - 25 Cyberinfrastructures in BMI CSE 300 Components of a cyberinfrastructure Data infrastructure Series of interconnected repositories Computational infrastructure Registered resource sharing Communication infrastructure Communication amongst architectures Human infrastructure Facilitate communication and collaboration between registered researchers Data Mining and Cyberinfrastructures in Biomedical Informatics - 26 Cyberinfrastructures in BMI CSE 300 Data Mining and Cyberinfrastructures in Biomedical Informatics - 27 Cyberinfrastructures in BMI CSE 300 Data infrastructure Network of databases Facilitates remote storage, integration, and retrieval of data Databases browsed by web based front-ends Can be extended to cater to Automatic acquisition Direct submission Allows for pulling of data into local repositories For private or semi-private analyses Data Mining and Cyberinfrastructures in Biomedical Informatics - 28 Cyberinfrastructures in BMI CSE 300 Computational infrastructure Shared access to hardware and software Intensive computation needed for sophisticated analyses – i.e. Image analysis software Essentially a computing grid Systems separated geographically but clustered over the web Provides a virtual consolidated supercomputing node If system is idle locally, it is raised as a resource for outsiders Data Mining and Cyberinfrastructures in Biomedical Informatics - 29 Cyberinfrastructures in BMI CSE 300 Communication infrastructure At the low level Require connectivity and acceptable bandwidth between – Repositories – Computational resources – Researcher At the high level Responsible for maintaining syntactic and semantic harmony throughout data Data Mining and Cyberinfrastructures in Biomedical Informatics - 30 Cyberinfrastructures in BMI CSE 300 Communication infrastructure continued Syntax and Semantics Suppose analysis involves data from different repositories Syntactic connectivity established through a common format for data organization Semantic connectivity maintains data interoperability by ensuring concepts captured by the data share a common terminology – Usually implemented using an ontology Data Mining and Cyberinfrastructures in Biomedical Informatics - 31 Cyberinfrastructures in BMI CSE 300 Human infrastructure Ultimately, must facilitate the sociology of science Everyone curates communal data sets Encourage the sharing of Protocols Analysis algorithms Data sets Similar to CICATS at UConn Research toolkit Data Mining and Cyberinfrastructures in Biomedical Informatics - 32 Cyberinfrastructures in BMI CSE 300 Human infrastructure continued Ideally researcher should be able to design experiment at a high level Describe datasets, relationships, etc. Generally high level description language – Workflow language Infrastructure then manages data retrieval, analysis, and transformation Constructs an environment where researchers can get an in-depth result from a high level description Data Mining and Cyberinfrastructures in Biomedical Informatics - 33 Cyberinfrastructures in BMI CSE 300 There are many existing cyberinfrastructures Don’t necessarily implement all components Most common form is an online database GenBank EMBL European Molecular Biology Lab UniProt Protein database PDB Protein data bank Data Mining and Cyberinfrastructures in Biomedical Informatics - 34 Cyberinfrastructures in BMI CSE 300 Online databases continued But, these lack the components to facilitate Collaboration Interdisciplinary research Use centralized resources and are generally managed by the owning research group Data centric Most of the computational architecture is dedicated solely to data acess Data Mining and Cyberinfrastructures in Biomedical Informatics - 35 Cyberinfrastructures in BMI CSE 300 Community Annotation Hubs Open up a centralized database to direct contribution from the research community SDSU Gene Wiki For the community annotation of gene function BMI Wikis have been recognized as some of the most sophisticated document repositories Despite being a relatively recent umbrella discipline Still not a complete research environment Could be ‘plugged in’ to the human infrastructure of a complete cyberinfrastructure Data Mining and Cyberinfrastructures in Biomedical Informatics - 36 Cyberinfrastructures in BMI CSE 300 Data sharing Still difficult to share data on disparate information classes Even if they are related through a subset of attribute types Further difficulty of interconnecting similar repositories written by different research groups Differing technologies Differing data representations Reoccurring difficulty in integrating data Medical data is inherently heterogeneous – Massive amount of data types involved – Data is captured differently, because it’s used differently Data Mining and Cyberinfrastructures in Biomedical Informatics - 37 Cyberinfrastructures in BMI CSE 300 Data sharing challenge As Dr. Kevin Sullivan said in the discussion It is difficult for an institution to share their data It can be difficult to argue a business case to do so – Institutions may not want people evaluating their treatments or incorrect treatments – Research groups get a sense of proprietary ownership over their data – Some institutions feel it is not theirs to share – Others are skeptical as to how the community would react to their health care provider exposing information to outsiders Data Mining and Cyberinfrastructures in Biomedical Informatics - 38 Cyberinfrastructures in BMI CSE 300 One interoperability solution is web services Provide common technology for heterogeneous data and services to interoperate Common implementations consist of Web Service Description Language (WSDL) – Describes capabilities of services Simple Object Access Protocol (SOAP) Researchers never use services directly But rely on the analysis and visualization engines that run on top of these Data Mining and Cyberinfrastructures in Biomedical Informatics - 39 Cyberinfrastructures in BMI CSE 300 Globus Open source libraries Industry heavyweight in web services for many domains Provides mechanisms for Announcing the availability of a computer resource Discovering the resource Invoking the resource Used by BIRN and caBIG BioMOBY Similar to Globus but relatively lightweight Used by PlaNet Consortium Data Mining and Cyberinfrastructures in Biomedical Informatics - 40 Cyberinfrastructures in BMI CSE 300 Ontologies Web services allow heterogeneous data and services to exchange But this does not enforce data semantics Ontologies are used to ensure an unambiguous standard for data Most BMI cyberinfrastructures specify ontologies using OWL Data Mining and Cyberinfrastructures in Biomedical Informatics - 41 Cyberinfrastructures in BMI CSE 300 Biomedical Informatics Research Network (BIRN) Developed a robust software installation & deployment system to implement a BIRN endpoint Host data and contribute computational resources Access shared datasets through web portal Analysis and visualization tools Publish datasets through BIRN data repository Roughly $20k for a BIRN rack Technologies Globus: grid management BIRNLex: ontology Data Mining and Cyberinfrastructures in Biomedical Informatics - 42 Cyberinfrastructures in BMI CSE 300 Cancer Biomedical Informatics Grid Launched 2003 Mission Provide a common information platform to support the diverse clinical and basic research of the US National Cancer Institute – 87 cancer institutes at the time Highly heterogeneous datasets Data Mining and Cyberinfrastructures in Biomedical Informatics - 43 Cyberinfrastructures in BMI CSE 300 Future of BMI cyberinfrastructures Use of cyberinfrastructure is growing rapidly Grid computing is increasingly more efficient Current weaknesses related to cross-discipline collaboration Each implements an internally consistent grid, but isolated from each other We need integration and communication among disciplines to investigate further relationships Data interoperability may be resolved by semantic web Current research in cyberinfrastructures is related to using the semantic web concept Data Mining and Cyberinfrastructures in Biomedical Informatics - 44 Cyberinfrastructures in BMI CSE 300 Semantic web in cyberinfrastructures Web services make strong distinction between data and data operations User identifies service to invoke Formats the input data Invokes the service Unpacks and interprets the results Semantic web is a technology tolerant of diverse data models No data transformation services Just pieces of information and relationships between them Data Mining and Cyberinfrastructures in Biomedical Informatics - 45