Data Mining & Cyberinfrastructures in Biomedical Informatics Ryan McGivern CSE5095

advertisement
CSE
300
Data Mining & Cyberinfrastructures in
Biomedical Informatics
Ryan McGivern
CSE5095
May 1, 2011
Data Mining and Cyberinfrastructures in Biomedical Informatics - 1
Main Concepts
CSE
300

Data Mining
 Knowledge Discovery

Cyberinfrastructures
 Collaborative Research
Data Mining and Cyberinfrastructures in Biomedical Informatics - 2
Nature of Biomedical Data
CSE
300
Health

care is more than numbers and readings
Can’t replace the subjective sense of disease
severity that a physician has in moments
Capture


data in a way that best captures observation
Data representation
Precision
Data Mining and Cyberinfrastructures in Biomedical Informatics - 3
Review

Medical datum
 Any single observation of a patient

Knowledge
 Derived through formal/informal analysis of data

Information
 Combine knowledge with data for new information
CSE
300
 Heuristics and research models

BMI Data-Knowledge Spectrum
 What information constitutes the substance of
medicine
Data Mining and Cyberinfrastructures in Biomedical Informatics - 4
Nature of Biomedical Data
CSE
300

Knowledge at one level of abstraction might be
considered data at another

Medical Database is a Collection of individual patient
observations
 EHR is in some sense simply a database

Using historical patient data from the EHR system can
facilitate the deduction of new knowledge related to
health care strategies
Data Mining and Cyberinfrastructures in Biomedical Informatics - 5
Nature of Biomedical Data
CSE
300

Humans can intuitively decompose information from
unitary view of data
 But nothing is intuitive to computational systems

Example
 Clinical setting
 BP of 120/80 may suffice to indicate a normal reading

Analytical setting
 Systolic BP = 120 mm Hg
 Diastolic BP = 80 mm Hg
Data Mining and Cyberinfrastructures in Biomedical Informatics - 6
Nature of Biomedical Data
CSE
300

Data mining in health is mainly related to Clinical
Research Support
 Clinical Data Repositories (CDRs)

New knowledge learned through aggregated info from
a large number of patients
 Can be facilitated by EHRs

Unfortunately
 CDRs generally limited to admin data sources
 Rarely store patient charts
Data Mining and Cyberinfrastructures in Biomedical Informatics - 7
Nature of Biomedical Data
CSE
300

CDRs support Clinical Research Studies

Retrospective studies
 Investigate a hypothesis that was not a subject of
the study at the time the data were collected

Prospective studies
 Clinical hypothesis known in advance
 Research protocol designed to collect future data
Data Mining and Cyberinfrastructures in Biomedical Informatics - 8
Nature of Biomedical Data
CSE
300

Knowledge base
 Facts
 Heuristics
 Complex models
 Semantic linking
 Conduct case based problem solving

Medical data is intrinsically heterogeneous
 Illusory to conceive ‘complete medical dataset’
 Data selective based on treatment
Data Mining and Cyberinfrastructures in Biomedical Informatics - 9
Data Mining in BMI
CSE
300

Data mining
 Knowledge discovery technique
 Sophisticated statistical methods
 Identify trend patterns hidden amongst the sheer
size of the dataset

Data warehouse
 Multiple heterogeneous data sources
 Organized under a unified schema


Single site
Facilitate management and decision making
Data Mining and Cyberinfrastructures in Biomedical Informatics - 10
Data Mining in BMI
CSE
300

CDR is essentially a data warehouse

Architecture consists of four tiers
 External data sources
 Operational databases, flat files, etc.

Data storage layer
 Unified schema, metadata, data marts

OLAP Layer
 Data mining engine

Presentation layer
 GUI
 Usually web-based
Data Mining and Cyberinfrastructures in Biomedical Informatics - 11
Data Mining in BMI
CSE
300
Figure: Clinical Data Repository
Data Mining and Cyberinfrastructures in Biomedical Informatics - 12
Data Mining in BMI
CSE
300

Data integration mechanism
 Extraction
 Transformation
 Refresh
 Scrubbing

Data marts
 Subsets of data tailored to a user group
 Cache resultant datasets
Data Mining and Cyberinfrastructures in Biomedical Informatics - 13
Data Mining in BMI
CSE
300

Data integration
 Heterogeneous data under a unified schema
 Ontologies
 Link primary data expressions to structured
vocabularies
 Data now available to search and algorithmic
processing at different levels of abstraction

Clinical domain
 Notorious for overwhelming presence of natural
language text
 Natural language processing
Data Mining and Cyberinfrastructures in Biomedical Informatics - 14
Data Mining in BMI
CSE
300

Data integration
 Cancer Biomedical Informatics Grid (caBIG)
 Seeks to integrate all cancer research data
 Standardize the way by which data is acquired,
formatted, processed, and stored
– Whole data ‘life cycle’

Translational research
 No common architecture among vocabularies
 Therefore difficult to consolidate terms into a single
system
Data Mining and Cyberinfrastructures in Biomedical Informatics - 15
Data Mining in BMI
CSE
300

Communication
 HL7
 Communication standard for exchange of all
information relevant to health care
 Focuses on meta-level of data integration within a
clinical setting
Data Mining and Cyberinfrastructures in Biomedical Informatics - 16
Data Mining in BMI

Online Analytical Processing Layer (OLAP)
 Formats aggregated data in multidimensional way
 Evaluated and visualized at presentation layer
 User specifies summary technique

Data Cube
 Roll-up and drill-down operations
 Control abstraction level for each data dimension
CSE
300
Data Mining and Cyberinfrastructures in Biomedical Informatics - 17
Data Mining in BMI

CSE
300
Data mining techniques
 Descriptive methods
 Mine for relationships among attribute types with as
few variables as possible

Predictive methods
 Iterate through attributes and classify data into
predefined classes
 Identify similar classes

Other related methods
 Neural Networks
 Machine Learning

Each provides a way of recognizing data patterns
Data Mining and Cyberinfrastructures in Biomedical Informatics - 18
Data Mining in BMI
CSE
300
UWV & VCU (2006)
 Data mining research
 667,00 digital records



Duke University (1997)
 Perinatal outcomes
 45,922 patient records
Out-patient & in-patient
De-identified


HealthMiner® (IBM)


CliniMiner®
 Association analysis


THOTH
 Predictive analysis

215,626 encounters
3,898,887 lab results
217,453 procedures
3,016,313 physical findings
SQL Queries


Average time 3 minutes
Longest time 12 minutes
 4 million records
Data Mining and Cyberinfrastructures in Biomedical Informatics - 19
Data Mining in BMI
CSE
300

Challenges in mining biomedical data
 Non-hypothesis driven approaches
 Combinatorial explosion
 Degree of non-reducibility
– Minimize with sophisticated heuristics

High dimensionality
 Sparse complex relationships
– Spread thinly across many dimensions

Hypotheses
 Limit inherent bias in traditional clinical data analysis
Data Mining and Cyberinfrastructures in Biomedical Informatics - 20
Data Mining in BMI
CSE
300

Challenges in warehousing biomedical data
 IT infrastructure for CDRs
 Established for clinical trials but separated from EHR
systems

Data integration
 Map clinical terminologies to clinical research
standards

Pseudonymization
 De-identification is a ‘must’ when EHR leaves the
realm of primary health care
Data Mining and Cyberinfrastructures in Biomedical Informatics - 21
Data Mining in BMI
CSE
300

General road blocks
 Data sharing
 Researchers are protective of their data

Language/vocabulary changes
 Due to required detail
 Bedside vs. laboratory

Transdisciplinary research leads to competing
standards
Data Mining and Cyberinfrastructures in Biomedical Informatics - 22
Data Mining in BMI
CSE
300

Advantages of mining biomedical data
 New health management strategies
 Relationships among patient observations
 Understanding of disease progression

Undetected drug events
 Prevalence through larger sample populations

Clinical trial cohort selection
 Identify patient types that will best prove a given
hypothesis
Data Mining and Cyberinfrastructures in Biomedical Informatics - 23
Cyberinfrastructures in BMI
CSE
300

Motivations
 Computer systems are now more than essential to
research
 Development of complex modeling tools
 But generally only available to a handful of clinical
researchers

Integration of data from different disciplines
 Can require specialized training in mathematics,
statistics, and software
 Ideally want to provide a layer of abstraction that can
make this integration transparent to the researcher
Data Mining and Cyberinfrastructures in Biomedical Informatics - 24
Cyberinfrastructures in BMI
CSE
300

Mission
 Develop a geographically distributed virtual
research community that facilitates
 Data sharing
– Data warehousing
 Computational resource sharing
– Distributed grid computing
 Collaboration
– Research management
– Research protocol sharing
Data Mining and Cyberinfrastructures in Biomedical Informatics - 25
Cyberinfrastructures in BMI
CSE
300

Components of a cyberinfrastructure
 Data infrastructure
 Series of interconnected repositories

Computational infrastructure
 Registered resource sharing

Communication infrastructure
 Communication amongst architectures

Human infrastructure
 Facilitate communication and collaboration between
registered researchers
Data Mining and Cyberinfrastructures in Biomedical Informatics - 26
Cyberinfrastructures in BMI
CSE
300
Data Mining and Cyberinfrastructures in Biomedical Informatics - 27
Cyberinfrastructures in BMI
CSE
300

Data infrastructure
 Network of databases
 Facilitates remote storage, integration, and retrieval
of data
 Databases browsed by web based front-ends
 Can be extended to cater to
 Automatic acquisition
 Direct submission

Allows for pulling of data into local repositories
 For private or semi-private analyses
Data Mining and Cyberinfrastructures in Biomedical Informatics - 28
Cyberinfrastructures in BMI
CSE
300

Computational infrastructure
 Shared access to hardware and software
 Intensive computation needed for sophisticated analyses
– i.e. Image analysis software

Essentially a computing grid
 Systems separated geographically but clustered over the
web
 Provides a virtual consolidated supercomputing node
 If system is idle locally, it is raised as a resource for
outsiders
Data Mining and Cyberinfrastructures in Biomedical Informatics - 29
Cyberinfrastructures in BMI
CSE
300

Communication infrastructure
 At the low level
 Require connectivity and acceptable bandwidth between
– Repositories
– Computational resources
– Researcher

At the high level
 Responsible for maintaining syntactic and semantic
harmony throughout data
Data Mining and Cyberinfrastructures in Biomedical Informatics - 30
Cyberinfrastructures in BMI
CSE
300

Communication infrastructure continued
 Syntax and Semantics
 Suppose analysis involves data from different
repositories
 Syntactic connectivity established through a common
format for data organization
 Semantic connectivity maintains data interoperability by
ensuring concepts captured by the data share a common
terminology
– Usually implemented using an ontology
Data Mining and Cyberinfrastructures in Biomedical Informatics - 31
Cyberinfrastructures in BMI
CSE
300

Human infrastructure
 Ultimately, must facilitate the sociology of science
 Everyone curates communal data sets
 Encourage the sharing of
 Protocols
 Analysis algorithms
 Data sets

Similar to CICATS at UConn
 Research toolkit
Data Mining and Cyberinfrastructures in Biomedical Informatics - 32
Cyberinfrastructures in BMI
CSE
300

Human infrastructure continued
 Ideally researcher should be able to design
experiment at a high level
 Describe datasets, relationships, etc.
 Generally high level description language
– Workflow language


Infrastructure then manages data retrieval,
analysis, and transformation
Constructs an environment where researchers can
get an in-depth result from a high level description
Data Mining and Cyberinfrastructures in Biomedical Informatics - 33
Cyberinfrastructures in BMI
CSE
300


There are many existing cyberinfrastructures
 Don’t necessarily implement all components
Most common form is an online database
 GenBank
 EMBL
 European Molecular Biology Lab

UniProt
 Protein database

PDB
 Protein data bank
Data Mining and Cyberinfrastructures in Biomedical Informatics - 34
Cyberinfrastructures in BMI
CSE
300

Online databases continued
 But, these lack the components to facilitate
 Collaboration
 Interdisciplinary research


Use centralized resources and are generally
managed by the owning research group
Data centric
 Most of the computational architecture is dedicated
solely to data acess
Data Mining and Cyberinfrastructures in Biomedical Informatics - 35
Cyberinfrastructures in BMI
CSE
300

Community Annotation Hubs
 Open up a centralized database to direct
contribution from the research community
 SDSU Gene Wiki
 For the community annotation of gene function

BMI Wikis have been recognized as some of the
most sophisticated document repositories
 Despite being a relatively recent umbrella discipline

Still not a complete research environment
 Could be ‘plugged in’ to the human infrastructure of a
complete cyberinfrastructure
Data Mining and Cyberinfrastructures in Biomedical Informatics - 36
Cyberinfrastructures in BMI
CSE
300

Data sharing
 Still difficult to share data on disparate information
classes
 Even if they are related through a subset of attribute
types

Further difficulty of interconnecting similar
repositories written by different research groups
 Differing technologies
 Differing data representations

Reoccurring difficulty in integrating data
 Medical data is inherently heterogeneous
– Massive amount of data types involved
– Data is captured differently, because it’s used differently
Data Mining and Cyberinfrastructures in Biomedical Informatics - 37
Cyberinfrastructures in BMI
CSE
300

Data sharing challenge
 As Dr. Kevin Sullivan said in the discussion
 It is difficult for an institution to share their data
 It can be difficult to argue a business case to do so
– Institutions may not want people evaluating their treatments or
incorrect treatments
– Research groups get a sense of proprietary ownership over their
data
– Some institutions feel it is not theirs to share
– Others are skeptical as to how the community would react to
their health care provider exposing information to outsiders
Data Mining and Cyberinfrastructures in Biomedical Informatics - 38
Cyberinfrastructures in BMI
CSE
300

One interoperability solution is web services
 Provide common technology for heterogeneous
data and services to interoperate
 Common implementations consist of
 Web Service Description Language (WSDL)
– Describes capabilities of services
 Simple Object Access Protocol (SOAP)

Researchers never use services directly
 But rely on the analysis and visualization engines that
run on top of these
Data Mining and Cyberinfrastructures in Biomedical Informatics - 39
Cyberinfrastructures in BMI
CSE
300

Globus
 Open source libraries
 Industry heavyweight in web services for many
domains
 Provides mechanisms for
 Announcing the availability of a computer resource
 Discovering the resource
 Invoking the resource
Used by BIRN and caBIG
BioMOBY
 Similar to Globus but relatively lightweight
 Used by PlaNet Consortium


Data Mining and Cyberinfrastructures in Biomedical Informatics - 40
Cyberinfrastructures in BMI
CSE
300

Ontologies
 Web services allow heterogeneous data and
services to exchange
 But this does not enforce data semantics


Ontologies are used to ensure an unambiguous
standard for data
Most BMI cyberinfrastructures specify ontologies
using OWL
Data Mining and Cyberinfrastructures in Biomedical Informatics - 41
Cyberinfrastructures in BMI
CSE
300

Biomedical Informatics Research Network (BIRN)
 Developed a robust software installation &
deployment system to implement a BIRN endpoint






Host data and contribute computational resources
Access shared datasets through web portal
Analysis and visualization tools
Publish datasets through BIRN data repository
Roughly $20k for a BIRN rack
Technologies
 Globus: grid management
 BIRNLex: ontology
Data Mining and Cyberinfrastructures in Biomedical Informatics - 42
Cyberinfrastructures in BMI
CSE
300

Cancer Biomedical Informatics Grid
 Launched 2003
 Mission
 Provide a common information platform to support the
diverse clinical and basic research of the US National
Cancer Institute
– 87 cancer institutes at the time

Highly heterogeneous datasets
Data Mining and Cyberinfrastructures in Biomedical Informatics - 43
Cyberinfrastructures in BMI
CSE
300

Future of BMI cyberinfrastructures
 Use of cyberinfrastructure is growing rapidly
 Grid computing is increasingly more efficient
 Current weaknesses related to cross-discipline
collaboration
 Each implements an internally consistent grid, but
isolated from each other
 We need integration and communication among
disciplines to investigate further relationships
 Data interoperability may be resolved by semantic web

Current research in cyberinfrastructures is related
to using the semantic web concept
Data Mining and Cyberinfrastructures in Biomedical Informatics - 44
Cyberinfrastructures in BMI
CSE
300

Semantic web in cyberinfrastructures
 Web services make strong distinction between data
and data operations





User identifies service to invoke
Formats the input data
Invokes the service
Unpacks and interprets the results
Semantic web is a technology tolerant of diverse
data models
 No data transformation services
 Just pieces of information and relationships between
them
Data Mining and Cyberinfrastructures in Biomedical Informatics - 45
Download