Document 11270080

advertisement
Towards Unified Biomedical Modeling with
Subgraph Mining and Factorization Algorithms
by
Yuan Luo
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
MASSACHUSETTS INSTITUTE
OF TECHNOLOGY
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
LIBRARIES
September 2015
@ Massachusetts Institute of Technology 2015. All rights reserved.
A
/-1
Signature redacted
A uthor .....................
Department of Electrical Engineering and Computer Science
August 18, 2015
redacted
Signature
Certified by..
..................
Peter Szolovits
Professor
Signature redacted
Certified by.
Thesis Supervisor
-or
Ozlem Uzuner
Associate Professor, State University of New York at Albany
Thesis Supervisor
Signature redacted
Accepted by..
....................
/ J
NOV 0 22015
Leslie A. Kolodziejski
Chair, Department Committee on Graduate Theses
Towards Unified Biomedical Modeling with Subgraph Mining
and Factorization Algorithms
by
Yuan Luo
Submitted to the Department of Electrical Engineering and Computer Science
on August 18, 2015, in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
Abstract
This dissertation applies subgraph mining and factorization algorithms to clinical narrative text,
ICU physiologic time series and computational genomics. These algorithms aims to build clinical models that improve both prediction accuracy and interpretability, by exploring relational
information in different biomedical data modalities including clinical narratives, physiologic
time series and exonic mutations.
This dissertation focuses on three concrete applications: implicating neurodevelopmentally coregulated exon clusters in phenotypes of Autism Spectrum Disorder (ASD), predicting mortality
risk of ICU patients based on their physiologic measurement time series, and identifying subtypes of lymphoma patients based on pathology report text. In each application, we automatically
extract relational information into a graph representation and collect important subgraphs that are
of interest. Depending on the degree of structure in the data format, heavier machinery of factorization models becomes necessary to reliably group important subgraphs. We demonstrate that
these methods lead to not only improved performance but also better interpretability in each application.
Thesis Supervisor: Peter Szolovits
Title: Professor
Thesis Supervisor: Ozlem Uzuner
Title: Associate Professor, State University of New York at Albany
2
Acknowledgments
I have been fortunate to have Pete Szolovits and Ozlem Uzuner as my advisers. Pete simultaneously provided the freedom to work on what I wanted and the guidance that enabled me to succeed in my work. Ozlem introduced me to the field of medical natural language processing and
has provided guidance in my pursuing this topic in depth. My PhD committee: Sam Madden and
Effi Hochberg provided valuable counsel on both research and writing. Ally Eran, Aliyah Sohani,
Yu Xin, Rohit Joshi, Nathan Palmer, Paul Avillach, and Isaac Kohane collaborated on part of
this work and contributed much insight. Andrew Lo, Jason Baron, Anand Dighe, Bill Long, Leo
Celi, Xiaoqian Jiang and Dahua Lin have supported me at various stages of my graduate career. I
am very grateful to all my friends at MIT, especially folks at MEDG, who made my graduate
years here exciting and pleasurable. I am deeply in debt to my family for their unconditional love
and support. The work in this thesis is supported by i2b2, by Grant Number U54LM008748 from
the National Library of Medicine, by the Scullen Center for Cancer Data Analysis, and the
MGH-MIT Strategic Partnership.
3
Contents
Introduction.........................................................................................................................
12
Biom edical Relations ..................................................................................................................
12
Chapter 1.
1.1
1.1.1
M edical Natural Language Processing............................................................................
13
1.1.2
Intensive Care Unit tim e series analysis ........................................................................
14
1.1.3
N ext Generation Sequencing analysis..............................................................................
16
Challenges in Modeling Biom edical Relations.......................................................................
17
1.2.1
N oisy structure extraction from narrative text ...............................................................
17
1.2.2
Poor scalability and abstraction for tim e sequence data .................................................
17
1.2.3
Connecting the dots for sequencing variants .................................................................
18
1.2.4
Correlation analysis am ong m ultiple feature m odes ......................................................
18
1.2
Contributions and Organization..............................................................................................
1.3
Chapter 2.
Related W ork ......................................................................................................................
Application of Biom edical Relation Extraction......................................................................
2.1
19
21
22
2.1.1
Biom olecular inform ation extraction .............................................................................
23
2.1.2
Clinical trial screening ...................................................................................................
23
2.1.3
Pharm acogenom ics ..........................................................................................................
23
2.1.4
Diagnosis categorization.................................................................................................
23
2.1.5
Adverse drug reaction and drug-drug interaction ..........................................................
24
2.2
General Pipeline for Biomedical Relation Extraction.............................................................
24
2.3
State-of-the-Art Methods for Biom edical Relation Extraction...............................................
26
2.3.1
Relation extraction from scientific literature .................................................................
28
2.3.2
Relation extraction from clinical narrative text ............................................................
37
2.3.3
Shared resources for relation extraction.........................................................................
39
2.4
Lim itations of Existing W ork ................................................................................................
39
2.4.1
N ot all parsers and dependency encodings are synergistic ............................................
39
2.4.2
Integrating co-reference resolution ................................................................................
40
2.4.3
General relation and event extraction and dom ain adaptation ........................................
41
2.4.4
Redundancy in subgraph patterns ..................................................................................
41
2.4.5
Integrating w ith NER .....................................................................................................
42
General Relation Extraction by Frequent Subgraph Mining Applied to Automatic
Chapter 3.
Lym phom a Classification ...........................................................................................................................
4
43
3.1
Background .................................................................................................................................
44
3.2
Task D efinition ...........................................................................................................................
45
3.3
D ata Collection ...........................................................................................................................
46
3.4
Methods.......................................................................................................................................46
3.4.1
Corpus pre-processing....................................................................................................
46
3.4.2
Intuition on relations am ong concepts ...........................................................................
56
3.4.3
Representing sentence dependency parses as graphs......................................................
57
3.4.4
Frequent subgraph m ining ..............................................................................................
58
3.4.5
Subgraph redundancy pruning .......................................................................................
59
3.4.6
Single node frequent subgraph collection.......................................................................
61
3.5
Experim ents and Results........................................................................................................
62
3.6
Feature and Error Analysis .....................................................................................................
66
3.7
D iscussion and Lim itations...................................................................................................
69
3.8
Conclusions.................................................................................................................................70
Chapter 4.
Subgraph Augmented Non-negative Tensor Factorization (SANTF) Applied to Modeling
72
Clinical N arrative Text ...............................................................................................................................
M ethods.......................................................................................................................................74
4.1
74
4.1.1
W orkflow of SAN TF .....................................................................................................
4.1.2
Joint modeling of higher-order features and atomic features using a tensor...................75
4.1.3
Patient and feature group discovery using SAN TF.........................................................
78
4.1.4
SAN TF algorithm ...............................................................................................................
78
4.2
Experim ents and Results........................................................................................................
80
4.3
Feature A nalysis..........................................................................................................................83
4.4
D iscussion...................................................................................................................................89
4.5
Conclusions.................................................................................................................................91
Subgraph Augmented Non-negative Matrix Factorization (SANMF) in Modeling ICU
Chapter 5.
Physiologic Tim e Series .............................................................................................................................
5.1
Background .................................................................................................................................
5.2
M ethods.......................................................................................................................................94
92
93
5.2.1
W orkflow of SAN M F .........................................................................................................
94
5.2.2
Representing tim e series as graphs ................................................................................
95
5.2.3
Frequent subgraph m ining ..............................................................................................
96
5.2.4
SAN MF algorithm ..............................................................................................................
99
5.2.5
Feature group discovery and association using SA NM F ..................................................
5
101
5.2.6
5.3
Evaluating the groups discovered by SAN M F..................................................................
Results.......................................................................................................................................
102
105
5.3.1
M ethod validation on ICU patients' m ortality risk prediction..........................................
105
5.3.2
Im portant subgraph groups ...............................................................................................
107
5.4
Lim itations and D iscussion.......................................................................................................
109
5.5
Conclusions...............................................................................................................................
110
Chapter 6.
Integrated Genomics, Transcriptomics, Medical Records, and Insurance Claims Analyses
Identify Dyslipidem ia as a Strong Inherited Risk Factor in A SD .............................................................
112
6.1
Background ...............................................................................................................................
6.2
M ethods.....................................................................................................................................115
113
6.2.1
Implication of Co-regulated Exons...................................................................................
6.2.2
Whole exom e sequence analysis.......................................................................................125
6.2.3
Segregation pattern analysis..............................................................................................
136
6.2.4
Integrated statistical significance ......................................................................................
138
6.2.5
Functional enrichm ent analysis.........................................................................................
139
115
6.2.6
Analysis of lipidemia profiles using lab results from individuals with ASD seen at Boston
Children's Hospital ...........................................................................................................................
139
6.2.7
6.3
PheWA S of A etna claim s data..........................................................................................
Results.......................................................................................................................................
141
142
6.3.1
in ASD
Neurodevelopmentally co-regulated, sexually dimorphic, segregating deleterious variation
142
6.3.2
Convergent lipid m etabolism etiology ..............................................................................
143
6.3.3
Dyslipidem ia in fam ilies with A SD ..................................................................................
149
6.3.4
Behavioral phenotypes of m ouse m odels of dyslipidem ia................................................
150
6.4
Conclusions and Discussion......................................................................................................
Chapter 7.
Conclusion and Future W ork ............................................................................................
7.1
Contributions.............................................................................................................................153
7.2
Future D irections ......................................................................................................................
Bibliography .............................................................................................................................................
6
151
153
154
157
List of Figures
Figure 1-1 Relations from an example sentence, using graph representation. ..........................
14
Figure 2-1 Applications of biomedical relation extraction......................................................
22
Figure 2-2 General workflow of biomedical relation extraction. ..............................................
25
Figure 3-1 MGH pathology reports usually contain four sections with almost all information
retained as narrative text...............................................................................................................
48
Figure 3-2 Example sentence parsed directly by the Stanford Parser. ....................................
49
Figure 3-3 Two-phase sentence parsing on example................................................................
50
Figure 3-4 Raw Stanford parsing result for example sentence 1...............................................
52
Figure 3-5 Stanford parsing result after pre-processing for example sentence 1 .....................
52
Figure 3-6 Raw Stanford parsing result for example sentence 2...............................................
53
Figure 3-7 Stanford parsing result after pre-processing for example sentence 2 ......................
54
Figure 3-8 Raw Stanford parsing result for example sentence 3...............................................
54
Figure 3-9 Stanford parsing result after pre-processing for example sentence 3 .....................
56
Figure 3-10 A variety of sentences frequently occurring in our corpus describe the relations
am ong cells, staining, and antigens/antibodies .........................................................................
57
Figure 3-11 Constructing the sentence graph from the results of two-phase dependency parsing.
.......................................................................................................................................................
58
Figure 3-12 Example subgraphs for the sentence graph in Figure 3-11...................................
59
Figure 3-13 A hierarchical hash partition algorithm for determining subisomorphism relation
am ong graphs in a set....................................................................................................................
62
Figure 4-1 The workflow of subgraph augmented non-negative tensor factorization (SANTF). 74
Figure 4-2 Graph generation and subgraph collection in SANTF .............................................
75
Figure 4-3 Tensor modeling and factorization with distributional representations of the sentence
sub grap h s. .....................................................................................................................................
7
77
Figure 4-4 Word group distribution for six of the top subgraphs in the first DLBCL associated
sub graph group ..............................................................................................................................
89
Figure 4-5 Correlation between six of the top subgraphs (partial sentences) in the first DLBCL
associated subgraph group ........................................................................................................
90
Figure 5-1 The workflow of subgraph augmented non-negative matrix factorization (SANMF).
.......................................................................................................................................................
95
Figure 5-2 Graph generation and subgraph mining in SANMF. ..............................................
98
Figure 5-3 Subgraph augmented non-negative matrix factorization model. ..............................
101
Figure 5-4 AUC comparisons between NMF and PCA under specification of different number of
subgraph groups..........................................................................................................................
106
Figure 5-5 ROC curves for proposed method SANMF, comparison models including subgraph,
discretized & interpolated measures (D,I-measure), and organ level status, as well as the baseline
using SA PS,, approxim ation.......................................................................................................
107
Figure 6-1 Independent sources of information used to identify molecular networks contributing
to A S D . .......................................................................................................................................
1 13
Figure 6-2 Visualization of the BrainSpan RNA-Seq data.........................................................
118
Figure 6-3 Distribution of the number of non-NA values in expressions of exons....................
120
Figure 6-4 Block and parallel exon correlation makes computation feasible.............................
121
Figure 6-5 Distribution of R2 in the BrainSpan data..................................................................
122
Figure 6-6 Visualization of part of the entire exon graph...........................................................
123
Figure 6-7 Distribution of padded and merged BrainSpan interval sizes...................................
128
Figure 6-8 O verview of W ES analysis. ......................................................................................
129
Figure 6-9 Distributions of the total number of variants in probands and unaffected siblings in
discordant fam ilies......................................................................................................................
133
Figure 6-10 Distribution of number of variants per individual in the discordant family cohort at
each stage of variant analysis......................................................................................................
8
134
Figure 6-11 Distribution of number of variants per individual among multiplex families at each
stage of variant analysis..............................................................................................................
135
Figure 6-12 Distribution of sizes of multiplex families..............................................................
137
Figure 6-13 Pseudo code of the extended ASP test for multiplex families. ...............................
138
Figure 6-14 The sexually dimorphic neurodevelopmentally co-regulated LDLR exon cluster. 147
Figure 6-15 ASD-segregating deleterious variation in the sexually-dimorphic LPL exon cluster.
.....................................................................................................................................................
9
14 8
List of Tables
Table 2-1 Summarization and characterization of relation extraction algorithms.................... 28
Table 2-2 BioNLP event extraction tasks ................................................................................
29
Table 2-3 Shared resources for relation extraction..................................................................
39
Table 3-1 Regular Expressions to Catch Lymphoma Mentions...............................................
45
Table 3-2 Semantic types considered as immunologic factors.................................................
56
Table 3-3 Multiple-hit or intermediate lymphoma cases...........................................................
63
Table 3-4 Distribution of lymphoma cases in full corpus, training corpus and testing corpus .... 63
Table 3-5 Held-out test results on different feature groups ......................................................
65
Table 3-6 Held-out test results on different settings of sentence subgraph feature groups .......... 66
Table 4-1 Statistics of the lymphoma subtype distribution in the pathology narrative text corpus.
.......................................................................................................................................................
80
Table 4-2 Clustering performances for MGH lymphoma dataset. ...........................................
82
Table 4-3 Per-class evaluation of clustering on the lymphoma dataset.....................................
83
Table 4-4 Top higher-order feature groups associated with diffuse large B-cell lymphoma....... 84
Table 4-5 Top higher-order feature groups associated with follicular lymphoma. .................. 87
Table 4-6 Top higher-order feature groups associated with Hodgkin lymphoma....................
87
Table 5-1 A simplified algorithm for determining subisomorphism relation among time series
sub graph s. .....................................................................................................................................
99
Table 5-2 Statistics of experim ent data.......................................................................................
103
Table 5-3 Physiologic time series predictor variables from MIMIC-II dataset..........................
104
Table 5-4 Top subgraph groups associated with high mortality risks. .......................................
108
Table 6-1 Brain region hierarchy of regions, areas, and structures included in this study......... 116
Table 6-2 Periods of brain development included in this study..................................................
10
117
Table 6-3 Distribution of cluster sizes (measured in terms of number of exons).......................
124
Table 6-4 Distribution of number of genes in exon clusters.......................................................
124
Table 6-5 Whole exome sequence datasets used. .......................................................................
126
Table 6-6 Patients used to examine the association of abnormal lipid lab results with ASD..... 141
Table 6-7 Significant clusters of sexually dimorphic, neurodevelopmentally co-regulated, ASDsegregating deleterious variation, and their molecular themes...................................................
146
Table 6-8 Enrichment of comorbid dyslipidemia diagnoses in individuals with ASD as compared
to their unaffected siblings..........................................................................................................
149
Table 6-9 Significant enrichment of dyslipidemia-related diagnoses in individuals with ASD,
150
detected in health claim s data.....................................................................................................
Table 6-10 Behavioral and nervous system phenotypes shared between 42 mouse models of ASD
and 7 mouse m odels of LDLR deficiency. .................................................................................
11
150
Chapter 1.
Introduction
1.1 Biomedical Relations
With recent advances of the data acquisition and storage technologies in the biomedical field,
large volumes of data that have unique characteristics and multiple modalities flow into growing
archives that can be used to study and improve medical care. For example, narrative text in a pathology report may explain pathologists' interpretations of flow cytometry results, immunohistochemical patterns, or genetic karyotype profiles. Such text has moderately controlled vocabularies but generally presents high variability due to the flexibility of natural language. In a narrative
text corpus, multiple sentence constructs often express the same meaning, differing in syntactic
construction, word order, or use of abbreviations. A second example considers the vital signs and
other physiologic measurements monitored during hospital admissions, which present themselves as evolving time series, often at unevenly sampled time points. Early recognition of clinical deterioration and early warning systems is an area of active research in order to identify actionable items for improving patient survival [1]. Scaling to a comprehensive set of clinical variables means analyzing many unevenly spaced times series, which quickly becomes computationally intensive as the number of variables increases. A third example concerns next generation
sequencing that may output multiple gigabytes of gene sequence data per individual, posing immediate throughput challenges to existing representation and learning frameworks. Growing evidence has linked the alternatively spliced isoforms and regulating pathways to distinct clinical
outcomes of multiple specific diseases such as Autism Spectrum Disorder (ASD), underscoring
the value of the ability to sift through the genetic sequence data. In addition to their varying
characteristics and emphasis, for those aforementioned data modalities, meaningful and effective
structure discovery has been under active study within respective research subfields.
The problem domains addressed by this thesis includes medical natural language processing,
clinical dynamic time series analysis and next generation sequencing analysis. As vast
knowledge and data sources often exceed the capacity of human experts, we need to leverage
modern statistical analysis and machine learning algorithms to generate models that are both accurate and interpretable. We emphasize interpretability so that researchers and clinicians will un-
12
derstand the model and use it to advance the understanding of pathophysiology and to improve
patient care. The methods need to be broadly applicable and easily adaptable across domains.
1.1.1 Medical Natural Language Processing
Relation extraction from text documents is an important task in knowledge representation and
inference in order to create structured knowledge bases, augment existing knowledge bases and
in turn support question answering and decision making. The task generally involves annotating
unstructured text with named entities and identifying the relations between these annotated entities. State-of-the-art named entity recognizers can automatically annotate text with high accuracy
[2,3], but relation extraction is not as straightforward. General domain relation extraction has
been an active research area for decades [4]. In the biomedical and clinical domain, extracting
relations from scientific publications and clinical narratives has been gaining traction over the
past decade.
To illustrate the importance of biomedical and clinical relation extraction, consider that in lymphoma pathology reports, immunophenotypic features are expressed as relations among medical
concepts. For example, in "[large atypical cells] are positive for [CD30] and negative for
[CD15]", "large atypical cells", "CD30" and "CD15" are medical concepts; "CD30" and "CD15"
are cell surface antigens. A bag-of-words or bag-of-concepts representation of this sentence
would fail to capture whether "large atypical cells" are positive or negative for "CD30" or
"CD15". In this and many other similar cases, the biomedical concepts need to be represented as
linked through syntax and/or semantics in order to be informative, so as to enable resolution of
ambiguities by putting the concepts into context.
We define a relation as a tuple r(c, c 2 , . . , c), n
>
2, where ci's are concepts (named entities),
and the ci's are semantically and/or syntactically linked to form relation r, as expressed in text.
Thus a single named entity is generally not regarded as a relation; an assertion is also generally
not regarded as a relation. In other words, a relation involves at least two concepts. If n is two
(three), we call the relation a binary (ternary) relation, and for general n an n-ary relation. Some
researchers use the term relation to focus on triples that represent binary relations (e.g., positive-expression(large
atypical
cells,
CD30),
negative-
CD15)). Others also consider composite rela-
expression (large atypical cells,
13
cells,
tions, e.g., and (positive-expression (large atypical
CD15) ).
ative-expression (large atypical cells,
CD30) , neg-
We also use the term rela-
tion to include what are often referred to as events; e.g., the ternary relation rl: treat-
edby(patient, Imatinib regimen, 5 months) as expressed in "[the patient] was put on [Imatinib
regimen] for [5 months]" can also be parsed as an event, where the event trigger is "put", theme
is "Imatinib regimen" and target argument is "patient".
Nested events may occur when one
event takes other events as arguments. Figure 1-1 shows relations from an example sentence, as
well as binary relations, complex relations, and nested events. We note that all these language
constructs can be universally represented and mined as graphs (e.g., with medical concepts as
nodes and syntactic/semantic links as edges).
Bone marrow biopsy was performed on the patient in order to evaluate
the effet of oedication for ,ymphbnm* as the cause ooeof
bone marrow biopsy --
geI neutropenia.
iet
performed_on--(
evaluate
effect
cause
(p
ge
e
ea
produced-by
(medication)
tra
ypoa
Figure 1-1 Relations from an example sentence, using graph representation. Nodes are named
entities and edges indicate the relations between two nodes (or multiple named entities connected
by multiple edges can be considered as one relation). Named entities considered are in bold in
the sentence. The dashed box denotes a binary relation, i.e., with two named entities. The solid
box denotes a relation with multiple named entities, which alternatively can be viewed as a collection of three binary relations. These relations (in solid box and dashed box) can also be regarded as events, and the entire graph can be interpreted as a nested event.
1.1.2
Intensive Care Unit time series analysis
14
Modem ICUs generate multivariate time series data for individual patients using an increasing
number of monitoring devices and laboratory tests. There is a growing body of evidence suggesting that early recognition of clinical instability and early intervention in the development of disease processes may improve patient outcome such as mortality [5,6]. To interpret such data in a
timely fashion and to provide high quality care, the close attention required from critical care
providers exposes ICU patients to human errors known to be common in hospital admissions
[7,8]. Thus automated tools are needed to help clinicians and nurses identify clinical deterioration early on and quickly assemble effective treatment plan. A model that understands the patient's multivariate physiologic temporal progressions may be useful to catch preludes to dangerous episodes, increase caregiver vigilance, and ultimately improve patient outcome.
Many studies have tracked clinical variables to understand the natural history of diseases or to
monitor patient baseline progressions in response to medical intervening procedures and agents.
One such comprehensive time series archive lies in the MIMIC-II (Multiparameter Intelligent
Monitoring in Intensive Care) Databases containing physiologic signals and vital signs time series captured from patient monitors, as well as accompanying clinical data extracted from electronic medical records (EMR) systems. The database currently contains over 40,000 ICU patients,
whose data were collected between 2001 and 2008 from a variety of ICUs (medical, surgical,
coronary care, and neonatal) within a single tertiary teaching hospital.
The patient's multivariate physiologic temporal progressions are in fact relations in the temporal
domain. The ability to succinctly represent these relations and to correlate features of such representations with various aspects of diseases may offer insights into the pathogenesis, and help
physicians make informed decisions. To digest the vast amount of monitored time series and to
present them in an informative way, dynamic models have been studied which mostly fall in the
probabilistic generative model framework. Filter-based generative models such as switching
Kalman filters [9] assume that data is generated from a discrete set of transition matrices, but
discretization may limit the visibility of fine grained variability among individual patients. Models based on hierarchical Dirichlet processes (HDPs) loosen the discretization prerequisite and
accept infinite dimensional latent state space [10,11]. They typically model the time series using
a sequence of parameterized generating functions that specify the series dynamics conditioned on
the current and/or previous states and differ in the degree of overlap among topics of such gener15
ating functions. In addition to generative models, Fourier or wavelet transformations [12,13]
have been applied to directly extracting features from the time series. However, these methods
generally suffer from the problem of feature interpretability.
1.1.3 Next Generation Sequencing analysis
In recent years, high-throughput sequencing techniques have enabled the identification of genetic
patterns associated with distinct clinical outcomes of specific disease entities. For example, genome-wide association studies (GWAS) expanded the assessment scope on genetic variations to
the whole genome, though they are generally limited to previously identified single-nucleotide
polymorphisms (SNPs) [14]. Exome sequencing is able to comprehensively identify and type
protein-coding variations throughout the genome, hence is less biased towards learningwhat we
already know. About 99% of the entire genome ignored by exome sequencing consists of noncoding regions that may have regulatory influence on the expression and functioning of coding
regions [14]. Personalized whole-genome sequencing is not restricted by the biases associated
with the previous two sequencing technologies. Next generation sequencing technology has produced an ever-increasing amount of genomics data at multiple resolutions, which makes it possible to characterize at the genetic level those diseases and disorders that are inherited but highly
heterogeneous. Such characterization requires deep understanding of genetic variants in relation
to each other and to the disease phenotype, through mechanisms such as regulatory network and
signaling pathway. Thus it is important to effectively model the relations (e.g., through transcription or regulation) of genetic variants in next generation sequencing analysis.
One example is Autism Spectrum Disorder (ASD). One in every 68 children in the USA is diagnosed with ASD, a set of neurodevelopmental conditions characterized by social and communication impairments, and increased repetitive behavior. ASD has a substantial genetic component,
but the specific cause of most cases remains unknown. Today, different constellations of selected
molecular, biochemical, neurofunctional, and clinical measurements that fall outside of normal
ranges can each identify a group of individuals with ASD. However, individuals without ASD
also display measures that lie outside of the normal range for one (or possibly more) of the dimensions tested. Furthermore, recent large-scale whole exome and whole genome sequencing
studies suggest that not only do different individuals with ASD carry different deleterious variants, but a single individual may have multiple different variants in likely candidate genes [1516
24]. Therefore, there might exist a spectrum of genetic variants underlying the spectrum of clinical manifestations, making ASD extremely heterogeneous on both the molecular and clinical
levels. Thus it is essential to model the relations of genetic variants in association with disease
using not only next generation sequencing data, but also personal health data from other modalities in an integrative fashion.
1.2
Challenges in Modeling Biomedical Relations
There are a few major challenges, common to each subfield and the overall field, with respect to
modeling biomedical relations.
1.2.1 Noisy structure extraction from narrative text
Much of the clinical content of EMRs is, from a computer's viewpoint, locked up in the narrative
text portions of the records. These typically include doctors' and nurses' notes, referring letters,
specialists' reports, discharge summaries, and communications between doctors and patients.
Their content adds to the data available from more structured components of the EMR such as
laboratory values, medication prescriptions and vital sign records. There are existing clinical
NLP systems such as cTakes [25] and MetaMap [26] that can extract medical concepts and their
assertions (e.g., negated concepts [27]). However, it is still an open problem to automatically extract useful relationships between medical concepts. Much of the state-of-the-art focuses on extracting or classifying predefined relations from biomedical narratives [2,28-34], however, it is
uncertain whether these predefined and often binary relations are directly useful and comprehensive for complex tasks such as patient diagnosis and outcome prediction.
1.2.2 Poor scalability and abstraction for time sequence data
During hospital admissions, routinely monitored patient baseline progression includes vital signs,
chem7 and other physiologic measurements. Studies have linked early recognition of patients'
declining baseline condition to 50% reduction in the heart attack rate, and in turn to lower mortality [5]. Common practice typically involves the usage of the predictive scoring systems that
aim to identify only a few and best descriptive clinical measurements for a particular outcome
[35-40]. Many attempts to perform multivariate time-series analysis are restricted to only a
handful of clinical variables (usually less than 20, see [10,41-43]). On the other hand, the few
17
approaches on unsupervised high-dimensional multivariate learning [44,45] lack the ability to
simultaneously learn temporal patterns while learning abstractions' over raw measurements.
1.2.3
Connecting the dots for sequencing variants
The current practice of analyzing genetic sequence variants often assumes linear models where
the relation between Single Nucleotide Variations (SNVs) and Copy Number Variations (CNVs)
are largely ignored. On the other hand, the genes that are affected by those SNVs and CNVs interact with each other functionally in the context of pathways or regulatory networks. Moving
toward whole-exome and whole-genome analysis, statistical tools face multiple challenges to
connect those SNVs and CNVs through their functional interactions in order to better understand
pathogenic mechanisms. In particular, association between variants and disease phenotype
should be investigated in the context where variants are not treated independently, but collectively when functionally correlated. However, as next generation sequencing produces ever increasing amount of genomics data, it also makes the problem more difficult to identify a subset of genetic variants underlying a particular phenotype. Even if one focuses on the protein-encoding
exome, there are at least 25,000 distinct variants that differentiate individuals from each other.
Although graphical models have been applied to estimate the structure of functional interaction,
they are typically restricted to a small set of variants [46-48]. Relaxing such restrictions to take
advantage of whole-exome and whole-genome sequencing will pose not only computational
challenges (e.g., convergence rate and local optima) but also representational and statistical challenges (e.g., hypothesis space pruning and significance testing within a greatly increased hypothesis space).
1.2.4 Correlation analysis among multiple feature modes
In many modeling tasks, the raw data can be processed by multiple feature extraction algorithms
that generate features from different modalities or from multiple levels of analysis. For example,
in medical natural language processing, one can extract the standard bag-of-words features, or
one can extract more semantic-syntactic enriched features such as predicate argument structures
and named entities. The different levels of features are correlated and collectively reflect the
characteristics of a sentence or a document. Traditional machine learning models in medicine
Some refer to this level of learning as learning clusters, while others refer to it as learning topics.
18
mostly adopt a two-dimensional matrix view of the data in the sense that patients and features
each span one axis of a matrix. Such models cannot account for interactions between features or
group of features in different levels. Similar challenges exist when patients' personal health data
come in multiple modalities. For example, in studying patients with Autism Spectrum Disorders,
it has been broadly hypothesized that only through combinations of multimodal measures, including genomics, transcriptomics, lab test results, and insurance claims analyses, will we obtain
the diagnostic and prognostic accuracy that permits proper assignment of each individual to the
group of ASD patients whose etiology, pathophysiology, treatment response, and clinical course
most closely resemble his or hers.
1.3 Contributions and Organization
This dissertation contributes a generalizable framework based on subgraph mining and factorization algorithms to model biomedical relations, and further, their correlations. It develops SANTF,
a subgraph augmented non-negative tensor factorization tool that integrates atomic features
(words) to help correlate higher-order features (relations between medical concepts) in clinical
narrative text, and enables automated and interpretable lymphoma subtype categorization. As a
variation of SANTF, this dissertation also develops Subgraph Augmented Non-negative Matrix
Factorization (SANMF) that groups graph represented temporal progression trends of physiologic variables in a way that reflects the patient pathophysiology evolution and that is predicative of
patients' mortality risks. As another variation, it develops ICE, implication of co-regulated exons,
which is a new subgraph-based method to implicate co-regulated exons with ASD phenotype and
allows identification of novel risk factors for ASD.
The rest of this dissertation is organized as follows. In Chapter 2, we provide the background
necessary to understand the motivations of applying subgraph mining and factorization algorithms to extract relations from biomedical narratives. We also describe previous work in the area. In Chapter 3, we describe in more detail the graph mining component of SANTF, which is
applied to lymphoma subtype classification. Chapter 4 continues to describe the core SANTF,
which extends the graph mining component to augment non-negative tensor factorization algorithms in order to group subgraph-mined biomedical relations and produce interpretable diagnostic panels for lymphoma subtypes. Chapter 5 describes SANMF and its application to ICU mor19
tality risk prediction. Chapter 6 describes ICE and its application to study genetic risk factors for
ASD. Chapter 7 summarizes conclusions and future work.
20
Chapter 2.
Related Work
In this chapter, we review relation extraction from unstructured text using natural language processing (NLP) methods, with a focus on applications in biomedical and clinical informatics. The
representation of relations has been a subject of knowledge representation research for decades
[49], and there are various alternatives. One representation uses composed simple logical forms.
For example, Resource Description Framework (RDF) or Web Ontology Language (OWL) encodes complex relations by multiple triples, where the elements of these triples can themselves
be other composed forms. Thus binary relations such as positive-expression (large
atypical cells, CD30) has the following subject-predicate-object triple representation:
large atypical cells-positively express-CD30. A more powerful alternative
is the sentential logic (or propositional logic) representation [49], in which relations are propositions or composed propositions using logical connectives (e.g., and for conjunction, or for disjunction). A third alternative is the graph-based representation in which nodes are named entities
and edges indicate relationships (or multiple named entities connected by multiple edges can be
regarded as one relation), as in Figure 1-1, which shows binary relations, n-ary relations, and
how an n-ary relation can be regarded as a composition of multiple binary relations.
Regarding alternative representations, the graph-based representation is equivalent to the sentential logic representation, differing at most perhaps in the compactness of the representation [50].
Thus, relations (including events) can be universally represented as graphs by converting biomedical concepts to nodes and syntactic/semantic links to edges. Other relation representations
can also be easily derived using such graphs as intermediary input. Furthermore, although composition leads to complexity (e.g., n-ary relations or nested relations), by adopting a graph-based
representation, we can focus on syntactic and semantic graphical patterns that are common and
that provide good ways to capture relations. In fact, as will become clear later in this chapter, almost all state-of-the-art methods for extracting relations and events use graph-based algorithms.
The reader should also be aware of a body of research on creating curated structured knowledge
bases, which record manual annotations of biomedical relations by experts. Some of these
knowledge bases are biologically focused, such as KEGG [51], STRING [52], InterPro [53], and
InterDom [54]. Others are more clinically focused, such as PharmGKB [55], VARIMED [56]
21
and ClinVar [57]. However, the expert sourcing methods often scale poorly with the exponentially growing body of biomedical and clinical free text. Thus automated methods present a promising direction for discovering relations that can augment existing knowledge bases.
2.1
Application of Biomedical Relation Extraction
Extracting biomedical relations has numerous applications that vary from advancing basic sciences to improving clinical practices, as shown in Figure 2-1. These applications include but are
not limited to bio-molecular information extraction, clinical trial screening, pharmacogenomics,
diagnosis categorization, as well as discovery of adverse drug reactions and drug-drug interactions.
Relation Extraction
Figure 2-1 Applications of biomedical relation extraction. The bidirectional arrows indicate that
on the one hand, automated methods for relation extraction can help biological and clinical investigations; on the other hand, these applications can in turn provide shared resources (e.g., corpora and knowledge base etc.).
22
2.1.1
Biomolecular information extraction
To keep up with the exponential growth of the literature, automated methods have been applied
to mining protein-protein interactions [58,59], gene-phenotype associations [60,61], gene ontology [62], and pathway information [63], which we collectively call biomolecular information
extraction. Such relation mining has shown its value in the prioritization of cancerous genes for
further validation from a large number of candidates [64]. Many of these approaches apply NLP
methods to extract known disease-gene relations from the literature, which are then used to predict novel disease-gene relations [65-69].
2.1.2
Clinical trial screening
Archived clinical and research data have been made available by governmental agencies and
corporations, such as ClinicalTrials.gov [70]. Clinical trials are in large part characterized by eligibility criteria, some of which can be captured via relations (e.g., no [diagnosis] for [rheumatoid
arthritis] for at least [6 months]). Electronic screening can improve efficiency in clinical trial recruitment, and intelligent query over clinical trials can support clinical research knowledge curation [71]. Recently, NLP support has proved useful in scaling up the annotation process [72-74],
enabling semantically meaningful search queries [75], and clustering similar clinical trials based
on their eligibility criteria profiles [76].
2.1.3 Pharmacogenomics
Pharmacogenomics aims to understand how different patients respond to drugs by studying relations between drug response phenotypes and patient genetic variations. Much of the knowledge
on such relations can be mined from scientific literature text and curated in databases to enable
discovery of new relationships. One such database is the Pharmacogenetics Research Network
and Knowledge Base (PharmGKB [77]). Initial efforts to populate PharmGKB included a mixture of expert annotation and rule-based approaches. Recent approaches have extended to utilizing semantic and syntactic analysis as well as statistical machine learning tools to mine targeted
pharmacogenomics relations from biomedical literature and clinical records [78-80].
2.1.4 Diagnosis categorization
23
Diagnosis categorization enables automated billing and patient cohort selection for secondary
research. Systems have been developed to automatically perform coding and classification of
diagnoses from Electronic Medical Records (EMRs) [81-85]. More recent approaches demonstrated the success of extracting semantic relations and using these relations as additional features in diagnosis categorization, some through better vocabulary coverage [86], others through
more expressive and informative representation of relations between medical concepts [87,88].
2.1.5
Adverse drug reaction and drug-drug interaction
Adverse drug reaction (ADR) refers to unexpected injuries caused by taking a medication. Drugdrug interaction (DDI) happens when a drug affects the activity of another drug when both are
administered together. ADR is an important cause of morbidity and mortality [89], and DDIs
may cause reduced drug efficacy or lead to drug overdose. Detecting potential ADRs and DDIs
can guide the process of drug development. Recently, an increasing number of systems have leveraged the scientific literature and clinical records using NLP. These systems often explore the
relations between drugs, genes and pathways, and discover ADRs [90-92] and DDIs [33,34] that
are stated in unstructured text.
2.2 General Pipeline for Biomedical Relation Extraction
In Figure 2-2, we first present a general pipeline, summarized from the reviewed approaches, as
a cookbook to follow either in part or as a whole for extracting biomedical relations. We present
this general pipeline before the methodology review to provide the reader a roadmap of the components discussed in the state-of-the-art methods. For completeness, we assume documents as
the input and the extracted relations as the output. The pipeline is thus self-contained, but can
also be used as a foundation for downstream applications such as logical inference with extracted
relations. The pipeline covers steps for breaking the documents to sentences, understanding the
semantic and syntactic structures of sentences and constructing a multitude of features for rela-
tion extraction. We refer the reader to the description of each step in the accompanying text of
the figure. We emphasize the role of graph mining in the pipeline as a central concept. The
common graphs provide a point of convergence for methods that combine local features, a point
of divergence from which more integrated features may be constructed, and a bridge to connect
the syntax and semantics.
24
Section recognition
Documents
Sentence breaking
Regex Pattern
Matcher
To kenization
Morphological analysis
Se
PStgig
Terminology
Parsing
77Feature extraction
Context features
- Lexical features
- Semantic features
- Concept features
- Graph (tree, path) features
- Dictionary features etc.
-
Graph representation
Semantic Role Labeling
Post-procesGraph
-Improve recall
- Improve precision
mining
Relations
Relation classification
a tion
l o n / optIm z
R u le I n u ct
*. . ..
==
Featre paceClasifirs
Krne s (incl. graph/tree kernels)
Figure 2-2 General workflow of biomedical relation extraction. Section recognition distinguishes
text under different section headings (e.g., "Chief Complaints" or "Past Medical History"). Sentence breaking is to automatically decide where sentences in a paragraph begin and end. Morphological analysis investigates features such as capitalization and usage of alphanumeric characters. Stemming reduces the inflected words to the root form (e.g., performed to perform). POS
tagging assigns a part-of-speech tag for each word in the sentence (e.g., VBN for "performed" in
the sentence in Figure 1-1). Parsing is the process of assigning a syntactic structure to a sentence
(e.g., the constituency or dependency structure obtained by Stanford Parser). The results from
morphological analysis, stemming, POS tagging and parsing can provide features for recognizing
anaphora (coreference resolution) and typed concepts (concept recognition). Coreference resolution and concept resolution can also improve parsing accuracy. Together with parsing, they are
essential in generating the graph representation for a sentence and labeling semantic roles of
concepts in the graph representation (Semantic Role Labeling). The graph representation is the
25
foundation for graph mining, and along with upstream steps including direct regular expression
feature extraction, leads to the generation of semantically and syntactically enriched features.
These features then support either rule based, feature space based or kernel based relation extraction system. Many biomedical relation extraction systems rely on external knowledge sources
(e.g., UMLS). The shaded cloud denotes that the external resources (terminology, ontology and
knowledge bases) can be utilized by some or all of the covered steps.
2.3 State-of-the-Art Methods for Biomedical Relation Extraction
As the task of biomedical relation extraction has been receiving increasing attention, so have the
methods to accomplish it. Some conventional approaches focus on using co-occurrence statistics
as a proxy for relatedness [79,93-96]. Some clinical NLP systems apply hand-crafted syntactic
and semantic rules to extract pre-specified semantic relations, such as MedLEE [97] and SemRep
[98], and are hard to adapt to new subdomains. Recently, the research community has been paying more attention to the value of syntactic parsing, in order to develop generalizable methods to
extract relations that fully explore the constituency and dependency structures of natural language. In this section, we review the state-of-the-art work where graph (including tree) mining
techniques are used to derive relations from syntactic or semantic parses. We group the methods
according to whether their corpora mainly concern scientific publications or clinical narrative
text, as this content difference often has implications for the methods and resources used to extract relations. We also summarize the algorithms and systems in Table 2-1.
26
CoRef External Resources
Graph Exploration
Methods
Parsers
Luo et al. [87,88,99]
Frequent subgraphs with No
Stanford
(augmented by redundancy removing
UMLS)
No
Shortest path
Stanford
Roberts et al. [101]
deBruijn et al.[102]
McCCJ, SD
Kay
Xu et al. [103]
Stanford,
McCCJ,
Enju
Liu et al. [105,106], McCCJ, SD
Mackinlay et al.
[107], Ravikumar et
al. [108]
Bjorne et al. [111- McCCJ, SD
114] , Hakala et al.
[115]
et
al. McCCJ, SD
Kilicoglu
[117,118]
Hakenberg et al. BioLG
[119,120]
Solt et al. [104]
Thomas et al. [121]
Bikel, SD
Riedel et al. [123]
McCCJ
Minimal trees over con- No
cept pair
Conceptual graph repre- No
sentation
Graph kernels:
kBSPS
APG, No
Exact subgraph match- No
ing, approximate subgraph matching
Shortest path, rule-based No
graph pruning
Embedding graph, postprocessing rules
Subgraph pattern matching using customized
query language, postprocessing rules
Pattern matching in dependency graphs
Candidate graph scoring
Yes
UMLS, Gaston [100]
Concept Matching
Normalized string
greedy match
CRF for concept
boundary
and
SVM for concept
type
Semi-Markov
UMLS
HMM
Kay Chart Parser
UMLS
and regular expressions
Compiled dictionaries Dictionary lookup
and graph matching rules
PDB [109], Uniprot, Yes
Biothesaurus [110]
UMLS, Wordnet
Wikipedia
Uniprot [116], Sub- No
tiWiki,
Wordnet,
DrugBank, MetaMap
Compiled dictionaries No
Yes
Compiled dictionar- BANNER, PNAT
ies, Lucene, Uniprot,
GO
Yes
GNAT [122]
No
No
Van Landeghem et Stanford
al. [125]
No
Compiled dictionar- No
ies, Stanford event
extractor [124]
Compiled dictionaries Yes
et
al.
Kaljurand
[126]
Vlachos et al. [128]
Yes 2
IntAct [127]
No
Yes
No
No
No
No
No
No
No
No
No
UMLS, Wordnet
No
PharmGKB [77]
Yes
McClosky et al.
[124,129]
Quirk et al. [130]
Miwa et al. [131]
Coulet
et
al.
Percha
et
[78,132],
2
Extraction rules based on
minimal event containing subgraph patterns
Dependency paths bePro3Gres
tween the concept pairs
Dependency paths beRASP
tween the concept pairs,
post-processing rules
McCCJ, SD
Minimum spanning tree
algorithm
SD;
Shortest
paths between
McCCJ,
Enju
the concept pairs
Enju, GDep
Dependency paths between the concept pairs
Stanford
Dependency paths between the concept pairs
Relative clause anaphora
27
No
I
I
al. [80]
Subtrees rooted at the
lowest common ancestors of concept pair
Wang et al. [137]
No
Association
distance
between pair of entities
in a semantic network.
Bui et al. [139]
Stanford
Grammatical rules to
traverse the tree structures
et
al. LGP, Minipar, Subtrees rooted at the
Katrenko
Charniak
lowest common ances[142]
tors of concept pair
Enju, GDep
Dependency paths beSatre et al. [143]
tween the concept pairs
In-house par- Frequent subtree patterns
Weng et al. [75]
ser
Graph kernels: APG,
Thomas et al. [146] McCCJ, SD
kBSPS
Tree kernel: MEDT
Chowdhury et al Stanford,
1
McCCJ, SD
[147-149]
Hakenberg
[133]
et
al. Stanford
No
No
UMLS, SIDER [134], BANNER [136]
DrugBank
[135],
PharmGKB, GNAT
Chem2Bio2RDF
No
[138]
Yes
HIVDB [140],
gaDB [141]
No
No
No
No
UniProt, Entrez Gene Yes
[144], GENA [145]
Yes
UMLS
No
No
No
No
No
No
1
__
Re- Pre-specified drug
names and regular
expressions
Yes
Table 2-1 Summarization and characterization of relation extraction algorithms.Abbreviation
used in this table include: CoRef - co-reference resolution, CRF - conditional random field,
HMM - hidden Markov model, APG - all paths graph kernel [58], kBSPS - k-band shortest path
spectrum kernel [150], MEDT - mildly extended dependency tree kernel [151]; PDB - Protein
Data Bank [109], UMLS - Unified Medical Language System. The key for the parsers are: Stanford - Stanford Parser, McCCJ - McClosky-Charniak-Johnson Parser, Chart - Kay Chart Parser,
Enju - Enju Parser, Bikel - Bikel Parser, SD - Stanford Dependency. When Stanford Parser is
used, Stanford Dependency is automatically assumed.
2.3.1
Relation extraction from scientific literature
Over the past decade, continuous effort has been directed to extracting semantic relations from
biomedical literature text, often in the form of shared-task community challenges that aim to assess and advance NLP techniques. Notable community challenges include BioNLP shared tasks
on event mining, BioCreative shared tasks on protein-protein interaction (PPI) extraction, and
DDlExtraction challenges on drug-drug interaction (DDI) extraction. We observed that an increasing number of teams applied graph-based techniques to characterize the semantic relations
in these shared tasks. These techniques frequently place among the top performing echelon. This
section reviews the graph-based methodologies developed for these challenges. We consider only the papers accepted into the shared task proceedings as full publications, and focus on the top
performing systems. We summarize the f-measures of the best systems in each shared task as an
evaluation of each, and refer the reader to the challenge overviews for detailed and comprehen28
sive evaluations. Perhaps through learning the lessons from these challenges, real world applications such as the field of pharmacogenomics also saw significant momentum in development of
graph-based text mining methods. Thus we devote the last part of this section to review recent
advances in pharmacogenomics and demonstrate the transfer and adaptation of graph based algorithms from methodology oriented research to application oriented research in biomedical relation extraction.
2.3.1.1
BioNLP event mining shared tasks
Three BioNLP shared tasks have focused on recognizing biological events (relations) from the
literature. The shared tasks provided the protein mentions as input and asked the participating
teams to identify a predefined set of semantic relations. Teams were not required to discover the
protein mentions. BioNLP-ST 2009 consisted of three sub-tasks, including core event detection,
event argument recognition, and negation/speculation detection, all based on the GENIA corpus
[31]. BioNLP-ST 2011 expanded the tasks and resources in order to cover more text types, event
types and subject domains [28]. Besides the continued GENIA task (GE), the 2011 shared tasks
added the following sub-tasks: epigenetics and post-translational modification (EPI), infectious
diseases (ID), bacteria biotope (BB) and bacteria interaction (BI). BioNLP-ST 2013 further expanded the application domains and included the following event extraction tasks: GE, BB, cancer genetics (CG), pathway curation (PC), and gene regulation ontology (GRO) [32]. Table 2-2
describes the nature of those tasks in more detail.
Tasks
GE
EPI
ID
BB
BI
CG
Task Descriptions
Extracting the bio-molecular events related to NFKB proteins.
Extracting epigenetic and post-transcriptional modification events.
Extracting events describing the biomolecular foundations of infectious diseases.
Extracting the association between bacteria and their habitats.
Extracting the bacterial molecular interactions and transcriptional regulations.
Extracting cancer related molecular and cellular level foundations, tissue and organ level
effects and organism level outcomes.
PC
Extracting signaling and metabolic pathway related biomolecular reactions.
GRO Extracting regulatory events between genes.
Table 2-2 BioNLP event extraction tasks.
The typical event extraction workflow can be broken into two general steps: trigger detection and
argument detection. For example, in r3: [the patient] was put on [Imatinib regimen], the first step
29
detects the event trigger "put", and the second step detects the theme "Imatinib regimen" and
target argument "patient". Bjorne et al. [111-113] converted sentences to a dependency graph
(Stanford Dependency [152]) representation using the McClosky-Charniak-Johnson parser
[153,154] and explored the graphs to construct features for both steps. The McClosky-CharniakJohnson parser is based on the constituency parser of Charniak and Johnson [153] and retrained
with the biomedical domain model of McClosky [154]. Bjome et al. generated N-gram features
connecting event arguments based on the shortest path of syntactic dependencies between the
arguments. They included as features the types and supertypes of trigger nodes from event type
generalization, in order to address feature sparsity. Bjorne et al. also applied semantic postprocessing rules to prune graph edges that violate semantic compatibility that is required by the
event definition to hold between event arguments. Their system (currently referred to as TEES)
performed best in the 2009 GE (0.52 f-measure), 2011 EPI (0.5333 f-measure), 2013 CG (0.5541
f-measure), 2013 GRO (0.215 f-measure, being the only participating system) and 2013 BB full
event extraction (0.14 f-measure). Hakala et al. [115] built on top of the TEES system and reranked its output by enriched graph-based features, including paths connecting nested events and
occurrence of gene-protein pairs in general subgraphs mined from external PubMed abstracts and
the PubMed Central full-text corpus. In addition, they applied event type generalization to augment graph-based features to combat feature sparsity. The system by Hakala et al. placed first in
2013 GE (0.5097 f-measure), whereas the TEES system placed second (0.5074 f-measure). The
strong performance of both systems highlights the importance of exploring graph-based features.
The performance increase associated with enriched graph-based features suggests directions for
improvement.
Miwa et al. [131,155] built the EventMine system that can extract not only biomedical events but
also their negations and uncertainty statements. For event extraction, they used the Enju parser
[156] and the GENIA Dependency parser (GDep) [157] to generate path features along with dictionary based features (e.g., UMLS Specialist lexicon [158] and Wordnet [159]). Their entry in
BioNLP ST 2013 placed first in the PC task. In particular, their path features include not only
paths between event arguments but also paths between event argument and non-argument named
entities. The enriched paths linking non-argument entities likely account for the strong performance by providing more local context features.
30
Another vein of work proposed joint models for event extraction in which event triggers and arguments for all events in the same sentence are predicted jointly. McClosky et al. [124,129] integrated event extraction into the overall dependency parsing objective, and treated flat events and
nested events similarly. For preprocessing, they applied the McClosky-Charniak-Johnson parser
and converted the parsing results to Stanford Dependency. They converted the annotated event
structures in the training data to event dependency graphs that take event arguments as nodes and
argument slot names as edge labels. They mapped the event dependency graphs to Stanford Dependency graphs and generated graph-based features to train an extended MSTParser [160] for
extracting event dependency graphs from test data. The graph-based features included paths between nodes in the Stanford Dependency graph, as well as subgraphs consisting of parents, children, and siblings of the nodes. McClosky et al. also included consistency features that impose
domain-specific soft constraints on the compatibility of edges connecting event arguments. They
also applied event type generalization to combat feature sparsity. They then converted the top-n
extracted event dependency graphs back to event structures and re-ranked event structures to get
the best one, using graph-based features similar to those in MSTParser training but extracted
from event dependency graphs. Riedel et al. first applied Markov Logic Networks to learn relational structures for event extraction [161] and later switched to graph-based methods [123,162].
They projected events to labeled graphs, and scored candidate graphs using a function that captures constraints on event triggers and event arguments. The scoring function considers token
features, dictionary features and dependency path features. Riedel et al. further used a stacking
model to combine their system with the system by McClosky et al. [124,129]. The combined system obtained first place in 2011 GE task (0.56 f-measure) and 2011 ID task (0.556 f-measure).
Most of the remaining BioNLP systems that performed competitively also used graph-based features to various extents. Liu et al. developed an Exact Subgraph Matching (ESM) method [106],
and later a more flexible Approximate Subgraph Matching (ASM) method, in order to mine
basic and nested events [105,107]. They processed sentences with the McClosky-CharniakJohnson parser and transformed the parsing results to dependency graphs while respecting edge
directionality. They constructed the graph representation of an event by computing unions of dependency paths between event arguments. After that, Liu et al. applied exact or approximate
subgraph matching to match sentence graphs to event graphs, based on a customized distance
metric, which takes into account subgraph differences in graph structure, node labels (formed by
31
the words covered by a node) and edge directionality. To improve the sensitivity of subgraph
matching, Liu et al. used lemmatization to unify words [163]. This work falls along the lines of
graph kernel based methods. As with many such methods, absorbing features into the calculation
of similarity scores makes it difficult for supervised machine learning algorithms to directly
weight/rank features. Kilicoglu et al. [117,118] also adopted the McClosky-Charniak-Johnson
Parser/Stanford Dependency pipeline. They converted the dependency graphs to embedding
graphs, where nodes themselves can be small dependency graphs, in order to apply postprocessing rules to traverse embedding graphs and extract nested events. However, their embedding graphs also lead to argument error propagation and thus hurt precision.
Besides the frequently used McClosky-Charniak-Johnson Parser/Stanford Dependency pipeline,
there are a number of systems experimenting with different parsers and/or dependency representations. Hakenberg et al. [119] applied BioLG [164], a Link Grammar Parser [165] extension, to
obtain parse trees from sentences. They stored parse trees in a database and designed a query
language to match subgraph patterns, which are manually generated from training data, against
parse trees. Hakenberg et al. pointed out that generalization of event types would likely improve
their results. Van Landeghem et al. [125] analyzed dependency graphs from the Stanford Parser
[166], identified minimal event-containing subgraph patterns from training data and constructed
extraction rules based on these patterns. Their post-processing rules handled overlapping triggers
of different event types and events based on the same trigger, aiming for high precision at the
expense of recall.
The remaining systems generally used the dependency paths connecting the concept pairs as features for event extraction. For example, the dependencies were obtained through applications of
different parsers including the Pro3Gres parser [167] (used by Kaljurand et al. [126]), the RASP
parser [168] (used by Vlachos et al. [128]) or both McClosky-Charniak-Johnson parser and Enju
parser (used by Quirk et al. [130], who combined the parsing results). However, most of these
methods attained inferior performance compared to the best systems in the same shared tasks.
We believe that there are at least two reasons: the McClosky-Charniak-Johnson parser with the
self-trained biomedical parsing model is probably the most accurate parser in this domain; the
enriched graph-based features and event type generalization as used by the top performing systems likely produced more useful features for event extraction.
32
2.3.1.2
Protein-protein interaction extraction and BioCreative shared tasks
BioCreative shared tasks focused on automatic named entity recognition on genes and proteins in
biomedical text and on extraction of the interactions between these entities [29,30,169]. Among
the participants of the protein-protein interaction task of BioCreative II [29], most systems used
co-occurrence statistics, pattern templates and shallow linguistic features (e.g., context words
and part-of-speech tags), with either statistical machine learning or rule-based systems. Some
systems observed the need for capturing cross sentence mentions of interacting proteins. For example, Huang et al. [170] developed a profile based method that creates a vector representation
for candidate protein pairs by aggregating features from multiple sentences in the document. The
profile features included n-grams, manually constructed templates and relative positions of protein mentions. In BioCreative 11.5, based on the top teams in the protein-protein interaction task,
the organizers pointed out that the BioNLP techniques using deep parsing and dependency
tree/graph mining were necessary to achieve significant results [30]. In particular, Hakenberg et
al. [120] used a system similar to their BioNLP 2009 entry system [119]. They manually generated subgraph patterns from training data and matched them against parse trees. They achieved
an f-measure of 0.30. Satre et al. [143] applied the Enju parser and the GDep parser and considered the dependency paths between concept pairs as features for relation extraction. They
achieved an f-measure of 0.374. The protein-protein interaction tasks of BioCreative III consisted of detecting PPI related articles that provide evidence to specified PPIs, but did not include
the actual extraction of PPIs, which is the focus of this review [169]. Several follow up studies to
BioCreative 11.5 concerned the usage of kernels in PPI extraction [150,171], and they categorized
kernels into the following categories: 1) kernels not using deep parsing information, including
shallow linguistic (SL) kernel [172]; 2) constituent parse tree based kernels, including subtree
(ST) [173], subset tree (SST) [174] and partial tree (PT) [175] kernels that use increasingly generalized forms of subtrees, as well as a spectrum tree (SpT) [176] kernel that uses path structures
from constituent parse trees; 3) dependency parse tree based kernels, including edit distance and
cosine similarity kernels that are based on shortest paths [177], k-band shortest path spectrum
(kBSPS) [150] that additionally allows k-band extension of shortest paths, all-path graph (APG)
kernel [58] that further differently weights shortest paths and extension paths in similarity calculation, as well as Kim's kernels [178] that use various combinations of lexical, part-of-speech,
and syntactic information along with the shortest path structures. The comparative studies and
33
error analyses showed that: 1) dependency tree based kernels generally outperform constituent
tree based kernels; 2) kernel method performances heavily depend on corpus-specific parameter
optimization; 3) APG, kBSPS, and SL are top performing kernels; 4) ensembles based on dissimilar kernels can significantly improve performance; 5) non-kernel based methods (e.g., rulebased method, BayesNet) can perform on par with or better than all non-top kernel methods.
From these observations, it is evident that richer dependency graph/tree structures (e.g., in APG,
kBSPS) than shortest paths are important to better performance of graph/tree based kernels,
which is consistent with the analysis of BioNLP participating systems. Also the limited advantage from the kernel methods over non-kernel methods and the interpretation difficulty associated with kernel methods seem to suggest that a more fruitful direction may be investigating
novel feature sets rather than novel kernel functions.
2.3.1.3
Drug-drug interaction extraction and DDIExtraction shared tasks
The two DDIExtraction challenges (organized in 2011 and 2013) aimed at automated extraction
of drug-drug interactions (DDI) from biomedical texts [33,34]. The organizers of the two challenges recognized the extended delays in updating manually curated DDI databases. They observed that the medical literature and technical reports are the most effective sources for the detection of DDIs but contain an overwhelming amount of data. Thus DDIExtraction was motivat-
ed by the pressing need for accurate automated text mining approaches. The 2011 challenge focused on classifying whether there is any interaction between candidate drug pairs. The 2013
challenge, in addition, pursued the detailed classification task of categorizing DDIs into one of
the four possible subtypes: advice (advice regarding the concomitant use of two drugs), effect
(effect of DDI), mechanism (pharmacodynamics or pharmacokinetic mechanism of DDI) and int
(general mention of interaction without further detail). For these two challenges, we review the
top-performing teams. In the 2011 challenge, Thomas et al. [146] applied the McCloskyCharniak-Johnson parser and converted the parses to Stanford dependencies. They used voting to
combine the following kernels to implicitly capture features for relation extraction: all-path
graph (APG) [58], k-band shortest path spectrum (kBSPS) [150], and shallow linguistic (SL)
[172] kernels. Their system achieved the best f-measure of 0.657. Chowdhury et al. [147,149]
applied the Stanford parser to obtain dependency trees and experimented with both feature based
methods and kernel based ensemble methods for relation extraction. They experimented with SL
34
[172], mildly extended dependency tree (MEDT) [151] (expanding shortest paths to also cover
important verbs, modifiers or subjects) and path-encoded tree (PET) [179] (based on constituency tree) kernels. By combining feature-based and kernel-based methods, Chowdhury et al.
achieved the second best result with an f-measure of 0.6398. In the 2013 challenge, Chowdhury
Johnson parser and converted the parses to Stanford dependencies [148]. They attained an
-
et al. used their previous kernel method [147,149] but switched to the McClosky-Charniakmeasure of 0.80 for general classification and 0.65 for detailed classification and placed first in
the 2013 challenge. Thomas et al. [180] followed a two-step approach to first detect general
DDIs and then classify detected DDIs into subtypes. For the general DDI task, they used voting
to combine kernels including APG [58], subtree (ST) [173], subset tree (SST) [174], spectrum
tree (SpT) [176] and SL [172] kernels. For the subtype classification step, they used TEES directly [113]. Their system performed second best with an f-measure of 0.76 for general classification and 0.609 in detailed classification. It is interesting to see that adoption of systems originally developed for PPI extraction or event extraction has led to top performances in the DDI
task. This further corroborates that these tasks are closely related, and technical solutions for one
are generalizable to others.
2.3.1.4 Pharmacogenomics
In the field of pharmacogenomics, continuous efforts from multiple research teams have centered
on the utilization of literature and clinical text in order to mine interesting relations between genetic mutations and drug response phenotypes. Although it is difficult to compare their performances due to the fact that the experiments are not on shared corpora, these approaches do illuminate the translational application and adaptation of some state-of-the-art biomedical relation
extraction techniques to problems directly asked by clinicians and pharmacologists.
Some systems used path-based approaches. Coulet et al. [78] aimed at extracting binary relations
between genes, drugs and phenotypes in order to build semantic networks for pharmacogenomics.
They first converted the Stanford Parser output on sentences (from collected PubMed abstracts)
into dependency graphs. They tracked the paths starting from named entities and ending at a verb,
and merged paths ending with the same verb to form binary relations. Coulet et al. further explored frequency information to retain recurrent relations. They also performed normalization on
35
both the collected entities and relation types (verbs). Percha et al. [80] extended this approach to
use breadth-first search to yield the shortest path between two named entities in the dependency
graph in order to generate features for relation extraction. Wang et al. [137] used Latent Dirichlet
Allocation (LDA) to create a semantic representation of biomedical named entities and used
Kullback-Leibler (KL) divergence to calculate the association distance between pairs of entities
in the Chem2Bio2RDF [138] semantic network. They ranked candidate associations between
named entity pairs based on the summation of distances along the path connecting the pairs in
the semantic network.
Other systems used tree-based approaches. Katrenko et al. [142] studied gene-disease relation
extraction and included as features the subtrees rooted at the lowest common ancestors of two
named entities in the dependency parse trees. Their experiment used several parsers including the
Link Grammar Parser [165], Minipar [181] and the Charniak Parser [182]. Compared with using
individual parser's results separately, they reported improved performance from adopting ensemble methods (stacking and AdaBoost) and combining multiple parsers' results [183]. Hakenberg et al. [133] relied on co-occurrence for extraction of certain relations (e.g., gene-drug, genedisease and drug-disease), but augmented co-occurrence with subtrees from the Stanford Parser
output for other types of relations. In particular, their subtrees are rooted at the lowest common
ancestors of named entity pairs in the binary relations considered. Bui et al. [139] aimed to extract causal relations on HIV drug resistance from the literature. They used the Stanford Parser to
generate constituent parse trees for sentences and developed grammatical rules that traverse the
tree structures in order to extract drug-gene relations.
Both path-based and tree-based systems in pharmacogenomics tend to focus on precision over
recall in their evaluation, differing from the balanced f-measure used in multiple shared tasks.
This likely stems from their specific goals of harvesting reliable relations to build and grow
pharmacogenomics semantic networks. Too much noise will likely cloud the initial semantic
network, while missing relations still have a chance to be later discovered with growing literature.
In fact, reported precisions for pharmacogenomics relation extraction systems typically range
from 70% to over 80%. In addition, these systems often check extracted relations against curated
database such as PharmGKB. We believe that these systems can further benefit from adopting
36
parsers trained with biomedical models and using enriched graph-based features, two of the most
recent lessons learned in shared tasks.
2.3.2
Relation extraction from clinical narrative text
In the medical informatics community, relation extraction has also been extensively studied in
the form of shared tasks and separately motivated research. For example, significant advances in
extracting semantic relations from narrative text in Electronic Medical Records (EMR) have
been documented in the 2010 i2b2/VA challenge (i2b2 - Informatics for Integrating Biology to
the Bedside, VA - Veterans Association) [2].
2.3.2.1
i2b2/VA challenge
The challenge focused on three aspects of semantic relation extraction (i.e., concept extraction,
assertion classification, and relation classification) and attracted international teams to address
these shared tasks [2]. Concept extraction can be considered the basic task, as assertions and relations all refer to the extracted concepts. As the challenge allows subsequent tasks (e.g., relation
classification) to use the ground truth of preceding tasks (e.g., extracted concepts), the performance metrics for the relation classification task should be interpreted as an upper bound for the
end-to-end relation extraction task (same as the challenges from BioNLP, BioCreative and
DDIExtraction). In this section, we review only the systems in the relation classification task,
where the target relations are limited to predefimed relations among medical problems, tests, and
treatments. There are eight relations including treatment improves / worsens / causes / is administeredfor / is not administeredbecause of medical problem, test reveals / conducted to investigate medical problem, and medical problem indicates medical problem. As we did in reviewing
the above challenges, we review only those systems that represented sentences as graphs and explored such graphs during the feature generation step.
Roberts et al. [101] classified the semantic relations using a rather comprehensive set of features:
context features (e.g. n-grams, GENIA part-of-speech tags surrounding medical concepts), nested relation features (relations in the text span between candidate pairs of concepts), single concept features (e.g., words and concept type of medical concept), Wikipedia features (e.g., concepts matching Wikipedia titles), concept vicinity features (concept bi-grams around relation argument concepts) and similarity features. The latter were computed using edit distance on lan37
guage constructs including GENIA phrase chunks and Stanford Dependency shortest paths.
Their system reached the highest f-measure on relation classification (0.737).
deBruijn et al.[102] applied a maximum entropy classifier with down sampling applied to balance the relation distribution. In addition to features from the concept extraction task, they applied the McClosky-Charniak-Johnson parser, converted the parsing results into Stanford dependencies, and included as features the labels in the minimal trees that cover the concept pairs.
They used word clusters as features to address the problem of unseen words. Their system
reached an f-measure of 0.731, the second best among relation classification participants.
Solt et al. [104] extracted concepts by identifying head terms from dictionary look up and extending concept spans by rules. For relation classification, they experimented with several
parsers including the Stanford Parser, the McClosky-Charniak-Johnson Parser and the Enju Parser. They used the resulting dependency graphs with two graph kernels including the all paths
graph (APG) kernel [58] and k-band shortest path spectrum (kBSPS) [150], which produced only
moderate performance. This likely reflects the difficulty in tuning the graph/tree kernel based
systems, consistent with the observations from the experience in relation/event extraction from
the scientific literature.
2.3.2.2
Separately motivated clinical relation extraction
After the i2b2 challenges, several authors aimed at combining the concept extraction and relation
extraction steps into an integral pipeline and/or generalizing to the extraction of complex or even
nested relations. Xu et al. [103] developed a rule-based system MedEx to extract medications
and specific relations between medications and their associated strengths, routes and frequencies.
The MedEx system converts narrative sentences in clinical notes into conceptual graph representations of medication relations. To do so, Xu et al. designed a semantic grammar directly mappable to conceptual graphs and applied a Chart Parser by Kay [184] to parse sentences according to
this grammar. They also used a regular expression based chunker to capture medications missed
by the Kay Chart Parser. Weng et al. [75] applied a customized syntactic parser on text specifying clinical eligibility criteria. They mined maximal frequent subtree patterns and manually aggregated and enriched them with the Unified Medical Language System (UMLS) to form a semantic representation for eligibility criteria, which aims to enable semantically meaningful
38
search queries over ClinicalTrials.gov. Luo et al. [99] extracted syntactic path features from the
Link Grammar Parser generated dependencies from PubMed abstracts. The syntactic paths are
included as features in clustering relations between noun phrase pairs.
2.3.3 Shared resources for relation extraction
The shared tasks and separately motivated research on biomedical relation extraction have not
only advanced the state-of-the-art in methodology, but also created and/or demonstrated the utilization of a repository of shared resources that range from knowledge bases to shared corpora to
graph mining toolkits. We categorize and summarize those resources in Table 2-3.
Utility Category
Data Sources
Terminology & Ontology
GO [185], UMLS [186], MeSH [187], HUGO [188], Wordnet [159], Verbnet [189],
Biothesaurus [110]
Graph Miner
Gaston [100], Mofa [190], GSpan [191], FFSM [192], Graph Spider [193]
Tree/Graph Kernel
subtree (ST) kernel [173], subset tree (SST) kernel [174], partial tree (PT) kernel
[175], spectrum tree (SpT) kernel [176], mildly extended dependency tree (MEDT)
kernel [151], all-path graph (APG) kernel [58], k-band shortest path spectrum (kBSPS)
kernel [150], path-encoded tree (PET) kernel [179]
Dependency Parsers
Enju Parser [156], GDep Parser [157], Stanford Parser [194], McCCJ Parser [153,154],
RASP Parser [168], Bikel Parser [195], BioLG Parser [164], Pro3Gres Parser [167],
Kay Parser [184], C&C [196]
Shared Corpora
BioNLP-09 event corpus [31], BioNLP-11 event corpus [28], BioNLP-13 event corpus
[32], BioCreative II relation corpus [197], BioCreative 11.5 relation corpus [30],
DDlExtraction relation corpora [33,34], i2b2/VA corpus [2], AIMed [198], Biolnfer
[199], HPRD50 [200], IEPA [201], and LLL [202], Uniprot corpus [203]
Table 2-3 Shared resources for relation extraction. The resources are organized by their utility
category. Abbreviations used include: Gene Ontology (GO), Unified Medical Language System
(UMLS), Medical Subject Heading (MeSH), Human Protein Reference Database (HPRD).
2.4 Limitations of Existing Work
Although notable progress have taken place in applying graph based algorithms to improve the
extraction of biomedical relations, barriers still exist to enabling practical relation extraction
methods that are both generalizable and sufficiently accurate. Below we discuss a few such barriers and promising directions.
2.4.1
Not all parsers and dependency encodings are synergistic
It has been pointed out repeatedly that the choice of the parser and dependency encodings may
play an important role in a relation extraction system's performance. Buyko et al. [204] per39
formed comparative analysis on the impact of graph encoding based on different parsers (Char-
niak-Johnson [153], McClosky-Charniak-Johnson, Bikel [195], GDep, MST [160], MALT [205])
and dependency representations (Stanford Dependency and CoNLL dependency) and found that
the CoNLL dependency representation performs better in combination with four parsers than the
Stanford Dependency representation; and McClosky-Charniak-Johnson parser frequently places
as the best performing parser. Miwa et al. [206] compared five syntactic parsers for BioNLP-ST
2009. They concluded that although performances from using individual parsers (GDep, C&C
[196], McClosky-Charniak-Johnson, Bikel, Enju) do not differ much, using an ensemble of
parsers and different dependency representations (Stanford Dependency, CoNLL, Predicate Argument Structure) can improve the event extraction results. As Stanford Dependency is the most
widely used dependency encoding, they also compared the performance of using different Stanford Dependency variants and found that basic dependency performs best if keeping types of dependency edges. On the other hand, if ignoring types of dependency edges, they found that the
collapsed dependency variant performs best, which corroborates the finding by Luo et al. [87]. In
[87], the task is to extract relations as features without classification as opposed to supervised
relation classification in the BioNLP-ST event extraction tasks. Thus recall is favored in the feature learning step, where ignoring types of dependencies helps to improve the coverage of subgraph patterns.
2.4.2
Integrating co-reference resolution
Co-reference occurs frequently in biomedical literature and clinical narrative text, arising from
the use of pronouns, anaphora and varied terms for the same concepts. Care must be exercised to
transfer the correct relation along the co-reference chain. However, many of the reviewed approaches for named entity and event recognition did not have a built-in co-reference resolution
component. Miwa et al. [207] specifically studied the impact of using a co-reference resolution
system and showed improved event extraction performance. In particular, they developed a rulebased co-reference resolution system that consists of detecting rules for mention, antecedent and
co-referential link, respectively. They used the co-reference information to modify syntactic
parse results so that antecedent and mention share dependencies.
Features were also extended
between mentions and antecedents. However, those systems that integrate co-reference resolution limited the scope to co-references within the same sentence. Recognizing the importance of
40
co-reference features, the organizers of BioNLP ST 2011 and 2013 integrated the co-reference
annotations into the event annotations. Use of such annotations should be encouraged to develop
and improve the co-reference component in event extraction systems and to gauge their performance. In the future, it is also worth investigating the impact of co-reference resolution across
sentences.
2.4.3
General relation and event extraction and domain adaptation
The state-of-the-art relation and event extraction systems are all built around tasks with domainspecific definitions of relations and events, many of which are in fact binary (e.g., BioCreative
PPI challenge [30], DDlExtraction challenge [33,34], and i2b2/VA challenge [2]). However,
there is a gap between the technical advances and the demands from many real-world tasks, including building pharmacogenomics semantic networks [78], extracting clinical trial eligibility
criteria [75] and representing immunophenotypic test results for automating lymphoma subtype
classification [87,88]. In those tasks, general relation and event discovery is necessary, where the
number of nodes is flexible and even the relation/event structure is not entirely predetermined.
Another challenge brought by domain-specific relation/event definition concerns the training data. The problem of limited training data often plagues the development of NLP systems, with
those on relation extraction being no exception. To take better advantage of existing annotated
corpora, it is necessary to perform domain adaptation from external training corpora (source) to
the target corpus. Miwa et al. [207] proposed to add source instances followed by instance reweighting when source and target match on events to be extracted. When source and target corpora have a partial match on events, they proposed to train each event extraction module separately on the source corpus and used its output as additional features for the corresponding modules on the target corpus. Miwa et al. [208] further improved methods of combining corpora by
integrating heuristics to filter spurious negative examples. The heuristics target situations where
instances not annotated in one corpus due to a different focus may be treated as negative instances in another corpus. Applying this method on learning from seven event annotated corpora, they
showed improved performance on two tasks in BioNLP-ST 2011.
2.4.4 Redundancy in subgraph patterns
41
For automated subgraph pattern collection such as using frequency as cues, there is the problem
of redundancy among collected subgraph patterns. Many smaller subgraphs are subisomorphic to
other larger frequent subgraphs. Many of these larger subgraphs have the same frequencies as
their subisomorphic smaller subgraphs. This arises when a larger subgraph is frequent; all its
subgraphs automatically become frequent as well. Furthermore, if the smaller subgraph g, is so
unique that it is not subisomorphic to any other larger subgraph gj, then this pair gs, g, shares
identical frequency. Therefore, one only need to keep the larger subgraphs in such pairs. Note
that it is cost prohibitive to perform a full pairwise check because the subisomorphism comparison between two subgraphs is already NP complete [100], and a pairwise approach would ask for
around a billion comparisons for a collection of several tens of thousands subgraphs. Efficient
algorithm is needed that reduces the number of subgraph pairs to compare by several orders of
magnitude. The key idea is that they only need to compare subgraphs whose sizes differ by one,
and they can further partition the subgraphs so that only those within the same partition need to
be compared. On the other hand, depending on the task, algorithms may be developed to collect
subgraph patterns that explore the "novelty" of the subgraphs, such as using p-significance to
assess how strange it is to see the subgraphs in the current corpus [209].
2.4.5 Integrating with NER
Most shared task participants were not evaluated based on their relation extraction from scratch.
Rather, their systems were evaluated given the gold standard of named entity annotations, which
is even true for challenges that include a NER task, such as the i2b2/VA shared tasks. Thus their
evaluation results are likely an upper bound of the end-to-end system performance, the tuning of
which is in fact a non-trivial task. Kabiljo et al. [210] evaluated several methods for relation ex-
traction including a keyword based method, a co-occurrence based method, and a method using
dependency graph-based patterns. They noted that in general a significant performance drop will
occur when using named entities tagged by NER system such as BANNER [136] instead of the
gold standard. In addition, it is useful but challenging to filter out named entity tuples (including
pairs) that do not have relations explicitly stated in the text [99]. Such filtering may adopt a hybrid approach that relies on both automatically checking semantic type compatibility and manually sifting through the remaining tuples. However, as the number of non-related tuples often
dominates that of related tuples, better automated filtering is necessary and is an open question.
42
Chapter 3.
General Relation Extraction by Frequent Subgraph
Mining Applied to Automatic Lymphoma Classification
In this chapter3 and the next, we address some of the limitations of state-of-the-art relation/event
extraction approaches including: general relation extraction, redundancy elimination, NER integration, concept unification, and parser augmentation. We use subgraph mining (focus of Chapter 3) and factorization algorithms (focus of Chapter 4) to develop a general framework for extracting relations from clinical narrative text and to explore their correlations. To test our proposed framework with a concrete real-world medical problem, we investigate automated lymphoma subtype categorization based on pathology report narrative text.
The differential diagnosis of lymphoid malignancies has long been a difficult task and a source
of debate for pathologists and clinicians [211-214]. To standardize knowledge into a widely accepted guideline, the World Health Organization (WHO) published a consensus lymphoma classification in 2001 [215], which was revised in 2008 [216]. Even with the full spectrum of clinical
and genetic features used in this guideline, uncertainty persists in pathologists' daily practice
[217,218]. Since its original publication, several case series and reviews of lymphoma have suggested refinements to the current classification scheme and additional lymphoma subtypes [219223]. Facing this ongoing need for periodic revision, the current approach to revise the WHO
classification presents several challenges. First, the review process took more than one year, involving an eight-member steering committee and over 130 pathologists and hematologists
worldwide [216], hence it is a time consuming and labor intensive task. Moreover, the cases covered for consideration of revisions are subject to selection bias from different studies. These
challenges motivated us to build an interpretable lymphoma classification model to automate the
case review process in a systematic way.
Many medical natural language processing (NLP) systems aim to extract medical problems from
text to identify patient cohorts for clinical studies (e.g., [25,26,224-227]). They rely heavily on
mentions and synonyms of the targeted problems. In contrast, we exclude all mentions and synonyms of lymphomas. The aim is to prevent oracles from telling the system the true lymphoma
3 This chapter was published as a research article in Journal of the American Medical Informatics Association [1].
43
type and to mimic the differential diagnosis with the pathology reports as proxies for related labs
and tests. The automatically built diagnostic models are intended to assist with expert review,
thus it is necessary not only to achieve high accuracy, but also to retain interpretable features.
3.1 Background
As described in Chapter 2, part of the advances in the state-of-the-art specialized clinical NLP
systems for identifying medical problems have been documented in challenge workshops such as
the yearly i2b2 (Informatics for Integrating Biology to the Bedside) Workshops. The first such
challenge focused in part on identifying the smoking status of patients [225]. Features used by
the successful teams included mentioned medical entities, n-grams (up to trigrams), part of
speech (POS) tags, and task-specific regular expressions, dictionaries and assertion classification
rules. Feature engineering details contributed significantly to the best performing systems [228-
230]. In a later challenge, recognizing obesity and its 15 comorbidities [227], the top four systems employed heavier feature engineering on hand-crafted rules that integrated "diseasespecific, non-preventive medications and their brand names" [231], disease-related procedures
[232], and disease-specific symptoms [233,234]. However, task-specific rules and regular expressions to capture medical concepts and relations are usually subdomain specific and hard to
generalize. In contrast, standard linguistic features such as n-grams are easy to generalize but difficult to interpret -
the selected n-grams may not be meaningful.
General clinical NLP systems such as cTakes [25] and MetaMap [26] can extract negated [27]
medical concepts. Besides negations, they specify few additional relations. Other systems apply
hand-crafted rules to extract pre-specified semantic relations, such as MedLEE [97], MedEx [103]
and SemRep [98], or require supervised learning on pre-specified semantic relations [235], and
thus are hard to adapt to new subdomains. The value of syntactic parsing in concept and relation
extraction has also been explored, such as phrase chunking in cTAKES [25], shallow parsing
with the Stanford Parser [166], short syntactic link chain extraction [236], and Treebank building
such as in the MiPACQ corpus [237]. Our work features unsupervised extraction of relations
among a flexible number of medical concepts, which produces features that both improve performance over baselines and are more interpretable.
44
3.2
Task Definition
Pathology reports typically record four general categories of patient information: clinical presentation, morphology, immunophenotype and cytogenetics. Our corpus is rich in narrative sentences that specify complex relations among medical concepts. We accordingly design a sentence
subgraph mining framework that is suitable for capturing such relations. Using the features generated from this framework, we performed the following tasks:
1.
We tested the hypothesis that an automated lymphoma classifier with sentence subgraph features can outperform the baseline classifier with standard n-gram features.
2.
We tested the hypothesis that sentence subgraph features can outperform the baselines with full or filtered medical concept features extracted by the latest MetaMap.
3. We showed that sentence subgraph features are friendly to interpretation and provide
insights to the diagnosis of lymphoma.
To prevent classifiers from using the explicit mentions and synonyms of the lymphoma types, we
exclude phrases overlapping with a Medical Subject Heading (MeSH) [187] of "lymphoma" or
"leukemia". We also exclude phrases that match a set of manually constructed patterns aiming to
catch abbreviations and synonyms of the target lymphomas that may be missed by MeSH, as
shown in Table 3-1.
Regular Expressions
"(?i)(burkitlburket)"
"(?i)\bBL\b"
"(?i)\bDLBCL\b"
"(?is)(follicularlfollicle).*(typeIorigin)" // e.g. "low grade lymphoma, follicle center cell type"
"(?i)\bFL\b"
"(?i)\b(nlphllnlphdlhllhd)\b"
"(?i)\bNHL\b"
"(?i)hodgkin"
"(?i)lymphoma"
"(?i)leukemia"
"(?is)diffuse.*large.*b.*cell"
"(?i)T/HRBCL"
"(?is)(nodular\s+sclerosismixed\s+cellularityllymphocyterich.*typeIlymphocyte\s+predominant)" // e.g., "Hodgkin lymphoma, mixed cellularity type"
Table 3-1 Regular Expressions to Catch Lymphoma Mentions.
45
3.3 Data Collection
Our corpus consists of Massachusetts General Hospital (MGH) pathology reports residing in the
Research Patient Data Registry (RPDR) [238] database. An MGH pathology report consists of
standard and semi-standard sections as shown in Figure 3-1.
For this project, we focused on the following four lymphomas: diffuse large B-cell lymphoma
(DLBCL; the most common lymphoma), Burkitt lymphoma (the most aggressive lymphoma),
follicular lymphoma (the second most common lymphoma) and Hodgkin lymphoma (the most
common lymphoma in young patients). We obtained our patient cases by having two MGH medical oncologists and one hematopathologist review pathology reports of patients diagnosed between 2000 and 2010, and collected 1038 cases whose written diagnosis (in the final diagnoses
section) had one or more of the four lymphomas.
3.4 Methods
We first preprocess our corpus using sentence breaking, tokenization, and part-of-speech tagging,
with customizations to medical corpora. We then perform a two-phase sentence parsing step,
grouping token subsequences that match to concept unique identifiers (CULs) in the UMLS Metathesaurus [26] and merging them as a single token before applying Stanford Parser. The next
section on corpus pre-processing gives more details.
3.4.1
Corpus pre-processing
We use two NLP packages to pre-process our corpus, OpenNLP [240] and the Stanford Parser
[194]. We use the sentence breaker from OpenNLP, which applies a maximum entropy model,
and apply rule-based post-processing customized to our corpus. After sentence breaking, we use
a home-built rule-based tokenizer that recognizes domain specific tokens such as "CD4+" or
"TdT+" as one token. Following the approach of Huang et al. [166], we use the UMLS Specialist
Lexicon (which contains lexical descriptions of over 1.1 million words) to build an extended lexicon by mapping UMLS style part-of-speech tags and linguistic features such as plural and present singular to Penn Treebank tags [241]. Unlike Huang et al. [166], we add the extended lexicon into OpenNLP's Part-of-Speech (POS) tagger dictionary. This is straightforward because the
46
OpenNLP tagger enumerates possible tags only from its dictionary and then evaluates their likelihood.
3.4.1.1 Matching token subsequences to UMLS concepts
To group token subsequences that correspond to medical terminology, we perform dictionary
look up against the UMLS Metathesaurus [26]. We investigate each of the n x (n - 1)/2 subsequences of tokens in a sentence and look them up in the UMLS Metathesaurus. For UMLS CUI
matching, we experimented with the entire set or subsets of CUIs and chose the following approach that balances the coverage and accuracy on our data. If the token subsequence has only
one CUI match, this CUI is used. If the token subsequence has multiple CUI matches, we select
the one that is confirmed by the most number of sources. If there is a tie, we prefer the CUI supported by SNOMED CT [242] if there is one, or flip a coin otherwise. We then perform a greedy
search to find the longest token subsequences with a matching UMLS concept unique identifier
(CUI). The heuristics employed to guide the greedy search include ignoring case in matching,
eliminating subsequences that are fully contained in longer sequences, eliminating interpretations
of single tokens that fall into function-word grammatical categories, and ignoring punctuation.
After that, we look up multiple mapping tables in the UMLS Metathesaurus and obtain medical
subject headings (MeSH) and semantic type unique identifiers (TUI) from CUIs.
47
CLINICAL DATA:
53-year-old with psoriasis, bilateral axillary
? lymphoma.
lymphadenopathy, palpable on right for one month
Immunohistochemical stains show that the follicles, as well as some
extrafollicular areas, contain Pax5+ B cells that co-express Bcl6 and Bcl2.
Numerous scattered CD2+ T cells are present.
Follicles are encompassed by
CD21+ follicular dendritic cell (FDC) aggregates, with some loss of FDC
staining in the larger follicles and among extrafollicular B cells. A stain
for CD30 highlights occasional interfollicular immunoblasts.
CD15 stains
granulocytes.
There is no lymphoid staining for cyclin Dl or ALK-1.
FLOW CYTOMETRY REPORT: Hematopoietic Cell Surface Markers
SPECIMEN: Tissue - Right Axillary Lymph Node Core Biopsy
RECEIVED: 3/12/10
DIFFERENTIAL COUNT: Lymphocytes: 93%; Monocytes: <1%; Granulocytes: <1%.
RESULTS
LIGHT SCATTER GATE ANALYZED: Lymphocyte
ANTIGENS:
B CELL
T/NK CELL
MYELOID/OTHER
CD19: 55%
CD45: 84%
42%
CD14: <1%
CD20: 55%
37%
5% surfaceCD19+KAPPA: 50%
34% surfaceCD19+LAMBDA:
6%
39%
CD19/20+5: <1%
CD19/20+10: 42%
1%
CD19/20+23: 13%
CD19/20+43brt: <1%
INTERPRETATION:
1.
CD19+, CD20bright+, CD10+, CD43-, CD5- B cells with monotypic expression of
kappa light chain amid a polytypic background.
2.
CD4+ and CD8+ T cells.
CD3:
CD3+4:
CD3+8:
CD5:
CD7:
CD3-7+:
KARYOTYPE:
46,XX,t(6;12)(q2?6;q2?1),t(14;18)(q32;q21)[cp7]/47,XX,+X[3]
BANDING: GTG
SCORED: 0
ANALYZED: 10
METAPHASES COUNTED: 10
INTERPRETATION:
Seven of 10 metaphases contained a translocation of chromosomes 14 and 18.
This translocation is associated with an IGH-BCL2 rearrangement, and is a
characteristic finding in B-cell non-Hodgkin' s lymphomas of follicular center
cell origin.
Figure 3-1 MGH pathology reports usually contain four sections with almost all information retained as narrative text. Clinical data, the first section, includes patient age, past medical history,
and ongoing treatment procedures, etc. The second section, morphology and immunohistochemistry, describes cellular structural alterations appearing under a light microscope aided by a variety of dyes, some of which are conjugated to cell-specific antibodies. The third section is on flow
cytometry, which describes the characteristic expression of various surface antigens on cells. The
individual or combined percentages of antigens (e.g., CD20, CD5 and CD 10) are reported. Also
reported are pathologists' interpretations, which characterize these numbers (e.g., +: positive or -:
negative) relative to reference values. The fourth section is on cytogenetics, which records the
presence of chromosomal aberrations such as translocations, insertions and deletions, in the form
of a "karyotype" using a standardized nomenclature [239] that is not NLP friendly. However, the
accompanying "interpretation" section describes these aberrations in narrative text. Dates and
Age etc. are replaced with realistic surrogates for de-identification.
48
3.4.1.2 Two-phase sentence parsing
The medical language used in pathology reports is challenging for general domain parsers. Consider the example sentence: "In situ hybridization for kappa and lambda immunoglobulin light
chains show the plasma cells to be polytypic." Figure 3-2 shows the parse by the Stanford Parser,
in which the term "in situ hybridization" is broken and erroneous dependencies such as
amod(hybridization-3, situ-2) and prepin(show- 11, hybridization-3) are generated.
Typed dependencies, collapsed
Parse
(ROOT
(S
(PP (IN In)
(NP
(NP (JJ situ) (NN hybridization))
(PP (IN for)
(NP (NN kappa)
(CC and)
(NN lambda) (NN immunoglobulin)))))
(NP (JJ light) (NNS chains))
(VP (VBP show)
(S
(NP (DT the) (NN plasma) (NNS cells))
(VP (TO to)
(VP (VB be)
(ADJP (JJ polytypic))))))
(.
amod(hybridization-3, situ-2)
prepin(show-ll, hybridization-3)
nn(immunoglobulin-8, kappa-5)
conj_and(kappa-5, lambda-7)
nn(immunoglobulin-8, lambda-7)
prep_for (hybridization-3, immunoglobulin-8)
amod(chains-10, light-9)
noubj(show-li, chains-10)
root(ROOT-O, show-Il)
det(cells-14, the-12)
nn(cells-14, plasima-13)
nsubj (polytypic-17, cells-14)
aux(polytypic-17, to-15)
cop (polytypic-17, be-16)
xcomp (show-Il, polytypic-17)
.)))
Figure 3-2 Example sentence parsed directly by the Stanford Parser.
Knowing that "in situ hybridization" is one phrase, the parser not only corrects the error with "in
situ hybridization", but also respects the long phrase "kappa and lambda immunoglobulin light
chains", as shown in Figure 3-3. We therefore parse sentences in two steps: 1) we identify and
group together the non-determiner tokens that match to the concept unique identifiers (CUI) in
the UMLS Metathesaurus [186], 2) we then apply the Stanford Parser with grouped tokens as
one token. We only group token subsequences whose last token is a noun. Finally, we assign
POS tags to grouped token subsequences by using the POS tags from their last tokens during a
separate run of POS tagger on the original sentence.
49
Typed dependencies, collapsed
Parse
(ROOT
(NP
(NP (NNP In-situ-hybridization))
(PP (IN for)
(NP (NN kappa)
(CC and)
(NN lambda) (NN immunoglobulin) (JJ light) (NNS chains))))
(VP (VBP show)
(5
(NP (DT the) (IN plasma) (NNS cells))
(VP (TO to)
(VP (VB be)
(ADJP (JJ polytypic))))))
(.
nsubj (show-9, In-situ-hybridization-1)
(m(chains-8, kappa-3)
conj_and(kappa-3, lambda-5)
nn(chains-8, lambda-5)
nn(chains-8, immunoglobulin-6)
emod(chains-8, light-7)
prepfor (In-situ-hybridization-1, chains-8)
root(ROOT-O, show-9)
det(cells-12, the-10)
nn(cells-12, plasma-li)
nsubj(polytypic-15, cella-12)
aux (polytypic-15, to-13)
cop(polytypic-15, be-14)
xcomp(shov-9, polytypic-15)
.)))
Figure 3-3 Two-phase sentence parsing on example.
3.4.1.3 Choosing CUI over TUI to group token subsequences
The relative usefulness of various dictionaries from the UMLS Metathesaurus has received
mixed reports from the research community [243]. Earlier in our experiments, we initially relied
on using the UMLS semantic types to group token subsequences. The UMLS currently defines
133 semantic types that are indexed by TUIs. Our earlier approach followed a sequence of steps
called zoom-in, mine and zoom-out. In the zoom-in step, in addition to grouping token subsequences using CUls, we mapped each CUI to a corresponding TUI and identified the semantic
types of the grouped token subsequences. In the mining step, we treated token subsequences
sharing a semantic type as identical nodes in the sentence graphs, and applied frequent subgraph
mining. The rationale was to group concepts of the same semantic types together. This would
lead to a coarser granularity of concepts, with the hope for the captured frequent subgraphs to
cover more sentences. In the zoom-out step, we took the frequent subgraphs returned by the mining step, mapped them back to the sentences and replaced TUI labels for their nodes with corresponding CUIs extracted from those sentences.
However, we later noticed that UMLS semantic categories in general provided too coarse a
granularity for our application. For example, T cells, B cells, neutrophils, and megakaryocytes all
mapped to the semantic type of "Cell" at the lowest level of the UMLS semantic types. Moreover, the UMLS semantic types sometimes led to inconsistencies with our domain knowledge. For
example, if one includes all CUls for "CD 10" and maps them to semantic types, one gets the following semantic types: molecularfunction, enzyme, and gene or genome. However, pathologists
see CD10 primarily as an important immunologic factor. In fact, this happens for multiple CD
50
antigens, including CD79a (mapping to Amino Acid, Peptide, or Protein and Receptor), CD138
(mapping to Gene or Genome, Amino Acid, Peptide, or Protein and Biologically Active Sub-
stance), etc. Note that for CD 138, strictly speaking, Biologically Active Substance is a semantic
type subsuming immunologicfactor. However, referring only to the semantic type hierarchy, this
does not preclude the possibility that CD 138 may belong to other subsumed semantic types such
as Neuroreactive Substance or Biogenic Amine, Hormone, Enzyme, Vitamin, and Receptor. A
third problem is that the UMLS semantic type hierarchy does not form a strict taxonomy. For
example, under the type chemical, the subtypes chemical viewed functionally and chemical
viewed structurally largely overlap each other. This leads to the problem that even the same CUI
of a chemical can have two semantic types. Due to the above problems, we saw much noise coming from using the UMLS semantic types as node labels for sentence graph, which affected discovery of frequent subgraphs and, in turn, classification performance. We tried multiple heuristics to attempt to resolve such inconsistencies, for example, only looking at upper levels of the
semantic hierarchy. However, this aggravated the coarse granularity problem and led to no obvious classification performance gain. We finally resorted to relying on the CUIs to label sentence
graph nodes.
3.4.1.4 Parse post processing
In order to increase the accuracy of the sentence graph representations, we perform post processing on the Stanford dependency parsing results. The main observation is that lists of immunologic factors often pose parsing challenges, as in the sentence, "Most interstitial lymphocytes
are CD3 positive T-cells with fewer CD20 and PAX5 positive B-cells". Even if all POS tags are
correctly assigned, the parser still has difficulty in determining that "CD20" and "PAX5" are
both connected to "positive". We observed the following list patterns that may interfere with the
parsing process and implemented rule-based post-processing systems to systematically correct
list-related errors. For each pattern, we give an example sentence along with its Stanford Parsing
results with and without pre-processing.
1. A list of nominal immunological factors:
Example sentence 1: "These large cells are positive for the B-cell markers CD20, OCT2,
BOBI and are also MUMI and BCL6 positive."
51
.
..
......................
..
..........
..............
Figure 3-4 shows the raw Stanford parsing result. Figure 3-5 shows the parsing results after pre-processing on tokens and POS tags. It is clear that pre-processing helps correct the
POS tags for "MUMi" and "BCL6". However dependencies involving "OCT2" and
"BOB I" are incorrect as highlighted in Figure 3-5.
(ROOT
(S
(NP CDT These) (JJ large) (NNS cells))
(VP
(VP (VBP are)
(ADJP (JJ positive)
(PP (IN for)
(NP
(NP (DT the) (JJ B-cell) (BNS markers)
(, ,)
(NP (NNP OCT2) (, ,) (NNP BOB)))))))
(CC and)
(VP (VBP are)
(ADVP (RB also))
(NN CD20))
(ADJP
(ADJP (JJ MUNl))
(CC and)
G.
(ADJP (RB BCL6)
)M
(JJ positive)))))
det(cells-3, These-1)
amod(cells-3, large-2)
nsubj(positive-5, cells-3)
nsubj (MUll-18, cells-3)
cop(positive-5, are-4)
root(ROOT-O, positive-5)
det(CD20-10, the-7)
amod(CD20-10, B-cell-8)
nn(CD20-10, markers-9)
prepfor(positive-5, CD20-10)
nn(BOBi-14, OCT2-12)
appos(CD20-10, BOB1-14)
cop (MH-18, are-16)
advaod(KOH-18, also-17)
conjand(positive-5,
HU1-18)
advod(positive-21, BCL6-20)
conjand(positive-5, positive-21)
conj_and(MUHi-10, positive-21)
Figure 3-4 Raw Stanford parsing result for example sentence 1.
(ROOT
(S
(NP (DT These) (JJ large) (ENS cells))
(VP
(VP (VBP are)
(ADJP (JJ positive)
(PP (IN for)
(NP
(NP (DT the) (3)1 B-cell-markers)
(EN CD20))
1)
(NP
(NP (NN OCT2))
(, ,)
(NP (NN BOBIl)))))))
(CC and)
(VP (VBP are)
(ADVP (RB also))
(NP (UN KHNl)
(CC and)
(NN BCL6)))
(ADJP (JJ positive)))
(. .)))
det(cells-3, These-1)
amod(cells-3, large-2)
nsubj(positive-S, cells-3)
cop(positive-S, are-4)
root(ROOT-O, positive-5)
det(CD20-9, the-7)
nn(CD20-9, B-cell-markers-8)
prep_for (positive-5, CD20-9)
appos(CD20-9, OCT2-ll)
appos(OCT2-ll, BB-13)
cop (KUM-l7, are-15)
advuod(UH1-17, also-16)
conj_and(positive-5, HU1-17)
conjand(positive-5, BCL6-19)
conjand(HUH-17, BCL6-19)
acomp(positive-5, positive-20)
Figure 3-5 Stanford parsing result after pre-processing for example sentence 1. The yellow highlights mark the erroneous parsing structures.
2. A list of adjective form immunological factors:
52
..............
Example sentence 2: "Report of immunostains indicates the cells are CD79a+, CD20+,
CD3-, CD5-, BC16+, BCL2-, and CD 10+ consistent with follicle center origin."
Figure 3-6 shows the raw parsing result, in which many tokens, POS tags and dependencies are incorrect. Figure 3-7 shows the parsing result after pre-processing. Improvements
on tokenization output and POS tags are seen, but dependency errors are still present as
highlighted.
nsubj (indicates-4, Report-I)
prepof (Report-1, immunostains-3)
root(ROOT-0, indicates-4)
det(cells-6, the-5)
nsubj(+-9, cells-6)
cop(+-9, are-7)
nn(+-9, CD79a-8)
ccomp(indicates-4, +-9)
appos(+-9, CD20-ll)
nua(CD20-ll, +-12)
num(CD20-ll, CD3-14)
ccomp (indicates-4, CD5-17)
conj_and(+-9, CD5-17)
amod(+-21, BC16-20)
dep(CD5-17, +-21)
appos(+-21, BCL2-23)
nn(+-28, CD1O-27)
ccomp (indicates-4, +-28)
conjend(+-9, +-28)
amod(+-28, consistent-29)
amod(origin-33, follicle-31)
nn(origin-33, center-32)
prepwith(consistent-29, origin-33)
(ROOT
(5
(NP
(NP (NNP Report))
(PP (IN of)
(NP (NNS immunostains))))
(VP (VBZ indicates)
(MUAR
(S
(NP (DT the) (NNS cells))
(VP (VBP are)
(NP
(NP
(NP (NNP CD79a) (NNP +))
(, ,)
(NP (NIP CD20) (CD +) (, , (CD CD3))
(: -))
),
(NP
(NP (NNP CD5))
(PRN (: -)
(NP
(NP (, , (JJ BC16) (NN +))
(NP (NKP BCL2)))
(: -))
(CC and)
(NP
(NP (NNP CD10) (NNP +))
(ADJP (JJ consistent)
(PP (IN with)
(NP (JJ follicle) (NN center)
(.
(NN origin))))))))))
.)))
Figure 3-6 Raw Stanford parsing result for example sentence 2.
53
-
!--
jW_
"W.- ,-j
.
(ROOT
.. .........
.............
.......... ....
...
.. ...
nsubj (indicates-4, Report-i)
prepof (Report-i, imamnostains-3)
(5
(NP
root(ROOT-O, indicates-4)
det(cells-6, the-S)
(NN Report))
(PP (IN of)
(NP (ENS imximstinA))))
(VP (VBZ indicates)
(SAR
(NP
nsubj(CD79&+-8, cells-6)
cop(CD79e+-8, are-7)
ccomp(indicates-4,
aaod(CD79&+-O,
the) (NNS cells))
(VP (VBP are)
(NP
(NP (JJ CD79&+))
(NP (DT
end)
(UJ
CDI0+)
emoul(CD79a+-O,
CDS--12)
CDS--14)
conlend(congistent-22, CDS--14)
awod(CD79+-O, PC16+-16)
BC16+)
(, ,) (JJ DCL2-) (, ,)
conjuend(conuistent-22, DC16+-16)
euod(CD79&+-8, BCL2--18)
conjend(consiatent-22, BCL2--18)
(JJ consistent)))
eaod(CD79a+-8, CD10+-21)
conjand(consixtent-22, CDIO+-21)
amod(CD79a+-8, consistent-22)
nn(origin-26, follicle-24)
nn(origin-26, center-25)
prepyIth(CD79&+-8, origln-26)
(PP (IN with)
(NP
CD3--12)
conttand(cowistent-22,
(ADJP (JJ CD20+) (, ,) (JJ CD3-) C, ,) (JJ CDS-) (, ,) (JJ
(CC
CD79a+-8)
dep (consistent-22, CD20+-10)
(s
(NN follicle) (N
center) (NN origin)))))))
( .)))
Figure 3-7 Stanford parsing result after pre-processing for example sentence 2. The yellow highlights mark the erroneous parsing structures.
3. A list of nominal immunological factors modifying adjectives:
Example sentence 3: "Most interstitial lymphocytes are CD3 positive T-cells with fewer
CD20 and PAX5 positive B-cells."
Figure 3-8 shows the raw parsing result with POS tags errors such as for "CD3". Figure
3-9 shows the parsing result with pre-processing. Highlighted parts indicate the error in
not recognizing that "B-cells" are "CD20" "positive".
(ROOT
(S
(NP
(NP (NNP Immunohistochemistry))
(PP (IN of)
(NP (DT the) (NN bone) (NN
(VP (VBZ reveals)
(SBAR
marrow)
(NN core))))
(IN that)
(S
(NP (RBS most) (JJ interstitial) (NNS lymphocytes))
(VP (VBP are)
(VP (VBG CD3)
(NP (JJ positive) (NNS T-cell3))
(PP (IN with)
(NP
(NP (JJR fewer) (NN CD20)
(CC and)
(NP (CD PAXS) (JJ positive) (NNS B-cells))))))))))
(S
(NP (DT the) (NN latter))
(VP (VBP are)
(ADJP (JJ small)
(PP (IN in)
(NP (NN size))))))
nsubj (reveals-7, Immunohistochemistry-l)
det(core-6, the-3)
nn(core-6, bone-4)
nn(core-6, marrow-5)
prep of (Imaunohistochemistry-1, core-6)
root(ROOT-O, reveals-7)
mark (CD3-13, that-8)
advaod(lymphocytes-ll, most-9)
eamod(lymphocytes-ll, interstitial-10)
naubj (CD3-13, lymphocytes-Il)
aux(CD3-13, are-12)
ccomp(reveals-7, CD3-13)
amod(T-cells-15, positive-14)
dobj(CD3-13, T-cells-15)
emod(CD20-18, fewer-17)
prepwith(CD3-13, CD20-18)
num(B-cells-22, PAX5-20)
emod(B-cells-22, positive-21)
prep-with(CD3-13, B-cells-22)
conj3end(CD20-18, B-cells-22)
det(latter-25, the-24)
naubj(small-27, latter-25)
cop(small-27, are-26)
paratexis(reveals-7, small-27)
prepin(small-27, size-29)
. .))
Figure 3-8 Raw Stanford parsing result for example sentence 3.
54
To correct the parsing errors introduced by the above list patterns, we perform the following
steps. We first recognize the immunologic list patterns by checking the UMLS semantic types of
parsing nodes and record those belonging to immunologic factors. The semantic types along with
their specific TUI numbers that are considered as immunologic factors are shown in Table 3-2.
Multiple semantic types are included because some cell surface markers may belong to one or
more semantic types. For example, "CD2" belongs to "Amino Acid, Peptide, or Protein", "Immunologic Factor", "Receptor", "CD10" belongs to "Enzyme", "CD138" belongs to "Amino Acid, Peptide, or Protein", "Biologically Active Substance", "BCL2" belongs to "Gene or Genome"
and "EBV" belongs to "Virus". After recognizing such list patterns, we check the POS tags of
immunologic factor parse nodes. If they are adjectives (pattern 2), we replace the whole list with
a dummy adjective "atypical". If they are nouns, and if the list is followed by an adjective (pattern 3), we replace the whole list and the following adjective with a dummy adjective "atypical".
If the list is not followed by an adjective, we replace the whole list by a dummy proper noun
"ATG"4 . We then parse those modified sentences using the Stanford Parser. At last, we fill back
the immunologic factors in the original list. For pattern 2, we copy the dependencies of "atypical"
to each immunologic factor adjectives. For pattern 3, we copy the dependencies of "atypical" to
the adjective following the list and connect each immunologic factor with that adjective. For pattern 1, we copy the dependencies of "ATG" to each immunologic factor.
4 We use a dummy proper noun so that they can fit in sentences with either singular form or plural form predicates.
55
......
....
..................
......
.........
.......
.
. ..
...
. ......
.....
......
. .......
. ........................................
- ..............
nsubk (reveals-6, Immunohistochemistry-1)
det(core-5, the-3)
(ROOT
(S
nn(core-5, bone-merrow-4)
(S
(NP
(NP (NN Immunohistochemistry))
(PP (IN of)
(NP (DT the) (NN bone-marrow) (NN core))))
(VP (VBZ reveals)
(NP
(NP (UN that))
(SBAR
(S
(NP (JJS most) (JJ interstitial) (NIS lymphocytes))
(VP (VBP are)
(NP
(NP (UN CD3) (JJ positive) (NUS T-cells))
(PP (IN with)
(NP
(NP (JJR fewer) (IN CD20))
(CC and)
(NP (NN PAX5) (JJ positive) (INS B-cells)))))))))))
(S
anod(lymphocytes-10, most-8)
amod(lymphocytes-10, interstitial-9)
naubj(T-cells-14, lymphocytes-10)
cop(T-cells-14, are-Il)
nn(T-cells-14, CD3-12)
anod(T-cells-14, positive-13)
rcmod(that-7, T-cells-14)
emod(CD20-17, fewer-16)
prep vith(T-cells-14, CD20-17)
=~(-e115-21, PAXS-19)
eaod(D-cells-21, positive-20)
prep vith(T-cells-14, B-cells-21)
conjond(CD20-17, B-cells-21)
det(latter-24, the-23)
nsubj (small-26, latter-24)
cop(small-26, are-25)
parataxis (reveala-6, samall-26)
prepin(small-26, size-28)
)
(NP (DT the) (NN latter))
(VP (VBP are)
(ADJP (JJ small)
(PP (IN in)
(NP (NN size))))))
(. .))
prepof (Immunohistocheistry-1, core-5)
xoot(ROOT-0, reveals-6)
dobh(reveals-6, that-7)
Figure 3-9 Stanford parsing result after pre-processing for example sentence 3. The yellow highlights mark the erroneous parsing structures.
TUIs Semantic Types
T123 Biologically Active Substance
T129 Immunologic Factor
T192 Receptor
T1 16 Amino Acid, Peptide, or Protein
T126 Enzyme
T028 Gene or Genome
T005 Virus
Table 3-2 Semantic types considered as immunologic factors.
3.4.2 Intuition on relations among concepts
In a corpus of pathology reports focusing on a specific disease, certain relations among medical
concepts occur frequently. For example, Figure 3-10 shows variations of immunohistochemistry
interpretations, which describe "what kind of staining" (bold-outline blocks) is observed with
regard to "antibodies" to "what type of antigens" (dash-outline blocks). The relations among
those concepts are what characterize the immunohistochemistry results. For example, in one pathology report, "B lineage antigens" associate with "staining of most large atypical cells", and "T
lineage associated antigens" associate with "staining of most small cells". If we use only indi56
vidual findings, it is difficult to exclude the other possibilities of association. For daily pathology
practice, important relations are likely to be repeated in similar syntactic and semantic constructs.
This motivated us to use a graph representation to capture concepts and relations expressed in a
sentence, as well as to use frequent subgraph mining to identify important relations encoded by
sentence subgraphs.
W
ath
ntibodies
0i
1 immunoglobulin
staining of most large atypical
cells and very few small cells
I Blineage antigens (CD2) I
. - - - - - - - - -
r - - -
-
- - - -
-
- -
background staining
hEr
-
T lineage associated antigens (CD3)
staining of most small
I
-~--- - - - - - '
immunoglobulin light chains
L_ _- - - - - - - - - _
cells within the tissue
bright monotypic (kappa)
i
staining of most lymphoid cells
Figure 3-10 A variety of sentences frequently occurring in our corpus describe the relations
among cells, staining, and antigens/antibodies. Dash-outline blocks indicate "what type of antigens"; bold-outline blocks indicate "what kind of staining".
3.4.3 Representing sentence dependency parses as graphs
In natural language, the syntactic structure of a statement often corresponds at least approximately to the ways in which the semantic parts may be combined to aggregate the meaning of the
overall statement [152].
The two-phase sentence parsing (described above) produces the de-
pendency linkage structure of a sentence. This translates conveniently to a graph representation
of the relations, where the nodes are concepts and the edges are syntactic dependencies among
the concepts. We experimented with multiple parsers including the augmented Stanford Parser
[194], the Link Grammar Parser [165,244] and the ClearParser [245]. We chose the Stanford
Parser because it produced fewer systemic errors on our corpus.
Figure 3-11 shows the graph representation for the example sentence "Immunostains show the
large atypical cells are strongly positive for CD30 and negative for CD15, CD20, BOB1, OCT2
and CD3." Syntactic dependencies are denoted using line segments with labels (e.g., prep for).
For each parse node (round-corner rectangle), the text in parentheses includes the tokens in the
57
original sentence, connected by hyphens (e.g., "atypical-cells"). The text above the parentheses
displays the preferred name of the node's CUI (e.g., CD20_Antigens for C0054946). For determiners, we exclude common functional determiners such as "a", "an" and "the" but keep the semantically meaningful ones such as "no" and "all".
The Stanford Parser supports various parsing modes. We chose the mode specifying "collapsed
dependencies with propagation of conjunct dependencies" [246], which has the most compact
graph translations. With this mode, possible cyclic graphs can arise in the dependency linkages,
such as the cycle in the middle of Figure 3-11.
Strong
(strongly)
Large
\(large)
Antigens,_CD3O
(CD30)
Positive
immunostain
immunostains
/ositive
0.
Cytologic-atypia
show
(atypical-cells
(show)
Negative
neatie
PCD3
repto(
CD20Antigens
(CDO
9
prep for
re.,r
CD3_Antigens
SLC22A2gene
'*1- (OCT2)
ntigens,_CD1
(CD5)
/
POU2AF1_gene
(BB1)
Figure 3-11 Constructing the sentence graph from the results of two-phase dependency parsing.
In order to increase the accuracy of the sentence graph representations, we perform post processing on the Stanford dependency parsing results by converting lists of immunologic factors to
single tokens as described in section 3.4.1.4.
3.4.4
Frequent subgraph mining
Frequent subgraph mining is based on the notion of graph subisomorphism. Intuitively, one
graph is subisomorphic to another graph if it is part of the other. Formally, let G, =
(Vs, Es, ls)Gs = (Vs, Es, ls) and G = (V, E, 1) be two graphs, where V (Vs) is the set of nodes, E
(Es) is the set of edges and 1 (l)
is the labeling function for nodes and edges. For Gs to be
58
subisomorphic to G, the following conditions must be met: there exists a one to one mapping f
such that:
1) f(Vs) -
Vm c- V, st.
v E V,ls() = l (f(v))
2) V v 1 , v 2 E VS , if(v1, V 2 ) E Es, then (f(v), f(v 2 )) E E and ls(v 1, v 2 ) = l(f(v1, f(V 2 ))
Condition 1 says that there exists a mapping from nodes in Gs to a subset of nodes in G, such that
corresponding nodes agree on their labels. Condition 2 says that each edge in Gs should also have
a counterpart in G that shares the same label. Figure 3-12 shows two example subgraphs of the
sentence graph in Figure 3-11.
Large
(Lare)
ytod giciatypia
amod7 (atyical-cells)
Negative
ne ative
nsubj
D20 Antigens
or
Antigens,_CD1
(CD15)
(a re
Cytologicatypia
atypical-cells)
LC
A2
POU2AF1_gene
(BO81)
Antigens,_CD30
(CD30)
Strong
stronl
0
nsubj
n(
(1)
ieV(2)
Positive
~ositive)
Figure .3- 12 Example subgraphs for the sentence graph in Figure 3-11.
We say that a subgraph occurs once in a corpus every time it is subisomorphic to a graph in that
corpus. The frequency of a subgraph is the total number of its occurrences within the corpus.
Frequent subgraph mining tries to identify those subgraphs whose frequencies are above a given
threshold. Various graph encodings, enumeration strategies and search pruning policies have
been proposed to improve the efficiency of the mining algorithms [247,248]. In this work, we
use the open-source frequent subgraph miner Gaston [100], which has state-of-the-art speed.
3.4.5
Subgraph redundancy pruning
We ran Gaston on our training dataset containing 17,186 sentences, with a frequency threshold
of 5, and obtained 180,863 frequent subgraphs. Analyzing these subgraphs, we found that many
59
smaller subgraphs are subisomorphic to other larger frequent subgraphs. Many of these larger
subgraphs have the same frequencies as their subisomorphic smaller subgraphs. This arises when
a larger subgraph is frequent; all of its subgraphs also become frequent. Furthermore, if the
smaller subgraph is so unique that it is not subisomorphic to any other larger subgraph, then this
pair of larger and smaller subgraphs shares identical frequency. Therefore, we only kept the larger subgraphs in such pairs. Note that it is cost prohibitive to perform a full pairwise check because the subisomorphism comparison between two subgraphs is already NP complete [100],
and a pairwise approach would ask for around 16 billion such comparisons for our dataset. We
developed an efficient algorithm using hierarchical hash partitioning that reduces the number of
subgraph pairs to compare by several orders of magnitude. The key idea is that we only need to
compare subgraphs whose sizes differ by one, and we can further partition the subgraphs so that
only those within the same partition need to be compared. After subgraph redundancy pruning,
we are left with 9935 subgraphs.
In fact, let Gs, G be subgraphs with G, being subisomorphic to G, and #(G,) = #(G) and
IGI < IGI - 1, where #(.) denotes the frequency and I - I denotes the number of nodes in a
graph. Then given the subisomorphism between G, and G, one can construct a G 1 by simply
adding one additional node (and associated edges) in G. It is clear that #(G) 5 #(G 1 ) 5 #(G,),
but because #(G) = #(Gs), we have #(G) = #(Gl) = #(G,). Thus we only need to check
subisomorphism between Gs and G 1 , and between G1 and G, where G 1 differ from G, in size by
only one. Carrying on such construction, we therefore only need to check pairs of subgraphs
whose sizes differ by one. Based on the H2, we can first order subgraphs in descending order
according to their sizes. Then it suffices to progress down the hierarchy, checking among sub-
graphs that are in the neighboring two levels.
To further reduce unnecessary subisomorphic comparisons, we make another observation that for
a graph Gs to be subisomorphic to G, the node labels of Gs must be a subset of G. Moreover, as
we restrict ourselves in comparing only subgraphs from neighboring levels, we are able to adopt
a hash partition scheme to avoid enumerating all possible pairs from neighboring levels. Precisely speaking, at level n, a subgraph has n nodes, if we consider its n - 1 size subgraphs, there are
only n possible set of labels. We can then construct a hash table and hash the level-n subgraphs
n times using their n - 1 node label subsets as keys. We also hash subgraphs from level n - 1
60
using their node label sets (size n - 1) as key. We note that it is only necessary to check subgraph pairs in the same partition. Although an upper level subgraph is hashed multiple times into
the hash table, hashing has both constant amortized update time and constant amortized look up
time. The time for multiple hashes is much less than the time for unnecessary subisomorphism
comparisons. Moreover, in practice, the size of the subgraph is often small, and multiple hashes
only multiply a constant factor to the total hash update and look up times.
A summary of our algorithm is shown in Figure 3-13. Lines 1 and 2 sort the set of graphs so that
they are first ordered (in descending order) by their number of nodes and then by their number of
edges. This ensures that subisomorphism only needs to be checked by looking at graphs before
the current one. Line 3 partitions graphs into levels according to their sizes while keeping the
previously sorted order. Lines 5 to 29 progress down the hierarchy perfonning subisomorphism
check when necessary. Lines 7 to 11 hash each upper level graph into possibly multiple buckets.
Lines 12 to 15 partition lower level graphs into different hash buckets. Lines 20 to 23 check
subisomorphism within the same hash partition on the lower level. Lines 24 to 29 check
subisomorphism between corresponding lower level bucket and upper level buckets. In lines 22
and 28, we generalize from the condition requiring two subgraphs to have identical frequencies
to a condition customizable by the user.
3.4.6
Single node frequent subgraph collection
Gaston only collects frequent subgraphs having two or more nodes. Because our token subsequence grouping may group all tokens within a short sentence into one node if they are covered
by one CUI, such nodes would be ignored by Gaston. We do not want to exclude the possibility
that sometimes the presence of a meaningful medical concept in the text can be informative. We
thus also collected single node subgraphs using the same frequency threshold 5 as for multi-node
frequent subgraphs, adding 1602 single node subgraphs (11537 total).
61
subisomorphim for set of graphs
S - set of graphs
input:
effect: compute subisomorphism relation among graphs in S
1
2
3
4
stable sort S in descending order of number of edges
stable sort S in descending order of number of nodes
<- put graphs of size n into
levels
levels
[n]
maxlevel
- length(levels)
5
6
7
8
9
10
11
for
n = maxlevel
downto 2
h upper = {}; h lower = {}
if n != max level
ulevel = levels[n+l]
for i = 1 to length(ulevel)
foreach key : labels of n-l subset of nodes of ulevel[i]
add ulevel[i] into the list h_upper[key]
12
13
14
15
llevel = levels[n]
for i = 1 to length(llevel)
foreach key : set of labels llevel[i]
add llevel[i] into the list h lower[key]
16
17
18
19
20
21
22
foreach key in h lower.keys()
g lower = hlower[key]
for i = 1 to length(glower)
gs = g lower[i]
for j = 1 to i-l
gb = glower[j]
if condition = true
23
24
25
26
27
28
29j
subisomorphism(gs, gb)
if h upper.haskey(key)
gupper = h upper[key]
for j = 1 to length(gupper)
gb = g upper[j]
if condition = true
subisomorphism(gs, gb)
Figure 3-13 A hierarchical hash partition algorithm for determining subisomorphism relation
among graphs in a set
3.5 Experiments and Results
For each patient case, we use the written diagnosis (in the final diagnoses section of the pathology reports) as the ground truth label. A patient may have multiple lymphomas at the same time,
or the diagnosis may be an intermediate case between multiple lymphomas. Given the relatively
small numbers of multiple-hit/intermediate cases as shown in Table 3-3, we model the classification task as multiple binary classification problems, one for each lymphoma. For the ground truth,
the positive cases for one lymphoma type also include the multiple-hit/intermediate cases involving this type. The negative cases of one lymphoma type include positive cases of the other three
types, except for multiple-hit/intermediate cases involving this type. Our task resembles the dif-
62
ferential diagnosis of four lymphomas, assuming that every patient in the selected population has
at least one lymphoma. By splitting the dataset randomly into halves, stratified by type of lymphoma, we obtained a training set and a testing set, whose statistics are in Table 3-4.
# Cases Percent
Type
Intermediate between Burkitt and DLBCL
18
1.7%
2
0.2%
Intermediate between Burkitt and Follicular
42
4.0%
Double-hit of DLBCL and Follicular
0.7%
7
Intermediate between DLBCL and Hodgkin
Table 3-3 Multiple-hit or intermediate lymphoma cases. Percentage is out of a total of 1038 cases.
Lymphoma
Full Corpus
N
P
P%
Training Corpus
N
P
P%
Testing Corpus
N
P
P%
-
Burkitt 946 93
9.0% 500 55 9.9% 446 38 7.9%
DLBCL 383 656 63.2% 210 345 62.2% 173 311 64.4%
Follicular 811 228 22.0% 425 130 23.4% 386 98 20.3%
Hodgkin 908 131 12.6% 486 69 12.4% 422 62 12.8%
Table 3-4 Distribution of lymphoma cases in full corpus, training corpus and testing corpus. N
number of the negative patients, P - number of positive patients, P %- percentage of the positive patients. We show these three statistics in the full corpus, in the training corpus and in the
testing corpus. Note that in the full corpus, the number of positive cases does not add up to 1038
(the total number of patients), this is because there are patients with diagnoses for multiple/intermediate lymphomas.
In our experiments, we trained three baseline classifiers on different feature types. Baseline 1
uses negation classified medical concepts extracted by the latest Metamap [26]. Baseline 2 further filters the concepts in Baseline 1 based on UMLS semantic types that are reported in previous studies to have good performance for medical problem extraction [249,250]. In addition to
previously used semantic categories of diseases and symptoms, we also included semantic types
that fall under the hierarchy of "Chemical" and "Anatomical Structure" as our pathology reports
largely concern the immunological factors and various types of lymphocytes. Baseline 3 uses the
standard n-grams features [251], including unigrams, bigrams and trigrams, which have been reported as most useful for document classification [252]. We experimented with multiple machine
learning algorithms including support vector machines (SVM), decision trees and Bayesian networks. We chose SVM for its better performance on our training data and its widely acknowledged generalizability. We experimented with polynomials up to degree five and radial basis
functions as candidate kernels. We performed ten-fold cross validation on training data for pa-
63
rameter selection and evaluated the trained model on the held-out test dataset. Cross validation
favored a linear kernel for all the settings in our experiment.
Table 3-5 shows the evaluation results on the subgraph features for each of the four lymphoma
categories in comparison with the three baselines. The evaluation metrics include standard precision, recall, f-measure and AUC (area under ROC curve). Let TP denote the number of true positives in the contingency table, FP denote the number of false positives and FN denote the number
+
of false negatives, the definition of precision is P = TP/(TP + FP), recall is R = TP/(TP
FN), f-measure is F = 2 x P x R/(P + R). It is clear that full MetaMap features outperform
filtered MetaMap features. Thus we performed significance tests comparing the subgraph features with the full MetaMap features and with the n-gram features. We used the approximate
randomization test [253] to assess whether two system outputs were significantly different from
each other (p = 0.05) and the statistically significant changes in Table 3-5 are marked. We see
improvements on precision, recall, and f-measure across all four lymphomas compared with either baseline. For Burkitt lymphoma, all improvements are significant. For DLBCL, the improvement in recall over n-grams is not significant. For follicular lymphoma, all improvements
over n-grams are significant; the improvement in recall over MetaMap is significant. For Hodgkin lymphoma, all improvements are significant except for the recall compared with n-gram features. Overall, the sentence subgraph features significantly outperform all three baselines.
64
Lymphoma
Class
Full MetaMap* (3112)
P
R
F
AUC
Filtered MetaMap (1600)
P
R
F
AUC
n-gramt (16326)
P
R
F
Sentence subgraph (11537)
AUC
P
Burkitt-N
0.965 0.978 0.971 0.778 0.959 0.989 0.973 0.744 0.969 0.984 0.977 0.808 0.978
Burkitt-P
0.688 0.579 0.629 0.778 0.792 0.5
DLBCL-N
0.703 0.634 0.667 0.743 0.714 0.523 0.604 0.704 0.829 0.703 0.761 0.812 0.87
DLBCL-P
0.808 0.852 0.829 0.743 0.77
0.613 0.744 0.774 0.632 0.696 0.808
0.884 0.823 0.704 0.849 0.92
0.883 0.812
0.875*t
0.884*t
Follicular-N 0.933 0.974 0.953 0.849 0.939 0.953 0.946 0.854 0.932 0.958 0.945 0.841 0.952
Follicular-P 0.877 0.724 0.793 0.849 0.804 0.755 0.779 0.854 0.816 0.724 0.768 0.841
Hodgkin-N
0.963 0.995 0.979 0.869 0.952 0.988 0.97
Hodgkin-P
0.958 0.742 0.836 0.869 0.891 0.661 0.759 0.825 0.907 0.79
0.825 0.97
0.878t
0.988 0.979 0.889 0.977
0.845 0.889 1*t
R
F
AUC
0.991
0.984
0.864
0.737*
0.8*t
0.864
0.779
0.822
0.857
0.936*
0.909*t 0.857
0.971
0.961
0.889
0.806*t
0.84t
0.889
1
0.988
0.919
0.839*
0.912*t
0.919
Table 3-5 Held-out test results on different feature groups. In the lymphoma class column, suffix
"-N" denotes negative cases, "-P" denotes positive cases. P - precision, R - recall, F - f-measure,
AUC - area under curve for ROC curve. Numbers in parentheses next to each feature group indicate the number of the features in that group. Evaluation metrics for each positive class are in
bold if they show significant improvements over baselines. Markers (*t) are used to indicate
specific baselines.
To assess the effect of parse post-processing and the effect of detailed dependency types on the
performance of sentence subgraph features, Table 3-6 shows different configurations in separate
panels, in which "untyped dependency" means that all dependency types are ignored. Vertical
comparisons show that post processing in general helps to improve classification performance
with the exception of Burkitt lymphoma classification when the system uses typed dependencies.
Horizontal comparisons show that distinguishing dependency types in general does not improve
classification performance. In particular, with post processing, untyped dependencies even help
to improve the f-measures for Burkitt, DLBCL, and follicular lymphoma classifications. There
are two possible reasons. First, the Stanford Parser dependency types may distinguish relations
between concepts in unnecessary detail. For example, the partial sentences "B-cells with CDlO
prep._with
expression" (B-cells
partmod
>expressing
dobj
--
amod
expression
>CD10) and "B-cells expressing CD10" (B-cells
CD10) have different syntactic parses but convey almost the same in-
formation to pathologists. In addition, parser errors during dependency type assignment could
introduce noise that diminishes the usefulness of the dependency types.
65
No post processing, typed dependency (7491)
No post processing, untyped dependency (8548)
P
R
F
AUC
P
R
F
AUC
Burkitt-N
0.978
0.984
0.981
0.861
0.978
0.984
0.981
0.861
Burkitt-P
0.8
0.737
0.767
0.861
0.8
0.737
0.767
0.861
DLBCL-N
0.819
0.762
0.789
0.834
0.868
0.767
0.815
0.852
DLBCL-P
0.873
0.907
0.890
0.834
0.879
0.936
0.907
0.852
Follicular-N
0.942
0.971
0.957
0.868
0.937
0.971
0.954
0.858
Follicular-P
0.872
0.765
0.815
0.868
0.869
0.745
0.802
0.858
Hodgkin-N
0.977
0.990
0.983
0.915
0.974
0.993
0.984
0.908
Hodgkin-P
0.929
0.829
0.881
0.915
0.944
0.823
0.879
0.908
Lymphoma
Class
Lymphoma
Class
Post processing, typed dependency (9488)
Post processing, untyped dependency (11537)
P
R
F
AUC
P
R
F
AUC
Burkitt-N
0.969
0.989
0.979
0.810
0.978
0.991
0,984
0.864
Burkitt-P
0.828
0.632
0.716
0.810
0.875
0.737
0.8
0.864
DLBCL-N
0.86
0.75
0.801
0.841
0.87
0.779
0.822
0.857
DLBCL-P
0.871
0.932
0.901
0.841
0.884
0.936
0.909
0.857
Follicular-N
0.943
0.979
0.961
0.872
0.952
0.971
0.961
0.889
Follicular-P
0.904
0.765
0.829
0.872
0.878
0.806
0.84
0.889
Hodgkin-N
0.979
0.998
0.988
0.926
0.977
1
0.988
0.919
Hodgkin-P
0.981
0.855
0.914
0.926
1
0.839
0.912
0.919
Table 3-6 Held-out test results on different settings of sentence subgraph feature groups. In the
lymphoma class column, suffix "-N" denotes negative cases, "-P" denotes positive cases. "P"
denotes positive cases. P - precision, R - recall, F - f-measure, AUC - area under curve for ROC
curve. Numbers in parentheses next to each feature group indicate the number of features in that
group.
3.6 Feature and ErrorAnalysis
This section investigates the ability of sentence subgraphs to assist with human review by
providing insightful relations over a flexible number of medical concepts. The sentence subgraph
features outperform all three baselines and n-grams seem to be the best baseline overall. A closer
look at the MetaMap baseline shows that the program did not identify some important immunologic factors, such as CD30, CD15 etc. By contrast, n-gram features cover the entire text, but often do not map to medical concepts. To compare subgraph features with the baselines, we identi66
fied in the training corpus cases that are false negatives for the n-gram baseline and the MetaMap
baseline but not for the sentence subgraph features during cross validation. We then identified
the big subgraphs (> 3 nodes) that contribute to the improved recognition of the three minority
lymphomas, by choosing those with a normalized weight above 0.01 as assigned by a linear kernel SVM. For Burkitt lymphoma, examples of interesting positive factors include:
with antibodies to immunoglobulin, ... there is monotypic ... kappa staining of most tumor cells ...
"...
bf2
"...
bf3
"... CD19+, CD20+, CD10+, CD5-, CD23-, CD43+ ... B cells with monotypic
expression of kappa light chain ...
bf4
" ... tumor cell is positive for CD 10
"
bfl
"
...
"
b-cells ... negative for BCL2 ... positive for BCL6
"
...
For readability, we translated each subgraph into a partial sentence. Note that in bf3, although we
have listed "CD19+, CD20+, CD1O+, CD5-, CD23-, CD43+" in order, when viewed in the subgraph, individual immunologic factors are all adjective modifiers of "B cells", hence the subgraph is order ignorant. The factors bfl, bf2, bf3 and bf4 are consistent with immunophenotypic
characteristics of Burkitt lymphoma in the WHO classification [216], which states that the tumor
cells are light chain-restricted with moderate to strong expression of pan-B-cell (CD19, CD20)
and germinal center (BCL6 and CD 10) antigens, and are negative for CD5 and CD23.
For follicular lymphoma, examples of positive factors that are exclusively discovered by sentence subgraph features are as follows. The factors ffl, ff2 and ff3 are consistent with Table 8.01
in [216], as CD10 is usually positive and CD23 is intermittently positive on B cells in follicular
lymphoma.
"...
CD20+, CD10dim, CD5-, CD23- ... B cells ...
ff2
"...
CD20+, CD10dim, CD5-, CD43- ... B cells ...
ff3
"...
CD19+, CD20+, CD23+ ... B cells with ... expression of lambda light chain
"
"
ff1
67
"
...
One might think that Hodgkin lymphoma cases should be easy to classify because of the presence of Reed-Sternberg cells as a well-recognized diagnostic feature. However, our analysis
shows that the paucity of neoplastic Reed-Sternberg cells and the predominance of nonneoplastic cells lead to interesting associations between sentence subgraphs and Hodgkin lymphoma. In particular, we found the following positive factors discovered by sentence subgraph
features.
hfl
"...
atypical large cells ... positive for ... CD30
hf2
"...
with antibodies to B lineage ... antigens ... there is staining of many ...
hf3
"...
with antibodies to T lineage associated antigen ... there is staining of ... cells
"
...
"
...
...
"
cells
The factor hfl links CD30-expressing atypical large cells to Hodgkin lymphoma and conforms to
conventional knowledge [216]. The factors hf2 and hf3 refer to staining patterns of background
T and B cells. Although hf2 and hf3 are seen to some extent in other lymphoma subtypes, Hodgkin lymphoma is particularly rich in background non-neoplastic T cells, as well as B cells to a
lesser extent, and these non-neoplastic cells vastly outnumber the neoplastic Reed-Steinberg
cells [6]. Together with other Hodgkin-related subgraph features such as hfl or Reed-Steinberg
cells, hf2 and hf3 appear to account for these non-neoplastic cells. Our classifier placed higher
weight on hf3 than on hf2, agreeing with the aforementioned T-cell dominance. Of note, recent
work has shown varying patterns of morphology and immunophenotype in background nonneoplastic cells associated with a certain subtype of Hodgkin lymphoma [254-256], pointing to
the potential utility of our analysis in identifying variant patterns of lymphoma.
Of the four lymphomas, follicular lymphoma has a moderate number of cases but comparatively
lower f-measure than DLBCL and Hodgkin lymphoma. We thus delved into false negative cases
of follicular lymphoma in the training data and selected common features that have top negative
weights as assigned by the linear kernel SVM. Investigating those common features, we highlighted the following.
fnfl
"...
large ...
68
erythroid maturation is normal ...
"...
fnf3
"4... myeloid maturation is normal ...
"
"
fnf2
The factor fnfl incorrectly associates the single-node subgraph "large" to negative classification
of follicular lymphoma. In the description of a morphological study, "large" often describes the
cell size. Although the keyword corresponds to the name of DLBCL (diffuse large B cell lymphoma), it is however not a distinguishing feature, because a Hodgkin Reed-Sternberg cell can
be large, and centroblasts in follicular lymphoma can be large. Similarly the keyword "diffuse"
and "follicular" are also not special to DLBCL and follicular lymphoma respectively. Although
our model successfully excluded "diffuse" from the top negative features for follicular lymphoma, it incorrectly included "large". We reason that this is because we have a majority of DLBCL
cases, which do frequently have the keyword "large", and the imbalanced ratio between DLBCL
and follicular lymphoma confused our model. The factors fnf2 and fnf3 refer to erythroid and
myeloid maturation respectively, which in reality are neither positively nor negatively associated
with the likelihood of follicular lymphoma. We think this is identified by the classifier because
lymphoma patients often undergo a staging bone marrow biopsy in which myeloid and erythroid
maturation are routinely assessed during the process of determining whether the marrow is involved by lymphoma. As a result, normal myeloid and erythroid maturation is frequently associated with most cases. Because there are more follicular lymphoma cases with uninvolved staging
bone marrow biopsies than those with involved biopsies, such association could be regarded by
the classifier as favoring negative classification of follicular lymphoma.
3.7 Discussion and Limitations
Some clinical reports are template based. In fact, our pathology reports also have template-based
sections. For example, there are disclaimers such as "By his/her signature below, the pathologist
listed as making the Final Diagnosis certifies that he/she has personally reviewed this case and
confirmed or corrected the diagnoses." We exclude these sentences from being processed, as
they do not offer clinical insights. Recognizing these sections is based on knowledge from EMR
vendors about pre-specified templates.
69
Patient demographics such as gender and ages are usually mentioned in the clinical presentation
section. They are also part of the features captured by subgraphs. For the age features, expressions such as "year-old" are connected to the integers that we discretize by every 10 years. However, we did not find demographics ranked as top-weighted features in our experiments. This is
likely due to the presence of more specific predictors such as morphologic, immunophenotypic,
and genetic features, though we do not exclude the possibility that a better customized discretiza-
tion can yield different outcome.
In addition, we note that different institutions may have different clinical documentation systems
and styles, which may bring challenges to generalizing our framework to multiple institutions.
We expect that the untyped dependencies will help mitigate some style (e.g., syntactic) differences between institutions. We also expect that the UMLS concept mapping can lessen the impact of the terminology differences between institutions. We are in fact expanding the lymphoma
classification project across institutions, and generalizability analysis is part of our future work.
Our work is predicated on the assumption that pathology reports provide a comprehensive statement of measurements, observations and interpretations made by pathologists. This seems true of
current practice, but future programs may have access to digital images of immunohistochemical
slides and raw flow cytometry counts directly from instruments. Nevertheless, we expect that for
the foreseeable future pathologists' observations and interpretations will continue to be expressed in natural language, hence the techniques we report here will continue to be helpful.
We expect to scale up our tool to assist with human expert reviews and more systematically iden-
tify unique variants and new subcategories of lymphoma, whose recognition, diagnosis and acceptance into the widely-used classification system is important for patients to receive appropriate treatment and follow-up and to further our understanding of lymphoma biology.
3.8 Conclusions
We narrowed the gap between automatic unsupervised feature generation and interpretable feature generation from clinical narrative text by building a framework that can perform unsupervised extraction of relations among flexible number of medical concepts. Our framework represents narrative sentences in pathology reports as graphs, and automatically mines sentence sub-
70
graphs for feature generation. We perform a lymphoma classification task resembling differential
diagnosis, in which no explicit mentions or synonyms of the targeted lymphomas are available to
the classifier. Evaluation shows that the classifier with unsupervised sentence subgraph features
significantly outperforms the baselines using standard n-grams, full MetaMap concepts, or filtered MetaMap concepts respectively. With detailed feature analysis, we highlight that our system generates meaningful features and medical insights into lymphoma classification.
71
Chapter 4.
Subgraph Augmented Non-negative Tensor
Factorization (SANTF) Applied to Modeling Clinical
Narrative Text
This chapter5 continues to describe the core part of the Subgraph Augmented Non-negative Tensor Factorization (SANTF) algorithm, with a focus on applying non-negative tensor factorization
to group subgraphs collected from Chapter 3. We begin by motivating the need for using nonnegative tensor factorization to perform such groupings, continuing with the example of lymphoma subtype categorization based on pathology reports.
Advances in machine learning have opened avenues towards more effective mining and modeling of EMRs to facilitate translational research [257,258]. However, clinicians often regard existing machine learning models as hard-to-interpret black boxes. In lymphoma pathology reports,
immunophenotypic features may be expressed in the form of relations among medical concepts
such as lymphoid cells and antigens (e.g., "[large atypical cells] express [CD30]"). We refer to
the above relations as higher-orderfeatures, and the words (e.g., "large", "cells") as atomicfeatures. When interpreting pathology reports and evaluating lymphoma subtypes, clinicians usually
reason at the level of higher-order features (e.g., cell-antigen relations) besides atomic features
(e.g., individual words). Moreover, multiple higher-order features (such as "[large atypical cells]
express [CD30]", "[large atypical cells] express [CD15]" and "[large atypical cells] have [ReedSternberg appearance]") can strengthen the confidence of suspected lymphoma (Hodgkin lymphoma here). Such a group of higher-order features conveniently encodes medical knowledge as
in the WHO lymphoma classification guideline [216] (referred to as WHO guideline later),
where a panel of morphologic and immunophenotypic features are used to specify diagnostic criteria. For computational modeling, atomic features can help correlate higher-order features in
order to discover medically meaningful groupings. For example, the above relations all share the
words "large", "atypical" and "cells", which indicates that they all describe the characteristics of
tumor cells. However, extracting higher-order features is itself a difficult task and often involves
manually constructed rules and domain knowledge [27,97,103,259]. In addition, modeling inter5 This chapter was published as a research article in Journal of the American Medical Informatics Association [2]
72
actions between higher-order features and atomic features is usually ignored by machine learning
algorithms that mostly adopt a flat patient-by-feature matrix view (patients as rows and features
as columns). Although theoretically one can add interactions as additional features or embed
graphical models to account for feature interactions, the problem quickly becomes intractable for
large feature dimensionality.
On the other hand, limited availability of expert annotation leads to the fact that most clinical
data are still either unannotated or sparsely annotated. Thus unsupervised machine learning approaches have often been used to analyze biomedical data [260,261]. Moreover, the expense of
expert engineered features also argues for unsupervised feature learning instead of manual feature engineering [87,262,263]. In particular, non-negative matrix factorization (NMF) has been a
highly effective unsupervised method [264] to cluster similar patients [265] and sample cell lines
[266], to identify subtypes of diseases [267] and to learn groups of atomic features or expert engineered features such as temporal patterns from predefined events [268] and genetic expression
patterns [269-273]. As the multi-dimensional extension of NMF, non-negative tensor factorization (NTF) [274-276] has recently been studied to model the genetic associations with phenotypes [277-279] and interaction between cellular activities [280]. However, none of these approaches model the correlations among higher-order features, and some even do not consider
higher-order features. Our work is more closely related to previous work on applying NMF and
NTF in text mining in general domains such as email and security surveillance [281-284]. In
particular, our approach differs from the NTF based text document analysis [281,284] in that we
augment the NTF with subgraphs to capture relation oriented higher-order features instead of
standalone entities. In addition, we adopted the Tucker tensor factorization model instead of the
PARAFAC model [285], where the support for factor matrices with different group numbers better serves our application purpose.
In this chapter, we develop an unsupervised framework that can generate machine learning models conveniently interpretable to clinicians. The framework adopts NTF to discover groupings of
subgraph encoded higher-order features, hence the name subgraph augmented non-negative tensor factorization (SANTF).
73
191.44A- ...aw. I
4.1 Methods
4.1.1 Workflow of SANTF
We first outline SANTF workflow in Figure 4-1. Narrative text sentences are first converted to
graph representations, derived using the natural language processing (NLP) steps for pathology
reports described in section 3.4.1 and frequent subgraph mining (FSM) as described in sections
3.4.4 to 3.4.6. Figure 4-2 shows an example of higher-order features for clinical narrative text.
With such representations, subgraphs encode higher-order features, and we use "subgraphs" and
"higher-order features" interchangeably throughout the chapter. We jointly model the higherorder features and atomic features, and apply non-negative tensor factorization to discover
groups of features and patients, and then perform unsupervised learning to identify the associations between feature groups and patient groups. We next explain the tensor modeling and factorization in more detail.
(
Narrative Text
NLP Steps
Graphs
Frequent Subgraph Mining
Subgraphs
Words
(Higher-Order Features)
Atomic Features)
'x
Non-negative Tensor
Factorization
Feature and
Patient Groups;
(Unsupervised
Learning
Figure 4-1 The workflow of subgraph augmented non-negative tensor factorization (SANTF).
FSM - frequent subgraph mining. NLP - natural language processing.
74
1;
,
-
-
- -11,- ,,- -
11
1
1- 111.1- 1
11
11 -,
Immunostains show the large atypical cells are positive for
OCT2 and BOB 1, and negative for CD 10, CD15 and CD30.
4I
)
(OCT2)
*
BOB
NLP steps
immunostains)
positive
)
-large)
0.
.4
atypical cells_
show
(c D10)
-
negative
1CD35
~CD15)-
4I
(large)
FSM
13131
E
atypical cells
)atypical cells
0
E
-nsubj--
)-nsubj
positive
prep-for (OCT2
- negative) -prep for.
40
C D 3O>
A,
"-1O
CD15)
(large)
Figure 4-2 Graph generation and subgraph collection in SANTF . The graph representation for
the example sentence: "Immunostains show the large atypical cells are positive for OCT2 and
BOB 1, and negative for CD10, CD15 and CD30". Example frequent subgraphs are shown after
the frequent subgraph mining (FSM) steps.
4.1.2 Joint modeling of higher-order features and atomic features using a tensor
In clinical narrative text, higher-order features are often correlated with each other in medically
meaningful ways. For example, the two subgraphs in Figure 4-2 both describe the surface mark75
ers expressed by the "large atypical cells" that are often tumor cells. However, as pointed out in
the introduction, with a flat matrix view and binary feature representation, such correlations are
difficult to account for. Motivated by the need to explicitly model correlations among the higherorder features, we compose a three-mode tensor, in which one mode represents the patients, a
second the higher-order features (subgraphs), and a third the atomic features. Note that in tensor
terminology [285], we speak of mode in place of dimension. Figure 4-3 shows the schematic
view of tensor modeling. We select as atomic features the words that are covered by or next to a
subgraph node (neighborhood window size was set to two for this work). The intuition is that
subgraphs that share affiliated (covered and contextual) words are likely to be conceptually relat-
ed. By taking the union over all words that are affiliated with the nodes of a sentence subgraph,
we obtain the distributional representations of that sentence subgraph. Each entry of the tensor is
the count of a certain combination of patient, subgraph, and word, and is non-negative (see Figure 4-3 for an example). We then used a generalized tf-idf weighting of co-occurrence counts of
subgraph-word pairs (i.e. counting and weighting subgraph-word pairs instead of counting and
weighting words), which leads to better empirical performance.
76
.....................................
.....
............
......
_ _ : ::::
2
large cells>nonotypic -
1n
Ilarge cells
B-cells
expresskn
b
*
Isubgraph groupl1
negative'-- BCL2
c-
CD3O
positive
Iappearance)
Reed-Sternberg
)
1
.......
............
_..,....
...............
..................
4
-
immunoglobulin lambda chains,
/
.4In
/
Higher-Order Features (S)
I,
I
*.
/
P x P9
S X
fr
Sg
rn-i
A
ml
S xPWox
44
.4
P XS
XWg
S
g
.4
.4
.4
am,
.................
a
fuse
1 large
2 cells
3 BCL2
rinfitrnation
gce-s
I
I
a
4 positive
5 CD30
negat
-cels
6 negative
I
-
_
m
ve- CD1
_-
mm
m
mmmm
m
mmmm
Figure 4-3 Tensor modeling and factorization with distributional representations of the sentence
subgraphs. In the figure, we show some higher-order features (i.e., sentence subgraphs), as well
as some atomic features (i.e., words). The higher-order features are numbered with the first subgraph being "[large cells] - [negative] - [BCL2]". This subgraph matches the sentence "The
large cells are negative for BCL2", where the word "cells" is one of the neighboring contextual
words for the node "[negative]". If the pathology report of patient 1 has a sentence "The large
cells are negative for BCL2", then subgraph 1 is associated with this patient. As the subgraph
covers the word "large", the first atomic feature, the tensor entry (1,1,1) is increased by 1. The
factor matrix A is the (patient, patient group) matrix, B the (subgraph, subgraph group) matrix, C
the (atomic feature, atomic feature group) matrix. The core tensor g captures the interactions between the patient groups, subgraph groups and atomic feature groups. We also show example
subgraph group 1 and subgraph group 2. It is desirable that some subgraph groups correspond to
panels of characteristic features for lymphoma subtypes. For example, subgraph group 1 includes
mentions of CD30 staining and Reed-Sternberg appearance of cells, and suggests Hodgkin lymphoma; subgraph group 2 includes mentions of diffuse infiltration of large cells, moderately high
Ki67 expression, and no CD10 staining, and suggests diffuse large B-cell lymphoma (DLBCL).
77
.....
.
......
..... ........
C E Xijk
4.1.3 Patient and feature group discovery using SANTF
The non-negative tensor is then factorized to reduce dimensionality and obtain groups for each
mode. We follow the Tucker factorization scheme [274], where the data tensor is factorized into
a core tensor multiplied by factor matrices (one factor matrix for each mode, and is orthogonal in
our setting). The core tensor specifies the level of interaction between groups from different
modes. The column vectors in a factor matrix specify the grouping in the corresponding mode.
Such groupings can capture similar patients, similar sentence subgraphs and similar words;
meanwhile they allow sharing of an element among different groups as specified by its fractional
weights across groups. In Figure 4-3, two example subgraph groups are shown. The top subgraphs in the subgraph group 1 correlate with Hodgkin lymphoma and in group 2 correlate with
diffuse large B-cell lymphoma (DLBCL). Meaningful groupings will not only improve the performance of multiple machine learning tasks but also identify panels of characteristic features of
patient subcategories, in the same form as specified by the diagnostic guidelines.
SANTF differs from previous NTF [277-279] by introducing a mode that captures higher-order
features. SANTF performs group discovery over sentence subgraphs based on the intuition that
these higher-order features encode more aggregated information. In addition, SANTF simultaneously identifies the groups of the atomic features, which indirectly helps the group discovery for
higher-order features through the core tensor. This is possible as the core tensor encodes the interactions among the groups of patients, higher-order features, and atomic features. We next give
the detailed SANTF algorithm.
4.1.4
SANTF algorithm
Here we provide a mathematical formulation of the procedures depicted in Figure 4-3, following
the standard notation [285]. Let X E RPx Sx Wbe the data tensor, where P, S, W are numbers of
patients, subgraphs, and atomic features respectively. We want to find a low rank approximation
to X by solving a least squares optimization problem (Tucker tensor factorization [285])
P
Z pq rAip Bj Ckr
f(A, B, C, g)
(4-1)
=
S
9
W
9
Wg
Y
gABC= I=Iq=1r=1
i=1 j=1 k=1
p=1 q=1 r=1
78
2
where Pg, Sg, W% are the numbers of groups of patients, subgraphs and atomic features, respectively, and A E R'
s, B E Rsx S9 and C
E Rwx wg are factor matrices. Each column corre-
sponds to a group of features or patients. We call the tensor g the core tensor, which specifies the
interactions between the groups of factor matrices and usually has much smaller size compared
E1
rW1
9p
q
r Ai to
Bjqthe
Ckr as
data
the tensor.
reconstructed
Wetensor,
refer and
to
Z
q
the goal is to closely approximate the data tensor using the reconstructed tensor. We further constrain the factor matrices and the core tensor to be non-negative, i.e., Aip, Bjq, Ckr, g > 0. To
solve the constraint optimization problem, we follow the block alternating least square (ALS)
algorithm [286].
After the groups are computed, we weight each group according to the core tensor g. Let the
slice matrix gi:: of the core tensor g be obtained by fixing the mode-I index and varying mode-2
and mode-3 indices (: indicates all indices for the corresponding mode). We choose from g the
slice matrix gP:: corresponding to the pth patient group and use the e2 norm of the slice matrix as
the group weight:
s9 W 9
WP= gP::11 2 =
I
P qr
(4-2)
q=1 r=1
Each entry of the pth column in the factor matrix A is then multiplied by wp to obtain A'. For
the ith patient case (ith row in A'), we assign it to group p if A ' = max(A[). Intuitively, the
columns of the pre-weighted patient group matrix specify the contribution of each patient to this
group; the norm as calculated in equation ( 4-2 ) specifies the magnitude of this patient group
interacting with subgraph and word groups. Weighting according to the core tensor G by multiplying a column using the corresponding norm takes into account such magnitude, which is necessary when evaluating different group proportions for one patient. Although we have adopted
hard grouping for patients due to the fact that a patient can only belong to one cluster in our experiments, SANTF itself can be readily generalized to applications with soft grouping (multiple
membership) of patients.
We next give details on how to identify word groups associated with a specific subgraph from
the tensor factorization results, which is used in feature analysis. Let 5Z be the mode-2 tensor
vector product defined as
79
12
(T x 2 v)i1 i, = ITii
2
3
(4-3)
Vi 2
i 2 =1
where T is any three mode tensor with size 11
X 12
X 13 and v is a vector of length 12. For the
subgraph i, we obtain
( 4-4
)
A(g x2 Bi.)
where g is the core tensor, A the patient factor matrix, B the subgraph factor matrix. We then
sum across the columns of the matrix A(g x2 Bi:) to get the desired word group distribution vec-
tor for the ith subgraph.
4.2 Experiments and Results
We experimented with SANTF on clustering lymphoma subtypes based on pathology report narrative text. SANTF itself does not require annotated training data, but in order to verify our algorithms, we use annotated datasets for ground truth. We used part of the dataset described in sec-
tion 3.3, which consists of 897 patients whose written diagnosis (in the final diagnoses section)
maps to exactly one of the following three lymphomas: Diffuse large B-cell lymphoma (DLBCL;
the most common lymphoma), follicular lymphoma (the second most common lymphoma) and
Hodgkin lymphoma (the most common lymphoma in young patients). The written diagnoses
themselves were excluded from being processed by the feature extraction steps, as before. In
contrast to the analysis of Chapter 3, we omit cases of Burkitt lymphoma because it had too few
cases to learn a good clustering model, and we omit cases in which the patient has multiple lymphomas because these do not fit the hard clustering paradigm. The case distribution of the ground
truth for the cases used here is shown in Table 4-1, where the dataset is partitioned roughly
equally, and stratified by type of lymphoma, into a training set (471 cases) and a testing set (426
cases).
Clinical Narrative Text
Lymphoma
All
Train
Test
DLBCL
589
305
284
Follicular
184
101
83
Hodgkin
124
65
59
Table 4-1 Statistics of the lymphoma.subtype distribution in the pathology narrative text corpus.
80
To study the impact of being able to model the interactions among multiple types of features, we
establish three types of baselines for NMF and two configurations of k-means, a frequently used
clustering method. The two configurations of k-means differ in their distance metrics used: Euclidean distance and cosine distance [287]. The first type of baseline applies NMF or k-means on
the (patient, atomic feature) matrices. The second baseline applies NMF or k-means on the (patient, higher-order feature) matrices. The third baseline applies NMF or k-means on the (patient,
combined feature) matrices, where the combined features are generated by adjoining the atomic
features and the higher-order features, because we want to exclude the possibility that the improvements of SANTF only come from simply adding features. Under orthogonality constraints,
NMF is equivalent to simultaneous clustering of rows and columns of a matrix [288], and similar
arguments can be made for NTF. Thus for each factorization scheme, we obtain the factor matrix
of (patient, patient group), and translate this matrix into a clustering interpretation in that for
each patient case, we pick the maximum column as its cluster label. For the pathology reports,
recorded texts reflect results from tests and labs that are performed in order to make differential
diagnoses among possible subtypes of lymphoma. Thus it is reasonable to expect that clustering
based on these data will lead to patient groupings that reflect the lymphoma subtypes.
The tensor has 3773 higher-order features and 2841 atomic features. The patient group number is
set to three, the same as the number of lymphoma subtypes. Because our method is unsupervised,
there is no a priori mapping from patient groups to lymphoma subtypes. We therefore consider
the label permutation that yields the best evaluation metrics as a parameter. For SANTF, the ideal group numbers for the higher-order features and for the atomic features are also parameters.
All parameters are selected using 5-fold cross-validation on the training data and then applied to
the held-out testing data.
For the evaluation metrics of clustering performance, we use the commonly adopted metrics of
averaged precision, recall, f-measure, and accuracy that all apply to multi-class clustering [289].
Averaging computes a direct arithmetic average over classes. The accuracy computes the proportions of the sum of diagonal entries out of all entries from the multi-class contingency table. Because neither the NMF nor the NTF has a global convergence guarantee [285,286,290], we use
random initialization for all factorization schemes and average the clustering evaluation metrics
from 100 runs. We show the results in Table 4-2 for the lymphoma subtype clustering. We also
81
perform significance testing based on the student t-test with a = 0.05. We see that SANTF significantly outperforms all nine baselines, and in particular, by over 10% margins in average Fmeasure compared to all baselines. Given that the classes are highly imbalanced, the results seem
to suggest that improvements by SANTF come not only from the fact that more patient cases are
correctly grouped (better accuracy), but also from more balanced clustering among multiple classes (better averaged precision, recall and f-measure).
Methods
Avg. Precision
Avg. Recall
Avg. F-measure
Accuracy
(1) NMF pt x wd
0.492
0.495
0.428
0.626
(2) NMF pt x sg
0.621
0.765
0.601
0.605
(3) NMF pt x [sg wd]
0.637
0.787
0.615
0.614
(4) k-means (Euclidean) pt x wd
0.483
0.420
0.398
0.664
(5) k-means (Euclidean) pt x sg
(6) k-means (Euclidean) pt x [sg wd]
0.700
0.602
0.584
0.708
0.690
0.593
0.573
0.726
(7) k-means (Cosine) pt x wd
0.620
0.694
0.618
0.617
(8) k-means (Cosine) pt x sg
0.647
0.762
0.624
0.615
(9) k-means (Cosine) pt x [sg wd]
0.648
0.759
0.626
0.617
(10) SANTF pt x sg x wd
0.720'~9
O.849'~9
0.743'-9
O.751'-9
Table 4-2 Clustering performances for MGH lymphoma dataset.Each factorization and clustering
scheme is numbered in the "methods" column. Significant improvements (p < 0.05) are in boldface and marked with superscripts indicating the baselines against which they were significantly
improved from. SANTF chose by cross-validation 3 x 180 x 60 as the core tensor size for the
lymphoma dataset.
We show the per-class breakdown of evaluations on the lymphoma dataset in Table 4-3. The detailed evaluation results further confirm the above observation that SANTF not only leads to
more patient cases being correctly grouped, as evidenced by big improvement in more populated
classes, but also leads to more balanced clustering, as evidenced by improvements in multiple
classes.
82
Precision
Recall
F-measure
DLBCL
Follicular
Hodgkin
DLBCL
Follicular
0.713
0.528
0.235
0.944
0.481
0.770
0.242
0.473
0.451
0.862
0.723
0.250
0.310
0.598
0.611
Hodgkin
0.436
0.981
0.596
DLBCL
Follicular
Hodgkin
0.969
0.516
0.426
0.444
0.935
0.983
0.596
0.660
0.589
DLBCL
0.696
0.920
0.791
K-Means (Euclidean) pt x wd
Follicular
Hodgkin
DLBCL
K-Means (Euclidean) pt x sg
Follicular
Hodgkin
DLBCL
K-Means (Euclidean) pt x [sg wd]
Follicular
0.443
0.311
0.788
0.548
0.763
0.769
0.607
0.068
0.271
0.810
0.541
0.455
0.848
0.565
0.115
0.289
0.779
0.481
0.492
0.802
0.529
Hodgkin
0.696
0.366
0.389
DLBCL
0.799
Follicular
0.366
0.564
0.552
0.646
0.439
Hodgkin
0.694
0.966
0.768
DLBCL
Follicular
0.920
0.566
0.476
0.831
0.612
0.669
Hodgkin
0.455
0.980
0.590
DLBCL
0.901
Follicular
0.575
0.483
0.817
0.611
0.671
Hodgkin
0.467
0.977
0.597
DLBCL
Follicular
0.971
0.546
0.651
0.965
0.777
0.697
0.755
Method
NMF pt x wd
NMF pt x sg
NMF pt x [sg wd]
K-Means (cosine) pt x wd
K-Means (cosine) pt x sg
K-Means (cosine) pt x [sg wd]
SANTF pt x sg x wd
Class
0.932
0.645
_ Hodgkin
Table 4-3 Per-class evaluation of clustering on the lymphoma dataset
4.3 Feature Analysis
We performed feature analysis to identify groups of higher-order features contributing to lym-
phoma subtype clustering. The analyzed subgraph groups corresponding to the core tensor size
of 3 x 180 x 60 selected by cross-validation. We follow the standard approach of analyzing
groups in factorization models [291], and make necessary adaptation to SANTF output. Based on
the core tensor after factorization, we associate subgraph groups with patient clusters using the
83
-=-
~
-~
.
.~.
-~
...
.. ~
-
-
following calculation. Adopting the standard notation [285], for each slice gi:: (i = 1,2,3) corresponding to a particular patient cluster i, we sum over its word mode (mode-3) to get a vector
whose elements correspond to the subgraph groups. We then sort the vector and investigate the
top 10 subgraph groups for each patient cluster i. For each subgraph group, we sort the subgraphs according to their weights in the subgraph factor matrix and display the top subgraphs,
where the weight is the entry value in the matrix indexed by the corresponding subgraph and
subgraph group. For each patient cluster, we select its top four subgraph groups and list them in
Table 4-4, Table 4-5 and Table 4-6. For readability, we translated each subgraph into a partial
sentence. Note that in the first DLBCL-associated subgraph group, although we have listed "cells
are CD30+, MUM1+" in order in the partial sentence, the subgraph does not distinguish the order between "CD30+" and "MUMl+" as they are both linked to "cells". We analyze each cluster
and relate them in the context of the WHO guideline [216], which reflects the current consensus
knowledge.
DLBCL 2d Subgraph Group
DLBCL I' Subgraph Group
0.6640 atypical cells
0.0929 large lymphoid cells
0.0530 atypical cells
0.0293 large lymphoid cells
0.0057 show ... positive cells
0.0240 large cells
0.0040
0.0025
0.0019
0.0010
0.0005
0.0005
0.0004
0.0002
0.0385
0.0329
0.0312
0.0137
0.0082
0.0077
0.0051
large lymphoid cell with vesicular nuclei
0.0070 monotypic staining of immunoglobulin light chains
show the cells are ... B-cells co-expressing
large cells predominate
0.0059 show large atypical cells with ... vesicular nuclei
0.0051 B-lineage antibody PAX5 ... stain ... large cells
cells are CD30+, MUM1+
large cells stain for CD79a
admixed small lymphocytes
large cells stain positively for CD20
large atypical cell with vesicular nuclei
DLBCL 3' Subgraph Group
diffuse infiltrate of large ... cells
large lymphoid cells
large atypical cells
diffuse infiltrate of large ... cells with ... vesicular nuclei
B-lineage antibody PAX5 ... stain ... large cells
infiltrate of large ... cells with . scant cytoplasm
sections show . . tissue with . . infiltrate of . cells
0.0049
0.0047
0.0037
0.0034
0.0034
0.0144
0.0111
0.0104
0.0103
0.0101
associated cells
a few large cells
atypical cells are CDlO-, BCL2-...
infiltrate of large . cells with ... scant cytoplasm
sheet of ... cells
DLBCL 4ft Subgraph Group
negative for cytokeratin
stain positively for CD20
in-situ hybridization show
positive for immunoglobulin kappa chains
cells show -.. stain
0.0041 positive for CD20, BCL2
0.0094 Ki67 proliferation index is greater than 70%
0.0086 Ki67 proliferation index is greater than 60%
0.0075 positive for CD79a
0.0028 cells... form
0.0014 atypical large cells ... positive for CD20
0.0060 stain for Ki67
0.0053 large cells stain positively for CD20
0.0009 monotypic staining with immunoglobulin lambda chains 0.0044 positive for cytokeratin
Table 4-4 Top higher-order feature groups associated with diffuse large B-cell lymphoma.Subgraphs are translated to partial sentences. In each list item, e.g., "0.0010, ... cells are
CD30+, MUM1+ ... ", 0.0010 indicates its weight in the group. The "... cells are CD30+,
MUM 1+ ... " is the partial sentence translated from the corresponding subgraph. Partial sentenc-
es that are not mentioned in feature analysis are grayed out.
For the DLBCL cluster as shown in Table 4-4, the first associated subgraph group recognizes the
following histologic (light microscope-visible) facts: the cells are atypical in appearance and are
84
large lymphoid cells with vesicular nuclei (the critical visual hallmarks of diffuse large B cell
lymphoma). Immunohistochemically the group appropriately identifies staining for the B cell
markers CD79a and CD20. Although the staining for CD79a, CD20 can also be seen in the scattered large lymphocyte-predominant (LP) cells in nodular lymphocyte predominant Hodgkin
lymphoma (NLPHL) (see p.324 of the WHO guideline [216]), these LP cells generally lack
CD30 staining. Also, the predominance of large cells helps to rule out NLPHL. Thus these features all together offer insights into the differential diagnosis of DLBCL (see Chapter 10 of the
WHO guideline [216]). The second DLBCL associated subgraph group is again highly consistent
with the current pathologic definition of DLBCL and in this group the additional feature of monotypic light chain expression is identified. This group appears to be directed towards the identification of the activated B cell-like subtype of DLBCL, which is CD10 negative. The third
DLBCL associated subgraph group echoes the characteristic features of DLBCL: diffuse infiltrate of neoplastic cells, expression of common B-cell lineage antibodies, and monotypic immunoglobulin expression. The second and third groups also reflect the mixed expression levels of
BCL2 in DLBCL. The fourth DLBCL associated subgraph group states the following interesting
facts: Ki67 proliferation index is moderately high. Note that when discretizing percentages, we
choose multiple dichotomy thresholds with a step size of 10%. Thus collectively the subgraphs
on Ki67 proliferation index point out that the index is moderately high in DLBCL. This in addition to the positivity of CD20 and CD79a, and the monoclonality of immunoglobulin light chains
collectively associate with the differential diagnosis of DLBCL (see Chapter 10 of the WHO
guideline [216]).
For the follicular lymphoma cluster as shown in Table 4-5, the first associated subgraph group is
consistent with the fact that follicular lymphoma is typically composed of both centrocytes
(small cells) and centroblasts, and in bone marrow biopsies the lymphoma characteristically localizes to the paratrabecular region in bone marrow and may spread into the interstitial area (see
p.222 of the WHO guideline [216]). The second follicular lymphoma associated subgraph group
is consistent with frequent BCL2 overexpression, accompanied sclerosis, and enlargement and
effacement in the architecture of lymph nodes in the setting of follicular lymphoma. The third
follicular lymphoma associated subgraph group summarizes typical immunophenotypic features
such as lack of expression for the cell surface marker CD5, and mixed expression levels of CD 10
(together with the first and second follicular lymphoma associated subgraph groups) and CD23,
85
all of which are consistent with Table 8.01 in the WHO guideline [216]. The fourth follicular
lymphoma associated subgraph group reveals characteristic morphological features including
dense infiltration of small lymphoid cells, the presence of cleaved centrocytes, and the staining
of cells in follicular dendritic pattern (see p.220 of the WHO guideline [216]).
For the Hodgkin lymphoma cluster as shown in Table 4-6, the first associated subgraph group
correctly identifies the morphological feature of the large neoplastic Reed-Sternberg cells that
are usually multilobated and stain positively for CD15 (see p.327 of the WHO guideline [216]).
The second Hodgkin lymphoma associated subgraph group extracts additional essential hematopathologic features for the malignant cells of Hodgkin lymphoma: CD30 positivity, CD15 positivity, CD20 negativity, and the appearance suggestive of Reed-Sternberg cells, which often express PAX5 and occur with histiocytes (see p.328 of the WHO guideline [216]). The third Hodgkin lymphoma associated subgraph group is mostly consistent with the nodular sclerosis subtype
of classical Hodgkin lymphoma, where the lymphoma contains Reed-Sternberg cells as well as a
microenvironment of non-neoplastic inflammatory cells, the lymph nodes show a nodular growth
pattern, collagen bands often surround nodules, and necrosis may occur (see p.330 of the WHO
guideline [216]). The fourth Hodgkin lymphoma associated subgraph group is mostly consistent
with the subtype of NLPHL, in that large neoplastic cells (lymphocyte predominant cells or LP
cells) are positive for CD45, OCT2, PAX5, and immunoglobulin light (kappa and/or lambda)
chains. The subgraph group is also consistent with the co-occurrence of LP cells and CD3 positive T-cells (see p.324 of the WHO guideline [216]).
86
Follicular
0.0308
0.0196
0.0171
0.0149
0.0127
1! Subraph Group
interstitial lymphoid aggregates
predominantly small ... cell
paratrabecular lymphoid aggregates
focal
Follicular 2' Subgraph Group
0.0583
0.0213
0.0201
0.0091
0.0063
nodal architecture ... effaced
B-cells co-expressing BCL2, CD10
biopsy of lymph node
sclerotic tissue
lymph node architecture effaced by ... follicular proliferation
cells in the follicles
0.0117 large paratrabecular lymphoid aggregates 0.0061 sections show enlarged lymph nodes
diffuse infiltrate of small lymphoid cells 0.0059 cell with reduced size
infiltrate consisting of ... lymphoid cells 0.0055 sections show ... lymph nodes
CD10+/- B-cell population
0.0045 residual ... follicle center cells
0.0043 cells stain positively for ... BCL2
core needle biopsy
0.0021 flow cytometry demonstrate . . population
follicles contain ... centroblasts
Follicular 4' Subgraph Group
Follicular 3 Subrph Group
0.0642 lymphoid infiltration
0.0829 B-cells are negative for CD5
0.0269 atypical infiltration
0.0466 B-cells express
0.0107
0.0093
0.0080
0.0062
0.0050
0.0405 CD5-, ... , CD230.0315 negative for CD10
0.0267 dense lymphoid infiltration
0.0133 mucosa infiltration
0.0271 positive for CD23
0.0251 positive for CD10
0.0102 small lymphoid cells
0.0095 small lymphocytes
0.0148 positive for CD19, CD20, CD23
0.0060 containing... large atypical cells.
0.0041 positive for CD3
0.0024 show B-cells are positive for CD3. CD20
0.0018 CD5-, CD10- ... B-cells
0.0084 cleaved centrocytes
0.0082 diffuse infiltrate of small lymphoid cells
0.0060 cells ... in follicular dendritic pattern
0.0059 fibroadipose tissue
0.0044 dense infiltrate containing
lymphoid
cells
Table 4-5 Top higher-order feature groups associated with follicular lymphoma.Subgraphs are
translated to partial sentences. Partial sentences that are not mentioned in feature analysis are
grayed out.
Hodgkin 1g Subgraph Group
0.0362 large cells
0.0312 atypical cells
0.0303 large cells stain
Hodgkin 2" Subgraph Group
0.0143 positive for CD30
0.0083 large cells are negative
0.0065 positive for CD15, CD30
0.0063 expressing PAX5
0.0263 positive for CD15
0.0063 large atypical cells
0.0196 scattered large ... cells
0.0117 infiltrate of large ... cells with lobated nuclei 0.0060 large cells are negative for CD20
0.0103
0.0064
0.0046
0.0042
0.0027
many large cells
large neoplastic cells
stain positively for CD15
multilobated ... cells
background contain ... lymphocytes
Hodgkin 3' Subgraph Group
0.0233 necrosis
0.0142
0.0106
0.0099
0.0098
dense sclerosis
vaguely nodular pattern
collagen fibrosis
mixed inflammatory cells
0.0073 nodular pattern
0.0053 atypical infiltration
0.0043 collagen bands
0.0058
0.0058
0.0049
0.0040
0.0034
inflammatory cells
large cells are Reed-Steinberg like
rare cells are .. positive
histiocytes
irregular nuclei
Hodgkin 4t Subgraph Group
0.0237 positive for CD3
0.0209 B-cells positive for immunogiobulin lambda chains
0.0179 small CD3 positive lymphocytes
0.0169 CD3 positive T-cells
0.0140 B-cells expressing ... kappa and lambda light chains
0.0100 expression of B-cell antigens
0.0053 number of .. B-cells
0.0048 large atypical cells
0.0042 nodular lymphoid proliferation
0.0047 expressing CD45
0.0018 areas of vague nodularity
0.0017 cells ... with Reed-Sternberg forms
0.0025 positive for OCT2, PAX5
0.0020 many scattered ... T-cells
Table 4-6 Top higher-order feature groups associated with Hodgkin lymphoma.Subgraphs are
translated to partial sentences. Partial sentences that are not mentioned in feature analysis are
grayed out.
We note the advantage of using subgraph groups as features compared to using individual subgraphs as features. For example, in the third follicular lymphoma associated subgraph group,
87
standalone positivity or negativity on CD5, CDIO, and CD23 may not be discriminative enough,
but collectively they offer medically important information favoring follicular lymphoma.
We next look into why the atomic feature groups as jointly discovered by SANTF help to better
group individual subgraphs, in order to validate our intuition that exploiting interactions between
both feature types is beneficial. Continuing from the analysis of important higher-order feature
groups, we give an analysis on word group distributions associated with individual subgraphs. In
the first DLBCL associated subgraph group in Table 4-4, the following subgraphs (partial sentences) are together ranked among the top subgraphs: "... large cells predominate
cells stain for CD79a
cells
... ",
...
", "... large cells stain positively for CD20
"... cells are CD30+, MUMl
... ",
... ",
... ",
"... large
"... large lymphoid
"... atypical cells ... ". By contrast, we did not find a
similar grouping in patterns generated by those baselines that have subgraphs as features (baselines 2 and 3 in Table 4-2, k-means clustering does not produce subgraph groups). The positivity
for the antigens CD79a and CD20 may associate with the scattered large LP cells in NLPHL, but
the group includes additional positive staining for MUM1 and CD30, which favors the differential diagnosis of DLBCL. We look into the above six subgraphs and identify word groups associated with each subgraph. Intuitively, such associations are expressed in the core tensor and one
can sum out the patient mode to explicitly associate a subgraph with the word groups (see
SANTF algorithm section on how to identify word groups associated with a specific subgraph
from the tensor factorization results). The associated word group distribution for each subgraph
is shown in Figure 4-4, and their correlation coefficients are shown in Figure 4-5. It becomes evident from Figure 4-5 that each of the subgraphs is correlated with at least one other subgraph
with a correlation coefficient above 0.5, indicating relatively strong correlation. Figure 4-4 gives
details on which word groups help to correlate subgraphs. For example, the word groups 10, 13,
"
16, 17, 26, 28, 33 and 52 help correlate subgraphs "... large cells stain positively for CD20 ...
and "... large cells stain for CD79a ... " This illustrates the benefits of using word group distribu-
tion to correlate subgraphs. In summary, analysis of word groups suggests that adding the word
mode (including covered and contextual words) to the tensor and jointly learning the subgraph
groups and the word groups help to better capture the correlations between subgraph features.
88
large cells predominate ... word group dist
... large cells stain positively for CD20 .. word group dist
0.25
0.25
0.2
0.15
-
0.2
0.150.1
0.1
0.05-
0.05
0
10
20
30
..j
w
g
40
-CD79a
50
0
60
arge cells stain for CD79a . .word group dist
..
0.25-
0.2
0.15-
0.15
0.1
0.1
50
30
40
20
. large lymphoid cells ... word group dist
60
10
20
30
40
50
60
40
50
60
50
60
0
60
10
0.25
0.2
0.2
0.15
0.15
0.1
0.1
i
L
.I
10
20
30
40
50
20
...
atypical
0.25
0
40
50
word group dist
0.05
-J
10
0.05
20
30
..cells are CD3O+, MUM1+
0.25
1
0.2
0.05
10
30
cells ...
word
group dist
0.05
60
10
20
30
40
Figure 4-4 Word group distribution for six of the top subgraphs in the first DLBCL associated
subgraph group.For example, the word groups 10, 13, 16, 17, 26, 28, 33 and 52 help correlate
subgraphs "... large cells stain positively for CD20 ... " and "... large cells stain for CD79a
... ",
as highlighted in light gray.
4.4 Discussion
Currently the selection of SANTF parameters such as core tensor size relies on cross validation.
We recognize the potential of using a non-parametric Bayesian approach to discover such parameters directly from data. For example, in the non-parametric Bayesian setting, each patient in
a dataset can be associated with hidden variables describing groups (causes) that are responsible
for generating the patient's data. Although there can be an infinite number of possible groups to
choose from, under proper prior distributions (e.g., specified using the Indian buffet process
[292]), only a finite number of groups would be selected. Care needs to be taken when defining
generative processes for multiple types of features to account for the fact that atomic features
aggregate into higher-order features and to allow for an efficient inference algorithm. Clearly,
the performance of SANTF depends on the nature of the relationships among the various modes
of the tensor. We suspect that there is an information-theoretic analysis that can shed light on
89
quantifying these relationships, where the suggested generative model could provide a basis for
such an analysis.
4
4>%
'4
.. large cells predominate ...
0,64
0.4741
0.5566
0.5415
A-5953
...
large cells stain for CD379a ...
0.3281
...
large cells stain positively for CD20 ...
0.145
0.2501
0.3238
0.3521
0.3314
...
large
ly mphoid cells ...
.218
0.3873
...
cells are CD30+, MUM I + ..
...
at y p ical cells ...
Figure 4-5 Correlation between six of the top subgraphs (partial sentences) in the first DLBCL
associated subgraph group.Only upper triangular matrix is shown due to symmetry.
SANTF is currently computationally intensive. The tensor factorization on average takes 22
minutes on a computer with Intel Core 2 Duo P8600 and 8 GB RAM. The steps of document
preprocessing including parsing, UMLS concept identification and graph/subgraph construction
also take considerable amount of time. We parallelize the computations into batches of 50 patients and run them on the pHPC clusters at Partners Health Care, which has 600 processing
cores in total and a maximum 100 core concurrency per user. The parallel pre-processing time is
under 30 minutes, which could be improved by parallelization into smaller batches on a larger
cluster. We also plan to explore parallelization and approximation techniques such as stochastic
gradient descent to speed up tensor factorization in future work.
Parsing challenges may arise with less formal clinical notes such as discharge summaries. For
example, many connecting parts of speech (conjunctions, articles, prepositions) may be elided,
which makes dependency parsing difficult for even statistical parsers. For less formal clinical
notes, we expect a hybrid form of NLP may work better. Namely, for longer sentences, graph
construction can be based on dependency parsing, while for shorter sentences, graph construction
90
can be based on co-occurrence of concepts. Choosing the threshold of longer vs. shorter sentences is non-trivial and may depend on the characteristics of clinical notes; we intend to explore
such trade-offs in future work. On the other hand, different institutions may have different clinical documentation systems and styles. Such generalizability challenges are partly addressed by
our clinical text subgraph mining approaches [87] such as using UMLS concepts as subgraph
nodes and ignoring dependency types, which can mitigate the impact of the terminology and
style differences between institutions. Using atomic features to correlate higher-order features as
done by SANTF also helps connect higher-order features whose differences are mainly in writing style.
4.5 Conclusions
We proposed a novel unsupervised framework of subgraph augmented non-negative tensor factorization (SANTF), which can automatically generate machine learning models that are easily
interpretable to clinicians. SANTF can jointly model the interactions among different types of
features by integrating them into the learning objective. We applied SANTF to unsupervised
learning tasks on clustering lymphoma subtypes based on narrative text from pathology reports.
We established nine baselines with widely-used NMF and k-means clustering methods. For each
NMF or k-means configuration, the first baseline explores the atomic features. The second baseline explores the higher-order subgraph features. The third baseline explores both types of features but not their correlations. Experimental evaluation demonstrated that SANTF significantly
outperforms all nine baselines, in particular, by over 10% margins in average F-measure over all
baselines. A closer look at the subgraph groups that are generated by SANTF offers more clinical
insights about lymphoma subtypes than atomic features or even standalone subgraphs. We also
found that the atomic feature groups as jointly discovered by SANTF help to better correlate individual subgraphs, validating our intuition that exploiting interactions between different feature
types is beneficial.
91
Chapter 5.
Subgraph Augmented Non-negative Matrix
Factorization (SANMF) in Modeling ICU Physiologic
Time Series
This chapter describes an extension of subgraph mining and factorization algorithms applied to
modeling ICU physiologic time series.
All monitors come with a trade-off between sensitivity and specificity. In the ICU setting, sensitivity is often favored over specificity, thus alerts based on whether the value of a single parameter crosses a threshold may result in a prevalence of false alarms [293]. Better trade-off between
sensitivity and specificity can be achieved if a model can consider multivariate time series comprehensively [294]. The assumption is that more volatile patients display concerted progressions
in multiple physiologic variables, which are associated with high risk of mortality. To this end,
data mining can play an important role in exploring archived ICU physiologic time series in order to build calibrated clinical models for mortality risk stratification. Such models should be
able to detect clinical state changes over certain period of time, in order to help clinicians interpret ICU data more intuitively and more accurately.
Models that appear as "black boxes" to clinicians, however, form a poor basis for decision support. We need to be able to translate complex meaningful clinical events to detailed features
needed by a machine learning model. For example, vital measurements and laboratory test values
fluctuate as time progresses (e.g., a patient's glucose level may increase from 158 mg/dL to 189
mg/dL after 53 minutes then fall to 172 mg/dL after another 62 minutes). We refer to these
events as temporal trends. In contrast, the standalone numerical measurements (e.g., 158 mg/dL,
189 mg/dL and 172 mg/dL for glucose level) are snapshots with respect to single time points.
Intuitively, the higher-order features are more expressive and informative, but their extraction is
often difficult and involves manually pre-specifying rules or patterns and matching against time
series [97,103,259]. In contrast, snapshot measurements have been widely used due to their simple extraction and robust statistical properties. However, snapshot measurements are less informative and interpretable than higher-order features. In addition, higher-order features need to
be considered in groups, as the underlying pathophysiologic evolution of a patient (e.g. kidney
92
failure) usually manifests itself through multiple physiologic variables (e.g., abnormalities in
glomerular filtration rate, blood urea nitrogen, creatinine, etc.).
5.1 Background
Decision support tools in the ICU are receiving growing attention as critical care has become an
increasingly multidisciplinary team effort. How to integrate the entire scope of information for
improving patient outcome is complex due to ongoing evolution in clinical evidence supporting
the involvement of an expanding set of physiologic variables such as fluid composition and balance [295]. Such integration calls for automated and informative tools to model the effects of
physiologic variables on patient outcome. We focus on mortality as an outcome proxy. Previous
work in correlating ICU physiology with mortality risk generally falls into two categories. Scorebased methods (e.g., SAPS-II [40], APACHE [39] and SOFA [38]) assume a resource-limited
ICU setting and aim to select a limited set of commonly measured clinical predictors that can be
aggregated into a severity score and best associated to a particular outcome. Other work adopted
a multivariate data mining perspective. Hug et al. [44] considered a comprehensive set of physiologic measurements from the Multiparameter Intelligent Monitoring in Intensive Care (MIMICII) clinical dataset [296] and manually defined a set of trend patterns (e.g., slope of a measurement during a particular time interval). However, physiologic measurements and trends were
treated as independent features in the regression model, without explicitly accounting for the fact
that multiple measurements and trends could be attributed to the same underlying pathophysiologic states. Cohen et al. [43] used hierarchical clustering to extract 10 clusters as clinically relevant patient states from physiologic measurements, over a set of 17 patients and 14 measurements. Kshetri [297] experimented with k-means clustering and faced scalability challenges on
the MIMIC-II dataset, with over 50 physiologic variables and tens of thousands of patients.
Quinn et al. [41] developed a factorial switching linear dynamical system to model the patient
states underlying 8 physiologic measurements. However, these multivariate data mining models
require advice from practicing physicians on cluster numbers or switching states and are difficult
to scale to many more physiologic variables. Joshi et al. [45] manually clustered the physiologic
measurements into organ specific patient states by associating each measurement with the status
of a particular organ, and achieved a state-of-the-art performance on 30-day mortality prediction
from the MIMIC-II dataset. Despite partially addressing the feasibility challenge, such manual
93
feature clustering can be a subjective call. For example, a low hematocrit may be linked to blood
loss, bone marrow problems, or kidney problems, among a variety of other problems. In addition,
the manual clustering is on single time point measurements. Addressing the unanswered questions in previous research, we study how to group temporal progression trends instead of single
time point measurements, and how such a grouping can be performed in an evidence-driven
fashion over a comprehensive set of physiologic variables. We represent the temporal trends as
graphs and this preprocessing approach falls into the category of time-series symbolization
methods that discretize time series into sequences of symbols and attach meaning to the symbols
[298,299]. Our approach differs from existing work in that it calculates a customized z-score to
perform measurement-axis discretization and it handles time series with irregularly sampled time
points.
5.2 Methods
In this section, we develop an unsupervised feature learning algorithm in order to build machine
learning models that are interpretable to clinicians. The model adopts non-negative matrix factorization to discover groups of subgraph-encoded temporal progression trends; hence the name
subgraph augmented non-negative matrix factorization (SANMF).
5.2.1 Workflow of SANMF
We first outline the workflow of the SANMF algorithm in Figure 5-1. ICU physiologic time series are first converted to graph representations. The graph representation is derived by discretizing time and measurement axes for physiologic measurements, as shown in Figure 5-2. We use
frequent subgraph mining (FSM) [190] tools to collect important subgraphs where the subgraphs
are identified as common temporal trends of the physiologic variables. Examples of temporal
trends for physiologic time series are shown in Figure 5-2. With such representations, subgraphs
encode temporal trends, and we use "subgraphs" and "temporal trends" interchangeably within
the context of this chapter. We model the correlation between the subgraphs, and apply nonnegative matrix factorization to discover groups of subgraphs and patients, and then train a logistic regression model to predict the mortality risk using subgraph groups as features. We next
explain each step in more detail.
94
........
....
.........
1. - - -- -
-
-
.-- -
-
.. ---- - -
--
.
--- --- -
-
a
Time Window
-
==-
_-
-
-
--- =
-
-
-
-
-
Window Selected
Time Series
_ _
'C
------
--
Computing z'-score
Organ Level +-RDF
Summarization ITime
Normalized]
5eries i
Discretization & Interpolation
DI- easure 412
10
Graphs
Frequent Subgraph Mining
Subgraph
4sNM
GNM
30-
F
-I
phs
6002650-
80
10
260
26
me0mm
KLogistic Regression Based_ Classifi, er
Mortality Risk
Stratification
RDF: Radial Domain Folding
NMF: Non-negative Matrix Factorization
Figure 5-1 The workflow of subgraph augmented non-negative matrix factorization
(SANMF).We focus on the physiologic time series from the second half of the first day, balancing the trade-off between early detection of clinical deterioration and data availability. In the
flow chart, shaded blocks indicate comparison models. The block with bold fonts corresponds to
the features produced by the SANMF model.
5.2.2 Representing time series as graphs
Figure 5-2 shows the steps before matrix factorization, with three example variables. To test the
ability to detect deterioration early on, we focus on the data from the second half of the first day
after patients' admissions to an ICU. We exclude the first half of the first day because many
measurements are not yet available in that time period. In Figure 5-1, it becomes clear that the
time series of different variables may have different sampling times and sampling frequencies, so
we preprocess the time series. We first fill in the missing values, using a sample-and-hold heuristic, which was also shown to be effective by previous work on MIMIC-II data [44,45]. More advanced imputation algorithms such as EM or Gaussian processes inference may lead to more accurate estimation of the missing values. For this task, we stick to sample-and-hold, as we compare our model with a state-of-the-art system [45] that also followed the same heuristic on MIM95
.
.......
........ ..........
. ...........
- -
_,:.
IC-II data. We next convert time series into graphs so that multivariate temporal patterns can be
automatically mined. To this end, we perform discretization on both the time axis and the measurement axis. With the filled and sliced time series, we first compute a customized z-score (z'score) where we define everything within the reference range of a certain test to be 0 [45]. For a
physiologic variable x, let x, and
Xh
be the low and high ends of the reference range, let j index
different ICU patient stays, and p(x) and a(x) be the mean and standard deviation of variable x
across different ICU patient stays, the z'-score is calculated using the following equations
z(xi) = (xi -
0
z' (x1 ) = z(x1 ) - Z(Xh)
w(x)) /o(x)
if z(xI) < z(x) < Z(Xh)
if z(x1 ) > z(xh)
z(xj) - z(xi)
(5-1)
(5-2)
if z(xj) < z(xJ)
Each individual measurement is then discretized based on whether its value is within the reference range (label 0), within one o outside the reference range (label
1), or beyond one o out-
side the reference range (label + 2). Such discretization is essentially a thresholded round-up
from equation ( 5-2 ). We discretize the time axis by linearly interpolating the time series and resampling at regularly spaced time intervals. We determined empirically (by cross-validation over
possible choices including 2, 4, or 6 hour intervals) that two-hour time intervals were best in our
experiment. After discretization, we generate the time series graph for each measurement by
connecting the discretized measurement values that are adjacent on the time axis. We use three
types of edges to distinguish changes between adjacent nodes, namely up, down and same, and to
encode partial directionality in temporal progression. After sample-and-hold, there are 27.5%
measurements that are still missing. As a result, after discretization and graph conversion, the
corresponding nodes are labeled as missing values. Note that the signal fluctuation rates vary
across different physiologic variables. We intend to pursue alternative and adaptive resampling
frequencies in future work.
5.2.3
Frequent subgraph mining
96
With time series graphs, we perform frequent subgraph mining to produce the time series trends
that are repeated in the dataset. The intuition is that similar patients undergo similar physiologic
trajectories during their ICU stays. We refer the reader to section 3.4.4 for definition and intuition on frequent subgraph mining. In this chapter, we use the frequent subgraph miner MoSS
[190] with frequency threshold empirically chosen (by cross validation on choices including 5,
10 or 15 as threshold) to be 10 (i.e., subgraphs must occurs at least 10 times in the dataset). Example frequent subgraphs are shown in Figure 5-2. We require that frequent subgraphs must not
have missing value nodes. As we are focusing on deterioration (abnormality) detection, we also
exclude subgraphs that start with multiple zero labeled nodes or end with multiple zero labeled
nodes.
97
Blood Urea Nitrogen
Mean Arterial Pressure
504 -1
E65
E0.
-
E 40
35
4,
32-
0 -'
01
-
-2
-
N
1
ILJ1
,
I
I
2-
-
2--
_
-3-
~1~
0-
0-
.
-1D-2-
840
960
CMI
1080 1200 1320 1440
840
0
MAP
Temperature
99.0198.5-
0
I*-
1
s
-s- -1 -d- -2 -u-
1 -s- 1 ---
-1
2 r-s- 2
-
Aw
0 -d- -1
BUN
1 --
-
_ 98.0
97.5
97.0-
960 1080 1200 1320 1440
Temperature
-- s- 1 -d-1 -1 -s-
Frequent Subgraph Mining
*
0
BUN
12-
I
0 0-
IBUNT
1
D
_
-3 - - _
i~rn
-I--
I
-s- 1
U
p
s-
1 -s- 1 -u-
2 -s-
2
Temperature
I
-
N -2
-1
1 -s-
I
1 -d- -1 -- s-
-1
MAP
0 -d-- -1
-s,
1 -d-
-2 -u-
-1
Computing z-score
0--
Interpolation and discretization
-2 -
--
840
,
960
-
------
-
1080 1200 1320 1440
--
Translating graphs
Figure 5-2 Graph generation and subgraph mining in SANMF.Shown in this figure is the graph
representation for three example ICU physiologic time series. BUN is blood urea nitrogen. MAP
is mean arterial pressure. Example frequent subgraphs are shown after the frequent subgraph
mining steps. The figure shows three separate subgraphs in the end.
The above frequent subgraph mining steps generate 5534 frequent subgraphs. Among them,
smaller subgraphs are subisomorphic to other larger frequent subgraphs. As noted in section
3.4.5, when a larger subgraph is frequent; all of its subgraphs are necessarily also frequent. Fur98
thermore, if a patient case has a larger subgraph, then both the larger and smaller subgraphs are
counted for that patient. This may cause the signal from larger subgraphs to be overwhelmed by
the signal from many smaller subgraphs. Therefore, we kept only the larger subgraphs in such
pairs when a patient case has both. Note that such filtering is different from the notion of mining
maximal frequent subgraphs, where only subgraphs that are not a part of any other frequent subgraphs at all are collected [300]. As noted in section 3.4.5, it is cost prohibitive to perform a full
pairwise check because the subisomorphism comparison between two subgraphs is already NP
complete [100], and a pairwise approach would ask for over 15 million such comparisons for our
task. In our case, we only need to compare subgraph pairs from the same physiologic variable.
Furthermore, subgraph subisomorphism comparison can be simplified into string matching, as
our subgraphs are essentially sequences. Combining the two observations, the algorithm for determining the subisomorphism relation among frequent subgraphs is shown in Table 5-1, which
is a variant of the one shown in Chapter 3. The above filtering steps in fact exclude some small
subgraphs completely, reducing the final number of subgraphs to 5387.
Subisomorphim for set of subgraphs
input:
S - set of subgraphs
output: m - adjacency matrix of subisomorphism among subgraphs in S
1
categorize subgraphs in S according to their variables
2
foreach v in variables:
3
stable
sort
S, in ascending order of number of nodes
4
5
for i = 1 to length (S,) -1
for j = i+1 to length(S.)
// ids is the index of smaller subgraph in S
// idb is the index of bigger subgraph in S
6
7
ids
idb
8
9
10
if subStringMatch(S[ids],
m[ids, idb] = 1
return
=
Sv[i]
=
S,[j]
S [idb])
m
Table 5-1 A simplified algorithm for determining subisomorphism relation among time series
subgraphs.The simplification mainly comes from variable partition (line 1-2) and reduction of
subisomorphism to substring match (line 8) for time series subgraphs.
5.2.4 SANMF algorithm
Non-negative matrix factorization (NMF) has been a highly effective unsupervised method [264]
to cluster similar patients [265] and sample cell lines [266], to identify subtypes of diseases [267]
and to learn genetic expression patterns [269,272,273,301]. However, none of these approaches
model the correlations among temporal trends, and some even do not consider temporal trends.
99
We observe that a patient's underlying pathophysiologic evolution usually manifests itself
through a group of temporal progression patterns of multiple physiologic variables. This motivates us to use NMF to group time series subgraphs by factorizing the patient-by-subgraph count
matrix, hence the name subgraph augmented NMF (SANMF). A schematic view of SANMF is
shown in Figure 5-3. Let M be the patient-by-subgraph count matrix of dimension P x S, where
P is the number of patients and S is the number of subgraphs. NMF approximates M using two
lower ranked matrices U (of dimension P x Sg where SQ is the number of subgraph groups) and
V (of dimension Sg x S), as formalized in the following equation.
minIM - UV||(5
( 5-3
)
U,V
st.U
O,V
0
where I|- I indicates squared Frobenius norm (squared summation of all entries in a matrix) and
U
0 means U being entry-wise non-negative. Intuitively, each row of V gives the composition
of each subgraph group, each column of U reveals how each patient may be viewed as having a
mixture of subgraph groups (approximating patterns of pathophysiologic evolution).
As we focus on count data that is by definition nonnegative, we use NMF instead of other grouping methods such as k-means or principal component analysis (PCA) that do not have a built-in
nonnegativity constraint. The subgraph subisomorphism filtering step in Table 5-1 weakens the
correlation between frequent subgraphs to a certain degree because the filtering step prevents
certain subgraph co-occurrences from being counted. To systematically capture the subgraph
correlation, we include single node subgraphs in the matrix M, but multiply counts of these sin-
gleton subgraphs by 0.5. Empirically, the factor 0.5 worked well in balancing the trade-off between preventing overwhelming signals from singleton subgraphs and capturing correlations be-
tween other frequent subgraphs. The NMF solver we used is the projected gradient NMF [302]
implemented in Scikit-learn [303]. We used nonnegative double singular value decomposition as
a deterministic initialization method [304]. We also enforced sparsity on subgraph groups [305]
so that a group has only a limited number of non-zero weighted subgraphs and places most
weight on only a few subgraphs, which is easier to interpret for clinicians.
100
Patient Groups
:
X_
ti
Subgraph Group 1
ArtBE
-2
--2
2 7d
1
SBP
PX
Px
-2 -s
ArtBE
-s
1-s-1 1s
1
1 -u
s
2 -s- -2
ArtBE
MAP
O
BUN
2 -s- 2 -d-
-d-
-1
1
s1
--s--
1
2
-s--2
-s - s-
1
Temperature
1 -s-
1s- 1 -dMAP
-1
-s- -1 -d--u-
-2
ArtBE
s- -2
1 -d-1
Temperature
1
2 -d2 -s-
0
Subgraph Group 2
2-d-I-s-I
-u- 2s- 2BUN
u
-2-u--1 -u- O
SBP
-ss 2 BUN
BUN
-u1
s-2--2
s-
-2
-1 -s- -1
0
d
-1
-s-1 -d- -2-u--1
1
Figure 5-3 Subgraph augmented non-negative matrix factorization model. In the figure, M is the
patient-by-subgraph count matrix. Below M are some example subgraphs. We also show example subgraph group 1 and subgraph group 2 after factorization. It is often desirable to have some
subgraph groups indicate a general progression to the better state (e.g., subgraph group 1), or to
the worse state (e.g., subgraph group 2).
5.2.5
Feature group discovery and association using SANMF
In SANMF, the column vectors in the subgraph factor matrix V specify the grouping of subgraphs. Such groupings can be viewed as mixtures of subgraphs, as they allow sharing of a subgraph among different groups as specified by its fractional weights across groups. In Figure 5-3,
two example subgraph groups are shown. The top ranked subgraphs in subgraph group 1 indicate
a general progression to an improved state. The top ranked subgraphs in subgraph group 2 indicate a general progression to a worse state. Namely, Blood Urea Nitrogen (BUN) increasing
from 1 to 2 is worsening, as is Mean Arterial Pressure (MAP) decreasing from 0 to -1 or -2.
Temperature changing from 1 to -1 can be good or bad, depending on the risks of high vs. low
temperatures. But the overshoot of temperature change likely suggests problematic conditions.
101
The motivation is to identify some subgraph groups that can indicate concerted progression pat-
terns of physiologic variables as driven by the patient's underlying pathophysiologic evolution.
The subgraph groups as specified in V are used as features in logistic regression with the instance-feature matrix being U. Using the trained regression model, we rank the subgraph groups
by their regression coefficients and focus on the top subgraph groups that are associated with
high mortality risk.
5.2.6 Evaluating the groups discovered by SANMF
Because there is no innate way to determine whether the groupings of subgraphs discovered by
SANMF are good or poor, we evaluate their utility as features, abstracted from the raw data, in a
prediction model. We assume that good features will improve prediction and will give us some
insights into which temporal progression patterns are indicative of our predicted endpoint.
We use physiologic time series from the MIMIC-II Database [296]. The time series include laboratory test values and physiologic measurements captured from patients monitored in the ICU
at Beth-Israel Deaconess Medical Center (BIDMC), as shown in Table 5-3. Our dataset is a subset of the one used by Joshi et al. [45] (patients from the year 2000 to 2008); we only include
those patients who have at least one day of time series data. The outcome we predict is whether
a patient survives or dies in the ICU or within 30 days after ICU discharge, as shown in Table
5-2, from data available about each patient during the period between 12 and 24 hours after their
admission to the ICU. Choosing a relatively long time horizon emphasizes our motivation to detect clinical deterioration early on. We partitioned the cases equally, stratified by mortality, into a
training set (3932 cases total) and a testing set (3931 cases total).
102
Patient ICU Stays
Mortality
; 30 days
> 30 days or alive
Number of Cases
Number of Training Cases
Number of Test Cases
788
383 (9.7%)
405 (10.3%)
7075
3549 (90.3%)
3526 (89.7%)
Table 5-2 Statistics of experiment data. The table includes the patients' 30-day mortality distribution in ICU (both absolute numbers and percentages). The dataset is split equally into a training set and a test set.
To evaluate the effectiveness of SANMF in abstracting raw data into more highly predictive features, we use five-fold cross-validation on only the training set to choose the number of subgraph
groups, and use these subgraph groups as the independent features to train a logistic regression
predictive model. We then evaluate the model on the held-out test set, and compare its performance against the following models: (a) as a baseline, 30-day mortality prediction by a logistic
regression model using an approximation of the SAPS,, score [44] and its log-transformation as
predictors, where the SAPS,, variable "chronic diseases" is approximated using ICD9 codes and
the variable "type of admission" is approximated using the ICU service type; (b) a state-of-theart organ-level summarization model [45] modified to account for our use of a 12-hour time
window rather than a snapshot of time points by replacing a binary representation of whether an
organ system is in a specific state by a count of the number of times it is in that state during the
12 hours; (c) the D,I-measure based on our discretized (D) and interpolated (I) data values,
where we also count the number of times each physiologic variable took on a discretized value
during the 12 hours; and (d) a model based on treating each of our common subgraphs as a separate feature. The comparison models are shaded in Figure 5-1. We compare the Area Under the
ROC Curve (AUC) of our model against those of the other models.
103
Variable
Age
Variable
Description
Hemoglobin
Hemoglobin level
INR
Prothrombin time international
normalized ratio
Arterial C02
Arterial PaCO2
Description
Age of the patient upon admission
The resistance of the respiratory tract
to airflow during inspiration and expiration.
Albumin in blood
Alanine aminotransferase in blood
Excess in the amount of base present
in arterial blood
Arterial carbon dioxide
Arterial carbon dioxide tension
Arterial PaO2
Arterial oxygen tension
Minute Ventilation
Arterial pH
pH level in arterial blood
Na
AST
Aspartate aminotransferase in blood
PaO2/FiO2
AST/ALT
Aspartate aminotransferase / alanine
aminotransferase
Partial Thromboplastin
Time
BUN
Blood urea nitrogen
PEEPSet
BUN/Creatinine
Blood urea nitrogen / Creatinine
PIP
Ca
Calcium level
Plateau Pressure
Albumin
ALT
Arterial Base Excess
Central Venous Pressure
Relates the cardiac output (CO) from
left ventricle in one minute to body
surface area
Blood pressure in the thoracic vena
cava
Cl
Chloride level
Creatinine
Delivered Tidal Volume
Diastolic blood pressure
Level of creatinine in the blood
Air volume of lung without extra ef-
Cardiac Index
Direct bilirubin
fort
Minimum blood pressure during
heartbeat
Level of bilirubin conjugated with
Ion Calcium
K
Ion Calcium level
Lactate
Lactate level
MAP
Mg
Mean arterial pressure
Magnesium level
Volume of gas exchanged from
lung per minute
Sodium level
Partial pressure arterial oxygen
Potassium level
/
Airway Resistance
Fraction of inspired oxygen
Time it takes for blood to clot
Positive end-expiratory pressure
set on ventilator
Peak inspiratory pressure
Pressure applied (in positive
pressure ventilation) to the small
airways and alveoli
Platelets
Platelets count
Prothrombin Time
Time it takes for plasma to clot
RBC
Respiratory Rate
Red blood count
Respiratory rate per minute
RSBI
Rapid shallow breathing index*
RSBI Rate
Rapid shallow breathing index
rate change
Sa02
Saturation of arterial oxygen
glucuronic acid
Maximum blood pressure during
eGFR
Estimated glomerular filtration rate
Systolic blood pressure
FiO2Set
Fraction of inspired oxygen set on
ventilator
Temperature
Body temperature
Glasgow Coma Scale
Glasgow coma scale
Total Bilirubin
Level of bilirubin conjugated or
unconi ugated
Glucose
Glucose level
tProtein
Heart Rate
Heart rate per minute
Urine/Hour/Weight
Hematocrit
Hematocrit level
WBC
heartbeat
Total protein in the blood plasma
Urine output per hour per kg of
body weight
White blood count
Table 5-3 Physiologic time series predictor variables from MIMIC-II dataset.Demographic information such as age is also included.
104
5.3 Results
5.3.1 Method validation on ICU patients' mortality risk prediction
When using NMF to identify latent groups of features and reduce data dimensionality, the number of groups needs to be empirically determined. We chose this parameter by 5-fold cross validation on the training data and considered a range of groups between 10 and 120 (at increments
of 10), as shown in Figure 5-4 (a). For each number of groups and for each of the crossvalidation runs, we build our predictive model and evaluate it on the remainder of the training
data, averaging the resulting AUC from each of the runs. In addition to NMF, we also show the
performance if we use PCA instead to group subgraphs. Figure 5-4 (b) shows the corresponding
performances when evaluated on the held-out test data, for reference. Both methods show similar
AUCs; NMF in fact outperforms PCA on the held-out test evaluation, indicating that NMF is less
prone to overfitting than PCA due to its additional non-negativity constraints. It is worth emphasizing the built-in non-negativity constraint and the additive interpretation benefit that NMF has.
Namely, the weight of each subgraph in a group is non-negative and can be interpreted as its
contribution to the group. In the PCA setting, it is not intuitive how to interpret a negative weight
of certain subgraphs within a group. From Figure 5-4 (a), we see that the AUC quickly rises and
plateaus as the number of groups increases for NMF. The maximum AUC on 5-fold cross validation is attained at the group number 100, which is used when evaluating SANMF on the held-out
test data.
The performance results of SANMF, comparison models and the baseline on held-out test data
are shown in Figure 5-5. Comparing all the models and baseline, we can see that SAPS,, approximation has an AUC of 0.673, which is lower than what is generally reported for SAPS,, in the
literature [44,45] (We discuss this and other related issues in section 5.4). All the models that
abstract the measured data by discretizing and aggregating them perform better, each with an
AUC greater than 0.8. The predictive model based on our SANMF-derived subgraph feature
groups has the best performance, at an AUC of 0.848, modestly outperforming the next-best
model based on abstraction by organ-system, by a 2% improvement in AUC.
105
0.85-
0.84-
0.83-
0.82Methods
.1NMF
PCA
0.81.
0.80Number of groups
(a)
0.850.840.83
0.82Methods
NMF
PCA
0.81 0.8010
30
90
50
70
Number of groups
110
(b)
Figure 5-4 AUC comparisons between NMF and PCA under specification of different number of
subgraph groups. (a) AUC for the 5-fold cross validation experiment. (b) AUC for the held-out
test experiment. Shown in panel (a) for corresponding number of groups is a single AUC by
merging all the responses from the 5 validation subsets.
106
1.00
.
0.75
0.50
-0.25
Experiment
SAPS-Ila (AUC=0.673
.-
F r 0
-
0.00
0.00
0.25
Subgraph (AUC=0.81 0
D,I-measure (AUC=0. 19)
Organ (AUC=0.827)
FSubgraph NMF (AUC=0.848)
0.50
False positive rate
0.75
1.00
Figure 5-5 ROC curves for proposed method SANMF, comparison models including subgraph,
discretized & interpolated measures (D,I-measure), and organ level status, as well as the baseline
using SAPS,, approximation.
5.3.2 Important subgraph groups
Using the method described in the section 5.2.5, we identified the top four subgraph groups that
are associated with high mortality risk and list them in Table 5-4. These subgraph groups typically contain physiologic trends that stay at or progress to more severe states. In addition, they generally indicate problematic pathophysiologic processes that involve one organ or multiple organs
simultaneously, while still retaining the temporal trend details at the physiologic variable level.
107
30-day Mortality 1" Subgraph Group
Glasgow Coma Scale
-2 -2 -2 -2 -2 -2
Minute Ventilation
-2 -2 -2 -2 -2 -2
Minute Ventilation
-1 -1 -1 -1 -1 -1
PEEPSet
2222 22
Airway Resistance
10
Airway Resistance
0 11
Plateau Pressure
2222 22
PEEPSet
1I I I 1 1
PaO2/FiO2
02
Airway Resistance
11 0
30-day Mortality 3rd Subgraph Group
0.1650 INR
222222
0.1269 Prothrombin Time
2 222 2 2
0.0318 Prothrombin Time
1I I I 1 1
0.1000
0.0085
0.0082
0.0081
0.0066
0.0060
0.0059
0.0052
0.0047
0.0040
30-day Mortality 2 "d Subgraph Group
BUN/Creatinine
2 2 2 22 2
BUN
22 2222
Albumin
-2 -2 -2 -2 -2 -2
Arterial C02
1I I I 1 1
Heart Rate
0 -1
Na
222 222
Na
I I I 11
Arterial C02
222 22
Arterial Base Excess
2 1
Delivered Tidal Volume -1 -1 -1 0
30-day Mortality 4th Subgraph Group
0.0539 Heart Rate
222222
0.038 1 Cardiac Index
0 10
0.0142 Respiratory Rate
222222
0.1634
0.0481
0.0155
0.0040
0.0040
0.0038
0.0034
0.0033
0.0032
0.0029
0.0095
Total Bilirubin
1I I I 1 1
0.0140
Heart Rate
020
0.0056
Total Bilirubin
2222 22
0.0075
Cardiac Index
1
0.0046
Diastolic blood pressure
-1 -1 -1 -1 -1 -1
0.0071
Cardiac Index
1 0
0.0029
0.0025
0.0024
0.0022
ALT
Prothrombin Time
Minute Ventilation
eGFR
2222 22
2222 2
0 -1
-2 -2 -2 -2 -2 -1
0.0069
0.0064
0.0062
0.0060
Lactate
RSBI Rate
Cardiac Index
Urine/Hour/Weight
-1 -1 -1 -1 -1 -1
1 0
0 11 0
-1 0
01
Table 5-4 Top subgraph groups associated with high mortality risks.Subgraphs are converted into a sequence to save space. For each subgraph such as "0.1000 Glasgow Coma Scale -2 -2 -2 -2
-2 -2", 0.1000 is the membership coefficient, Glasgow Coma Scale is the measurement label, "2 -2 -2 -2 -2 -2" is the trend (flat for this case). Abbreviations used in the table include: PEEPSet
- positive end-expiratory pressure set on ventilator; INR - prothrombin time international normalized ratio; ALT - alanine aminotransferase; PaO2 - arterial oxygen tension; FiO2 - fraction
of inspired oxygen; BUN - blood urea nitrogen; Na - sodium level; eGFR - estimated glomeru-
lar filtration rate; RSBI Rate - rapid shallow breathing index rate change. Please refer to Table
5-3 for descriptions of these variables.
For example, the first associated subgraph group has several subgraphs suggesting that the patient mainly has pulmonary problem (continuously low minute ventilation, high plateau pressure,
fluctuating airway resistance, and high level of positive end-expiratory pressure set on ventilator).
On the other hand, this group also has Glasgow Coma Scale staying very low, meaning that the
patient is probably unconscious or sedated. Thus the entire group may be interpreted as the status
of unconscious or sedated patients with severe pulmonary problems. The second associated subgraph group displays abnormal trends related to problems in multiple organs including kidney,
lung, and heart. The third associated subgraph group displays abnormal trends in hematology,
liver, heart, kidney, and lung. Similarly, the fourth associated group involves abnormality in
heart, lung, acid base homeostasis and kidney.
An interesting observation is that top ranked subgraph groups contributing to high mortality risk
usually involve problems in multiple organs rather than a single organ; multiple organ failure is
108
indeed a common cause of mortality in ICU settings. This type of grouping is difficult to achieve
using manual grouping according to only organ status as done by Joshi et al. [45] and is considered one of the benefits of using NMF to automatically group temporal progression trends in an
evidence-driven fashion.
5.4 Limitations and Discussion
We observe that the AUC of our approximation to SAPS,, is lower than what is previously reported [40]. This may be because of the large amount of missing data in our data set and the approximations we make because our data do not include exactly the parameters used in SAPS,,.
The organ system based model also shows an AUC somewhat lower than reported, but we believe this is because we build our predictions only on data available between 12 and 24 hours
after ICU admission, whereas the previous study uses the totality of data from a patient's ICU
stay.
In this work, we use 30-day mortality (including both in-hospital mortality and mortality within
30 days after discharge) as an obtainable ground truth in order to demonstrate the efficacy of
SANMF as an unsupervised feature learning algorithm. Similar methods may be applicable to
improve not only mortality predictions but also predictions that indicate specific types of patient
deterioration (e.g., anticipating hypotension, kidney injury, hepatic failure, sepsis) and identifying therapeutic opportunities (e.g., ability to wean from a ventilator, an intra-aortic balloon pump,
vasopressors), as have been investigated by Hug [306]. Such improved models can provide decision support for treatment planning, informed staffmg and operations.
Currently the selection of SANMF parameters such as number of subgraph groups relies on cross
validation. We recognize the potential of using a probabilistic Bayesian approach to define a
generative process for the time series. For example, physiologic time series can be modeled with
stochastic processes (e.g., Gaussian process). Parameters of these stochastic processes can in turn
be generated according to underlying pathophysiologic states (what we have been approximating
with subgraph groups obtained by NMF). Although there may be a large number of possible
pathophysiologic states and stochastic process parameters to choose from at each level of a generative hierarchy, under proper prior distributions (e.g., specified using the Indian buffet process
[292]), we can impose constraints so that only a limited number of them would be selected in a
109
particular dataset. Although the Bayesian approach enjoys good properties such as its ability to
integrate a priori clinical knowledge and its flexibility in the model size, care needs to be taken
when defining stochastic processes for modeling time series to account for issues such as nonstationarity [262]. Clearly, the performance of SANMF depends on the nature of the correlations
among multivariate temporal progression patterns, for which the suggested generative model
could provide a basis for incremental analysis.
In this study, SANMF only takes account of the physiologic time series that are "observed" from
the patients' underlying pathophysiologic evolution, in order to make a fair comparison to the
baseline model SAPS,,, which does not take into account treatment information. On the other
hand, ICU admission, and in general, hospital admission can be better categorized as the inter-
play between observations and interventions. We plan to model such interplays with SANTF.
Under this setting, SANTF will integrate interventions as a third mode of a tensor (i.e., a third
dimension of a higher-order matrix). By grouping physiologic temporal patterns (corresponding
to pathophysiologic state) and grouping intervention temporal patterns (corresponding to intervention regime), we expect to be able to predict outcomes for patient groups who have similar
underlying pathophysiologic evolutions and who have undergone similar treatment regimes. This
is a promising direction of research as it may elucidate effective treatment options for a particular patient sub-cohort based on evidence from previously admitted patients.
5.5 Conclusions
We proposed a novel unsupervised feature learning algorithm named subgraph augmented non-
negative matrix factorization (SANMF), which is designed for analyzing temporal progression
patterns in clinical time series data and is shown to improve both the accuracy and the interpreta-
bility of the learnt model for ICU mortality risk prediction. In summary, subgraph mining on
multivariate time series leads to unsupervised extraction of multivariate temporal progression
patterns, which are more informative than single time point measurements. The ensuing NMF
explores the correlations among trends of different physiologic variables and reduces dimensionality at the same time, which then leads to better interpretability and improved accuracy. We
compared SANMF to four different models using features with different granularities and time
spans. SANMF outperforms all the comparison models and in particular demonstrates an AUC
improvement from 0.827 to 0.848, compared to the state-of-the-art model that explores manual
110
feature engineering on the MIMIC-II dataset. A detailed feature analysis of the subgraph groups
that are generated by SANMF offers more clinical insights about multiple organ problems associated with high mortality risk.
111
Chapter 6.
Integrated Genomics, Transcriptomics, Medical
Records, and Insurance Claims Analyses Identify
Dyslipidemia as a Strong Inherited Risk Factor in ASD
This chapter 6 describes a variation of subgraph mining algorithms used to detect co-regulated
exon clusters in genomic analysis. Moreover, this chapter also demonstrates an innovative approach to perform integrative analysis with multiple data modalities including genomics, transcriptomics, laboratory test results, and insurance claims. The real-world problem we chose to
study is the Autism Spectrum Disorder (ASD). In particular, our subgraph mining algorithm, Implication of Co-regulated Exons (ICE), automatically identifies clusters of exons whose expressions during brain development are highly correlated, thus implying co-regulation. To effectively
apply ICE in interpreting the massive amounts of whole exome sequence data obtained from
thousands of families with ASD, we employed an integrative analytic approach, which combines
sequence data with neurodevelopmental expression patterns, familial segregation patterns, sexually dimorphic expression patterns, expression correlation, large-scale variant frequency data,
EMR data, and healthcare claims data (Figure 6-1). In this integrative genomic analysis aggregating different modalities of patient data, the subgraph mining algorithm ICE serves as the basis
and suggests a deeper understanding of the mechanisms of genetic variations by placing them in
the context of exon clusters that harbor these variations.
A version of this chapter is currently under review as a research article whose coauthors Alal Eran, Nathan Palmer
and Paul Avillach have also contributed significantly to the analysis.
6
112
Expression
.CTGCGA..
jj
.CTGTGA..
jjj
zII~.
..C-GCGA..
(a
.. TGGA.
I
N
~p~i1.6
cPE, .4
(e)
(C)
(d)
Figure 6-1 Independent sources of information used to identify molecular networks contributing
to ASD.(a) Deleterious variants called in whole exome sequence data from 3,531 individuals belonging to 1,704 simplex and 50 multiplex families. These include (a.ii) nonsense, (a.iii)
frameshift, and (a.iv) splice site mutations, whose impact on wild type gene (shown in a.i) is depicted. (b) Sexually dimorphic, neurodevelopmentally co-regulated exons identified by clustering correlated BrainSpan spatiotemporal RNA-Seq data of the developing human brain and comparing cluster expression between male and female samples. (c) ASD-segregation patterns in
multiplex and simplex families. (d) Information streams a-c were integrated to identify clusters
of sexually dimorphic, neurodevelopmentally co-regulated, ASD-segregating deleterious variants.
(e) Lipid dysregulation, a novel molecular theme revealed by the above analysis, was validated
using EMR and health claims data, demonstrating significant alterations in lipid profiles of children with ASD, and an increased prevalence of comorbid dyslipidemia disorders among individuals with ASD and their family members, as compared to age, gender, and socioeconomically
matched controls.
6.1
Background
One in every 68 children in the United States is diagnosed with ASD, a wide spectrum of social
and communication deficits with repetitive behaviors [307,308]. Although twin and family studies provide substantial evidence that ASD is one of the most heritable complex disorders [309311], the specific variants causing or increasing the risk for ASD remain largely elusive. Recent
advances in ASD genetics have highlighted its extreme locus heterogeneity, revealing a role for
de novo mutations [15-18,20,22-24], copy-number variants [312-315], common variants
[316,317], and rare single nucleotide variants [20,22,318-320]. This has accelerated a growing
113
realization that ASD is comprised of a multitude of etiologies with partially overlapping symp-
tomatology and clinical course [321-329].
Because ASD is a common disorder with a shared phenotype, individually rare etiologies must
converge at some level. Recent ASD genomic studies have revealed several convergent etiolo-
gies, including synaptic dysfunction [22,24,312,322,324-326,330], immune dysregulation [331333], chromatin and transcriptional dysregulation [17,22,23,320,334,335], and growth abnormalities [322,336-338]. Despite these significant advances in characterizing the genomic landscape
of ASD, the cause of the majority of cases remains unknown. Understanding the molecular bases
of ASD is needed to enable more accurate early diagnosis, personalized treatment options, and
improved outcomes for people with ASD.
Recent accumulation of enormous quantities of molecular data in typical and atypical human
brain development (including ASD) is providing unprecedented opportunities for elucidating the
interplay within and between different layers of genomic structures and their deviations in ASD.
Integrative genomics, the study of molecular events at different levels, has been successfully applied to various cancers, revealing principal disease subtypes with characteristic distributions of
age at diagnosis, clinical behavior, and optimal treatment response, thereby offering improved
personalized care [339-343]. However, these approaches have yet to be applied to ASD.
Here we integrate large-scale genomic, transcriptomic, me dical records, and insurance claims
datasets to discover and validate molecular mechanisms associated with ASD. Besides reproducing previously reported convergent etiologies, our analysis reveals a strong signal of lipid
dysregulation (26% of all exon clusters whose genetic variants are significantly implicated in
ASD). We find a significant burden of sexually dimorphic, neurodevelopmentally co-regulated,
ASD-segregating deleterious mutations in lipid metabolism genes, significantly altered lipid profiles in blood of children with ASD, and a significantly higher prevalence of comorbid
dyslipidemia disorders in individuals with ASD and their family members. These findings suggest that dyslipidemia may be a strong inherited risk factor for ASD, thereby offering means for
earlier screening, more accurate diagnoses, and rational approaches to therapy.
114
6.2 Methods
6.2.1
Implication of Co-regulated Exons
It is known that in the human brain each gene unit has many alternatively spliced isoforms. This
mechanism supports important fine-tuned regulation and adaptation to the changing environment.
Nowhere is this fined-tuned response to environmental stimuli more important than in the developing human brain, the most complex organ shown to have the most divergent splicing patterns
[344]. Therefore, to study co-regulated variation, examining variants at the whole gene level has
insufficient resolution. Such functional co-regulation needs to be investigated at the higher resolution isoform level, and from the perspective of spatiotemporal co-expression patterns during
human neurodevelopment. To this end, we develop the method Implication of Co-regulated Exons (ICE).
In order to understand which variants might function together, we examine exonic spatiotemporal co-expression patterns in the recently generated BrainSpan RNA-Seq data [345]. This dataset contains normalized read counts (in RPKM: Reads Per Kilobase per Million mapped reads)
for 309,223 coding and non-coding exons measured across 524 samples from 26 brain regions
throughout human neurodevelopment (Table 6-1 and Table 6-2).
115
Structure
Structure
descriptions
Area
descriptions
Area(s)
VFC
Region descriptions
Orbital prefrontal cortex
Dorsolateral prefrontal
cortex
Ventrolateral prefrontal
cortex
MFC
Medial prefrontal cortex
Region(s)
OFC
DFC
Frontal
cortex
FC
MiC
NCX
Neocortex
Parietal
cortex
PC
SiC
IPC
AIC
Temporal
TC
cortex
Occipital
OC
cortex
DTH
'
MD
Mediodorsal
nucleus of the
thalamus
MD
CB
CBC
Cerebellar cortex
CBC
HIP
Hippocampus
AMY
STR
Amygdala
Striatum,
STC
Primary motor cortex
(MIC)
Primary somatosensory
cortex
Posterior inferior parietal
cortex
Primary auditory cortex
Posterior superior temporal
cortex
ITC
Inferior temporal cortex
ViC
Primary visual cortex
Dorsal
Thalamus
(embryonic
and early
fetal development)
Mediodorsal
nucleus (all
other
periods)
Cerebellum
(embryonic
and early
fetal development)
Cerebellar
cortex
Table 6-1 Brain region hierarchy of regions, areas, and structures included in this study.
116
Period
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Age
4PCW-8PCW
8PCW-lOPCW
1OPCW-13PCW
13PCW-16PCW
16PCW-19PCW
19PCW-24PCW
24PCW-38PCW
OM (birth) - 6M
6M-12M
lM-6Y
6Y-12Y
12Y-20Y
20Y-60Y
40Y-60Y
>60Y
Description
Embryonic
Early fetal
Early fetal
Early mid-fetal
Early mid-fetal
Late mid-fetal
Late fetal
Neonatal and early infancy
Late infancy
Early childhood
Middle and late childhood
Adolescence
Young adulthood
Middle adulthood
Late adulthood
Table 6-2 Periods of brain development included in this study.PCW, Post conception weeks; M,
months; Y, years.
The 524 samples in the dataset were extracted from multiple brain regions belonging to 23 males
and 19 females at multiple developmental stages (Figure 6-2). These samples created the spatio(regarding brain structures) and temporal- (regarding ages) profile for each exon in the BrainSpan data. Co-expression analysis based on those profiles can identify exons that are linked
through functional co-regulation.
117
BrainSpan individuals age and gender profile
F-
am
S
4 m e
U(b
2$6
2048
16384
Ages (days in log scale)(b
Figure 6-2 Visualization of the BrainSpan RNA-Seq data.(a) Example heatmap of the expression
profiles of 29 exons throughout neurodevelopment and across individuals. (b) The age and gender profiles of BrainSpan individuals.
We applied initial filtering steps on exons using the following criteria, keeping 292,146 exons.
1. Variability filter. If there is no change in the expression profile (i.e., expression levels
are the same for different brain areas at different developmental stages from different donors), then the exon is excluded.
2. Multi-sample filter. If an exon only has samples from a single donor, then this exon is
excluded.
3. Duplicate filter. We also find some exonic intervals are duplicated in the BrainSpan
RNA-Seq data, where duplicates may be labeled with different (and sometimes tempo-
rary) names. To consolidate these duplicated exons and label them in a meaningful and
118
consistent manner, we first identify exons that share chromosome names, start positions
and end positions. For each such exon, we then identify temporary names using the following regular expression patterns: "RP.*-.*\\..*" and "\\w{2}\\d+\\.\\d". If there are additional meaningful names, these temporary names are discarded, and the exon is named
by concatenating all meaningful names.
We next identify co-regulated exons by calculating their similarity across the BrainSpan dataset.
We measure such similarity by the coefficient of determination R 2 = cor(el, e2 ) 2 , where
cor(el, e2 ) is the Pearson correlation between expression profiles of two exons - el and e 2
[346]. The coefficient R 2 measures how well el might be constructed from e2 (by creating a predictor of the form a + fle 2 ), and vice versa. Preprocessing of the RNA-Seq data is applied before calculating cor(el, e 2). In particular, we regard all 0 values as NA [347]. We then log2transformed the RPKM values by using the formula log 2 (x + 1) to reduce the effects due to
measurement noise. Due to the prevalence of the NA values, we filter those exons so that their
profiles must have at least 5% non-NA values (non-NA exon filter). Requiring at least 25 values
to be measured (5% out of 524) is a rather inclusive criterion as it retains 248,898 exons (i.e., 85%
out of an initially filtered total of 292,146, see Figure 6-3). We also require that candidate pairs
of exons share 75% of samples with non-NA measurements (pair filter). That is,
I intersect(nna(el), nna(e 2)) > 75% x max(Inna(el)I, Inna(e 2)|), where nna(e) refers to
samples for which e has non-NA values, intersect(.) denotes the operation of set intersection,
and I-I returns the length of a vector.
119
Distribution of number of non-NA values in expressions
80000-
Quantile
5%: 5
10%: 12
15%: 25
60000-
Ce
(
48
0020%:
25%: 85
20000-
00
400
200
Number of non-NA values
Figure 6-3 Distribution of the number of non-NA values in expressions of exons. We also show
the lower quantiles of the number of per exon non-NA values. For example, 15% quantile at 25
means 15% of the exon expressions have < 25 non-NA values. In other words, 85% of the exon
expressions have > non-NA values.
Pairwise correlation calculation between 309,223 exons amounts to over 47 billion pairs, and is a
daunting task that is intensive in both computation and storage. Thus we adopt a distributed
block-wise approach to calculate pairwise exon correlations, as shown in Figure 6-4. By dividing
the exons into size 10,000 blocks, the correlation calculation is parallelized in a block-wise fashion. Let the blocks be bl, ... , bn, we then need to compute correlations between exons themselves
in bl, correlations between exons in b, and b2 , ... , correlations between exons in b, and bn, correlation between exons in b2 and b 3 (b 2 -bl block correlations can be omitted due to symmetry),
etc. Each block-wise correlation is dispatched to its own computing node in a 2000-core computing cluster, thus achieving thousand fold speed up.
120
Exon
Upper diagonal
10 4 x 104 blocks
Parallel computation
Cc nnect exon pairs
2
G0.7
>
Wi th RG
x
0
...
..
PVRIG.e8
....
PVPiG.e4
RIG.02
e sMaximally
PVRiG.&6
STAG3..34
STAG3.35
connected component
Exon clusters
exon graph
Figure 6-4 Block and parallel exon correlation makes computation feasible.
6.2.1.1
Identification of co-regulated exons
The distribution of the coefficients of determination R 2 is shown in Figure 6-5. As the histogram
in Figure 6-5 (a) shows, with the R 2 increasing, the frequency falls at a speed faster than exponential. This holds for the R 2 distribution after applying the two filters in the previous section. It
is our goal to focus on highly co-expressed exons. Thus we establish the empirical criterion that
two exons must have their R 2 be at least 0.7 to be considered as co-expressed, thereby focusing
on the 0.02% most tightly correlated exon pairs. We keep the exons that are co-expressed with at
least one other exon, and turn them into a graph representation. This graph has exons as nodes
and draws an edge between exons el and e2 if they are co-expressed (R 2 (el, e 2 ) 2 > 0.7). Thus a
large sparse exon co-expression graph with 92,240 nodes and 6,205,327 edges is produced. A
small part of the exon graph is shown in Figure 6-6, which clearly demonstrates that the whole
exon graph consists of smaller exon clusters.
121
R2
Distribution of mean R 2 per duster
histogram
min=0.700, max=1.000, mean=0.816, median=0.780
500
- ~-
-
- -
~ ~ ~-
-
~
-~--^-
a
400
300
200
100
o
000
T
o
TTooT
0.10
0.20
0.30
0.40
I.
I
I.2 I I.30
0.50
I
0.60
I
I
0.70
I
0.80
I
I
0.90
I
1.00
0.7
0.8
0.9
1.0
Mean R 2 per cluster
(a)
(b)
2
Figure 6-5 Distribution of R in the BrainSpan data. (a) The distribution of R 2 between pairs of
exons that pass the two filters: 1) exons must have at least 25 non-NA values (the four exon filters); 2) the exons in a pair share 75% of samples with non-NA measurements (pair filter). The
frequency is on a logarithmic scale. (b) The distribution of per-cluster mean R 2 coefficients. The
per-cluster mean R 2 is calculated by averaging the R 2 coefficients over all exon pairs in one
cluster with R 2 > 0.7. In other words, the per-cluster mean R2 measures the average cluster connection strength.
122
3
13
CPNE1.e2 NCAPH2.e12
0
o
Q3
SYCE1.el 1
PVRIG.e5
SYCE1.e7
PVRIG.e2
0
TRAPPC2L.e4
NCAPH2.e15
2
STAG3.e35
Q
C32E.ee 2 SYCEI.e8
PVRIG.e6
TRAPPC2L.e3
12
PVRIG.o4
NCAPH2.e14 SYM.e.
C ,E1WNE1.e3 <
9
C
NCAPH2.e13
TRAPPC2L.e2
STAG3.36
NCAPH2.e7 SYCE1.e9
0
PR&1.e
1.05
STAG3.e34
o
NCAPH2.98
SYCE1.elO
'E1.9
O
PERI.e6
I'
. CNEl.e9
AMZI.e9 a
o o
CPN1;l4FS1.e4
-Se1
AC006028.9.o1
PERl.e5 CPNE1;NFs.
CE
GPR126.KCN
KCNJ12.
2
S.k1
3KCNJ2 .NJ1
CPNE1.o4
GPR126-.4
e7
CTD-2517M22.14.e2
T
.
L
CBS.o8
CTD-251l4.5.o2
CTD-2517M22.14.e
H1 .1L
RNH1.
PPP1R16A.N
PPPIR16A.e8
RP1 1-573D15.8.e3
o
CBS.eH4
LPL.C5
WLRel
1
e CTD-2517M14.5.el
KNG1.ell
10
NXF1.e14
NXF1.e13
2
PEBP4.5
NXFI.e12
XXbeBPG254F23.6.el
HLA-09B1.e5
HLA-DQ81.04
AP3S1.e6
AP3S1.e8
CBS.el
L
R
CBS. 6
ITGB11.e5
LDLRe11
LDLR.e6
POMT1.e
ITGB1BPI.e4 LDLR.e5
PEBP4.el
o
LDLR.e3
0
RP1I-334J6.6.e2
P
PEBP4.e2
13
0
W-DQB1-AS1.e1
HLA.e1
ITGBiM~a2
LDLR.elO
ITGB1BP1.e3
POMTI.e8
TMEM91.e7
TMEM91.63
o
POMT1.e18
POT.1
Figure 6-6 Visualization of part of the entire exon graph. Each node represents an exon, and an
edge connects nodes el and e2 if R 2 (e 1 , e 2 ) 2 > 0.7, with width proportional to the magnitude of
R 2 .Nodes are labeled according to their hosting gene and exon index.
Based on the entire exon graph, we cluster co-expressed exons by finding the maximally connected components using the igraph package [348]. This procedure generates 6,242 co-expressed
exon clusters with an average mean R 2 of 0.82 and an average exon count of 15. The collection
of exon clusters is remarkably heterogeneous in size, i.e. clusters contain different number of exons (Table 6-3) and genes (Table 6-4). Although the distributions are skewed towards smaller
exon clusters, there are numerous exon clusters representative of tight multi-gene co-expression.
123
COuster,
Size
Count
2 4111
3
734
Count
25
Count
3
19
1
348
218
167
109
79
78
60
11
65
33
26
28
22
15
14
7
13
15
10
9
6
1
4
5
2
4
5
6
2
1
8
1
697
1
40
11,47
cluster
63
5
47
44
43
41
38S
37
36,
Se
1
1
1
1
3
1
3
2
2
Count
exons).
of
of
number
in
terms
(measured
sizes
of
cluster
Table 6-3 Distribution
Number
of
clute
r
2_
3
4-
Number of
4
10
16
154
3202 '2851
clusters
Table 6-4 Distribution of number of genes in exon clusters.
6.2.1.2
1
12
1
11
Tracking expression patterns of co-regulated exons
We next track the temporal expression profiles of the co-regulated exon clusters identified in the
previous step, across the BrainSpan regions. As shown in Table 6-1, the measured brain regions
and areas can be summarized into six brain structures: amygdaloid complex (AMY), cerebellar
cortex (CBC), neocortex (NCX), hippocampus (HIP), mediodorsal thalamus (MD) and striatum
(STR). Based on the mapping in the summarization, we can derive the expression profile for the
areas with the formulas ( 6-1 ) to ( 6-8 ).
FC = mean(OFC,DFC,VFC,MFC,(M1C IM1C - SC))
PC = mean(PCx,IPC,S1C)
TC = mean(TCx,ITC, A1C,STC)
OC = mean(OCx,V1C)
NCX = mean(FC, PC, TC, OC)
STR = mean(STR, MGE, LGE,CGE)
MD = mean(MD,DTH)
CBC = mean(CBC,CB)
124
(6-1)
(6-2)
(6-3)
(6-4)
(6-5)
(6-6)
(6-7)
(6-8)
Using these formulas, we can calculate the aggregated expression of all exons, in each brain area
across the entire cohort. To compare across individuals and brain regions, we first normalize the
expression levels using a Z transformation (i.e. centering the expression vector on the mean and
dividing by its standard deviation). We next track the temporal expression patterns in each gender using two approaches:
(1) Mean and standard error plot. In this approach, for each exon cluster, each brain structure,
each gender, and each time point, we compute the mean expression and its standard error by using the aggregated expressions of all exons in that exon cluster, from all matching sample donors.
We plot the temporal expression profile for each exon cluster, brain area, and gender combination using line graph with means as values and standard errors as error bars at each time point.
(2) Mean and standard error period plot. The spatiotemporal dynamics of the human brain
transcriptome is a staged process and can be tracked as a multi-period system, as detailed in Table 6-2. For each exon cluster, each brain structure, each gender, and each neurodevelopmental
period, we compute the mean expression level and its standard error by using the aggregated expression of all exons in that exon cluster, from all matching sample donors. The temporal expression profiles are then similarly plotted as in "mean and standard error plot".
6.2.1.3
Identification of sexually dimorphic co-regulated exons
The sexually dimorphic prevalence of ASD (male-to-female ratio of 4:1) increases the likelihood
that the functional loss incurred by genetic mutations impaired those co-regulated exons that
demonstrate differential expression patterns between males and females. To identify sexually
dimorphic co-regulated exons, we compare the temporal expression profiles of an exon cluster in
in each brain structure, as detected in section 6.2.1.2, and select the clusters that demonstrate
gender-specific differential expression in one or more brain structures.
6.2.2 Whole exome sequence analysis
Whole exome sequencing (WES) aims to identify the variants found in the coding region of
genes.
6.2.2.1 Data compilation
125
We compiled several familial whole exome sequencing studies from the National Database for
Autism Research (NDAR), as detailed in Table 6-5. The table also shows the number of included
families from each dataset. Inclusion criteria were families with at least two siblings that have a
similar degree of sequence coverage, as determined by the Genome Analysis Toolkit's CallableLoci analysis [349].
NDAR Collection Title
Family
type
Number of
families
Number of Individuals
1918
Multiplex
45
111
2004
Multiplex
5
12
NDAR
Collection
ID
Human autism genetics and activity dependent gene activation
Sequencing Autism Spectrum
Disorder Extended Pedigrees
3408
1704
2042
Simplex
SSC total recall project
Table 6-5 Whole exome sequence datasets used.For the SSC total recall project, we include only
those 1704 families from [350] for which the VQSR step (see section 6.2.2.3) succeeded.
Of the families listed in Table 6-5, a total of 1,754 families were included in our analysis, comprising 50 multiplex families with 2-5 affected siblings, and 1,704 simplex families with one affected and one unaffected full siblings.
The total number of individuals included in our analysis amounts to 3,531. In order to accurately
and consistently call variants from across all datasets, we adopt the Genome Analysis Tool Kit
(GATK) framework [351] for a standardized preprocessing of WES data into analysis ready
reads followed by joint variant calling.
6.2.2.2
WES Preprocessing
For each individual included in our study, multiple BAM files may be generated by multiple sequencing runs. Furthermore, different studies used different aligners and different variant calling
frameworks. To standardize variant calling and data analysis across studies, our data preprocessing began with converting BAM files back to interleaved FastQ files and aligning these in a
standardized manner using BWA-MEM [352]. Such a back-winding step through FastQ format
ensures that the BAM files are processed in the same standard way in order to improve the variant calling accuracy. Before converting a BAM file to a FastQ file, we first split the BAM files
into multiple read groups. We then apply the Picard toolkit [353] to undo possible post alignment
126
processing for each split BAM file, using the RevertSAM utility. The actual conversion from
BAM files to FASTQ files includes the following two sub-steps: The first sub-step uses the
"bamshuf' utility from SAMtools [354] to shuffle the reads in the BAM file for them to not be in
any biased order so that a subsequent aligner can correctly estimate the insert size using blocks
of paired reads. The second sub-step uses the "bam2fq" utility from SAMtools to convert the
BAM file to an interleaved FastQ file where each pair of reads (forward and reverse reads) are in
the same file. The interleaved FastQ files from all individuals were then mapped to a single human reference genome (GRCh37/hgl9, version 37, including decoy contigs) using BWA-MEM.
The newly aligned BAM files containing different read groups were then merged using the Picard MergeSamFiles utility. For the merged BAM file, duplicates were marked and removed us-
ing the Picard MarkDuplicates utility, read group information was added using the Picard AddOrReplaceReadGroups utility.
For efficiency, we restrict variant calling to a limited set of chromosomal regions specified by
the BrainSpan exon intervals. This is because we are only interested in neurodevelopmentally coregulated variants in this study. Toward that goal, we pad each BrainSpan exon with 100bp buffer. We sort the padded intervals and divide them into two collections based on whether they are
on the forward or reverse strands. We then merge intervals overlapping with other intervals in
the same collection to provide a non-overlapping collection of intervals on each strand. The union of the two collections of merged intervals then forms the BrainSpan reference interval. Fig-
ure 6-7 shows the distribution of padded merged BrainSpan interval size. The figure also categorizes the intervals based on their strand (forward or reverse), and depicts the distributions of
those intervals respectively, which are similar to each other and similar to that of all intervals.
127
Distribution of merged BrainSpan interval(+) size
min=202, max=22968, mean=633, median=373
Distribution of merged BrainSpan interval size
min=202, max=29414, mean=632, median=374
30000
6WOOo
4000010
010000
100
10000-
Interva size (logarithmic scale)
(b)
Distribution of merged BrainSpan interval(-) size
min=202, max=29414, mean=632, median=375
-_
30000
20000-
100001
Interasize (logarithmic scale)
(a)
Interval size (logarithmic scale)
(c)
Figure 6-7 Distribution of padded and merged BrainSpan interval sizes. (a) Size distribution of
all BrainSpan intervals. (b) Distribution for intervals on the forward strand. (c) Distribution of
intervals on the reverse strand.
6.2.2.3
Joint variant calling in BrainSpan intervals
After preprocessing, we perform joint variant calling using the GATK tool. Figure 6-8 shows the
overview of this workflow. The Non-GATK box corresponds to the preprocessing steps of
6.2.2.2. The preprocessed BAM files undergo local realignment, which transforms regions with
misalignments due to Indels into clean reads with a consensus Indel model (Indel Realignment
step in Figure 6-8, using GATK RealignerTargetCreator and IndelRealigner utilities). The reads'
quality scores are then recalibrated to correct for artifact and offset bias (Base Recalibration step,
using the GATK BaseRecalibrator utility), producing analysis ready reads.
128
Genatype Uk*sHoo
..Phiclcaklon
e.g. Chr.start-end
Cytoband
Geme
e.g. Gene name
Variant function
Gene
On
sR"*MwAre
Analyal.Ready
Readt
NPs4 ndels
---.
St
e.g. Pathway
Molecular process:
Predicted variant Impact
e.g. SIFT
PolyPhen
i
ji
Comprehensively
-7
SNP*,
,Individual
genotypes M
Populationfrequency
e.g. 1000 Genomes
ESP 8600
Clinical aignificance
e.g. ClinVar
OMIM
mEression pon-rna
e.g. GTEx
BraInSpan
---
TranscriPtaIn regulaione.g. ENCODE TFBS
Hiatone modifications
Figure 6-8 Overview of WES analysis.After rigorous quality control steps, whole exome sequence data from various NDAR collections is aligned to the reference human genome using
BWA. Duplicates are then marked, a realignment step follows to account for Indel-related errors,
and finally base quality score recalibration results in analysis-ready BAM files. These are then
analyzed using the Haplotype Caller, resulting in per-position genotype likelihood. Following a
joint genotyping phase, raw variants are called. These are filtered using a machine-learning
based variant recalibration tool that balances the sensitivity-specificity tradeoff. The resulting
SNPs and Indels are then subject to annotation based on multiple considerations, including predicted variant impact, conservation, their population frequency and clinical significance. The end
result of this pipeline is a list of comprehensively annotated variants, and a table of their individual genotypes.
The analysis ready reads are then processed using the GATK Haplotype Caller. This step simultaneously calls SNPs and Indels using local re-assembly of haplotypes in an active region, resulting in per-position genotype likelihood. We use the human reference genome GRCh37/hgl9,
(version 37, including decoy contigs) as reference for the Haplotype Caller, using the recommended setting for single-sample all-sites calling on DNAseq: emitRefConfidence=GVCF, variantindex type=LINEAR, variantindexparameter=128000.
We then combine the resulting per-sample variants and perform joint genotyping step using the
GATK GenotypeGVCFs utility. Joint genotyping aggregates multi-sample variants and merges
the records in order to re-estimate the genotype likelihood by combining all records spanning the
target chromosome location. Based on our joint genotyping results, we apply a machine learning
129
based variant filtering step, Variant Quality Score Recalibration (VQSR). VQSR uses a Gaussian
mixture model to fit and cluster the called variants and compare them to known positive and
negative variant sets. SNPs and Indels are recalibrated separately in two passes. The first pass
recalibrates SNPs, with Indels left untouched; the second pass recalibrated Indels, with recalibrated SNPs left untouched.
We apply the WES preprocessing and joint variant calling steps to samples from the multiplex
family cohort, producing an average of 83,808 variant/individual (74,111 SNPs, 9,697 Indels).
For the discordant family cohort, we use a subset of the dataset produced by Krumm et al. [350]
which is based on a similar GATK pipeline and has an average of 35,164 variant/individual
(31,644 SNPs, 3,520 Indels). There are two main differences between the pipelines by Krumm et
al. [350] and our pipeline: 1) Krumm et al. performed joint variant calling separately for each
quad (parents, proband7 and unaffected sibling) instead of the entire cohort; 2) Krumm et al.
called variants within 20 bp of the NimbleGen EZ-SeqCap v2.0 targets instead of within 100 bp
of BrainSpan interval targets. The difference 1) may introduce some bias when directly comparing called samples from the two cohorts. However, we performed segregation analysis separately
on the two cohorts, thus avoiding such bias. The difference 2) results in disparate numbers of
variant/individual between two cohorts. However, as will be evidenced in section 6.2.2.6 and
Figure 6-9 to Figure 6-11, our subsequent filtering steps (mapping to BrainSpan exon clusters in
particular) resulted in average numbers of variant/individual comparable between the two cohorts. In addition, to make it as much consistent to our pipeline as possible, we include only
those 1704 quads from [350] for which VQSR succeeded.
6.2.2.4 Variant annotation
We next used the ANNOVAR toolkit [355] to comprehensively annotate called variants with a
wide array of information, including their hosting gene (using several gene models such as RefSeq [356], UCSC Known Gene [357], Gencode [358]); the variant function; its predicted pathogenicity according to PolyPhen2 [359], SIFT [360], MutationTaster2 [361], MutationAssessor
[362], CADD [363], LRT [364], VEST3 [365], and other meta predictors; its conservation according to PhyloP [366], SiPhy [367], and GERP++ [368]; its minor allele frequency among the
7 Proband
refers to affected sibling.
130
1000 Genomes populations [369], ESP6500 [370], and ExAC [371]; and its phenotype associations according to ClinVar [57], and HGMD [372].
6.2.2.5
Annotation-based variant filtration and deleterious variant detection
To address issues of reference mis-annotation, we resort to the recently released Exome Aggregation Consortium (ExAC) exome dataset [371], which aims to aggregate exome sequencing data sets from a wide range of large-scale sequencing projects including the cohorts of Myocardial
Infarction Genetics Consortium, Swedish Schizophrenia & Bipolar Studies and The Cancer Genome Atlas (TCGA). We filter out those variants whose allele frequencies are observed to be
over 90% among the 60,706 individuals aggregated by ExAC. We also apply a similar 90% filtering threshold on the alternate allele frequency in our cohort. We further focus on deleterious
variants, which include frame-shift insertion, frame-shift deletions, nonsense variants, and splice
site mutations.
131
Deleterious
variant counts
Passed
variant counts
Coregulated deleterious
variant counts
ii0
0
0
8
C)
0C)
V0
0
0
0I
C0VI
0
T
T
C%
T T
0
Proband
Sibling
Proband
Passed
SNP counts
0
0-
Sibling
Deleterious
SNP counts
Proband
0-
Sibling
Deleterious
SNP counts
0
N
0
0C)
80
0-
0O
0
00
0-
0
0
Proband
Sibling
0
0-
0
V- Proband
Sibling
Proband
Sibling
Deleterious
Indel counts
Passed
Indel counts
0
C0
0-
Deleterious
Indel counts
Cv,
r-
I
I
C)CN
U)
0
0(0
0
C)
0
0Uf)
0N
0
0-
C)-
0
0
C,-,
0-
T T
CN
CN
Proband
Sibling
0
0
Proband
132
Sib ling
Proband
Sibling
Figure 6-9 Distributions of the total number of variants in probands and unaffected siblings in
discordant families. Shown are the per-individual SNP and Indel distributions after each of the
following analysis steps: joint variant calling, restricting to deleterious variants, and restricting to
co-regulated deleterious variants. Note the dramatic reduction from about 32,000 total SNPs per
individual to about 50 candidate SNPs, and from 3,500 total Indels per individual to about 130
candidate Indels. Importantly, the number of neurodevelopmentally co-regulated deleterious variants is similar between probands and unaffected siblings, but their distribution among clusters
differs significantly, with an enriched aggregation of deleterious variants in certain exon clusters.
6.2.2.6
Mapping variants onto co-regulated exon clusters
To identify neurodevelopmentally co-regulated variants, we next map the called variants to the
exon clusters identified in section 6.2.1. In doing so we first perform interval search to map variants into exons using the GenomicRanges toolkit [373]. A variant maps into an exon when the
variant's genomic location falls within the exon's interval. After mapping variants to their hosting exons we assign cluster membership for each variant based on the cluster membership of its
hosting exon as obtained in section 6.2.1. This mapping of deleterious variants to exon clusters
allows us to identify and enumerate deleterious mutations in each co-regulated exon cluster. Figure 6-10 and Figure 6-11 show the distributions of the number of variants per individual at each
stage of variant analysis, for the discordant family cohort and the multiplex family cohort, respectively. From Figure 6-10, it can be easily seen that the steps of restricting to deleterious variants, restricting variants to co-regulated exon clusters and filtering for differentially variable variants all contribute to the reduction of the number of candidate variants. Similar reduction holds
true for multiplex families, where the last filtering step is based on shared variants among all
proband siblings, as shown in Figure 6-11.
133
Variants per individuals
min=17473, max=47453, mean=35164, median=35908
Deleterious variants per individuals
min=286, max=1092, mean=626, median=619
80o-
400-
200-
200-
020000
Variant per individual
200
50000
40000
400
pO r
800
Variant per individual
1000
Differentially expressed, coregulated deleterious variants
min=14, max=210, mean=39, median=36
Coregulated deleterious variants per individuals
min=81, max=412, mean=195, median=189
400-
400-
00
200.-
200
100.1
0-,
0.
100
260
30
Variant per individual
400
0
0
100
1 0
Variant per individual
200
Figure 6-10 Distribution of number of variants per individual in the discordant family cohort at
each stage of variant analysis.
134
Deleterious variants per individuals
min=405, max=888, mean=682, median=677
Variants per individuals
min=70699, max=122616, mean=83808, median=81353
25,
201
20-
151
15~
io10
5
5-
-
- ----100000
80000
Variant per Individual
120000
400
500
600
70
N00
M0
Variant per individual
Shared coregulated deleterious variants
Coregulated deleterious variants per individuals
min=1 09, max=258, mean=209, median=210
min=24, max=120, mean=76, median=76
6-
-
15
10.
0
5.
01
0-i
100
150
200
25
250
Variant per individual
50
75
Variant per individual
100
125
Figure 6-11 Distribution of number of variants per individual among multiplex families at each
stage of variant analysis.
Below we summarize the overall reduction of candidate variant numbers at each step. For the
discordant family cohort, we start with an average of 35,164 variants/individual (31,644 SNPs,
3,520 Indels). Focusing on deleterious variants reduces the candidate pool size to 626 variants/individual (238 SNPs, 388 Indels) on average. Mapping deleterious variants to co-regulated
exon clusters further trims the average number down to 195 variant/individual (61 SNPs, 134
135
Indels). Finally filtering variants by differential variability between discordant sibling pairs leads
to 39 variants/individual (15 SNPs, 24 Indels) on average. For the multiplex family cohort, we
start with an average of 83,808 variants/individual (74,111 SNPs, 9,697 Indels). Focusing on
deleterious variants reduces the candidate pool size to 682 variants/individual (296 SNPs, 386
Indels) on average. Mapping deleterious variants to co-regulated exon clusters brings the average
number down to 209 variants/individual (68 SNPs, 141 Indels). Finally filtering variants by
keeping the variants shared by probands in multiplex families leads to 76 variants/individual (30
SNPs, 46 Indels) on average.
6.2.3
Segregation pattern analysis
Here we examine the segregation patterns of neurodevelopmentally-co-regulated, sexually dimorphic deleterious variants in both discordant and multiplex ASD families.
6.2.3.1 Discordant ASD families
Simplex ASD families refer to those that have one child affected by ASD. We focus on discordant families, special cases of simplex ASD families that have two siblings: one proband (affected
with ASD) and one unaffected sibling. In each discordant family, discordant sibling pairs are
formed by pairing a proband with his/her own unaffected sibling. With the collection of discordant pairs, we can compare neurodevelopmentally co-regulated deleterious variants found in probands and the variants carried by siblings, in each exon cluster. By selecting the exon clusters
with excess mutation burden in probands, we filter the exon clusters to retain those that likely
harbor the pathogenic mutations of ASD.
We use permutation tests [253] to assess the statistical significance of an exon cluster's excess
deleterious variants in probands as compared to their unaffected siblings. Treating each family as
rows and probands and sibling as columns, we fill in entries of this matrix with the total number
of mutations occurring in each individual in each exon cluster. This creates an exon cluster mutational profile among discordant families. To obtain an empirical p-value for excess mutational
burden we randomly shuffle paired probands and siblings. Repeating the permutation creates a
distribution of mutational profiles that simulates mutational events in an exon cluster by chance.
With this simulated distribution, we then calculate the p-value of differential variation (i.e.,
136
Li(mp,, - mst) where e indexes the exon clusters, i indexes discordant families, and pi and si
are the proband and unaffected sibling in the ith family respectively).
6.2.3.2
Multiplex ASD families
Multiplex families have two or more probands. In this segregation analysis we search for neurodevelopmentally co-regulated deleterious variants that are shared among all affected siblings. As
we assume that probands have a similar cause of ASD, focusing on neurodevelopmentally coregulated exon clusters with shared deleterious variants enables us to zoom in on mutations that
more likely cause ASD. The distribution of number of siblings in multiplex families is shown in
Figure 6-12. While most multiplex families have two affected siblings, there are 16 families with
3-5 affected siblings.
Distribution of number of siblings in multiplex family
min=2, max=5, mean=2, median=2
30-
200
10H1
2
I
I
3
4
Number of siblings in multiplex family
5
Figure 6-12 Distribution of sizes of multiplex families.
We use Affected Sib Pair (ASP) analysis [374] to assess the significance of variant sharing
among all proband siblings. We follow an extended version of the affected sib-pair test (page
125 in [374]). The null hypothesis of this test is that variant sharing is by chance, and therefore
137
not related to the phenotype. This hypothesis is tested using the nonparametric linkage (NPL) z-
score. To deal with multiplex families of more than 2 siblings, we divide each family to sib pairs
(i.e. a family with s siblings would result in s - (s - 1)/2 affected sib pairs). Because the artificially-created pairs are dependent, each is weighted by 2/s (i.e. scaled down by s/2, as though
there were only s - 1 pairs in the sibship). The pseudo code of the extended affected sib pair test
for multiplex families is shown in Figure 6-13, where variants are aggregated per cluster and
Zclust is the cluster's extended NPL z-score.
Input: exon clusters
Output: p-values for exon clusters as evidenced by multiplex families
for cluster c=1 to all clusters
Zclust=0;
for variant v=1 to all variants in cluster c
{
Z[v]=0;
n=0;
for family f=l to all families
{
//siblings with genotypes passing the selected filters
s=#informative affected siblings in family f
for informative sib pair p=l to s*(s-1)/2
//sampling from siblings with genotype that passed filters
Generate sib pair p
//sqrt(2),0,-sqrt(2) if sib pair shares 2,1,0 non-ref alleles
z=sgrt(2)*(#alleles shared in sib pair [0,1,21-1)
Z [v]+=z*2/s
n++
Z [vi=[1/sqrt (n) I *Z [v]
Zclust+=Z [v]
//Zclust ~ N(0,#variants in cluster)
pVal[c]=2*pnorm(Zclust,mean=0,sd=sqrt(#variants in c),lower.tail=F)
}
Figure 6-13 Pseudo code of the extended ASP test for multiplex families. The null hypothesis is
that variant sharing is by chance (and therefore not related to ASD), and the statistic is the extended nonparametric linkage (NPL) z-score (the variable Zclust).
6.2.4
Integrated statistical significance
With the quantitative evidences from the simplex ASD families and multiplex ASD families, we
use the following statistical significance analysis to combine the two sources of association evi-
138
dences. In particular, for each exon clusters, we proceed separately with discordant families and
multiplex families respectively.
For each exon cluster, we have p-values calculated independently for the statistical significance
of excess deleterious variation in probands as compared to their unaffected siblings (in the discordant family analysis), as well as for increased deleterious allele sharing among all affected
siblings (in the multiplex family analysis). We then use Fisher's method [375] to combine pvalues from both analyses for each exon cluster. The combined p-values are then Bonferronicorrected for multiple testing of all clusters [376].
6.2.5 Functional enrichment analysis
To assess the function of all significant exon clusters, we used NCBI's gene2go table
(ftp://ftp.ncbi.nlm.nih.jov/gene/DATA/gene2go.gz) to map genes to their molecular function,
biological process, and cellular compartment. We further used GSEA's MSiGDB
(http://www.broadinstitute.org/gsea/msigdb/) to identify gene membership in KEGG pathways
(http://www.genome.jp)/kegg), Reactome pathways (http://www.reactome.org), BioCarta pathways (http://www.biocarta.com) , and their pathway interactions, as recorded in the Pathway Interaction Database (http://pid.nci.nih.gov). SAFRI Gene, an integrated catalogue of human genetic studies related to autism, was used to examine the significant cluster genes' known association with ASD (https://gene.sfari.org/autdb/HG Home.do). Only genes belonging to evidence
categories 1-3 were considered as having a strong prior for playing a role in ASD. Furthermore,
NCBI's ClinVar [57] and OMIM [377] databases were mined in search for significant cluster
genes' implication in schizophrenia and bipolar disorder, two related neurodevelopmental disorders whose etiologies overlap with those of ASD [378].
6.2.6 Analysis of lipidemia profiles using lab results from individuals with ASD
seen at Boston Children's Hospital
We used the i2b2/tranSMART platform [379,380] to analyze EMR data from 1,343,481 individuals seen at Boston Children's Hospital (BCH), including 101,227 children with ASD.
i2b2/tranSMART enables the cohesive analysis of heterogeneous phenotypic data, including
longitudinal diagnoses and lab results. Using this engine, we compared the results of common
lipid lab tests between individuals with ASD and matched individuals with no ASD-related diag139
noses. Tests included triglyceride levels (lab 1173), total cholesterol (lab 8350), HDL (lab 8352),
and LDL (lab 8352). For each lab, a 2-by-2 contingency table was used to compare the association of abnormal lab results with ASD by counting the number of individuals with an ICD-9
299.0 diagnosis ("Autistic disorder") and at least one abnormal test result, the number of individuals with an ICD-9 299.0 diagnosis and normal lab values, individuals who have never had a
299.0 diagnosis and all their lab values are within the reference range, and those who have never
had a 299.0 diagnosis but had at least one abnormal test result. Table 6-6 details the number of
individuals used for each comparison. Pearson's chi square tests were then used to assess the statistical significance of the association of abnormal lipid lab results and ASD.
140
Lab name Lab nae
LDLTotal
LDL
cholesterol
HDL
Triglycerides
Lab ID
8352
8350
1079
1173
BCH patients with at least one abnormal test result that never had an
autism ICD9 299.0* code
8628
2427
5899
12356
BCH patients with all test results
within the reference range that never had an autism ICD9 299.0* code
17289
24511
33132
21918
BCH patients with at least one abnormal test result and at least one
autism ICD9 299.0* code
291
101
121
273
BCH patients with all test results
within the reference range and at
least one autism ICD9 299.0* code
352
553
523
406
Total number of individuals with at
least one test result and at least one
299.0 diagnosis
643
654
644
679
Total number of individuals with at
least one test result and no 299.0
diagnoses
25917
26938
39031
34274
Period examined
13 years
13 years
21 years
21 years
1/1/2001 -
-
1/1/1993 - 1/1/1993
1/1/2001 12/31/2014 12/31/2014 12/31/2014 12/31/2014
Table 6-6 Patients used to examine the association of abnormal lipid lab results with ASD.
Dates examined
6.2.7 PheWAS of Aetna claims data
We analyzed four calendar years' (2010 - 2013) worth of medical claims and enrollment demographics for approximately 33 million Americans who were covered by Aetna Inc. policies
during that period. Data from the insurance provider were warehoused in a centralized repository,
using relational data tables managed by Microsoft SQL Server 2012 Enterprise Edition. We used
the subscriber-to-member relationships in the insurance claims data to identify approximately
30,000 families with at least one child diagnosed with ASD, indicated by the presence of one or
141
more ICD-9 codes in the 299 group (pervasive developmental disorders) in at least one medical
claim. Fathers, mothers, and their affected children were matched to control populations by age,
gender, and zip-code (a socioeconomic marker). These large control populations were repeatedly
subsampled (n=10,000) to compare the prevalence of comorbid diagnoses in equally sized samples of affected and unaffected populations of fathers, mothers, and offspring. Diagnoses were
mapped to PheWAS groups (http://phewas.mc.vanderbilt.edu), and the p-value of the median
statistic for each diagnostic category was taken as the representative association between that
diagnostic group and the case population.
6.3 Results
6.3.1 Neurodevelopmentally co-regulated, sexually dimorphic, segregating deleterious variation in ASD
To identify neurodevelopmentally co-regulated, sexually dimorphic, segregating deleterious variation in ASD, we performed the integrative analysis as shown in Figure 6-1. Raw whole exome
sequence data from several cohorts were obtained from NDAR, and jointly processed using a
standard BWA/GATK pipeline for standardized powerful variant calling. Confidently called single nucleotide and Indel variants were annotated to identify deleterious variants, namely
frameshift, nonsense, and canonical splice site altering variants, which are the focus of all subsequent analyses. We then focused on variants that segregate with ASD in 1,754 families. Specifically, we focused on variants that are shared among all affected siblings in 50 multiplex families
with 2-5 probands per family, and those that are discordant between 1,704 probands and their
unaffected siblings. We further focused on variants that function together during early human
brain development. To identify those we analyzed the BrainSpan RNA-Seq data, which summarizes normalized read counts from 524 samples of different ages, genders, and brain regions. We
analyzed exon-level pairwise correlation patterns throughout human brain development, and aggregated them to identify clusters of co-regulated exons. We then identified those clusters with
sexually dimorphic expression patterns, which more likely give rise to ASD, a male-dominant
disorder. We mapped variants back to sexually dimorphic exon clusters to identify co-regulated
deleterious variants that might have gender-specific effects during early human neurodevelopment. We employed rigorous statistics to control for multiple testing, using affected sib pair
(ASP) analysis to assess the significance of multiplex family variant sharing, and permutation
142
tests to assess increased burden of deleterious, neurodevelopmentally-co-regulated sexually dimorphic variation in probands as compared to their unaffected siblings. These independent analyses were integrated to reveal 22 neurodevelopmentally co-regulated sexually dimorphic clusters
with ASD-segregating deleterious variation (Table 6-7).
6.3.2 Convergent lipid metabolism etiology
Functional enrichment analysis of the identified exon clusters revealed several molecular themes,
most of which have been previously associated with ASD. These include chromatin and transcriptional regulation, immune function, and synaptic function. However, it also elucidated a
previously unknown convergent etiology, consisting 23% of the signal: lipid regulation (Table
6-7). Lipid metabolism genes implicated by our integrative analysis include low-density lipoprotein receptor (LDLR), lipoprotein lipase (LPL), copine I (CPNE1), and Globoside alpha-1,3-Nacetylgalactosaminyltransferase (GBGT1). For example, the LDLR cluster includes 5 coregulated exons with a male-dominant expression pattern during prenatal development, which
switches to female dominance postnatally (Figure 6-14). This cluster is hit by 3 ASD-segregating
deleterious variants (P = 1.93 x 10-07). Another example is the LPL cluster, which consists of
10 tightly co-regulated exons with male-dominant prenatal expression. It is hit by 5 ASDsegregating variants (P = 1.55 x 10-06, Figure 6-15).
143
Molecular
Cluster p-
theme
value
Gene products
Location
Selected molecular processes
4q24
Glycerophospholipid biosynthetic process, lipid
metabolic process, neuron projection extension,
phospholipid metabolic process, positive regulation
of neuron differentiation
Upregulated by low-density lipoprotein, negative
regulation of signal transduction
Globoside
alpha-1,3-Nacetylgalactosaminyltransferase
9q34.13q34.3
Glycolipid biosynthetic process, protein glycosylation
20q 11.22
7.88E-11
CPNEI1
Copine I
2.50E-09
DDIT4L
DNA-damage-inducible
script 4-like
GBGT1
2.43E-08
Lipid
tion
ene(s)
tran-
1
regula1.93E-07
LDLR
Low density lipoprotein receptor
19p13.2
1.55E-06
LPL
Lipoprotein lipase
8p22
144
Cholesterol homeostasis, cholesterol metabolic process, cholesterol transport, lipid metabolic process,
lipoprotein catabolic process, low-density lipoprotein particle clearance, phospholipid transport, phototransduction, positive regulation of triglyceride
biosynthetic process, receptor-mediated endocytosis
Fatty acid biosynthetic process, lipoprotein metabolic process, phospholipid metabolic process, phototransduction, positive regulation of cholesterol storage, positive regulation of sequestering of triglyceride, triglyceride biosynthetic process, triglyceride
homeostasis, triglyceride metabolic process, verylow-density lioorotein particle remodeling
Molecular
theme
Cluster p Gene(s)
Ivalue
III
Gene products
Location
145
Selected molecular processes
Molecular
Cluster p-
theme
value
None
Gene(s)
Gene products
Location
Selected molecular processes
D-aspartate oxidase
6q2l
Aspartate catabolic process, grooming behavior,
hormone metabolic process, oxidation-reduction
process
I6q24.3
ER to Golgi vesicle-mediated transport
8q24.3
Regulation of catalytic activity
I 0q22.2
Positive regulation of GTPase activity
7p22.3
Proteolysis
III
2.50E-09
DDO
5.86E-06
TRAPPC2L
6.69E-06
PPPIR16A
2.72E-05
AGAP5
1.03E-04
AMZI
Trafficking
protein
complex 2-like
particle
Protein phosphatase 1, regula-
tory subunit 16A
ArfGAP with GTPase domain,
ankyrin repeat and PH domain
5
Archaelysin family metallopeptidase I
Table 6-7 Significant clusters of sexually dimorphic, neurodevelopmentally co-regulated, ASD-segregating deleterious variation, and
their molecular themes.
146
ALDLR6
LDLR5
B -2
LDLRe1
L
Mediodorsal nucleus
2-
1
Striatum
CL
0
0
LDRe I
atr
Nmadleopeta
cosnuoeeomna
--
0stema
eid defne i Tal
-
-.
ete
*'
ero
'
'
-
1'0 1'
I
omlzdepeso
2-
'
LDLRLRe3
C-
maleatpeta
expression switching to female dominance postnatally. (C) Multiplex family sharing of three deleterious variants hitting this cluster.
Five families with two affected siblings each, share deleterious alleles in co-regulated LDLR exons (shown in red).
147
A
3
B3
p
Hippocampus
Neocortex
2
C
C
2,
LP.e10
variable
male
/ female
0LPL.911I
LPL.05
LPLe3
0
EE
o
LPLe12
LPLe13
256
C
-ZZZ
_
1024
4096
ages (days in log scale)
g
_
2'
1
8
8 8
1024
256
16384
4096
ages (days in log scale)
16384
8 S
z.
&Z
Z,
Z
Z
Figure 6-15 ASD-segregating deleterious variation in the sexually-dimorphic LPL exon cluster. (A) Tight co-regulation of 10 LPL exons. The graph depicts the pairwise correlation structure among 10 LPL exons comprising this cluster, showing that all are correlated
with R 2 > 0.7. (B) Sexually dimorphic neurodevelopmental expression patterns of the LPL cluster in the neocortex and hippocampus.
Shown is the mean normalized expression pattern across sample donor ages measured in days (logarithmic scale). Note the male dominant prenatal (before 256 days) expression. (C) Multiplex family sharing of five deleterious variants hitting this cluster. Five families
with two affected siblings each, share deleterious alleles in co-regulated LPL exons (shown in red).
148
6.3.3 Dyslipidemia in families with ASD
Using health claims data from 34,003,107 individuals, we identified 23,837 families with at least
one child diagnosed with ASD (ICD-9 code 299.x) and at least one child lacking any 299.x diagnosis. Comparing the rates of dyslipidemia between children with ASD and their unaffected siblings, we found that ASD is significantly associated with dyslipidemia (OR=1.76, 95%CI= [1.61
1.92], Fisher's p = 2.25 x 10-36, Table 6-8).
ASD
No ASD
At least one dyslipidemia diagnosis No dyslipidemia diagnosis
23743
1083
38496
999
Where dyslipidemia is defined as having any of the below diagnoses:
Code
PheWAS Group
272.1
Hyperlipidemia
272.13
Mixed hyperlipidemia
272.11
Hypercholesterolemia
272.9
Lipoid metabolism disorder NOS
277.51
Lipoprotein disorders
Other disorders of lipoid metabolism and hyperalimentation 277.5
272
Disorders of lipoid metabolism
Table 6-8 Enrichment of comorbid dyslipidemia diagnoses in individuals with ASD as compared
to their unaffected siblings. (p = 2.25 x 10-36).
We next compared the prevalence of dyslipidemia diagnoses in 30,000 individuals also diagnosed with ASD and repeatedly sampled unrelated controls matched by age, gender, and zipcode (as a marker for socio-economic status). We found a significant enrichment of dyslipidemia-related diagnoses in individuals with ASD (P = 9.70 x 10-66). Similar findings were obtained for parents of children with ASD as compared to age, gender, and socio-economically
matched controls, corroborating that dyslipidemia is an inherited risk factor for ASD. Thus independent large-scale datasets of disparate sources can provide unprecedented opportunities to
powerfully validate the implication of molecular mechanisms in ASD.
149
Median
Hypergeometric
P-value
Diagnosis
Median number of
Number of
matched individuals
individuals
(ASD+, diagnosis+) (ASD-, diagnosis+)
Hyperlipidemia
941
425.5
2.21
3.79 x 10-46
Mixed hyperlipidemia
350
142
2.46
9.80 x 10-22
Hypercholesterolemia
796
486
1.64
1.1
Lipid metabolism disorder NOS
78
30
2.60
2.13 x 10-6
Lipoprotein disorders
59
21
2.81
1.25 x 10-
Other disorders of lipoid metabolism and
hyperalimentation
20
5
4.00
2.04 x 10-
Median
Odds Ratio
103
9.70 x 106
1.90
986
1877
Any of the above
Table 6-9 Significant enrichment of dyslipidemia-related diagnoses in individuals with ASD, detected in health claims data.
6.3.4 Behavioral phenotypes of mouse models of dyslipidemia
The MGI database was mined to compare behavioral and nervous system phenotypes between
ASD mouse models and LDLR-deficient mice (Table 6-10). Five relevant phenotypes were
found to be significantly shared among ASD models and LDLR-deficient mice, including abnormal synapse morphology, abnormal neuronal proliferation, and abnormal spatial learning
(Power > 80% Fisher's exact test, Table 6-10). Thus there is a striking similarity between behavioral and nervous system phenotypes of ASD and dyslipidemia mouse models.
Phenotype
% ASD models with % LDLR deficient models Power P
phenotype (n=42)
with phenotype (n=7)
17%
17%
0.970
Abnormal spatial learning 38%
17%
Abnormal neuronal pre-
34%
34%
0.9404 1.000
0.856 0.598
17%
34%
0.856 0.402
0.9404 1.000
Abnormal synapse mor-
1.000
phology
cursor proliferation
Increased body weight
Abnormal hippocampus
36%
38%
morphology
Table 6-10 Behavioral and nervous system phenotypes shared between 42 mouse models of ASD
and 7 mouse models of LDLR deficiency.
150
6.4
Conclusions and Discussion
In this chapter, we developed a subgraph mining based method termed Implication of Coregulated Exons (ICE), in order to identify exons that are co-regulated during brain development.
ICE serves as the basis of a comprehensive and integrative approach that delineates the biologic
foundations of ASD by leveraging recently available genomic, transcriptomic, EMR, and health
claims datasets. Besides reproducing previously reported convergent etiologies in ASD (e.g.,
immune, chromatin / transcriptional, synaptic, and growth dysregulation), we also discovered
and validated lipid dysregulation as a strong inherited risk factor for ASD. By integrating
streams of independent information, we identified sexually dimorphic, neurodevelopmentally coregulated, ASD-segregating deleterious variation in several lipid metabolism genes. These include LDLR, LPL, CPNE 1, PEBP4, GBGT 1, and DDIT4L. All of these genes were found to be
mutated in individuals with developmental delay. LDLR knockdown mice have autistic symptoms, and DD1T4L is a component of the mTOR pathway, shown to be dysregulated in some
types of ASD. Importantly, we validated this novel etiology using both EMR and health claims
data from millions of children with ASD, their unaffected family members, and unrelated controls. Using EMR data, we demonstrated that children with ASD have lipid and cholesterol lab
values that are outside the reference ranges, which may be used to distinguish them from neurotypical children. Using health claims data, we showed that individuals diagnosed with ASD have
a significantly higher prevalence of dyslipidemia-related diagnoses as compared to age, gender,
and socioeconomically matched controls. We further showed that both fathers and mothers of
individuals with ASD are diagnosed with dyslipidemia disorders significantly more than
matched controls. Taken together, our work suggests that lipid dysregulation may be a strong
inherited risk factor for ASD.
Our results offer several practical considerations for improving early diagnosis of ASD, thereby
offering better outcomes for children with ASD [381]. First, this study suggests that families
with a history of dyslipidemia may be at increased risk for having children with ASD. They
should be counseled and monitored accordingly. Second, common lipid lab tests, including total
cholesterol, HDL, LDL, and triglyceride levels may be informative for screening newborns for
increased ASD risk. Follow-up studies should track the earliest age at which differences in lipid
profiles have sufficient sensitivity and specificity to be used as biomarkers, and design a pro151
spective trial accordingly. Third, metabolomic studies, which include fatty acid derivatives, may
be used for early screening. This conclusion is also supported by targeted studies in small cohorts
that found altered lipid mediators in plasma from children with ASD as compared to matched
controls [382-385].
Roughly half of the human brain's weight is attributed to lipids. Rather than being used for energy storage, brain lipids are essential building blocks of cell membranes, the synaptic infrastructure of neurons, and the isolating elements of myelin [386,387]. Mutations in lipid regulators
have recently been shown to alter human brain function and growth, leading to intellectual disability and microcephaly [388-390]. Follow up mechanistic studies in mice and cellular models of
ASD are needed to better understand how the gene disruptions described here contribute to ASD,
and how lipid augmentation therapies may normalize the ASD phenotype.
152
Chapter 7.
Conclusion and Future Work
In this chapter, we conclude the dissertation by summarizing our contributions and proposing
directions for future work.
7.1
Contributions
This thesis proposed a series of models based on subgraph mining and factorization algorithms to
extract higher-order features (biomedical relations), temporal trends, and exon co-regulation and
to explore their correlations. A common theme of these models lies in the application of subgraph mining algorithms to extract higher-order features, temporal trends and exon co-regulation,
and application of factorization algorithms, at various depths, to model correlations between
higher-order features and temporal trends. Part of our contribution is the universal recognition of
subgraph structures in different biomedical subdomains: relations between biomedical concepts
in clinical narratives, temporal progression of patients' physiologic measurements in ICU time
series, and exons that are co-regulated during human brain development. Moreover, this dissertation demonstrated that using subgraph structures and groupings of subgraph structures (produced
by factorization algorithms) can lead to not only better accuracy, but also better interpretability,
even novel knowledge into disease pathogenesis.
The above demonstrations span across multiple concretely motivated medical problems. In NLP
analysis on lymphoma pathology reports, sentence subgraphs lead to unsupervised extraction of
relations among flexible number of medical concepts from clinical narrative text. Subgraph
Augmented Non-negative Tensor Factorization (SANTF) jointly model the interactions among
different types of features and reduce dimensionality at the same time, which then leads to better
interpretability and improved accuracy even in unsupervised learning.
In ICU mortality risk prediction, time series subgraphs lead to unsupervised extraction of multivariate temporal progression patterns, which are more informative than single time point measurements. Subgraph Augmented Non-negative Matrix Factorization (SANMF) explores the correlations among trends of different physiologic variables and reduces dimensionality at the same
time, which then leads to better interpretability and improved accuracy compared to snapshot
measurements and standalone subgraphs.
153
In Autism Spectrum Disorder (ASD) genetic risk analysis, Implication of Co-regulated Exons
(ICE) automatically identifies co-regulated exon clusters based on analyzing spatiotemporal profiles of exonic expressions during brain development. Expression burden analysis coupled with
segregation pattern analysis implicates variants in the identified co-regulated exon clusters with
the ASD phenotype. Together with functional analysis and clinical data analysis, ICE allows
identification of novel ASD risk factors including dyslipidemia. The integrative genomic analysis aggregating different modalities of patient data, pivoted by the subgraph mining algorithm
ICE, enables deeper understanding of the mechanisms of variations in the genome, which leads
to clinical insights and opportunities of early intervention.
We note that the graph representation offers generalizability, applicable to represent relations
between concepts in medical NLP, temporal progression of physiologic measurement in ICU
time series and co-regulated exons in ASD genomics. We also showed that the general framework of subgraph mining and factorization algorithms can be effective in supervised learning,
unsupervised learning, and association analysis.
7.2 Future Directions
By proposing a generalizable framework to mine subgraph structures and explore their correla-
tions in multiple biomedical subdomains, this thesis lays the foundation of several research directions that can potentially change the current practice of medicine.
Automated cancer pathology on a truly global scale: In current lymphoma classification
guideline, the Asian population is severely underrepresented. Incorporating Asian lymphoma patients will at least double the size of the existing patient cohort. This may lead to a better elucidated boundary between currently gray zone lymphoma subtypes, or lead to previously undiscovered subtypes. Moreover, looking at the difference in treatment courses between Asian and
Caucasian patients at a large scale may lend insights on optimal intervention strategy. We are
genuinely interested in the influence of such integration towards the understanding of the entire
terrain of lymphoma pathology.
On a broader horizon, as pathology advances, what previously constituted one cancer category is
now often regarded as multiple diseases or even a spectrum of diseases. This shift will likely
154
generate phenomenal impact on society if one can automatically identify sub-cohorts of cancer
patients that share Omic and phenotypic signatures and that can benefit from targeted medications. To this end, automated diagnostic guideline construction is a promising application. Moreover, integrating the Omic data and published literature will not only impose practical application demands but also raise fundamental methodology challenges to big data analysis (e.g.,
[391]).
Utilizing symbiosis among common laboratory tests to improve clinical decision making:
On the clinical monitoring side, we plan to model outcome-specific patient profiling where outcomes can be specific such as wean of ventilator or response to steroids. This will enable a variety of clinical applications ranging from treatment plan selection to informed staffing to operational decisions. In addition to modeling ICU patients' conditions, SANMF/SANTF framework
can also be utilized to study chronic conditions such as chronic kidney disease, where early
symptoms such as tiredness and troubled sleep are often ignored and physiologic variable monitoring may offer a chance of early detection and early intervention.
On the other hand, the effectiveness of SANMF on physiologic variable evolution demonstrated
the shared information among certain common laboratory tests. Explore their correlation in mortality risk prediction is only a first step towards unlocking their hidden diagnostic utility. It is
important, in the long term, to fully investigate the extent of information redundancy and potential symbiosis among all common laboratory tests regarding their diagnostic utility. An immediate plan is to build an information theoretic framework to quantify the information shared between the actual test results and the predicated test results based on concurrent other test results.
Such an improved understanding of the complex relationships and patterns within sets of laboratory tests will be incorporated into electronic clinical decision systems to enhance laboratory test
result interpretation and increase the diagnostic information that can be extracted from laboratory
testing.
Associate functionally related groups of genetic networks with phenotype: Implicating neurodevelopmentally co-regulated exon clusters with ASD phenotype still leaves the following fact
unaccounted for: co-regulated exon clusters may in fact be functionally related to each other (e.g.
in the same known pathway). One can use known metabolic networks, known genetic pathways,
155
and known protein-protein interaction networks to correlate exon clusters and integrate features
from the entire Omic hierarchy into the SANTF model. This network wise interaction study
(NWIS) will lead to a whole new level of integrative genomic analysis and help us to better understand the complete genetic mechanisms. This will likely generate more specific markers for
ASD's early detection and elucidate targeted treatments and interventions against the development course of ASD.
Taking this intuition one step further, I am also interested in investigating the association between functionally related groups of genetic networks with multiple distinct but related nervous
system disorders. It has been shown that multiple neurodegenerative diseases, including Alzheimer's disease, Parkinson's disease, Huntington's disease and Amyotrophic Lateral Sclerosis
(ALS), share common genetic and metabolic pathways such as those for protein degradation.
Moreover, patients with neurodegenerative diseases often have late onsets, suggesting the progressive and cumulative effect of intracellular pathogenesis mechanisms including protein degradation abnormality and mitochondrial dysfunction. To better understand disease progression
and explore options for preventative intervention, new online methods are needed to integrate
both progressive observations and intervention outcomes into disease modeling. Pilot research
projects and cross-field collaborations have great potential to break the silos and to unlock better
therapeutic opportunities for all jointly studied diseases and disorders.
156
Bibliography
[1]
G. McNeill and D. Bryden, "Do either early warning systems or emergency response
teams improve hospital patient survival? A systematic review," Resuscitation, vol. 84,
2013, pp. 1652-1667.
[2]
0. Uzuner, B.R. South, S. Shen, and S.L. DuVall, "2010 i2b2/VA challenge on concepts,
assertions, and relations in clinical text," Journal of the American Medical Informatics
Association, vol. 18, 2011, pp. 552-556.
[3]
D. Nadeau and S. Sekine, "A survey of named entity recognition and classification,"
Lingvisticae Investigationes, vol. 30, 2007, pp. 3-26.
[4]
[5]
[6]
[7]
R. Grishman and B. Sundheim, "Message Understanding Conference-6: A Brief History.,"
COLING, 1996, pp. 466-471.
M.D. Buist, G.E. Moore, S.A. Bernard, B.P. Waxman, J.N. Anderson, and T.V. Nguyen,
"Effects of a medical emergency team on reduction of incidence of and mortality from
unexpected cardiac arrests in hospital: preliminary study," Bmj, vol. 324, 2002, pp. 387390.
P.S. Chan, R. Jain, B.K. Nallmothu, R.A. Berg, and C. Sasson, "Rapid response teams: a
systematic review and meta-analysis," Archives of internal medicine, vol. 170, 2010, pp.
18-26.
L.T. Kohn, J.M. Corrigan, M.S. Donaldson, and others, To Err Is Human:: Building a
Safer Health System, National Academies Press, 2000.
[8]
D.R. Levinson and I. General, "Adverse events in hospitals: national incidence among
Medicare beneficiaries," Department of Health and Human Services Office of the
Inspector General, 2010.
[9]
[10]
Y. Bar-Shalom and T.E. Fortmann, Tracking and Data Association, Academic Press,
1988.
S. Saria, A.K. Rajani, J. Gould, D.L. Koller, and A.A. Penn, "Integration of early
physiological responses predicts later illness severity in preterm infants," Science
TranslationalMedicine, vol. 2, 2010, pp. 48-65.
[11]
A.S. Willsky, E.B. Sudderth, M.I. Jordan, and E.B. Fox, "Nonparametric Bayesian
learning of switching linear dynamical systems," Advances in Neural Information
ProcessingSystems, 2008, pp. 457-464.
[12]
H. Lee, R. Grosse, R. Ranganath, and A.Y. Ng, "Convolutional deep belief networks for
scalable unsupervised learning of hierarchical representations," Proceedings of the 26th
Annual InternationalConference on Machine Learning, ACM, 2009, pp. 609-616.
[13]
[14]
A. Mueen, E.J. Keogh, Q. Zhu, S. Cash, and M.B. Westover, "Exact Discovery of Time
Series Motifs.," SDM, 2009, pp. 473-484.
B.D. Walker and G.Y. Xu, "Unravelling the mechanisms of durable control of HIV-1,"
Nature Reviews Immunology, vol. 13, 2013, pp. 487-498.
[15]
[16]
S.J. Sanders, M.T. Murtha, A.R. Gupta, J.D. Murdoch, M.J. Raubeson, A.J. Willsey, A.G.
Ercan-Sencicek, N.M. DiLullo, N.N. Parikshak, J.L. Stein, and others, "De novo
mutations revealed by whole-exome sequencing are strongly associated with autism,"
Nature, vol. 485, 2012, pp. 237-241.
B.M. Neale, Y. Kou, L. Liu, A. Ma'Ayan, K.E. Samocha, A. Sabo, C.-F. Lin, C. Stevens,
L.-S. Wang, V. Makarov, and others, "Patterns and rates of exonic de novo mutations in
autism spectrum disorders," Nature, vol. 485, 2012, pp. 242-245.
157
[17]
B.J. O'Roak, L. Vives, S. Girirajan, E. Karakoc, N. Krumm, B.P. Coe, R. Levy, A. Ko, C.
Lee, J.D. Smith, and others, "Sporadic autism exomes reveal a highly interconnected
protein network of de novo mutations," Nature, vol. 485, 2012, pp. 246-250.
[18]
I. Iossifov, M. Ronemus, D. Levy, Z. Wang, I. Hakker, J. Rosenbaum, B. Yamrom, Y.
Lee, G. Narzisi, A. Leotta, and others, "De novo gene disruptions in children on the
[19]
autistic spectrum," Neuron, vol. 74, 2012, pp. 285-299.
Y. Jiang, R.K. Yuen, X. Jin, M. Wang, N. Chen, X. Wu, J. Ju, J. Mei, Y. Shi, M. He, and
others, "Detection of clinically relevant genetic variants in autism spectrum disorder by
whole-genome sequencing," The American Journalof Human Genetics, vol. 93, 2013, pp.
[20]
249-263.
R.K. Yuen, B. Thiruvahindrapuram, D. Merico, S. Walker, K. Tammimies, N. Hoang, C.
Chrysler, T. Nalpathamkalam, G. Pellecchia, Y. Liu, M.J. Gazzellone, L. D'Abate, E.
Deneault, J.L. Howe, R.S.C. Liu, A. Thompson, M. Zarrei, M. Uddin, C.R. Marshall, R.H.
Ring, L. Zwaigenbaum, P.N. Ray, R. Weksberg, Carter, B.A. Fernandez, W. Roberts, P.
Szatmari, and S.W. Scherer, "Whole-genome sequencing of quartet families with autism
[21]
[22]
[23]
[24]
spectrum disorder," Nature Methods, vol. 21, 2015, pp. 185-191.
S. Nemirovsky, M. Cordoba, J. Zaiat, S. Completa, P. Vega, D. Gonzalez-Moron, N.
Medina, M. Fabbro, S. Romero, B. Brun, S. Revale, M. Ogara, A. Pecci, M. Marti, M.
Vazquez, A. Turjanski, and M. Kauffiman, "Whole Genome Sequencing Reveals a De
Novo SHANK3 Mutation in Familial Autism Spectrum Disorder," PloS one, vol. 10,
2015, p. e0116358.
S. De Rubeis, X. He, A.P. Goldberg, C.S. Poultney, K. Samocha, A.E. Cicek, Y. Kou, L.
Liu, M. Fromer, S. Walker, and others, "Synaptic, transcriptional and chromatin genes
disrupted in autism," Nature, vol. 515, 2014, pp. 209-215.
I. Iossifov, B.J. O'Roak, S.J. Sanders, M. Ronemus, N. Krumm, D. Levy, H.A. Stessman,
K.T. Witherspoon, L. Vives, K.E. Patterson, and others, "The contribution of de novo
coding mutations to autism spectrum disorder," Nature, vol. 515, 2014, pp. 216-221.
S. Dong, M.F. Walker, N.J. Carriero, M. DiCola, A.J. Willsey, Y.Y. Adam, Z. Waqar,
L.E. Gonzalez, J.D. Overton, S. Frahmn, and others, "De novo insertions and deletions of
predominantly paternal origin are associated with autism spectrum disorder," Cell reports,
[25]
vol. 9, 2014, pp. 16-23.
G.K. Savova, J.J. Masanz, P.V. Ogren, J. Zheng, S. Sohn, K.C. Kipper-Schuler, and C.G.
Chute, "Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES):
architecture, component evaluation and applications," Journal of the American Medical
Informatics Association, vol. 17, 2010, pp. 507-513.
[26]
[27]
[28]
[29]
A.R. Aronson, "Effective Mapping of Biomedical Text to the UMLS Metathesaurus: The
MetaMap Program," AMIA annualsymposium proceedings, vol. 2001, 2001, pp. 17-21.
W.W. Chapman, W. Bridewell, P. Hanbury, G.F. Cooper, and B.G. Buchanan, "A simple
algorithm for identifying negated findings and diseases in discharge summaries," Journal
of biomedical informatics, vol. 34, 2001, pp. 301-3 10.
J.-D. Kim, Y. Wang, T. Takagi, and A. Yonezawa, "Overview of genia event task in
bionlp shared task 2011," Proceedings of the BioNLP Shared Task 2011 Workshop,
Association for Computational Linguistics, 2011, pp. 7-15.
M. Krallinger, F. Leitner, C. Rodriguez-Penagos, A. Valencia, and others, "Overview of
the protein-protein interaction annotation extraction task of BioCreative II," Genome
biology, vol. 9, 2008, p. S4.
158
[30]
F. Leitner, S.A. Mardis, M. Krallinger, G. Cesareni, L.A. Hirschman, and A. Valencia,
"An overview of BioCreative II. 5," Computational Biology and Bioinformatics,
IEEE/A CM Transactionson, vol. 7, 2010, pp. 385-399.
[31]
J.-D. Kim, T. Ohta, S. Pyysalo, Y. Kano, and J. Tsujii, "Overview of BioNLP'09 shared
task on event extraction," Proceedings of the Workshop on Current Trends in Biomedical
Natural Language Processing:Shared Task, Association for Computational Linguistics,
[32]
2009, pp. 1-9.
C. Nedellec, R. Bossy, J.-D. Kim, J.-J. Kim, T. Ohta, S. Pyysalo, and P. Zweigenbaum,
"Overview of BioNLP shared task 2013," Proceedings of the BioNLP Shared Task 2013
Workshop, 2013, pp. 1-7.
[33]
I. Segura-Bedmar, P. Martinez, and D. Sanchez-Cisneros, "The 1st DDIExtraction-2011
challenge task: Extraction of Drug-Drug Interactions from biomedical texts," Proceedings
of the 1st Challenge Task on Drug-DrugInteractionExtraction, vol. 761, 2011, pp. 1-9.
[34]
[35]
[36]
I. Segura-Bedmar, P. Martinez, and M. Herrero-Zazo, "Semeval-2013 task 9: Extraction
of drug-drug interactions from biomedical texts (ddiextraction 2013)," Proceedings of
Semeval, 2013, pp. 341-350.
R. Chang, "Individual outcome prediction models for intensive care units," The Lancet,
vol. 334, 1989, pp. 143-146.
L. Ohno-Machado, F.S. Resnic, and M.E. Matheny, "Prognosis in critical care," Annu.
Rev. Biomed. Eng., vol. 8, 2006, pp. 567-599.
[37]
Y. Zhang and P. Szolovits, "Patient-specific learning in real time for adaptive monitoring
in critical care," Journalof biomedical informatics, vol. 41, 2008, pp. 452-460.
[38]
[39]
[40]
D.P. Bota, C. Melot, F.L. Ferreira, V.N. Ba, and J.-L. Vincent, "The multiple organ
dysfunction score (MODS) versus the sequential organ failure assessment (SOFA) score
in outcome prediction," Intensive care medicine, vol. 28, 2002, pp. 1619-1624.
W.A. Knaus, D. Wagner, E. e al Draper, J. Zimmerman, M. Bergner, P.G. Bastos, C.
Sirio, D. Murphy, T. Lotring, and A. Damiano, "The APACHE III prognostic system.
Risk prediction of hospital mortality for critically ill hospitalized adults.," CHEST
Journal, vol. 100, 1991, pp. 1619-1636.
J.-R. Le Gall, S. Lemeshow, and F. Saulnier, "A new simplified acute physiology score
(SAPS II) based on a European/North American multicenter study," JAMA: thejournal of
the American Medical Association, vol. 270, 1993, pp. 2957-2963.
[41]
J.A. Quinn, C.K. Williams, and N. McIntosh, "Factorial switching linear dynamical
systems applied to physiological condition monitoring," Pattern Analysis and Machine
Intelligence, IEEE Transactionson, vol. 31, 2009, pp. 1537-1551.
[42]
A. Silva, P. Cortez, M.F. Santos, L. Gomes, and J. Neves, "Mortality assessment in
intensive care units via adverse events using artificial neural networks," Artificial
Intelligence in Medicine, vol. 36, 2006, pp. 223-234.
[43]
[44]
M.J. Cohen, A.D. Grossman, D. Morabito, M.M. Knudson, A.J. Butte, and G.T. Manley,
"Research Identification of complex metabolic states in critically injured patients using
bioinformatic cluster analysis," 2010.
C.W. Hug and P. Szolovits, "ICU acuity: real-time models versus daily models," AMIA
Annual Symposium Proceedings, American Medical Informatics Association, 2009, p.
260.
159
[45]
[46]
R. Joshi and P. Szolovits, "Prognostic Physiology: Modeling Patient Severity in Intensive
Care Units Using Radial Domain Folding," AMIA Annual Symposium Proceedings,
American Medical Informatics Association, 2012, p. 1276.
J. Yin and H. Li, "A sparse conditional Gaussian graphical model for analysis of genetical
genomics data," The annals of appliedstatistics, vol. 5, 2011, p. 2630.
[47]
[48]
S. Kim and E.P. Xing, "Statistical estimation of correlated genome associations to a
quantitative trait network," PLoS genetics, vol. 5, 2009, p. e1000587.
J. Bergelson and F. Roux, "Towards identifying genes underlying ecologically relevant
traits in Arabidopsis thaliana," Nature Reviews Genetics, vol. 11, 2010, pp. 867-879.
[49]
R. Brachman and H. Levesque, Knowledge representationand reasoning, Elsevier, 2004.
[50]
J.F. Sowa, "Knowledge representation: logical, philosophical, and computational
foundations," 1999.
M. Kanehisa, S. Goto, Y. Sato, M. Furumichi, and M. Tanabe, "KEGG for integration and
[51]
interpretation of large-scale molecular data sets," Nucleic acids research, vol. 40, 2012,
[52]
pp. D109-D114.
A. Franceschini, D. Szklarczyk, S. Frankild, M. Kuhn, M. Simonovic, A. Roth, J. Lin, P.
Minguez, P. Bork, C. von Mering, and others, "STRING v9. 1: protein-protein interaction
networks, with increased coverage and integration," Nucleic acids research, vol. 41, 2013,
[53]
[54]
pp. D808-D815.
S. Hunter, P. Jones, A. Mitchell, R. Apweiler, T.K. Attwood, A. Bateman, T. Bernard, D.
Binns, P. Bork, S. Burge, and others, "InterPro in 2011: new developments in the family
and domain prediction database," Nucleic acids research, vol. 40, 2012, pp. D306-D312.
S.-K. Ng, Z. Zhang, S.-H. Tan, and K. Lin, "InterDom: a database of putative interacting
protein domains for validating predicted protein interactions and complexes," Nucleic
[55]
acids research,vol. 31, 2003, pp. 251-254.
M. Hewett, D.E. Oliver, D.L. Rubin, K.L. Easton, J.M. Stuart, R.B. Altman, and T.E.
Klein, "PharmGKB: the pharmacogenetics knowledge base," Nucleic acids research, vol.
FZi
30, 2002, pp. 163-165.
C.
a,.
Chen, and A.J. Butte, "Data-driven integration of epideiniological and
toxicological data to select candidate interacting genes and environmental factors in
[57]
association with disease," Bioinformatics, vol. 28, 2012, pp. il21-il26.
M.J. Landrum, J.M. Lee, G.R. Riley, W. Jang, W.S. Rubinstein, D.M. Church, and D.R.
Maglott, "ClinVar: public archive of relationships among sequence variation and human
[58]
phenotype," Nucleic acids research, vol. 42, 2014, pp. D980-D985.
A. Airola, S. Pyysalo, J. Bjirne, T. Pahikkala, F. Ginter, and T. Salakoski, "All-paths
graph kernel for protein-protein interaction extraction with evaluation of cross-corpus
[59]
learning," BMC bioinformatics, vol. 9, 2008, p. S2.
M. Miwa, R. Sxtre, Y. Miyao, and J. Tsujii, "A rich feature vector for protein-protein
interaction extraction from multiple corpora," Proceedings of the 2009 Conference on
Empirical Methods in Natural Language Processing: Volume 1-Volume 1, Association
[60]
for Computational Linguistics, 2009, pp. 121-130.
H.-W. Chun, Y. Tsuruoka, J.-D. Kim, R. Shiba, N. Nagata, T. Hishiki, and J. Tsujii,
"Extraction of gene-disease relations from Medline using domain dictionaries and
machine learning.," Pacific Symposium on Biocomputing, 2006, pp. 4-15.
160
[61]
[62]
[63]
[64]
[65]
[66]
[67]
A. Ozgtr, T. Vu, G. Erkan, and D.R. Radev, "Identifying gene-disease associations using
centrality on a literature mined gene-interaction network," Bioinformatics, vol. 24, 2008,
pp. i277-i285.
E. Camon, M. Magrane, D. Barrell, V. Lee, E. Dimmer, J. Maslen, D. Binns, N. Harte, R.
Lopez, and R. Apweiler, "The Gene Ontology annotation (GOA) database: sharing
knowledge in Uniprot with Gene Ontology," Nucleic acids research, vol. 32, 2004, pp.
D262-D266.
G.D. Bader, M.P. Cary, and C. Sander, "Pathguide: a pathway resource list," Nucleic
acids research,vol. 34, 2006, pp. D504-D506.
Y. Luo, G. Riedlinger, and P. Szolovits, "Text Mining in Cancer Gene and Pathway
Prioritization," Cancer informatics, vol. 13, 2014, p. 69.
J. Chen, E.E. Bardes, B.J. Aronow, and A.G. Jegga, "ToppGene Suite for gene list
enrichment analysis and candidate gene prioritization," Nucleic acids research, vol. 37,
2009, pp. W305-W311.
J. Chen, H. Xu, B.J. Aronow, and A.G. Jegga, "Improved human disease candidate gene
prioritization using mouse phenotype," BMC bioinformatics, vol. 8, 2007, p. 392.
M.A. van Driel, J. Bruggeman, G. Vriend, H.G. Brunner, and J.A. Leunissen, "A textmining analysis of the human phenome," Europeanjournal of human genetics, vol. 14,
2006, pp. 535-542.
[68]
[69]
[70]
[71]
T.H. Pers, P. Dworzyski, C.E. Thomas, K. Lage, and S. Brunak, "MetaRanker 2.0: a web
server for prioritization of genetic variation data," Nucleic acids research, vol. 41, 2013,
pp. W104-W108.
S. Raychaudhuri, R.M. Plenge, E.J. Rossin, A.C. Ng, S.M. Purcell, P. Sklar, E.M.
Scolnick, R.J. Xavier, D. Altshuler, M.J. Daly, and others, "Identifying relationships
among genomic disease regions: predicting genes at pathogenic SNP associations and rare
deletions," PLoS genetics, vol. 5, 2009, p. e1000534.
US National Library of Medicine, "ClinicalTrial.gov https://clinicaltrial.gov/."
S.R. Thadani, C. Weng, J.T. Bigger, J.F. Ennever, and D. Wajngurt, "Electronic screening
improves efficiency in clinical trial recruitment," Journal of the American Medical
InformaticsAssociation, vol. 16, 2009, pp. 869-873.
[72]
R. Miotto and C. Weng, "Unsupervised mining of frequent tags for clinical eligibility text
indexing," Journalof biomedicalinformatics, vol. 46, 2013, pp. 1145-1151.
[73]
S.W. Tu, M. Peleg, S. Carini, M. Bobak, J. Ross, D. Rubin, and I. Sim, "A practical
method for transforming free-text eligibility criteria into computable criteria," Journal of
biomedical informatics, vol. 44, 2011, pp. 239-250.
[74]
[75]
B. deBruijn, S. Carini, S. Kiritchenko, J. Martin, and I. Sim, "Automated information
extraction of key trial design elements from clinical trial publications," AMIA Annual
Symposium Proceedings,American Medical Informatics Association, 2008, p. 141.
C. Weng, X. Wu, Z. Luo, M.R. Boland, D. Theodoratos, and S.B. Johnson, "EliXR: an
approach to eligibility criteria extraction and representation," Journal of the American
Medical Informatics Association, vol. 18, 2011, pp. il 16-i124.
[76]
T. Hao, A. Rusanov, M.R. Boland, and C. Weng, "Clustering clinical trials with similar
eligibility criteria features," Journal of biomedical informatics, vol. 52, 2014, pp. 112120.
161
[77]
T. Klein, J. Chang, M. Cho, K. Easton, R. Fergerson, M. Hewett, Z. Lin, Y. Liu, S. Liu, D.
Oliver, and others, "Integrating genotype and phenotype information: an overview of the
[78]
PharmGKB project," PharmacogenomicsJ, vol. 1, 2001, pp. 167-170.
A. Coulet, N.H. Shah, Y. Garten, M. Musen, and R.B. Altman, "Using text to build
semantic networks for pharmacogenomics," Journal of biomedical informatics, vol. 43,
2010, pp. 1009-1019.
[79]
[80]
[81]
Y. Garten and R.B. Altman, "Pharmspresso:
a text mining tool for extraction of
pharmacogenomic concepts and relationships from full text," BMC bioinformatics, vol.
10, 2009, p. S6.
B. Percha, Y. Garten, R.B. Altman, and others, "Discovery and explanation of drug-drug
interactions via text mining," Pac Symp Biocomput, World Scientific, 2012, p. 421.
S.V. Pakhomov, J.D. Buntrock, and C.G. Chute, "Automating the assignment of diagnosis
codes to patient encounters using example-based and machine learning techniques,"
Journalof the American Medical InformaticsAssociation, vol. 13, 2006, pp. 516-525.
[82]
A.B. Wilcox and G. Hripcsak, "The role of domain knowledge in automating medical text
report classification," Journal of the American Medical Informatics Association, vol. 10,
2003, pp. 330-338.
[83]
D.B. Aronow, F. Fangfang, and W.B. Croft, "Ad hoc classification of radiology reports,"
Journalof the American Medical InformaticsAssociation, vol. 6, 1999, pp. 393-411.
[84]
D. Aronsky and P.J. Haug, "Automatic identification of patients eligible for a pneumonia
guideline.," Proceedings of the AMIA
[85]
Symposium, American Medical Informatics
Association, 2000, p. 12.
M. Fiszman, W.W. Chapman, D. Aronsky, R.S. Evans, and P.J. Haug, "Automatic
detection of acute bacterial pneumonia from chest X-ray reports," Journal of the
American MedicalInformatics Association, vol. 7, 2000, pp. 593-604.
[86]
[87]
H.-M. Lu, D. Zeng, L. Trujillo, K. Komatsu, and H. Chen, "Ontology-enhanced automatic
chief complaint classification for syndromic surveillance," Journal of biomedical
informatics, vol. 41, 2008, pp. 340-356.
Y. Luo, A. Sohani, E. Hochberg, and P. Szolovits, "Automatic Lymphoma Classification
with Sentence Subgraph Mining from Pathology Reports," Journal of the American
[88]
Medical InformaticsAssociation (JAMIA) 2014, vol. 21, 2014, pp. 824-832.
Y. Luo, Y. Xin, E. Hochberg, R. Joshi, 0. Uzuner, and P. Szolovits, "Subgraph
Augmented Non-Negative Tensor Factorization (SANTF) for Modeling Clinical Text,"
Journalof the American Medical Informatics Association (JAMI) in press, 2015.
[89]
[90]
G. Onder, C. Pedone, F. Landi, M. Cesari, C. Della Vedova, R. Bernabei, and G.
Gambassi, "Adverse drug reactions as cause of hospital admissions: results from the
Italian Group of Pharmacoepidemiology in the Elderly (GIFA)," Journal of the American
GeriatricsSociety, vol. 50, 2002, pp. 1962-1968.
H. Zheng, H. Wang, H. Xu, Y. Wu, Z. Zhao, and F. Azuaje, "Linking Biochemical
Pathways and Networks to Adverse Drug
Transactions on, vol. 13, 2014, pp. 131-137.
[91]
Reactions,"
NanoBioscience, IEEE
M. Liu, Y. Wu, Y. Chen, J. Sun, Z. Zhao, X. Chen, M.E. Matheny, and H. Xu, "Largescale prediction of adverse drug reactions using chemical, biological, and phenotypic
properties of drugs," Journal of the American Medical Informatics Association, vol. 19,
2012, pp. e28-e35.
162
[92]
R. Harpaz, S. Vilar, W. DuMouchel, H. Salmasian, K. Haerian, N.H. Shah, H.S. Chase,
and C. Friedman, "Combing signals from spontaneous reports and electronic health
records for detection of adverse drug reactions," Journal of the American Medical
InformaticsAssociation, 2012, p. amiajnl-2012.
[93]
[94]
[95]
J. Li, X. Zhu, and J.Y. Chen, "Building disease-specific drug-protein connectivity maps
from molecular interaction networks and PubMed abstracts," PLoS computationalbiology,
vol. 5, 2009, p. e1000450.
C. Blaschke, M.A. Andrade, C.A. Ouzounis, and A. Valencia, "Automatic extraction of
biological information from scientific text: protein-protein interactions.," Ismb, 1999, pp.
60-67.
B. Rosario and M.A. Hearst, "Classifying semantic relations in bioscience texts,"
Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics,
[96]
Association for Computational Linguistics, 2004, p. 430.
B. Rosario and M.A. Hearst, "Multi-way relation classification: application to proteinprotein interactions," Proceedings of the conference on Human Language Technology
and EmpiricalMethods in NaturalLanguage Processing,Association for Computational
[97]
Linguistics, 2005, pp. 732-739.
D. Hristovski, C. Friedman, T.C. Rindflesch, and B. Peterlin, "Exploiting semantic
relations for literature-based discovery," AMIA annual symposium proceedings,American
[98]
Medical Informatics Association, 2006, pp. 349-353.
T.C. Rindflesch and M. Fiszman, "The interaction of domain knowledge and linguistic
structure in natural language processing: interpreting hypernymic propositions in
biomedical text," Journalof biomedical informatics, vol. 36, 2003, pp. 462-477.
[99]
Y. Luo and 0. Uzuner, "Semi-Supervised Learning to Identify UMLS Semantic
Relations," AMA Joint Summits on TranslationalScience, 2014.
[100] S. Nijssen and J.N. Kok, "The gaston tool for frequent subgraph mining," Electronic
Notes in Theoretical Computer Science, vol. 127, 2005, pp. 77-87.
[101] K. Roberts, B. Rink, and S. Harabagiu, "Extraction of medical concepts, assertions, and
relations from discharge summaries for the fourth i2b2/VA shared task," Proceedings of
the 2010 i2b2/VA Workshop on Challenges in Natural Language Processingfor Clinical
Data. Boston, MA, USA: i2b2, 2010.
[102] B. deBruijn, C. Cherry, S. Kiritchenko, J. Martin, and X. Zhu, "Machine-learned solutions
for three stages of clinical infonnation extraction: the state of the art at i2b2 2010,"
Journal of the American Medical InformaticsAssociation, vol. 18, 2011, pp. 557-562.
[103] H. Xu, S.P. Stenner, S. Doan, K.B. Johnson, L.R. Waitman, and J.C. Denny, "MedEx: a
medication information extraction system for clinical narratives," Journal of the
American Medical InformaticsAssociation, vol. 17, 2010, pp. 19-24.
[104] P. Anick, P. Hong, N. Xue, and D. Anick, "Concept, Assertion and Relation Extraction at
the 2010 i2b2 Relation Extraction Challenge using parsing information and dictionaries,"
Proc. of i2b2/VA Shared-Task. Washington, DC, 2010.
[105] H. Liu, L. Hunter, V. Kegelj, and K. Verspoor, "Approximate Subgraph Matching-Based
Literature Mining for Biomedical Events and Relations," PloS one, vol. 8, 2013, p.
e60954.
[106] H. Liu, R. Komandur, and K. Verspoor, "From graphs to events: A subgraph matching
approach for information extraction from biomedical text," Proceedings of the BioNLP
163
Shared Task 2011 Workshop, Association for Computational Linguistics, 2011, pp. 164172.
[107] A. MacKinlay, D. Martinez, A.J. Yepes, H. Liu, W.J. Wilbur, and K. Verspoor,
"Extracting biomedical events and modifications using subgraph matching with noisy
training data," Proceedings of the BioNLP Shared Task 2013 Workshop. Association for
ComputationalLinguistics, Sofia, Bulgaria, 2013, pp. 35-44.
[108] K. Ravikumar, H. Liu, J.D. Cohn, M.E. Wall, K. Verspoor, and others, "Literature mining
of protein-residue associations with graph rules learned through distant supervision.," J.
Biomedical Semantics, vol. 3, 2012, p. S2.
[109] H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. Bhat, H. Weissig, I.N. Shindyalov,
and P.E. Bourne, "The protein data bank," Nucleic acids research, vol. 28, 2000, pp. 235242.
[110] H. Liu, Z.-Z. Hu, J. Zhang, and C. Wu, "BioThesaurus: a web-based thesaurus of protein
and gene names," Bioinformatics, vol. 22, 2006, pp. 103-105.
[111] J. Bjrne, J. Heimonen, F. Ginter, A. Airola, T. Pahikkala, and T. Salakoski, "Extracting
complex biological events with rich graph-based feature sets," Proceedings of the
Workshop on Current Trends in Biomedical NaturalLanguage Processing:Shared Task,
Association for Computational Linguistics, 2009, pp. 10-18.
[112]
J. Bjrne and T. Salakoski, "Generalizing biomedical event extraction," Proceedings of
the BioNLP Shared Task 2011 Workshop, Association for Computational Linguistics,
2011, pp. 183-191.
[113] J. BjOrne and T. Salakoski, "TEES 2.1: Automated annotation scheme learning in the
BioNLP 2013 shared task," Proceedings of the BioNLP Shared Task 2013 Workshop,
2013, pp. 16-25.
[114] J. Bjbme, A. Airola, T. Pahikkala, and T. Salakoski, "Drug-drug interaction extraction
from biomedical texts with svm and rls classifiers," Proceedings of DDIExtraction-2011
challenge task, 2011, pp. 35-42.
[115] K. Hakala, S. Van Landeghem, T. Salakoski, Y. Van de Peer, and F. Ginter, "EVEX in
ST'13: Application of a large-scale text mining resource to event extraction and network
construction," Proceedings of the BioNLP Shared Task 2013 Workshop, 2013, pp. 26-34.
[116] U. Consortium and others, "The universal protein resource (UniProt)," Nucleic acids
research, vol. 36, 2008, pp. D190-D195.
[117] H. Kilicoglu and S. Bergler, "Adapting a general semantic interpretation approach to
biological event extraction," Proceedings of the BioNLP Shared Task 2011 Workshop,
Association for Computational Linguistics, 2011, pp. 173-182.
[118] H. Kilicoglu and S. Bergler, "Syntactic dependency based heuristics for biological event
extraction," Proceedings of the Workshop on Current Trends in Biomedical Natural
Language Processing:Shared Task, Association for Computational Linguistics, 2009, pp.
119-127.
[119] J. Hakenberg, I. Solt, D. Tikk, L. Tari, A. Rheinlinder, Q.L. Ngyuen, G. Gonzalez, and U.
Leser, "Molecular event extraction from link grammar parse trees," Proceedings of the
Workshop on Current Trends in Biomedical NaturalLanguage Processing:Shared Task,
Association for Computational Linguistics, 2009, pp. 86-94.
[120] J. Hakenberg, R. Leaman, N. Ha Vo, S. Jonnalagadda, R. Sullivan, C. Miller, L. Tari, C.
Baral, and G. Gonzalez, "Efficient extraction of protein-protein interactions from full-text
164
articles," IEEE/ACM Transactionson ComputationalBiology and Bioinformatics (TCBB),
vol. 7, 2010, pp. 481-494.
[121] P. Thomas, S. Pietschmann, I. Solt, D. Tikk, and U. Leser, "Not all links are equal:
exploiting dependency types for the extraction of protein-protein interactions from text,"
Proceedings ofBioNLP 2011 Workshop, Association for Computational Linguistics, 2011,
pp. 1-9.
[122] J. Hakenberg, C. Plake, R. Leaman, M. Schroeder, and G. Gonzalez, "Inter-species
normalization of gene mentions with GNAT," Bioinformatics, vol. 24, 2008, pp. i126i132.
[123]
S. Riedel and A. McCallum, "Robust biomedical event extraction with dual
decomposition and minimal domain adaptation," Proceedingsof the BioNLP Shared Task
2011 Workshop, Association for Computational Linguistics, 2011, pp. 46-50.
[124] D. McClosky, M. Surdeanu, and C.D. Manning, "Event extraction as dependency parsing,"
Proceedings of the 49th Annual Meeting of the Associationfor ComputationalLinguistics:
Human Language Technologies-Volume 1, Association for Computational Linguistics,
2011, pp. 1626-1635.
[125]
S. Van Landeghem, Y. Saeys, B. De Baets, and Y. Van de Peer, "Analyzing text in search
of bio-molecular events: a high-precision machine learning framework," Proceedings of
the Workshop on Current Trends in Biomedical Natural Language Processing: Shared
Task, Association for Computational Linguistics, 2009, pp. 128-136.
[126] K. Kaljurand, G. Schneider, and F. Rinaldi, "UZurich in the BioNLP 2009 shared task,"
Proceedings of the Workshop on Current Trends in Biomedical Natural Language
Processing:SharedTask, Association for Computational Linguistics, 2009, pp. 28-36.
[127] S. Kerrien, Y. Alam-Faruque, B. Aranda, I. Bancarz, A. Bridge, C. Derow, E. Dimmer, M.
Feuermann, A. Friedrichsen, R. Huntley, and others, "IntAct-open source resource for
molecular interaction data," Nucleic acids research,vol. 35, 2007, pp. D561-D565.
[128] A. Vlachos, P. Buttery, D.O. Seaghdha, and T. Briscoe, "Biomedical event extraction
without training data," Proceedings of the Workshop on Current Trends in Biomedical
Natural Language Processing: Shared Task, Association for Computational Linguistics,
2009, pp. 37-40.
[129] D. McClosky, M. Surdeanu, and C.D. Manning, "Event extraction as dependency parsing
in BioNLP 2011," Proceedings of the BioNLP Shared Task 2011 Workshop, Association
for Computational Linguistics, 2011, pp. 41-45.
[130] C. Quirk, P. Choudhury, M. Gamon, and L. Vanderwende, "Msr-nlp entry in bionlp
shared task 2011," Proceedings of the BioNLP Shared Task 2011 Workshop, Association
for Computational Linguistics, 2011, pp. 155-163.
[131] M. Miwa, P. Thompson, J. McNaught, D.B. Kell, and S. Ananiadou, "Extracting
semantically enriched events from biomedical literature," BMC bioinformatics, vol. 13,
2012, p. 108.
[132] A. Coulet, Y. Garten, M. Dumontier, R.B. Altman, M.A. Musen, N.H. Shah, and others,
"Integration and publication of heterogeneous text-mined relationships on the Semantic
Web.," J. Biomedical Semantics, vol. 2, 2011, p. S10.
[133] J. Hakenberg, D. Voronov, V.H. Nguyen, S. Liang, S. Anwar, B. Lumpkin, R. Leaman, L.
Tari, and C. Baral, "A SNPshot of PubMed to associate genetic variants with drugs,
diseases, and adverse reactions," Journal of biomedical informatics, vol. 45, 2012, pp.
842-850.
165
[134] M. Kuhn, M. Campillos, I. Letunic, L.J. Jensen, and P. Bork, "A side effect resource to
capture phenotypic effects of drugs," Molecular systems biology, vol. 6, 2010, p. 343.
[135] D.S. Wishart, C. Knox, A.C. Guo, D. Cheng, S. Shrivastava, D. Tzur, B. Gautam, and M.
Hassanali, "DrugBank: a knowledgebase for drugs, drug actions and drug targets,"
Nucleic acids research, vol. 36, 2008, pp. D901-D906.
[136] R. Leaman, G. Gonzalez, and others, "BANNER: an executable survey of advances in
biomedical named entity recognition.," Pacific Symposium on Biocomputing, 2008, pp.
652-663.
[137] H. Wang, Y. Ding, J. Tang, X. Dong, B. He, J. Qiu, and D.J. Wild, "Finding complex
biological relationships in recent PubMed articles using Bio-LDA," PLoS One, vol. 6,
2011, p. e17243.
[138] B. Chen, X. Dong, D. Jiao, H. Wang, Q. Zhu, Y. Ding, and D.J. Wild, "Chem2Bio2RDF:
a semantic framework for linking and data mining chemogenomic and systems chemical
biology data," BMC bioinformatics, vol. 11, 2010, p. 255.
[139] Q.-C. Bui, B.O. Nuallin, C.A. Boucher, and P.M. Sloot, "Extracting causal relations on
HIV drug resistance from literature," BMC bioinformatics,vol. 11, 2010, p. 101.
[140] J. Vondrasek and A. Wlodawer, "HIVdb: a database of the structures of human
immunodeficiency virus protease," Proteins:Structure, Function, and Bioinformatics, vol.
49, 2002, pp. 429-431.
[141] P. Libin, G. Beheydt, K. Deforche, S. Imbrechts, F. Ferreira, K. Van Laethem, K. Theys,
A.P. Carvalho, J. Cavaco-Silva, G. Lapadula, and others, "RegaDB: community-driven
data management and analysis for infectious diseases," Bioinformatics, vol. 29, 2013, pp.
1477-1480.
[142] S. Katrenko and P. Adriaans, "Learning relations from biomedical corpora using
dependency trees," Knowledge Discovery and Emergent Complexity in Bioinformatics,
Springer, 2007, pp. 61-80.
[143] R. Sxtre, K. Yoshida, M. Miwa, T. Matsuzaki, Y. Kano, and J. Tsujii, "Extracting protein
interactions from text with the unified AkaneRE event extraction system," IEEE/ACM
Transactions on Computational Biology andtiiUformatics (TCBB), vol. 7, 2010, pp.
442-453.
[144] D. Maglott, J. Ostell, K.D. Pruitt, and T. Tatusova, "Entrez Gene: gene-centered
information at NCBL," Nucleic acids research, vol. 33, 2005, pp. D54-D58.
[145] A. Koike and T. Takagi, "Gene/protein/family name recognition in biomedical literature,"
Proceedings of BioLink 2004 Workshop: Linking Biological Literature, Ontologies and
Databases: Tools for Users, 2004, p. 56.
[146] P. Thomas, M. Neves, I. Solt, D. Tikk, and U. Leser, "Relation extraction for drug-drug
interactions using ensemble learning," Proceedings of DDIExtraction-2011 challenge
task, 2011.
[147] M.F.M. Chowdhury and A. Lavelli, "Drug-drug interaction extraction using composite
kernels," ProceedingsofDDIExtraction-2011challenge task, 2011, pp. 27-33.
[148] M.F.M. Chowdhury and A. Lavelli, "FBK-irst: A multi-phase kernel based approach for
drug-drug interaction detection and classification that exploits linguistic information,"
Proceedings ofSemEval 2013, 2013, pp. 351-355.
[149] M.F.M. Chowdhury, A.B. Abacha, A. Lavelli, and P. Zweigenbaum, "Two different
machine learning techniques for drug-drug interaction extraction," Challenge Task on
Drug-DrugInteractionExtraction, 2011, pp. 19-26.
166
[150] D. Tikk, P. Thomas, P. Palaga, J. Hakenberg, and U. Leser, "A comprehensive
benchmark of kernel methods to extract protein-protein interactions from literature,"
PLoS computationalbiology, vol. 6, 2010, p. e1000837.
[151] F.M. Chowdhury, A. Lavelli, and A. Moschitti, "A study on dependency tree kernels for
automatic extraction of protein-protein interaction," Proceedings of BioNLP 2011
Workshop, Association for Computational Linguistics, 2011, pp. 124-133.
[152] M.-C. De Marneffe, B. MacCartney, and C.D. Manning, "Generating typed dependency
parses from phrase structure parses," ProceedingsofLREC, 2006, pp. 449-454.
[153] E. Charniak and M. Johnson, "Coarse-to-fine n-best parsing and MaxEnt discriminative
reranking," Proceedings of the 43rd Annual Meeting on Association for Computational
Linguistics, Association for Computational Linguistics, 2005, pp. 173-180.
[154] D. McClosky, "Any domain parsing: automatic domain adaptation for natural language
parsing," Brown University, 2010.
[155] M. Miwa and S. Ananiadou, "NaCTeM EventMine for BioNLP 2013 CG and PC tasks,"
Proceedings of BioNLP Shared Task 2013 Workshop, 2013, pp. 94-98.
[156] Y. Miyao, K. Sagae, R. Sotre, T. Matsuzaki, and J. Tsujii, "Evaluating contributions of
natural language parsers to protein-protein interaction extraction," Bioinformatics, vol. 25,
2009, pp. 394-400.
[157] K. Sagae and J. Tsujii, "Dependency Parsing and Domain Adaptation with LR Models
and Parser Ensembles.," EMNLP-CoNLL, 2007, pp. 1044-1050.
[158] 0. Bodenreider, "The unified medical language system (UMLS): integrating biomedical
terminology," Nucleic acids research, vol. 32, 2004, pp. D267-D270.
[159] G.A. Miller, "WordNet: a lexical database for English," Communicationsof the ACM, vol.
38, 1995, pp. 39-41.
[160] R. McDonald, F. Pereira, K. Ribarov, and J. Hajic, "Non-projective dependency parsing
using spanning tree algorithms," Proceedings of the conference on Human Language
Technology and Empirical Methods in Natural Language Processing, Association for
Computational Linguistics, 2005, pp. 523-530.
[161] S. Riedel, H.-W. Chun, T. Takagi, and J. Tsujii, "A markov logic approach to biomolecular event extraction," Proceedings of the Workshop on Current Trends in
Biomedical Natural Language Processing: Shared Task, Association for Computational
Linguistics, 2009, pp. 41-49.
[162] S. Riedel, D. McClosky, M. Surdeanu, A. McCallum, and C.D. Manning, "Model
combination for event extraction in BioNLP 2011," Proceedings of the BioNLP Shared
Task 2011 Workshop, Association for Computational Linguistics, 2011, pp. 51-55.
[163] H. Liu, T. Christiansen, W.A. Baumgartner Jr, and K. Verspoor, "BioLemmatizer: a
lemmatization tool for morphological processing of biomedical text.," J. Biomedical
Semantics, vol. 3, 2012, p. 17.
[164] S. Pyysalo, T. Salakoski, S. Aubin, and A. Nazarenko, "Lexical adaptation of link
grammar to the biomedical sublanguage: a comparative evaluation of three approaches,"
BMC bioinformatics, vol. 7, 2006, p. S2.
[165] D.D. Sleator and D. Temperley, "Parsing English with a link grammar," arXiv preprint
cmp-lg/9508004, 1995.
[166] Y. Huang, H.J. Lowe, D. Klein, and R.J. Cucina, "Improved identification of noun
phrases in clinical radiology reports using a high-performance statistical natural language
167
parser augmented with the UMLS specialist lexicon," Journal of the American Medical
Informatics Association, vol. 12, 2005, pp. 275-285.
[167] G. Schneider, M. Hess, and P. Merlo, "Hybrid long-distance functional dependency
parsing," PhD, University of Zurich, 2008.
[168] T. Briscoe, J. Carroll, and R. Watson, "The second release of the RASP system,"
Proceedings of the COLING/ACL on Interactive presentation sessions, Association for
Computational Linguistics, 2006, pp. 77-80.
[169] M. Krallinger, M. Vazquez, F. Leitner, D. Salgado, A. Chatr-aryamontri, A. Winter, L.
Perfetto, L. Briganti, L. Licata, M. lannuccelli, and others, "The Protein-Protein
Interaction tasks of BioCreative III: classification/ranking of articles and linking bioontology concepts to full text," BMC bioinformatics, vol. 12, 2011, p. S3.
[170] M. Huang, S. Ding, H. Wang, and X. Zhu, "Mining physical protein-protein interactions
from the literature," Genome Biol, vol. 9, 2008, p. S12.
[171] D. Tikk, I. Solt, P. Thomas, and U. Leser, "A detailed error analysis of 13 kernel methods
for protein-protein interaction extraction," BMC bioinformatics,vol. 14, 2013, p. 12.
[172] C. Giuliano, A. Lavelli, and L. Romano, "Exploiting shallow linguistic information for
relation extraction from biomedical literature.," EACL, 2006, pp. 401-408.
[173] S. Vishwanathan and A.J. Smola, "Fast kernels for string and tree matching," NIPS, 2002,
pp. 569-576.
[174] M. Collins and N. Duffy, "Convolution kernels for natural language," Advances in neural
informationprocessingsystems, 2001, pp. 625--632.
[175] A. Moschitti, "Efficient convolution kernels for dependency and constituent syntactic
trees," Machine Learning: ECML 2006, Springer, 2006, pp. 318-329.
[176] T. Kuboyama, K. Hirata, H. Kashima, K.F. Aoki-Kinoshita, and H. Yasuda, "A spectrum
tree kernel," Information and Media Technologies, vol. 2, 2007, pp. 292-299.
[177] G. Erkan, A. Ozgur, and D.R. Radev, "Semi-supervised classification for extracting
protein interaction sentences using dependency parsing.," EAINLP-CoNLL, 2007, pp.
228-237.
[178"
. Kim1, J. Yoon, and
T.
Yang, "Kernel approaches for genic iteraction extraction"
Bioinformatics, vol. 24, 2008, pp. 118-126.
[179] A. Moschitti, "A study on convolution kernels for shallow semantic parsing,"
Proceedings of the 42nd Annual Meeting on Association for ComputationalLinguistics,
Association for Computational Linguistics, 2004, p. 335.
[180] P. Thomas, M. Neves, T. Rocktaschel, and U. Leser, "WBI-DDI: drug-drug interaction
extraction using majority voting," Second Joint Conference on Lexical and
ComputationalSemantics (* SEM), 2013, pp. 628-635.
[181] D. Lin, "Dependency-based evaluation of MINIPAR," Treebanks, Springer, 2003, pp.
317-329.
[182] M. Lease and E. Charniak, "Parsing biomedical literature," Natural Language
Processing-IJCNLP2005, Springer, 2005, pp. 58-69.
[183] Y. Freund and R.E. Schapire, "A decision-theoretic generalization of on-line learning and
an application to boosting," Journal of computer and system sciences, vol. 55, 1997, pp.
119-139.
[184] M. Kay, "Algorithm schemata and data structures in syntactic processing," Technical
Report CSL80-12, 1980.
168
[185] M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, A.P. Davis, K.
Dolinski, S.S. Dwight, J.T. Eppig, and others, "Gene Ontology: tool for the unification of
biology," Nature genetics, vol. 25,2000, pp. 25-29.
[186] D.A. Lindberg, B.L. Humphreys, A.T. McCray, and others, "The Unified Medical
Language System.," Methods of information in medicine, vol. 32, 1993, p. 281.
[187] National Library of Medicine, "MeSH http://www.ncbi.nlm.nih.gov/mesh."
[188] K.A. Gray, B. Yates, R.L. Seal, M.W. Wright, and E.A. Bruford, "Genenames. org: the
HGNC resources in 2015," Nucleic acids research, 2014, p. gku1071.
[189] K.K. Schuler, "VerbNet: A broad-coverage, comprehensive verb lexicon," University of
Pennsylvania, 2005.
[190] C. Borgelt and M.R. Berthold, "Mining molecular fragments: Finding relevant
substructures of molecules," Proceedings. 2002 IEEE InternationalConference on Data
Mining, IEEE, 2002, pp. 51-58.
[191] X. Yan and J. Han, "gspan: Graph-based substructure pattern mining," Proceedings. 2002
IEEE InternationalConference on DataMining, IEEE, 2002, pp. 721-724.
[192] J. Huan, W. Wang, and J. Prins, "Efficient mining of frequent subgraphs in the presence
of isomorphism," Data Mining, 2003. ICDM 2003. Third IEEE InternationalConference
on, IEEE, 2003, pp. 549-552.
[193] A.B. Clegg and A.J. Shepherd, "Syntactic pattern matching with Graph-Spider and MPL,"
The Proceedings of the Third International Symposium on Semantic Mining in
Biomedicine (SMBM2008), Turku, Finland, 2008, pp. 129-132.
[194] Stanford NLP, "Stanford Parser http://nlp.stanford.edu:8080/parser/."
[195] D.M. Bikel, "Design of a multi-lingual, parallel-processing statistical parsing engine,"
Proceedings of the second international conference on Human Language Technology
Research, Morgan Kaufmann Publishers Inc., 2002, pp. 178-182.
[196] L. Rimell and S. Clark, "Porting a lexicalized-grammar parser to the biomedical domain,"
Journal ofBiomedical Informatics, vol. 42, 2009, pp. 852-865.
[197] M. Krallinger, A. Morgan, L. Smith, F. Leitner, L. Tanabe, J. Wilbur, L. Hirschman, and
A. Valencia, "Evaluation of text-mining systems for biology: overview of the Second
BioCreative community challenge," Genome Biol, vol. 9, 2008, p. 51.
[198] R. Bunescu, R. Ge, R.J. Kate, E.M. Marcotte, R.J. Mooney, A.K. Ramani, and Y.W.
Wong, "Comparative experiments on learning information extractors for proteins and
their interactions," Artificial intelligence in medicine, vol. 33, 2005, pp. 139-155.
[199] S. Pyysalo, F. Ginter, J. Heimonen, J. Bjorne, J. Boberg, J. Jarvinen, and T. Salakoski,
"Biolnfer: a corpus for information extraction in the biomedical domain," BMC
bioinformatics, vol. 8, 2007, p. 50.
[200] K. Fundel, R. K~ffner, and R. Zimmer, "RelEx-Relation extraction using dependency
parse trees," Bioinformatics, vol. 23, 2007, pp. 365-371.
[201] J. Ding, D. Berleant, D. Nettleton, and E.S. Wurtele, "Mining MEDLINE: abstracts,
sentences, or phrases?," Pacific Symposium on Biocomputing, World Scientific, 2002, pp.
326-337.
[202] C. Nddellec, "Learning language in logic-genic interaction extraction challenge,"
Proceedingsof the 4th LearningLanguage in Logic Workshop (LLL05), 2005.
[203] K. Nagel, A. Jimeno-Yepes, and D. Rebholz-Schuhmann, "Annotation of protein residues
based on a literature analysis: cross-validation against UniProtKb," BMC bioinformatics,
vol. 10, 2009, p. S4.
169
[204] E. Buyko and U. Hahn, "Evaluating the impact of alternative dependency graph
encodings on solving event extraction tasks," Proceedings of the 2010 Conference on
Empirical Methods in Natural Language Processing, Association for Computational
Linguistics, 2010, pp. 982-992.
[205] J. Nivre, J. Hall, J. Nilsson, A. Chanev, G. Eryigit, S. KUbler, S. Marinov, and E. Marsi,
"MaltParser: A language-independent system for data-driven dependency parsing,"
NaturalLanguage Engineering, vol. 13, 2007, pp. 95-13 5.
[206] M. Miwa, S. Pyysalo, T. Hara, and J. Tsujii, "A comparative study of syntactic parsers for
event extraction," Proceedings of the 2010 Workshop on Biomedical Natural Language
Processing, Association for Computational Linguistics, 2010, pp. 37-45.
[207] M. Miwa, P. Thompson, and S. Ananiadou, "Boosting automatic event extraction from
the literature using domain adaptation and coreference resolution," Bioinformatics, vol.
28, 2012, pp. 1759-1765.
[208] M. Miwa, S. Pyysalo, T. Ohta, and S. Ananiadou, "Wide coverage biomedical event
extraction using multiple partially overlapping corpora," BMC bioinformatics, vol. 14,
2013, p. 175.
[209] S. Ranu and A.K. Singh, "Graphsig: A scalable approach to mining significant subgraphs
in large graph databases," Data Engineering, 2009. ICDE'09. IEEE 25th International
Conference on, IEEE, 2009, pp. 844-855.
[210] R. Kabiljo, A.B. Clegg, and A.J. Shepherd, "A realistic assessment of methods for
extracting gene/protein interactions from free text," BMC bioinformatics, vol. 10, 2009, p.
233.
[211] A. Robb-Smith, "US National Cancer Institute working formulation of non-Hodgkin's
lymphomas for clinical use," The Lancet, vol. 320, 1982, pp. 432-434.
[212] M. Bennett, G. Farrer-Brown, K. Henry, A. Jelliffe, R. Gerard-Marchant, I. Hamlin, K.
Lennert, F. Rilke, A. Stansfeld, and J. Van Unnik, "Classification of non-Hodgkin's
lymphomas," The Lancet, vol. 304, 1974, pp. 405-408.
[213] R.J. Lukes and R.D. Collins, "Immunologic characterization of human malignant
lymphomas," Cancer, vol. 34, 197, pp. 1488-1503.
[214] H. Rappaport, Tumors of the Hematopoietic System, Armed Forces Institute of Pathology,
1966.
[215] E.S. Jaffe, N.L. Harris, H. Stein, and J. Vardiman, eds., WHO Classification of Tumours.
Pathology and Genetics of Tumours of Haematopoietic and Lymphoid Tissues, IARC
Press, 2001.
[216] S.H. Swerdlow, E. Campo, N.L. Harris, E.S. Jaffe, S.A. Pileri, H. Stein, J. Thiele, and V.
J.W., eds., WHO classificationof tumours of haematopoieticand lymphoid tissues, IARC
Press, 2008.
[217] J. Turner, A. Hughes, A. Kricker, S. Milliken, A. Grulich, J. Kaldor, and B. Armstrong,
"Use of the WHO lymphoma classification in a population-based epidemiological study,"
Annals of oncology, vol. 15, 2004, pp. 631-637.
[218] C.A. Clarke, S.L. Glaser, R.F. Dorfman, P.M. Bracci, E. Eberle, and E.A. Holly, "Expert
Review of Non-Hodgkin's Lymphomas in a Population-Based Cancer Registry
Reliability of Diagnosis and Subtype Classifications," Cancer Epidemiology Biomarkers
& Prevention, vol. 13, 2004, pp. 138-143.
[219] M. Snuderl, O.K. Kolman, Y.-B. Chen, J.J. Hsu, A.M. Ackerman, P. Dal Cin, J.A. Ferry,
N.L. Harris, R.P. Hasserjian, L.R. Zukerberg, and others, "B-cell lymphomas with
170
concurrent IGH-BCL2 and MYC rearrangements are aggressive neoplasms with clinical
and pathologic features distinct from Burkitt lymphoma and diffuse large B-cell
lymphoma," The Americanjournal ofsurgicalpathology, vol. 34, 2010, pp. 327-340.
[220] A.M. Gruver, M.A. Huba, A. Dogan, and E.D. Hsi, "Fibrin-associated Large B-cell
Lymphoma: Part of the Spectrum of Cardiac Lymphomas," The American Journal of
SurgicalPathology, vol. 36, 2012, pp. 1527-1537.
[221] K.J. Savage, N.L. Harris, J.M. Vose, F. Ullrich, E.S. Jaffe, J.M. Connors, L. Rimsza, S.A.
Pileri, M. Chhanabhai, R.D. Gascoyne, and others, "ALK- anaplastic large-cell
lymphoma is clinically and immunophenotypically different from both ALK+ ALCL and
peripheral T-cell lymphoma, not otherwise specified: report from the International
Peripheral T-Cell Lymphoma Project," Blood, vol. 111, 2008, pp. 5496-5504.
[222] E. Hsi, T. Singleton, L. Swinnen, C. Dunphy, and S. Alkan, "Mucosa-associated
lymphoid tissue-type lymphomas occurring in post-transplantation patients," The
Americanjournalofsurgicalpathology, vol. 24, 2000, pp. 100-106.
[223] J.A. Ferry, A.R. Sohani, J.A. Longtine, R.A. Schwartz, and N.L. Harris, "HHV8-positive,
EBV-positive Hodgkin lymphoma-like large B-cell lymphoma and HHV8-positive
intravascular large B-cell lymphoma," Modern Pathology, vol. 22, 2009, pp. 618-626.
[224] K.P. Liao, T. Cai, V. Gainer, S. Goryachev, Q. Zeng-treitler, S. Raychaudhuri, P.
Szolovits, S. Churchill, S. Murphy, I. Kohane, and others, "Electronic medical records for
discovery research in rheumatoid arthritis," Arthritis care and research,vol. 62, 2010, pp.
1120-1127.
[225] 0. Uzuner, I. Goldstein, Y. Luo, and I. Kohane, "Identifying patient smoking status from
medical discharge records," Journal of the American Medical Informatics Association,
vol. 15, 2008, pp. 14-24.
[226] 0. Uzuner, Y. Luo, and P. Szolovits, "Evaluating the state-of-the-art in automatic deidentification," Journal of the American Medical Informatics Association, vol. 14, 2007,
pp. 550-563.
[227] 0. Uzuner, "Recognizing obesity and comorbidities in sparse data," Journal of the
American Medical Informatics Association, vol. 16, 2009, pp. 561-570.
[228] A.M. Cohen, "Five-way smoking status classification using text hot-spot identification
and error-correcting output codes," Journal of the American Medical Informatics
Association, vol. 15, 2008, pp. 32-35.
[229] E. Aramaki, T. Imai, K. Miyo, and K. Ohe, "Patient status classification by using rule
based sentence extraction and BM25 kNN-based classifier," i2b2 Workshop on
Challenges in NaturalLanguage Processingfor ClinicalData, 2006.
[230] C. Clark, K. Good, L. Jezierny, M. Macpherson, B. Wilson, and U. Chajewska,
"Identifying smokers with a medical extraction system," Journalof the American Medical
InformaticsAssociation, vol. 15, 2008, pp. 36-39.
[231] 1. Solt, D. Tikk, V. GOl, and Z.T. Kardkovacs, "Semantic classification of diseases in
discharge summaries using a context-aware rule-based classifier," Journal of the
American Medical Informatics Association, vol. 16, 2009, pp. 580-584.
[232] R. Farkas, G. Szarvas, I. Hegediis, A. Almsi, V. Vincze, R. Orm6ndi, and R. BusaFekete, "Semi-automated construction of decision rules to predict morbidities from
clinical texts," Journal of the American Medical Informatics Association, vol. 16, 2009,
pp. 601-605.
171
[233] L.C. Childs, R. Enelow, L. Simonsen, N.H. Heintzelman, K.M. Kowalski, and R.J. Taylor,
"Description of a rule-based system for the i2b2 challenge in natural language processing
for clinical data," Journalof the American Medical Informatics Association, vol. 16, 2009,
pp. 571-575.
[234] H. Ware, C.J. Mullett, and V. Jagannathan, "Natural language processing framework to
assess clinical conditions," Journalof the American Medical Informatics Association, vol.
16, 2009, pp. 585-589.
[235] 0. Uzuner, J. Mailoa, R. Ryan, and T. Sibanda, "Semantic relations for problem-oriented
medical records," Artificial Intelligence in Medicine, vol. 50, 2010, pp. 63-73.
[236] T. Sibanda, T. He, P. Szolovits, and 0. Uzuner, "Syntactically-informed semantic
category recognizer for discharge summaries," AMIA annual symposium proceedings,
American Medical Informatics Association, 2006, pp. 714-718.
[237] D. Albright, A. Lanfranchi, A. Fredriksen, W.F. Styler, C. Warner, J.D. Hwang, J.D. Choi,
D. Dligach, R.D. Nielsen, J. Martin, and others, "Towards comprehensive syntactic and
semantic annotations of the clinical narrative," Journal of the American Medical
Informatics Association, vol. 20, 2013, pp. 922-930.
[238] Partners Healthcare, "Research Patient Data Registry (RPDR) http://rc.partners.org/rpdr."
[239] L.G. Shaffer and N. Tommerup, ISCN 2013: an international system for human
cytogenetic nomenclature (2013): recommendations of the International Standing
Committee on Human Cytogenetic Nomenclature, Karger, 2013.
[240] Apache OpenNLP project team, "Apache OpenNLP http://opennlp.apache.org/," Apr.
2013.
[241] B. Santorini, "Part-of-speech tagging guidelines for the Penn Treebank Project (3rd
revision)," 1990.
[242] International Health Terminology Standards Development Organisation, "SNOMED CT
http://www.ihtsdo.org/snomed-ct/."
[243] Y. Chen, H. Gu, Y. Perl, M. Halper, and J. Xu, "Expanding the extent of a UMLS
semantic type via group neighborhood auditing," Journal of the American Medical
Informatics Association, vol. 16, 2009, pp. 746-757.
[244] AbiWord, "Link Parser http://www.abisource.com/projects/link-grammar/."
[245] J.D. Choi and M. Palmer, "Getting the Most out of Transition-based Dependency
Parsing.," ACL (ShortPapers), 2011, pp. 687-692.
[246] M.-C. De Marneffe and C.D. Manning, "Stanford typed dependencies manual," 2008.
[247] Y. Chi, R.R. Muntz, S. Nijssen, and J.N. Kok, "Frequent subtree mining-an overview,"
FundamentaInformaticae, vol. 66, 2005, pp. 161-198.
[248] C. Jiang, F. Coenen, and M. Zito, "A Survey of Frequent Subgraph Mining Algorithms,"
Knowledge EngineeringReview, vol. 28, 2013, pp. 75-105.
[249] I. Goldstein and 0. Uzuner, "Specializing for predicting obesity and its co-morbidities,"
Journalof biomedicalinformatics, vol. 42, 2009, pp. 873-886.
[250] W. Long, "Extracting diagnoses from discharge summaries," AMIA annual symposium
proceedings, American Medical Informatics Association, 2005, pp. 470-474.
[251] W.B. Cavnar and J.M. Trenkle, "N-Gram-Based Text Categorization," Proceedings of
SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval,
1994, pp. 161-175.
[252] R. Baeza-Yates and B. Ribeiro-Neto, Modern informationretrieval, 1999.
172
[253] E.W. Noreen, Computer-intensivemethods for testing hypotheses: an introduction,Wiley,
1989.
[254] Z. Fan, Y. Natkunam, E. Bair, R. Tibshirani, and R.A. Warnke, "Characterization of
variant patterns of nodular lymphocyte predominant Hodgkin lymphoma with
immunohistologic and clinical correlation," The American journal of surgicalpathology,
vol. 27, 2003, pp. 1346-1356.
[255] A.R. Sohani, E.S. Jaffe, N.L. Harris, J.A. Ferry, S. Pittaluga, and R.P. Hasserjian,
"Nodular lymphocyte-predominant Hodgkin lymphoma with atypical T cells: a
morphologic variant mimicking peripheral T-cell lymphoma," The American journal of
surgicalpathology, vol. 35, 2011, pp. 1666-1678.
[256] A. Rahemtullah, K.K. Reichard, F.I. Preffer, N.L. Harris, and R.P. Hasserjian, "A doublepositive CD4+ CD8+ T-cell population is commonly found in nodular lymphocyte
predominant Hodgkin lymphoma," Americanjournal of clinicalpathology, vol. 126, 2006,
pp. 805-814.
[257] R.L. Winslow, N. Trayanova, D. Geman, and M.I. Miller, "Computational medicine:
translating models to clinical care," Science translational medicine, vol. 4, 2012, p.
15 8rv 11.
[258] M.A. Shipp, K.N. Ross, P. Tamayo, A.P. Weng, J.L. Kutok, R.C. Aguiar, M. Gaasenbeek,
M. Angelo, M. Reich, G.S. Pinkus, and others, "Diffuse large B-cell lymphoma outcome
prediction by gene-expression profiling and supervised machine learning," Nature
medicine, vol. 8, 2002, pp. 68-74.
[259] J.Y. Irwin, H. Harkema, L.M. Christensen, T. Schleyer, P.J. Haug, and W.W. Chapman,
"Methodology to develop and evaluate a semantic representation for NLP," AMIA Annual
Symposium Proceedings, American Medical Informatics Association, 2009, p. 271.
[260] M.M. Gordon, A.M. Moser, and E. Rubin, "Unsupervised Analysis of Classical
Biomedical Markers: Robustness and Medical Relevance of Patient Clustering Using
Bioinformatics Tools," PloS one, vol. 7, 2012, p. e29578.
[261] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, "Cluster analysis and display of
genome-wide expression patterns," Proceedingsof the NationalAcademy of Sciences, vol.
95,1998,pp.14863-14868.
[262] T.A. Lasko, J.C. Denny, and M.A. Levy, "Computational Phenotype Discovery Using
Unsupervised Feature Learning over Noisy, Sparse, and Irregular Clinical Data," PloS
one, vol. 8, 2013, p. e66341.
[263] G.N. Noren, J. Hopstadius, A. Bate, K. Star, and I.R. Edwards, "Temporal pattern
discovery in longitudinal electronic patient records," Data Mining and Knowledge
Discovery, vol. 20, 2010, pp. 361-387.
[264] D.D. Lee and H.S. Seung, "Learning the parts of objects by non-negative matrix
factorization," Nature, vol. 401, 1999, pp. 788-791.
[265] M. Hofree, J.P. Shen, H. Carter, A. Gross, and T. Ideker, "Network-based stratification of
tumor mutations," Nature methods, 2013.
[266] F.-J. MUller, L.C. Laurent, D. Kostka, I. Ulitsky, R. Williams, C. Lu, I.-H. Park, M.S. Rao,
R. Shamir, P.H. Schwartz, and others, "Regulatory networks define phenotypic classes of
human stem cell lines," Nature, vol. 455, 2008, pp. 401-405.
[267] E.A. Collisson, A. Sadanandam, P. Olson, W.J. Gibb, M. Truitt, S. Gu, J. Cooc, J.
Weinkle, G.E. Kim, L. Jakkula, and others, "Subtypes of pancreatic ductal
173
adenocarcinoma and their differing responses to therapy," Nature medicine, vol. 17, 2011,
pp. 500-503.
[268] F. Wang, N. Lee, J. Hu, J. Sun, and S. Ebadollahi, "Towards heterogeneous temporal
clinical event pattern discovery: a convolutional approach," Proceedings of the 18th ACM
SIGKDD internationalconference on Knowledge discovery and data mining, ACM, 2012,
pp. 453-461.
[269] H. Kim and H. Park, "Sparse non-negative matrix factorizations via alternating nonnegativity-constrained least squares for microarray data analysis," Bioinformatics, vol. 23,
2007, pp. 1495-1502.
[270] J.-P. Brunet, P. Tamayo, T.R. Golub, and J.P. Mesirov, "Metagenes and molecular pattern
discovery using matrix factorization," Proceedings of the NationalAcademy of Sciences,
[271]
[272]
[273]
[274]
[275]
vol. 101, 2004, pp. 4164-4169.
Y. Gao and G. Church, "Improving molecular cancer class discovery through sparse nonnegative matrix factorization," Bioinformatics, vol. 21, 2005, pp. 3970-3975.
S. Nik-Zainal, D.C. Wedge, L.B. Alexandrov, M. Petljak, A.P. Butler, N. Bolli, H.R.
Davies, S. Knappskog, S. Martin, E. Papaemmanuil, and others, "Association of a
germline copy number polymorphism of APOBEC3A and APOBEC3B with burden of
putative APOBEC-dependent mutations in breast cancer," Naturegenetics, 2014.
L.B. Alexandrov, S. Nik-Zainal, D.C. Wedge, S.A. Aparicio, S. Behjati, A.V. Biankin,
G.R. Bignell, N. Bolli, A. Borg, A.-L. Borresen-Dale, and others, "Signatures of
mutational processes in human cancer," Nature, 2013.
L.R. Tucker, "Some mathematical notes on three-mode factor analysis," Psychometrika,
vol. 31, 1966, pp. 279-311.
J. Sun, D. Tao, S. Papadimitriou, P.S. Yu, and C. Faloutsos, "Incremental tensor analysis:
Theory and applications," ACM Transactions on Knowledge Discovery from Data
(TKDD), vol. 2, 2008, p. 11.
[276] R.A. Harshman and M.E. Lundy, "Uniqueness proof for a family of models sharing
features of Tucker's three-mode factor analysis and PARAFAC/CANDECOMP,"
Psychometrika, vol. 61, 1996, pp. 133-154.
[277] L. Omberg, G.H. Golub, and 0. Alter, "A tensor higher-order singular value
decomposition for integrative analysis of DNA microarray data from different studies,"
Proceedings of the NationalAcademy ofSciences, vol. 104, 2007, pp. 18371-18376.
[278] L. Omberg, J.R. Meyerson, K. Kobayashi, L.S. Drury, J.F. Diffley, and 0. Alter, "Global
effects of DNA replication and DNA replication origin activity on eukaryotic gene
expression," Molecular systems biology, vol. 5, 2009.
[279] C. Ozcaglar, A. Shabbeer, S. Vandenberg, B. Yener, and K.P. Bennett, "Sublineage
structure analysis of Mycobacterium tuberculosis complex strains using multiplebiomarker tensors," BMC genomics, vol. 12, 2011, p. Sl.
[280] B. Yener, E. Acar, P. Aguis, K. Bennett, S. Vandenberg, and G. Plopper, "Multiway
modeling and analysis in stem cell systems biology," BMC Systems Biology, vol. 2, 2008,
p. 63.
[281] B.W. Bader, A.A. Puretskiy, and M.W. Berry, "Scenario discovery using nonnegative
tensor factorization," Progress in PatternRecognition, Image Analysis and Applications,
Springer, 2008, pp. 791-805.
[282] M.W. Berry and M. Browne, "Email surveillance using non-negative matrix factorization,"
Computational & MathematicalOrganizationTheory, vol. 11, 2005, pp. 249-264.
174
[283] F. Shahnaz, M.W. Berry, V.P. Pauca, and R.J. Plemmons, "Document clustering using
nonnegative matrix factorization," Information Processing& Management, vol. 42, 2006,
pp. 373-386.
[284] B.W. Bader, M.W. Berry, and M. Browne, "Discussion tracking in Enron email using
PARAFAC," Survey of Text Mining II, Springer, 2008, pp. 147-163.
[285] T.G. Kolda and B.W. Bader, "Tensor decompositions and applications," SIAM review, vol.
51, 2009, pp. 455-500.
[286] Y. Xu and W. Yin, "A block coordinate descent method for regularized multiconvex
optimization with applications to nonnegative tensor factorization and completion," SIAM
Journalon Imaging Sciences, vol. 6, 2013, pp. 1758-1789.
[287] C.D. Manning and H. Schtitze, Foundations of statistical natural language processing,
MIT press, 1999.
[288] C.H. Ding, X. He, and H.D. Simon, "On the Equivalence of Nonnegative Matrix
Factorization and Spectral Clustering.," SDM, SIAM, 2005, pp. 606-610.
[289] C.D. Manning, P. Raghavan, and H. Schutze, Introduction to information retrieval,
Cambridge University Press Cambridge, 2008.
[290] J. Liu, J. Liu, P. Wonka, and J. Ye, "Sparse non-negative tensor factorization using
columnwise coordinate descent," PatternRecognition, vol. 45, 2012, pp. 649-656.
[291] T.L. Griffiths and M. Steyvers, "Finding scientific topics," Proceedings of the National
academy ofSciences of the United States ofAmerica, vol. 101, 2004, pp. 5228-5235.
[292] T.L. Griffiths and Z. Ghahramani, "The indian buffet process: An introduction and
review," The JournalofMachine LearningResearch, vol. 12, 2011, pp. 1185-1224.
[293] N. McIntosh, "Intensive care monitoring: past, present and future," Clinical medicine, vol.
2, 2002, pp. 349-355.
[294] W. Zong, G. Moody, and R. Mark, "Reduction of false arterial blood pressure alarms
using signal quality assessement and relationships between the electrocardiogram and
arterial blood pressure," Medical and Biological Engineering and Computing, vol. 42,
2004, pp. 698-706.
[295] G. Martin, "State-of-the-art fluid management in critically ill patients," Current Opinion
in CriticalCare, vol. 20, 2014, p. 359.
[296] M. Saeed, M. Villarroel, A.T. Reisner, G. Clifford, L.-W. Lehman, G. Moody, T. Heldt,
T.H. Kyaw, B. Moody, and R.G. Mark, "Multiparameter Intelligent Monitoring in
Intensive Care II (MIMIC-Il): a public-access intensive care unit database," Criticalcare
medicine, vol. 39, 2011, p. 952.
[297] K.B. Kshetri, "Modelling patient states in intensive care patients," Massachusetts Institute
of Technology, 2011.
[298] Z. Syed and J.V. Guttag, "Unsupervised Similarity-Based Risk Stratification for
Cardiovascular Events Using Long-Term Time-Series Data.," Journal of Machine
LearningResearch, vol. 12, 2011, pp. 999-1024.
[299] J. Lin, E. Keogh, L. Wei, and S. Lonardi, "Experiencing SAX: a novel symbolic
representation of time series," Data Mining and knowledge discovery, vol. 15, 2007, pp.
107-144.
[300] J. Huan, W. Wang, J. Prins, and J. Yang, "Spin: mining maximal frequent subgraphs from
graph databases," Proceedings of the tenth ACM SJGKDD international conference on
Knowledge discovery and data mining, ACM, 2004, pp. 581-586.
175
[301] E. Tjioe, M.W. Berry, and R. Homayouni, "Discovering gene functional relationships
using FAUN (Feature Annotation Using Nonnegative matrix factorization)," BMC
bioinformatics, vol. 11, 2010, p. S14.
[302] C.-J. Lin, "Projected gradient methods for nonnegative matrix factorization," Neural
computation, vol. 19, 2007, pp. 2756-2779.
[303] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, 0. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, and others, "Scikit-leam: Machine learning in
Python," The Journalof Machine LearningResearch, vol. 12, 2011, pp. 2825-2830.
[304] C. Boutsidis and E. Gallopoulos, "SVD based initialization: A head start for nonnegative
matrix factorization," PatternRecognition, vol. 41, 2008, pp. 1350-1362.
[305] P.O. Hoyer, "Non-negative matrix factorization with sparseness constraints," The Journal
ofMachine LearningResearch, vol. 5, 2004, pp. 1457-1469.
[306] C. Hug, "Detecting hazardous intensive care patient episodes using real-time mortality
models," Massachusetts Institute of Technology, 2009.
[307] Autism and Developmental Disabilities Monitoring Network Surveillance Year 2010
Principal Investigators, "Prevalence of autism spectrum disorder among children aged 8
years-autism and developmental disabilities monitoring network, 11 sites, United States,
2010.," Morbidity and mortality weekly report. Surveillance summaries, vol. 63, 2014.
[308] C.P. Johnson, S.M. Myers, and the Council on Children With Disabilities, "Identification
and evaluation of children with autism spectrum disorders," Pediatrics,vol. 120, 2007, pp.
1183-1215.
[309] A. Bailey, A. Le Couteur, I. Gottesman, P. Bolton, E. Simonoff, E. Yuzda, and M. Rutter,
"Autism as a strongly genetic disorder: evidence from a British twin study,"
Psychologicalmedicine, vol. 25, 1995, pp. 63-77.
[310] S. Steffenburg, C. Gillberg, L. Hellgren, L. Andersson, I.C. Gillberg, G. Jakobsson, and
M. Bohman, "A twin study of autism in Denmark, Finland, Iceland, Norway and Sweden,"
Journalof ChildPsychology and Psychiatry,vol. 30, 1989, pp. 405-416.
[311] S. Folstein and M. Rutter, "Infantile autism: a genetic study of 21 twin pairs," Journal of
Ch..7ildpsychology nd
18, 19-7-7, p 29-7-2.
[312] S.R. Gilman, I. Iossifov, D. Levy, M. Ronemus, M. Wigler, and D. Vitkup, "Rare de novo
fi yLI&LA
I
Psychtr/
L #AJ* )yLtI&&L~tI), V'Ji. I
vol.
_1J I I , Pp. /_ I -- )/-
variants associated with autism implicate a large functional network of genes involved in
formation and function of synapses," Neuron, vol. 70, 2011, pp. 898-907.
[313] D. Levy, M. Ronemus, B. Yamrom, Y. Lee, A. Leotta, J. Kendall, S. Marks, B. Lakshmi,
D. Pai, K. Ye, and others, "Rare de novo and transmitted copy-number variation in
autistic spectrum disorders," Neuron, vol. 70, 2011, pp. 886-897.
[314] S.J. Sanders, A.G. Ercan-Sencicek, V. Hus, R. Luo, M.T. Murtha, D. Moreno-De-Luca,
S.H. Chu, M.P. Moreau, A.R. Gupta, S.A. Thomson, and others, "Multiple recurrent de
novo CNVs, including duplications of the 7q1 1. 23 Williams syndrome region, are
strongly associated with autism," Neuron, vol. 70, 2011, pp. 863-885.
[315] Y. Sakai, C.A. Shaw, B.C. Dawson, D.V. Dugas, Z. Al-Mohtaseb, D.E. Hill, and H.Y.
Zoghbi, "Protein interactome reveals converging molecular pathways among autism
disorders," Science translationalmedicine, vol. 3, 2011, pp. 86ra49-86ra49.
[316] L.A. Weiss, D.E. Arking, M.J. Daly, A. Chakravarti, C.W. Brune, K. West, A. O'Connor,
G. Hilton, R.L. Tomlinson, A.B. West, and others, "A genome-wide linkage and
association scan reveals novel loci for autism," Nature, vol. 461, 2009, pp. 802-808.
176
[317] K. Wang, H. Zhang, D. Ma, M. Bucan, J.T. Glessner, B.S. Abrahams, D. Salyakina, M.
Imielinski, J.P. Bradfield, P.M. Sleiman, and others, "Common genetic variants on 5pl4.1
associate with autism spectrum disorders," Nature, vol. 459, 2009, pp. 528-533.
[318] M.H. Chahrour, W.Y. Timothy, E.T. Lim, B. Ataman, M.E. Coulter, R.S. Hill, C.R.
Stevens, C.R. Schubert, M.E. Greenberg, S.B. Gabriel, and others, "Whole-exome
sequencing and homozygosity analysis implicate depolarization-regulated neuronal genes
in autism," PLoS genetics, vol. 8, 2012, p. e1002635.
[319] B.J. O'Roak, L. Vives, W. Fu, J.D. Egertson, I.B. Stanaway, I.G. Phelps, G. Carvill, A.
Kumar, C. Lee, K. Ankenman, and others, "Multiplex targeted sequencing identifies
recurrently mutated genes in autism spectrum disorders," Science, vol. 338, 2012, pp.
1619-1622.
[320] T.N. Turner, K. Sharma, E.C. Oh, Y.P. Liu, R.L. Collins, M.X. Sosa, D.R. Auer, H.
Brand, S.J. Sanders, D. Moreno-De-Luca, and others, "Loss of [dgr]-catenin function in
severe autism," Nature, vol. 520, 2015, pp. 51-56.
[321] I.S. Kohane, A. McMurry, G. Weber, D. MacFadden, L. Rappaport, L. Kunkel, J. Bickel,
N. Wattanasin, S. Spence, S. Murphy, and others, "The co-morbidity burden of children
and young adults with autism spectrum disorders," PloS one, vol. 7, 2012, p. e33224.
[322] I. Voineagu, X. Wang, P. Johnston, J.K. Lowe, Y. Tian, S. Horvath, J. Mill, R.M. Cantor,
B.J. Blencowe, and D.H. Geschwind, "Transcriptomic analysis of autistic brain reveals
convergent molecular pathology," Nature, vol. 474, 2011, pp. 3 80-384.
[323] M.W. State, P. Levitt, and others, "The conundrums of understanding genetic risks for
autism spectrum disorders," Nature neuroscience, vol. 14, 2011, pp. 1499-1506.
[324] R. Toro, M. Konyukh, R. Delorme, C. Leblond, P. Chaste, F. Fauchereau, M. Coleman,
M. Leboyer, C. Gillberg, and T. Bourgeron, "Key role for gene dosage and synaptic
homeostasis in autism spectrum disorders," Trends in genetics, vol. 26, 2010, pp. 363372.
[325] T. Bourgeron, "A synaptic trek to autism," Current opinion in neurobiolog, vol. 19,
2009, pp. 231-234.
[326] I.S. Kohane, "An autism case history to review the systematic analysis of large-scale data
to refine the diagnosis and treatment of neuropsychiatric disorders," Biologicalpsychiatry,
vol. 77, 2015, pp. 59-65.
[327] F. Doshi-Velez, Y. Ge, and I. Kohane, "Comorbidity clusters in autism spectrum
disorders: an electronic health record time-series analysis," Pediatrics,vol. 133, 2014, pp.
e54-e63.
[328] K.K. Ausderau, M. Furlong, J. Sideris, J. Bulluck, L.M. Little, L.R. Watson, B.A. Boyd,
A. Belger, V.A. Dickie, and G.T. Baranek, "Sensory subtypes in children with autism
spectrum disorder: Latent profile transition analysis using a national survey of sensory
features," Journalof Child Psychology and Psychiatry, vol. 55, 2014, pp. 935-944.
[329] I. Rapin, M.A. Dunn, D.A. Allen, M.C. Stevens, and D. Fein, "Subtypes of language
disorders in school-age children with autism," Developmental Neuropsychology, vol. 34,
2009, pp. 66-84.
[330] F. Hormozdiari, 0. Penn, E. Borenstein, and E.E. Eichler, "The discovery of integrated
gene networks for autism and related disorders," Genome research, vol. 25, 2015, pp.
142-154.
177
[331] C.J. McDougle, S.M. Landino, A. Vahabzadeh, J. O'Rourke, N.R. Zurcher, B.C. Finger,
M.L. Palumbo, J. Helt, J.E. Mullett, J.M. Hooker, and others, "Toward an immunemediated subtype of autism spectrum disorder," Brain research,2014.
[332] E.Y. Hsiao, "Immune dysregulation in autism spectrum disorder," Int Rev Neurobiol, vol.
113,2013, pp. 269-302.
[333] M. Michel, M.J. Schmidt, and K. Mimics, "Immune system gene dysregulation in autism
and schizophrenia," Developmental neurobiology, vol. 72, 2012, pp. 1277-1287.
[334] N. Krumm, B.J. O'Roak, J. Shendure, and E.E. Eichler, "A de novo convergence of
autism genetics and molecular neuroscience," Trends in neurosciences, vol. 37, 2014, pp.
95-105.
[335] E. Ben-David and S. Shifman, "Combined analysis of exome sequencing points toward a
major role for transcription regulation during brain development in autism," Molecular
psychiatry, vol. 18, 2013, pp. 1054-1056.
[336] W.F. Hu, M.H. Chahrour, and C.A. Walsh, "The diverse genetic landscape of
neurodevelopmental disorders," Annual review of genomics and human genetics, vol. 15,
2014, pp. 195-213.
[337] M.E. Talkowski, J.A. Rosenfeld, I. Blumenthal, V. Pillalamarri, C. Chiang, A. Heilbut, C.
Ernst, C. Hanscom, E. Rossin, A.M. Lindgren, and others, "Sequencing chromosomal
abnormalities reveals neurodevelopmental loci that confer risk across diagnostic
boundaries," Cell, vol. 149, 2012, pp. 525-537.
[338] A.J. Willsey, S.J. Sanders, M. Li, S. Dong, A.T. Tebbenkamp, R.A. Muhle, S.K. Reilly, L.
Lin, S. Fertuzinhos, J.A. Miller, and others, "Coexpression networks implicate human
midfetal deep cortical projection neurons in the pathogenesis of autism," Cell, vol. 155,
2013, pp. 997-1007.
[339] T. Yuan, Y. Jiao, S. de Jong, R.A. Ophoff, S. Beck, and A.E. Teschendorff, "An
integrative multi-scale analysis of the dynamic DNA methylation landscape in aging,"
PLoS genetics, vol. 11, 2015, pp. e1004996-e1004996.
[340] D. Robinson, E.M. Van Allen, Y.-M. Wu, N. Schultz, R.J. Lonigro, J.-M. Mosquera, B.
Montgomery, M.-E. Taplin, C.C. Pritchard, G. Attard, and others, "Integrative Clinical
Genomics of Advanced Prostate Cancer," Cell, vol. 161, 2015, pp. 1215-1228.
[341] E. L6pez-Knowles, P.M. Wilkerson, R. Ribas, H. Anderson, A. Mackay, Z. Ghazoui, A.
Rani, P. Osin, A. Nerurkar, L. Renshaw, and others, "Integrative analyses identify
modulators of response to neoadjuvant aromatase inhibitors in patients with early breast
cancer," Breast Cancer Research, vol. 17, 2015, p. 35.
[342] The Cancer Genome Atlas Research Network., "Comprehensive, Integrative Genomic
Analysis of Diffuse Lower-Grade Gliomas," New EnglandJournalofMedicine, vol. 372,
2015, pp. 2481-2498.
[343] The Cancer Genome Atlas Research Network., "Genomic Classification of Cutaneous
Melanoma," Cell, vol. 161, 2015, pp. 1681-1696.
[344] M. Meld, P.G. Ferreira, F. Reverter, D.S. DeLuca, J. Monlong, M. Sammeth, T.R. Young,
J.M. Goldmann, D.D. Pervouchine, T.J. Sullivan, and others, "The human transcriptome
across tissues and individuals," Science, vol. 348, 2015, pp. 660-665.
[345] "BrainSpan: Atlas of the Developing Human Brain [Internet]. Funded by ARRA Awards
1RC2MH08992 1-01, 1 RC2MHO90047-0 1, and 1 RC2MH089929-0 1.," 2011.
[346] B.S. Everitt, The CambridgeDictionary ofStatistics, Cambridge University Press, 2006.
178
[347] BrainSpan, Transcriptome profiling by rna sequencing and exon microarray, Allen
Institute, 2013.
[348] G. Csardi, "Package igraph," 2010.
[349] GATK
team,
"https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitutegatktoolswalkersco
verageCallableLoci.php."
[350] N. Krumm, T.N. Turner, C. Baker, L. Vives, K. Mohajeri, K. Witherspoon, A. Raja, B.P.
Coe, H.A. Stessman, Z.-X. He, and others, "Excess of rare, inherited truncating mutations
in autism," Nature genetics, vol. 47, 2015, pp. 582-588.
[351] M.A. DePristo, E. Banks, R. Poplin, K.V. Garimella, J.R. Maguire, C. Hartl, A.A.
Philippakis, G. del Angel, M.A. Rivas, M. Hanna, and others, "A framework for variation
discovery and genotyping using next-generation DNA sequencing data," Nature genetics,
vol. 43, 2011, pp. 491-498.
[352] H. Li, "Aligning sequence reads, clone sequences and assembly contigs with BWAMEM," arXiv preprintarXiv:1303.3997, 2013.
[353] The Picard team, "The Picard toolkit http://picard.sourceforge.net/," 2014.
[354] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis,
R. Durbin, and others, "The sequence alignment/map format and SAMtools,"
Bioinformatics, vol. 25, 2009, pp. 2078-2079.
[355] K. Wang, M. Li, and H. Hakonarson, "ANNOVAR: functional annotation of genetic
variants from high-throughput sequencing data," Nucleic acids research, vol. 38, 2010, pp.
e164-e 164.
[356] K.D. Pruitt, G.R. Brown, S.M. Hiatt, F. Thibaud-Nissen, A. Astashyn, 0. Ermolaeva,
C.M. Farrell, J. Hart, M.J. Landrum, K.M. McGarvey, and others, "RefSeq: an update on
mammalian reference sequences," Nucleic acids research, vol. 42, 2014, pp. D756-D763.
[357] K.R. Rosenbloom, J. Armstrong, G.P. Barber, J. Casper, H. Clawson, M. Diekhans, T.R.
Dreszer, P.A. Fujita, L. Guruvadoo, M. Haeussler, and others, "The UCSC genome
browser database: 2015 update," Nucleic acids research, vol. 43, 2015, pp. D670-D681.
[358] J. Harrow, A. Frankish, J.M. Gonzalez, E. Tapanari, M. Diekhans, F. Kokocinski, B.L.
Aken, D. Barrell, A. Zadissa, S. Searle, and others, "GENCODE: the reference human
genome annotation for The ENCODE Project," Genome research, vol. 22, 2012, pp.
1760-1774.
[359] I. Adzhubei, D.M. Jordan, and S.R. Sunyaev, "Predicting functional effect of human
missense mutations using PolyPhen-2," Currentprotocols in human genetics, 2013, pp.
7-20.
[360] P. Kumar, S. Henikoff, and P.C. Ng, "Predicting the effects of coding non-synonymous
variants on protein function using the SIFT algorithm," Nature protocols, vol. 4, 2009, pp.
1073-1081.
[361] J.M. Schwarz, D.N. Cooper, M. Schuelke, and D. Seelow, "MutationTaster2: mutation
prediction for the deep-sequencing age," Nature methods, vol. 11, 2014, pp. 361-362.
[362] B. Reva, Y. Antipin, and C. Sander, "Predicting the functional impact of protein
mutations: application to cancer genomics," Nucleic acids research,2011, p. gkr407.
[363] M. Kircher, D.M. Witten, P. Jain, B.J. O'Roak, G.M. Cooper, and J. Shendure, "A
general framework for estimating the relative pathogenicity of human genetic variants,"
3 1 0 - 3 15
.
Nature genetics, vol. 46, 2014, pp.
179
[364] S. Chun and J.C. Fay, "Identification of deleterious mutations within three human
genomes," Genome research, vol. 19, 2009, pp. 1553-156 1.
[365] H. Carter, C. Douville, P.D. Stenson, D.N. Cooper, and R. Karchin, "Identifying
Mendelian disease genes with the variant effect scoring tool," BMC genomics, vol. 14,
2013, p. S3.
[366] G.M. Cooper, E.A. Stone, G. Asimenos, E.D. Green, S. Batzoglou, and A. Sidow,
"Distribution and intensity of constraint in mammalian genomic sequence," Genome
research,vol. 15, 2005, pp. 901-913.
[367] M. Garber, M. Guttman, M. Clamp, M.C. Zody, N. Friedman, and X. Xie, "Identifying
novel constrained elements by exploiting biased substitution patterns," Bioinformatics,
vol. 25, 2009, pp. i54-i62.
[368] E.V. Davydov, D.L. Goode, M. Sirota, G.M. Cooper, A. Sidow, and S. Batzoglou,
"Identifying a high fraction of the human genome to be under selective constraint using
GERP++," PLoS Comput Biol, vol. 6, 2010, p. e1001025.
[369] 1000 Genomes Project Consortium, "An integrated map of genetic variation from 1,092
human genomes," Nature, vol. 491, 2012, pp. 56-65.
[370] Exome Variant Server, NHLBI GO Exome Sequencing Project (ESP), Seattle, WA,
"http://evs.gs.washington.edu/EVS/," Sep. 2014.
[371] Exome
Aggregation
Consortium
(ExAC),
"ExAC
Summary
Data
http://exac.broadinstitute.org," Apr. 2015.
[372] P.D. Stenson, E.V. Ball, M. Mort, A.D. Phillips, K. Shaw, and D.N. Cooper, "The Human
Gene Mutation Database (HGMD) and its exploitation in the fields of personalized
genomics and molecular evolution," Currentprotocols in bioinformatics, 2012, pp. 1-13.
[373] M. Lawrence, W. Huber, H. Pages, P. Aboyoun, M. Carlson, R. Gentleman, M.T. Morgan,
and V.J. Carey, "Software for computing and annotating genomic ranges," PLoS
computationalbiology, vol. 9, 2013, p. e1003118.
[374] B. Neale, M. Ferreira, and S. Medland, Statistical Genetics, Taylor & Francis Group,
2012.
[375] R.A. Fisher, Statistical methods for researchworkers, Genesis PublishingPvt Ltd, 1925.
[376] O.J. Dunn, "Multiple comparisons among means," Journal of the American Statistical
Association, vol. 56, 1961, pp. 52-64.
[377] J.S. Amberger, C.A. Bocchini, F. Schiettecatte, A.F. Scott, and A. Hamosh, "OMIM. org:
Online Mendelian Inheritance in Man (OMIM@), an online catalog of human genes and
genetic disorders," Nucleic acids research, vol. 43, 2015, pp. D789-D798.
[378] C.-D.G. of the Psychiatric Genomics Consortium and others, "Genetic relationship
between five psychiatric disorders estimated from genome-wide SNPs," Nature genetics,
vol. 45, 2013, pp. 984-994.
[379] S.N. Murphy, G. Weber, M. Mendis, V. Gainer, H.C. Chueh, S. Churchill, and I. Kohane,
"Serving the enterprise and beyond with informatics for integrating biology and the
bedside (i2b2)," Journalof the American Medical Informatics Association, vol. 17, 2010,
pp. 124-130.
[380] I.S. Kohane, S.E. Churchill, and S.N. Murphy, "A translational engine at the national
scale: informatics for integrating biology and the bedside," Journal of the American
Medical Informatics Association, vol. 19, 2012, pp. 181-185.
180
[381] A.S. Weitlauf, M.L. McPheeters, B. Peters, N. Sathe, R. Travis, R. Aiello, E. Williamson,
J. Veenstra-VanderWeele, S. Krishnaswami, R. Jerome, and others, "Therapies for
Children With Autism Spectrum Disorder," 2014.
[382] S.A. Brigandi, H. Shao, S.Y. Qian, Y. Shen, B.-L. Wu, and J.X. Kang, "Autistic Children
Exhibit Decreased Levels of Essential Fatty Acids in Red Blood Cells," International
journal of molecular sciences, vol. 16, 2015, pp. 10061-10076.
[383] J. Gordon Bell, D. Miller, D.J. MacDonald, E.E. MacKinlay, J.R. Dick, S. Cheseldine,
R.M. Boyle, C. Graham, and A.E. O'Hare, "The fatty acid compositions of erythrocyte
and plasma polar lipids in children with autism, developmental delay or typically
developing controls and the effect of fish oil intake," Britishjournalof nutrition, vol. 103,
2010, pp. 1160-1167.
[384] M. Wiest, J. German, D. Harvey, S. Watkins, and I. Hertz-Picciotto, "Plasma fatty acid
profiles in autism: a case-control study," Prostaglandins, Leukotrienes and Essential
FattyAcids, vol. 80, 2009, pp. 221-227.
[385] S. Vancassel, G. Durand, C. Barthelemy, B. Lejeune, J. Martineau, D. Guilloteau, C.
Andres, and S. Chalon, "Plasma fatty acid levels in autistic children," Prostaglandins,
Leukotrienes and EssentialFattyAcids, vol. 65, 2001, pp. 1-7.
[386] C. Betsholtz, "Lipid transport and human brain development," Nat Genet, vol. 47, 2015,
pp. 699-701.
[387] M. Aureli, S. Grassi, S. Prioni, S. Sonnino, and A. Prinetti, "Lipid membrane domains in
the brain," Biochimica et Biophysica Acta (BBA)-Molecular and Cell Biology of Lipids,
vol. 1851, 2015, pp. 1006-1016.
[388] A. Guemez-Gamboa, L.N. Nguyen, H. Yang, M.S. Zaki, M. Kara, T. Ben-Omran, N.
Akizu, R.O. Rosti, B. Rosti, E. Scott, and others, "Inactivating mutations in MFSD2A,
required for omega-3 fatty acid transport in brain, cause a lethal microcephaly syndrome,"
Nature genetics, 2015.
[389] V. Alakbarzade, A. Hameed, D.Q. Quek, B.A. Chioza, E.L. Baple, A. Cazenave-Gassiot,
L.N. Nguyen, M.R. Wenk, A.Q. Ahmad, A. Sreekantan-Nair, and others, "A partially
inactivating mutation in the sodium-dependent lysophosphatidylcholine transporter
MFSD2A causes a non-lethal microcephaly syndrome," Nature genetics, 2015.
[390] T. Papadopoulos, R. Schemm, H. Grubmiller, and N. Brose, "Lipid Binding Defects and
Perturbed Synaptogenic Activity of a Collybistin R290H Mutant That Causes Epilepsy
and Intellectual Disability," Journal of Biological Chemistry, vol. 290, 2015, pp. 82568270.
[391] Y. Luo, G. Riedlinger, and P. Szolovits, "Text Mining in Cancer Gene and Pathway
Prioritization," Cancer Informatics, vol. 13, 2014, pp. 69-79.
181
Download