Towards Unified Biomedical Modeling with Subgraph Mining and Factorization Algorithms by Yuan Luo Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy MASSACHUSETTS INSTITUTE OF TECHNOLOGY at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY LIBRARIES September 2015 @ Massachusetts Institute of Technology 2015. All rights reserved. A /-1 Signature redacted A uthor ..................... Department of Electrical Engineering and Computer Science August 18, 2015 redacted Signature Certified by.. .................. Peter Szolovits Professor Signature redacted Certified by. Thesis Supervisor -or Ozlem Uzuner Associate Professor, State University of New York at Albany Thesis Supervisor Signature redacted Accepted by.. .................... / J NOV 0 22015 Leslie A. Kolodziejski Chair, Department Committee on Graduate Theses Towards Unified Biomedical Modeling with Subgraph Mining and Factorization Algorithms by Yuan Luo Submitted to the Department of Electrical Engineering and Computer Science on August 18, 2015, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract This dissertation applies subgraph mining and factorization algorithms to clinical narrative text, ICU physiologic time series and computational genomics. These algorithms aims to build clinical models that improve both prediction accuracy and interpretability, by exploring relational information in different biomedical data modalities including clinical narratives, physiologic time series and exonic mutations. This dissertation focuses on three concrete applications: implicating neurodevelopmentally coregulated exon clusters in phenotypes of Autism Spectrum Disorder (ASD), predicting mortality risk of ICU patients based on their physiologic measurement time series, and identifying subtypes of lymphoma patients based on pathology report text. In each application, we automatically extract relational information into a graph representation and collect important subgraphs that are of interest. Depending on the degree of structure in the data format, heavier machinery of factorization models becomes necessary to reliably group important subgraphs. We demonstrate that these methods lead to not only improved performance but also better interpretability in each application. Thesis Supervisor: Peter Szolovits Title: Professor Thesis Supervisor: Ozlem Uzuner Title: Associate Professor, State University of New York at Albany 2 Acknowledgments I have been fortunate to have Pete Szolovits and Ozlem Uzuner as my advisers. Pete simultaneously provided the freedom to work on what I wanted and the guidance that enabled me to succeed in my work. Ozlem introduced me to the field of medical natural language processing and has provided guidance in my pursuing this topic in depth. My PhD committee: Sam Madden and Effi Hochberg provided valuable counsel on both research and writing. Ally Eran, Aliyah Sohani, Yu Xin, Rohit Joshi, Nathan Palmer, Paul Avillach, and Isaac Kohane collaborated on part of this work and contributed much insight. Andrew Lo, Jason Baron, Anand Dighe, Bill Long, Leo Celi, Xiaoqian Jiang and Dahua Lin have supported me at various stages of my graduate career. I am very grateful to all my friends at MIT, especially folks at MEDG, who made my graduate years here exciting and pleasurable. I am deeply in debt to my family for their unconditional love and support. The work in this thesis is supported by i2b2, by Grant Number U54LM008748 from the National Library of Medicine, by the Scullen Center for Cancer Data Analysis, and the MGH-MIT Strategic Partnership. 3 Contents Introduction......................................................................................................................... 12 Biom edical Relations .................................................................................................................. 12 Chapter 1. 1.1 1.1.1 M edical Natural Language Processing............................................................................ 13 1.1.2 Intensive Care Unit tim e series analysis ........................................................................ 14 1.1.3 N ext Generation Sequencing analysis.............................................................................. 16 Challenges in Modeling Biom edical Relations....................................................................... 17 1.2.1 N oisy structure extraction from narrative text ............................................................... 17 1.2.2 Poor scalability and abstraction for tim e sequence data ................................................. 17 1.2.3 Connecting the dots for sequencing variants ................................................................. 18 1.2.4 Correlation analysis am ong m ultiple feature m odes ...................................................... 18 1.2 Contributions and Organization.............................................................................................. 1.3 Chapter 2. Related W ork ...................................................................................................................... Application of Biom edical Relation Extraction...................................................................... 2.1 19 21 22 2.1.1 Biom olecular inform ation extraction ............................................................................. 23 2.1.2 Clinical trial screening ................................................................................................... 23 2.1.3 Pharm acogenom ics .......................................................................................................... 23 2.1.4 Diagnosis categorization................................................................................................. 23 2.1.5 Adverse drug reaction and drug-drug interaction .......................................................... 24 2.2 General Pipeline for Biomedical Relation Extraction............................................................. 24 2.3 State-of-the-Art Methods for Biom edical Relation Extraction............................................... 26 2.3.1 Relation extraction from scientific literature ................................................................. 28 2.3.2 Relation extraction from clinical narrative text ............................................................ 37 2.3.3 Shared resources for relation extraction......................................................................... 39 2.4 Lim itations of Existing W ork ................................................................................................ 39 2.4.1 N ot all parsers and dependency encodings are synergistic ............................................ 39 2.4.2 Integrating co-reference resolution ................................................................................ 40 2.4.3 General relation and event extraction and dom ain adaptation ........................................ 41 2.4.4 Redundancy in subgraph patterns .................................................................................. 41 2.4.5 Integrating w ith NER ..................................................................................................... 42 General Relation Extraction by Frequent Subgraph Mining Applied to Automatic Chapter 3. Lym phom a Classification ........................................................................................................................... 4 43 3.1 Background ................................................................................................................................. 44 3.2 Task D efinition ........................................................................................................................... 45 3.3 D ata Collection ........................................................................................................................... 46 3.4 Methods.......................................................................................................................................46 3.4.1 Corpus pre-processing.................................................................................................... 46 3.4.2 Intuition on relations am ong concepts ........................................................................... 56 3.4.3 Representing sentence dependency parses as graphs...................................................... 57 3.4.4 Frequent subgraph m ining .............................................................................................. 58 3.4.5 Subgraph redundancy pruning ....................................................................................... 59 3.4.6 Single node frequent subgraph collection....................................................................... 61 3.5 Experim ents and Results........................................................................................................ 62 3.6 Feature and Error Analysis ..................................................................................................... 66 3.7 D iscussion and Lim itations................................................................................................... 69 3.8 Conclusions.................................................................................................................................70 Chapter 4. Subgraph Augmented Non-negative Tensor Factorization (SANTF) Applied to Modeling 72 Clinical N arrative Text ............................................................................................................................... M ethods.......................................................................................................................................74 4.1 74 4.1.1 W orkflow of SAN TF ..................................................................................................... 4.1.2 Joint modeling of higher-order features and atomic features using a tensor...................75 4.1.3 Patient and feature group discovery using SAN TF......................................................... 78 4.1.4 SAN TF algorithm ............................................................................................................... 78 4.2 Experim ents and Results........................................................................................................ 80 4.3 Feature A nalysis..........................................................................................................................83 4.4 D iscussion...................................................................................................................................89 4.5 Conclusions.................................................................................................................................91 Subgraph Augmented Non-negative Matrix Factorization (SANMF) in Modeling ICU Chapter 5. Physiologic Tim e Series ............................................................................................................................. 5.1 Background ................................................................................................................................. 5.2 M ethods.......................................................................................................................................94 92 93 5.2.1 W orkflow of SAN M F ......................................................................................................... 94 5.2.2 Representing tim e series as graphs ................................................................................ 95 5.2.3 Frequent subgraph m ining .............................................................................................. 96 5.2.4 SAN MF algorithm .............................................................................................................. 99 5.2.5 Feature group discovery and association using SA NM F .................................................. 5 101 5.2.6 5.3 Evaluating the groups discovered by SAN M F.................................................................. Results....................................................................................................................................... 102 105 5.3.1 M ethod validation on ICU patients' m ortality risk prediction.......................................... 105 5.3.2 Im portant subgraph groups ............................................................................................... 107 5.4 Lim itations and D iscussion....................................................................................................... 109 5.5 Conclusions............................................................................................................................... 110 Chapter 6. Integrated Genomics, Transcriptomics, Medical Records, and Insurance Claims Analyses Identify Dyslipidem ia as a Strong Inherited Risk Factor in A SD ............................................................. 112 6.1 Background ............................................................................................................................... 6.2 M ethods.....................................................................................................................................115 113 6.2.1 Implication of Co-regulated Exons................................................................................... 6.2.2 Whole exom e sequence analysis.......................................................................................125 6.2.3 Segregation pattern analysis.............................................................................................. 136 6.2.4 Integrated statistical significance ...................................................................................... 138 6.2.5 Functional enrichm ent analysis......................................................................................... 139 115 6.2.6 Analysis of lipidemia profiles using lab results from individuals with ASD seen at Boston Children's Hospital ........................................................................................................................... 139 6.2.7 6.3 PheWA S of A etna claim s data.......................................................................................... Results....................................................................................................................................... 141 142 6.3.1 in ASD Neurodevelopmentally co-regulated, sexually dimorphic, segregating deleterious variation 142 6.3.2 Convergent lipid m etabolism etiology .............................................................................. 143 6.3.3 Dyslipidem ia in fam ilies with A SD .................................................................................. 149 6.3.4 Behavioral phenotypes of m ouse m odels of dyslipidem ia................................................ 150 6.4 Conclusions and Discussion...................................................................................................... Chapter 7. Conclusion and Future W ork ............................................................................................ 7.1 Contributions.............................................................................................................................153 7.2 Future D irections ...................................................................................................................... Bibliography ............................................................................................................................................. 6 151 153 154 157 List of Figures Figure 1-1 Relations from an example sentence, using graph representation. .......................... 14 Figure 2-1 Applications of biomedical relation extraction...................................................... 22 Figure 2-2 General workflow of biomedical relation extraction. .............................................. 25 Figure 3-1 MGH pathology reports usually contain four sections with almost all information retained as narrative text............................................................................................................... 48 Figure 3-2 Example sentence parsed directly by the Stanford Parser. .................................... 49 Figure 3-3 Two-phase sentence parsing on example................................................................ 50 Figure 3-4 Raw Stanford parsing result for example sentence 1............................................... 52 Figure 3-5 Stanford parsing result after pre-processing for example sentence 1 ..................... 52 Figure 3-6 Raw Stanford parsing result for example sentence 2............................................... 53 Figure 3-7 Stanford parsing result after pre-processing for example sentence 2 ...................... 54 Figure 3-8 Raw Stanford parsing result for example sentence 3............................................... 54 Figure 3-9 Stanford parsing result after pre-processing for example sentence 3 ..................... 56 Figure 3-10 A variety of sentences frequently occurring in our corpus describe the relations am ong cells, staining, and antigens/antibodies ......................................................................... 57 Figure 3-11 Constructing the sentence graph from the results of two-phase dependency parsing. ....................................................................................................................................................... 58 Figure 3-12 Example subgraphs for the sentence graph in Figure 3-11................................... 59 Figure 3-13 A hierarchical hash partition algorithm for determining subisomorphism relation am ong graphs in a set.................................................................................................................... 62 Figure 4-1 The workflow of subgraph augmented non-negative tensor factorization (SANTF). 74 Figure 4-2 Graph generation and subgraph collection in SANTF ............................................. 75 Figure 4-3 Tensor modeling and factorization with distributional representations of the sentence sub grap h s. ..................................................................................................................................... 7 77 Figure 4-4 Word group distribution for six of the top subgraphs in the first DLBCL associated sub graph group .............................................................................................................................. 89 Figure 4-5 Correlation between six of the top subgraphs (partial sentences) in the first DLBCL associated subgraph group ........................................................................................................ 90 Figure 5-1 The workflow of subgraph augmented non-negative matrix factorization (SANMF). ....................................................................................................................................................... 95 Figure 5-2 Graph generation and subgraph mining in SANMF. .............................................. 98 Figure 5-3 Subgraph augmented non-negative matrix factorization model. .............................. 101 Figure 5-4 AUC comparisons between NMF and PCA under specification of different number of subgraph groups.......................................................................................................................... 106 Figure 5-5 ROC curves for proposed method SANMF, comparison models including subgraph, discretized & interpolated measures (D,I-measure), and organ level status, as well as the baseline using SA PS,, approxim ation....................................................................................................... 107 Figure 6-1 Independent sources of information used to identify molecular networks contributing to A S D . ....................................................................................................................................... 1 13 Figure 6-2 Visualization of the BrainSpan RNA-Seq data......................................................... 118 Figure 6-3 Distribution of the number of non-NA values in expressions of exons.................... 120 Figure 6-4 Block and parallel exon correlation makes computation feasible............................. 121 Figure 6-5 Distribution of R2 in the BrainSpan data.................................................................. 122 Figure 6-6 Visualization of part of the entire exon graph........................................................... 123 Figure 6-7 Distribution of padded and merged BrainSpan interval sizes................................... 128 Figure 6-8 O verview of W ES analysis. ...................................................................................... 129 Figure 6-9 Distributions of the total number of variants in probands and unaffected siblings in discordant fam ilies...................................................................................................................... 133 Figure 6-10 Distribution of number of variants per individual in the discordant family cohort at each stage of variant analysis...................................................................................................... 8 134 Figure 6-11 Distribution of number of variants per individual among multiplex families at each stage of variant analysis.............................................................................................................. 135 Figure 6-12 Distribution of sizes of multiplex families.............................................................. 137 Figure 6-13 Pseudo code of the extended ASP test for multiplex families. ............................... 138 Figure 6-14 The sexually dimorphic neurodevelopmentally co-regulated LDLR exon cluster. 147 Figure 6-15 ASD-segregating deleterious variation in the sexually-dimorphic LPL exon cluster. ..................................................................................................................................................... 9 14 8 List of Tables Table 2-1 Summarization and characterization of relation extraction algorithms.................... 28 Table 2-2 BioNLP event extraction tasks ................................................................................ 29 Table 2-3 Shared resources for relation extraction.................................................................. 39 Table 3-1 Regular Expressions to Catch Lymphoma Mentions............................................... 45 Table 3-2 Semantic types considered as immunologic factors................................................. 56 Table 3-3 Multiple-hit or intermediate lymphoma cases........................................................... 63 Table 3-4 Distribution of lymphoma cases in full corpus, training corpus and testing corpus .... 63 Table 3-5 Held-out test results on different feature groups ...................................................... 65 Table 3-6 Held-out test results on different settings of sentence subgraph feature groups .......... 66 Table 4-1 Statistics of the lymphoma subtype distribution in the pathology narrative text corpus. ....................................................................................................................................................... 80 Table 4-2 Clustering performances for MGH lymphoma dataset. ........................................... 82 Table 4-3 Per-class evaluation of clustering on the lymphoma dataset..................................... 83 Table 4-4 Top higher-order feature groups associated with diffuse large B-cell lymphoma....... 84 Table 4-5 Top higher-order feature groups associated with follicular lymphoma. .................. 87 Table 4-6 Top higher-order feature groups associated with Hodgkin lymphoma.................... 87 Table 5-1 A simplified algorithm for determining subisomorphism relation among time series sub graph s. ..................................................................................................................................... 99 Table 5-2 Statistics of experim ent data....................................................................................... 103 Table 5-3 Physiologic time series predictor variables from MIMIC-II dataset.......................... 104 Table 5-4 Top subgraph groups associated with high mortality risks. ....................................... 108 Table 6-1 Brain region hierarchy of regions, areas, and structures included in this study......... 116 Table 6-2 Periods of brain development included in this study.................................................. 10 117 Table 6-3 Distribution of cluster sizes (measured in terms of number of exons)....................... 124 Table 6-4 Distribution of number of genes in exon clusters....................................................... 124 Table 6-5 Whole exome sequence datasets used. ....................................................................... 126 Table 6-6 Patients used to examine the association of abnormal lipid lab results with ASD..... 141 Table 6-7 Significant clusters of sexually dimorphic, neurodevelopmentally co-regulated, ASDsegregating deleterious variation, and their molecular themes................................................... 146 Table 6-8 Enrichment of comorbid dyslipidemia diagnoses in individuals with ASD as compared to their unaffected siblings.......................................................................................................... 149 Table 6-9 Significant enrichment of dyslipidemia-related diagnoses in individuals with ASD, 150 detected in health claim s data..................................................................................................... Table 6-10 Behavioral and nervous system phenotypes shared between 42 mouse models of ASD and 7 mouse m odels of LDLR deficiency. ................................................................................. 11 150 Chapter 1. Introduction 1.1 Biomedical Relations With recent advances of the data acquisition and storage technologies in the biomedical field, large volumes of data that have unique characteristics and multiple modalities flow into growing archives that can be used to study and improve medical care. For example, narrative text in a pathology report may explain pathologists' interpretations of flow cytometry results, immunohistochemical patterns, or genetic karyotype profiles. Such text has moderately controlled vocabularies but generally presents high variability due to the flexibility of natural language. In a narrative text corpus, multiple sentence constructs often express the same meaning, differing in syntactic construction, word order, or use of abbreviations. A second example considers the vital signs and other physiologic measurements monitored during hospital admissions, which present themselves as evolving time series, often at unevenly sampled time points. Early recognition of clinical deterioration and early warning systems is an area of active research in order to identify actionable items for improving patient survival [1]. Scaling to a comprehensive set of clinical variables means analyzing many unevenly spaced times series, which quickly becomes computationally intensive as the number of variables increases. A third example concerns next generation sequencing that may output multiple gigabytes of gene sequence data per individual, posing immediate throughput challenges to existing representation and learning frameworks. Growing evidence has linked the alternatively spliced isoforms and regulating pathways to distinct clinical outcomes of multiple specific diseases such as Autism Spectrum Disorder (ASD), underscoring the value of the ability to sift through the genetic sequence data. In addition to their varying characteristics and emphasis, for those aforementioned data modalities, meaningful and effective structure discovery has been under active study within respective research subfields. The problem domains addressed by this thesis includes medical natural language processing, clinical dynamic time series analysis and next generation sequencing analysis. As vast knowledge and data sources often exceed the capacity of human experts, we need to leverage modern statistical analysis and machine learning algorithms to generate models that are both accurate and interpretable. We emphasize interpretability so that researchers and clinicians will un- 12 derstand the model and use it to advance the understanding of pathophysiology and to improve patient care. The methods need to be broadly applicable and easily adaptable across domains. 1.1.1 Medical Natural Language Processing Relation extraction from text documents is an important task in knowledge representation and inference in order to create structured knowledge bases, augment existing knowledge bases and in turn support question answering and decision making. The task generally involves annotating unstructured text with named entities and identifying the relations between these annotated entities. State-of-the-art named entity recognizers can automatically annotate text with high accuracy [2,3], but relation extraction is not as straightforward. General domain relation extraction has been an active research area for decades [4]. In the biomedical and clinical domain, extracting relations from scientific publications and clinical narratives has been gaining traction over the past decade. To illustrate the importance of biomedical and clinical relation extraction, consider that in lymphoma pathology reports, immunophenotypic features are expressed as relations among medical concepts. For example, in "[large atypical cells] are positive for [CD30] and negative for [CD15]", "large atypical cells", "CD30" and "CD15" are medical concepts; "CD30" and "CD15" are cell surface antigens. A bag-of-words or bag-of-concepts representation of this sentence would fail to capture whether "large atypical cells" are positive or negative for "CD30" or "CD15". In this and many other similar cases, the biomedical concepts need to be represented as linked through syntax and/or semantics in order to be informative, so as to enable resolution of ambiguities by putting the concepts into context. We define a relation as a tuple r(c, c 2 , . . , c), n > 2, where ci's are concepts (named entities), and the ci's are semantically and/or syntactically linked to form relation r, as expressed in text. Thus a single named entity is generally not regarded as a relation; an assertion is also generally not regarded as a relation. In other words, a relation involves at least two concepts. If n is two (three), we call the relation a binary (ternary) relation, and for general n an n-ary relation. Some researchers use the term relation to focus on triples that represent binary relations (e.g., positive-expression(large atypical cells, CD30), negative- CD15)). Others also consider composite rela- expression (large atypical cells, 13 cells, tions, e.g., and (positive-expression (large atypical CD15) ). ative-expression (large atypical cells, CD30) , neg- We also use the term rela- tion to include what are often referred to as events; e.g., the ternary relation rl: treat- edby(patient, Imatinib regimen, 5 months) as expressed in "[the patient] was put on [Imatinib regimen] for [5 months]" can also be parsed as an event, where the event trigger is "put", theme is "Imatinib regimen" and target argument is "patient". Nested events may occur when one event takes other events as arguments. Figure 1-1 shows relations from an example sentence, as well as binary relations, complex relations, and nested events. We note that all these language constructs can be universally represented and mined as graphs (e.g., with medical concepts as nodes and syntactic/semantic links as edges). Bone marrow biopsy was performed on the patient in order to evaluate the effet of oedication for ,ymphbnm* as the cause ooeof bone marrow biopsy -- geI neutropenia. iet performed_on--( evaluate effect cause (p ge e ea produced-by (medication) tra ypoa Figure 1-1 Relations from an example sentence, using graph representation. Nodes are named entities and edges indicate the relations between two nodes (or multiple named entities connected by multiple edges can be considered as one relation). Named entities considered are in bold in the sentence. The dashed box denotes a binary relation, i.e., with two named entities. The solid box denotes a relation with multiple named entities, which alternatively can be viewed as a collection of three binary relations. These relations (in solid box and dashed box) can also be regarded as events, and the entire graph can be interpreted as a nested event. 1.1.2 Intensive Care Unit time series analysis 14 Modem ICUs generate multivariate time series data for individual patients using an increasing number of monitoring devices and laboratory tests. There is a growing body of evidence suggesting that early recognition of clinical instability and early intervention in the development of disease processes may improve patient outcome such as mortality [5,6]. To interpret such data in a timely fashion and to provide high quality care, the close attention required from critical care providers exposes ICU patients to human errors known to be common in hospital admissions [7,8]. Thus automated tools are needed to help clinicians and nurses identify clinical deterioration early on and quickly assemble effective treatment plan. A model that understands the patient's multivariate physiologic temporal progressions may be useful to catch preludes to dangerous episodes, increase caregiver vigilance, and ultimately improve patient outcome. Many studies have tracked clinical variables to understand the natural history of diseases or to monitor patient baseline progressions in response to medical intervening procedures and agents. One such comprehensive time series archive lies in the MIMIC-II (Multiparameter Intelligent Monitoring in Intensive Care) Databases containing physiologic signals and vital signs time series captured from patient monitors, as well as accompanying clinical data extracted from electronic medical records (EMR) systems. The database currently contains over 40,000 ICU patients, whose data were collected between 2001 and 2008 from a variety of ICUs (medical, surgical, coronary care, and neonatal) within a single tertiary teaching hospital. The patient's multivariate physiologic temporal progressions are in fact relations in the temporal domain. The ability to succinctly represent these relations and to correlate features of such representations with various aspects of diseases may offer insights into the pathogenesis, and help physicians make informed decisions. To digest the vast amount of monitored time series and to present them in an informative way, dynamic models have been studied which mostly fall in the probabilistic generative model framework. Filter-based generative models such as switching Kalman filters [9] assume that data is generated from a discrete set of transition matrices, but discretization may limit the visibility of fine grained variability among individual patients. Models based on hierarchical Dirichlet processes (HDPs) loosen the discretization prerequisite and accept infinite dimensional latent state space [10,11]. They typically model the time series using a sequence of parameterized generating functions that specify the series dynamics conditioned on the current and/or previous states and differ in the degree of overlap among topics of such gener15 ating functions. In addition to generative models, Fourier or wavelet transformations [12,13] have been applied to directly extracting features from the time series. However, these methods generally suffer from the problem of feature interpretability. 1.1.3 Next Generation Sequencing analysis In recent years, high-throughput sequencing techniques have enabled the identification of genetic patterns associated with distinct clinical outcomes of specific disease entities. For example, genome-wide association studies (GWAS) expanded the assessment scope on genetic variations to the whole genome, though they are generally limited to previously identified single-nucleotide polymorphisms (SNPs) [14]. Exome sequencing is able to comprehensively identify and type protein-coding variations throughout the genome, hence is less biased towards learningwhat we already know. About 99% of the entire genome ignored by exome sequencing consists of noncoding regions that may have regulatory influence on the expression and functioning of coding regions [14]. Personalized whole-genome sequencing is not restricted by the biases associated with the previous two sequencing technologies. Next generation sequencing technology has produced an ever-increasing amount of genomics data at multiple resolutions, which makes it possible to characterize at the genetic level those diseases and disorders that are inherited but highly heterogeneous. Such characterization requires deep understanding of genetic variants in relation to each other and to the disease phenotype, through mechanisms such as regulatory network and signaling pathway. Thus it is important to effectively model the relations (e.g., through transcription or regulation) of genetic variants in next generation sequencing analysis. One example is Autism Spectrum Disorder (ASD). One in every 68 children in the USA is diagnosed with ASD, a set of neurodevelopmental conditions characterized by social and communication impairments, and increased repetitive behavior. ASD has a substantial genetic component, but the specific cause of most cases remains unknown. Today, different constellations of selected molecular, biochemical, neurofunctional, and clinical measurements that fall outside of normal ranges can each identify a group of individuals with ASD. However, individuals without ASD also display measures that lie outside of the normal range for one (or possibly more) of the dimensions tested. Furthermore, recent large-scale whole exome and whole genome sequencing studies suggest that not only do different individuals with ASD carry different deleterious variants, but a single individual may have multiple different variants in likely candidate genes [1516 24]. Therefore, there might exist a spectrum of genetic variants underlying the spectrum of clinical manifestations, making ASD extremely heterogeneous on both the molecular and clinical levels. Thus it is essential to model the relations of genetic variants in association with disease using not only next generation sequencing data, but also personal health data from other modalities in an integrative fashion. 1.2 Challenges in Modeling Biomedical Relations There are a few major challenges, common to each subfield and the overall field, with respect to modeling biomedical relations. 1.2.1 Noisy structure extraction from narrative text Much of the clinical content of EMRs is, from a computer's viewpoint, locked up in the narrative text portions of the records. These typically include doctors' and nurses' notes, referring letters, specialists' reports, discharge summaries, and communications between doctors and patients. Their content adds to the data available from more structured components of the EMR such as laboratory values, medication prescriptions and vital sign records. There are existing clinical NLP systems such as cTakes [25] and MetaMap [26] that can extract medical concepts and their assertions (e.g., negated concepts [27]). However, it is still an open problem to automatically extract useful relationships between medical concepts. Much of the state-of-the-art focuses on extracting or classifying predefined relations from biomedical narratives [2,28-34], however, it is uncertain whether these predefined and often binary relations are directly useful and comprehensive for complex tasks such as patient diagnosis and outcome prediction. 1.2.2 Poor scalability and abstraction for time sequence data During hospital admissions, routinely monitored patient baseline progression includes vital signs, chem7 and other physiologic measurements. Studies have linked early recognition of patients' declining baseline condition to 50% reduction in the heart attack rate, and in turn to lower mortality [5]. Common practice typically involves the usage of the predictive scoring systems that aim to identify only a few and best descriptive clinical measurements for a particular outcome [35-40]. Many attempts to perform multivariate time-series analysis are restricted to only a handful of clinical variables (usually less than 20, see [10,41-43]). On the other hand, the few 17 approaches on unsupervised high-dimensional multivariate learning [44,45] lack the ability to simultaneously learn temporal patterns while learning abstractions' over raw measurements. 1.2.3 Connecting the dots for sequencing variants The current practice of analyzing genetic sequence variants often assumes linear models where the relation between Single Nucleotide Variations (SNVs) and Copy Number Variations (CNVs) are largely ignored. On the other hand, the genes that are affected by those SNVs and CNVs interact with each other functionally in the context of pathways or regulatory networks. Moving toward whole-exome and whole-genome analysis, statistical tools face multiple challenges to connect those SNVs and CNVs through their functional interactions in order to better understand pathogenic mechanisms. In particular, association between variants and disease phenotype should be investigated in the context where variants are not treated independently, but collectively when functionally correlated. However, as next generation sequencing produces ever increasing amount of genomics data, it also makes the problem more difficult to identify a subset of genetic variants underlying a particular phenotype. Even if one focuses on the protein-encoding exome, there are at least 25,000 distinct variants that differentiate individuals from each other. Although graphical models have been applied to estimate the structure of functional interaction, they are typically restricted to a small set of variants [46-48]. Relaxing such restrictions to take advantage of whole-exome and whole-genome sequencing will pose not only computational challenges (e.g., convergence rate and local optima) but also representational and statistical challenges (e.g., hypothesis space pruning and significance testing within a greatly increased hypothesis space). 1.2.4 Correlation analysis among multiple feature modes In many modeling tasks, the raw data can be processed by multiple feature extraction algorithms that generate features from different modalities or from multiple levels of analysis. For example, in medical natural language processing, one can extract the standard bag-of-words features, or one can extract more semantic-syntactic enriched features such as predicate argument structures and named entities. The different levels of features are correlated and collectively reflect the characteristics of a sentence or a document. Traditional machine learning models in medicine Some refer to this level of learning as learning clusters, while others refer to it as learning topics. 18 mostly adopt a two-dimensional matrix view of the data in the sense that patients and features each span one axis of a matrix. Such models cannot account for interactions between features or group of features in different levels. Similar challenges exist when patients' personal health data come in multiple modalities. For example, in studying patients with Autism Spectrum Disorders, it has been broadly hypothesized that only through combinations of multimodal measures, including genomics, transcriptomics, lab test results, and insurance claims analyses, will we obtain the diagnostic and prognostic accuracy that permits proper assignment of each individual to the group of ASD patients whose etiology, pathophysiology, treatment response, and clinical course most closely resemble his or hers. 1.3 Contributions and Organization This dissertation contributes a generalizable framework based on subgraph mining and factorization algorithms to model biomedical relations, and further, their correlations. It develops SANTF, a subgraph augmented non-negative tensor factorization tool that integrates atomic features (words) to help correlate higher-order features (relations between medical concepts) in clinical narrative text, and enables automated and interpretable lymphoma subtype categorization. As a variation of SANTF, this dissertation also develops Subgraph Augmented Non-negative Matrix Factorization (SANMF) that groups graph represented temporal progression trends of physiologic variables in a way that reflects the patient pathophysiology evolution and that is predicative of patients' mortality risks. As another variation, it develops ICE, implication of co-regulated exons, which is a new subgraph-based method to implicate co-regulated exons with ASD phenotype and allows identification of novel risk factors for ASD. The rest of this dissertation is organized as follows. In Chapter 2, we provide the background necessary to understand the motivations of applying subgraph mining and factorization algorithms to extract relations from biomedical narratives. We also describe previous work in the area. In Chapter 3, we describe in more detail the graph mining component of SANTF, which is applied to lymphoma subtype classification. Chapter 4 continues to describe the core SANTF, which extends the graph mining component to augment non-negative tensor factorization algorithms in order to group subgraph-mined biomedical relations and produce interpretable diagnostic panels for lymphoma subtypes. Chapter 5 describes SANMF and its application to ICU mor19 tality risk prediction. Chapter 6 describes ICE and its application to study genetic risk factors for ASD. Chapter 7 summarizes conclusions and future work. 20 Chapter 2. Related Work In this chapter, we review relation extraction from unstructured text using natural language processing (NLP) methods, with a focus on applications in biomedical and clinical informatics. The representation of relations has been a subject of knowledge representation research for decades [49], and there are various alternatives. One representation uses composed simple logical forms. For example, Resource Description Framework (RDF) or Web Ontology Language (OWL) encodes complex relations by multiple triples, where the elements of these triples can themselves be other composed forms. Thus binary relations such as positive-expression (large atypical cells, CD30) has the following subject-predicate-object triple representation: large atypical cells-positively express-CD30. A more powerful alternative is the sentential logic (or propositional logic) representation [49], in which relations are propositions or composed propositions using logical connectives (e.g., and for conjunction, or for disjunction). A third alternative is the graph-based representation in which nodes are named entities and edges indicate relationships (or multiple named entities connected by multiple edges can be regarded as one relation), as in Figure 1-1, which shows binary relations, n-ary relations, and how an n-ary relation can be regarded as a composition of multiple binary relations. Regarding alternative representations, the graph-based representation is equivalent to the sentential logic representation, differing at most perhaps in the compactness of the representation [50]. Thus, relations (including events) can be universally represented as graphs by converting biomedical concepts to nodes and syntactic/semantic links to edges. Other relation representations can also be easily derived using such graphs as intermediary input. Furthermore, although composition leads to complexity (e.g., n-ary relations or nested relations), by adopting a graph-based representation, we can focus on syntactic and semantic graphical patterns that are common and that provide good ways to capture relations. In fact, as will become clear later in this chapter, almost all state-of-the-art methods for extracting relations and events use graph-based algorithms. The reader should also be aware of a body of research on creating curated structured knowledge bases, which record manual annotations of biomedical relations by experts. Some of these knowledge bases are biologically focused, such as KEGG [51], STRING [52], InterPro [53], and InterDom [54]. Others are more clinically focused, such as PharmGKB [55], VARIMED [56] 21 and ClinVar [57]. However, the expert sourcing methods often scale poorly with the exponentially growing body of biomedical and clinical free text. Thus automated methods present a promising direction for discovering relations that can augment existing knowledge bases. 2.1 Application of Biomedical Relation Extraction Extracting biomedical relations has numerous applications that vary from advancing basic sciences to improving clinical practices, as shown in Figure 2-1. These applications include but are not limited to bio-molecular information extraction, clinical trial screening, pharmacogenomics, diagnosis categorization, as well as discovery of adverse drug reactions and drug-drug interactions. Relation Extraction Figure 2-1 Applications of biomedical relation extraction. The bidirectional arrows indicate that on the one hand, automated methods for relation extraction can help biological and clinical investigations; on the other hand, these applications can in turn provide shared resources (e.g., corpora and knowledge base etc.). 22 2.1.1 Biomolecular information extraction To keep up with the exponential growth of the literature, automated methods have been applied to mining protein-protein interactions [58,59], gene-phenotype associations [60,61], gene ontology [62], and pathway information [63], which we collectively call biomolecular information extraction. Such relation mining has shown its value in the prioritization of cancerous genes for further validation from a large number of candidates [64]. Many of these approaches apply NLP methods to extract known disease-gene relations from the literature, which are then used to predict novel disease-gene relations [65-69]. 2.1.2 Clinical trial screening Archived clinical and research data have been made available by governmental agencies and corporations, such as ClinicalTrials.gov [70]. Clinical trials are in large part characterized by eligibility criteria, some of which can be captured via relations (e.g., no [diagnosis] for [rheumatoid arthritis] for at least [6 months]). Electronic screening can improve efficiency in clinical trial recruitment, and intelligent query over clinical trials can support clinical research knowledge curation [71]. Recently, NLP support has proved useful in scaling up the annotation process [72-74], enabling semantically meaningful search queries [75], and clustering similar clinical trials based on their eligibility criteria profiles [76]. 2.1.3 Pharmacogenomics Pharmacogenomics aims to understand how different patients respond to drugs by studying relations between drug response phenotypes and patient genetic variations. Much of the knowledge on such relations can be mined from scientific literature text and curated in databases to enable discovery of new relationships. One such database is the Pharmacogenetics Research Network and Knowledge Base (PharmGKB [77]). Initial efforts to populate PharmGKB included a mixture of expert annotation and rule-based approaches. Recent approaches have extended to utilizing semantic and syntactic analysis as well as statistical machine learning tools to mine targeted pharmacogenomics relations from biomedical literature and clinical records [78-80]. 2.1.4 Diagnosis categorization 23 Diagnosis categorization enables automated billing and patient cohort selection for secondary research. Systems have been developed to automatically perform coding and classification of diagnoses from Electronic Medical Records (EMRs) [81-85]. More recent approaches demonstrated the success of extracting semantic relations and using these relations as additional features in diagnosis categorization, some through better vocabulary coverage [86], others through more expressive and informative representation of relations between medical concepts [87,88]. 2.1.5 Adverse drug reaction and drug-drug interaction Adverse drug reaction (ADR) refers to unexpected injuries caused by taking a medication. Drugdrug interaction (DDI) happens when a drug affects the activity of another drug when both are administered together. ADR is an important cause of morbidity and mortality [89], and DDIs may cause reduced drug efficacy or lead to drug overdose. Detecting potential ADRs and DDIs can guide the process of drug development. Recently, an increasing number of systems have leveraged the scientific literature and clinical records using NLP. These systems often explore the relations between drugs, genes and pathways, and discover ADRs [90-92] and DDIs [33,34] that are stated in unstructured text. 2.2 General Pipeline for Biomedical Relation Extraction In Figure 2-2, we first present a general pipeline, summarized from the reviewed approaches, as a cookbook to follow either in part or as a whole for extracting biomedical relations. We present this general pipeline before the methodology review to provide the reader a roadmap of the components discussed in the state-of-the-art methods. For completeness, we assume documents as the input and the extracted relations as the output. The pipeline is thus self-contained, but can also be used as a foundation for downstream applications such as logical inference with extracted relations. The pipeline covers steps for breaking the documents to sentences, understanding the semantic and syntactic structures of sentences and constructing a multitude of features for rela- tion extraction. We refer the reader to the description of each step in the accompanying text of the figure. We emphasize the role of graph mining in the pipeline as a central concept. The common graphs provide a point of convergence for methods that combine local features, a point of divergence from which more integrated features may be constructed, and a bridge to connect the syntax and semantics. 24 Section recognition Documents Sentence breaking Regex Pattern Matcher To kenization Morphological analysis Se PStgig Terminology Parsing 77Feature extraction Context features - Lexical features - Semantic features - Concept features - Graph (tree, path) features - Dictionary features etc. - Graph representation Semantic Role Labeling Post-procesGraph -Improve recall - Improve precision mining Relations Relation classification a tion l o n / optIm z R u le I n u ct *. . .. == Featre paceClasifirs Krne s (incl. graph/tree kernels) Figure 2-2 General workflow of biomedical relation extraction. Section recognition distinguishes text under different section headings (e.g., "Chief Complaints" or "Past Medical History"). Sentence breaking is to automatically decide where sentences in a paragraph begin and end. Morphological analysis investigates features such as capitalization and usage of alphanumeric characters. Stemming reduces the inflected words to the root form (e.g., performed to perform). POS tagging assigns a part-of-speech tag for each word in the sentence (e.g., VBN for "performed" in the sentence in Figure 1-1). Parsing is the process of assigning a syntactic structure to a sentence (e.g., the constituency or dependency structure obtained by Stanford Parser). The results from morphological analysis, stemming, POS tagging and parsing can provide features for recognizing anaphora (coreference resolution) and typed concepts (concept recognition). Coreference resolution and concept resolution can also improve parsing accuracy. Together with parsing, they are essential in generating the graph representation for a sentence and labeling semantic roles of concepts in the graph representation (Semantic Role Labeling). The graph representation is the 25 foundation for graph mining, and along with upstream steps including direct regular expression feature extraction, leads to the generation of semantically and syntactically enriched features. These features then support either rule based, feature space based or kernel based relation extraction system. Many biomedical relation extraction systems rely on external knowledge sources (e.g., UMLS). The shaded cloud denotes that the external resources (terminology, ontology and knowledge bases) can be utilized by some or all of the covered steps. 2.3 State-of-the-Art Methods for Biomedical Relation Extraction As the task of biomedical relation extraction has been receiving increasing attention, so have the methods to accomplish it. Some conventional approaches focus on using co-occurrence statistics as a proxy for relatedness [79,93-96]. Some clinical NLP systems apply hand-crafted syntactic and semantic rules to extract pre-specified semantic relations, such as MedLEE [97] and SemRep [98], and are hard to adapt to new subdomains. Recently, the research community has been paying more attention to the value of syntactic parsing, in order to develop generalizable methods to extract relations that fully explore the constituency and dependency structures of natural language. In this section, we review the state-of-the-art work where graph (including tree) mining techniques are used to derive relations from syntactic or semantic parses. We group the methods according to whether their corpora mainly concern scientific publications or clinical narrative text, as this content difference often has implications for the methods and resources used to extract relations. We also summarize the algorithms and systems in Table 2-1. 26 CoRef External Resources Graph Exploration Methods Parsers Luo et al. [87,88,99] Frequent subgraphs with No Stanford (augmented by redundancy removing UMLS) No Shortest path Stanford Roberts et al. [101] deBruijn et al.[102] McCCJ, SD Kay Xu et al. [103] Stanford, McCCJ, Enju Liu et al. [105,106], McCCJ, SD Mackinlay et al. [107], Ravikumar et al. [108] Bjorne et al. [111- McCCJ, SD 114] , Hakala et al. [115] et al. McCCJ, SD Kilicoglu [117,118] Hakenberg et al. BioLG [119,120] Solt et al. [104] Thomas et al. [121] Bikel, SD Riedel et al. [123] McCCJ Minimal trees over con- No cept pair Conceptual graph repre- No sentation Graph kernels: kBSPS APG, No Exact subgraph match- No ing, approximate subgraph matching Shortest path, rule-based No graph pruning Embedding graph, postprocessing rules Subgraph pattern matching using customized query language, postprocessing rules Pattern matching in dependency graphs Candidate graph scoring Yes UMLS, Gaston [100] Concept Matching Normalized string greedy match CRF for concept boundary and SVM for concept type Semi-Markov UMLS HMM Kay Chart Parser UMLS and regular expressions Compiled dictionaries Dictionary lookup and graph matching rules PDB [109], Uniprot, Yes Biothesaurus [110] UMLS, Wordnet Wikipedia Uniprot [116], Sub- No tiWiki, Wordnet, DrugBank, MetaMap Compiled dictionaries No Yes Compiled dictionar- BANNER, PNAT ies, Lucene, Uniprot, GO Yes GNAT [122] No No Van Landeghem et Stanford al. [125] No Compiled dictionar- No ies, Stanford event extractor [124] Compiled dictionaries Yes et al. Kaljurand [126] Vlachos et al. [128] Yes 2 IntAct [127] No Yes No No No No No No No No No UMLS, Wordnet No PharmGKB [77] Yes McClosky et al. [124,129] Quirk et al. [130] Miwa et al. [131] Coulet et al. Percha et [78,132], 2 Extraction rules based on minimal event containing subgraph patterns Dependency paths bePro3Gres tween the concept pairs Dependency paths beRASP tween the concept pairs, post-processing rules McCCJ, SD Minimum spanning tree algorithm SD; Shortest paths between McCCJ, Enju the concept pairs Enju, GDep Dependency paths between the concept pairs Stanford Dependency paths between the concept pairs Relative clause anaphora 27 No I I al. [80] Subtrees rooted at the lowest common ancestors of concept pair Wang et al. [137] No Association distance between pair of entities in a semantic network. Bui et al. [139] Stanford Grammatical rules to traverse the tree structures et al. LGP, Minipar, Subtrees rooted at the Katrenko Charniak lowest common ances[142] tors of concept pair Enju, GDep Dependency paths beSatre et al. [143] tween the concept pairs In-house par- Frequent subtree patterns Weng et al. [75] ser Graph kernels: APG, Thomas et al. [146] McCCJ, SD kBSPS Tree kernel: MEDT Chowdhury et al Stanford, 1 McCCJ, SD [147-149] Hakenberg [133] et al. Stanford No No UMLS, SIDER [134], BANNER [136] DrugBank [135], PharmGKB, GNAT Chem2Bio2RDF No [138] Yes HIVDB [140], gaDB [141] No No No No UniProt, Entrez Gene Yes [144], GENA [145] Yes UMLS No No No No No No 1 __ Re- Pre-specified drug names and regular expressions Yes Table 2-1 Summarization and characterization of relation extraction algorithms.Abbreviation used in this table include: CoRef - co-reference resolution, CRF - conditional random field, HMM - hidden Markov model, APG - all paths graph kernel [58], kBSPS - k-band shortest path spectrum kernel [150], MEDT - mildly extended dependency tree kernel [151]; PDB - Protein Data Bank [109], UMLS - Unified Medical Language System. The key for the parsers are: Stanford - Stanford Parser, McCCJ - McClosky-Charniak-Johnson Parser, Chart - Kay Chart Parser, Enju - Enju Parser, Bikel - Bikel Parser, SD - Stanford Dependency. When Stanford Parser is used, Stanford Dependency is automatically assumed. 2.3.1 Relation extraction from scientific literature Over the past decade, continuous effort has been directed to extracting semantic relations from biomedical literature text, often in the form of shared-task community challenges that aim to assess and advance NLP techniques. Notable community challenges include BioNLP shared tasks on event mining, BioCreative shared tasks on protein-protein interaction (PPI) extraction, and DDlExtraction challenges on drug-drug interaction (DDI) extraction. We observed that an increasing number of teams applied graph-based techniques to characterize the semantic relations in these shared tasks. These techniques frequently place among the top performing echelon. This section reviews the graph-based methodologies developed for these challenges. We consider only the papers accepted into the shared task proceedings as full publications, and focus on the top performing systems. We summarize the f-measures of the best systems in each shared task as an evaluation of each, and refer the reader to the challenge overviews for detailed and comprehen28 sive evaluations. Perhaps through learning the lessons from these challenges, real world applications such as the field of pharmacogenomics also saw significant momentum in development of graph-based text mining methods. Thus we devote the last part of this section to review recent advances in pharmacogenomics and demonstrate the transfer and adaptation of graph based algorithms from methodology oriented research to application oriented research in biomedical relation extraction. 2.3.1.1 BioNLP event mining shared tasks Three BioNLP shared tasks have focused on recognizing biological events (relations) from the literature. The shared tasks provided the protein mentions as input and asked the participating teams to identify a predefined set of semantic relations. Teams were not required to discover the protein mentions. BioNLP-ST 2009 consisted of three sub-tasks, including core event detection, event argument recognition, and negation/speculation detection, all based on the GENIA corpus [31]. BioNLP-ST 2011 expanded the tasks and resources in order to cover more text types, event types and subject domains [28]. Besides the continued GENIA task (GE), the 2011 shared tasks added the following sub-tasks: epigenetics and post-translational modification (EPI), infectious diseases (ID), bacteria biotope (BB) and bacteria interaction (BI). BioNLP-ST 2013 further expanded the application domains and included the following event extraction tasks: GE, BB, cancer genetics (CG), pathway curation (PC), and gene regulation ontology (GRO) [32]. Table 2-2 describes the nature of those tasks in more detail. Tasks GE EPI ID BB BI CG Task Descriptions Extracting the bio-molecular events related to NFKB proteins. Extracting epigenetic and post-transcriptional modification events. Extracting events describing the biomolecular foundations of infectious diseases. Extracting the association between bacteria and their habitats. Extracting the bacterial molecular interactions and transcriptional regulations. Extracting cancer related molecular and cellular level foundations, tissue and organ level effects and organism level outcomes. PC Extracting signaling and metabolic pathway related biomolecular reactions. GRO Extracting regulatory events between genes. Table 2-2 BioNLP event extraction tasks. The typical event extraction workflow can be broken into two general steps: trigger detection and argument detection. For example, in r3: [the patient] was put on [Imatinib regimen], the first step 29 detects the event trigger "put", and the second step detects the theme "Imatinib regimen" and target argument "patient". Bjorne et al. [111-113] converted sentences to a dependency graph (Stanford Dependency [152]) representation using the McClosky-Charniak-Johnson parser [153,154] and explored the graphs to construct features for both steps. The McClosky-CharniakJohnson parser is based on the constituency parser of Charniak and Johnson [153] and retrained with the biomedical domain model of McClosky [154]. Bjome et al. generated N-gram features connecting event arguments based on the shortest path of syntactic dependencies between the arguments. They included as features the types and supertypes of trigger nodes from event type generalization, in order to address feature sparsity. Bjorne et al. also applied semantic postprocessing rules to prune graph edges that violate semantic compatibility that is required by the event definition to hold between event arguments. Their system (currently referred to as TEES) performed best in the 2009 GE (0.52 f-measure), 2011 EPI (0.5333 f-measure), 2013 CG (0.5541 f-measure), 2013 GRO (0.215 f-measure, being the only participating system) and 2013 BB full event extraction (0.14 f-measure). Hakala et al. [115] built on top of the TEES system and reranked its output by enriched graph-based features, including paths connecting nested events and occurrence of gene-protein pairs in general subgraphs mined from external PubMed abstracts and the PubMed Central full-text corpus. In addition, they applied event type generalization to augment graph-based features to combat feature sparsity. The system by Hakala et al. placed first in 2013 GE (0.5097 f-measure), whereas the TEES system placed second (0.5074 f-measure). The strong performance of both systems highlights the importance of exploring graph-based features. The performance increase associated with enriched graph-based features suggests directions for improvement. Miwa et al. [131,155] built the EventMine system that can extract not only biomedical events but also their negations and uncertainty statements. For event extraction, they used the Enju parser [156] and the GENIA Dependency parser (GDep) [157] to generate path features along with dictionary based features (e.g., UMLS Specialist lexicon [158] and Wordnet [159]). Their entry in BioNLP ST 2013 placed first in the PC task. In particular, their path features include not only paths between event arguments but also paths between event argument and non-argument named entities. The enriched paths linking non-argument entities likely account for the strong performance by providing more local context features. 30 Another vein of work proposed joint models for event extraction in which event triggers and arguments for all events in the same sentence are predicted jointly. McClosky et al. [124,129] integrated event extraction into the overall dependency parsing objective, and treated flat events and nested events similarly. For preprocessing, they applied the McClosky-Charniak-Johnson parser and converted the parsing results to Stanford Dependency. They converted the annotated event structures in the training data to event dependency graphs that take event arguments as nodes and argument slot names as edge labels. They mapped the event dependency graphs to Stanford Dependency graphs and generated graph-based features to train an extended MSTParser [160] for extracting event dependency graphs from test data. The graph-based features included paths between nodes in the Stanford Dependency graph, as well as subgraphs consisting of parents, children, and siblings of the nodes. McClosky et al. also included consistency features that impose domain-specific soft constraints on the compatibility of edges connecting event arguments. They also applied event type generalization to combat feature sparsity. They then converted the top-n extracted event dependency graphs back to event structures and re-ranked event structures to get the best one, using graph-based features similar to those in MSTParser training but extracted from event dependency graphs. Riedel et al. first applied Markov Logic Networks to learn relational structures for event extraction [161] and later switched to graph-based methods [123,162]. They projected events to labeled graphs, and scored candidate graphs using a function that captures constraints on event triggers and event arguments. The scoring function considers token features, dictionary features and dependency path features. Riedel et al. further used a stacking model to combine their system with the system by McClosky et al. [124,129]. The combined system obtained first place in 2011 GE task (0.56 f-measure) and 2011 ID task (0.556 f-measure). Most of the remaining BioNLP systems that performed competitively also used graph-based features to various extents. Liu et al. developed an Exact Subgraph Matching (ESM) method [106], and later a more flexible Approximate Subgraph Matching (ASM) method, in order to mine basic and nested events [105,107]. They processed sentences with the McClosky-CharniakJohnson parser and transformed the parsing results to dependency graphs while respecting edge directionality. They constructed the graph representation of an event by computing unions of dependency paths between event arguments. After that, Liu et al. applied exact or approximate subgraph matching to match sentence graphs to event graphs, based on a customized distance metric, which takes into account subgraph differences in graph structure, node labels (formed by 31 the words covered by a node) and edge directionality. To improve the sensitivity of subgraph matching, Liu et al. used lemmatization to unify words [163]. This work falls along the lines of graph kernel based methods. As with many such methods, absorbing features into the calculation of similarity scores makes it difficult for supervised machine learning algorithms to directly weight/rank features. Kilicoglu et al. [117,118] also adopted the McClosky-Charniak-Johnson Parser/Stanford Dependency pipeline. They converted the dependency graphs to embedding graphs, where nodes themselves can be small dependency graphs, in order to apply postprocessing rules to traverse embedding graphs and extract nested events. However, their embedding graphs also lead to argument error propagation and thus hurt precision. Besides the frequently used McClosky-Charniak-Johnson Parser/Stanford Dependency pipeline, there are a number of systems experimenting with different parsers and/or dependency representations. Hakenberg et al. [119] applied BioLG [164], a Link Grammar Parser [165] extension, to obtain parse trees from sentences. They stored parse trees in a database and designed a query language to match subgraph patterns, which are manually generated from training data, against parse trees. Hakenberg et al. pointed out that generalization of event types would likely improve their results. Van Landeghem et al. [125] analyzed dependency graphs from the Stanford Parser [166], identified minimal event-containing subgraph patterns from training data and constructed extraction rules based on these patterns. Their post-processing rules handled overlapping triggers of different event types and events based on the same trigger, aiming for high precision at the expense of recall. The remaining systems generally used the dependency paths connecting the concept pairs as features for event extraction. For example, the dependencies were obtained through applications of different parsers including the Pro3Gres parser [167] (used by Kaljurand et al. [126]), the RASP parser [168] (used by Vlachos et al. [128]) or both McClosky-Charniak-Johnson parser and Enju parser (used by Quirk et al. [130], who combined the parsing results). However, most of these methods attained inferior performance compared to the best systems in the same shared tasks. We believe that there are at least two reasons: the McClosky-Charniak-Johnson parser with the self-trained biomedical parsing model is probably the most accurate parser in this domain; the enriched graph-based features and event type generalization as used by the top performing systems likely produced more useful features for event extraction. 32 2.3.1.2 Protein-protein interaction extraction and BioCreative shared tasks BioCreative shared tasks focused on automatic named entity recognition on genes and proteins in biomedical text and on extraction of the interactions between these entities [29,30,169]. Among the participants of the protein-protein interaction task of BioCreative II [29], most systems used co-occurrence statistics, pattern templates and shallow linguistic features (e.g., context words and part-of-speech tags), with either statistical machine learning or rule-based systems. Some systems observed the need for capturing cross sentence mentions of interacting proteins. For example, Huang et al. [170] developed a profile based method that creates a vector representation for candidate protein pairs by aggregating features from multiple sentences in the document. The profile features included n-grams, manually constructed templates and relative positions of protein mentions. In BioCreative 11.5, based on the top teams in the protein-protein interaction task, the organizers pointed out that the BioNLP techniques using deep parsing and dependency tree/graph mining were necessary to achieve significant results [30]. In particular, Hakenberg et al. [120] used a system similar to their BioNLP 2009 entry system [119]. They manually generated subgraph patterns from training data and matched them against parse trees. They achieved an f-measure of 0.30. Satre et al. [143] applied the Enju parser and the GDep parser and considered the dependency paths between concept pairs as features for relation extraction. They achieved an f-measure of 0.374. The protein-protein interaction tasks of BioCreative III consisted of detecting PPI related articles that provide evidence to specified PPIs, but did not include the actual extraction of PPIs, which is the focus of this review [169]. Several follow up studies to BioCreative 11.5 concerned the usage of kernels in PPI extraction [150,171], and they categorized kernels into the following categories: 1) kernels not using deep parsing information, including shallow linguistic (SL) kernel [172]; 2) constituent parse tree based kernels, including subtree (ST) [173], subset tree (SST) [174] and partial tree (PT) [175] kernels that use increasingly generalized forms of subtrees, as well as a spectrum tree (SpT) [176] kernel that uses path structures from constituent parse trees; 3) dependency parse tree based kernels, including edit distance and cosine similarity kernels that are based on shortest paths [177], k-band shortest path spectrum (kBSPS) [150] that additionally allows k-band extension of shortest paths, all-path graph (APG) kernel [58] that further differently weights shortest paths and extension paths in similarity calculation, as well as Kim's kernels [178] that use various combinations of lexical, part-of-speech, and syntactic information along with the shortest path structures. The comparative studies and 33 error analyses showed that: 1) dependency tree based kernels generally outperform constituent tree based kernels; 2) kernel method performances heavily depend on corpus-specific parameter optimization; 3) APG, kBSPS, and SL are top performing kernels; 4) ensembles based on dissimilar kernels can significantly improve performance; 5) non-kernel based methods (e.g., rulebased method, BayesNet) can perform on par with or better than all non-top kernel methods. From these observations, it is evident that richer dependency graph/tree structures (e.g., in APG, kBSPS) than shortest paths are important to better performance of graph/tree based kernels, which is consistent with the analysis of BioNLP participating systems. Also the limited advantage from the kernel methods over non-kernel methods and the interpretation difficulty associated with kernel methods seem to suggest that a more fruitful direction may be investigating novel feature sets rather than novel kernel functions. 2.3.1.3 Drug-drug interaction extraction and DDIExtraction shared tasks The two DDIExtraction challenges (organized in 2011 and 2013) aimed at automated extraction of drug-drug interactions (DDI) from biomedical texts [33,34]. The organizers of the two challenges recognized the extended delays in updating manually curated DDI databases. They observed that the medical literature and technical reports are the most effective sources for the detection of DDIs but contain an overwhelming amount of data. Thus DDIExtraction was motivat- ed by the pressing need for accurate automated text mining approaches. The 2011 challenge focused on classifying whether there is any interaction between candidate drug pairs. The 2013 challenge, in addition, pursued the detailed classification task of categorizing DDIs into one of the four possible subtypes: advice (advice regarding the concomitant use of two drugs), effect (effect of DDI), mechanism (pharmacodynamics or pharmacokinetic mechanism of DDI) and int (general mention of interaction without further detail). For these two challenges, we review the top-performing teams. In the 2011 challenge, Thomas et al. [146] applied the McCloskyCharniak-Johnson parser and converted the parses to Stanford dependencies. They used voting to combine the following kernels to implicitly capture features for relation extraction: all-path graph (APG) [58], k-band shortest path spectrum (kBSPS) [150], and shallow linguistic (SL) [172] kernels. Their system achieved the best f-measure of 0.657. Chowdhury et al. [147,149] applied the Stanford parser to obtain dependency trees and experimented with both feature based methods and kernel based ensemble methods for relation extraction. They experimented with SL 34 [172], mildly extended dependency tree (MEDT) [151] (expanding shortest paths to also cover important verbs, modifiers or subjects) and path-encoded tree (PET) [179] (based on constituency tree) kernels. By combining feature-based and kernel-based methods, Chowdhury et al. achieved the second best result with an f-measure of 0.6398. In the 2013 challenge, Chowdhury Johnson parser and converted the parses to Stanford dependencies [148]. They attained an - et al. used their previous kernel method [147,149] but switched to the McClosky-Charniakmeasure of 0.80 for general classification and 0.65 for detailed classification and placed first in the 2013 challenge. Thomas et al. [180] followed a two-step approach to first detect general DDIs and then classify detected DDIs into subtypes. For the general DDI task, they used voting to combine kernels including APG [58], subtree (ST) [173], subset tree (SST) [174], spectrum tree (SpT) [176] and SL [172] kernels. For the subtype classification step, they used TEES directly [113]. Their system performed second best with an f-measure of 0.76 for general classification and 0.609 in detailed classification. It is interesting to see that adoption of systems originally developed for PPI extraction or event extraction has led to top performances in the DDI task. This further corroborates that these tasks are closely related, and technical solutions for one are generalizable to others. 2.3.1.4 Pharmacogenomics In the field of pharmacogenomics, continuous efforts from multiple research teams have centered on the utilization of literature and clinical text in order to mine interesting relations between genetic mutations and drug response phenotypes. Although it is difficult to compare their performances due to the fact that the experiments are not on shared corpora, these approaches do illuminate the translational application and adaptation of some state-of-the-art biomedical relation extraction techniques to problems directly asked by clinicians and pharmacologists. Some systems used path-based approaches. Coulet et al. [78] aimed at extracting binary relations between genes, drugs and phenotypes in order to build semantic networks for pharmacogenomics. They first converted the Stanford Parser output on sentences (from collected PubMed abstracts) into dependency graphs. They tracked the paths starting from named entities and ending at a verb, and merged paths ending with the same verb to form binary relations. Coulet et al. further explored frequency information to retain recurrent relations. They also performed normalization on 35 both the collected entities and relation types (verbs). Percha et al. [80] extended this approach to use breadth-first search to yield the shortest path between two named entities in the dependency graph in order to generate features for relation extraction. Wang et al. [137] used Latent Dirichlet Allocation (LDA) to create a semantic representation of biomedical named entities and used Kullback-Leibler (KL) divergence to calculate the association distance between pairs of entities in the Chem2Bio2RDF [138] semantic network. They ranked candidate associations between named entity pairs based on the summation of distances along the path connecting the pairs in the semantic network. Other systems used tree-based approaches. Katrenko et al. [142] studied gene-disease relation extraction and included as features the subtrees rooted at the lowest common ancestors of two named entities in the dependency parse trees. Their experiment used several parsers including the Link Grammar Parser [165], Minipar [181] and the Charniak Parser [182]. Compared with using individual parser's results separately, they reported improved performance from adopting ensemble methods (stacking and AdaBoost) and combining multiple parsers' results [183]. Hakenberg et al. [133] relied on co-occurrence for extraction of certain relations (e.g., gene-drug, genedisease and drug-disease), but augmented co-occurrence with subtrees from the Stanford Parser output for other types of relations. In particular, their subtrees are rooted at the lowest common ancestors of named entity pairs in the binary relations considered. Bui et al. [139] aimed to extract causal relations on HIV drug resistance from the literature. They used the Stanford Parser to generate constituent parse trees for sentences and developed grammatical rules that traverse the tree structures in order to extract drug-gene relations. Both path-based and tree-based systems in pharmacogenomics tend to focus on precision over recall in their evaluation, differing from the balanced f-measure used in multiple shared tasks. This likely stems from their specific goals of harvesting reliable relations to build and grow pharmacogenomics semantic networks. Too much noise will likely cloud the initial semantic network, while missing relations still have a chance to be later discovered with growing literature. In fact, reported precisions for pharmacogenomics relation extraction systems typically range from 70% to over 80%. In addition, these systems often check extracted relations against curated database such as PharmGKB. We believe that these systems can further benefit from adopting 36 parsers trained with biomedical models and using enriched graph-based features, two of the most recent lessons learned in shared tasks. 2.3.2 Relation extraction from clinical narrative text In the medical informatics community, relation extraction has also been extensively studied in the form of shared tasks and separately motivated research. For example, significant advances in extracting semantic relations from narrative text in Electronic Medical Records (EMR) have been documented in the 2010 i2b2/VA challenge (i2b2 - Informatics for Integrating Biology to the Bedside, VA - Veterans Association) [2]. 2.3.2.1 i2b2/VA challenge The challenge focused on three aspects of semantic relation extraction (i.e., concept extraction, assertion classification, and relation classification) and attracted international teams to address these shared tasks [2]. Concept extraction can be considered the basic task, as assertions and relations all refer to the extracted concepts. As the challenge allows subsequent tasks (e.g., relation classification) to use the ground truth of preceding tasks (e.g., extracted concepts), the performance metrics for the relation classification task should be interpreted as an upper bound for the end-to-end relation extraction task (same as the challenges from BioNLP, BioCreative and DDIExtraction). In this section, we review only the systems in the relation classification task, where the target relations are limited to predefimed relations among medical problems, tests, and treatments. There are eight relations including treatment improves / worsens / causes / is administeredfor / is not administeredbecause of medical problem, test reveals / conducted to investigate medical problem, and medical problem indicates medical problem. As we did in reviewing the above challenges, we review only those systems that represented sentences as graphs and explored such graphs during the feature generation step. Roberts et al. [101] classified the semantic relations using a rather comprehensive set of features: context features (e.g. n-grams, GENIA part-of-speech tags surrounding medical concepts), nested relation features (relations in the text span between candidate pairs of concepts), single concept features (e.g., words and concept type of medical concept), Wikipedia features (e.g., concepts matching Wikipedia titles), concept vicinity features (concept bi-grams around relation argument concepts) and similarity features. The latter were computed using edit distance on lan37 guage constructs including GENIA phrase chunks and Stanford Dependency shortest paths. Their system reached the highest f-measure on relation classification (0.737). deBruijn et al.[102] applied a maximum entropy classifier with down sampling applied to balance the relation distribution. In addition to features from the concept extraction task, they applied the McClosky-Charniak-Johnson parser, converted the parsing results into Stanford dependencies, and included as features the labels in the minimal trees that cover the concept pairs. They used word clusters as features to address the problem of unseen words. Their system reached an f-measure of 0.731, the second best among relation classification participants. Solt et al. [104] extracted concepts by identifying head terms from dictionary look up and extending concept spans by rules. For relation classification, they experimented with several parsers including the Stanford Parser, the McClosky-Charniak-Johnson Parser and the Enju Parser. They used the resulting dependency graphs with two graph kernels including the all paths graph (APG) kernel [58] and k-band shortest path spectrum (kBSPS) [150], which produced only moderate performance. This likely reflects the difficulty in tuning the graph/tree kernel based systems, consistent with the observations from the experience in relation/event extraction from the scientific literature. 2.3.2.2 Separately motivated clinical relation extraction After the i2b2 challenges, several authors aimed at combining the concept extraction and relation extraction steps into an integral pipeline and/or generalizing to the extraction of complex or even nested relations. Xu et al. [103] developed a rule-based system MedEx to extract medications and specific relations between medications and their associated strengths, routes and frequencies. The MedEx system converts narrative sentences in clinical notes into conceptual graph representations of medication relations. To do so, Xu et al. designed a semantic grammar directly mappable to conceptual graphs and applied a Chart Parser by Kay [184] to parse sentences according to this grammar. They also used a regular expression based chunker to capture medications missed by the Kay Chart Parser. Weng et al. [75] applied a customized syntactic parser on text specifying clinical eligibility criteria. They mined maximal frequent subtree patterns and manually aggregated and enriched them with the Unified Medical Language System (UMLS) to form a semantic representation for eligibility criteria, which aims to enable semantically meaningful 38 search queries over ClinicalTrials.gov. Luo et al. [99] extracted syntactic path features from the Link Grammar Parser generated dependencies from PubMed abstracts. The syntactic paths are included as features in clustering relations between noun phrase pairs. 2.3.3 Shared resources for relation extraction The shared tasks and separately motivated research on biomedical relation extraction have not only advanced the state-of-the-art in methodology, but also created and/or demonstrated the utilization of a repository of shared resources that range from knowledge bases to shared corpora to graph mining toolkits. We categorize and summarize those resources in Table 2-3. Utility Category Data Sources Terminology & Ontology GO [185], UMLS [186], MeSH [187], HUGO [188], Wordnet [159], Verbnet [189], Biothesaurus [110] Graph Miner Gaston [100], Mofa [190], GSpan [191], FFSM [192], Graph Spider [193] Tree/Graph Kernel subtree (ST) kernel [173], subset tree (SST) kernel [174], partial tree (PT) kernel [175], spectrum tree (SpT) kernel [176], mildly extended dependency tree (MEDT) kernel [151], all-path graph (APG) kernel [58], k-band shortest path spectrum (kBSPS) kernel [150], path-encoded tree (PET) kernel [179] Dependency Parsers Enju Parser [156], GDep Parser [157], Stanford Parser [194], McCCJ Parser [153,154], RASP Parser [168], Bikel Parser [195], BioLG Parser [164], Pro3Gres Parser [167], Kay Parser [184], C&C [196] Shared Corpora BioNLP-09 event corpus [31], BioNLP-11 event corpus [28], BioNLP-13 event corpus [32], BioCreative II relation corpus [197], BioCreative 11.5 relation corpus [30], DDlExtraction relation corpora [33,34], i2b2/VA corpus [2], AIMed [198], Biolnfer [199], HPRD50 [200], IEPA [201], and LLL [202], Uniprot corpus [203] Table 2-3 Shared resources for relation extraction. The resources are organized by their utility category. Abbreviations used include: Gene Ontology (GO), Unified Medical Language System (UMLS), Medical Subject Heading (MeSH), Human Protein Reference Database (HPRD). 2.4 Limitations of Existing Work Although notable progress have taken place in applying graph based algorithms to improve the extraction of biomedical relations, barriers still exist to enabling practical relation extraction methods that are both generalizable and sufficiently accurate. Below we discuss a few such barriers and promising directions. 2.4.1 Not all parsers and dependency encodings are synergistic It has been pointed out repeatedly that the choice of the parser and dependency encodings may play an important role in a relation extraction system's performance. Buyko et al. [204] per39 formed comparative analysis on the impact of graph encoding based on different parsers (Char- niak-Johnson [153], McClosky-Charniak-Johnson, Bikel [195], GDep, MST [160], MALT [205]) and dependency representations (Stanford Dependency and CoNLL dependency) and found that the CoNLL dependency representation performs better in combination with four parsers than the Stanford Dependency representation; and McClosky-Charniak-Johnson parser frequently places as the best performing parser. Miwa et al. [206] compared five syntactic parsers for BioNLP-ST 2009. They concluded that although performances from using individual parsers (GDep, C&C [196], McClosky-Charniak-Johnson, Bikel, Enju) do not differ much, using an ensemble of parsers and different dependency representations (Stanford Dependency, CoNLL, Predicate Argument Structure) can improve the event extraction results. As Stanford Dependency is the most widely used dependency encoding, they also compared the performance of using different Stanford Dependency variants and found that basic dependency performs best if keeping types of dependency edges. On the other hand, if ignoring types of dependency edges, they found that the collapsed dependency variant performs best, which corroborates the finding by Luo et al. [87]. In [87], the task is to extract relations as features without classification as opposed to supervised relation classification in the BioNLP-ST event extraction tasks. Thus recall is favored in the feature learning step, where ignoring types of dependencies helps to improve the coverage of subgraph patterns. 2.4.2 Integrating co-reference resolution Co-reference occurs frequently in biomedical literature and clinical narrative text, arising from the use of pronouns, anaphora and varied terms for the same concepts. Care must be exercised to transfer the correct relation along the co-reference chain. However, many of the reviewed approaches for named entity and event recognition did not have a built-in co-reference resolution component. Miwa et al. [207] specifically studied the impact of using a co-reference resolution system and showed improved event extraction performance. In particular, they developed a rulebased co-reference resolution system that consists of detecting rules for mention, antecedent and co-referential link, respectively. They used the co-reference information to modify syntactic parse results so that antecedent and mention share dependencies. Features were also extended between mentions and antecedents. However, those systems that integrate co-reference resolution limited the scope to co-references within the same sentence. Recognizing the importance of 40 co-reference features, the organizers of BioNLP ST 2011 and 2013 integrated the co-reference annotations into the event annotations. Use of such annotations should be encouraged to develop and improve the co-reference component in event extraction systems and to gauge their performance. In the future, it is also worth investigating the impact of co-reference resolution across sentences. 2.4.3 General relation and event extraction and domain adaptation The state-of-the-art relation and event extraction systems are all built around tasks with domainspecific definitions of relations and events, many of which are in fact binary (e.g., BioCreative PPI challenge [30], DDlExtraction challenge [33,34], and i2b2/VA challenge [2]). However, there is a gap between the technical advances and the demands from many real-world tasks, including building pharmacogenomics semantic networks [78], extracting clinical trial eligibility criteria [75] and representing immunophenotypic test results for automating lymphoma subtype classification [87,88]. In those tasks, general relation and event discovery is necessary, where the number of nodes is flexible and even the relation/event structure is not entirely predetermined. Another challenge brought by domain-specific relation/event definition concerns the training data. The problem of limited training data often plagues the development of NLP systems, with those on relation extraction being no exception. To take better advantage of existing annotated corpora, it is necessary to perform domain adaptation from external training corpora (source) to the target corpus. Miwa et al. [207] proposed to add source instances followed by instance reweighting when source and target match on events to be extracted. When source and target corpora have a partial match on events, they proposed to train each event extraction module separately on the source corpus and used its output as additional features for the corresponding modules on the target corpus. Miwa et al. [208] further improved methods of combining corpora by integrating heuristics to filter spurious negative examples. The heuristics target situations where instances not annotated in one corpus due to a different focus may be treated as negative instances in another corpus. Applying this method on learning from seven event annotated corpora, they showed improved performance on two tasks in BioNLP-ST 2011. 2.4.4 Redundancy in subgraph patterns 41 For automated subgraph pattern collection such as using frequency as cues, there is the problem of redundancy among collected subgraph patterns. Many smaller subgraphs are subisomorphic to other larger frequent subgraphs. Many of these larger subgraphs have the same frequencies as their subisomorphic smaller subgraphs. This arises when a larger subgraph is frequent; all its subgraphs automatically become frequent as well. Furthermore, if the smaller subgraph g, is so unique that it is not subisomorphic to any other larger subgraph gj, then this pair gs, g, shares identical frequency. Therefore, one only need to keep the larger subgraphs in such pairs. Note that it is cost prohibitive to perform a full pairwise check because the subisomorphism comparison between two subgraphs is already NP complete [100], and a pairwise approach would ask for around a billion comparisons for a collection of several tens of thousands subgraphs. Efficient algorithm is needed that reduces the number of subgraph pairs to compare by several orders of magnitude. The key idea is that they only need to compare subgraphs whose sizes differ by one, and they can further partition the subgraphs so that only those within the same partition need to be compared. On the other hand, depending on the task, algorithms may be developed to collect subgraph patterns that explore the "novelty" of the subgraphs, such as using p-significance to assess how strange it is to see the subgraphs in the current corpus [209]. 2.4.5 Integrating with NER Most shared task participants were not evaluated based on their relation extraction from scratch. Rather, their systems were evaluated given the gold standard of named entity annotations, which is even true for challenges that include a NER task, such as the i2b2/VA shared tasks. Thus their evaluation results are likely an upper bound of the end-to-end system performance, the tuning of which is in fact a non-trivial task. Kabiljo et al. [210] evaluated several methods for relation ex- traction including a keyword based method, a co-occurrence based method, and a method using dependency graph-based patterns. They noted that in general a significant performance drop will occur when using named entities tagged by NER system such as BANNER [136] instead of the gold standard. In addition, it is useful but challenging to filter out named entity tuples (including pairs) that do not have relations explicitly stated in the text [99]. Such filtering may adopt a hybrid approach that relies on both automatically checking semantic type compatibility and manually sifting through the remaining tuples. However, as the number of non-related tuples often dominates that of related tuples, better automated filtering is necessary and is an open question. 42 Chapter 3. General Relation Extraction by Frequent Subgraph Mining Applied to Automatic Lymphoma Classification In this chapter3 and the next, we address some of the limitations of state-of-the-art relation/event extraction approaches including: general relation extraction, redundancy elimination, NER integration, concept unification, and parser augmentation. We use subgraph mining (focus of Chapter 3) and factorization algorithms (focus of Chapter 4) to develop a general framework for extracting relations from clinical narrative text and to explore their correlations. To test our proposed framework with a concrete real-world medical problem, we investigate automated lymphoma subtype categorization based on pathology report narrative text. The differential diagnosis of lymphoid malignancies has long been a difficult task and a source of debate for pathologists and clinicians [211-214]. To standardize knowledge into a widely accepted guideline, the World Health Organization (WHO) published a consensus lymphoma classification in 2001 [215], which was revised in 2008 [216]. Even with the full spectrum of clinical and genetic features used in this guideline, uncertainty persists in pathologists' daily practice [217,218]. Since its original publication, several case series and reviews of lymphoma have suggested refinements to the current classification scheme and additional lymphoma subtypes [219223]. Facing this ongoing need for periodic revision, the current approach to revise the WHO classification presents several challenges. First, the review process took more than one year, involving an eight-member steering committee and over 130 pathologists and hematologists worldwide [216], hence it is a time consuming and labor intensive task. Moreover, the cases covered for consideration of revisions are subject to selection bias from different studies. These challenges motivated us to build an interpretable lymphoma classification model to automate the case review process in a systematic way. Many medical natural language processing (NLP) systems aim to extract medical problems from text to identify patient cohorts for clinical studies (e.g., [25,26,224-227]). They rely heavily on mentions and synonyms of the targeted problems. In contrast, we exclude all mentions and synonyms of lymphomas. The aim is to prevent oracles from telling the system the true lymphoma 3 This chapter was published as a research article in Journal of the American Medical Informatics Association [1]. 43 type and to mimic the differential diagnosis with the pathology reports as proxies for related labs and tests. The automatically built diagnostic models are intended to assist with expert review, thus it is necessary not only to achieve high accuracy, but also to retain interpretable features. 3.1 Background As described in Chapter 2, part of the advances in the state-of-the-art specialized clinical NLP systems for identifying medical problems have been documented in challenge workshops such as the yearly i2b2 (Informatics for Integrating Biology to the Bedside) Workshops. The first such challenge focused in part on identifying the smoking status of patients [225]. Features used by the successful teams included mentioned medical entities, n-grams (up to trigrams), part of speech (POS) tags, and task-specific regular expressions, dictionaries and assertion classification rules. Feature engineering details contributed significantly to the best performing systems [228- 230]. In a later challenge, recognizing obesity and its 15 comorbidities [227], the top four systems employed heavier feature engineering on hand-crafted rules that integrated "diseasespecific, non-preventive medications and their brand names" [231], disease-related procedures [232], and disease-specific symptoms [233,234]. However, task-specific rules and regular expressions to capture medical concepts and relations are usually subdomain specific and hard to generalize. In contrast, standard linguistic features such as n-grams are easy to generalize but difficult to interpret - the selected n-grams may not be meaningful. General clinical NLP systems such as cTakes [25] and MetaMap [26] can extract negated [27] medical concepts. Besides negations, they specify few additional relations. Other systems apply hand-crafted rules to extract pre-specified semantic relations, such as MedLEE [97], MedEx [103] and SemRep [98], or require supervised learning on pre-specified semantic relations [235], and thus are hard to adapt to new subdomains. The value of syntactic parsing in concept and relation extraction has also been explored, such as phrase chunking in cTAKES [25], shallow parsing with the Stanford Parser [166], short syntactic link chain extraction [236], and Treebank building such as in the MiPACQ corpus [237]. Our work features unsupervised extraction of relations among a flexible number of medical concepts, which produces features that both improve performance over baselines and are more interpretable. 44 3.2 Task Definition Pathology reports typically record four general categories of patient information: clinical presentation, morphology, immunophenotype and cytogenetics. Our corpus is rich in narrative sentences that specify complex relations among medical concepts. We accordingly design a sentence subgraph mining framework that is suitable for capturing such relations. Using the features generated from this framework, we performed the following tasks: 1. We tested the hypothesis that an automated lymphoma classifier with sentence subgraph features can outperform the baseline classifier with standard n-gram features. 2. We tested the hypothesis that sentence subgraph features can outperform the baselines with full or filtered medical concept features extracted by the latest MetaMap. 3. We showed that sentence subgraph features are friendly to interpretation and provide insights to the diagnosis of lymphoma. To prevent classifiers from using the explicit mentions and synonyms of the lymphoma types, we exclude phrases overlapping with a Medical Subject Heading (MeSH) [187] of "lymphoma" or "leukemia". We also exclude phrases that match a set of manually constructed patterns aiming to catch abbreviations and synonyms of the target lymphomas that may be missed by MeSH, as shown in Table 3-1. Regular Expressions "(?i)(burkitlburket)" "(?i)\bBL\b" "(?i)\bDLBCL\b" "(?is)(follicularlfollicle).*(typeIorigin)" // e.g. "low grade lymphoma, follicle center cell type" "(?i)\bFL\b" "(?i)\b(nlphllnlphdlhllhd)\b" "(?i)\bNHL\b" "(?i)hodgkin" "(?i)lymphoma" "(?i)leukemia" "(?is)diffuse.*large.*b.*cell" "(?i)T/HRBCL" "(?is)(nodular\s+sclerosismixed\s+cellularityllymphocyterich.*typeIlymphocyte\s+predominant)" // e.g., "Hodgkin lymphoma, mixed cellularity type" Table 3-1 Regular Expressions to Catch Lymphoma Mentions. 45 3.3 Data Collection Our corpus consists of Massachusetts General Hospital (MGH) pathology reports residing in the Research Patient Data Registry (RPDR) [238] database. An MGH pathology report consists of standard and semi-standard sections as shown in Figure 3-1. For this project, we focused on the following four lymphomas: diffuse large B-cell lymphoma (DLBCL; the most common lymphoma), Burkitt lymphoma (the most aggressive lymphoma), follicular lymphoma (the second most common lymphoma) and Hodgkin lymphoma (the most common lymphoma in young patients). We obtained our patient cases by having two MGH medical oncologists and one hematopathologist review pathology reports of patients diagnosed between 2000 and 2010, and collected 1038 cases whose written diagnosis (in the final diagnoses section) had one or more of the four lymphomas. 3.4 Methods We first preprocess our corpus using sentence breaking, tokenization, and part-of-speech tagging, with customizations to medical corpora. We then perform a two-phase sentence parsing step, grouping token subsequences that match to concept unique identifiers (CULs) in the UMLS Metathesaurus [26] and merging them as a single token before applying Stanford Parser. The next section on corpus pre-processing gives more details. 3.4.1 Corpus pre-processing We use two NLP packages to pre-process our corpus, OpenNLP [240] and the Stanford Parser [194]. We use the sentence breaker from OpenNLP, which applies a maximum entropy model, and apply rule-based post-processing customized to our corpus. After sentence breaking, we use a home-built rule-based tokenizer that recognizes domain specific tokens such as "CD4+" or "TdT+" as one token. Following the approach of Huang et al. [166], we use the UMLS Specialist Lexicon (which contains lexical descriptions of over 1.1 million words) to build an extended lexicon by mapping UMLS style part-of-speech tags and linguistic features such as plural and present singular to Penn Treebank tags [241]. Unlike Huang et al. [166], we add the extended lexicon into OpenNLP's Part-of-Speech (POS) tagger dictionary. This is straightforward because the 46 OpenNLP tagger enumerates possible tags only from its dictionary and then evaluates their likelihood. 3.4.1.1 Matching token subsequences to UMLS concepts To group token subsequences that correspond to medical terminology, we perform dictionary look up against the UMLS Metathesaurus [26]. We investigate each of the n x (n - 1)/2 subsequences of tokens in a sentence and look them up in the UMLS Metathesaurus. For UMLS CUI matching, we experimented with the entire set or subsets of CUIs and chose the following approach that balances the coverage and accuracy on our data. If the token subsequence has only one CUI match, this CUI is used. If the token subsequence has multiple CUI matches, we select the one that is confirmed by the most number of sources. If there is a tie, we prefer the CUI supported by SNOMED CT [242] if there is one, or flip a coin otherwise. We then perform a greedy search to find the longest token subsequences with a matching UMLS concept unique identifier (CUI). The heuristics employed to guide the greedy search include ignoring case in matching, eliminating subsequences that are fully contained in longer sequences, eliminating interpretations of single tokens that fall into function-word grammatical categories, and ignoring punctuation. After that, we look up multiple mapping tables in the UMLS Metathesaurus and obtain medical subject headings (MeSH) and semantic type unique identifiers (TUI) from CUIs. 47 CLINICAL DATA: 53-year-old with psoriasis, bilateral axillary ? lymphoma. lymphadenopathy, palpable on right for one month Immunohistochemical stains show that the follicles, as well as some extrafollicular areas, contain Pax5+ B cells that co-express Bcl6 and Bcl2. Numerous scattered CD2+ T cells are present. Follicles are encompassed by CD21+ follicular dendritic cell (FDC) aggregates, with some loss of FDC staining in the larger follicles and among extrafollicular B cells. A stain for CD30 highlights occasional interfollicular immunoblasts. CD15 stains granulocytes. There is no lymphoid staining for cyclin Dl or ALK-1. FLOW CYTOMETRY REPORT: Hematopoietic Cell Surface Markers SPECIMEN: Tissue - Right Axillary Lymph Node Core Biopsy RECEIVED: 3/12/10 DIFFERENTIAL COUNT: Lymphocytes: 93%; Monocytes: <1%; Granulocytes: <1%. RESULTS LIGHT SCATTER GATE ANALYZED: Lymphocyte ANTIGENS: B CELL T/NK CELL MYELOID/OTHER CD19: 55% CD45: 84% 42% CD14: <1% CD20: 55% 37% 5% surfaceCD19+KAPPA: 50% 34% surfaceCD19+LAMBDA: 6% 39% CD19/20+5: <1% CD19/20+10: 42% 1% CD19/20+23: 13% CD19/20+43brt: <1% INTERPRETATION: 1. CD19+, CD20bright+, CD10+, CD43-, CD5- B cells with monotypic expression of kappa light chain amid a polytypic background. 2. CD4+ and CD8+ T cells. CD3: CD3+4: CD3+8: CD5: CD7: CD3-7+: KARYOTYPE: 46,XX,t(6;12)(q2?6;q2?1),t(14;18)(q32;q21)[cp7]/47,XX,+X[3] BANDING: GTG SCORED: 0 ANALYZED: 10 METAPHASES COUNTED: 10 INTERPRETATION: Seven of 10 metaphases contained a translocation of chromosomes 14 and 18. This translocation is associated with an IGH-BCL2 rearrangement, and is a characteristic finding in B-cell non-Hodgkin' s lymphomas of follicular center cell origin. Figure 3-1 MGH pathology reports usually contain four sections with almost all information retained as narrative text. Clinical data, the first section, includes patient age, past medical history, and ongoing treatment procedures, etc. The second section, morphology and immunohistochemistry, describes cellular structural alterations appearing under a light microscope aided by a variety of dyes, some of which are conjugated to cell-specific antibodies. The third section is on flow cytometry, which describes the characteristic expression of various surface antigens on cells. The individual or combined percentages of antigens (e.g., CD20, CD5 and CD 10) are reported. Also reported are pathologists' interpretations, which characterize these numbers (e.g., +: positive or -: negative) relative to reference values. The fourth section is on cytogenetics, which records the presence of chromosomal aberrations such as translocations, insertions and deletions, in the form of a "karyotype" using a standardized nomenclature [239] that is not NLP friendly. However, the accompanying "interpretation" section describes these aberrations in narrative text. Dates and Age etc. are replaced with realistic surrogates for de-identification. 48 3.4.1.2 Two-phase sentence parsing The medical language used in pathology reports is challenging for general domain parsers. Consider the example sentence: "In situ hybridization for kappa and lambda immunoglobulin light chains show the plasma cells to be polytypic." Figure 3-2 shows the parse by the Stanford Parser, in which the term "in situ hybridization" is broken and erroneous dependencies such as amod(hybridization-3, situ-2) and prepin(show- 11, hybridization-3) are generated. Typed dependencies, collapsed Parse (ROOT (S (PP (IN In) (NP (NP (JJ situ) (NN hybridization)) (PP (IN for) (NP (NN kappa) (CC and) (NN lambda) (NN immunoglobulin))))) (NP (JJ light) (NNS chains)) (VP (VBP show) (S (NP (DT the) (NN plasma) (NNS cells)) (VP (TO to) (VP (VB be) (ADJP (JJ polytypic)))))) (. amod(hybridization-3, situ-2) prepin(show-ll, hybridization-3) nn(immunoglobulin-8, kappa-5) conj_and(kappa-5, lambda-7) nn(immunoglobulin-8, lambda-7) prep_for (hybridization-3, immunoglobulin-8) amod(chains-10, light-9) noubj(show-li, chains-10) root(ROOT-O, show-Il) det(cells-14, the-12) nn(cells-14, plasima-13) nsubj (polytypic-17, cells-14) aux(polytypic-17, to-15) cop (polytypic-17, be-16) xcomp (show-Il, polytypic-17) .))) Figure 3-2 Example sentence parsed directly by the Stanford Parser. Knowing that "in situ hybridization" is one phrase, the parser not only corrects the error with "in situ hybridization", but also respects the long phrase "kappa and lambda immunoglobulin light chains", as shown in Figure 3-3. We therefore parse sentences in two steps: 1) we identify and group together the non-determiner tokens that match to the concept unique identifiers (CUI) in the UMLS Metathesaurus [186], 2) we then apply the Stanford Parser with grouped tokens as one token. We only group token subsequences whose last token is a noun. Finally, we assign POS tags to grouped token subsequences by using the POS tags from their last tokens during a separate run of POS tagger on the original sentence. 49 Typed dependencies, collapsed Parse (ROOT (NP (NP (NNP In-situ-hybridization)) (PP (IN for) (NP (NN kappa) (CC and) (NN lambda) (NN immunoglobulin) (JJ light) (NNS chains)))) (VP (VBP show) (5 (NP (DT the) (IN plasma) (NNS cells)) (VP (TO to) (VP (VB be) (ADJP (JJ polytypic)))))) (. nsubj (show-9, In-situ-hybridization-1) (m(chains-8, kappa-3) conj_and(kappa-3, lambda-5) nn(chains-8, lambda-5) nn(chains-8, immunoglobulin-6) emod(chains-8, light-7) prepfor (In-situ-hybridization-1, chains-8) root(ROOT-O, show-9) det(cells-12, the-10) nn(cells-12, plasma-li) nsubj(polytypic-15, cella-12) aux (polytypic-15, to-13) cop(polytypic-15, be-14) xcomp(shov-9, polytypic-15) .))) Figure 3-3 Two-phase sentence parsing on example. 3.4.1.3 Choosing CUI over TUI to group token subsequences The relative usefulness of various dictionaries from the UMLS Metathesaurus has received mixed reports from the research community [243]. Earlier in our experiments, we initially relied on using the UMLS semantic types to group token subsequences. The UMLS currently defines 133 semantic types that are indexed by TUIs. Our earlier approach followed a sequence of steps called zoom-in, mine and zoom-out. In the zoom-in step, in addition to grouping token subsequences using CUls, we mapped each CUI to a corresponding TUI and identified the semantic types of the grouped token subsequences. In the mining step, we treated token subsequences sharing a semantic type as identical nodes in the sentence graphs, and applied frequent subgraph mining. The rationale was to group concepts of the same semantic types together. This would lead to a coarser granularity of concepts, with the hope for the captured frequent subgraphs to cover more sentences. In the zoom-out step, we took the frequent subgraphs returned by the mining step, mapped them back to the sentences and replaced TUI labels for their nodes with corresponding CUIs extracted from those sentences. However, we later noticed that UMLS semantic categories in general provided too coarse a granularity for our application. For example, T cells, B cells, neutrophils, and megakaryocytes all mapped to the semantic type of "Cell" at the lowest level of the UMLS semantic types. Moreover, the UMLS semantic types sometimes led to inconsistencies with our domain knowledge. For example, if one includes all CUls for "CD 10" and maps them to semantic types, one gets the following semantic types: molecularfunction, enzyme, and gene or genome. However, pathologists see CD10 primarily as an important immunologic factor. In fact, this happens for multiple CD 50 antigens, including CD79a (mapping to Amino Acid, Peptide, or Protein and Receptor), CD138 (mapping to Gene or Genome, Amino Acid, Peptide, or Protein and Biologically Active Sub- stance), etc. Note that for CD 138, strictly speaking, Biologically Active Substance is a semantic type subsuming immunologicfactor. However, referring only to the semantic type hierarchy, this does not preclude the possibility that CD 138 may belong to other subsumed semantic types such as Neuroreactive Substance or Biogenic Amine, Hormone, Enzyme, Vitamin, and Receptor. A third problem is that the UMLS semantic type hierarchy does not form a strict taxonomy. For example, under the type chemical, the subtypes chemical viewed functionally and chemical viewed structurally largely overlap each other. This leads to the problem that even the same CUI of a chemical can have two semantic types. Due to the above problems, we saw much noise coming from using the UMLS semantic types as node labels for sentence graph, which affected discovery of frequent subgraphs and, in turn, classification performance. We tried multiple heuristics to attempt to resolve such inconsistencies, for example, only looking at upper levels of the semantic hierarchy. However, this aggravated the coarse granularity problem and led to no obvious classification performance gain. We finally resorted to relying on the CUIs to label sentence graph nodes. 3.4.1.4 Parse post processing In order to increase the accuracy of the sentence graph representations, we perform post processing on the Stanford dependency parsing results. The main observation is that lists of immunologic factors often pose parsing challenges, as in the sentence, "Most interstitial lymphocytes are CD3 positive T-cells with fewer CD20 and PAX5 positive B-cells". Even if all POS tags are correctly assigned, the parser still has difficulty in determining that "CD20" and "PAX5" are both connected to "positive". We observed the following list patterns that may interfere with the parsing process and implemented rule-based post-processing systems to systematically correct list-related errors. For each pattern, we give an example sentence along with its Stanford Parsing results with and without pre-processing. 1. A list of nominal immunological factors: Example sentence 1: "These large cells are positive for the B-cell markers CD20, OCT2, BOBI and are also MUMI and BCL6 positive." 51 . .. ...................... .. .......... .............. Figure 3-4 shows the raw Stanford parsing result. Figure 3-5 shows the parsing results after pre-processing on tokens and POS tags. It is clear that pre-processing helps correct the POS tags for "MUMi" and "BCL6". However dependencies involving "OCT2" and "BOB I" are incorrect as highlighted in Figure 3-5. (ROOT (S (NP CDT These) (JJ large) (NNS cells)) (VP (VP (VBP are) (ADJP (JJ positive) (PP (IN for) (NP (NP (DT the) (JJ B-cell) (BNS markers) (, ,) (NP (NNP OCT2) (, ,) (NNP BOB))))))) (CC and) (VP (VBP are) (ADVP (RB also)) (NN CD20)) (ADJP (ADJP (JJ MUNl)) (CC and) G. (ADJP (RB BCL6) )M (JJ positive))))) det(cells-3, These-1) amod(cells-3, large-2) nsubj(positive-5, cells-3) nsubj (MUll-18, cells-3) cop(positive-5, are-4) root(ROOT-O, positive-5) det(CD20-10, the-7) amod(CD20-10, B-cell-8) nn(CD20-10, markers-9) prepfor(positive-5, CD20-10) nn(BOBi-14, OCT2-12) appos(CD20-10, BOB1-14) cop (MH-18, are-16) advaod(KOH-18, also-17) conjand(positive-5, HU1-18) advod(positive-21, BCL6-20) conjand(positive-5, positive-21) conj_and(MUHi-10, positive-21) Figure 3-4 Raw Stanford parsing result for example sentence 1. (ROOT (S (NP (DT These) (JJ large) (ENS cells)) (VP (VP (VBP are) (ADJP (JJ positive) (PP (IN for) (NP (NP (DT the) (3)1 B-cell-markers) (EN CD20)) 1) (NP (NP (NN OCT2)) (, ,) (NP (NN BOBIl))))))) (CC and) (VP (VBP are) (ADVP (RB also)) (NP (UN KHNl) (CC and) (NN BCL6))) (ADJP (JJ positive))) (. .))) det(cells-3, These-1) amod(cells-3, large-2) nsubj(positive-S, cells-3) cop(positive-S, are-4) root(ROOT-O, positive-5) det(CD20-9, the-7) nn(CD20-9, B-cell-markers-8) prep_for (positive-5, CD20-9) appos(CD20-9, OCT2-ll) appos(OCT2-ll, BB-13) cop (KUM-l7, are-15) advuod(UH1-17, also-16) conj_and(positive-5, HU1-17) conjand(positive-5, BCL6-19) conjand(HUH-17, BCL6-19) acomp(positive-5, positive-20) Figure 3-5 Stanford parsing result after pre-processing for example sentence 1. The yellow highlights mark the erroneous parsing structures. 2. A list of adjective form immunological factors: 52 .............. Example sentence 2: "Report of immunostains indicates the cells are CD79a+, CD20+, CD3-, CD5-, BC16+, BCL2-, and CD 10+ consistent with follicle center origin." Figure 3-6 shows the raw parsing result, in which many tokens, POS tags and dependencies are incorrect. Figure 3-7 shows the parsing result after pre-processing. Improvements on tokenization output and POS tags are seen, but dependency errors are still present as highlighted. nsubj (indicates-4, Report-I) prepof (Report-1, immunostains-3) root(ROOT-0, indicates-4) det(cells-6, the-5) nsubj(+-9, cells-6) cop(+-9, are-7) nn(+-9, CD79a-8) ccomp(indicates-4, +-9) appos(+-9, CD20-ll) nua(CD20-ll, +-12) num(CD20-ll, CD3-14) ccomp (indicates-4, CD5-17) conj_and(+-9, CD5-17) amod(+-21, BC16-20) dep(CD5-17, +-21) appos(+-21, BCL2-23) nn(+-28, CD1O-27) ccomp (indicates-4, +-28) conjend(+-9, +-28) amod(+-28, consistent-29) amod(origin-33, follicle-31) nn(origin-33, center-32) prepwith(consistent-29, origin-33) (ROOT (5 (NP (NP (NNP Report)) (PP (IN of) (NP (NNS immunostains)))) (VP (VBZ indicates) (MUAR (S (NP (DT the) (NNS cells)) (VP (VBP are) (NP (NP (NP (NNP CD79a) (NNP +)) (, ,) (NP (NIP CD20) (CD +) (, , (CD CD3)) (: -)) ), (NP (NP (NNP CD5)) (PRN (: -) (NP (NP (, , (JJ BC16) (NN +)) (NP (NKP BCL2))) (: -)) (CC and) (NP (NP (NNP CD10) (NNP +)) (ADJP (JJ consistent) (PP (IN with) (NP (JJ follicle) (NN center) (. (NN origin)))))))))) .))) Figure 3-6 Raw Stanford parsing result for example sentence 2. 53 - !-- jW_ "W.- ,-j . (ROOT .. ......... ............. .......... .... ... .. ... nsubj (indicates-4, Report-i) prepof (Report-i, imamnostains-3) (5 (NP root(ROOT-O, indicates-4) det(cells-6, the-S) (NN Report)) (PP (IN of) (NP (ENS imximstinA)))) (VP (VBZ indicates) (SAR (NP nsubj(CD79&+-8, cells-6) cop(CD79e+-8, are-7) ccomp(indicates-4, aaod(CD79&+-O, the) (NNS cells)) (VP (VBP are) (NP (NP (JJ CD79&+)) (NP (DT end) (UJ CDI0+) emoul(CD79a+-O, CDS--12) CDS--14) conlend(congistent-22, CDS--14) awod(CD79+-O, PC16+-16) BC16+) (, ,) (JJ DCL2-) (, ,) conjuend(conuistent-22, DC16+-16) euod(CD79&+-8, BCL2--18) conjend(consiatent-22, BCL2--18) (JJ consistent))) eaod(CD79a+-8, CD10+-21) conjand(consixtent-22, CDIO+-21) amod(CD79a+-8, consistent-22) nn(origin-26, follicle-24) nn(origin-26, center-25) prepyIth(CD79&+-8, origln-26) (PP (IN with) (NP CD3--12) conttand(cowistent-22, (ADJP (JJ CD20+) (, ,) (JJ CD3-) C, ,) (JJ CDS-) (, ,) (JJ (CC CD79a+-8) dep (consistent-22, CD20+-10) (s (NN follicle) (N center) (NN origin))))))) ( .))) Figure 3-7 Stanford parsing result after pre-processing for example sentence 2. The yellow highlights mark the erroneous parsing structures. 3. A list of nominal immunological factors modifying adjectives: Example sentence 3: "Most interstitial lymphocytes are CD3 positive T-cells with fewer CD20 and PAX5 positive B-cells." Figure 3-8 shows the raw parsing result with POS tags errors such as for "CD3". Figure 3-9 shows the parsing result with pre-processing. Highlighted parts indicate the error in not recognizing that "B-cells" are "CD20" "positive". (ROOT (S (NP (NP (NNP Immunohistochemistry)) (PP (IN of) (NP (DT the) (NN bone) (NN (VP (VBZ reveals) (SBAR marrow) (NN core)))) (IN that) (S (NP (RBS most) (JJ interstitial) (NNS lymphocytes)) (VP (VBP are) (VP (VBG CD3) (NP (JJ positive) (NNS T-cell3)) (PP (IN with) (NP (NP (JJR fewer) (NN CD20) (CC and) (NP (CD PAXS) (JJ positive) (NNS B-cells)))))))))) (S (NP (DT the) (NN latter)) (VP (VBP are) (ADJP (JJ small) (PP (IN in) (NP (NN size)))))) nsubj (reveals-7, Immunohistochemistry-l) det(core-6, the-3) nn(core-6, bone-4) nn(core-6, marrow-5) prep of (Imaunohistochemistry-1, core-6) root(ROOT-O, reveals-7) mark (CD3-13, that-8) advaod(lymphocytes-ll, most-9) eamod(lymphocytes-ll, interstitial-10) naubj (CD3-13, lymphocytes-Il) aux(CD3-13, are-12) ccomp(reveals-7, CD3-13) amod(T-cells-15, positive-14) dobj(CD3-13, T-cells-15) emod(CD20-18, fewer-17) prepwith(CD3-13, CD20-18) num(B-cells-22, PAX5-20) emod(B-cells-22, positive-21) prep-with(CD3-13, B-cells-22) conj3end(CD20-18, B-cells-22) det(latter-25, the-24) naubj(small-27, latter-25) cop(small-27, are-26) paratexis(reveals-7, small-27) prepin(small-27, size-29) . .)) Figure 3-8 Raw Stanford parsing result for example sentence 3. 54 To correct the parsing errors introduced by the above list patterns, we perform the following steps. We first recognize the immunologic list patterns by checking the UMLS semantic types of parsing nodes and record those belonging to immunologic factors. The semantic types along with their specific TUI numbers that are considered as immunologic factors are shown in Table 3-2. Multiple semantic types are included because some cell surface markers may belong to one or more semantic types. For example, "CD2" belongs to "Amino Acid, Peptide, or Protein", "Immunologic Factor", "Receptor", "CD10" belongs to "Enzyme", "CD138" belongs to "Amino Acid, Peptide, or Protein", "Biologically Active Substance", "BCL2" belongs to "Gene or Genome" and "EBV" belongs to "Virus". After recognizing such list patterns, we check the POS tags of immunologic factor parse nodes. If they are adjectives (pattern 2), we replace the whole list with a dummy adjective "atypical". If they are nouns, and if the list is followed by an adjective (pattern 3), we replace the whole list and the following adjective with a dummy adjective "atypical". If the list is not followed by an adjective, we replace the whole list by a dummy proper noun "ATG"4 . We then parse those modified sentences using the Stanford Parser. At last, we fill back the immunologic factors in the original list. For pattern 2, we copy the dependencies of "atypical" to each immunologic factor adjectives. For pattern 3, we copy the dependencies of "atypical" to the adjective following the list and connect each immunologic factor with that adjective. For pattern 1, we copy the dependencies of "ATG" to each immunologic factor. 4 We use a dummy proper noun so that they can fit in sentences with either singular form or plural form predicates. 55 ...... .... .................. ...... ......... ....... . . .. ... . ...... ..... ...... . ....... . ........................................ - .............. nsubk (reveals-6, Immunohistochemistry-1) det(core-5, the-3) (ROOT (S nn(core-5, bone-merrow-4) (S (NP (NP (NN Immunohistochemistry)) (PP (IN of) (NP (DT the) (NN bone-marrow) (NN core)))) (VP (VBZ reveals) (NP (NP (UN that)) (SBAR (S (NP (JJS most) (JJ interstitial) (NIS lymphocytes)) (VP (VBP are) (NP (NP (UN CD3) (JJ positive) (NUS T-cells)) (PP (IN with) (NP (NP (JJR fewer) (IN CD20)) (CC and) (NP (NN PAX5) (JJ positive) (INS B-cells))))))))))) (S anod(lymphocytes-10, most-8) amod(lymphocytes-10, interstitial-9) naubj(T-cells-14, lymphocytes-10) cop(T-cells-14, are-Il) nn(T-cells-14, CD3-12) anod(T-cells-14, positive-13) rcmod(that-7, T-cells-14) emod(CD20-17, fewer-16) prep vith(T-cells-14, CD20-17) =~(-e115-21, PAXS-19) eaod(D-cells-21, positive-20) prep vith(T-cells-14, B-cells-21) conjond(CD20-17, B-cells-21) det(latter-24, the-23) nsubj (small-26, latter-24) cop(small-26, are-25) parataxis (reveala-6, samall-26) prepin(small-26, size-28) ) (NP (DT the) (NN latter)) (VP (VBP are) (ADJP (JJ small) (PP (IN in) (NP (NN size)))))) (. .)) prepof (Immunohistocheistry-1, core-5) xoot(ROOT-0, reveals-6) dobh(reveals-6, that-7) Figure 3-9 Stanford parsing result after pre-processing for example sentence 3. The yellow highlights mark the erroneous parsing structures. TUIs Semantic Types T123 Biologically Active Substance T129 Immunologic Factor T192 Receptor T1 16 Amino Acid, Peptide, or Protein T126 Enzyme T028 Gene or Genome T005 Virus Table 3-2 Semantic types considered as immunologic factors. 3.4.2 Intuition on relations among concepts In a corpus of pathology reports focusing on a specific disease, certain relations among medical concepts occur frequently. For example, Figure 3-10 shows variations of immunohistochemistry interpretations, which describe "what kind of staining" (bold-outline blocks) is observed with regard to "antibodies" to "what type of antigens" (dash-outline blocks). The relations among those concepts are what characterize the immunohistochemistry results. For example, in one pathology report, "B lineage antigens" associate with "staining of most large atypical cells", and "T lineage associated antigens" associate with "staining of most small cells". If we use only indi56 vidual findings, it is difficult to exclude the other possibilities of association. For daily pathology practice, important relations are likely to be repeated in similar syntactic and semantic constructs. This motivated us to use a graph representation to capture concepts and relations expressed in a sentence, as well as to use frequent subgraph mining to identify important relations encoded by sentence subgraphs. W ath ntibodies 0i 1 immunoglobulin staining of most large atypical cells and very few small cells I Blineage antigens (CD2) I . - - - - - - - - - r - - - - - - - - - - - background staining hEr - T lineage associated antigens (CD3) staining of most small I -~--- - - - - - ' immunoglobulin light chains L_ _- - - - - - - - - _ cells within the tissue bright monotypic (kappa) i staining of most lymphoid cells Figure 3-10 A variety of sentences frequently occurring in our corpus describe the relations among cells, staining, and antigens/antibodies. Dash-outline blocks indicate "what type of antigens"; bold-outline blocks indicate "what kind of staining". 3.4.3 Representing sentence dependency parses as graphs In natural language, the syntactic structure of a statement often corresponds at least approximately to the ways in which the semantic parts may be combined to aggregate the meaning of the overall statement [152]. The two-phase sentence parsing (described above) produces the de- pendency linkage structure of a sentence. This translates conveniently to a graph representation of the relations, where the nodes are concepts and the edges are syntactic dependencies among the concepts. We experimented with multiple parsers including the augmented Stanford Parser [194], the Link Grammar Parser [165,244] and the ClearParser [245]. We chose the Stanford Parser because it produced fewer systemic errors on our corpus. Figure 3-11 shows the graph representation for the example sentence "Immunostains show the large atypical cells are strongly positive for CD30 and negative for CD15, CD20, BOB1, OCT2 and CD3." Syntactic dependencies are denoted using line segments with labels (e.g., prep for). For each parse node (round-corner rectangle), the text in parentheses includes the tokens in the 57 original sentence, connected by hyphens (e.g., "atypical-cells"). The text above the parentheses displays the preferred name of the node's CUI (e.g., CD20_Antigens for C0054946). For determiners, we exclude common functional determiners such as "a", "an" and "the" but keep the semantically meaningful ones such as "no" and "all". The Stanford Parser supports various parsing modes. We chose the mode specifying "collapsed dependencies with propagation of conjunct dependencies" [246], which has the most compact graph translations. With this mode, possible cyclic graphs can arise in the dependency linkages, such as the cycle in the middle of Figure 3-11. Strong (strongly) Large \(large) Antigens,_CD3O (CD30) Positive immunostain immunostains /ositive 0. Cytologic-atypia show (atypical-cells (show) Negative neatie PCD3 repto( CD20Antigens (CDO 9 prep for re.,r CD3_Antigens SLC22A2gene '*1- (OCT2) ntigens,_CD1 (CD5) / POU2AF1_gene (BB1) Figure 3-11 Constructing the sentence graph from the results of two-phase dependency parsing. In order to increase the accuracy of the sentence graph representations, we perform post processing on the Stanford dependency parsing results by converting lists of immunologic factors to single tokens as described in section 3.4.1.4. 3.4.4 Frequent subgraph mining Frequent subgraph mining is based on the notion of graph subisomorphism. Intuitively, one graph is subisomorphic to another graph if it is part of the other. Formally, let G, = (Vs, Es, ls)Gs = (Vs, Es, ls) and G = (V, E, 1) be two graphs, where V (Vs) is the set of nodes, E (Es) is the set of edges and 1 (l) is the labeling function for nodes and edges. For Gs to be 58 subisomorphic to G, the following conditions must be met: there exists a one to one mapping f such that: 1) f(Vs) - Vm c- V, st. v E V,ls() = l (f(v)) 2) V v 1 , v 2 E VS , if(v1, V 2 ) E Es, then (f(v), f(v 2 )) E E and ls(v 1, v 2 ) = l(f(v1, f(V 2 )) Condition 1 says that there exists a mapping from nodes in Gs to a subset of nodes in G, such that corresponding nodes agree on their labels. Condition 2 says that each edge in Gs should also have a counterpart in G that shares the same label. Figure 3-12 shows two example subgraphs of the sentence graph in Figure 3-11. Large (Lare) ytod giciatypia amod7 (atyical-cells) Negative ne ative nsubj D20 Antigens or Antigens,_CD1 (CD15) (a re Cytologicatypia atypical-cells) LC A2 POU2AF1_gene (BO81) Antigens,_CD30 (CD30) Strong stronl 0 nsubj n( (1) ieV(2) Positive ~ositive) Figure .3- 12 Example subgraphs for the sentence graph in Figure 3-11. We say that a subgraph occurs once in a corpus every time it is subisomorphic to a graph in that corpus. The frequency of a subgraph is the total number of its occurrences within the corpus. Frequent subgraph mining tries to identify those subgraphs whose frequencies are above a given threshold. Various graph encodings, enumeration strategies and search pruning policies have been proposed to improve the efficiency of the mining algorithms [247,248]. In this work, we use the open-source frequent subgraph miner Gaston [100], which has state-of-the-art speed. 3.4.5 Subgraph redundancy pruning We ran Gaston on our training dataset containing 17,186 sentences, with a frequency threshold of 5, and obtained 180,863 frequent subgraphs. Analyzing these subgraphs, we found that many 59 smaller subgraphs are subisomorphic to other larger frequent subgraphs. Many of these larger subgraphs have the same frequencies as their subisomorphic smaller subgraphs. This arises when a larger subgraph is frequent; all of its subgraphs also become frequent. Furthermore, if the smaller subgraph is so unique that it is not subisomorphic to any other larger subgraph, then this pair of larger and smaller subgraphs shares identical frequency. Therefore, we only kept the larger subgraphs in such pairs. Note that it is cost prohibitive to perform a full pairwise check because the subisomorphism comparison between two subgraphs is already NP complete [100], and a pairwise approach would ask for around 16 billion such comparisons for our dataset. We developed an efficient algorithm using hierarchical hash partitioning that reduces the number of subgraph pairs to compare by several orders of magnitude. The key idea is that we only need to compare subgraphs whose sizes differ by one, and we can further partition the subgraphs so that only those within the same partition need to be compared. After subgraph redundancy pruning, we are left with 9935 subgraphs. In fact, let Gs, G be subgraphs with G, being subisomorphic to G, and #(G,) = #(G) and IGI < IGI - 1, where #(.) denotes the frequency and I - I denotes the number of nodes in a graph. Then given the subisomorphism between G, and G, one can construct a G 1 by simply adding one additional node (and associated edges) in G. It is clear that #(G) 5 #(G 1 ) 5 #(G,), but because #(G) = #(Gs), we have #(G) = #(Gl) = #(G,). Thus we only need to check subisomorphism between Gs and G 1 , and between G1 and G, where G 1 differ from G, in size by only one. Carrying on such construction, we therefore only need to check pairs of subgraphs whose sizes differ by one. Based on the H2, we can first order subgraphs in descending order according to their sizes. Then it suffices to progress down the hierarchy, checking among sub- graphs that are in the neighboring two levels. To further reduce unnecessary subisomorphic comparisons, we make another observation that for a graph Gs to be subisomorphic to G, the node labels of Gs must be a subset of G. Moreover, as we restrict ourselves in comparing only subgraphs from neighboring levels, we are able to adopt a hash partition scheme to avoid enumerating all possible pairs from neighboring levels. Precisely speaking, at level n, a subgraph has n nodes, if we consider its n - 1 size subgraphs, there are only n possible set of labels. We can then construct a hash table and hash the level-n subgraphs n times using their n - 1 node label subsets as keys. We also hash subgraphs from level n - 1 60 using their node label sets (size n - 1) as key. We note that it is only necessary to check subgraph pairs in the same partition. Although an upper level subgraph is hashed multiple times into the hash table, hashing has both constant amortized update time and constant amortized look up time. The time for multiple hashes is much less than the time for unnecessary subisomorphism comparisons. Moreover, in practice, the size of the subgraph is often small, and multiple hashes only multiply a constant factor to the total hash update and look up times. A summary of our algorithm is shown in Figure 3-13. Lines 1 and 2 sort the set of graphs so that they are first ordered (in descending order) by their number of nodes and then by their number of edges. This ensures that subisomorphism only needs to be checked by looking at graphs before the current one. Line 3 partitions graphs into levels according to their sizes while keeping the previously sorted order. Lines 5 to 29 progress down the hierarchy perfonning subisomorphism check when necessary. Lines 7 to 11 hash each upper level graph into possibly multiple buckets. Lines 12 to 15 partition lower level graphs into different hash buckets. Lines 20 to 23 check subisomorphism within the same hash partition on the lower level. Lines 24 to 29 check subisomorphism between corresponding lower level bucket and upper level buckets. In lines 22 and 28, we generalize from the condition requiring two subgraphs to have identical frequencies to a condition customizable by the user. 3.4.6 Single node frequent subgraph collection Gaston only collects frequent subgraphs having two or more nodes. Because our token subsequence grouping may group all tokens within a short sentence into one node if they are covered by one CUI, such nodes would be ignored by Gaston. We do not want to exclude the possibility that sometimes the presence of a meaningful medical concept in the text can be informative. We thus also collected single node subgraphs using the same frequency threshold 5 as for multi-node frequent subgraphs, adding 1602 single node subgraphs (11537 total). 61 subisomorphim for set of graphs S - set of graphs input: effect: compute subisomorphism relation among graphs in S 1 2 3 4 stable sort S in descending order of number of edges stable sort S in descending order of number of nodes <- put graphs of size n into levels levels [n] maxlevel - length(levels) 5 6 7 8 9 10 11 for n = maxlevel downto 2 h upper = {}; h lower = {} if n != max level ulevel = levels[n+l] for i = 1 to length(ulevel) foreach key : labels of n-l subset of nodes of ulevel[i] add ulevel[i] into the list h_upper[key] 12 13 14 15 llevel = levels[n] for i = 1 to length(llevel) foreach key : set of labels llevel[i] add llevel[i] into the list h lower[key] 16 17 18 19 20 21 22 foreach key in h lower.keys() g lower = hlower[key] for i = 1 to length(glower) gs = g lower[i] for j = 1 to i-l gb = glower[j] if condition = true 23 24 25 26 27 28 29j subisomorphism(gs, gb) if h upper.haskey(key) gupper = h upper[key] for j = 1 to length(gupper) gb = g upper[j] if condition = true subisomorphism(gs, gb) Figure 3-13 A hierarchical hash partition algorithm for determining subisomorphism relation among graphs in a set 3.5 Experiments and Results For each patient case, we use the written diagnosis (in the final diagnoses section of the pathology reports) as the ground truth label. A patient may have multiple lymphomas at the same time, or the diagnosis may be an intermediate case between multiple lymphomas. Given the relatively small numbers of multiple-hit/intermediate cases as shown in Table 3-3, we model the classification task as multiple binary classification problems, one for each lymphoma. For the ground truth, the positive cases for one lymphoma type also include the multiple-hit/intermediate cases involving this type. The negative cases of one lymphoma type include positive cases of the other three types, except for multiple-hit/intermediate cases involving this type. Our task resembles the dif- 62 ferential diagnosis of four lymphomas, assuming that every patient in the selected population has at least one lymphoma. By splitting the dataset randomly into halves, stratified by type of lymphoma, we obtained a training set and a testing set, whose statistics are in Table 3-4. # Cases Percent Type Intermediate between Burkitt and DLBCL 18 1.7% 2 0.2% Intermediate between Burkitt and Follicular 42 4.0% Double-hit of DLBCL and Follicular 0.7% 7 Intermediate between DLBCL and Hodgkin Table 3-3 Multiple-hit or intermediate lymphoma cases. Percentage is out of a total of 1038 cases. Lymphoma Full Corpus N P P% Training Corpus N P P% Testing Corpus N P P% - Burkitt 946 93 9.0% 500 55 9.9% 446 38 7.9% DLBCL 383 656 63.2% 210 345 62.2% 173 311 64.4% Follicular 811 228 22.0% 425 130 23.4% 386 98 20.3% Hodgkin 908 131 12.6% 486 69 12.4% 422 62 12.8% Table 3-4 Distribution of lymphoma cases in full corpus, training corpus and testing corpus. N number of the negative patients, P - number of positive patients, P %- percentage of the positive patients. We show these three statistics in the full corpus, in the training corpus and in the testing corpus. Note that in the full corpus, the number of positive cases does not add up to 1038 (the total number of patients), this is because there are patients with diagnoses for multiple/intermediate lymphomas. In our experiments, we trained three baseline classifiers on different feature types. Baseline 1 uses negation classified medical concepts extracted by the latest Metamap [26]. Baseline 2 further filters the concepts in Baseline 1 based on UMLS semantic types that are reported in previous studies to have good performance for medical problem extraction [249,250]. In addition to previously used semantic categories of diseases and symptoms, we also included semantic types that fall under the hierarchy of "Chemical" and "Anatomical Structure" as our pathology reports largely concern the immunological factors and various types of lymphocytes. Baseline 3 uses the standard n-grams features [251], including unigrams, bigrams and trigrams, which have been reported as most useful for document classification [252]. We experimented with multiple machine learning algorithms including support vector machines (SVM), decision trees and Bayesian networks. We chose SVM for its better performance on our training data and its widely acknowledged generalizability. We experimented with polynomials up to degree five and radial basis functions as candidate kernels. We performed ten-fold cross validation on training data for pa- 63 rameter selection and evaluated the trained model on the held-out test dataset. Cross validation favored a linear kernel for all the settings in our experiment. Table 3-5 shows the evaluation results on the subgraph features for each of the four lymphoma categories in comparison with the three baselines. The evaluation metrics include standard precision, recall, f-measure and AUC (area under ROC curve). Let TP denote the number of true positives in the contingency table, FP denote the number of false positives and FN denote the number + of false negatives, the definition of precision is P = TP/(TP + FP), recall is R = TP/(TP FN), f-measure is F = 2 x P x R/(P + R). It is clear that full MetaMap features outperform filtered MetaMap features. Thus we performed significance tests comparing the subgraph features with the full MetaMap features and with the n-gram features. We used the approximate randomization test [253] to assess whether two system outputs were significantly different from each other (p = 0.05) and the statistically significant changes in Table 3-5 are marked. We see improvements on precision, recall, and f-measure across all four lymphomas compared with either baseline. For Burkitt lymphoma, all improvements are significant. For DLBCL, the improvement in recall over n-grams is not significant. For follicular lymphoma, all improvements over n-grams are significant; the improvement in recall over MetaMap is significant. For Hodgkin lymphoma, all improvements are significant except for the recall compared with n-gram features. Overall, the sentence subgraph features significantly outperform all three baselines. 64 Lymphoma Class Full MetaMap* (3112) P R F AUC Filtered MetaMap (1600) P R F AUC n-gramt (16326) P R F Sentence subgraph (11537) AUC P Burkitt-N 0.965 0.978 0.971 0.778 0.959 0.989 0.973 0.744 0.969 0.984 0.977 0.808 0.978 Burkitt-P 0.688 0.579 0.629 0.778 0.792 0.5 DLBCL-N 0.703 0.634 0.667 0.743 0.714 0.523 0.604 0.704 0.829 0.703 0.761 0.812 0.87 DLBCL-P 0.808 0.852 0.829 0.743 0.77 0.613 0.744 0.774 0.632 0.696 0.808 0.884 0.823 0.704 0.849 0.92 0.883 0.812 0.875*t 0.884*t Follicular-N 0.933 0.974 0.953 0.849 0.939 0.953 0.946 0.854 0.932 0.958 0.945 0.841 0.952 Follicular-P 0.877 0.724 0.793 0.849 0.804 0.755 0.779 0.854 0.816 0.724 0.768 0.841 Hodgkin-N 0.963 0.995 0.979 0.869 0.952 0.988 0.97 Hodgkin-P 0.958 0.742 0.836 0.869 0.891 0.661 0.759 0.825 0.907 0.79 0.825 0.97 0.878t 0.988 0.979 0.889 0.977 0.845 0.889 1*t R F AUC 0.991 0.984 0.864 0.737* 0.8*t 0.864 0.779 0.822 0.857 0.936* 0.909*t 0.857 0.971 0.961 0.889 0.806*t 0.84t 0.889 1 0.988 0.919 0.839* 0.912*t 0.919 Table 3-5 Held-out test results on different feature groups. In the lymphoma class column, suffix "-N" denotes negative cases, "-P" denotes positive cases. P - precision, R - recall, F - f-measure, AUC - area under curve for ROC curve. Numbers in parentheses next to each feature group indicate the number of the features in that group. Evaluation metrics for each positive class are in bold if they show significant improvements over baselines. Markers (*t) are used to indicate specific baselines. To assess the effect of parse post-processing and the effect of detailed dependency types on the performance of sentence subgraph features, Table 3-6 shows different configurations in separate panels, in which "untyped dependency" means that all dependency types are ignored. Vertical comparisons show that post processing in general helps to improve classification performance with the exception of Burkitt lymphoma classification when the system uses typed dependencies. Horizontal comparisons show that distinguishing dependency types in general does not improve classification performance. In particular, with post processing, untyped dependencies even help to improve the f-measures for Burkitt, DLBCL, and follicular lymphoma classifications. There are two possible reasons. First, the Stanford Parser dependency types may distinguish relations between concepts in unnecessary detail. For example, the partial sentences "B-cells with CDlO prep._with expression" (B-cells partmod >expressing dobj -- amod expression >CD10) and "B-cells expressing CD10" (B-cells CD10) have different syntactic parses but convey almost the same in- formation to pathologists. In addition, parser errors during dependency type assignment could introduce noise that diminishes the usefulness of the dependency types. 65 No post processing, typed dependency (7491) No post processing, untyped dependency (8548) P R F AUC P R F AUC Burkitt-N 0.978 0.984 0.981 0.861 0.978 0.984 0.981 0.861 Burkitt-P 0.8 0.737 0.767 0.861 0.8 0.737 0.767 0.861 DLBCL-N 0.819 0.762 0.789 0.834 0.868 0.767 0.815 0.852 DLBCL-P 0.873 0.907 0.890 0.834 0.879 0.936 0.907 0.852 Follicular-N 0.942 0.971 0.957 0.868 0.937 0.971 0.954 0.858 Follicular-P 0.872 0.765 0.815 0.868 0.869 0.745 0.802 0.858 Hodgkin-N 0.977 0.990 0.983 0.915 0.974 0.993 0.984 0.908 Hodgkin-P 0.929 0.829 0.881 0.915 0.944 0.823 0.879 0.908 Lymphoma Class Lymphoma Class Post processing, typed dependency (9488) Post processing, untyped dependency (11537) P R F AUC P R F AUC Burkitt-N 0.969 0.989 0.979 0.810 0.978 0.991 0,984 0.864 Burkitt-P 0.828 0.632 0.716 0.810 0.875 0.737 0.8 0.864 DLBCL-N 0.86 0.75 0.801 0.841 0.87 0.779 0.822 0.857 DLBCL-P 0.871 0.932 0.901 0.841 0.884 0.936 0.909 0.857 Follicular-N 0.943 0.979 0.961 0.872 0.952 0.971 0.961 0.889 Follicular-P 0.904 0.765 0.829 0.872 0.878 0.806 0.84 0.889 Hodgkin-N 0.979 0.998 0.988 0.926 0.977 1 0.988 0.919 Hodgkin-P 0.981 0.855 0.914 0.926 1 0.839 0.912 0.919 Table 3-6 Held-out test results on different settings of sentence subgraph feature groups. In the lymphoma class column, suffix "-N" denotes negative cases, "-P" denotes positive cases. "P" denotes positive cases. P - precision, R - recall, F - f-measure, AUC - area under curve for ROC curve. Numbers in parentheses next to each feature group indicate the number of features in that group. 3.6 Feature and ErrorAnalysis This section investigates the ability of sentence subgraphs to assist with human review by providing insightful relations over a flexible number of medical concepts. The sentence subgraph features outperform all three baselines and n-grams seem to be the best baseline overall. A closer look at the MetaMap baseline shows that the program did not identify some important immunologic factors, such as CD30, CD15 etc. By contrast, n-gram features cover the entire text, but often do not map to medical concepts. To compare subgraph features with the baselines, we identi66 fied in the training corpus cases that are false negatives for the n-gram baseline and the MetaMap baseline but not for the sentence subgraph features during cross validation. We then identified the big subgraphs (> 3 nodes) that contribute to the improved recognition of the three minority lymphomas, by choosing those with a normalized weight above 0.01 as assigned by a linear kernel SVM. For Burkitt lymphoma, examples of interesting positive factors include: with antibodies to immunoglobulin, ... there is monotypic ... kappa staining of most tumor cells ... "... bf2 "... bf3 "... CD19+, CD20+, CD10+, CD5-, CD23-, CD43+ ... B cells with monotypic expression of kappa light chain ... bf4 " ... tumor cell is positive for CD 10 " bfl " ... " b-cells ... negative for BCL2 ... positive for BCL6 " ... For readability, we translated each subgraph into a partial sentence. Note that in bf3, although we have listed "CD19+, CD20+, CD1O+, CD5-, CD23-, CD43+" in order, when viewed in the subgraph, individual immunologic factors are all adjective modifiers of "B cells", hence the subgraph is order ignorant. The factors bfl, bf2, bf3 and bf4 are consistent with immunophenotypic characteristics of Burkitt lymphoma in the WHO classification [216], which states that the tumor cells are light chain-restricted with moderate to strong expression of pan-B-cell (CD19, CD20) and germinal center (BCL6 and CD 10) antigens, and are negative for CD5 and CD23. For follicular lymphoma, examples of positive factors that are exclusively discovered by sentence subgraph features are as follows. The factors ffl, ff2 and ff3 are consistent with Table 8.01 in [216], as CD10 is usually positive and CD23 is intermittently positive on B cells in follicular lymphoma. "... CD20+, CD10dim, CD5-, CD23- ... B cells ... ff2 "... CD20+, CD10dim, CD5-, CD43- ... B cells ... ff3 "... CD19+, CD20+, CD23+ ... B cells with ... expression of lambda light chain " " ff1 67 " ... One might think that Hodgkin lymphoma cases should be easy to classify because of the presence of Reed-Sternberg cells as a well-recognized diagnostic feature. However, our analysis shows that the paucity of neoplastic Reed-Sternberg cells and the predominance of nonneoplastic cells lead to interesting associations between sentence subgraphs and Hodgkin lymphoma. In particular, we found the following positive factors discovered by sentence subgraph features. hfl "... atypical large cells ... positive for ... CD30 hf2 "... with antibodies to B lineage ... antigens ... there is staining of many ... hf3 "... with antibodies to T lineage associated antigen ... there is staining of ... cells " ... " ... ... " cells The factor hfl links CD30-expressing atypical large cells to Hodgkin lymphoma and conforms to conventional knowledge [216]. The factors hf2 and hf3 refer to staining patterns of background T and B cells. Although hf2 and hf3 are seen to some extent in other lymphoma subtypes, Hodgkin lymphoma is particularly rich in background non-neoplastic T cells, as well as B cells to a lesser extent, and these non-neoplastic cells vastly outnumber the neoplastic Reed-Steinberg cells [6]. Together with other Hodgkin-related subgraph features such as hfl or Reed-Steinberg cells, hf2 and hf3 appear to account for these non-neoplastic cells. Our classifier placed higher weight on hf3 than on hf2, agreeing with the aforementioned T-cell dominance. Of note, recent work has shown varying patterns of morphology and immunophenotype in background nonneoplastic cells associated with a certain subtype of Hodgkin lymphoma [254-256], pointing to the potential utility of our analysis in identifying variant patterns of lymphoma. Of the four lymphomas, follicular lymphoma has a moderate number of cases but comparatively lower f-measure than DLBCL and Hodgkin lymphoma. We thus delved into false negative cases of follicular lymphoma in the training data and selected common features that have top negative weights as assigned by the linear kernel SVM. Investigating those common features, we highlighted the following. fnfl "... large ... 68 erythroid maturation is normal ... "... fnf3 "4... myeloid maturation is normal ... " " fnf2 The factor fnfl incorrectly associates the single-node subgraph "large" to negative classification of follicular lymphoma. In the description of a morphological study, "large" often describes the cell size. Although the keyword corresponds to the name of DLBCL (diffuse large B cell lymphoma), it is however not a distinguishing feature, because a Hodgkin Reed-Sternberg cell can be large, and centroblasts in follicular lymphoma can be large. Similarly the keyword "diffuse" and "follicular" are also not special to DLBCL and follicular lymphoma respectively. Although our model successfully excluded "diffuse" from the top negative features for follicular lymphoma, it incorrectly included "large". We reason that this is because we have a majority of DLBCL cases, which do frequently have the keyword "large", and the imbalanced ratio between DLBCL and follicular lymphoma confused our model. The factors fnf2 and fnf3 refer to erythroid and myeloid maturation respectively, which in reality are neither positively nor negatively associated with the likelihood of follicular lymphoma. We think this is identified by the classifier because lymphoma patients often undergo a staging bone marrow biopsy in which myeloid and erythroid maturation are routinely assessed during the process of determining whether the marrow is involved by lymphoma. As a result, normal myeloid and erythroid maturation is frequently associated with most cases. Because there are more follicular lymphoma cases with uninvolved staging bone marrow biopsies than those with involved biopsies, such association could be regarded by the classifier as favoring negative classification of follicular lymphoma. 3.7 Discussion and Limitations Some clinical reports are template based. In fact, our pathology reports also have template-based sections. For example, there are disclaimers such as "By his/her signature below, the pathologist listed as making the Final Diagnosis certifies that he/she has personally reviewed this case and confirmed or corrected the diagnoses." We exclude these sentences from being processed, as they do not offer clinical insights. Recognizing these sections is based on knowledge from EMR vendors about pre-specified templates. 69 Patient demographics such as gender and ages are usually mentioned in the clinical presentation section. They are also part of the features captured by subgraphs. For the age features, expressions such as "year-old" are connected to the integers that we discretize by every 10 years. However, we did not find demographics ranked as top-weighted features in our experiments. This is likely due to the presence of more specific predictors such as morphologic, immunophenotypic, and genetic features, though we do not exclude the possibility that a better customized discretiza- tion can yield different outcome. In addition, we note that different institutions may have different clinical documentation systems and styles, which may bring challenges to generalizing our framework to multiple institutions. We expect that the untyped dependencies will help mitigate some style (e.g., syntactic) differences between institutions. We also expect that the UMLS concept mapping can lessen the impact of the terminology differences between institutions. We are in fact expanding the lymphoma classification project across institutions, and generalizability analysis is part of our future work. Our work is predicated on the assumption that pathology reports provide a comprehensive statement of measurements, observations and interpretations made by pathologists. This seems true of current practice, but future programs may have access to digital images of immunohistochemical slides and raw flow cytometry counts directly from instruments. Nevertheless, we expect that for the foreseeable future pathologists' observations and interpretations will continue to be expressed in natural language, hence the techniques we report here will continue to be helpful. We expect to scale up our tool to assist with human expert reviews and more systematically iden- tify unique variants and new subcategories of lymphoma, whose recognition, diagnosis and acceptance into the widely-used classification system is important for patients to receive appropriate treatment and follow-up and to further our understanding of lymphoma biology. 3.8 Conclusions We narrowed the gap between automatic unsupervised feature generation and interpretable feature generation from clinical narrative text by building a framework that can perform unsupervised extraction of relations among flexible number of medical concepts. Our framework represents narrative sentences in pathology reports as graphs, and automatically mines sentence sub- 70 graphs for feature generation. We perform a lymphoma classification task resembling differential diagnosis, in which no explicit mentions or synonyms of the targeted lymphomas are available to the classifier. Evaluation shows that the classifier with unsupervised sentence subgraph features significantly outperforms the baselines using standard n-grams, full MetaMap concepts, or filtered MetaMap concepts respectively. With detailed feature analysis, we highlight that our system generates meaningful features and medical insights into lymphoma classification. 71 Chapter 4. Subgraph Augmented Non-negative Tensor Factorization (SANTF) Applied to Modeling Clinical Narrative Text This chapter5 continues to describe the core part of the Subgraph Augmented Non-negative Tensor Factorization (SANTF) algorithm, with a focus on applying non-negative tensor factorization to group subgraphs collected from Chapter 3. We begin by motivating the need for using nonnegative tensor factorization to perform such groupings, continuing with the example of lymphoma subtype categorization based on pathology reports. Advances in machine learning have opened avenues towards more effective mining and modeling of EMRs to facilitate translational research [257,258]. However, clinicians often regard existing machine learning models as hard-to-interpret black boxes. In lymphoma pathology reports, immunophenotypic features may be expressed in the form of relations among medical concepts such as lymphoid cells and antigens (e.g., "[large atypical cells] express [CD30]"). We refer to the above relations as higher-orderfeatures, and the words (e.g., "large", "cells") as atomicfeatures. When interpreting pathology reports and evaluating lymphoma subtypes, clinicians usually reason at the level of higher-order features (e.g., cell-antigen relations) besides atomic features (e.g., individual words). Moreover, multiple higher-order features (such as "[large atypical cells] express [CD30]", "[large atypical cells] express [CD15]" and "[large atypical cells] have [ReedSternberg appearance]") can strengthen the confidence of suspected lymphoma (Hodgkin lymphoma here). Such a group of higher-order features conveniently encodes medical knowledge as in the WHO lymphoma classification guideline [216] (referred to as WHO guideline later), where a panel of morphologic and immunophenotypic features are used to specify diagnostic criteria. For computational modeling, atomic features can help correlate higher-order features in order to discover medically meaningful groupings. For example, the above relations all share the words "large", "atypical" and "cells", which indicates that they all describe the characteristics of tumor cells. However, extracting higher-order features is itself a difficult task and often involves manually constructed rules and domain knowledge [27,97,103,259]. In addition, modeling inter5 This chapter was published as a research article in Journal of the American Medical Informatics Association [2] 72 actions between higher-order features and atomic features is usually ignored by machine learning algorithms that mostly adopt a flat patient-by-feature matrix view (patients as rows and features as columns). Although theoretically one can add interactions as additional features or embed graphical models to account for feature interactions, the problem quickly becomes intractable for large feature dimensionality. On the other hand, limited availability of expert annotation leads to the fact that most clinical data are still either unannotated or sparsely annotated. Thus unsupervised machine learning approaches have often been used to analyze biomedical data [260,261]. Moreover, the expense of expert engineered features also argues for unsupervised feature learning instead of manual feature engineering [87,262,263]. In particular, non-negative matrix factorization (NMF) has been a highly effective unsupervised method [264] to cluster similar patients [265] and sample cell lines [266], to identify subtypes of diseases [267] and to learn groups of atomic features or expert engineered features such as temporal patterns from predefined events [268] and genetic expression patterns [269-273]. As the multi-dimensional extension of NMF, non-negative tensor factorization (NTF) [274-276] has recently been studied to model the genetic associations with phenotypes [277-279] and interaction between cellular activities [280]. However, none of these approaches model the correlations among higher-order features, and some even do not consider higher-order features. Our work is more closely related to previous work on applying NMF and NTF in text mining in general domains such as email and security surveillance [281-284]. In particular, our approach differs from the NTF based text document analysis [281,284] in that we augment the NTF with subgraphs to capture relation oriented higher-order features instead of standalone entities. In addition, we adopted the Tucker tensor factorization model instead of the PARAFAC model [285], where the support for factor matrices with different group numbers better serves our application purpose. In this chapter, we develop an unsupervised framework that can generate machine learning models conveniently interpretable to clinicians. The framework adopts NTF to discover groupings of subgraph encoded higher-order features, hence the name subgraph augmented non-negative tensor factorization (SANTF). 73 191.44A- ...aw. I 4.1 Methods 4.1.1 Workflow of SANTF We first outline SANTF workflow in Figure 4-1. Narrative text sentences are first converted to graph representations, derived using the natural language processing (NLP) steps for pathology reports described in section 3.4.1 and frequent subgraph mining (FSM) as described in sections 3.4.4 to 3.4.6. Figure 4-2 shows an example of higher-order features for clinical narrative text. With such representations, subgraphs encode higher-order features, and we use "subgraphs" and "higher-order features" interchangeably throughout the chapter. We jointly model the higherorder features and atomic features, and apply non-negative tensor factorization to discover groups of features and patients, and then perform unsupervised learning to identify the associations between feature groups and patient groups. We next explain the tensor modeling and factorization in more detail. ( Narrative Text NLP Steps Graphs Frequent Subgraph Mining Subgraphs Words (Higher-Order Features) Atomic Features) 'x Non-negative Tensor Factorization Feature and Patient Groups; (Unsupervised Learning Figure 4-1 The workflow of subgraph augmented non-negative tensor factorization (SANTF). FSM - frequent subgraph mining. NLP - natural language processing. 74 1; , - - - -11,- ,,- - 11 1 1- 111.1- 1 11 11 -, Immunostains show the large atypical cells are positive for OCT2 and BOB 1, and negative for CD 10, CD15 and CD30. 4I ) (OCT2) * BOB NLP steps immunostains) positive ) -large) 0. .4 atypical cells_ show (c D10) - negative 1CD35 ~CD15)- 4I (large) FSM 13131 E atypical cells )atypical cells 0 E -nsubj-- )-nsubj positive prep-for (OCT2 - negative) -prep for. 40 C D 3O> A, "-1O CD15) (large) Figure 4-2 Graph generation and subgraph collection in SANTF . The graph representation for the example sentence: "Immunostains show the large atypical cells are positive for OCT2 and BOB 1, and negative for CD10, CD15 and CD30". Example frequent subgraphs are shown after the frequent subgraph mining (FSM) steps. 4.1.2 Joint modeling of higher-order features and atomic features using a tensor In clinical narrative text, higher-order features are often correlated with each other in medically meaningful ways. For example, the two subgraphs in Figure 4-2 both describe the surface mark75 ers expressed by the "large atypical cells" that are often tumor cells. However, as pointed out in the introduction, with a flat matrix view and binary feature representation, such correlations are difficult to account for. Motivated by the need to explicitly model correlations among the higherorder features, we compose a three-mode tensor, in which one mode represents the patients, a second the higher-order features (subgraphs), and a third the atomic features. Note that in tensor terminology [285], we speak of mode in place of dimension. Figure 4-3 shows the schematic view of tensor modeling. We select as atomic features the words that are covered by or next to a subgraph node (neighborhood window size was set to two for this work). The intuition is that subgraphs that share affiliated (covered and contextual) words are likely to be conceptually relat- ed. By taking the union over all words that are affiliated with the nodes of a sentence subgraph, we obtain the distributional representations of that sentence subgraph. Each entry of the tensor is the count of a certain combination of patient, subgraph, and word, and is non-negative (see Figure 4-3 for an example). We then used a generalized tf-idf weighting of co-occurrence counts of subgraph-word pairs (i.e. counting and weighting subgraph-word pairs instead of counting and weighting words), which leads to better empirical performance. 76 ..................................... ..... ............ ...... _ _ : :::: 2 large cells>nonotypic - 1n Ilarge cells B-cells expresskn b * Isubgraph groupl1 negative'-- BCL2 c- CD3O positive Iappearance) Reed-Sternberg ) 1 ....... ............ _..,.... ............... .................. 4 - immunoglobulin lambda chains, / .4In / Higher-Order Features (S) I, I *. / P x P9 S X fr Sg rn-i A ml S xPWox 44 .4 P XS XWg S g .4 .4 .4 am, ................. a fuse 1 large 2 cells 3 BCL2 rinfitrnation gce-s I I a 4 positive 5 CD30 negat -cels 6 negative I - _ m ve- CD1 _- mm m mmmm m mmmm Figure 4-3 Tensor modeling and factorization with distributional representations of the sentence subgraphs. In the figure, we show some higher-order features (i.e., sentence subgraphs), as well as some atomic features (i.e., words). The higher-order features are numbered with the first subgraph being "[large cells] - [negative] - [BCL2]". This subgraph matches the sentence "The large cells are negative for BCL2", where the word "cells" is one of the neighboring contextual words for the node "[negative]". If the pathology report of patient 1 has a sentence "The large cells are negative for BCL2", then subgraph 1 is associated with this patient. As the subgraph covers the word "large", the first atomic feature, the tensor entry (1,1,1) is increased by 1. The factor matrix A is the (patient, patient group) matrix, B the (subgraph, subgraph group) matrix, C the (atomic feature, atomic feature group) matrix. The core tensor g captures the interactions between the patient groups, subgraph groups and atomic feature groups. We also show example subgraph group 1 and subgraph group 2. It is desirable that some subgraph groups correspond to panels of characteristic features for lymphoma subtypes. For example, subgraph group 1 includes mentions of CD30 staining and Reed-Sternberg appearance of cells, and suggests Hodgkin lymphoma; subgraph group 2 includes mentions of diffuse infiltration of large cells, moderately high Ki67 expression, and no CD10 staining, and suggests diffuse large B-cell lymphoma (DLBCL). 77 ..... . ...... ..... ........ C E Xijk 4.1.3 Patient and feature group discovery using SANTF The non-negative tensor is then factorized to reduce dimensionality and obtain groups for each mode. We follow the Tucker factorization scheme [274], where the data tensor is factorized into a core tensor multiplied by factor matrices (one factor matrix for each mode, and is orthogonal in our setting). The core tensor specifies the level of interaction between groups from different modes. The column vectors in a factor matrix specify the grouping in the corresponding mode. Such groupings can capture similar patients, similar sentence subgraphs and similar words; meanwhile they allow sharing of an element among different groups as specified by its fractional weights across groups. In Figure 4-3, two example subgraph groups are shown. The top subgraphs in the subgraph group 1 correlate with Hodgkin lymphoma and in group 2 correlate with diffuse large B-cell lymphoma (DLBCL). Meaningful groupings will not only improve the performance of multiple machine learning tasks but also identify panels of characteristic features of patient subcategories, in the same form as specified by the diagnostic guidelines. SANTF differs from previous NTF [277-279] by introducing a mode that captures higher-order features. SANTF performs group discovery over sentence subgraphs based on the intuition that these higher-order features encode more aggregated information. In addition, SANTF simultaneously identifies the groups of the atomic features, which indirectly helps the group discovery for higher-order features through the core tensor. This is possible as the core tensor encodes the interactions among the groups of patients, higher-order features, and atomic features. We next give the detailed SANTF algorithm. 4.1.4 SANTF algorithm Here we provide a mathematical formulation of the procedures depicted in Figure 4-3, following the standard notation [285]. Let X E RPx Sx Wbe the data tensor, where P, S, W are numbers of patients, subgraphs, and atomic features respectively. We want to find a low rank approximation to X by solving a least squares optimization problem (Tucker tensor factorization [285]) P Z pq rAip Bj Ckr f(A, B, C, g) (4-1) = S 9 W 9 Wg Y gABC= I=Iq=1r=1 i=1 j=1 k=1 p=1 q=1 r=1 78 2 where Pg, Sg, W% are the numbers of groups of patients, subgraphs and atomic features, respectively, and A E R' s, B E Rsx S9 and C E Rwx wg are factor matrices. Each column corre- sponds to a group of features or patients. We call the tensor g the core tensor, which specifies the interactions between the groups of factor matrices and usually has much smaller size compared E1 rW1 9p q r Ai to Bjqthe Ckr as data the tensor. reconstructed Wetensor, refer and to Z q the goal is to closely approximate the data tensor using the reconstructed tensor. We further constrain the factor matrices and the core tensor to be non-negative, i.e., Aip, Bjq, Ckr, g > 0. To solve the constraint optimization problem, we follow the block alternating least square (ALS) algorithm [286]. After the groups are computed, we weight each group according to the core tensor g. Let the slice matrix gi:: of the core tensor g be obtained by fixing the mode-I index and varying mode-2 and mode-3 indices (: indicates all indices for the corresponding mode). We choose from g the slice matrix gP:: corresponding to the pth patient group and use the e2 norm of the slice matrix as the group weight: s9 W 9 WP= gP::11 2 = I P qr (4-2) q=1 r=1 Each entry of the pth column in the factor matrix A is then multiplied by wp to obtain A'. For the ith patient case (ith row in A'), we assign it to group p if A ' = max(A[). Intuitively, the columns of the pre-weighted patient group matrix specify the contribution of each patient to this group; the norm as calculated in equation ( 4-2 ) specifies the magnitude of this patient group interacting with subgraph and word groups. Weighting according to the core tensor G by multiplying a column using the corresponding norm takes into account such magnitude, which is necessary when evaluating different group proportions for one patient. Although we have adopted hard grouping for patients due to the fact that a patient can only belong to one cluster in our experiments, SANTF itself can be readily generalized to applications with soft grouping (multiple membership) of patients. We next give details on how to identify word groups associated with a specific subgraph from the tensor factorization results, which is used in feature analysis. Let 5Z be the mode-2 tensor vector product defined as 79 12 (T x 2 v)i1 i, = ITii 2 3 (4-3) Vi 2 i 2 =1 where T is any three mode tensor with size 11 X 12 X 13 and v is a vector of length 12. For the subgraph i, we obtain ( 4-4 ) A(g x2 Bi.) where g is the core tensor, A the patient factor matrix, B the subgraph factor matrix. We then sum across the columns of the matrix A(g x2 Bi:) to get the desired word group distribution vec- tor for the ith subgraph. 4.2 Experiments and Results We experimented with SANTF on clustering lymphoma subtypes based on pathology report narrative text. SANTF itself does not require annotated training data, but in order to verify our algorithms, we use annotated datasets for ground truth. We used part of the dataset described in sec- tion 3.3, which consists of 897 patients whose written diagnosis (in the final diagnoses section) maps to exactly one of the following three lymphomas: Diffuse large B-cell lymphoma (DLBCL; the most common lymphoma), follicular lymphoma (the second most common lymphoma) and Hodgkin lymphoma (the most common lymphoma in young patients). The written diagnoses themselves were excluded from being processed by the feature extraction steps, as before. In contrast to the analysis of Chapter 3, we omit cases of Burkitt lymphoma because it had too few cases to learn a good clustering model, and we omit cases in which the patient has multiple lymphomas because these do not fit the hard clustering paradigm. The case distribution of the ground truth for the cases used here is shown in Table 4-1, where the dataset is partitioned roughly equally, and stratified by type of lymphoma, into a training set (471 cases) and a testing set (426 cases). Clinical Narrative Text Lymphoma All Train Test DLBCL 589 305 284 Follicular 184 101 83 Hodgkin 124 65 59 Table 4-1 Statistics of the lymphoma.subtype distribution in the pathology narrative text corpus. 80 To study the impact of being able to model the interactions among multiple types of features, we establish three types of baselines for NMF and two configurations of k-means, a frequently used clustering method. The two configurations of k-means differ in their distance metrics used: Euclidean distance and cosine distance [287]. The first type of baseline applies NMF or k-means on the (patient, atomic feature) matrices. The second baseline applies NMF or k-means on the (patient, higher-order feature) matrices. The third baseline applies NMF or k-means on the (patient, combined feature) matrices, where the combined features are generated by adjoining the atomic features and the higher-order features, because we want to exclude the possibility that the improvements of SANTF only come from simply adding features. Under orthogonality constraints, NMF is equivalent to simultaneous clustering of rows and columns of a matrix [288], and similar arguments can be made for NTF. Thus for each factorization scheme, we obtain the factor matrix of (patient, patient group), and translate this matrix into a clustering interpretation in that for each patient case, we pick the maximum column as its cluster label. For the pathology reports, recorded texts reflect results from tests and labs that are performed in order to make differential diagnoses among possible subtypes of lymphoma. Thus it is reasonable to expect that clustering based on these data will lead to patient groupings that reflect the lymphoma subtypes. The tensor has 3773 higher-order features and 2841 atomic features. The patient group number is set to three, the same as the number of lymphoma subtypes. Because our method is unsupervised, there is no a priori mapping from patient groups to lymphoma subtypes. We therefore consider the label permutation that yields the best evaluation metrics as a parameter. For SANTF, the ideal group numbers for the higher-order features and for the atomic features are also parameters. All parameters are selected using 5-fold cross-validation on the training data and then applied to the held-out testing data. For the evaluation metrics of clustering performance, we use the commonly adopted metrics of averaged precision, recall, f-measure, and accuracy that all apply to multi-class clustering [289]. Averaging computes a direct arithmetic average over classes. The accuracy computes the proportions of the sum of diagonal entries out of all entries from the multi-class contingency table. Because neither the NMF nor the NTF has a global convergence guarantee [285,286,290], we use random initialization for all factorization schemes and average the clustering evaluation metrics from 100 runs. We show the results in Table 4-2 for the lymphoma subtype clustering. We also 81 perform significance testing based on the student t-test with a = 0.05. We see that SANTF significantly outperforms all nine baselines, and in particular, by over 10% margins in average Fmeasure compared to all baselines. Given that the classes are highly imbalanced, the results seem to suggest that improvements by SANTF come not only from the fact that more patient cases are correctly grouped (better accuracy), but also from more balanced clustering among multiple classes (better averaged precision, recall and f-measure). Methods Avg. Precision Avg. Recall Avg. F-measure Accuracy (1) NMF pt x wd 0.492 0.495 0.428 0.626 (2) NMF pt x sg 0.621 0.765 0.601 0.605 (3) NMF pt x [sg wd] 0.637 0.787 0.615 0.614 (4) k-means (Euclidean) pt x wd 0.483 0.420 0.398 0.664 (5) k-means (Euclidean) pt x sg (6) k-means (Euclidean) pt x [sg wd] 0.700 0.602 0.584 0.708 0.690 0.593 0.573 0.726 (7) k-means (Cosine) pt x wd 0.620 0.694 0.618 0.617 (8) k-means (Cosine) pt x sg 0.647 0.762 0.624 0.615 (9) k-means (Cosine) pt x [sg wd] 0.648 0.759 0.626 0.617 (10) SANTF pt x sg x wd 0.720'~9 O.849'~9 0.743'-9 O.751'-9 Table 4-2 Clustering performances for MGH lymphoma dataset.Each factorization and clustering scheme is numbered in the "methods" column. Significant improvements (p < 0.05) are in boldface and marked with superscripts indicating the baselines against which they were significantly improved from. SANTF chose by cross-validation 3 x 180 x 60 as the core tensor size for the lymphoma dataset. We show the per-class breakdown of evaluations on the lymphoma dataset in Table 4-3. The detailed evaluation results further confirm the above observation that SANTF not only leads to more patient cases being correctly grouped, as evidenced by big improvement in more populated classes, but also leads to more balanced clustering, as evidenced by improvements in multiple classes. 82 Precision Recall F-measure DLBCL Follicular Hodgkin DLBCL Follicular 0.713 0.528 0.235 0.944 0.481 0.770 0.242 0.473 0.451 0.862 0.723 0.250 0.310 0.598 0.611 Hodgkin 0.436 0.981 0.596 DLBCL Follicular Hodgkin 0.969 0.516 0.426 0.444 0.935 0.983 0.596 0.660 0.589 DLBCL 0.696 0.920 0.791 K-Means (Euclidean) pt x wd Follicular Hodgkin DLBCL K-Means (Euclidean) pt x sg Follicular Hodgkin DLBCL K-Means (Euclidean) pt x [sg wd] Follicular 0.443 0.311 0.788 0.548 0.763 0.769 0.607 0.068 0.271 0.810 0.541 0.455 0.848 0.565 0.115 0.289 0.779 0.481 0.492 0.802 0.529 Hodgkin 0.696 0.366 0.389 DLBCL 0.799 Follicular 0.366 0.564 0.552 0.646 0.439 Hodgkin 0.694 0.966 0.768 DLBCL Follicular 0.920 0.566 0.476 0.831 0.612 0.669 Hodgkin 0.455 0.980 0.590 DLBCL 0.901 Follicular 0.575 0.483 0.817 0.611 0.671 Hodgkin 0.467 0.977 0.597 DLBCL Follicular 0.971 0.546 0.651 0.965 0.777 0.697 0.755 Method NMF pt x wd NMF pt x sg NMF pt x [sg wd] K-Means (cosine) pt x wd K-Means (cosine) pt x sg K-Means (cosine) pt x [sg wd] SANTF pt x sg x wd Class 0.932 0.645 _ Hodgkin Table 4-3 Per-class evaluation of clustering on the lymphoma dataset 4.3 Feature Analysis We performed feature analysis to identify groups of higher-order features contributing to lym- phoma subtype clustering. The analyzed subgraph groups corresponding to the core tensor size of 3 x 180 x 60 selected by cross-validation. We follow the standard approach of analyzing groups in factorization models [291], and make necessary adaptation to SANTF output. Based on the core tensor after factorization, we associate subgraph groups with patient clusters using the 83 -=- ~ -~ . .~. -~ ... .. ~ - - following calculation. Adopting the standard notation [285], for each slice gi:: (i = 1,2,3) corresponding to a particular patient cluster i, we sum over its word mode (mode-3) to get a vector whose elements correspond to the subgraph groups. We then sort the vector and investigate the top 10 subgraph groups for each patient cluster i. For each subgraph group, we sort the subgraphs according to their weights in the subgraph factor matrix and display the top subgraphs, where the weight is the entry value in the matrix indexed by the corresponding subgraph and subgraph group. For each patient cluster, we select its top four subgraph groups and list them in Table 4-4, Table 4-5 and Table 4-6. For readability, we translated each subgraph into a partial sentence. Note that in the first DLBCL-associated subgraph group, although we have listed "cells are CD30+, MUM1+" in order in the partial sentence, the subgraph does not distinguish the order between "CD30+" and "MUMl+" as they are both linked to "cells". We analyze each cluster and relate them in the context of the WHO guideline [216], which reflects the current consensus knowledge. DLBCL 2d Subgraph Group DLBCL I' Subgraph Group 0.6640 atypical cells 0.0929 large lymphoid cells 0.0530 atypical cells 0.0293 large lymphoid cells 0.0057 show ... positive cells 0.0240 large cells 0.0040 0.0025 0.0019 0.0010 0.0005 0.0005 0.0004 0.0002 0.0385 0.0329 0.0312 0.0137 0.0082 0.0077 0.0051 large lymphoid cell with vesicular nuclei 0.0070 monotypic staining of immunoglobulin light chains show the cells are ... B-cells co-expressing large cells predominate 0.0059 show large atypical cells with ... vesicular nuclei 0.0051 B-lineage antibody PAX5 ... stain ... large cells cells are CD30+, MUM1+ large cells stain for CD79a admixed small lymphocytes large cells stain positively for CD20 large atypical cell with vesicular nuclei DLBCL 3' Subgraph Group diffuse infiltrate of large ... cells large lymphoid cells large atypical cells diffuse infiltrate of large ... cells with ... vesicular nuclei B-lineage antibody PAX5 ... stain ... large cells infiltrate of large ... cells with . scant cytoplasm sections show . . tissue with . . infiltrate of . cells 0.0049 0.0047 0.0037 0.0034 0.0034 0.0144 0.0111 0.0104 0.0103 0.0101 associated cells a few large cells atypical cells are CDlO-, BCL2-... infiltrate of large . cells with ... scant cytoplasm sheet of ... cells DLBCL 4ft Subgraph Group negative for cytokeratin stain positively for CD20 in-situ hybridization show positive for immunoglobulin kappa chains cells show -.. stain 0.0041 positive for CD20, BCL2 0.0094 Ki67 proliferation index is greater than 70% 0.0086 Ki67 proliferation index is greater than 60% 0.0075 positive for CD79a 0.0028 cells... form 0.0014 atypical large cells ... positive for CD20 0.0060 stain for Ki67 0.0053 large cells stain positively for CD20 0.0009 monotypic staining with immunoglobulin lambda chains 0.0044 positive for cytokeratin Table 4-4 Top higher-order feature groups associated with diffuse large B-cell lymphoma.Subgraphs are translated to partial sentences. In each list item, e.g., "0.0010, ... cells are CD30+, MUM1+ ... ", 0.0010 indicates its weight in the group. The "... cells are CD30+, MUM 1+ ... " is the partial sentence translated from the corresponding subgraph. Partial sentenc- es that are not mentioned in feature analysis are grayed out. For the DLBCL cluster as shown in Table 4-4, the first associated subgraph group recognizes the following histologic (light microscope-visible) facts: the cells are atypical in appearance and are 84 large lymphoid cells with vesicular nuclei (the critical visual hallmarks of diffuse large B cell lymphoma). Immunohistochemically the group appropriately identifies staining for the B cell markers CD79a and CD20. Although the staining for CD79a, CD20 can also be seen in the scattered large lymphocyte-predominant (LP) cells in nodular lymphocyte predominant Hodgkin lymphoma (NLPHL) (see p.324 of the WHO guideline [216]), these LP cells generally lack CD30 staining. Also, the predominance of large cells helps to rule out NLPHL. Thus these features all together offer insights into the differential diagnosis of DLBCL (see Chapter 10 of the WHO guideline [216]). The second DLBCL associated subgraph group is again highly consistent with the current pathologic definition of DLBCL and in this group the additional feature of monotypic light chain expression is identified. This group appears to be directed towards the identification of the activated B cell-like subtype of DLBCL, which is CD10 negative. The third DLBCL associated subgraph group echoes the characteristic features of DLBCL: diffuse infiltrate of neoplastic cells, expression of common B-cell lineage antibodies, and monotypic immunoglobulin expression. The second and third groups also reflect the mixed expression levels of BCL2 in DLBCL. The fourth DLBCL associated subgraph group states the following interesting facts: Ki67 proliferation index is moderately high. Note that when discretizing percentages, we choose multiple dichotomy thresholds with a step size of 10%. Thus collectively the subgraphs on Ki67 proliferation index point out that the index is moderately high in DLBCL. This in addition to the positivity of CD20 and CD79a, and the monoclonality of immunoglobulin light chains collectively associate with the differential diagnosis of DLBCL (see Chapter 10 of the WHO guideline [216]). For the follicular lymphoma cluster as shown in Table 4-5, the first associated subgraph group is consistent with the fact that follicular lymphoma is typically composed of both centrocytes (small cells) and centroblasts, and in bone marrow biopsies the lymphoma characteristically localizes to the paratrabecular region in bone marrow and may spread into the interstitial area (see p.222 of the WHO guideline [216]). The second follicular lymphoma associated subgraph group is consistent with frequent BCL2 overexpression, accompanied sclerosis, and enlargement and effacement in the architecture of lymph nodes in the setting of follicular lymphoma. The third follicular lymphoma associated subgraph group summarizes typical immunophenotypic features such as lack of expression for the cell surface marker CD5, and mixed expression levels of CD 10 (together with the first and second follicular lymphoma associated subgraph groups) and CD23, 85 all of which are consistent with Table 8.01 in the WHO guideline [216]. The fourth follicular lymphoma associated subgraph group reveals characteristic morphological features including dense infiltration of small lymphoid cells, the presence of cleaved centrocytes, and the staining of cells in follicular dendritic pattern (see p.220 of the WHO guideline [216]). For the Hodgkin lymphoma cluster as shown in Table 4-6, the first associated subgraph group correctly identifies the morphological feature of the large neoplastic Reed-Sternberg cells that are usually multilobated and stain positively for CD15 (see p.327 of the WHO guideline [216]). The second Hodgkin lymphoma associated subgraph group extracts additional essential hematopathologic features for the malignant cells of Hodgkin lymphoma: CD30 positivity, CD15 positivity, CD20 negativity, and the appearance suggestive of Reed-Sternberg cells, which often express PAX5 and occur with histiocytes (see p.328 of the WHO guideline [216]). The third Hodgkin lymphoma associated subgraph group is mostly consistent with the nodular sclerosis subtype of classical Hodgkin lymphoma, where the lymphoma contains Reed-Sternberg cells as well as a microenvironment of non-neoplastic inflammatory cells, the lymph nodes show a nodular growth pattern, collagen bands often surround nodules, and necrosis may occur (see p.330 of the WHO guideline [216]). The fourth Hodgkin lymphoma associated subgraph group is mostly consistent with the subtype of NLPHL, in that large neoplastic cells (lymphocyte predominant cells or LP cells) are positive for CD45, OCT2, PAX5, and immunoglobulin light (kappa and/or lambda) chains. The subgraph group is also consistent with the co-occurrence of LP cells and CD3 positive T-cells (see p.324 of the WHO guideline [216]). 86 Follicular 0.0308 0.0196 0.0171 0.0149 0.0127 1! Subraph Group interstitial lymphoid aggregates predominantly small ... cell paratrabecular lymphoid aggregates focal Follicular 2' Subgraph Group 0.0583 0.0213 0.0201 0.0091 0.0063 nodal architecture ... effaced B-cells co-expressing BCL2, CD10 biopsy of lymph node sclerotic tissue lymph node architecture effaced by ... follicular proliferation cells in the follicles 0.0117 large paratrabecular lymphoid aggregates 0.0061 sections show enlarged lymph nodes diffuse infiltrate of small lymphoid cells 0.0059 cell with reduced size infiltrate consisting of ... lymphoid cells 0.0055 sections show ... lymph nodes CD10+/- B-cell population 0.0045 residual ... follicle center cells 0.0043 cells stain positively for ... BCL2 core needle biopsy 0.0021 flow cytometry demonstrate . . population follicles contain ... centroblasts Follicular 4' Subgraph Group Follicular 3 Subrph Group 0.0642 lymphoid infiltration 0.0829 B-cells are negative for CD5 0.0269 atypical infiltration 0.0466 B-cells express 0.0107 0.0093 0.0080 0.0062 0.0050 0.0405 CD5-, ... , CD230.0315 negative for CD10 0.0267 dense lymphoid infiltration 0.0133 mucosa infiltration 0.0271 positive for CD23 0.0251 positive for CD10 0.0102 small lymphoid cells 0.0095 small lymphocytes 0.0148 positive for CD19, CD20, CD23 0.0060 containing... large atypical cells. 0.0041 positive for CD3 0.0024 show B-cells are positive for CD3. CD20 0.0018 CD5-, CD10- ... B-cells 0.0084 cleaved centrocytes 0.0082 diffuse infiltrate of small lymphoid cells 0.0060 cells ... in follicular dendritic pattern 0.0059 fibroadipose tissue 0.0044 dense infiltrate containing lymphoid cells Table 4-5 Top higher-order feature groups associated with follicular lymphoma.Subgraphs are translated to partial sentences. Partial sentences that are not mentioned in feature analysis are grayed out. Hodgkin 1g Subgraph Group 0.0362 large cells 0.0312 atypical cells 0.0303 large cells stain Hodgkin 2" Subgraph Group 0.0143 positive for CD30 0.0083 large cells are negative 0.0065 positive for CD15, CD30 0.0063 expressing PAX5 0.0263 positive for CD15 0.0063 large atypical cells 0.0196 scattered large ... cells 0.0117 infiltrate of large ... cells with lobated nuclei 0.0060 large cells are negative for CD20 0.0103 0.0064 0.0046 0.0042 0.0027 many large cells large neoplastic cells stain positively for CD15 multilobated ... cells background contain ... lymphocytes Hodgkin 3' Subgraph Group 0.0233 necrosis 0.0142 0.0106 0.0099 0.0098 dense sclerosis vaguely nodular pattern collagen fibrosis mixed inflammatory cells 0.0073 nodular pattern 0.0053 atypical infiltration 0.0043 collagen bands 0.0058 0.0058 0.0049 0.0040 0.0034 inflammatory cells large cells are Reed-Steinberg like rare cells are .. positive histiocytes irregular nuclei Hodgkin 4t Subgraph Group 0.0237 positive for CD3 0.0209 B-cells positive for immunogiobulin lambda chains 0.0179 small CD3 positive lymphocytes 0.0169 CD3 positive T-cells 0.0140 B-cells expressing ... kappa and lambda light chains 0.0100 expression of B-cell antigens 0.0053 number of .. B-cells 0.0048 large atypical cells 0.0042 nodular lymphoid proliferation 0.0047 expressing CD45 0.0018 areas of vague nodularity 0.0017 cells ... with Reed-Sternberg forms 0.0025 positive for OCT2, PAX5 0.0020 many scattered ... T-cells Table 4-6 Top higher-order feature groups associated with Hodgkin lymphoma.Subgraphs are translated to partial sentences. Partial sentences that are not mentioned in feature analysis are grayed out. We note the advantage of using subgraph groups as features compared to using individual subgraphs as features. For example, in the third follicular lymphoma associated subgraph group, 87 standalone positivity or negativity on CD5, CDIO, and CD23 may not be discriminative enough, but collectively they offer medically important information favoring follicular lymphoma. We next look into why the atomic feature groups as jointly discovered by SANTF help to better group individual subgraphs, in order to validate our intuition that exploiting interactions between both feature types is beneficial. Continuing from the analysis of important higher-order feature groups, we give an analysis on word group distributions associated with individual subgraphs. In the first DLBCL associated subgraph group in Table 4-4, the following subgraphs (partial sentences) are together ranked among the top subgraphs: "... large cells predominate cells stain for CD79a cells ... ", ... ", "... large cells stain positively for CD20 "... cells are CD30+, MUMl ... ", ... ", ... ", "... large "... large lymphoid "... atypical cells ... ". By contrast, we did not find a similar grouping in patterns generated by those baselines that have subgraphs as features (baselines 2 and 3 in Table 4-2, k-means clustering does not produce subgraph groups). The positivity for the antigens CD79a and CD20 may associate with the scattered large LP cells in NLPHL, but the group includes additional positive staining for MUM1 and CD30, which favors the differential diagnosis of DLBCL. We look into the above six subgraphs and identify word groups associated with each subgraph. Intuitively, such associations are expressed in the core tensor and one can sum out the patient mode to explicitly associate a subgraph with the word groups (see SANTF algorithm section on how to identify word groups associated with a specific subgraph from the tensor factorization results). The associated word group distribution for each subgraph is shown in Figure 4-4, and their correlation coefficients are shown in Figure 4-5. It becomes evident from Figure 4-5 that each of the subgraphs is correlated with at least one other subgraph with a correlation coefficient above 0.5, indicating relatively strong correlation. Figure 4-4 gives details on which word groups help to correlate subgraphs. For example, the word groups 10, 13, " 16, 17, 26, 28, 33 and 52 help correlate subgraphs "... large cells stain positively for CD20 ... and "... large cells stain for CD79a ... " This illustrates the benefits of using word group distribu- tion to correlate subgraphs. In summary, analysis of word groups suggests that adding the word mode (including covered and contextual words) to the tensor and jointly learning the subgraph groups and the word groups help to better capture the correlations between subgraph features. 88 large cells predominate ... word group dist ... large cells stain positively for CD20 .. word group dist 0.25 0.25 0.2 0.15 - 0.2 0.150.1 0.1 0.05- 0.05 0 10 20 30 ..j w g 40 -CD79a 50 0 60 arge cells stain for CD79a . .word group dist .. 0.25- 0.2 0.15- 0.15 0.1 0.1 50 30 40 20 . large lymphoid cells ... word group dist 60 10 20 30 40 50 60 40 50 60 50 60 0 60 10 0.25 0.2 0.2 0.15 0.15 0.1 0.1 i L .I 10 20 30 40 50 20 ... atypical 0.25 0 40 50 word group dist 0.05 -J 10 0.05 20 30 ..cells are CD3O+, MUM1+ 0.25 1 0.2 0.05 10 30 cells ... word group dist 0.05 60 10 20 30 40 Figure 4-4 Word group distribution for six of the top subgraphs in the first DLBCL associated subgraph group.For example, the word groups 10, 13, 16, 17, 26, 28, 33 and 52 help correlate subgraphs "... large cells stain positively for CD20 ... " and "... large cells stain for CD79a ... ", as highlighted in light gray. 4.4 Discussion Currently the selection of SANTF parameters such as core tensor size relies on cross validation. We recognize the potential of using a non-parametric Bayesian approach to discover such parameters directly from data. For example, in the non-parametric Bayesian setting, each patient in a dataset can be associated with hidden variables describing groups (causes) that are responsible for generating the patient's data. Although there can be an infinite number of possible groups to choose from, under proper prior distributions (e.g., specified using the Indian buffet process [292]), only a finite number of groups would be selected. Care needs to be taken when defining generative processes for multiple types of features to account for the fact that atomic features aggregate into higher-order features and to allow for an efficient inference algorithm. Clearly, the performance of SANTF depends on the nature of the relationships among the various modes of the tensor. We suspect that there is an information-theoretic analysis that can shed light on 89 quantifying these relationships, where the suggested generative model could provide a basis for such an analysis. 4 4>% '4 .. large cells predominate ... 0,64 0.4741 0.5566 0.5415 A-5953 ... large cells stain for CD379a ... 0.3281 ... large cells stain positively for CD20 ... 0.145 0.2501 0.3238 0.3521 0.3314 ... large ly mphoid cells ... .218 0.3873 ... cells are CD30+, MUM I + .. ... at y p ical cells ... Figure 4-5 Correlation between six of the top subgraphs (partial sentences) in the first DLBCL associated subgraph group.Only upper triangular matrix is shown due to symmetry. SANTF is currently computationally intensive. The tensor factorization on average takes 22 minutes on a computer with Intel Core 2 Duo P8600 and 8 GB RAM. The steps of document preprocessing including parsing, UMLS concept identification and graph/subgraph construction also take considerable amount of time. We parallelize the computations into batches of 50 patients and run them on the pHPC clusters at Partners Health Care, which has 600 processing cores in total and a maximum 100 core concurrency per user. The parallel pre-processing time is under 30 minutes, which could be improved by parallelization into smaller batches on a larger cluster. We also plan to explore parallelization and approximation techniques such as stochastic gradient descent to speed up tensor factorization in future work. Parsing challenges may arise with less formal clinical notes such as discharge summaries. For example, many connecting parts of speech (conjunctions, articles, prepositions) may be elided, which makes dependency parsing difficult for even statistical parsers. For less formal clinical notes, we expect a hybrid form of NLP may work better. Namely, for longer sentences, graph construction can be based on dependency parsing, while for shorter sentences, graph construction 90 can be based on co-occurrence of concepts. Choosing the threshold of longer vs. shorter sentences is non-trivial and may depend on the characteristics of clinical notes; we intend to explore such trade-offs in future work. On the other hand, different institutions may have different clinical documentation systems and styles. Such generalizability challenges are partly addressed by our clinical text subgraph mining approaches [87] such as using UMLS concepts as subgraph nodes and ignoring dependency types, which can mitigate the impact of the terminology and style differences between institutions. Using atomic features to correlate higher-order features as done by SANTF also helps connect higher-order features whose differences are mainly in writing style. 4.5 Conclusions We proposed a novel unsupervised framework of subgraph augmented non-negative tensor factorization (SANTF), which can automatically generate machine learning models that are easily interpretable to clinicians. SANTF can jointly model the interactions among different types of features by integrating them into the learning objective. We applied SANTF to unsupervised learning tasks on clustering lymphoma subtypes based on narrative text from pathology reports. We established nine baselines with widely-used NMF and k-means clustering methods. For each NMF or k-means configuration, the first baseline explores the atomic features. The second baseline explores the higher-order subgraph features. The third baseline explores both types of features but not their correlations. Experimental evaluation demonstrated that SANTF significantly outperforms all nine baselines, in particular, by over 10% margins in average F-measure over all baselines. A closer look at the subgraph groups that are generated by SANTF offers more clinical insights about lymphoma subtypes than atomic features or even standalone subgraphs. We also found that the atomic feature groups as jointly discovered by SANTF help to better correlate individual subgraphs, validating our intuition that exploiting interactions between different feature types is beneficial. 91 Chapter 5. Subgraph Augmented Non-negative Matrix Factorization (SANMF) in Modeling ICU Physiologic Time Series This chapter describes an extension of subgraph mining and factorization algorithms applied to modeling ICU physiologic time series. All monitors come with a trade-off between sensitivity and specificity. In the ICU setting, sensitivity is often favored over specificity, thus alerts based on whether the value of a single parameter crosses a threshold may result in a prevalence of false alarms [293]. Better trade-off between sensitivity and specificity can be achieved if a model can consider multivariate time series comprehensively [294]. The assumption is that more volatile patients display concerted progressions in multiple physiologic variables, which are associated with high risk of mortality. To this end, data mining can play an important role in exploring archived ICU physiologic time series in order to build calibrated clinical models for mortality risk stratification. Such models should be able to detect clinical state changes over certain period of time, in order to help clinicians interpret ICU data more intuitively and more accurately. Models that appear as "black boxes" to clinicians, however, form a poor basis for decision support. We need to be able to translate complex meaningful clinical events to detailed features needed by a machine learning model. For example, vital measurements and laboratory test values fluctuate as time progresses (e.g., a patient's glucose level may increase from 158 mg/dL to 189 mg/dL after 53 minutes then fall to 172 mg/dL after another 62 minutes). We refer to these events as temporal trends. In contrast, the standalone numerical measurements (e.g., 158 mg/dL, 189 mg/dL and 172 mg/dL for glucose level) are snapshots with respect to single time points. Intuitively, the higher-order features are more expressive and informative, but their extraction is often difficult and involves manually pre-specifying rules or patterns and matching against time series [97,103,259]. In contrast, snapshot measurements have been widely used due to their simple extraction and robust statistical properties. However, snapshot measurements are less informative and interpretable than higher-order features. In addition, higher-order features need to be considered in groups, as the underlying pathophysiologic evolution of a patient (e.g. kidney 92 failure) usually manifests itself through multiple physiologic variables (e.g., abnormalities in glomerular filtration rate, blood urea nitrogen, creatinine, etc.). 5.1 Background Decision support tools in the ICU are receiving growing attention as critical care has become an increasingly multidisciplinary team effort. How to integrate the entire scope of information for improving patient outcome is complex due to ongoing evolution in clinical evidence supporting the involvement of an expanding set of physiologic variables such as fluid composition and balance [295]. Such integration calls for automated and informative tools to model the effects of physiologic variables on patient outcome. We focus on mortality as an outcome proxy. Previous work in correlating ICU physiology with mortality risk generally falls into two categories. Scorebased methods (e.g., SAPS-II [40], APACHE [39] and SOFA [38]) assume a resource-limited ICU setting and aim to select a limited set of commonly measured clinical predictors that can be aggregated into a severity score and best associated to a particular outcome. Other work adopted a multivariate data mining perspective. Hug et al. [44] considered a comprehensive set of physiologic measurements from the Multiparameter Intelligent Monitoring in Intensive Care (MIMICII) clinical dataset [296] and manually defined a set of trend patterns (e.g., slope of a measurement during a particular time interval). However, physiologic measurements and trends were treated as independent features in the regression model, without explicitly accounting for the fact that multiple measurements and trends could be attributed to the same underlying pathophysiologic states. Cohen et al. [43] used hierarchical clustering to extract 10 clusters as clinically relevant patient states from physiologic measurements, over a set of 17 patients and 14 measurements. Kshetri [297] experimented with k-means clustering and faced scalability challenges on the MIMIC-II dataset, with over 50 physiologic variables and tens of thousands of patients. Quinn et al. [41] developed a factorial switching linear dynamical system to model the patient states underlying 8 physiologic measurements. However, these multivariate data mining models require advice from practicing physicians on cluster numbers or switching states and are difficult to scale to many more physiologic variables. Joshi et al. [45] manually clustered the physiologic measurements into organ specific patient states by associating each measurement with the status of a particular organ, and achieved a state-of-the-art performance on 30-day mortality prediction from the MIMIC-II dataset. Despite partially addressing the feasibility challenge, such manual 93 feature clustering can be a subjective call. For example, a low hematocrit may be linked to blood loss, bone marrow problems, or kidney problems, among a variety of other problems. In addition, the manual clustering is on single time point measurements. Addressing the unanswered questions in previous research, we study how to group temporal progression trends instead of single time point measurements, and how such a grouping can be performed in an evidence-driven fashion over a comprehensive set of physiologic variables. We represent the temporal trends as graphs and this preprocessing approach falls into the category of time-series symbolization methods that discretize time series into sequences of symbols and attach meaning to the symbols [298,299]. Our approach differs from existing work in that it calculates a customized z-score to perform measurement-axis discretization and it handles time series with irregularly sampled time points. 5.2 Methods In this section, we develop an unsupervised feature learning algorithm in order to build machine learning models that are interpretable to clinicians. The model adopts non-negative matrix factorization to discover groups of subgraph-encoded temporal progression trends; hence the name subgraph augmented non-negative matrix factorization (SANMF). 5.2.1 Workflow of SANMF We first outline the workflow of the SANMF algorithm in Figure 5-1. ICU physiologic time series are first converted to graph representations. The graph representation is derived by discretizing time and measurement axes for physiologic measurements, as shown in Figure 5-2. We use frequent subgraph mining (FSM) [190] tools to collect important subgraphs where the subgraphs are identified as common temporal trends of the physiologic variables. Examples of temporal trends for physiologic time series are shown in Figure 5-2. With such representations, subgraphs encode temporal trends, and we use "subgraphs" and "temporal trends" interchangeably within the context of this chapter. We model the correlation between the subgraphs, and apply nonnegative matrix factorization to discover groups of subgraphs and patients, and then train a logistic regression model to predict the mortality risk using subgraph groups as features. We next explain each step in more detail. 94 ........ .... ......... 1. - - -- - - - .-- - - .. ---- - - -- . --- --- - - a Time Window - ==- _- - - --- = - - - - - Window Selected Time Series _ _ 'C ------ -- Computing z'-score Organ Level +-RDF Summarization ITime Normalized] 5eries i Discretization & Interpolation DI- easure 412 10 Graphs Frequent Subgraph Mining Subgraph 4sNM GNM 30- F -I phs 6002650- 80 10 260 26 me0mm KLogistic Regression Based_ Classifi, er Mortality Risk Stratification RDF: Radial Domain Folding NMF: Non-negative Matrix Factorization Figure 5-1 The workflow of subgraph augmented non-negative matrix factorization (SANMF).We focus on the physiologic time series from the second half of the first day, balancing the trade-off between early detection of clinical deterioration and data availability. In the flow chart, shaded blocks indicate comparison models. The block with bold fonts corresponds to the features produced by the SANMF model. 5.2.2 Representing time series as graphs Figure 5-2 shows the steps before matrix factorization, with three example variables. To test the ability to detect deterioration early on, we focus on the data from the second half of the first day after patients' admissions to an ICU. We exclude the first half of the first day because many measurements are not yet available in that time period. In Figure 5-1, it becomes clear that the time series of different variables may have different sampling times and sampling frequencies, so we preprocess the time series. We first fill in the missing values, using a sample-and-hold heuristic, which was also shown to be effective by previous work on MIMIC-II data [44,45]. More advanced imputation algorithms such as EM or Gaussian processes inference may lead to more accurate estimation of the missing values. For this task, we stick to sample-and-hold, as we compare our model with a state-of-the-art system [45] that also followed the same heuristic on MIM95 . ....... ........ .......... . ........... - - _,:. IC-II data. We next convert time series into graphs so that multivariate temporal patterns can be automatically mined. To this end, we perform discretization on both the time axis and the measurement axis. With the filled and sliced time series, we first compute a customized z-score (z'score) where we define everything within the reference range of a certain test to be 0 [45]. For a physiologic variable x, let x, and Xh be the low and high ends of the reference range, let j index different ICU patient stays, and p(x) and a(x) be the mean and standard deviation of variable x across different ICU patient stays, the z'-score is calculated using the following equations z(xi) = (xi - 0 z' (x1 ) = z(x1 ) - Z(Xh) w(x)) /o(x) if z(xI) < z(x) < Z(Xh) if z(x1 ) > z(xh) z(xj) - z(xi) (5-1) (5-2) if z(xj) < z(xJ) Each individual measurement is then discretized based on whether its value is within the reference range (label 0), within one o outside the reference range (label 1), or beyond one o out- side the reference range (label + 2). Such discretization is essentially a thresholded round-up from equation ( 5-2 ). We discretize the time axis by linearly interpolating the time series and resampling at regularly spaced time intervals. We determined empirically (by cross-validation over possible choices including 2, 4, or 6 hour intervals) that two-hour time intervals were best in our experiment. After discretization, we generate the time series graph for each measurement by connecting the discretized measurement values that are adjacent on the time axis. We use three types of edges to distinguish changes between adjacent nodes, namely up, down and same, and to encode partial directionality in temporal progression. After sample-and-hold, there are 27.5% measurements that are still missing. As a result, after discretization and graph conversion, the corresponding nodes are labeled as missing values. Note that the signal fluctuation rates vary across different physiologic variables. We intend to pursue alternative and adaptive resampling frequencies in future work. 5.2.3 Frequent subgraph mining 96 With time series graphs, we perform frequent subgraph mining to produce the time series trends that are repeated in the dataset. The intuition is that similar patients undergo similar physiologic trajectories during their ICU stays. We refer the reader to section 3.4.4 for definition and intuition on frequent subgraph mining. In this chapter, we use the frequent subgraph miner MoSS [190] with frequency threshold empirically chosen (by cross validation on choices including 5, 10 or 15 as threshold) to be 10 (i.e., subgraphs must occurs at least 10 times in the dataset). Example frequent subgraphs are shown in Figure 5-2. We require that frequent subgraphs must not have missing value nodes. As we are focusing on deterioration (abnormality) detection, we also exclude subgraphs that start with multiple zero labeled nodes or end with multiple zero labeled nodes. 97 Blood Urea Nitrogen Mean Arterial Pressure 504 -1 E65 E0. - E 40 35 4, 32- 0 -' 01 - -2 - N 1 ILJ1 , I I 2- - 2-- _ -3- ~1~ 0- 0- . -1D-2- 840 960 CMI 1080 1200 1320 1440 840 0 MAP Temperature 99.0198.5- 0 I*- 1 s -s- -1 -d- -2 -u- 1 -s- 1 --- -1 2 r-s- 2 - Aw 0 -d- -1 BUN 1 -- - _ 98.0 97.5 97.0- 960 1080 1200 1320 1440 Temperature -- s- 1 -d-1 -1 -s- Frequent Subgraph Mining * 0 BUN 12- I 0 0- IBUNT 1 D _ -3 - - _ i~rn -I-- I -s- 1 U p s- 1 -s- 1 -u- 2 -s- 2 Temperature I - N -2 -1 1 -s- I 1 -d- -1 -- s- -1 MAP 0 -d-- -1 -s, 1 -d- -2 -u- -1 Computing z-score 0-- Interpolation and discretization -2 - -- 840 , 960 - ------ - 1080 1200 1320 1440 -- Translating graphs Figure 5-2 Graph generation and subgraph mining in SANMF.Shown in this figure is the graph representation for three example ICU physiologic time series. BUN is blood urea nitrogen. MAP is mean arterial pressure. Example frequent subgraphs are shown after the frequent subgraph mining steps. The figure shows three separate subgraphs in the end. The above frequent subgraph mining steps generate 5534 frequent subgraphs. Among them, smaller subgraphs are subisomorphic to other larger frequent subgraphs. As noted in section 3.4.5, when a larger subgraph is frequent; all of its subgraphs are necessarily also frequent. Fur98 thermore, if a patient case has a larger subgraph, then both the larger and smaller subgraphs are counted for that patient. This may cause the signal from larger subgraphs to be overwhelmed by the signal from many smaller subgraphs. Therefore, we kept only the larger subgraphs in such pairs when a patient case has both. Note that such filtering is different from the notion of mining maximal frequent subgraphs, where only subgraphs that are not a part of any other frequent subgraphs at all are collected [300]. As noted in section 3.4.5, it is cost prohibitive to perform a full pairwise check because the subisomorphism comparison between two subgraphs is already NP complete [100], and a pairwise approach would ask for over 15 million such comparisons for our task. In our case, we only need to compare subgraph pairs from the same physiologic variable. Furthermore, subgraph subisomorphism comparison can be simplified into string matching, as our subgraphs are essentially sequences. Combining the two observations, the algorithm for determining the subisomorphism relation among frequent subgraphs is shown in Table 5-1, which is a variant of the one shown in Chapter 3. The above filtering steps in fact exclude some small subgraphs completely, reducing the final number of subgraphs to 5387. Subisomorphim for set of subgraphs input: S - set of subgraphs output: m - adjacency matrix of subisomorphism among subgraphs in S 1 categorize subgraphs in S according to their variables 2 foreach v in variables: 3 stable sort S, in ascending order of number of nodes 4 5 for i = 1 to length (S,) -1 for j = i+1 to length(S.) // ids is the index of smaller subgraph in S // idb is the index of bigger subgraph in S 6 7 ids idb 8 9 10 if subStringMatch(S[ids], m[ids, idb] = 1 return = Sv[i] = S,[j] S [idb]) m Table 5-1 A simplified algorithm for determining subisomorphism relation among time series subgraphs.The simplification mainly comes from variable partition (line 1-2) and reduction of subisomorphism to substring match (line 8) for time series subgraphs. 5.2.4 SANMF algorithm Non-negative matrix factorization (NMF) has been a highly effective unsupervised method [264] to cluster similar patients [265] and sample cell lines [266], to identify subtypes of diseases [267] and to learn genetic expression patterns [269,272,273,301]. However, none of these approaches model the correlations among temporal trends, and some even do not consider temporal trends. 99 We observe that a patient's underlying pathophysiologic evolution usually manifests itself through a group of temporal progression patterns of multiple physiologic variables. This motivates us to use NMF to group time series subgraphs by factorizing the patient-by-subgraph count matrix, hence the name subgraph augmented NMF (SANMF). A schematic view of SANMF is shown in Figure 5-3. Let M be the patient-by-subgraph count matrix of dimension P x S, where P is the number of patients and S is the number of subgraphs. NMF approximates M using two lower ranked matrices U (of dimension P x Sg where SQ is the number of subgraph groups) and V (of dimension Sg x S), as formalized in the following equation. minIM - UV||(5 ( 5-3 ) U,V st.U O,V 0 where I|- I indicates squared Frobenius norm (squared summation of all entries in a matrix) and U 0 means U being entry-wise non-negative. Intuitively, each row of V gives the composition of each subgraph group, each column of U reveals how each patient may be viewed as having a mixture of subgraph groups (approximating patterns of pathophysiologic evolution). As we focus on count data that is by definition nonnegative, we use NMF instead of other grouping methods such as k-means or principal component analysis (PCA) that do not have a built-in nonnegativity constraint. The subgraph subisomorphism filtering step in Table 5-1 weakens the correlation between frequent subgraphs to a certain degree because the filtering step prevents certain subgraph co-occurrences from being counted. To systematically capture the subgraph correlation, we include single node subgraphs in the matrix M, but multiply counts of these sin- gleton subgraphs by 0.5. Empirically, the factor 0.5 worked well in balancing the trade-off between preventing overwhelming signals from singleton subgraphs and capturing correlations be- tween other frequent subgraphs. The NMF solver we used is the projected gradient NMF [302] implemented in Scikit-learn [303]. We used nonnegative double singular value decomposition as a deterministic initialization method [304]. We also enforced sparsity on subgraph groups [305] so that a group has only a limited number of non-zero weighted subgraphs and places most weight on only a few subgraphs, which is easier to interpret for clinicians. 100 Patient Groups : X_ ti Subgraph Group 1 ArtBE -2 --2 2 7d 1 SBP PX Px -2 -s ArtBE -s 1-s-1 1s 1 1 -u s 2 -s- -2 ArtBE MAP O BUN 2 -s- 2 -d- -d- -1 1 s1 --s-- 1 2 -s--2 -s - s- 1 Temperature 1 -s- 1s- 1 -dMAP -1 -s- -1 -d--u- -2 ArtBE s- -2 1 -d-1 Temperature 1 2 -d2 -s- 0 Subgraph Group 2 2-d-I-s-I -u- 2s- 2BUN u -2-u--1 -u- O SBP -ss 2 BUN BUN -u1 s-2--2 s- -2 -1 -s- -1 0 d -1 -s-1 -d- -2-u--1 1 Figure 5-3 Subgraph augmented non-negative matrix factorization model. In the figure, M is the patient-by-subgraph count matrix. Below M are some example subgraphs. We also show example subgraph group 1 and subgraph group 2 after factorization. It is often desirable to have some subgraph groups indicate a general progression to the better state (e.g., subgraph group 1), or to the worse state (e.g., subgraph group 2). 5.2.5 Feature group discovery and association using SANMF In SANMF, the column vectors in the subgraph factor matrix V specify the grouping of subgraphs. Such groupings can be viewed as mixtures of subgraphs, as they allow sharing of a subgraph among different groups as specified by its fractional weights across groups. In Figure 5-3, two example subgraph groups are shown. The top ranked subgraphs in subgraph group 1 indicate a general progression to an improved state. The top ranked subgraphs in subgraph group 2 indicate a general progression to a worse state. Namely, Blood Urea Nitrogen (BUN) increasing from 1 to 2 is worsening, as is Mean Arterial Pressure (MAP) decreasing from 0 to -1 or -2. Temperature changing from 1 to -1 can be good or bad, depending on the risks of high vs. low temperatures. But the overshoot of temperature change likely suggests problematic conditions. 101 The motivation is to identify some subgraph groups that can indicate concerted progression pat- terns of physiologic variables as driven by the patient's underlying pathophysiologic evolution. The subgraph groups as specified in V are used as features in logistic regression with the instance-feature matrix being U. Using the trained regression model, we rank the subgraph groups by their regression coefficients and focus on the top subgraph groups that are associated with high mortality risk. 5.2.6 Evaluating the groups discovered by SANMF Because there is no innate way to determine whether the groupings of subgraphs discovered by SANMF are good or poor, we evaluate their utility as features, abstracted from the raw data, in a prediction model. We assume that good features will improve prediction and will give us some insights into which temporal progression patterns are indicative of our predicted endpoint. We use physiologic time series from the MIMIC-II Database [296]. The time series include laboratory test values and physiologic measurements captured from patients monitored in the ICU at Beth-Israel Deaconess Medical Center (BIDMC), as shown in Table 5-3. Our dataset is a subset of the one used by Joshi et al. [45] (patients from the year 2000 to 2008); we only include those patients who have at least one day of time series data. The outcome we predict is whether a patient survives or dies in the ICU or within 30 days after ICU discharge, as shown in Table 5-2, from data available about each patient during the period between 12 and 24 hours after their admission to the ICU. Choosing a relatively long time horizon emphasizes our motivation to detect clinical deterioration early on. We partitioned the cases equally, stratified by mortality, into a training set (3932 cases total) and a testing set (3931 cases total). 102 Patient ICU Stays Mortality ; 30 days > 30 days or alive Number of Cases Number of Training Cases Number of Test Cases 788 383 (9.7%) 405 (10.3%) 7075 3549 (90.3%) 3526 (89.7%) Table 5-2 Statistics of experiment data. The table includes the patients' 30-day mortality distribution in ICU (both absolute numbers and percentages). The dataset is split equally into a training set and a test set. To evaluate the effectiveness of SANMF in abstracting raw data into more highly predictive features, we use five-fold cross-validation on only the training set to choose the number of subgraph groups, and use these subgraph groups as the independent features to train a logistic regression predictive model. We then evaluate the model on the held-out test set, and compare its performance against the following models: (a) as a baseline, 30-day mortality prediction by a logistic regression model using an approximation of the SAPS,, score [44] and its log-transformation as predictors, where the SAPS,, variable "chronic diseases" is approximated using ICD9 codes and the variable "type of admission" is approximated using the ICU service type; (b) a state-of-theart organ-level summarization model [45] modified to account for our use of a 12-hour time window rather than a snapshot of time points by replacing a binary representation of whether an organ system is in a specific state by a count of the number of times it is in that state during the 12 hours; (c) the D,I-measure based on our discretized (D) and interpolated (I) data values, where we also count the number of times each physiologic variable took on a discretized value during the 12 hours; and (d) a model based on treating each of our common subgraphs as a separate feature. The comparison models are shaded in Figure 5-1. We compare the Area Under the ROC Curve (AUC) of our model against those of the other models. 103 Variable Age Variable Description Hemoglobin Hemoglobin level INR Prothrombin time international normalized ratio Arterial C02 Arterial PaCO2 Description Age of the patient upon admission The resistance of the respiratory tract to airflow during inspiration and expiration. Albumin in blood Alanine aminotransferase in blood Excess in the amount of base present in arterial blood Arterial carbon dioxide Arterial carbon dioxide tension Arterial PaO2 Arterial oxygen tension Minute Ventilation Arterial pH pH level in arterial blood Na AST Aspartate aminotransferase in blood PaO2/FiO2 AST/ALT Aspartate aminotransferase / alanine aminotransferase Partial Thromboplastin Time BUN Blood urea nitrogen PEEPSet BUN/Creatinine Blood urea nitrogen / Creatinine PIP Ca Calcium level Plateau Pressure Albumin ALT Arterial Base Excess Central Venous Pressure Relates the cardiac output (CO) from left ventricle in one minute to body surface area Blood pressure in the thoracic vena cava Cl Chloride level Creatinine Delivered Tidal Volume Diastolic blood pressure Level of creatinine in the blood Air volume of lung without extra ef- Cardiac Index Direct bilirubin fort Minimum blood pressure during heartbeat Level of bilirubin conjugated with Ion Calcium K Ion Calcium level Lactate Lactate level MAP Mg Mean arterial pressure Magnesium level Volume of gas exchanged from lung per minute Sodium level Partial pressure arterial oxygen Potassium level / Airway Resistance Fraction of inspired oxygen Time it takes for blood to clot Positive end-expiratory pressure set on ventilator Peak inspiratory pressure Pressure applied (in positive pressure ventilation) to the small airways and alveoli Platelets Platelets count Prothrombin Time Time it takes for plasma to clot RBC Respiratory Rate Red blood count Respiratory rate per minute RSBI Rapid shallow breathing index* RSBI Rate Rapid shallow breathing index rate change Sa02 Saturation of arterial oxygen glucuronic acid Maximum blood pressure during eGFR Estimated glomerular filtration rate Systolic blood pressure FiO2Set Fraction of inspired oxygen set on ventilator Temperature Body temperature Glasgow Coma Scale Glasgow coma scale Total Bilirubin Level of bilirubin conjugated or unconi ugated Glucose Glucose level tProtein Heart Rate Heart rate per minute Urine/Hour/Weight Hematocrit Hematocrit level WBC heartbeat Total protein in the blood plasma Urine output per hour per kg of body weight White blood count Table 5-3 Physiologic time series predictor variables from MIMIC-II dataset.Demographic information such as age is also included. 104 5.3 Results 5.3.1 Method validation on ICU patients' mortality risk prediction When using NMF to identify latent groups of features and reduce data dimensionality, the number of groups needs to be empirically determined. We chose this parameter by 5-fold cross validation on the training data and considered a range of groups between 10 and 120 (at increments of 10), as shown in Figure 5-4 (a). For each number of groups and for each of the crossvalidation runs, we build our predictive model and evaluate it on the remainder of the training data, averaging the resulting AUC from each of the runs. In addition to NMF, we also show the performance if we use PCA instead to group subgraphs. Figure 5-4 (b) shows the corresponding performances when evaluated on the held-out test data, for reference. Both methods show similar AUCs; NMF in fact outperforms PCA on the held-out test evaluation, indicating that NMF is less prone to overfitting than PCA due to its additional non-negativity constraints. It is worth emphasizing the built-in non-negativity constraint and the additive interpretation benefit that NMF has. Namely, the weight of each subgraph in a group is non-negative and can be interpreted as its contribution to the group. In the PCA setting, it is not intuitive how to interpret a negative weight of certain subgraphs within a group. From Figure 5-4 (a), we see that the AUC quickly rises and plateaus as the number of groups increases for NMF. The maximum AUC on 5-fold cross validation is attained at the group number 100, which is used when evaluating SANMF on the held-out test data. The performance results of SANMF, comparison models and the baseline on held-out test data are shown in Figure 5-5. Comparing all the models and baseline, we can see that SAPS,, approximation has an AUC of 0.673, which is lower than what is generally reported for SAPS,, in the literature [44,45] (We discuss this and other related issues in section 5.4). All the models that abstract the measured data by discretizing and aggregating them perform better, each with an AUC greater than 0.8. The predictive model based on our SANMF-derived subgraph feature groups has the best performance, at an AUC of 0.848, modestly outperforming the next-best model based on abstraction by organ-system, by a 2% improvement in AUC. 105 0.85- 0.84- 0.83- 0.82Methods .1NMF PCA 0.81. 0.80Number of groups (a) 0.850.840.83 0.82Methods NMF PCA 0.81 0.8010 30 90 50 70 Number of groups 110 (b) Figure 5-4 AUC comparisons between NMF and PCA under specification of different number of subgraph groups. (a) AUC for the 5-fold cross validation experiment. (b) AUC for the held-out test experiment. Shown in panel (a) for corresponding number of groups is a single AUC by merging all the responses from the 5 validation subsets. 106 1.00 . 0.75 0.50 -0.25 Experiment SAPS-Ila (AUC=0.673 .- F r 0 - 0.00 0.00 0.25 Subgraph (AUC=0.81 0 D,I-measure (AUC=0. 19) Organ (AUC=0.827) FSubgraph NMF (AUC=0.848) 0.50 False positive rate 0.75 1.00 Figure 5-5 ROC curves for proposed method SANMF, comparison models including subgraph, discretized & interpolated measures (D,I-measure), and organ level status, as well as the baseline using SAPS,, approximation. 5.3.2 Important subgraph groups Using the method described in the section 5.2.5, we identified the top four subgraph groups that are associated with high mortality risk and list them in Table 5-4. These subgraph groups typically contain physiologic trends that stay at or progress to more severe states. In addition, they generally indicate problematic pathophysiologic processes that involve one organ or multiple organs simultaneously, while still retaining the temporal trend details at the physiologic variable level. 107 30-day Mortality 1" Subgraph Group Glasgow Coma Scale -2 -2 -2 -2 -2 -2 Minute Ventilation -2 -2 -2 -2 -2 -2 Minute Ventilation -1 -1 -1 -1 -1 -1 PEEPSet 2222 22 Airway Resistance 10 Airway Resistance 0 11 Plateau Pressure 2222 22 PEEPSet 1I I I 1 1 PaO2/FiO2 02 Airway Resistance 11 0 30-day Mortality 3rd Subgraph Group 0.1650 INR 222222 0.1269 Prothrombin Time 2 222 2 2 0.0318 Prothrombin Time 1I I I 1 1 0.1000 0.0085 0.0082 0.0081 0.0066 0.0060 0.0059 0.0052 0.0047 0.0040 30-day Mortality 2 "d Subgraph Group BUN/Creatinine 2 2 2 22 2 BUN 22 2222 Albumin -2 -2 -2 -2 -2 -2 Arterial C02 1I I I 1 1 Heart Rate 0 -1 Na 222 222 Na I I I 11 Arterial C02 222 22 Arterial Base Excess 2 1 Delivered Tidal Volume -1 -1 -1 0 30-day Mortality 4th Subgraph Group 0.0539 Heart Rate 222222 0.038 1 Cardiac Index 0 10 0.0142 Respiratory Rate 222222 0.1634 0.0481 0.0155 0.0040 0.0040 0.0038 0.0034 0.0033 0.0032 0.0029 0.0095 Total Bilirubin 1I I I 1 1 0.0140 Heart Rate 020 0.0056 Total Bilirubin 2222 22 0.0075 Cardiac Index 1 0.0046 Diastolic blood pressure -1 -1 -1 -1 -1 -1 0.0071 Cardiac Index 1 0 0.0029 0.0025 0.0024 0.0022 ALT Prothrombin Time Minute Ventilation eGFR 2222 22 2222 2 0 -1 -2 -2 -2 -2 -2 -1 0.0069 0.0064 0.0062 0.0060 Lactate RSBI Rate Cardiac Index Urine/Hour/Weight -1 -1 -1 -1 -1 -1 1 0 0 11 0 -1 0 01 Table 5-4 Top subgraph groups associated with high mortality risks.Subgraphs are converted into a sequence to save space. For each subgraph such as "0.1000 Glasgow Coma Scale -2 -2 -2 -2 -2 -2", 0.1000 is the membership coefficient, Glasgow Coma Scale is the measurement label, "2 -2 -2 -2 -2 -2" is the trend (flat for this case). Abbreviations used in the table include: PEEPSet - positive end-expiratory pressure set on ventilator; INR - prothrombin time international normalized ratio; ALT - alanine aminotransferase; PaO2 - arterial oxygen tension; FiO2 - fraction of inspired oxygen; BUN - blood urea nitrogen; Na - sodium level; eGFR - estimated glomeru- lar filtration rate; RSBI Rate - rapid shallow breathing index rate change. Please refer to Table 5-3 for descriptions of these variables. For example, the first associated subgraph group has several subgraphs suggesting that the patient mainly has pulmonary problem (continuously low minute ventilation, high plateau pressure, fluctuating airway resistance, and high level of positive end-expiratory pressure set on ventilator). On the other hand, this group also has Glasgow Coma Scale staying very low, meaning that the patient is probably unconscious or sedated. Thus the entire group may be interpreted as the status of unconscious or sedated patients with severe pulmonary problems. The second associated subgraph group displays abnormal trends related to problems in multiple organs including kidney, lung, and heart. The third associated subgraph group displays abnormal trends in hematology, liver, heart, kidney, and lung. Similarly, the fourth associated group involves abnormality in heart, lung, acid base homeostasis and kidney. An interesting observation is that top ranked subgraph groups contributing to high mortality risk usually involve problems in multiple organs rather than a single organ; multiple organ failure is 108 indeed a common cause of mortality in ICU settings. This type of grouping is difficult to achieve using manual grouping according to only organ status as done by Joshi et al. [45] and is considered one of the benefits of using NMF to automatically group temporal progression trends in an evidence-driven fashion. 5.4 Limitations and Discussion We observe that the AUC of our approximation to SAPS,, is lower than what is previously reported [40]. This may be because of the large amount of missing data in our data set and the approximations we make because our data do not include exactly the parameters used in SAPS,,. The organ system based model also shows an AUC somewhat lower than reported, but we believe this is because we build our predictions only on data available between 12 and 24 hours after ICU admission, whereas the previous study uses the totality of data from a patient's ICU stay. In this work, we use 30-day mortality (including both in-hospital mortality and mortality within 30 days after discharge) as an obtainable ground truth in order to demonstrate the efficacy of SANMF as an unsupervised feature learning algorithm. Similar methods may be applicable to improve not only mortality predictions but also predictions that indicate specific types of patient deterioration (e.g., anticipating hypotension, kidney injury, hepatic failure, sepsis) and identifying therapeutic opportunities (e.g., ability to wean from a ventilator, an intra-aortic balloon pump, vasopressors), as have been investigated by Hug [306]. Such improved models can provide decision support for treatment planning, informed staffmg and operations. Currently the selection of SANMF parameters such as number of subgraph groups relies on cross validation. We recognize the potential of using a probabilistic Bayesian approach to define a generative process for the time series. For example, physiologic time series can be modeled with stochastic processes (e.g., Gaussian process). Parameters of these stochastic processes can in turn be generated according to underlying pathophysiologic states (what we have been approximating with subgraph groups obtained by NMF). Although there may be a large number of possible pathophysiologic states and stochastic process parameters to choose from at each level of a generative hierarchy, under proper prior distributions (e.g., specified using the Indian buffet process [292]), we can impose constraints so that only a limited number of them would be selected in a 109 particular dataset. Although the Bayesian approach enjoys good properties such as its ability to integrate a priori clinical knowledge and its flexibility in the model size, care needs to be taken when defining stochastic processes for modeling time series to account for issues such as nonstationarity [262]. Clearly, the performance of SANMF depends on the nature of the correlations among multivariate temporal progression patterns, for which the suggested generative model could provide a basis for incremental analysis. In this study, SANMF only takes account of the physiologic time series that are "observed" from the patients' underlying pathophysiologic evolution, in order to make a fair comparison to the baseline model SAPS,,, which does not take into account treatment information. On the other hand, ICU admission, and in general, hospital admission can be better categorized as the inter- play between observations and interventions. We plan to model such interplays with SANTF. Under this setting, SANTF will integrate interventions as a third mode of a tensor (i.e., a third dimension of a higher-order matrix). By grouping physiologic temporal patterns (corresponding to pathophysiologic state) and grouping intervention temporal patterns (corresponding to intervention regime), we expect to be able to predict outcomes for patient groups who have similar underlying pathophysiologic evolutions and who have undergone similar treatment regimes. This is a promising direction of research as it may elucidate effective treatment options for a particular patient sub-cohort based on evidence from previously admitted patients. 5.5 Conclusions We proposed a novel unsupervised feature learning algorithm named subgraph augmented non- negative matrix factorization (SANMF), which is designed for analyzing temporal progression patterns in clinical time series data and is shown to improve both the accuracy and the interpreta- bility of the learnt model for ICU mortality risk prediction. In summary, subgraph mining on multivariate time series leads to unsupervised extraction of multivariate temporal progression patterns, which are more informative than single time point measurements. The ensuing NMF explores the correlations among trends of different physiologic variables and reduces dimensionality at the same time, which then leads to better interpretability and improved accuracy. We compared SANMF to four different models using features with different granularities and time spans. SANMF outperforms all the comparison models and in particular demonstrates an AUC improvement from 0.827 to 0.848, compared to the state-of-the-art model that explores manual 110 feature engineering on the MIMIC-II dataset. A detailed feature analysis of the subgraph groups that are generated by SANMF offers more clinical insights about multiple organ problems associated with high mortality risk. 111 Chapter 6. Integrated Genomics, Transcriptomics, Medical Records, and Insurance Claims Analyses Identify Dyslipidemia as a Strong Inherited Risk Factor in ASD This chapter 6 describes a variation of subgraph mining algorithms used to detect co-regulated exon clusters in genomic analysis. Moreover, this chapter also demonstrates an innovative approach to perform integrative analysis with multiple data modalities including genomics, transcriptomics, laboratory test results, and insurance claims. The real-world problem we chose to study is the Autism Spectrum Disorder (ASD). In particular, our subgraph mining algorithm, Implication of Co-regulated Exons (ICE), automatically identifies clusters of exons whose expressions during brain development are highly correlated, thus implying co-regulation. To effectively apply ICE in interpreting the massive amounts of whole exome sequence data obtained from thousands of families with ASD, we employed an integrative analytic approach, which combines sequence data with neurodevelopmental expression patterns, familial segregation patterns, sexually dimorphic expression patterns, expression correlation, large-scale variant frequency data, EMR data, and healthcare claims data (Figure 6-1). In this integrative genomic analysis aggregating different modalities of patient data, the subgraph mining algorithm ICE serves as the basis and suggests a deeper understanding of the mechanisms of genetic variations by placing them in the context of exon clusters that harbor these variations. A version of this chapter is currently under review as a research article whose coauthors Alal Eran, Nathan Palmer and Paul Avillach have also contributed significantly to the analysis. 6 112 Expression .CTGCGA.. jj .CTGTGA.. jjj zII~. ..C-GCGA.. (a .. TGGA. I N ~p~i1.6 cPE, .4 (e) (C) (d) Figure 6-1 Independent sources of information used to identify molecular networks contributing to ASD.(a) Deleterious variants called in whole exome sequence data from 3,531 individuals belonging to 1,704 simplex and 50 multiplex families. These include (a.ii) nonsense, (a.iii) frameshift, and (a.iv) splice site mutations, whose impact on wild type gene (shown in a.i) is depicted. (b) Sexually dimorphic, neurodevelopmentally co-regulated exons identified by clustering correlated BrainSpan spatiotemporal RNA-Seq data of the developing human brain and comparing cluster expression between male and female samples. (c) ASD-segregation patterns in multiplex and simplex families. (d) Information streams a-c were integrated to identify clusters of sexually dimorphic, neurodevelopmentally co-regulated, ASD-segregating deleterious variants. (e) Lipid dysregulation, a novel molecular theme revealed by the above analysis, was validated using EMR and health claims data, demonstrating significant alterations in lipid profiles of children with ASD, and an increased prevalence of comorbid dyslipidemia disorders among individuals with ASD and their family members, as compared to age, gender, and socioeconomically matched controls. 6.1 Background One in every 68 children in the United States is diagnosed with ASD, a wide spectrum of social and communication deficits with repetitive behaviors [307,308]. Although twin and family studies provide substantial evidence that ASD is one of the most heritable complex disorders [309311], the specific variants causing or increasing the risk for ASD remain largely elusive. Recent advances in ASD genetics have highlighted its extreme locus heterogeneity, revealing a role for de novo mutations [15-18,20,22-24], copy-number variants [312-315], common variants [316,317], and rare single nucleotide variants [20,22,318-320]. This has accelerated a growing 113 realization that ASD is comprised of a multitude of etiologies with partially overlapping symp- tomatology and clinical course [321-329]. Because ASD is a common disorder with a shared phenotype, individually rare etiologies must converge at some level. Recent ASD genomic studies have revealed several convergent etiolo- gies, including synaptic dysfunction [22,24,312,322,324-326,330], immune dysregulation [331333], chromatin and transcriptional dysregulation [17,22,23,320,334,335], and growth abnormalities [322,336-338]. Despite these significant advances in characterizing the genomic landscape of ASD, the cause of the majority of cases remains unknown. Understanding the molecular bases of ASD is needed to enable more accurate early diagnosis, personalized treatment options, and improved outcomes for people with ASD. Recent accumulation of enormous quantities of molecular data in typical and atypical human brain development (including ASD) is providing unprecedented opportunities for elucidating the interplay within and between different layers of genomic structures and their deviations in ASD. Integrative genomics, the study of molecular events at different levels, has been successfully applied to various cancers, revealing principal disease subtypes with characteristic distributions of age at diagnosis, clinical behavior, and optimal treatment response, thereby offering improved personalized care [339-343]. However, these approaches have yet to be applied to ASD. Here we integrate large-scale genomic, transcriptomic, me dical records, and insurance claims datasets to discover and validate molecular mechanisms associated with ASD. Besides reproducing previously reported convergent etiologies, our analysis reveals a strong signal of lipid dysregulation (26% of all exon clusters whose genetic variants are significantly implicated in ASD). We find a significant burden of sexually dimorphic, neurodevelopmentally co-regulated, ASD-segregating deleterious mutations in lipid metabolism genes, significantly altered lipid profiles in blood of children with ASD, and a significantly higher prevalence of comorbid dyslipidemia disorders in individuals with ASD and their family members. These findings suggest that dyslipidemia may be a strong inherited risk factor for ASD, thereby offering means for earlier screening, more accurate diagnoses, and rational approaches to therapy. 114 6.2 Methods 6.2.1 Implication of Co-regulated Exons It is known that in the human brain each gene unit has many alternatively spliced isoforms. This mechanism supports important fine-tuned regulation and adaptation to the changing environment. Nowhere is this fined-tuned response to environmental stimuli more important than in the developing human brain, the most complex organ shown to have the most divergent splicing patterns [344]. Therefore, to study co-regulated variation, examining variants at the whole gene level has insufficient resolution. Such functional co-regulation needs to be investigated at the higher resolution isoform level, and from the perspective of spatiotemporal co-expression patterns during human neurodevelopment. To this end, we develop the method Implication of Co-regulated Exons (ICE). In order to understand which variants might function together, we examine exonic spatiotemporal co-expression patterns in the recently generated BrainSpan RNA-Seq data [345]. This dataset contains normalized read counts (in RPKM: Reads Per Kilobase per Million mapped reads) for 309,223 coding and non-coding exons measured across 524 samples from 26 brain regions throughout human neurodevelopment (Table 6-1 and Table 6-2). 115 Structure Structure descriptions Area descriptions Area(s) VFC Region descriptions Orbital prefrontal cortex Dorsolateral prefrontal cortex Ventrolateral prefrontal cortex MFC Medial prefrontal cortex Region(s) OFC DFC Frontal cortex FC MiC NCX Neocortex Parietal cortex PC SiC IPC AIC Temporal TC cortex Occipital OC cortex DTH ' MD Mediodorsal nucleus of the thalamus MD CB CBC Cerebellar cortex CBC HIP Hippocampus AMY STR Amygdala Striatum, STC Primary motor cortex (MIC) Primary somatosensory cortex Posterior inferior parietal cortex Primary auditory cortex Posterior superior temporal cortex ITC Inferior temporal cortex ViC Primary visual cortex Dorsal Thalamus (embryonic and early fetal development) Mediodorsal nucleus (all other periods) Cerebellum (embryonic and early fetal development) Cerebellar cortex Table 6-1 Brain region hierarchy of regions, areas, and structures included in this study. 116 Period 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Age 4PCW-8PCW 8PCW-lOPCW 1OPCW-13PCW 13PCW-16PCW 16PCW-19PCW 19PCW-24PCW 24PCW-38PCW OM (birth) - 6M 6M-12M lM-6Y 6Y-12Y 12Y-20Y 20Y-60Y 40Y-60Y >60Y Description Embryonic Early fetal Early fetal Early mid-fetal Early mid-fetal Late mid-fetal Late fetal Neonatal and early infancy Late infancy Early childhood Middle and late childhood Adolescence Young adulthood Middle adulthood Late adulthood Table 6-2 Periods of brain development included in this study.PCW, Post conception weeks; M, months; Y, years. The 524 samples in the dataset were extracted from multiple brain regions belonging to 23 males and 19 females at multiple developmental stages (Figure 6-2). These samples created the spatio(regarding brain structures) and temporal- (regarding ages) profile for each exon in the BrainSpan data. Co-expression analysis based on those profiles can identify exons that are linked through functional co-regulation. 117 BrainSpan individuals age and gender profile F- am S 4 m e U(b 2$6 2048 16384 Ages (days in log scale)(b Figure 6-2 Visualization of the BrainSpan RNA-Seq data.(a) Example heatmap of the expression profiles of 29 exons throughout neurodevelopment and across individuals. (b) The age and gender profiles of BrainSpan individuals. We applied initial filtering steps on exons using the following criteria, keeping 292,146 exons. 1. Variability filter. If there is no change in the expression profile (i.e., expression levels are the same for different brain areas at different developmental stages from different donors), then the exon is excluded. 2. Multi-sample filter. If an exon only has samples from a single donor, then this exon is excluded. 3. Duplicate filter. We also find some exonic intervals are duplicated in the BrainSpan RNA-Seq data, where duplicates may be labeled with different (and sometimes tempo- rary) names. To consolidate these duplicated exons and label them in a meaningful and 118 consistent manner, we first identify exons that share chromosome names, start positions and end positions. For each such exon, we then identify temporary names using the following regular expression patterns: "RP.*-.*\\..*" and "\\w{2}\\d+\\.\\d". If there are additional meaningful names, these temporary names are discarded, and the exon is named by concatenating all meaningful names. We next identify co-regulated exons by calculating their similarity across the BrainSpan dataset. We measure such similarity by the coefficient of determination R 2 = cor(el, e2 ) 2 , where cor(el, e2 ) is the Pearson correlation between expression profiles of two exons - el and e 2 [346]. The coefficient R 2 measures how well el might be constructed from e2 (by creating a predictor of the form a + fle 2 ), and vice versa. Preprocessing of the RNA-Seq data is applied before calculating cor(el, e 2). In particular, we regard all 0 values as NA [347]. We then log2transformed the RPKM values by using the formula log 2 (x + 1) to reduce the effects due to measurement noise. Due to the prevalence of the NA values, we filter those exons so that their profiles must have at least 5% non-NA values (non-NA exon filter). Requiring at least 25 values to be measured (5% out of 524) is a rather inclusive criterion as it retains 248,898 exons (i.e., 85% out of an initially filtered total of 292,146, see Figure 6-3). We also require that candidate pairs of exons share 75% of samples with non-NA measurements (pair filter). That is, I intersect(nna(el), nna(e 2)) > 75% x max(Inna(el)I, Inna(e 2)|), where nna(e) refers to samples for which e has non-NA values, intersect(.) denotes the operation of set intersection, and I-I returns the length of a vector. 119 Distribution of number of non-NA values in expressions 80000- Quantile 5%: 5 10%: 12 15%: 25 60000- Ce ( 48 0020%: 25%: 85 20000- 00 400 200 Number of non-NA values Figure 6-3 Distribution of the number of non-NA values in expressions of exons. We also show the lower quantiles of the number of per exon non-NA values. For example, 15% quantile at 25 means 15% of the exon expressions have < 25 non-NA values. In other words, 85% of the exon expressions have > non-NA values. Pairwise correlation calculation between 309,223 exons amounts to over 47 billion pairs, and is a daunting task that is intensive in both computation and storage. Thus we adopt a distributed block-wise approach to calculate pairwise exon correlations, as shown in Figure 6-4. By dividing the exons into size 10,000 blocks, the correlation calculation is parallelized in a block-wise fashion. Let the blocks be bl, ... , bn, we then need to compute correlations between exons themselves in bl, correlations between exons in b, and b2 , ... , correlations between exons in b, and bn, correlation between exons in b2 and b 3 (b 2 -bl block correlations can be omitted due to symmetry), etc. Each block-wise correlation is dispatched to its own computing node in a 2000-core computing cluster, thus achieving thousand fold speed up. 120 Exon Upper diagonal 10 4 x 104 blocks Parallel computation Cc nnect exon pairs 2 G0.7 > Wi th RG x 0 ... .. PVRIG.e8 .... PVPiG.e4 RIG.02 e sMaximally PVRiG.&6 STAG3..34 STAG3.35 connected component Exon clusters exon graph Figure 6-4 Block and parallel exon correlation makes computation feasible. 6.2.1.1 Identification of co-regulated exons The distribution of the coefficients of determination R 2 is shown in Figure 6-5. As the histogram in Figure 6-5 (a) shows, with the R 2 increasing, the frequency falls at a speed faster than exponential. This holds for the R 2 distribution after applying the two filters in the previous section. It is our goal to focus on highly co-expressed exons. Thus we establish the empirical criterion that two exons must have their R 2 be at least 0.7 to be considered as co-expressed, thereby focusing on the 0.02% most tightly correlated exon pairs. We keep the exons that are co-expressed with at least one other exon, and turn them into a graph representation. This graph has exons as nodes and draws an edge between exons el and e2 if they are co-expressed (R 2 (el, e 2 ) 2 > 0.7). Thus a large sparse exon co-expression graph with 92,240 nodes and 6,205,327 edges is produced. A small part of the exon graph is shown in Figure 6-6, which clearly demonstrates that the whole exon graph consists of smaller exon clusters. 121 R2 Distribution of mean R 2 per duster histogram min=0.700, max=1.000, mean=0.816, median=0.780 500 - ~- - - - ~ ~ ~- - ~ -~--^- a 400 300 200 100 o 000 T o TTooT 0.10 0.20 0.30 0.40 I. I I.2 I I.30 0.50 I 0.60 I I 0.70 I 0.80 I I 0.90 I 1.00 0.7 0.8 0.9 1.0 Mean R 2 per cluster (a) (b) 2 Figure 6-5 Distribution of R in the BrainSpan data. (a) The distribution of R 2 between pairs of exons that pass the two filters: 1) exons must have at least 25 non-NA values (the four exon filters); 2) the exons in a pair share 75% of samples with non-NA measurements (pair filter). The frequency is on a logarithmic scale. (b) The distribution of per-cluster mean R 2 coefficients. The per-cluster mean R 2 is calculated by averaging the R 2 coefficients over all exon pairs in one cluster with R 2 > 0.7. In other words, the per-cluster mean R2 measures the average cluster connection strength. 122 3 13 CPNE1.e2 NCAPH2.e12 0 o Q3 SYCE1.el 1 PVRIG.e5 SYCE1.e7 PVRIG.e2 0 TRAPPC2L.e4 NCAPH2.e15 2 STAG3.e35 Q C32E.ee 2 SYCEI.e8 PVRIG.e6 TRAPPC2L.e3 12 PVRIG.o4 NCAPH2.e14 SYM.e. C ,E1WNE1.e3 < 9 C NCAPH2.e13 TRAPPC2L.e2 STAG3.36 NCAPH2.e7 SYCE1.e9 0 PR&1.e 1.05 STAG3.e34 o NCAPH2.98 SYCE1.elO 'E1.9 O PERI.e6 I' . CNEl.e9 AMZI.e9 a o o CPN1;l4FS1.e4 -Se1 AC006028.9.o1 PERl.e5 CPNE1;NFs. CE GPR126.KCN KCNJ12. 2 S.k1 3KCNJ2 .NJ1 CPNE1.o4 GPR126-.4 e7 CTD-2517M22.14.e2 T . L CBS.o8 CTD-251l4.5.o2 CTD-2517M22.14.e H1 .1L RNH1. PPP1R16A.N PPPIR16A.e8 RP1 1-573D15.8.e3 o CBS.eH4 LPL.C5 WLRel 1 e CTD-2517M14.5.el KNG1.ell 10 NXF1.e14 NXF1.e13 2 PEBP4.5 NXFI.e12 XXbeBPG254F23.6.el HLA-09B1.e5 HLA-DQ81.04 AP3S1.e6 AP3S1.e8 CBS.el L R CBS. 6 ITGB11.e5 LDLRe11 LDLR.e6 POMT1.e ITGB1BPI.e4 LDLR.e5 PEBP4.el o LDLR.e3 0 RP1I-334J6.6.e2 P PEBP4.e2 13 0 W-DQB1-AS1.e1 HLA.e1 ITGBiM~a2 LDLR.elO ITGB1BP1.e3 POMTI.e8 TMEM91.e7 TMEM91.63 o POMT1.e18 POT.1 Figure 6-6 Visualization of part of the entire exon graph. Each node represents an exon, and an edge connects nodes el and e2 if R 2 (e 1 , e 2 ) 2 > 0.7, with width proportional to the magnitude of R 2 .Nodes are labeled according to their hosting gene and exon index. Based on the entire exon graph, we cluster co-expressed exons by finding the maximally connected components using the igraph package [348]. This procedure generates 6,242 co-expressed exon clusters with an average mean R 2 of 0.82 and an average exon count of 15. The collection of exon clusters is remarkably heterogeneous in size, i.e. clusters contain different number of exons (Table 6-3) and genes (Table 6-4). Although the distributions are skewed towards smaller exon clusters, there are numerous exon clusters representative of tight multi-gene co-expression. 123 COuster, Size Count 2 4111 3 734 Count 25 Count 3 19 1 348 218 167 109 79 78 60 11 65 33 26 28 22 15 14 7 13 15 10 9 6 1 4 5 2 4 5 6 2 1 8 1 697 1 40 11,47 cluster 63 5 47 44 43 41 38S 37 36, Se 1 1 1 1 3 1 3 2 2 Count exons). of of number in terms (measured sizes of cluster Table 6-3 Distribution Number of clute r 2_ 3 4- Number of 4 10 16 154 3202 '2851 clusters Table 6-4 Distribution of number of genes in exon clusters. 6.2.1.2 1 12 1 11 Tracking expression patterns of co-regulated exons We next track the temporal expression profiles of the co-regulated exon clusters identified in the previous step, across the BrainSpan regions. As shown in Table 6-1, the measured brain regions and areas can be summarized into six brain structures: amygdaloid complex (AMY), cerebellar cortex (CBC), neocortex (NCX), hippocampus (HIP), mediodorsal thalamus (MD) and striatum (STR). Based on the mapping in the summarization, we can derive the expression profile for the areas with the formulas ( 6-1 ) to ( 6-8 ). FC = mean(OFC,DFC,VFC,MFC,(M1C IM1C - SC)) PC = mean(PCx,IPC,S1C) TC = mean(TCx,ITC, A1C,STC) OC = mean(OCx,V1C) NCX = mean(FC, PC, TC, OC) STR = mean(STR, MGE, LGE,CGE) MD = mean(MD,DTH) CBC = mean(CBC,CB) 124 (6-1) (6-2) (6-3) (6-4) (6-5) (6-6) (6-7) (6-8) Using these formulas, we can calculate the aggregated expression of all exons, in each brain area across the entire cohort. To compare across individuals and brain regions, we first normalize the expression levels using a Z transformation (i.e. centering the expression vector on the mean and dividing by its standard deviation). We next track the temporal expression patterns in each gender using two approaches: (1) Mean and standard error plot. In this approach, for each exon cluster, each brain structure, each gender, and each time point, we compute the mean expression and its standard error by using the aggregated expressions of all exons in that exon cluster, from all matching sample donors. We plot the temporal expression profile for each exon cluster, brain area, and gender combination using line graph with means as values and standard errors as error bars at each time point. (2) Mean and standard error period plot. The spatiotemporal dynamics of the human brain transcriptome is a staged process and can be tracked as a multi-period system, as detailed in Table 6-2. For each exon cluster, each brain structure, each gender, and each neurodevelopmental period, we compute the mean expression level and its standard error by using the aggregated expression of all exons in that exon cluster, from all matching sample donors. The temporal expression profiles are then similarly plotted as in "mean and standard error plot". 6.2.1.3 Identification of sexually dimorphic co-regulated exons The sexually dimorphic prevalence of ASD (male-to-female ratio of 4:1) increases the likelihood that the functional loss incurred by genetic mutations impaired those co-regulated exons that demonstrate differential expression patterns between males and females. To identify sexually dimorphic co-regulated exons, we compare the temporal expression profiles of an exon cluster in in each brain structure, as detected in section 6.2.1.2, and select the clusters that demonstrate gender-specific differential expression in one or more brain structures. 6.2.2 Whole exome sequence analysis Whole exome sequencing (WES) aims to identify the variants found in the coding region of genes. 6.2.2.1 Data compilation 125 We compiled several familial whole exome sequencing studies from the National Database for Autism Research (NDAR), as detailed in Table 6-5. The table also shows the number of included families from each dataset. Inclusion criteria were families with at least two siblings that have a similar degree of sequence coverage, as determined by the Genome Analysis Toolkit's CallableLoci analysis [349]. NDAR Collection Title Family type Number of families Number of Individuals 1918 Multiplex 45 111 2004 Multiplex 5 12 NDAR Collection ID Human autism genetics and activity dependent gene activation Sequencing Autism Spectrum Disorder Extended Pedigrees 3408 1704 2042 Simplex SSC total recall project Table 6-5 Whole exome sequence datasets used.For the SSC total recall project, we include only those 1704 families from [350] for which the VQSR step (see section 6.2.2.3) succeeded. Of the families listed in Table 6-5, a total of 1,754 families were included in our analysis, comprising 50 multiplex families with 2-5 affected siblings, and 1,704 simplex families with one affected and one unaffected full siblings. The total number of individuals included in our analysis amounts to 3,531. In order to accurately and consistently call variants from across all datasets, we adopt the Genome Analysis Tool Kit (GATK) framework [351] for a standardized preprocessing of WES data into analysis ready reads followed by joint variant calling. 6.2.2.2 WES Preprocessing For each individual included in our study, multiple BAM files may be generated by multiple sequencing runs. Furthermore, different studies used different aligners and different variant calling frameworks. To standardize variant calling and data analysis across studies, our data preprocessing began with converting BAM files back to interleaved FastQ files and aligning these in a standardized manner using BWA-MEM [352]. Such a back-winding step through FastQ format ensures that the BAM files are processed in the same standard way in order to improve the variant calling accuracy. Before converting a BAM file to a FastQ file, we first split the BAM files into multiple read groups. We then apply the Picard toolkit [353] to undo possible post alignment 126 processing for each split BAM file, using the RevertSAM utility. The actual conversion from BAM files to FASTQ files includes the following two sub-steps: The first sub-step uses the "bamshuf' utility from SAMtools [354] to shuffle the reads in the BAM file for them to not be in any biased order so that a subsequent aligner can correctly estimate the insert size using blocks of paired reads. The second sub-step uses the "bam2fq" utility from SAMtools to convert the BAM file to an interleaved FastQ file where each pair of reads (forward and reverse reads) are in the same file. The interleaved FastQ files from all individuals were then mapped to a single human reference genome (GRCh37/hgl9, version 37, including decoy contigs) using BWA-MEM. The newly aligned BAM files containing different read groups were then merged using the Picard MergeSamFiles utility. For the merged BAM file, duplicates were marked and removed us- ing the Picard MarkDuplicates utility, read group information was added using the Picard AddOrReplaceReadGroups utility. For efficiency, we restrict variant calling to a limited set of chromosomal regions specified by the BrainSpan exon intervals. This is because we are only interested in neurodevelopmentally coregulated variants in this study. Toward that goal, we pad each BrainSpan exon with 100bp buffer. We sort the padded intervals and divide them into two collections based on whether they are on the forward or reverse strands. We then merge intervals overlapping with other intervals in the same collection to provide a non-overlapping collection of intervals on each strand. The union of the two collections of merged intervals then forms the BrainSpan reference interval. Fig- ure 6-7 shows the distribution of padded merged BrainSpan interval size. The figure also categorizes the intervals based on their strand (forward or reverse), and depicts the distributions of those intervals respectively, which are similar to each other and similar to that of all intervals. 127 Distribution of merged BrainSpan interval(+) size min=202, max=22968, mean=633, median=373 Distribution of merged BrainSpan interval size min=202, max=29414, mean=632, median=374 30000 6WOOo 4000010 010000 100 10000- Interva size (logarithmic scale) (b) Distribution of merged BrainSpan interval(-) size min=202, max=29414, mean=632, median=375 -_ 30000 20000- 100001 Interasize (logarithmic scale) (a) Interval size (logarithmic scale) (c) Figure 6-7 Distribution of padded and merged BrainSpan interval sizes. (a) Size distribution of all BrainSpan intervals. (b) Distribution for intervals on the forward strand. (c) Distribution of intervals on the reverse strand. 6.2.2.3 Joint variant calling in BrainSpan intervals After preprocessing, we perform joint variant calling using the GATK tool. Figure 6-8 shows the overview of this workflow. The Non-GATK box corresponds to the preprocessing steps of 6.2.2.2. The preprocessed BAM files undergo local realignment, which transforms regions with misalignments due to Indels into clean reads with a consensus Indel model (Indel Realignment step in Figure 6-8, using GATK RealignerTargetCreator and IndelRealigner utilities). The reads' quality scores are then recalibrated to correct for artifact and offset bias (Base Recalibration step, using the GATK BaseRecalibrator utility), producing analysis ready reads. 128 Genatype Uk*sHoo ..Phiclcaklon e.g. Chr.start-end Cytoband Geme e.g. Gene name Variant function Gene On sR"*MwAre Analyal.Ready Readt NPs4 ndels ---. St e.g. Pathway Molecular process: Predicted variant Impact e.g. SIFT PolyPhen i ji Comprehensively -7 SNP*, ,Individual genotypes M Populationfrequency e.g. 1000 Genomes ESP 8600 Clinical aignificance e.g. ClinVar OMIM mEression pon-rna e.g. GTEx BraInSpan --- TranscriPtaIn regulaione.g. ENCODE TFBS Hiatone modifications Figure 6-8 Overview of WES analysis.After rigorous quality control steps, whole exome sequence data from various NDAR collections is aligned to the reference human genome using BWA. Duplicates are then marked, a realignment step follows to account for Indel-related errors, and finally base quality score recalibration results in analysis-ready BAM files. These are then analyzed using the Haplotype Caller, resulting in per-position genotype likelihood. Following a joint genotyping phase, raw variants are called. These are filtered using a machine-learning based variant recalibration tool that balances the sensitivity-specificity tradeoff. The resulting SNPs and Indels are then subject to annotation based on multiple considerations, including predicted variant impact, conservation, their population frequency and clinical significance. The end result of this pipeline is a list of comprehensively annotated variants, and a table of their individual genotypes. The analysis ready reads are then processed using the GATK Haplotype Caller. This step simultaneously calls SNPs and Indels using local re-assembly of haplotypes in an active region, resulting in per-position genotype likelihood. We use the human reference genome GRCh37/hgl9, (version 37, including decoy contigs) as reference for the Haplotype Caller, using the recommended setting for single-sample all-sites calling on DNAseq: emitRefConfidence=GVCF, variantindex type=LINEAR, variantindexparameter=128000. We then combine the resulting per-sample variants and perform joint genotyping step using the GATK GenotypeGVCFs utility. Joint genotyping aggregates multi-sample variants and merges the records in order to re-estimate the genotype likelihood by combining all records spanning the target chromosome location. Based on our joint genotyping results, we apply a machine learning 129 based variant filtering step, Variant Quality Score Recalibration (VQSR). VQSR uses a Gaussian mixture model to fit and cluster the called variants and compare them to known positive and negative variant sets. SNPs and Indels are recalibrated separately in two passes. The first pass recalibrates SNPs, with Indels left untouched; the second pass recalibrated Indels, with recalibrated SNPs left untouched. We apply the WES preprocessing and joint variant calling steps to samples from the multiplex family cohort, producing an average of 83,808 variant/individual (74,111 SNPs, 9,697 Indels). For the discordant family cohort, we use a subset of the dataset produced by Krumm et al. [350] which is based on a similar GATK pipeline and has an average of 35,164 variant/individual (31,644 SNPs, 3,520 Indels). There are two main differences between the pipelines by Krumm et al. [350] and our pipeline: 1) Krumm et al. performed joint variant calling separately for each quad (parents, proband7 and unaffected sibling) instead of the entire cohort; 2) Krumm et al. called variants within 20 bp of the NimbleGen EZ-SeqCap v2.0 targets instead of within 100 bp of BrainSpan interval targets. The difference 1) may introduce some bias when directly comparing called samples from the two cohorts. However, we performed segregation analysis separately on the two cohorts, thus avoiding such bias. The difference 2) results in disparate numbers of variant/individual between two cohorts. However, as will be evidenced in section 6.2.2.6 and Figure 6-9 to Figure 6-11, our subsequent filtering steps (mapping to BrainSpan exon clusters in particular) resulted in average numbers of variant/individual comparable between the two cohorts. In addition, to make it as much consistent to our pipeline as possible, we include only those 1704 quads from [350] for which VQSR succeeded. 6.2.2.4 Variant annotation We next used the ANNOVAR toolkit [355] to comprehensively annotate called variants with a wide array of information, including their hosting gene (using several gene models such as RefSeq [356], UCSC Known Gene [357], Gencode [358]); the variant function; its predicted pathogenicity according to PolyPhen2 [359], SIFT [360], MutationTaster2 [361], MutationAssessor [362], CADD [363], LRT [364], VEST3 [365], and other meta predictors; its conservation according to PhyloP [366], SiPhy [367], and GERP++ [368]; its minor allele frequency among the 7 Proband refers to affected sibling. 130 1000 Genomes populations [369], ESP6500 [370], and ExAC [371]; and its phenotype associations according to ClinVar [57], and HGMD [372]. 6.2.2.5 Annotation-based variant filtration and deleterious variant detection To address issues of reference mis-annotation, we resort to the recently released Exome Aggregation Consortium (ExAC) exome dataset [371], which aims to aggregate exome sequencing data sets from a wide range of large-scale sequencing projects including the cohorts of Myocardial Infarction Genetics Consortium, Swedish Schizophrenia & Bipolar Studies and The Cancer Genome Atlas (TCGA). We filter out those variants whose allele frequencies are observed to be over 90% among the 60,706 individuals aggregated by ExAC. We also apply a similar 90% filtering threshold on the alternate allele frequency in our cohort. We further focus on deleterious variants, which include frame-shift insertion, frame-shift deletions, nonsense variants, and splice site mutations. 131 Deleterious variant counts Passed variant counts Coregulated deleterious variant counts ii0 0 0 8 C) 0C) V0 0 0 0I C0VI 0 T T C% T T 0 Proband Sibling Proband Passed SNP counts 0 0- Sibling Deleterious SNP counts Proband 0- Sibling Deleterious SNP counts 0 N 0 0C) 80 0- 0O 0 00 0- 0 0 Proband Sibling 0 0- 0 V- Proband Sibling Proband Sibling Deleterious Indel counts Passed Indel counts 0 C0 0- Deleterious Indel counts Cv, r- I I C)CN U) 0 0(0 0 C) 0 0Uf) 0N 0 0- C)- 0 0 C,-, 0- T T CN CN Proband Sibling 0 0 Proband 132 Sib ling Proband Sibling Figure 6-9 Distributions of the total number of variants in probands and unaffected siblings in discordant families. Shown are the per-individual SNP and Indel distributions after each of the following analysis steps: joint variant calling, restricting to deleterious variants, and restricting to co-regulated deleterious variants. Note the dramatic reduction from about 32,000 total SNPs per individual to about 50 candidate SNPs, and from 3,500 total Indels per individual to about 130 candidate Indels. Importantly, the number of neurodevelopmentally co-regulated deleterious variants is similar between probands and unaffected siblings, but their distribution among clusters differs significantly, with an enriched aggregation of deleterious variants in certain exon clusters. 6.2.2.6 Mapping variants onto co-regulated exon clusters To identify neurodevelopmentally co-regulated variants, we next map the called variants to the exon clusters identified in section 6.2.1. In doing so we first perform interval search to map variants into exons using the GenomicRanges toolkit [373]. A variant maps into an exon when the variant's genomic location falls within the exon's interval. After mapping variants to their hosting exons we assign cluster membership for each variant based on the cluster membership of its hosting exon as obtained in section 6.2.1. This mapping of deleterious variants to exon clusters allows us to identify and enumerate deleterious mutations in each co-regulated exon cluster. Figure 6-10 and Figure 6-11 show the distributions of the number of variants per individual at each stage of variant analysis, for the discordant family cohort and the multiplex family cohort, respectively. From Figure 6-10, it can be easily seen that the steps of restricting to deleterious variants, restricting variants to co-regulated exon clusters and filtering for differentially variable variants all contribute to the reduction of the number of candidate variants. Similar reduction holds true for multiplex families, where the last filtering step is based on shared variants among all proband siblings, as shown in Figure 6-11. 133 Variants per individuals min=17473, max=47453, mean=35164, median=35908 Deleterious variants per individuals min=286, max=1092, mean=626, median=619 80o- 400- 200- 200- 020000 Variant per individual 200 50000 40000 400 pO r 800 Variant per individual 1000 Differentially expressed, coregulated deleterious variants min=14, max=210, mean=39, median=36 Coregulated deleterious variants per individuals min=81, max=412, mean=195, median=189 400- 400- 00 200.- 200 100.1 0-, 0. 100 260 30 Variant per individual 400 0 0 100 1 0 Variant per individual 200 Figure 6-10 Distribution of number of variants per individual in the discordant family cohort at each stage of variant analysis. 134 Deleterious variants per individuals min=405, max=888, mean=682, median=677 Variants per individuals min=70699, max=122616, mean=83808, median=81353 25, 201 20- 151 15~ io10 5 5- - - ----100000 80000 Variant per Individual 120000 400 500 600 70 N00 M0 Variant per individual Shared coregulated deleterious variants Coregulated deleterious variants per individuals min=1 09, max=258, mean=209, median=210 min=24, max=120, mean=76, median=76 6- - 15 10. 0 5. 01 0-i 100 150 200 25 250 Variant per individual 50 75 Variant per individual 100 125 Figure 6-11 Distribution of number of variants per individual among multiplex families at each stage of variant analysis. Below we summarize the overall reduction of candidate variant numbers at each step. For the discordant family cohort, we start with an average of 35,164 variants/individual (31,644 SNPs, 3,520 Indels). Focusing on deleterious variants reduces the candidate pool size to 626 variants/individual (238 SNPs, 388 Indels) on average. Mapping deleterious variants to co-regulated exon clusters further trims the average number down to 195 variant/individual (61 SNPs, 134 135 Indels). Finally filtering variants by differential variability between discordant sibling pairs leads to 39 variants/individual (15 SNPs, 24 Indels) on average. For the multiplex family cohort, we start with an average of 83,808 variants/individual (74,111 SNPs, 9,697 Indels). Focusing on deleterious variants reduces the candidate pool size to 682 variants/individual (296 SNPs, 386 Indels) on average. Mapping deleterious variants to co-regulated exon clusters brings the average number down to 209 variants/individual (68 SNPs, 141 Indels). Finally filtering variants by keeping the variants shared by probands in multiplex families leads to 76 variants/individual (30 SNPs, 46 Indels) on average. 6.2.3 Segregation pattern analysis Here we examine the segregation patterns of neurodevelopmentally-co-regulated, sexually dimorphic deleterious variants in both discordant and multiplex ASD families. 6.2.3.1 Discordant ASD families Simplex ASD families refer to those that have one child affected by ASD. We focus on discordant families, special cases of simplex ASD families that have two siblings: one proband (affected with ASD) and one unaffected sibling. In each discordant family, discordant sibling pairs are formed by pairing a proband with his/her own unaffected sibling. With the collection of discordant pairs, we can compare neurodevelopmentally co-regulated deleterious variants found in probands and the variants carried by siblings, in each exon cluster. By selecting the exon clusters with excess mutation burden in probands, we filter the exon clusters to retain those that likely harbor the pathogenic mutations of ASD. We use permutation tests [253] to assess the statistical significance of an exon cluster's excess deleterious variants in probands as compared to their unaffected siblings. Treating each family as rows and probands and sibling as columns, we fill in entries of this matrix with the total number of mutations occurring in each individual in each exon cluster. This creates an exon cluster mutational profile among discordant families. To obtain an empirical p-value for excess mutational burden we randomly shuffle paired probands and siblings. Repeating the permutation creates a distribution of mutational profiles that simulates mutational events in an exon cluster by chance. With this simulated distribution, we then calculate the p-value of differential variation (i.e., 136 Li(mp,, - mst) where e indexes the exon clusters, i indexes discordant families, and pi and si are the proband and unaffected sibling in the ith family respectively). 6.2.3.2 Multiplex ASD families Multiplex families have two or more probands. In this segregation analysis we search for neurodevelopmentally co-regulated deleterious variants that are shared among all affected siblings. As we assume that probands have a similar cause of ASD, focusing on neurodevelopmentally coregulated exon clusters with shared deleterious variants enables us to zoom in on mutations that more likely cause ASD. The distribution of number of siblings in multiplex families is shown in Figure 6-12. While most multiplex families have two affected siblings, there are 16 families with 3-5 affected siblings. Distribution of number of siblings in multiplex family min=2, max=5, mean=2, median=2 30- 200 10H1 2 I I 3 4 Number of siblings in multiplex family 5 Figure 6-12 Distribution of sizes of multiplex families. We use Affected Sib Pair (ASP) analysis [374] to assess the significance of variant sharing among all proband siblings. We follow an extended version of the affected sib-pair test (page 125 in [374]). The null hypothesis of this test is that variant sharing is by chance, and therefore 137 not related to the phenotype. This hypothesis is tested using the nonparametric linkage (NPL) z- score. To deal with multiplex families of more than 2 siblings, we divide each family to sib pairs (i.e. a family with s siblings would result in s - (s - 1)/2 affected sib pairs). Because the artificially-created pairs are dependent, each is weighted by 2/s (i.e. scaled down by s/2, as though there were only s - 1 pairs in the sibship). The pseudo code of the extended affected sib pair test for multiplex families is shown in Figure 6-13, where variants are aggregated per cluster and Zclust is the cluster's extended NPL z-score. Input: exon clusters Output: p-values for exon clusters as evidenced by multiplex families for cluster c=1 to all clusters Zclust=0; for variant v=1 to all variants in cluster c { Z[v]=0; n=0; for family f=l to all families { //siblings with genotypes passing the selected filters s=#informative affected siblings in family f for informative sib pair p=l to s*(s-1)/2 //sampling from siblings with genotype that passed filters Generate sib pair p //sqrt(2),0,-sqrt(2) if sib pair shares 2,1,0 non-ref alleles z=sgrt(2)*(#alleles shared in sib pair [0,1,21-1) Z [v]+=z*2/s n++ Z [vi=[1/sqrt (n) I *Z [v] Zclust+=Z [v] //Zclust ~ N(0,#variants in cluster) pVal[c]=2*pnorm(Zclust,mean=0,sd=sqrt(#variants in c),lower.tail=F) } Figure 6-13 Pseudo code of the extended ASP test for multiplex families. The null hypothesis is that variant sharing is by chance (and therefore not related to ASD), and the statistic is the extended nonparametric linkage (NPL) z-score (the variable Zclust). 6.2.4 Integrated statistical significance With the quantitative evidences from the simplex ASD families and multiplex ASD families, we use the following statistical significance analysis to combine the two sources of association evi- 138 dences. In particular, for each exon clusters, we proceed separately with discordant families and multiplex families respectively. For each exon cluster, we have p-values calculated independently for the statistical significance of excess deleterious variation in probands as compared to their unaffected siblings (in the discordant family analysis), as well as for increased deleterious allele sharing among all affected siblings (in the multiplex family analysis). We then use Fisher's method [375] to combine pvalues from both analyses for each exon cluster. The combined p-values are then Bonferronicorrected for multiple testing of all clusters [376]. 6.2.5 Functional enrichment analysis To assess the function of all significant exon clusters, we used NCBI's gene2go table (ftp://ftp.ncbi.nlm.nih.jov/gene/DATA/gene2go.gz) to map genes to their molecular function, biological process, and cellular compartment. We further used GSEA's MSiGDB (http://www.broadinstitute.org/gsea/msigdb/) to identify gene membership in KEGG pathways (http://www.genome.jp)/kegg), Reactome pathways (http://www.reactome.org), BioCarta pathways (http://www.biocarta.com) , and their pathway interactions, as recorded in the Pathway Interaction Database (http://pid.nci.nih.gov). SAFRI Gene, an integrated catalogue of human genetic studies related to autism, was used to examine the significant cluster genes' known association with ASD (https://gene.sfari.org/autdb/HG Home.do). Only genes belonging to evidence categories 1-3 were considered as having a strong prior for playing a role in ASD. Furthermore, NCBI's ClinVar [57] and OMIM [377] databases were mined in search for significant cluster genes' implication in schizophrenia and bipolar disorder, two related neurodevelopmental disorders whose etiologies overlap with those of ASD [378]. 6.2.6 Analysis of lipidemia profiles using lab results from individuals with ASD seen at Boston Children's Hospital We used the i2b2/tranSMART platform [379,380] to analyze EMR data from 1,343,481 individuals seen at Boston Children's Hospital (BCH), including 101,227 children with ASD. i2b2/tranSMART enables the cohesive analysis of heterogeneous phenotypic data, including longitudinal diagnoses and lab results. Using this engine, we compared the results of common lipid lab tests between individuals with ASD and matched individuals with no ASD-related diag139 noses. Tests included triglyceride levels (lab 1173), total cholesterol (lab 8350), HDL (lab 8352), and LDL (lab 8352). For each lab, a 2-by-2 contingency table was used to compare the association of abnormal lab results with ASD by counting the number of individuals with an ICD-9 299.0 diagnosis ("Autistic disorder") and at least one abnormal test result, the number of individuals with an ICD-9 299.0 diagnosis and normal lab values, individuals who have never had a 299.0 diagnosis and all their lab values are within the reference range, and those who have never had a 299.0 diagnosis but had at least one abnormal test result. Table 6-6 details the number of individuals used for each comparison. Pearson's chi square tests were then used to assess the statistical significance of the association of abnormal lipid lab results and ASD. 140 Lab name Lab nae LDLTotal LDL cholesterol HDL Triglycerides Lab ID 8352 8350 1079 1173 BCH patients with at least one abnormal test result that never had an autism ICD9 299.0* code 8628 2427 5899 12356 BCH patients with all test results within the reference range that never had an autism ICD9 299.0* code 17289 24511 33132 21918 BCH patients with at least one abnormal test result and at least one autism ICD9 299.0* code 291 101 121 273 BCH patients with all test results within the reference range and at least one autism ICD9 299.0* code 352 553 523 406 Total number of individuals with at least one test result and at least one 299.0 diagnosis 643 654 644 679 Total number of individuals with at least one test result and no 299.0 diagnoses 25917 26938 39031 34274 Period examined 13 years 13 years 21 years 21 years 1/1/2001 - - 1/1/1993 - 1/1/1993 1/1/2001 12/31/2014 12/31/2014 12/31/2014 12/31/2014 Table 6-6 Patients used to examine the association of abnormal lipid lab results with ASD. Dates examined 6.2.7 PheWAS of Aetna claims data We analyzed four calendar years' (2010 - 2013) worth of medical claims and enrollment demographics for approximately 33 million Americans who were covered by Aetna Inc. policies during that period. Data from the insurance provider were warehoused in a centralized repository, using relational data tables managed by Microsoft SQL Server 2012 Enterprise Edition. We used the subscriber-to-member relationships in the insurance claims data to identify approximately 30,000 families with at least one child diagnosed with ASD, indicated by the presence of one or 141 more ICD-9 codes in the 299 group (pervasive developmental disorders) in at least one medical claim. Fathers, mothers, and their affected children were matched to control populations by age, gender, and zip-code (a socioeconomic marker). These large control populations were repeatedly subsampled (n=10,000) to compare the prevalence of comorbid diagnoses in equally sized samples of affected and unaffected populations of fathers, mothers, and offspring. Diagnoses were mapped to PheWAS groups (http://phewas.mc.vanderbilt.edu), and the p-value of the median statistic for each diagnostic category was taken as the representative association between that diagnostic group and the case population. 6.3 Results 6.3.1 Neurodevelopmentally co-regulated, sexually dimorphic, segregating deleterious variation in ASD To identify neurodevelopmentally co-regulated, sexually dimorphic, segregating deleterious variation in ASD, we performed the integrative analysis as shown in Figure 6-1. Raw whole exome sequence data from several cohorts were obtained from NDAR, and jointly processed using a standard BWA/GATK pipeline for standardized powerful variant calling. Confidently called single nucleotide and Indel variants were annotated to identify deleterious variants, namely frameshift, nonsense, and canonical splice site altering variants, which are the focus of all subsequent analyses. We then focused on variants that segregate with ASD in 1,754 families. Specifically, we focused on variants that are shared among all affected siblings in 50 multiplex families with 2-5 probands per family, and those that are discordant between 1,704 probands and their unaffected siblings. We further focused on variants that function together during early human brain development. To identify those we analyzed the BrainSpan RNA-Seq data, which summarizes normalized read counts from 524 samples of different ages, genders, and brain regions. We analyzed exon-level pairwise correlation patterns throughout human brain development, and aggregated them to identify clusters of co-regulated exons. We then identified those clusters with sexually dimorphic expression patterns, which more likely give rise to ASD, a male-dominant disorder. We mapped variants back to sexually dimorphic exon clusters to identify co-regulated deleterious variants that might have gender-specific effects during early human neurodevelopment. We employed rigorous statistics to control for multiple testing, using affected sib pair (ASP) analysis to assess the significance of multiplex family variant sharing, and permutation 142 tests to assess increased burden of deleterious, neurodevelopmentally-co-regulated sexually dimorphic variation in probands as compared to their unaffected siblings. These independent analyses were integrated to reveal 22 neurodevelopmentally co-regulated sexually dimorphic clusters with ASD-segregating deleterious variation (Table 6-7). 6.3.2 Convergent lipid metabolism etiology Functional enrichment analysis of the identified exon clusters revealed several molecular themes, most of which have been previously associated with ASD. These include chromatin and transcriptional regulation, immune function, and synaptic function. However, it also elucidated a previously unknown convergent etiology, consisting 23% of the signal: lipid regulation (Table 6-7). Lipid metabolism genes implicated by our integrative analysis include low-density lipoprotein receptor (LDLR), lipoprotein lipase (LPL), copine I (CPNE1), and Globoside alpha-1,3-Nacetylgalactosaminyltransferase (GBGT1). For example, the LDLR cluster includes 5 coregulated exons with a male-dominant expression pattern during prenatal development, which switches to female dominance postnatally (Figure 6-14). This cluster is hit by 3 ASD-segregating deleterious variants (P = 1.93 x 10-07). Another example is the LPL cluster, which consists of 10 tightly co-regulated exons with male-dominant prenatal expression. It is hit by 5 ASDsegregating variants (P = 1.55 x 10-06, Figure 6-15). 143 Molecular Cluster p- theme value Gene products Location Selected molecular processes 4q24 Glycerophospholipid biosynthetic process, lipid metabolic process, neuron projection extension, phospholipid metabolic process, positive regulation of neuron differentiation Upregulated by low-density lipoprotein, negative regulation of signal transduction Globoside alpha-1,3-Nacetylgalactosaminyltransferase 9q34.13q34.3 Glycolipid biosynthetic process, protein glycosylation 20q 11.22 7.88E-11 CPNEI1 Copine I 2.50E-09 DDIT4L DNA-damage-inducible script 4-like GBGT1 2.43E-08 Lipid tion ene(s) tran- 1 regula1.93E-07 LDLR Low density lipoprotein receptor 19p13.2 1.55E-06 LPL Lipoprotein lipase 8p22 144 Cholesterol homeostasis, cholesterol metabolic process, cholesterol transport, lipid metabolic process, lipoprotein catabolic process, low-density lipoprotein particle clearance, phospholipid transport, phototransduction, positive regulation of triglyceride biosynthetic process, receptor-mediated endocytosis Fatty acid biosynthetic process, lipoprotein metabolic process, phospholipid metabolic process, phototransduction, positive regulation of cholesterol storage, positive regulation of sequestering of triglyceride, triglyceride biosynthetic process, triglyceride homeostasis, triglyceride metabolic process, verylow-density lioorotein particle remodeling Molecular theme Cluster p Gene(s) Ivalue III Gene products Location 145 Selected molecular processes Molecular Cluster p- theme value None Gene(s) Gene products Location Selected molecular processes D-aspartate oxidase 6q2l Aspartate catabolic process, grooming behavior, hormone metabolic process, oxidation-reduction process I6q24.3 ER to Golgi vesicle-mediated transport 8q24.3 Regulation of catalytic activity I 0q22.2 Positive regulation of GTPase activity 7p22.3 Proteolysis III 2.50E-09 DDO 5.86E-06 TRAPPC2L 6.69E-06 PPPIR16A 2.72E-05 AGAP5 1.03E-04 AMZI Trafficking protein complex 2-like particle Protein phosphatase 1, regula- tory subunit 16A ArfGAP with GTPase domain, ankyrin repeat and PH domain 5 Archaelysin family metallopeptidase I Table 6-7 Significant clusters of sexually dimorphic, neurodevelopmentally co-regulated, ASD-segregating deleterious variation, and their molecular themes. 146 ALDLR6 LDLR5 B -2 LDLRe1 L Mediodorsal nucleus 2- 1 Striatum CL 0 0 LDRe I atr Nmadleopeta cosnuoeeomna -- 0stema eid defne i Tal - -. ete *' ero ' ' - 1'0 1' I omlzdepeso 2- ' LDLRLRe3 C- maleatpeta expression switching to female dominance postnatally. (C) Multiplex family sharing of three deleterious variants hitting this cluster. Five families with two affected siblings each, share deleterious alleles in co-regulated LDLR exons (shown in red). 147 A 3 B3 p Hippocampus Neocortex 2 C C 2, LP.e10 variable male / female 0LPL.911I LPL.05 LPLe3 0 EE o LPLe12 LPLe13 256 C -ZZZ _ 1024 4096 ages (days in log scale) g _ 2' 1 8 8 8 1024 256 16384 4096 ages (days in log scale) 16384 8 S z. &Z Z, Z Z Figure 6-15 ASD-segregating deleterious variation in the sexually-dimorphic LPL exon cluster. (A) Tight co-regulation of 10 LPL exons. The graph depicts the pairwise correlation structure among 10 LPL exons comprising this cluster, showing that all are correlated with R 2 > 0.7. (B) Sexually dimorphic neurodevelopmental expression patterns of the LPL cluster in the neocortex and hippocampus. Shown is the mean normalized expression pattern across sample donor ages measured in days (logarithmic scale). Note the male dominant prenatal (before 256 days) expression. (C) Multiplex family sharing of five deleterious variants hitting this cluster. Five families with two affected siblings each, share deleterious alleles in co-regulated LPL exons (shown in red). 148 6.3.3 Dyslipidemia in families with ASD Using health claims data from 34,003,107 individuals, we identified 23,837 families with at least one child diagnosed with ASD (ICD-9 code 299.x) and at least one child lacking any 299.x diagnosis. Comparing the rates of dyslipidemia between children with ASD and their unaffected siblings, we found that ASD is significantly associated with dyslipidemia (OR=1.76, 95%CI= [1.61 1.92], Fisher's p = 2.25 x 10-36, Table 6-8). ASD No ASD At least one dyslipidemia diagnosis No dyslipidemia diagnosis 23743 1083 38496 999 Where dyslipidemia is defined as having any of the below diagnoses: Code PheWAS Group 272.1 Hyperlipidemia 272.13 Mixed hyperlipidemia 272.11 Hypercholesterolemia 272.9 Lipoid metabolism disorder NOS 277.51 Lipoprotein disorders Other disorders of lipoid metabolism and hyperalimentation 277.5 272 Disorders of lipoid metabolism Table 6-8 Enrichment of comorbid dyslipidemia diagnoses in individuals with ASD as compared to their unaffected siblings. (p = 2.25 x 10-36). We next compared the prevalence of dyslipidemia diagnoses in 30,000 individuals also diagnosed with ASD and repeatedly sampled unrelated controls matched by age, gender, and zipcode (as a marker for socio-economic status). We found a significant enrichment of dyslipidemia-related diagnoses in individuals with ASD (P = 9.70 x 10-66). Similar findings were obtained for parents of children with ASD as compared to age, gender, and socio-economically matched controls, corroborating that dyslipidemia is an inherited risk factor for ASD. Thus independent large-scale datasets of disparate sources can provide unprecedented opportunities to powerfully validate the implication of molecular mechanisms in ASD. 149 Median Hypergeometric P-value Diagnosis Median number of Number of matched individuals individuals (ASD+, diagnosis+) (ASD-, diagnosis+) Hyperlipidemia 941 425.5 2.21 3.79 x 10-46 Mixed hyperlipidemia 350 142 2.46 9.80 x 10-22 Hypercholesterolemia 796 486 1.64 1.1 Lipid metabolism disorder NOS 78 30 2.60 2.13 x 10-6 Lipoprotein disorders 59 21 2.81 1.25 x 10- Other disorders of lipoid metabolism and hyperalimentation 20 5 4.00 2.04 x 10- Median Odds Ratio 103 9.70 x 106 1.90 986 1877 Any of the above Table 6-9 Significant enrichment of dyslipidemia-related diagnoses in individuals with ASD, detected in health claims data. 6.3.4 Behavioral phenotypes of mouse models of dyslipidemia The MGI database was mined to compare behavioral and nervous system phenotypes between ASD mouse models and LDLR-deficient mice (Table 6-10). Five relevant phenotypes were found to be significantly shared among ASD models and LDLR-deficient mice, including abnormal synapse morphology, abnormal neuronal proliferation, and abnormal spatial learning (Power > 80% Fisher's exact test, Table 6-10). Thus there is a striking similarity between behavioral and nervous system phenotypes of ASD and dyslipidemia mouse models. Phenotype % ASD models with % LDLR deficient models Power P phenotype (n=42) with phenotype (n=7) 17% 17% 0.970 Abnormal spatial learning 38% 17% Abnormal neuronal pre- 34% 34% 0.9404 1.000 0.856 0.598 17% 34% 0.856 0.402 0.9404 1.000 Abnormal synapse mor- 1.000 phology cursor proliferation Increased body weight Abnormal hippocampus 36% 38% morphology Table 6-10 Behavioral and nervous system phenotypes shared between 42 mouse models of ASD and 7 mouse models of LDLR deficiency. 150 6.4 Conclusions and Discussion In this chapter, we developed a subgraph mining based method termed Implication of Coregulated Exons (ICE), in order to identify exons that are co-regulated during brain development. ICE serves as the basis of a comprehensive and integrative approach that delineates the biologic foundations of ASD by leveraging recently available genomic, transcriptomic, EMR, and health claims datasets. Besides reproducing previously reported convergent etiologies in ASD (e.g., immune, chromatin / transcriptional, synaptic, and growth dysregulation), we also discovered and validated lipid dysregulation as a strong inherited risk factor for ASD. By integrating streams of independent information, we identified sexually dimorphic, neurodevelopmentally coregulated, ASD-segregating deleterious variation in several lipid metabolism genes. These include LDLR, LPL, CPNE 1, PEBP4, GBGT 1, and DDIT4L. All of these genes were found to be mutated in individuals with developmental delay. LDLR knockdown mice have autistic symptoms, and DD1T4L is a component of the mTOR pathway, shown to be dysregulated in some types of ASD. Importantly, we validated this novel etiology using both EMR and health claims data from millions of children with ASD, their unaffected family members, and unrelated controls. Using EMR data, we demonstrated that children with ASD have lipid and cholesterol lab values that are outside the reference ranges, which may be used to distinguish them from neurotypical children. Using health claims data, we showed that individuals diagnosed with ASD have a significantly higher prevalence of dyslipidemia-related diagnoses as compared to age, gender, and socioeconomically matched controls. We further showed that both fathers and mothers of individuals with ASD are diagnosed with dyslipidemia disorders significantly more than matched controls. Taken together, our work suggests that lipid dysregulation may be a strong inherited risk factor for ASD. Our results offer several practical considerations for improving early diagnosis of ASD, thereby offering better outcomes for children with ASD [381]. First, this study suggests that families with a history of dyslipidemia may be at increased risk for having children with ASD. They should be counseled and monitored accordingly. Second, common lipid lab tests, including total cholesterol, HDL, LDL, and triglyceride levels may be informative for screening newborns for increased ASD risk. Follow-up studies should track the earliest age at which differences in lipid profiles have sufficient sensitivity and specificity to be used as biomarkers, and design a pro151 spective trial accordingly. Third, metabolomic studies, which include fatty acid derivatives, may be used for early screening. This conclusion is also supported by targeted studies in small cohorts that found altered lipid mediators in plasma from children with ASD as compared to matched controls [382-385]. Roughly half of the human brain's weight is attributed to lipids. Rather than being used for energy storage, brain lipids are essential building blocks of cell membranes, the synaptic infrastructure of neurons, and the isolating elements of myelin [386,387]. Mutations in lipid regulators have recently been shown to alter human brain function and growth, leading to intellectual disability and microcephaly [388-390]. Follow up mechanistic studies in mice and cellular models of ASD are needed to better understand how the gene disruptions described here contribute to ASD, and how lipid augmentation therapies may normalize the ASD phenotype. 152 Chapter 7. Conclusion and Future Work In this chapter, we conclude the dissertation by summarizing our contributions and proposing directions for future work. 7.1 Contributions This thesis proposed a series of models based on subgraph mining and factorization algorithms to extract higher-order features (biomedical relations), temporal trends, and exon co-regulation and to explore their correlations. A common theme of these models lies in the application of subgraph mining algorithms to extract higher-order features, temporal trends and exon co-regulation, and application of factorization algorithms, at various depths, to model correlations between higher-order features and temporal trends. Part of our contribution is the universal recognition of subgraph structures in different biomedical subdomains: relations between biomedical concepts in clinical narratives, temporal progression of patients' physiologic measurements in ICU time series, and exons that are co-regulated during human brain development. Moreover, this dissertation demonstrated that using subgraph structures and groupings of subgraph structures (produced by factorization algorithms) can lead to not only better accuracy, but also better interpretability, even novel knowledge into disease pathogenesis. The above demonstrations span across multiple concretely motivated medical problems. In NLP analysis on lymphoma pathology reports, sentence subgraphs lead to unsupervised extraction of relations among flexible number of medical concepts from clinical narrative text. Subgraph Augmented Non-negative Tensor Factorization (SANTF) jointly model the interactions among different types of features and reduce dimensionality at the same time, which then leads to better interpretability and improved accuracy even in unsupervised learning. In ICU mortality risk prediction, time series subgraphs lead to unsupervised extraction of multivariate temporal progression patterns, which are more informative than single time point measurements. Subgraph Augmented Non-negative Matrix Factorization (SANMF) explores the correlations among trends of different physiologic variables and reduces dimensionality at the same time, which then leads to better interpretability and improved accuracy compared to snapshot measurements and standalone subgraphs. 153 In Autism Spectrum Disorder (ASD) genetic risk analysis, Implication of Co-regulated Exons (ICE) automatically identifies co-regulated exon clusters based on analyzing spatiotemporal profiles of exonic expressions during brain development. Expression burden analysis coupled with segregation pattern analysis implicates variants in the identified co-regulated exon clusters with the ASD phenotype. Together with functional analysis and clinical data analysis, ICE allows identification of novel ASD risk factors including dyslipidemia. The integrative genomic analysis aggregating different modalities of patient data, pivoted by the subgraph mining algorithm ICE, enables deeper understanding of the mechanisms of variations in the genome, which leads to clinical insights and opportunities of early intervention. We note that the graph representation offers generalizability, applicable to represent relations between concepts in medical NLP, temporal progression of physiologic measurement in ICU time series and co-regulated exons in ASD genomics. We also showed that the general framework of subgraph mining and factorization algorithms can be effective in supervised learning, unsupervised learning, and association analysis. 7.2 Future Directions By proposing a generalizable framework to mine subgraph structures and explore their correla- tions in multiple biomedical subdomains, this thesis lays the foundation of several research directions that can potentially change the current practice of medicine. Automated cancer pathology on a truly global scale: In current lymphoma classification guideline, the Asian population is severely underrepresented. Incorporating Asian lymphoma patients will at least double the size of the existing patient cohort. This may lead to a better elucidated boundary between currently gray zone lymphoma subtypes, or lead to previously undiscovered subtypes. Moreover, looking at the difference in treatment courses between Asian and Caucasian patients at a large scale may lend insights on optimal intervention strategy. We are genuinely interested in the influence of such integration towards the understanding of the entire terrain of lymphoma pathology. On a broader horizon, as pathology advances, what previously constituted one cancer category is now often regarded as multiple diseases or even a spectrum of diseases. This shift will likely 154 generate phenomenal impact on society if one can automatically identify sub-cohorts of cancer patients that share Omic and phenotypic signatures and that can benefit from targeted medications. To this end, automated diagnostic guideline construction is a promising application. Moreover, integrating the Omic data and published literature will not only impose practical application demands but also raise fundamental methodology challenges to big data analysis (e.g., [391]). Utilizing symbiosis among common laboratory tests to improve clinical decision making: On the clinical monitoring side, we plan to model outcome-specific patient profiling where outcomes can be specific such as wean of ventilator or response to steroids. This will enable a variety of clinical applications ranging from treatment plan selection to informed staffing to operational decisions. In addition to modeling ICU patients' conditions, SANMF/SANTF framework can also be utilized to study chronic conditions such as chronic kidney disease, where early symptoms such as tiredness and troubled sleep are often ignored and physiologic variable monitoring may offer a chance of early detection and early intervention. On the other hand, the effectiveness of SANMF on physiologic variable evolution demonstrated the shared information among certain common laboratory tests. Explore their correlation in mortality risk prediction is only a first step towards unlocking their hidden diagnostic utility. It is important, in the long term, to fully investigate the extent of information redundancy and potential symbiosis among all common laboratory tests regarding their diagnostic utility. An immediate plan is to build an information theoretic framework to quantify the information shared between the actual test results and the predicated test results based on concurrent other test results. Such an improved understanding of the complex relationships and patterns within sets of laboratory tests will be incorporated into electronic clinical decision systems to enhance laboratory test result interpretation and increase the diagnostic information that can be extracted from laboratory testing. Associate functionally related groups of genetic networks with phenotype: Implicating neurodevelopmentally co-regulated exon clusters with ASD phenotype still leaves the following fact unaccounted for: co-regulated exon clusters may in fact be functionally related to each other (e.g. in the same known pathway). One can use known metabolic networks, known genetic pathways, 155 and known protein-protein interaction networks to correlate exon clusters and integrate features from the entire Omic hierarchy into the SANTF model. This network wise interaction study (NWIS) will lead to a whole new level of integrative genomic analysis and help us to better understand the complete genetic mechanisms. This will likely generate more specific markers for ASD's early detection and elucidate targeted treatments and interventions against the development course of ASD. Taking this intuition one step further, I am also interested in investigating the association between functionally related groups of genetic networks with multiple distinct but related nervous system disorders. It has been shown that multiple neurodegenerative diseases, including Alzheimer's disease, Parkinson's disease, Huntington's disease and Amyotrophic Lateral Sclerosis (ALS), share common genetic and metabolic pathways such as those for protein degradation. Moreover, patients with neurodegenerative diseases often have late onsets, suggesting the progressive and cumulative effect of intracellular pathogenesis mechanisms including protein degradation abnormality and mitochondrial dysfunction. To better understand disease progression and explore options for preventative intervention, new online methods are needed to integrate both progressive observations and intervention outcomes into disease modeling. Pilot research projects and cross-field collaborations have great potential to break the silos and to unlock better therapeutic opportunities for all jointly studied diseases and disorders. 156 Bibliography [1] G. McNeill and D. Bryden, "Do either early warning systems or emergency response teams improve hospital patient survival? A systematic review," Resuscitation, vol. 84, 2013, pp. 1652-1667. [2] 0. Uzuner, B.R. South, S. Shen, and S.L. DuVall, "2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text," Journal of the American Medical Informatics Association, vol. 18, 2011, pp. 552-556. [3] D. Nadeau and S. Sekine, "A survey of named entity recognition and classification," Lingvisticae Investigationes, vol. 30, 2007, pp. 3-26. [4] [5] [6] [7] R. Grishman and B. Sundheim, "Message Understanding Conference-6: A Brief History.," COLING, 1996, pp. 466-471. M.D. Buist, G.E. Moore, S.A. Bernard, B.P. Waxman, J.N. Anderson, and T.V. Nguyen, "Effects of a medical emergency team on reduction of incidence of and mortality from unexpected cardiac arrests in hospital: preliminary study," Bmj, vol. 324, 2002, pp. 387390. P.S. Chan, R. Jain, B.K. Nallmothu, R.A. Berg, and C. Sasson, "Rapid response teams: a systematic review and meta-analysis," Archives of internal medicine, vol. 170, 2010, pp. 18-26. L.T. Kohn, J.M. Corrigan, M.S. Donaldson, and others, To Err Is Human:: Building a Safer Health System, National Academies Press, 2000. [8] D.R. Levinson and I. General, "Adverse events in hospitals: national incidence among Medicare beneficiaries," Department of Health and Human Services Office of the Inspector General, 2010. [9] [10] Y. Bar-Shalom and T.E. Fortmann, Tracking and Data Association, Academic Press, 1988. S. Saria, A.K. Rajani, J. Gould, D.L. Koller, and A.A. Penn, "Integration of early physiological responses predicts later illness severity in preterm infants," Science TranslationalMedicine, vol. 2, 2010, pp. 48-65. [11] A.S. Willsky, E.B. Sudderth, M.I. Jordan, and E.B. Fox, "Nonparametric Bayesian learning of switching linear dynamical systems," Advances in Neural Information ProcessingSystems, 2008, pp. 457-464. [12] H. Lee, R. Grosse, R. Ranganath, and A.Y. Ng, "Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations," Proceedings of the 26th Annual InternationalConference on Machine Learning, ACM, 2009, pp. 609-616. [13] [14] A. Mueen, E.J. Keogh, Q. Zhu, S. Cash, and M.B. Westover, "Exact Discovery of Time Series Motifs.," SDM, 2009, pp. 473-484. B.D. Walker and G.Y. Xu, "Unravelling the mechanisms of durable control of HIV-1," Nature Reviews Immunology, vol. 13, 2013, pp. 487-498. [15] [16] S.J. Sanders, M.T. Murtha, A.R. Gupta, J.D. Murdoch, M.J. Raubeson, A.J. Willsey, A.G. Ercan-Sencicek, N.M. DiLullo, N.N. Parikshak, J.L. Stein, and others, "De novo mutations revealed by whole-exome sequencing are strongly associated with autism," Nature, vol. 485, 2012, pp. 237-241. B.M. Neale, Y. Kou, L. Liu, A. Ma'Ayan, K.E. Samocha, A. Sabo, C.-F. Lin, C. Stevens, L.-S. Wang, V. Makarov, and others, "Patterns and rates of exonic de novo mutations in autism spectrum disorders," Nature, vol. 485, 2012, pp. 242-245. 157 [17] B.J. O'Roak, L. Vives, S. Girirajan, E. Karakoc, N. Krumm, B.P. Coe, R. Levy, A. Ko, C. Lee, J.D. Smith, and others, "Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations," Nature, vol. 485, 2012, pp. 246-250. [18] I. Iossifov, M. Ronemus, D. Levy, Z. Wang, I. Hakker, J. Rosenbaum, B. Yamrom, Y. Lee, G. Narzisi, A. Leotta, and others, "De novo gene disruptions in children on the [19] autistic spectrum," Neuron, vol. 74, 2012, pp. 285-299. Y. Jiang, R.K. Yuen, X. Jin, M. Wang, N. Chen, X. Wu, J. Ju, J. Mei, Y. Shi, M. He, and others, "Detection of clinically relevant genetic variants in autism spectrum disorder by whole-genome sequencing," The American Journalof Human Genetics, vol. 93, 2013, pp. [20] 249-263. R.K. Yuen, B. Thiruvahindrapuram, D. Merico, S. Walker, K. Tammimies, N. Hoang, C. Chrysler, T. Nalpathamkalam, G. Pellecchia, Y. Liu, M.J. Gazzellone, L. D'Abate, E. Deneault, J.L. Howe, R.S.C. Liu, A. Thompson, M. Zarrei, M. Uddin, C.R. Marshall, R.H. Ring, L. Zwaigenbaum, P.N. Ray, R. Weksberg, Carter, B.A. Fernandez, W. Roberts, P. Szatmari, and S.W. Scherer, "Whole-genome sequencing of quartet families with autism [21] [22] [23] [24] spectrum disorder," Nature Methods, vol. 21, 2015, pp. 185-191. S. Nemirovsky, M. Cordoba, J. Zaiat, S. Completa, P. Vega, D. Gonzalez-Moron, N. Medina, M. Fabbro, S. Romero, B. Brun, S. Revale, M. Ogara, A. Pecci, M. Marti, M. Vazquez, A. Turjanski, and M. Kauffiman, "Whole Genome Sequencing Reveals a De Novo SHANK3 Mutation in Familial Autism Spectrum Disorder," PloS one, vol. 10, 2015, p. e0116358. S. De Rubeis, X. He, A.P. Goldberg, C.S. Poultney, K. Samocha, A.E. Cicek, Y. Kou, L. Liu, M. Fromer, S. Walker, and others, "Synaptic, transcriptional and chromatin genes disrupted in autism," Nature, vol. 515, 2014, pp. 209-215. I. Iossifov, B.J. O'Roak, S.J. Sanders, M. Ronemus, N. Krumm, D. Levy, H.A. Stessman, K.T. Witherspoon, L. Vives, K.E. Patterson, and others, "The contribution of de novo coding mutations to autism spectrum disorder," Nature, vol. 515, 2014, pp. 216-221. S. Dong, M.F. Walker, N.J. Carriero, M. DiCola, A.J. Willsey, Y.Y. Adam, Z. Waqar, L.E. Gonzalez, J.D. Overton, S. Frahmn, and others, "De novo insertions and deletions of predominantly paternal origin are associated with autism spectrum disorder," Cell reports, [25] vol. 9, 2014, pp. 16-23. G.K. Savova, J.J. Masanz, P.V. Ogren, J. Zheng, S. Sohn, K.C. Kipper-Schuler, and C.G. Chute, "Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications," Journal of the American Medical Informatics Association, vol. 17, 2010, pp. 507-513. [26] [27] [28] [29] A.R. Aronson, "Effective Mapping of Biomedical Text to the UMLS Metathesaurus: The MetaMap Program," AMIA annualsymposium proceedings, vol. 2001, 2001, pp. 17-21. W.W. Chapman, W. Bridewell, P. Hanbury, G.F. Cooper, and B.G. Buchanan, "A simple algorithm for identifying negated findings and diseases in discharge summaries," Journal of biomedical informatics, vol. 34, 2001, pp. 301-3 10. J.-D. Kim, Y. Wang, T. Takagi, and A. Yonezawa, "Overview of genia event task in bionlp shared task 2011," Proceedings of the BioNLP Shared Task 2011 Workshop, Association for Computational Linguistics, 2011, pp. 7-15. M. Krallinger, F. Leitner, C. Rodriguez-Penagos, A. Valencia, and others, "Overview of the protein-protein interaction annotation extraction task of BioCreative II," Genome biology, vol. 9, 2008, p. S4. 158 [30] F. Leitner, S.A. Mardis, M. Krallinger, G. Cesareni, L.A. Hirschman, and A. Valencia, "An overview of BioCreative II. 5," Computational Biology and Bioinformatics, IEEE/A CM Transactionson, vol. 7, 2010, pp. 385-399. [31] J.-D. Kim, T. Ohta, S. Pyysalo, Y. Kano, and J. Tsujii, "Overview of BioNLP'09 shared task on event extraction," Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing:Shared Task, Association for Computational Linguistics, [32] 2009, pp. 1-9. C. Nedellec, R. Bossy, J.-D. Kim, J.-J. Kim, T. Ohta, S. Pyysalo, and P. Zweigenbaum, "Overview of BioNLP shared task 2013," Proceedings of the BioNLP Shared Task 2013 Workshop, 2013, pp. 1-7. [33] I. Segura-Bedmar, P. Martinez, and D. Sanchez-Cisneros, "The 1st DDIExtraction-2011 challenge task: Extraction of Drug-Drug Interactions from biomedical texts," Proceedings of the 1st Challenge Task on Drug-DrugInteractionExtraction, vol. 761, 2011, pp. 1-9. [34] [35] [36] I. Segura-Bedmar, P. Martinez, and M. Herrero-Zazo, "Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts (ddiextraction 2013)," Proceedings of Semeval, 2013, pp. 341-350. R. Chang, "Individual outcome prediction models for intensive care units," The Lancet, vol. 334, 1989, pp. 143-146. L. Ohno-Machado, F.S. Resnic, and M.E. Matheny, "Prognosis in critical care," Annu. Rev. Biomed. Eng., vol. 8, 2006, pp. 567-599. [37] Y. Zhang and P. Szolovits, "Patient-specific learning in real time for adaptive monitoring in critical care," Journalof biomedical informatics, vol. 41, 2008, pp. 452-460. [38] [39] [40] D.P. Bota, C. Melot, F.L. Ferreira, V.N. Ba, and J.-L. Vincent, "The multiple organ dysfunction score (MODS) versus the sequential organ failure assessment (SOFA) score in outcome prediction," Intensive care medicine, vol. 28, 2002, pp. 1619-1624. W.A. Knaus, D. Wagner, E. e al Draper, J. Zimmerman, M. Bergner, P.G. Bastos, C. Sirio, D. Murphy, T. Lotring, and A. Damiano, "The APACHE III prognostic system. Risk prediction of hospital mortality for critically ill hospitalized adults.," CHEST Journal, vol. 100, 1991, pp. 1619-1636. J.-R. Le Gall, S. Lemeshow, and F. Saulnier, "A new simplified acute physiology score (SAPS II) based on a European/North American multicenter study," JAMA: thejournal of the American Medical Association, vol. 270, 1993, pp. 2957-2963. [41] J.A. Quinn, C.K. Williams, and N. McIntosh, "Factorial switching linear dynamical systems applied to physiological condition monitoring," Pattern Analysis and Machine Intelligence, IEEE Transactionson, vol. 31, 2009, pp. 1537-1551. [42] A. Silva, P. Cortez, M.F. Santos, L. Gomes, and J. Neves, "Mortality assessment in intensive care units via adverse events using artificial neural networks," Artificial Intelligence in Medicine, vol. 36, 2006, pp. 223-234. [43] [44] M.J. Cohen, A.D. Grossman, D. Morabito, M.M. Knudson, A.J. Butte, and G.T. Manley, "Research Identification of complex metabolic states in critically injured patients using bioinformatic cluster analysis," 2010. C.W. Hug and P. Szolovits, "ICU acuity: real-time models versus daily models," AMIA Annual Symposium Proceedings, American Medical Informatics Association, 2009, p. 260. 159 [45] [46] R. Joshi and P. Szolovits, "Prognostic Physiology: Modeling Patient Severity in Intensive Care Units Using Radial Domain Folding," AMIA Annual Symposium Proceedings, American Medical Informatics Association, 2012, p. 1276. J. Yin and H. Li, "A sparse conditional Gaussian graphical model for analysis of genetical genomics data," The annals of appliedstatistics, vol. 5, 2011, p. 2630. [47] [48] S. Kim and E.P. Xing, "Statistical estimation of correlated genome associations to a quantitative trait network," PLoS genetics, vol. 5, 2009, p. e1000587. J. Bergelson and F. Roux, "Towards identifying genes underlying ecologically relevant traits in Arabidopsis thaliana," Nature Reviews Genetics, vol. 11, 2010, pp. 867-879. [49] R. Brachman and H. Levesque, Knowledge representationand reasoning, Elsevier, 2004. [50] J.F. Sowa, "Knowledge representation: logical, philosophical, and computational foundations," 1999. M. Kanehisa, S. Goto, Y. Sato, M. Furumichi, and M. Tanabe, "KEGG for integration and [51] interpretation of large-scale molecular data sets," Nucleic acids research, vol. 40, 2012, [52] pp. D109-D114. A. Franceschini, D. Szklarczyk, S. Frankild, M. Kuhn, M. Simonovic, A. Roth, J. Lin, P. Minguez, P. Bork, C. von Mering, and others, "STRING v9. 1: protein-protein interaction networks, with increased coverage and integration," Nucleic acids research, vol. 41, 2013, [53] [54] pp. D808-D815. S. Hunter, P. Jones, A. Mitchell, R. Apweiler, T.K. Attwood, A. Bateman, T. Bernard, D. Binns, P. Bork, S. Burge, and others, "InterPro in 2011: new developments in the family and domain prediction database," Nucleic acids research, vol. 40, 2012, pp. D306-D312. S.-K. Ng, Z. Zhang, S.-H. Tan, and K. Lin, "InterDom: a database of putative interacting protein domains for validating predicted protein interactions and complexes," Nucleic [55] acids research,vol. 31, 2003, pp. 251-254. M. Hewett, D.E. Oliver, D.L. Rubin, K.L. Easton, J.M. Stuart, R.B. Altman, and T.E. Klein, "PharmGKB: the pharmacogenetics knowledge base," Nucleic acids research, vol. FZi 30, 2002, pp. 163-165. C. a,. Chen, and A.J. Butte, "Data-driven integration of epideiniological and toxicological data to select candidate interacting genes and environmental factors in [57] association with disease," Bioinformatics, vol. 28, 2012, pp. il21-il26. M.J. Landrum, J.M. Lee, G.R. Riley, W. Jang, W.S. Rubinstein, D.M. Church, and D.R. Maglott, "ClinVar: public archive of relationships among sequence variation and human [58] phenotype," Nucleic acids research, vol. 42, 2014, pp. D980-D985. A. Airola, S. Pyysalo, J. Bjirne, T. Pahikkala, F. Ginter, and T. Salakoski, "All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus [59] learning," BMC bioinformatics, vol. 9, 2008, p. S2. M. Miwa, R. Sxtre, Y. Miyao, and J. Tsujii, "A rich feature vector for protein-protein interaction extraction from multiple corpora," Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, Association [60] for Computational Linguistics, 2009, pp. 121-130. H.-W. Chun, Y. Tsuruoka, J.-D. Kim, R. Shiba, N. Nagata, T. Hishiki, and J. Tsujii, "Extraction of gene-disease relations from Medline using domain dictionaries and machine learning.," Pacific Symposium on Biocomputing, 2006, pp. 4-15. 160 [61] [62] [63] [64] [65] [66] [67] A. Ozgtr, T. Vu, G. Erkan, and D.R. Radev, "Identifying gene-disease associations using centrality on a literature mined gene-interaction network," Bioinformatics, vol. 24, 2008, pp. i277-i285. E. Camon, M. Magrane, D. Barrell, V. Lee, E. Dimmer, J. Maslen, D. Binns, N. Harte, R. Lopez, and R. Apweiler, "The Gene Ontology annotation (GOA) database: sharing knowledge in Uniprot with Gene Ontology," Nucleic acids research, vol. 32, 2004, pp. D262-D266. G.D. Bader, M.P. Cary, and C. Sander, "Pathguide: a pathway resource list," Nucleic acids research,vol. 34, 2006, pp. D504-D506. Y. Luo, G. Riedlinger, and P. Szolovits, "Text Mining in Cancer Gene and Pathway Prioritization," Cancer informatics, vol. 13, 2014, p. 69. J. Chen, E.E. Bardes, B.J. Aronow, and A.G. Jegga, "ToppGene Suite for gene list enrichment analysis and candidate gene prioritization," Nucleic acids research, vol. 37, 2009, pp. W305-W311. J. Chen, H. Xu, B.J. Aronow, and A.G. Jegga, "Improved human disease candidate gene prioritization using mouse phenotype," BMC bioinformatics, vol. 8, 2007, p. 392. M.A. van Driel, J. Bruggeman, G. Vriend, H.G. Brunner, and J.A. Leunissen, "A textmining analysis of the human phenome," Europeanjournal of human genetics, vol. 14, 2006, pp. 535-542. [68] [69] [70] [71] T.H. Pers, P. Dworzyski, C.E. Thomas, K. Lage, and S. Brunak, "MetaRanker 2.0: a web server for prioritization of genetic variation data," Nucleic acids research, vol. 41, 2013, pp. W104-W108. S. Raychaudhuri, R.M. Plenge, E.J. Rossin, A.C. Ng, S.M. Purcell, P. Sklar, E.M. Scolnick, R.J. Xavier, D. Altshuler, M.J. Daly, and others, "Identifying relationships among genomic disease regions: predicting genes at pathogenic SNP associations and rare deletions," PLoS genetics, vol. 5, 2009, p. e1000534. US National Library of Medicine, "ClinicalTrial.gov https://clinicaltrial.gov/." S.R. Thadani, C. Weng, J.T. Bigger, J.F. Ennever, and D. Wajngurt, "Electronic screening improves efficiency in clinical trial recruitment," Journal of the American Medical InformaticsAssociation, vol. 16, 2009, pp. 869-873. [72] R. Miotto and C. Weng, "Unsupervised mining of frequent tags for clinical eligibility text indexing," Journalof biomedicalinformatics, vol. 46, 2013, pp. 1145-1151. [73] S.W. Tu, M. Peleg, S. Carini, M. Bobak, J. Ross, D. Rubin, and I. Sim, "A practical method for transforming free-text eligibility criteria into computable criteria," Journal of biomedical informatics, vol. 44, 2011, pp. 239-250. [74] [75] B. deBruijn, S. Carini, S. Kiritchenko, J. Martin, and I. Sim, "Automated information extraction of key trial design elements from clinical trial publications," AMIA Annual Symposium Proceedings,American Medical Informatics Association, 2008, p. 141. C. Weng, X. Wu, Z. Luo, M.R. Boland, D. Theodoratos, and S.B. Johnson, "EliXR: an approach to eligibility criteria extraction and representation," Journal of the American Medical Informatics Association, vol. 18, 2011, pp. il 16-i124. [76] T. Hao, A. Rusanov, M.R. Boland, and C. Weng, "Clustering clinical trials with similar eligibility criteria features," Journal of biomedical informatics, vol. 52, 2014, pp. 112120. 161 [77] T. Klein, J. Chang, M. Cho, K. Easton, R. Fergerson, M. Hewett, Z. Lin, Y. Liu, S. Liu, D. Oliver, and others, "Integrating genotype and phenotype information: an overview of the [78] PharmGKB project," PharmacogenomicsJ, vol. 1, 2001, pp. 167-170. A. Coulet, N.H. Shah, Y. Garten, M. Musen, and R.B. Altman, "Using text to build semantic networks for pharmacogenomics," Journal of biomedical informatics, vol. 43, 2010, pp. 1009-1019. [79] [80] [81] Y. Garten and R.B. Altman, "Pharmspresso: a text mining tool for extraction of pharmacogenomic concepts and relationships from full text," BMC bioinformatics, vol. 10, 2009, p. S6. B. Percha, Y. Garten, R.B. Altman, and others, "Discovery and explanation of drug-drug interactions via text mining," Pac Symp Biocomput, World Scientific, 2012, p. 421. S.V. Pakhomov, J.D. Buntrock, and C.G. Chute, "Automating the assignment of diagnosis codes to patient encounters using example-based and machine learning techniques," Journalof the American Medical InformaticsAssociation, vol. 13, 2006, pp. 516-525. [82] A.B. Wilcox and G. Hripcsak, "The role of domain knowledge in automating medical text report classification," Journal of the American Medical Informatics Association, vol. 10, 2003, pp. 330-338. [83] D.B. Aronow, F. Fangfang, and W.B. Croft, "Ad hoc classification of radiology reports," Journalof the American Medical InformaticsAssociation, vol. 6, 1999, pp. 393-411. [84] D. Aronsky and P.J. Haug, "Automatic identification of patients eligible for a pneumonia guideline.," Proceedings of the AMIA [85] Symposium, American Medical Informatics Association, 2000, p. 12. M. Fiszman, W.W. Chapman, D. Aronsky, R.S. Evans, and P.J. Haug, "Automatic detection of acute bacterial pneumonia from chest X-ray reports," Journal of the American MedicalInformatics Association, vol. 7, 2000, pp. 593-604. [86] [87] H.-M. Lu, D. Zeng, L. Trujillo, K. Komatsu, and H. Chen, "Ontology-enhanced automatic chief complaint classification for syndromic surveillance," Journal of biomedical informatics, vol. 41, 2008, pp. 340-356. Y. Luo, A. Sohani, E. Hochberg, and P. Szolovits, "Automatic Lymphoma Classification with Sentence Subgraph Mining from Pathology Reports," Journal of the American [88] Medical InformaticsAssociation (JAMIA) 2014, vol. 21, 2014, pp. 824-832. Y. Luo, Y. Xin, E. Hochberg, R. Joshi, 0. Uzuner, and P. Szolovits, "Subgraph Augmented Non-Negative Tensor Factorization (SANTF) for Modeling Clinical Text," Journalof the American Medical Informatics Association (JAMI) in press, 2015. [89] [90] G. Onder, C. Pedone, F. Landi, M. Cesari, C. Della Vedova, R. Bernabei, and G. Gambassi, "Adverse drug reactions as cause of hospital admissions: results from the Italian Group of Pharmacoepidemiology in the Elderly (GIFA)," Journal of the American GeriatricsSociety, vol. 50, 2002, pp. 1962-1968. H. Zheng, H. Wang, H. Xu, Y. Wu, Z. Zhao, and F. Azuaje, "Linking Biochemical Pathways and Networks to Adverse Drug Transactions on, vol. 13, 2014, pp. 131-137. [91] Reactions," NanoBioscience, IEEE M. Liu, Y. Wu, Y. Chen, J. Sun, Z. Zhao, X. Chen, M.E. Matheny, and H. Xu, "Largescale prediction of adverse drug reactions using chemical, biological, and phenotypic properties of drugs," Journal of the American Medical Informatics Association, vol. 19, 2012, pp. e28-e35. 162 [92] R. Harpaz, S. Vilar, W. DuMouchel, H. Salmasian, K. Haerian, N.H. Shah, H.S. Chase, and C. Friedman, "Combing signals from spontaneous reports and electronic health records for detection of adverse drug reactions," Journal of the American Medical InformaticsAssociation, 2012, p. amiajnl-2012. [93] [94] [95] J. Li, X. Zhu, and J.Y. Chen, "Building disease-specific drug-protein connectivity maps from molecular interaction networks and PubMed abstracts," PLoS computationalbiology, vol. 5, 2009, p. e1000450. C. Blaschke, M.A. Andrade, C.A. Ouzounis, and A. Valencia, "Automatic extraction of biological information from scientific text: protein-protein interactions.," Ismb, 1999, pp. 60-67. B. Rosario and M.A. Hearst, "Classifying semantic relations in bioscience texts," Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, [96] Association for Computational Linguistics, 2004, p. 430. B. Rosario and M.A. Hearst, "Multi-way relation classification: application to proteinprotein interactions," Proceedings of the conference on Human Language Technology and EmpiricalMethods in NaturalLanguage Processing,Association for Computational [97] Linguistics, 2005, pp. 732-739. D. Hristovski, C. Friedman, T.C. Rindflesch, and B. Peterlin, "Exploiting semantic relations for literature-based discovery," AMIA annual symposium proceedings,American [98] Medical Informatics Association, 2006, pp. 349-353. T.C. Rindflesch and M. Fiszman, "The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text," Journalof biomedical informatics, vol. 36, 2003, pp. 462-477. [99] Y. Luo and 0. Uzuner, "Semi-Supervised Learning to Identify UMLS Semantic Relations," AMA Joint Summits on TranslationalScience, 2014. [100] S. Nijssen and J.N. Kok, "The gaston tool for frequent subgraph mining," Electronic Notes in Theoretical Computer Science, vol. 127, 2005, pp. 77-87. [101] K. Roberts, B. Rink, and S. Harabagiu, "Extraction of medical concepts, assertions, and relations from discharge summaries for the fourth i2b2/VA shared task," Proceedings of the 2010 i2b2/VA Workshop on Challenges in Natural Language Processingfor Clinical Data. Boston, MA, USA: i2b2, 2010. [102] B. deBruijn, C. Cherry, S. Kiritchenko, J. Martin, and X. Zhu, "Machine-learned solutions for three stages of clinical infonnation extraction: the state of the art at i2b2 2010," Journal of the American Medical InformaticsAssociation, vol. 18, 2011, pp. 557-562. [103] H. Xu, S.P. Stenner, S. Doan, K.B. Johnson, L.R. Waitman, and J.C. Denny, "MedEx: a medication information extraction system for clinical narratives," Journal of the American Medical InformaticsAssociation, vol. 17, 2010, pp. 19-24. [104] P. Anick, P. Hong, N. Xue, and D. Anick, "Concept, Assertion and Relation Extraction at the 2010 i2b2 Relation Extraction Challenge using parsing information and dictionaries," Proc. of i2b2/VA Shared-Task. Washington, DC, 2010. [105] H. Liu, L. Hunter, V. Kegelj, and K. Verspoor, "Approximate Subgraph Matching-Based Literature Mining for Biomedical Events and Relations," PloS one, vol. 8, 2013, p. e60954. [106] H. Liu, R. Komandur, and K. Verspoor, "From graphs to events: A subgraph matching approach for information extraction from biomedical text," Proceedings of the BioNLP 163 Shared Task 2011 Workshop, Association for Computational Linguistics, 2011, pp. 164172. [107] A. MacKinlay, D. Martinez, A.J. Yepes, H. Liu, W.J. Wilbur, and K. Verspoor, "Extracting biomedical events and modifications using subgraph matching with noisy training data," Proceedings of the BioNLP Shared Task 2013 Workshop. Association for ComputationalLinguistics, Sofia, Bulgaria, 2013, pp. 35-44. [108] K. Ravikumar, H. Liu, J.D. Cohn, M.E. Wall, K. Verspoor, and others, "Literature mining of protein-residue associations with graph rules learned through distant supervision.," J. Biomedical Semantics, vol. 3, 2012, p. S2. [109] H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. Bhat, H. Weissig, I.N. Shindyalov, and P.E. Bourne, "The protein data bank," Nucleic acids research, vol. 28, 2000, pp. 235242. [110] H. Liu, Z.-Z. Hu, J. Zhang, and C. Wu, "BioThesaurus: a web-based thesaurus of protein and gene names," Bioinformatics, vol. 22, 2006, pp. 103-105. [111] J. Bjrne, J. Heimonen, F. Ginter, A. Airola, T. Pahikkala, and T. Salakoski, "Extracting complex biological events with rich graph-based feature sets," Proceedings of the Workshop on Current Trends in Biomedical NaturalLanguage Processing:Shared Task, Association for Computational Linguistics, 2009, pp. 10-18. [112] J. Bjrne and T. Salakoski, "Generalizing biomedical event extraction," Proceedings of the BioNLP Shared Task 2011 Workshop, Association for Computational Linguistics, 2011, pp. 183-191. [113] J. BjOrne and T. Salakoski, "TEES 2.1: Automated annotation scheme learning in the BioNLP 2013 shared task," Proceedings of the BioNLP Shared Task 2013 Workshop, 2013, pp. 16-25. [114] J. Bjbme, A. Airola, T. Pahikkala, and T. Salakoski, "Drug-drug interaction extraction from biomedical texts with svm and rls classifiers," Proceedings of DDIExtraction-2011 challenge task, 2011, pp. 35-42. [115] K. Hakala, S. Van Landeghem, T. Salakoski, Y. Van de Peer, and F. Ginter, "EVEX in ST'13: Application of a large-scale text mining resource to event extraction and network construction," Proceedings of the BioNLP Shared Task 2013 Workshop, 2013, pp. 26-34. [116] U. Consortium and others, "The universal protein resource (UniProt)," Nucleic acids research, vol. 36, 2008, pp. D190-D195. [117] H. Kilicoglu and S. Bergler, "Adapting a general semantic interpretation approach to biological event extraction," Proceedings of the BioNLP Shared Task 2011 Workshop, Association for Computational Linguistics, 2011, pp. 173-182. [118] H. Kilicoglu and S. Bergler, "Syntactic dependency based heuristics for biological event extraction," Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing:Shared Task, Association for Computational Linguistics, 2009, pp. 119-127. [119] J. Hakenberg, I. Solt, D. Tikk, L. Tari, A. Rheinlinder, Q.L. Ngyuen, G. Gonzalez, and U. Leser, "Molecular event extraction from link grammar parse trees," Proceedings of the Workshop on Current Trends in Biomedical NaturalLanguage Processing:Shared Task, Association for Computational Linguistics, 2009, pp. 86-94. [120] J. Hakenberg, R. Leaman, N. Ha Vo, S. Jonnalagadda, R. Sullivan, C. Miller, L. Tari, C. Baral, and G. Gonzalez, "Efficient extraction of protein-protein interactions from full-text 164 articles," IEEE/ACM Transactionson ComputationalBiology and Bioinformatics (TCBB), vol. 7, 2010, pp. 481-494. [121] P. Thomas, S. Pietschmann, I. Solt, D. Tikk, and U. Leser, "Not all links are equal: exploiting dependency types for the extraction of protein-protein interactions from text," Proceedings ofBioNLP 2011 Workshop, Association for Computational Linguistics, 2011, pp. 1-9. [122] J. Hakenberg, C. Plake, R. Leaman, M. Schroeder, and G. Gonzalez, "Inter-species normalization of gene mentions with GNAT," Bioinformatics, vol. 24, 2008, pp. i126i132. [123] S. Riedel and A. McCallum, "Robust biomedical event extraction with dual decomposition and minimal domain adaptation," Proceedingsof the BioNLP Shared Task 2011 Workshop, Association for Computational Linguistics, 2011, pp. 46-50. [124] D. McClosky, M. Surdeanu, and C.D. Manning, "Event extraction as dependency parsing," Proceedings of the 49th Annual Meeting of the Associationfor ComputationalLinguistics: Human Language Technologies-Volume 1, Association for Computational Linguistics, 2011, pp. 1626-1635. [125] S. Van Landeghem, Y. Saeys, B. De Baets, and Y. Van de Peer, "Analyzing text in search of bio-molecular events: a high-precision machine learning framework," Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task, Association for Computational Linguistics, 2009, pp. 128-136. [126] K. Kaljurand, G. Schneider, and F. Rinaldi, "UZurich in the BioNLP 2009 shared task," Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing:SharedTask, Association for Computational Linguistics, 2009, pp. 28-36. [127] S. Kerrien, Y. Alam-Faruque, B. Aranda, I. Bancarz, A. Bridge, C. Derow, E. Dimmer, M. Feuermann, A. Friedrichsen, R. Huntley, and others, "IntAct-open source resource for molecular interaction data," Nucleic acids research,vol. 35, 2007, pp. D561-D565. [128] A. Vlachos, P. Buttery, D.O. Seaghdha, and T. Briscoe, "Biomedical event extraction without training data," Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task, Association for Computational Linguistics, 2009, pp. 37-40. [129] D. McClosky, M. Surdeanu, and C.D. Manning, "Event extraction as dependency parsing in BioNLP 2011," Proceedings of the BioNLP Shared Task 2011 Workshop, Association for Computational Linguistics, 2011, pp. 41-45. [130] C. Quirk, P. Choudhury, M. Gamon, and L. Vanderwende, "Msr-nlp entry in bionlp shared task 2011," Proceedings of the BioNLP Shared Task 2011 Workshop, Association for Computational Linguistics, 2011, pp. 155-163. [131] M. Miwa, P. Thompson, J. McNaught, D.B. Kell, and S. Ananiadou, "Extracting semantically enriched events from biomedical literature," BMC bioinformatics, vol. 13, 2012, p. 108. [132] A. Coulet, Y. Garten, M. Dumontier, R.B. Altman, M.A. Musen, N.H. Shah, and others, "Integration and publication of heterogeneous text-mined relationships on the Semantic Web.," J. Biomedical Semantics, vol. 2, 2011, p. S10. [133] J. Hakenberg, D. Voronov, V.H. Nguyen, S. Liang, S. Anwar, B. Lumpkin, R. Leaman, L. Tari, and C. Baral, "A SNPshot of PubMed to associate genetic variants with drugs, diseases, and adverse reactions," Journal of biomedical informatics, vol. 45, 2012, pp. 842-850. 165 [134] M. Kuhn, M. Campillos, I. Letunic, L.J. Jensen, and P. Bork, "A side effect resource to capture phenotypic effects of drugs," Molecular systems biology, vol. 6, 2010, p. 343. [135] D.S. Wishart, C. Knox, A.C. Guo, D. Cheng, S. Shrivastava, D. Tzur, B. Gautam, and M. Hassanali, "DrugBank: a knowledgebase for drugs, drug actions and drug targets," Nucleic acids research, vol. 36, 2008, pp. D901-D906. [136] R. Leaman, G. Gonzalez, and others, "BANNER: an executable survey of advances in biomedical named entity recognition.," Pacific Symposium on Biocomputing, 2008, pp. 652-663. [137] H. Wang, Y. Ding, J. Tang, X. Dong, B. He, J. Qiu, and D.J. Wild, "Finding complex biological relationships in recent PubMed articles using Bio-LDA," PLoS One, vol. 6, 2011, p. e17243. [138] B. Chen, X. Dong, D. Jiao, H. Wang, Q. Zhu, Y. Ding, and D.J. Wild, "Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data," BMC bioinformatics, vol. 11, 2010, p. 255. [139] Q.-C. Bui, B.O. Nuallin, C.A. Boucher, and P.M. Sloot, "Extracting causal relations on HIV drug resistance from literature," BMC bioinformatics,vol. 11, 2010, p. 101. [140] J. Vondrasek and A. Wlodawer, "HIVdb: a database of the structures of human immunodeficiency virus protease," Proteins:Structure, Function, and Bioinformatics, vol. 49, 2002, pp. 429-431. [141] P. Libin, G. Beheydt, K. Deforche, S. Imbrechts, F. Ferreira, K. Van Laethem, K. Theys, A.P. Carvalho, J. Cavaco-Silva, G. Lapadula, and others, "RegaDB: community-driven data management and analysis for infectious diseases," Bioinformatics, vol. 29, 2013, pp. 1477-1480. [142] S. Katrenko and P. Adriaans, "Learning relations from biomedical corpora using dependency trees," Knowledge Discovery and Emergent Complexity in Bioinformatics, Springer, 2007, pp. 61-80. [143] R. Sxtre, K. Yoshida, M. Miwa, T. Matsuzaki, Y. Kano, and J. Tsujii, "Extracting protein interactions from text with the unified AkaneRE event extraction system," IEEE/ACM Transactions on Computational Biology andtiiUformatics (TCBB), vol. 7, 2010, pp. 442-453. [144] D. Maglott, J. Ostell, K.D. Pruitt, and T. Tatusova, "Entrez Gene: gene-centered information at NCBL," Nucleic acids research, vol. 33, 2005, pp. D54-D58. [145] A. Koike and T. Takagi, "Gene/protein/family name recognition in biomedical literature," Proceedings of BioLink 2004 Workshop: Linking Biological Literature, Ontologies and Databases: Tools for Users, 2004, p. 56. [146] P. Thomas, M. Neves, I. Solt, D. Tikk, and U. Leser, "Relation extraction for drug-drug interactions using ensemble learning," Proceedings of DDIExtraction-2011 challenge task, 2011. [147] M.F.M. Chowdhury and A. Lavelli, "Drug-drug interaction extraction using composite kernels," ProceedingsofDDIExtraction-2011challenge task, 2011, pp. 27-33. [148] M.F.M. Chowdhury and A. Lavelli, "FBK-irst: A multi-phase kernel based approach for drug-drug interaction detection and classification that exploits linguistic information," Proceedings ofSemEval 2013, 2013, pp. 351-355. [149] M.F.M. Chowdhury, A.B. Abacha, A. Lavelli, and P. Zweigenbaum, "Two different machine learning techniques for drug-drug interaction extraction," Challenge Task on Drug-DrugInteractionExtraction, 2011, pp. 19-26. 166 [150] D. Tikk, P. Thomas, P. Palaga, J. Hakenberg, and U. Leser, "A comprehensive benchmark of kernel methods to extract protein-protein interactions from literature," PLoS computationalbiology, vol. 6, 2010, p. e1000837. [151] F.M. Chowdhury, A. Lavelli, and A. Moschitti, "A study on dependency tree kernels for automatic extraction of protein-protein interaction," Proceedings of BioNLP 2011 Workshop, Association for Computational Linguistics, 2011, pp. 124-133. [152] M.-C. De Marneffe, B. MacCartney, and C.D. Manning, "Generating typed dependency parses from phrase structure parses," ProceedingsofLREC, 2006, pp. 449-454. [153] E. Charniak and M. Johnson, "Coarse-to-fine n-best parsing and MaxEnt discriminative reranking," Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, 2005, pp. 173-180. [154] D. McClosky, "Any domain parsing: automatic domain adaptation for natural language parsing," Brown University, 2010. [155] M. Miwa and S. Ananiadou, "NaCTeM EventMine for BioNLP 2013 CG and PC tasks," Proceedings of BioNLP Shared Task 2013 Workshop, 2013, pp. 94-98. [156] Y. Miyao, K. Sagae, R. Sotre, T. Matsuzaki, and J. Tsujii, "Evaluating contributions of natural language parsers to protein-protein interaction extraction," Bioinformatics, vol. 25, 2009, pp. 394-400. [157] K. Sagae and J. Tsujii, "Dependency Parsing and Domain Adaptation with LR Models and Parser Ensembles.," EMNLP-CoNLL, 2007, pp. 1044-1050. [158] 0. Bodenreider, "The unified medical language system (UMLS): integrating biomedical terminology," Nucleic acids research, vol. 32, 2004, pp. D267-D270. [159] G.A. Miller, "WordNet: a lexical database for English," Communicationsof the ACM, vol. 38, 1995, pp. 39-41. [160] R. McDonald, F. Pereira, K. Ribarov, and J. Hajic, "Non-projective dependency parsing using spanning tree algorithms," Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2005, pp. 523-530. [161] S. Riedel, H.-W. Chun, T. Takagi, and J. Tsujii, "A markov logic approach to biomolecular event extraction," Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task, Association for Computational Linguistics, 2009, pp. 41-49. [162] S. Riedel, D. McClosky, M. Surdeanu, A. McCallum, and C.D. Manning, "Model combination for event extraction in BioNLP 2011," Proceedings of the BioNLP Shared Task 2011 Workshop, Association for Computational Linguistics, 2011, pp. 51-55. [163] H. Liu, T. Christiansen, W.A. Baumgartner Jr, and K. Verspoor, "BioLemmatizer: a lemmatization tool for morphological processing of biomedical text.," J. Biomedical Semantics, vol. 3, 2012, p. 17. [164] S. Pyysalo, T. Salakoski, S. Aubin, and A. Nazarenko, "Lexical adaptation of link grammar to the biomedical sublanguage: a comparative evaluation of three approaches," BMC bioinformatics, vol. 7, 2006, p. S2. [165] D.D. Sleator and D. Temperley, "Parsing English with a link grammar," arXiv preprint cmp-lg/9508004, 1995. [166] Y. Huang, H.J. Lowe, D. Klein, and R.J. Cucina, "Improved identification of noun phrases in clinical radiology reports using a high-performance statistical natural language 167 parser augmented with the UMLS specialist lexicon," Journal of the American Medical Informatics Association, vol. 12, 2005, pp. 275-285. [167] G. Schneider, M. Hess, and P. Merlo, "Hybrid long-distance functional dependency parsing," PhD, University of Zurich, 2008. [168] T. Briscoe, J. Carroll, and R. Watson, "The second release of the RASP system," Proceedings of the COLING/ACL on Interactive presentation sessions, Association for Computational Linguistics, 2006, pp. 77-80. [169] M. Krallinger, M. Vazquez, F. Leitner, D. Salgado, A. Chatr-aryamontri, A. Winter, L. Perfetto, L. Briganti, L. Licata, M. lannuccelli, and others, "The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bioontology concepts to full text," BMC bioinformatics, vol. 12, 2011, p. S3. [170] M. Huang, S. Ding, H. Wang, and X. Zhu, "Mining physical protein-protein interactions from the literature," Genome Biol, vol. 9, 2008, p. S12. [171] D. Tikk, I. Solt, P. Thomas, and U. Leser, "A detailed error analysis of 13 kernel methods for protein-protein interaction extraction," BMC bioinformatics,vol. 14, 2013, p. 12. [172] C. Giuliano, A. Lavelli, and L. Romano, "Exploiting shallow linguistic information for relation extraction from biomedical literature.," EACL, 2006, pp. 401-408. [173] S. Vishwanathan and A.J. Smola, "Fast kernels for string and tree matching," NIPS, 2002, pp. 569-576. [174] M. Collins and N. Duffy, "Convolution kernels for natural language," Advances in neural informationprocessingsystems, 2001, pp. 625--632. [175] A. Moschitti, "Efficient convolution kernels for dependency and constituent syntactic trees," Machine Learning: ECML 2006, Springer, 2006, pp. 318-329. [176] T. Kuboyama, K. Hirata, H. Kashima, K.F. Aoki-Kinoshita, and H. Yasuda, "A spectrum tree kernel," Information and Media Technologies, vol. 2, 2007, pp. 292-299. [177] G. Erkan, A. Ozgur, and D.R. Radev, "Semi-supervised classification for extracting protein interaction sentences using dependency parsing.," EAINLP-CoNLL, 2007, pp. 228-237. [178" . Kim1, J. Yoon, and T. Yang, "Kernel approaches for genic iteraction extraction" Bioinformatics, vol. 24, 2008, pp. 118-126. [179] A. Moschitti, "A study on convolution kernels for shallow semantic parsing," Proceedings of the 42nd Annual Meeting on Association for ComputationalLinguistics, Association for Computational Linguistics, 2004, p. 335. [180] P. Thomas, M. Neves, T. Rocktaschel, and U. Leser, "WBI-DDI: drug-drug interaction extraction using majority voting," Second Joint Conference on Lexical and ComputationalSemantics (* SEM), 2013, pp. 628-635. [181] D. Lin, "Dependency-based evaluation of MINIPAR," Treebanks, Springer, 2003, pp. 317-329. [182] M. Lease and E. Charniak, "Parsing biomedical literature," Natural Language Processing-IJCNLP2005, Springer, 2005, pp. 58-69. [183] Y. Freund and R.E. Schapire, "A decision-theoretic generalization of on-line learning and an application to boosting," Journal of computer and system sciences, vol. 55, 1997, pp. 119-139. [184] M. Kay, "Algorithm schemata and data structures in syntactic processing," Technical Report CSL80-12, 1980. 168 [185] M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, A.P. Davis, K. Dolinski, S.S. Dwight, J.T. Eppig, and others, "Gene Ontology: tool for the unification of biology," Nature genetics, vol. 25,2000, pp. 25-29. [186] D.A. Lindberg, B.L. Humphreys, A.T. McCray, and others, "The Unified Medical Language System.," Methods of information in medicine, vol. 32, 1993, p. 281. [187] National Library of Medicine, "MeSH http://www.ncbi.nlm.nih.gov/mesh." [188] K.A. Gray, B. Yates, R.L. Seal, M.W. Wright, and E.A. Bruford, "Genenames. org: the HGNC resources in 2015," Nucleic acids research, 2014, p. gku1071. [189] K.K. Schuler, "VerbNet: A broad-coverage, comprehensive verb lexicon," University of Pennsylvania, 2005. [190] C. Borgelt and M.R. Berthold, "Mining molecular fragments: Finding relevant substructures of molecules," Proceedings. 2002 IEEE InternationalConference on Data Mining, IEEE, 2002, pp. 51-58. [191] X. Yan and J. Han, "gspan: Graph-based substructure pattern mining," Proceedings. 2002 IEEE InternationalConference on DataMining, IEEE, 2002, pp. 721-724. [192] J. Huan, W. Wang, and J. Prins, "Efficient mining of frequent subgraphs in the presence of isomorphism," Data Mining, 2003. ICDM 2003. Third IEEE InternationalConference on, IEEE, 2003, pp. 549-552. [193] A.B. Clegg and A.J. Shepherd, "Syntactic pattern matching with Graph-Spider and MPL," The Proceedings of the Third International Symposium on Semantic Mining in Biomedicine (SMBM2008), Turku, Finland, 2008, pp. 129-132. [194] Stanford NLP, "Stanford Parser http://nlp.stanford.edu:8080/parser/." [195] D.M. Bikel, "Design of a multi-lingual, parallel-processing statistical parsing engine," Proceedings of the second international conference on Human Language Technology Research, Morgan Kaufmann Publishers Inc., 2002, pp. 178-182. [196] L. Rimell and S. Clark, "Porting a lexicalized-grammar parser to the biomedical domain," Journal ofBiomedical Informatics, vol. 42, 2009, pp. 852-865. [197] M. Krallinger, A. Morgan, L. Smith, F. Leitner, L. Tanabe, J. Wilbur, L. Hirschman, and A. Valencia, "Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge," Genome Biol, vol. 9, 2008, p. 51. [198] R. Bunescu, R. Ge, R.J. Kate, E.M. Marcotte, R.J. Mooney, A.K. Ramani, and Y.W. Wong, "Comparative experiments on learning information extractors for proteins and their interactions," Artificial intelligence in medicine, vol. 33, 2005, pp. 139-155. [199] S. Pyysalo, F. Ginter, J. Heimonen, J. Bjorne, J. Boberg, J. Jarvinen, and T. Salakoski, "Biolnfer: a corpus for information extraction in the biomedical domain," BMC bioinformatics, vol. 8, 2007, p. 50. [200] K. Fundel, R. K~ffner, and R. Zimmer, "RelEx-Relation extraction using dependency parse trees," Bioinformatics, vol. 23, 2007, pp. 365-371. [201] J. Ding, D. Berleant, D. Nettleton, and E.S. Wurtele, "Mining MEDLINE: abstracts, sentences, or phrases?," Pacific Symposium on Biocomputing, World Scientific, 2002, pp. 326-337. [202] C. Nddellec, "Learning language in logic-genic interaction extraction challenge," Proceedingsof the 4th LearningLanguage in Logic Workshop (LLL05), 2005. [203] K. Nagel, A. Jimeno-Yepes, and D. Rebholz-Schuhmann, "Annotation of protein residues based on a literature analysis: cross-validation against UniProtKb," BMC bioinformatics, vol. 10, 2009, p. S4. 169 [204] E. Buyko and U. Hahn, "Evaluating the impact of alternative dependency graph encodings on solving event extraction tasks," Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2010, pp. 982-992. [205] J. Nivre, J. Hall, J. Nilsson, A. Chanev, G. Eryigit, S. KUbler, S. Marinov, and E. Marsi, "MaltParser: A language-independent system for data-driven dependency parsing," NaturalLanguage Engineering, vol. 13, 2007, pp. 95-13 5. [206] M. Miwa, S. Pyysalo, T. Hara, and J. Tsujii, "A comparative study of syntactic parsers for event extraction," Proceedings of the 2010 Workshop on Biomedical Natural Language Processing, Association for Computational Linguistics, 2010, pp. 37-45. [207] M. Miwa, P. Thompson, and S. Ananiadou, "Boosting automatic event extraction from the literature using domain adaptation and coreference resolution," Bioinformatics, vol. 28, 2012, pp. 1759-1765. [208] M. Miwa, S. Pyysalo, T. Ohta, and S. Ananiadou, "Wide coverage biomedical event extraction using multiple partially overlapping corpora," BMC bioinformatics, vol. 14, 2013, p. 175. [209] S. Ranu and A.K. Singh, "Graphsig: A scalable approach to mining significant subgraphs in large graph databases," Data Engineering, 2009. ICDE'09. IEEE 25th International Conference on, IEEE, 2009, pp. 844-855. [210] R. Kabiljo, A.B. Clegg, and A.J. Shepherd, "A realistic assessment of methods for extracting gene/protein interactions from free text," BMC bioinformatics, vol. 10, 2009, p. 233. [211] A. Robb-Smith, "US National Cancer Institute working formulation of non-Hodgkin's lymphomas for clinical use," The Lancet, vol. 320, 1982, pp. 432-434. [212] M. Bennett, G. Farrer-Brown, K. Henry, A. Jelliffe, R. Gerard-Marchant, I. Hamlin, K. Lennert, F. Rilke, A. Stansfeld, and J. Van Unnik, "Classification of non-Hodgkin's lymphomas," The Lancet, vol. 304, 1974, pp. 405-408. [213] R.J. Lukes and R.D. Collins, "Immunologic characterization of human malignant lymphomas," Cancer, vol. 34, 197, pp. 1488-1503. [214] H. Rappaport, Tumors of the Hematopoietic System, Armed Forces Institute of Pathology, 1966. [215] E.S. Jaffe, N.L. Harris, H. Stein, and J. Vardiman, eds., WHO Classification of Tumours. Pathology and Genetics of Tumours of Haematopoietic and Lymphoid Tissues, IARC Press, 2001. [216] S.H. Swerdlow, E. Campo, N.L. Harris, E.S. Jaffe, S.A. Pileri, H. Stein, J. Thiele, and V. J.W., eds., WHO classificationof tumours of haematopoieticand lymphoid tissues, IARC Press, 2008. [217] J. Turner, A. Hughes, A. Kricker, S. Milliken, A. Grulich, J. Kaldor, and B. Armstrong, "Use of the WHO lymphoma classification in a population-based epidemiological study," Annals of oncology, vol. 15, 2004, pp. 631-637. [218] C.A. Clarke, S.L. Glaser, R.F. Dorfman, P.M. Bracci, E. Eberle, and E.A. Holly, "Expert Review of Non-Hodgkin's Lymphomas in a Population-Based Cancer Registry Reliability of Diagnosis and Subtype Classifications," Cancer Epidemiology Biomarkers & Prevention, vol. 13, 2004, pp. 138-143. [219] M. Snuderl, O.K. Kolman, Y.-B. Chen, J.J. Hsu, A.M. Ackerman, P. Dal Cin, J.A. Ferry, N.L. Harris, R.P. Hasserjian, L.R. Zukerberg, and others, "B-cell lymphomas with 170 concurrent IGH-BCL2 and MYC rearrangements are aggressive neoplasms with clinical and pathologic features distinct from Burkitt lymphoma and diffuse large B-cell lymphoma," The Americanjournal ofsurgicalpathology, vol. 34, 2010, pp. 327-340. [220] A.M. Gruver, M.A. Huba, A. Dogan, and E.D. Hsi, "Fibrin-associated Large B-cell Lymphoma: Part of the Spectrum of Cardiac Lymphomas," The American Journal of SurgicalPathology, vol. 36, 2012, pp. 1527-1537. [221] K.J. Savage, N.L. Harris, J.M. Vose, F. Ullrich, E.S. Jaffe, J.M. Connors, L. Rimsza, S.A. Pileri, M. Chhanabhai, R.D. Gascoyne, and others, "ALK- anaplastic large-cell lymphoma is clinically and immunophenotypically different from both ALK+ ALCL and peripheral T-cell lymphoma, not otherwise specified: report from the International Peripheral T-Cell Lymphoma Project," Blood, vol. 111, 2008, pp. 5496-5504. [222] E. Hsi, T. Singleton, L. Swinnen, C. Dunphy, and S. Alkan, "Mucosa-associated lymphoid tissue-type lymphomas occurring in post-transplantation patients," The Americanjournalofsurgicalpathology, vol. 24, 2000, pp. 100-106. [223] J.A. Ferry, A.R. Sohani, J.A. Longtine, R.A. Schwartz, and N.L. Harris, "HHV8-positive, EBV-positive Hodgkin lymphoma-like large B-cell lymphoma and HHV8-positive intravascular large B-cell lymphoma," Modern Pathology, vol. 22, 2009, pp. 618-626. [224] K.P. Liao, T. Cai, V. Gainer, S. Goryachev, Q. Zeng-treitler, S. Raychaudhuri, P. Szolovits, S. Churchill, S. Murphy, I. Kohane, and others, "Electronic medical records for discovery research in rheumatoid arthritis," Arthritis care and research,vol. 62, 2010, pp. 1120-1127. [225] 0. Uzuner, I. Goldstein, Y. Luo, and I. Kohane, "Identifying patient smoking status from medical discharge records," Journal of the American Medical Informatics Association, vol. 15, 2008, pp. 14-24. [226] 0. Uzuner, Y. Luo, and P. Szolovits, "Evaluating the state-of-the-art in automatic deidentification," Journal of the American Medical Informatics Association, vol. 14, 2007, pp. 550-563. [227] 0. Uzuner, "Recognizing obesity and comorbidities in sparse data," Journal of the American Medical Informatics Association, vol. 16, 2009, pp. 561-570. [228] A.M. Cohen, "Five-way smoking status classification using text hot-spot identification and error-correcting output codes," Journal of the American Medical Informatics Association, vol. 15, 2008, pp. 32-35. [229] E. Aramaki, T. Imai, K. Miyo, and K. Ohe, "Patient status classification by using rule based sentence extraction and BM25 kNN-based classifier," i2b2 Workshop on Challenges in NaturalLanguage Processingfor ClinicalData, 2006. [230] C. Clark, K. Good, L. Jezierny, M. Macpherson, B. Wilson, and U. Chajewska, "Identifying smokers with a medical extraction system," Journalof the American Medical InformaticsAssociation, vol. 15, 2008, pp. 36-39. [231] 1. Solt, D. Tikk, V. GOl, and Z.T. Kardkovacs, "Semantic classification of diseases in discharge summaries using a context-aware rule-based classifier," Journal of the American Medical Informatics Association, vol. 16, 2009, pp. 580-584. [232] R. Farkas, G. Szarvas, I. Hegediis, A. Almsi, V. Vincze, R. Orm6ndi, and R. BusaFekete, "Semi-automated construction of decision rules to predict morbidities from clinical texts," Journal of the American Medical Informatics Association, vol. 16, 2009, pp. 601-605. 171 [233] L.C. Childs, R. Enelow, L. Simonsen, N.H. Heintzelman, K.M. Kowalski, and R.J. Taylor, "Description of a rule-based system for the i2b2 challenge in natural language processing for clinical data," Journalof the American Medical Informatics Association, vol. 16, 2009, pp. 571-575. [234] H. Ware, C.J. Mullett, and V. Jagannathan, "Natural language processing framework to assess clinical conditions," Journalof the American Medical Informatics Association, vol. 16, 2009, pp. 585-589. [235] 0. Uzuner, J. Mailoa, R. Ryan, and T. Sibanda, "Semantic relations for problem-oriented medical records," Artificial Intelligence in Medicine, vol. 50, 2010, pp. 63-73. [236] T. Sibanda, T. He, P. Szolovits, and 0. Uzuner, "Syntactically-informed semantic category recognizer for discharge summaries," AMIA annual symposium proceedings, American Medical Informatics Association, 2006, pp. 714-718. [237] D. Albright, A. Lanfranchi, A. Fredriksen, W.F. Styler, C. Warner, J.D. Hwang, J.D. Choi, D. Dligach, R.D. Nielsen, J. Martin, and others, "Towards comprehensive syntactic and semantic annotations of the clinical narrative," Journal of the American Medical Informatics Association, vol. 20, 2013, pp. 922-930. [238] Partners Healthcare, "Research Patient Data Registry (RPDR) http://rc.partners.org/rpdr." [239] L.G. Shaffer and N. Tommerup, ISCN 2013: an international system for human cytogenetic nomenclature (2013): recommendations of the International Standing Committee on Human Cytogenetic Nomenclature, Karger, 2013. [240] Apache OpenNLP project team, "Apache OpenNLP http://opennlp.apache.org/," Apr. 2013. [241] B. Santorini, "Part-of-speech tagging guidelines for the Penn Treebank Project (3rd revision)," 1990. [242] International Health Terminology Standards Development Organisation, "SNOMED CT http://www.ihtsdo.org/snomed-ct/." [243] Y. Chen, H. Gu, Y. Perl, M. Halper, and J. Xu, "Expanding the extent of a UMLS semantic type via group neighborhood auditing," Journal of the American Medical Informatics Association, vol. 16, 2009, pp. 746-757. [244] AbiWord, "Link Parser http://www.abisource.com/projects/link-grammar/." [245] J.D. Choi and M. Palmer, "Getting the Most out of Transition-based Dependency Parsing.," ACL (ShortPapers), 2011, pp. 687-692. [246] M.-C. De Marneffe and C.D. Manning, "Stanford typed dependencies manual," 2008. [247] Y. Chi, R.R. Muntz, S. Nijssen, and J.N. Kok, "Frequent subtree mining-an overview," FundamentaInformaticae, vol. 66, 2005, pp. 161-198. [248] C. Jiang, F. Coenen, and M. Zito, "A Survey of Frequent Subgraph Mining Algorithms," Knowledge EngineeringReview, vol. 28, 2013, pp. 75-105. [249] I. Goldstein and 0. Uzuner, "Specializing for predicting obesity and its co-morbidities," Journalof biomedicalinformatics, vol. 42, 2009, pp. 873-886. [250] W. Long, "Extracting diagnoses from discharge summaries," AMIA annual symposium proceedings, American Medical Informatics Association, 2005, pp. 470-474. [251] W.B. Cavnar and J.M. Trenkle, "N-Gram-Based Text Categorization," Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, 1994, pp. 161-175. [252] R. Baeza-Yates and B. Ribeiro-Neto, Modern informationretrieval, 1999. 172 [253] E.W. Noreen, Computer-intensivemethods for testing hypotheses: an introduction,Wiley, 1989. [254] Z. Fan, Y. Natkunam, E. Bair, R. Tibshirani, and R.A. Warnke, "Characterization of variant patterns of nodular lymphocyte predominant Hodgkin lymphoma with immunohistologic and clinical correlation," The American journal of surgicalpathology, vol. 27, 2003, pp. 1346-1356. [255] A.R. Sohani, E.S. Jaffe, N.L. Harris, J.A. Ferry, S. Pittaluga, and R.P. Hasserjian, "Nodular lymphocyte-predominant Hodgkin lymphoma with atypical T cells: a morphologic variant mimicking peripheral T-cell lymphoma," The American journal of surgicalpathology, vol. 35, 2011, pp. 1666-1678. [256] A. Rahemtullah, K.K. Reichard, F.I. Preffer, N.L. Harris, and R.P. Hasserjian, "A doublepositive CD4+ CD8+ T-cell population is commonly found in nodular lymphocyte predominant Hodgkin lymphoma," Americanjournal of clinicalpathology, vol. 126, 2006, pp. 805-814. [257] R.L. Winslow, N. Trayanova, D. Geman, and M.I. Miller, "Computational medicine: translating models to clinical care," Science translational medicine, vol. 4, 2012, p. 15 8rv 11. [258] M.A. Shipp, K.N. Ross, P. Tamayo, A.P. Weng, J.L. Kutok, R.C. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G.S. Pinkus, and others, "Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning," Nature medicine, vol. 8, 2002, pp. 68-74. [259] J.Y. Irwin, H. Harkema, L.M. Christensen, T. Schleyer, P.J. Haug, and W.W. Chapman, "Methodology to develop and evaluate a semantic representation for NLP," AMIA Annual Symposium Proceedings, American Medical Informatics Association, 2009, p. 271. [260] M.M. Gordon, A.M. Moser, and E. Rubin, "Unsupervised Analysis of Classical Biomedical Markers: Robustness and Medical Relevance of Patient Clustering Using Bioinformatics Tools," PloS one, vol. 7, 2012, p. e29578. [261] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, "Cluster analysis and display of genome-wide expression patterns," Proceedingsof the NationalAcademy of Sciences, vol. 95,1998,pp.14863-14868. [262] T.A. Lasko, J.C. Denny, and M.A. Levy, "Computational Phenotype Discovery Using Unsupervised Feature Learning over Noisy, Sparse, and Irregular Clinical Data," PloS one, vol. 8, 2013, p. e66341. [263] G.N. Noren, J. Hopstadius, A. Bate, K. Star, and I.R. Edwards, "Temporal pattern discovery in longitudinal electronic patient records," Data Mining and Knowledge Discovery, vol. 20, 2010, pp. 361-387. [264] D.D. Lee and H.S. Seung, "Learning the parts of objects by non-negative matrix factorization," Nature, vol. 401, 1999, pp. 788-791. [265] M. Hofree, J.P. Shen, H. Carter, A. Gross, and T. Ideker, "Network-based stratification of tumor mutations," Nature methods, 2013. [266] F.-J. MUller, L.C. Laurent, D. Kostka, I. Ulitsky, R. Williams, C. Lu, I.-H. Park, M.S. Rao, R. Shamir, P.H. Schwartz, and others, "Regulatory networks define phenotypic classes of human stem cell lines," Nature, vol. 455, 2008, pp. 401-405. [267] E.A. Collisson, A. Sadanandam, P. Olson, W.J. Gibb, M. Truitt, S. Gu, J. Cooc, J. Weinkle, G.E. Kim, L. Jakkula, and others, "Subtypes of pancreatic ductal 173 adenocarcinoma and their differing responses to therapy," Nature medicine, vol. 17, 2011, pp. 500-503. [268] F. Wang, N. Lee, J. Hu, J. Sun, and S. Ebadollahi, "Towards heterogeneous temporal clinical event pattern discovery: a convolutional approach," Proceedings of the 18th ACM SIGKDD internationalconference on Knowledge discovery and data mining, ACM, 2012, pp. 453-461. [269] H. Kim and H. Park, "Sparse non-negative matrix factorizations via alternating nonnegativity-constrained least squares for microarray data analysis," Bioinformatics, vol. 23, 2007, pp. 1495-1502. [270] J.-P. Brunet, P. Tamayo, T.R. Golub, and J.P. Mesirov, "Metagenes and molecular pattern discovery using matrix factorization," Proceedings of the NationalAcademy of Sciences, [271] [272] [273] [274] [275] vol. 101, 2004, pp. 4164-4169. Y. Gao and G. Church, "Improving molecular cancer class discovery through sparse nonnegative matrix factorization," Bioinformatics, vol. 21, 2005, pp. 3970-3975. S. Nik-Zainal, D.C. Wedge, L.B. Alexandrov, M. Petljak, A.P. Butler, N. Bolli, H.R. Davies, S. Knappskog, S. Martin, E. Papaemmanuil, and others, "Association of a germline copy number polymorphism of APOBEC3A and APOBEC3B with burden of putative APOBEC-dependent mutations in breast cancer," Naturegenetics, 2014. L.B. Alexandrov, S. Nik-Zainal, D.C. Wedge, S.A. Aparicio, S. Behjati, A.V. Biankin, G.R. Bignell, N. Bolli, A. Borg, A.-L. Borresen-Dale, and others, "Signatures of mutational processes in human cancer," Nature, 2013. L.R. Tucker, "Some mathematical notes on three-mode factor analysis," Psychometrika, vol. 31, 1966, pp. 279-311. J. Sun, D. Tao, S. Papadimitriou, P.S. Yu, and C. Faloutsos, "Incremental tensor analysis: Theory and applications," ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 2, 2008, p. 11. [276] R.A. Harshman and M.E. Lundy, "Uniqueness proof for a family of models sharing features of Tucker's three-mode factor analysis and PARAFAC/CANDECOMP," Psychometrika, vol. 61, 1996, pp. 133-154. [277] L. Omberg, G.H. Golub, and 0. Alter, "A tensor higher-order singular value decomposition for integrative analysis of DNA microarray data from different studies," Proceedings of the NationalAcademy ofSciences, vol. 104, 2007, pp. 18371-18376. [278] L. Omberg, J.R. Meyerson, K. Kobayashi, L.S. Drury, J.F. Diffley, and 0. Alter, "Global effects of DNA replication and DNA replication origin activity on eukaryotic gene expression," Molecular systems biology, vol. 5, 2009. [279] C. Ozcaglar, A. Shabbeer, S. Vandenberg, B. Yener, and K.P. Bennett, "Sublineage structure analysis of Mycobacterium tuberculosis complex strains using multiplebiomarker tensors," BMC genomics, vol. 12, 2011, p. Sl. [280] B. Yener, E. Acar, P. Aguis, K. Bennett, S. Vandenberg, and G. Plopper, "Multiway modeling and analysis in stem cell systems biology," BMC Systems Biology, vol. 2, 2008, p. 63. [281] B.W. Bader, A.A. Puretskiy, and M.W. Berry, "Scenario discovery using nonnegative tensor factorization," Progress in PatternRecognition, Image Analysis and Applications, Springer, 2008, pp. 791-805. [282] M.W. Berry and M. Browne, "Email surveillance using non-negative matrix factorization," Computational & MathematicalOrganizationTheory, vol. 11, 2005, pp. 249-264. 174 [283] F. Shahnaz, M.W. Berry, V.P. Pauca, and R.J. Plemmons, "Document clustering using nonnegative matrix factorization," Information Processing& Management, vol. 42, 2006, pp. 373-386. [284] B.W. Bader, M.W. Berry, and M. Browne, "Discussion tracking in Enron email using PARAFAC," Survey of Text Mining II, Springer, 2008, pp. 147-163. [285] T.G. Kolda and B.W. Bader, "Tensor decompositions and applications," SIAM review, vol. 51, 2009, pp. 455-500. [286] Y. Xu and W. Yin, "A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion," SIAM Journalon Imaging Sciences, vol. 6, 2013, pp. 1758-1789. [287] C.D. Manning and H. Schtitze, Foundations of statistical natural language processing, MIT press, 1999. [288] C.H. Ding, X. He, and H.D. Simon, "On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering.," SDM, SIAM, 2005, pp. 606-610. [289] C.D. Manning, P. Raghavan, and H. Schutze, Introduction to information retrieval, Cambridge University Press Cambridge, 2008. [290] J. Liu, J. Liu, P. Wonka, and J. Ye, "Sparse non-negative tensor factorization using columnwise coordinate descent," PatternRecognition, vol. 45, 2012, pp. 649-656. [291] T.L. Griffiths and M. Steyvers, "Finding scientific topics," Proceedings of the National academy ofSciences of the United States ofAmerica, vol. 101, 2004, pp. 5228-5235. [292] T.L. Griffiths and Z. Ghahramani, "The indian buffet process: An introduction and review," The JournalofMachine LearningResearch, vol. 12, 2011, pp. 1185-1224. [293] N. McIntosh, "Intensive care monitoring: past, present and future," Clinical medicine, vol. 2, 2002, pp. 349-355. [294] W. Zong, G. Moody, and R. Mark, "Reduction of false arterial blood pressure alarms using signal quality assessement and relationships between the electrocardiogram and arterial blood pressure," Medical and Biological Engineering and Computing, vol. 42, 2004, pp. 698-706. [295] G. Martin, "State-of-the-art fluid management in critically ill patients," Current Opinion in CriticalCare, vol. 20, 2014, p. 359. [296] M. Saeed, M. Villarroel, A.T. Reisner, G. Clifford, L.-W. Lehman, G. Moody, T. Heldt, T.H. Kyaw, B. Moody, and R.G. Mark, "Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-Il): a public-access intensive care unit database," Criticalcare medicine, vol. 39, 2011, p. 952. [297] K.B. Kshetri, "Modelling patient states in intensive care patients," Massachusetts Institute of Technology, 2011. [298] Z. Syed and J.V. Guttag, "Unsupervised Similarity-Based Risk Stratification for Cardiovascular Events Using Long-Term Time-Series Data.," Journal of Machine LearningResearch, vol. 12, 2011, pp. 999-1024. [299] J. Lin, E. Keogh, L. Wei, and S. Lonardi, "Experiencing SAX: a novel symbolic representation of time series," Data Mining and knowledge discovery, vol. 15, 2007, pp. 107-144. [300] J. Huan, W. Wang, J. Prins, and J. Yang, "Spin: mining maximal frequent subgraphs from graph databases," Proceedings of the tenth ACM SJGKDD international conference on Knowledge discovery and data mining, ACM, 2004, pp. 581-586. 175 [301] E. Tjioe, M.W. Berry, and R. Homayouni, "Discovering gene functional relationships using FAUN (Feature Annotation Using Nonnegative matrix factorization)," BMC bioinformatics, vol. 11, 2010, p. S14. [302] C.-J. Lin, "Projected gradient methods for nonnegative matrix factorization," Neural computation, vol. 19, 2007, pp. 2756-2779. [303] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, 0. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, and others, "Scikit-leam: Machine learning in Python," The Journalof Machine LearningResearch, vol. 12, 2011, pp. 2825-2830. [304] C. Boutsidis and E. Gallopoulos, "SVD based initialization: A head start for nonnegative matrix factorization," PatternRecognition, vol. 41, 2008, pp. 1350-1362. [305] P.O. Hoyer, "Non-negative matrix factorization with sparseness constraints," The Journal ofMachine LearningResearch, vol. 5, 2004, pp. 1457-1469. [306] C. Hug, "Detecting hazardous intensive care patient episodes using real-time mortality models," Massachusetts Institute of Technology, 2009. [307] Autism and Developmental Disabilities Monitoring Network Surveillance Year 2010 Principal Investigators, "Prevalence of autism spectrum disorder among children aged 8 years-autism and developmental disabilities monitoring network, 11 sites, United States, 2010.," Morbidity and mortality weekly report. Surveillance summaries, vol. 63, 2014. [308] C.P. Johnson, S.M. Myers, and the Council on Children With Disabilities, "Identification and evaluation of children with autism spectrum disorders," Pediatrics,vol. 120, 2007, pp. 1183-1215. [309] A. Bailey, A. Le Couteur, I. Gottesman, P. Bolton, E. Simonoff, E. Yuzda, and M. Rutter, "Autism as a strongly genetic disorder: evidence from a British twin study," Psychologicalmedicine, vol. 25, 1995, pp. 63-77. [310] S. Steffenburg, C. Gillberg, L. Hellgren, L. Andersson, I.C. Gillberg, G. Jakobsson, and M. Bohman, "A twin study of autism in Denmark, Finland, Iceland, Norway and Sweden," Journalof ChildPsychology and Psychiatry,vol. 30, 1989, pp. 405-416. [311] S. Folstein and M. Rutter, "Infantile autism: a genetic study of 21 twin pairs," Journal of Ch..7ildpsychology nd 18, 19-7-7, p 29-7-2. [312] S.R. Gilman, I. Iossifov, D. Levy, M. Ronemus, M. Wigler, and D. Vitkup, "Rare de novo fi yLI&LA I Psychtr/ L #AJ* )yLtI&&L~tI), V'Ji. I vol. _1J I I , Pp. /_ I -- )/- variants associated with autism implicate a large functional network of genes involved in formation and function of synapses," Neuron, vol. 70, 2011, pp. 898-907. [313] D. Levy, M. Ronemus, B. Yamrom, Y. Lee, A. Leotta, J. Kendall, S. Marks, B. Lakshmi, D. Pai, K. Ye, and others, "Rare de novo and transmitted copy-number variation in autistic spectrum disorders," Neuron, vol. 70, 2011, pp. 886-897. [314] S.J. Sanders, A.G. Ercan-Sencicek, V. Hus, R. Luo, M.T. Murtha, D. Moreno-De-Luca, S.H. Chu, M.P. Moreau, A.R. Gupta, S.A. Thomson, and others, "Multiple recurrent de novo CNVs, including duplications of the 7q1 1. 23 Williams syndrome region, are strongly associated with autism," Neuron, vol. 70, 2011, pp. 863-885. [315] Y. Sakai, C.A. Shaw, B.C. Dawson, D.V. Dugas, Z. Al-Mohtaseb, D.E. Hill, and H.Y. Zoghbi, "Protein interactome reveals converging molecular pathways among autism disorders," Science translationalmedicine, vol. 3, 2011, pp. 86ra49-86ra49. [316] L.A. Weiss, D.E. Arking, M.J. Daly, A. Chakravarti, C.W. Brune, K. West, A. O'Connor, G. Hilton, R.L. Tomlinson, A.B. West, and others, "A genome-wide linkage and association scan reveals novel loci for autism," Nature, vol. 461, 2009, pp. 802-808. 176 [317] K. Wang, H. Zhang, D. Ma, M. Bucan, J.T. Glessner, B.S. Abrahams, D. Salyakina, M. Imielinski, J.P. Bradfield, P.M. Sleiman, and others, "Common genetic variants on 5pl4.1 associate with autism spectrum disorders," Nature, vol. 459, 2009, pp. 528-533. [318] M.H. Chahrour, W.Y. Timothy, E.T. Lim, B. Ataman, M.E. Coulter, R.S. Hill, C.R. Stevens, C.R. Schubert, M.E. Greenberg, S.B. Gabriel, and others, "Whole-exome sequencing and homozygosity analysis implicate depolarization-regulated neuronal genes in autism," PLoS genetics, vol. 8, 2012, p. e1002635. [319] B.J. O'Roak, L. Vives, W. Fu, J.D. Egertson, I.B. Stanaway, I.G. Phelps, G. Carvill, A. Kumar, C. Lee, K. Ankenman, and others, "Multiplex targeted sequencing identifies recurrently mutated genes in autism spectrum disorders," Science, vol. 338, 2012, pp. 1619-1622. [320] T.N. Turner, K. Sharma, E.C. Oh, Y.P. Liu, R.L. Collins, M.X. Sosa, D.R. Auer, H. Brand, S.J. Sanders, D. Moreno-De-Luca, and others, "Loss of [dgr]-catenin function in severe autism," Nature, vol. 520, 2015, pp. 51-56. [321] I.S. Kohane, A. McMurry, G. Weber, D. MacFadden, L. Rappaport, L. Kunkel, J. Bickel, N. Wattanasin, S. Spence, S. Murphy, and others, "The co-morbidity burden of children and young adults with autism spectrum disorders," PloS one, vol. 7, 2012, p. e33224. [322] I. Voineagu, X. Wang, P. Johnston, J.K. Lowe, Y. Tian, S. Horvath, J. Mill, R.M. Cantor, B.J. Blencowe, and D.H. Geschwind, "Transcriptomic analysis of autistic brain reveals convergent molecular pathology," Nature, vol. 474, 2011, pp. 3 80-384. [323] M.W. State, P. Levitt, and others, "The conundrums of understanding genetic risks for autism spectrum disorders," Nature neuroscience, vol. 14, 2011, pp. 1499-1506. [324] R. Toro, M. Konyukh, R. Delorme, C. Leblond, P. Chaste, F. Fauchereau, M. Coleman, M. Leboyer, C. Gillberg, and T. Bourgeron, "Key role for gene dosage and synaptic homeostasis in autism spectrum disorders," Trends in genetics, vol. 26, 2010, pp. 363372. [325] T. Bourgeron, "A synaptic trek to autism," Current opinion in neurobiolog, vol. 19, 2009, pp. 231-234. [326] I.S. Kohane, "An autism case history to review the systematic analysis of large-scale data to refine the diagnosis and treatment of neuropsychiatric disorders," Biologicalpsychiatry, vol. 77, 2015, pp. 59-65. [327] F. Doshi-Velez, Y. Ge, and I. Kohane, "Comorbidity clusters in autism spectrum disorders: an electronic health record time-series analysis," Pediatrics,vol. 133, 2014, pp. e54-e63. [328] K.K. Ausderau, M. Furlong, J. Sideris, J. Bulluck, L.M. Little, L.R. Watson, B.A. Boyd, A. Belger, V.A. Dickie, and G.T. Baranek, "Sensory subtypes in children with autism spectrum disorder: Latent profile transition analysis using a national survey of sensory features," Journalof Child Psychology and Psychiatry, vol. 55, 2014, pp. 935-944. [329] I. Rapin, M.A. Dunn, D.A. Allen, M.C. Stevens, and D. Fein, "Subtypes of language disorders in school-age children with autism," Developmental Neuropsychology, vol. 34, 2009, pp. 66-84. [330] F. Hormozdiari, 0. Penn, E. Borenstein, and E.E. Eichler, "The discovery of integrated gene networks for autism and related disorders," Genome research, vol. 25, 2015, pp. 142-154. 177 [331] C.J. McDougle, S.M. Landino, A. Vahabzadeh, J. O'Rourke, N.R. Zurcher, B.C. Finger, M.L. Palumbo, J. Helt, J.E. Mullett, J.M. Hooker, and others, "Toward an immunemediated subtype of autism spectrum disorder," Brain research,2014. [332] E.Y. Hsiao, "Immune dysregulation in autism spectrum disorder," Int Rev Neurobiol, vol. 113,2013, pp. 269-302. [333] M. Michel, M.J. Schmidt, and K. Mimics, "Immune system gene dysregulation in autism and schizophrenia," Developmental neurobiology, vol. 72, 2012, pp. 1277-1287. [334] N. Krumm, B.J. O'Roak, J. Shendure, and E.E. Eichler, "A de novo convergence of autism genetics and molecular neuroscience," Trends in neurosciences, vol. 37, 2014, pp. 95-105. [335] E. Ben-David and S. Shifman, "Combined analysis of exome sequencing points toward a major role for transcription regulation during brain development in autism," Molecular psychiatry, vol. 18, 2013, pp. 1054-1056. [336] W.F. Hu, M.H. Chahrour, and C.A. Walsh, "The diverse genetic landscape of neurodevelopmental disorders," Annual review of genomics and human genetics, vol. 15, 2014, pp. 195-213. [337] M.E. Talkowski, J.A. Rosenfeld, I. Blumenthal, V. Pillalamarri, C. Chiang, A. Heilbut, C. Ernst, C. Hanscom, E. Rossin, A.M. Lindgren, and others, "Sequencing chromosomal abnormalities reveals neurodevelopmental loci that confer risk across diagnostic boundaries," Cell, vol. 149, 2012, pp. 525-537. [338] A.J. Willsey, S.J. Sanders, M. Li, S. Dong, A.T. Tebbenkamp, R.A. Muhle, S.K. Reilly, L. Lin, S. Fertuzinhos, J.A. Miller, and others, "Coexpression networks implicate human midfetal deep cortical projection neurons in the pathogenesis of autism," Cell, vol. 155, 2013, pp. 997-1007. [339] T. Yuan, Y. Jiao, S. de Jong, R.A. Ophoff, S. Beck, and A.E. Teschendorff, "An integrative multi-scale analysis of the dynamic DNA methylation landscape in aging," PLoS genetics, vol. 11, 2015, pp. e1004996-e1004996. [340] D. Robinson, E.M. Van Allen, Y.-M. Wu, N. Schultz, R.J. Lonigro, J.-M. Mosquera, B. Montgomery, M.-E. Taplin, C.C. Pritchard, G. Attard, and others, "Integrative Clinical Genomics of Advanced Prostate Cancer," Cell, vol. 161, 2015, pp. 1215-1228. [341] E. L6pez-Knowles, P.M. Wilkerson, R. Ribas, H. Anderson, A. Mackay, Z. Ghazoui, A. Rani, P. Osin, A. Nerurkar, L. Renshaw, and others, "Integrative analyses identify modulators of response to neoadjuvant aromatase inhibitors in patients with early breast cancer," Breast Cancer Research, vol. 17, 2015, p. 35. [342] The Cancer Genome Atlas Research Network., "Comprehensive, Integrative Genomic Analysis of Diffuse Lower-Grade Gliomas," New EnglandJournalofMedicine, vol. 372, 2015, pp. 2481-2498. [343] The Cancer Genome Atlas Research Network., "Genomic Classification of Cutaneous Melanoma," Cell, vol. 161, 2015, pp. 1681-1696. [344] M. Meld, P.G. Ferreira, F. Reverter, D.S. DeLuca, J. Monlong, M. Sammeth, T.R. Young, J.M. Goldmann, D.D. Pervouchine, T.J. Sullivan, and others, "The human transcriptome across tissues and individuals," Science, vol. 348, 2015, pp. 660-665. [345] "BrainSpan: Atlas of the Developing Human Brain [Internet]. Funded by ARRA Awards 1RC2MH08992 1-01, 1 RC2MHO90047-0 1, and 1 RC2MH089929-0 1.," 2011. [346] B.S. Everitt, The CambridgeDictionary ofStatistics, Cambridge University Press, 2006. 178 [347] BrainSpan, Transcriptome profiling by rna sequencing and exon microarray, Allen Institute, 2013. [348] G. Csardi, "Package igraph," 2010. [349] GATK team, "https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitutegatktoolswalkersco verageCallableLoci.php." [350] N. Krumm, T.N. Turner, C. Baker, L. Vives, K. Mohajeri, K. Witherspoon, A. Raja, B.P. Coe, H.A. Stessman, Z.-X. He, and others, "Excess of rare, inherited truncating mutations in autism," Nature genetics, vol. 47, 2015, pp. 582-588. [351] M.A. DePristo, E. Banks, R. Poplin, K.V. Garimella, J.R. Maguire, C. Hartl, A.A. Philippakis, G. del Angel, M.A. Rivas, M. Hanna, and others, "A framework for variation discovery and genotyping using next-generation DNA sequencing data," Nature genetics, vol. 43, 2011, pp. 491-498. [352] H. Li, "Aligning sequence reads, clone sequences and assembly contigs with BWAMEM," arXiv preprintarXiv:1303.3997, 2013. [353] The Picard team, "The Picard toolkit http://picard.sourceforge.net/," 2014. [354] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, and others, "The sequence alignment/map format and SAMtools," Bioinformatics, vol. 25, 2009, pp. 2078-2079. [355] K. Wang, M. Li, and H. Hakonarson, "ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data," Nucleic acids research, vol. 38, 2010, pp. e164-e 164. [356] K.D. Pruitt, G.R. Brown, S.M. Hiatt, F. Thibaud-Nissen, A. Astashyn, 0. Ermolaeva, C.M. Farrell, J. Hart, M.J. Landrum, K.M. McGarvey, and others, "RefSeq: an update on mammalian reference sequences," Nucleic acids research, vol. 42, 2014, pp. D756-D763. [357] K.R. Rosenbloom, J. Armstrong, G.P. Barber, J. Casper, H. Clawson, M. Diekhans, T.R. Dreszer, P.A. Fujita, L. Guruvadoo, M. Haeussler, and others, "The UCSC genome browser database: 2015 update," Nucleic acids research, vol. 43, 2015, pp. D670-D681. [358] J. Harrow, A. Frankish, J.M. Gonzalez, E. Tapanari, M. Diekhans, F. Kokocinski, B.L. Aken, D. Barrell, A. Zadissa, S. Searle, and others, "GENCODE: the reference human genome annotation for The ENCODE Project," Genome research, vol. 22, 2012, pp. 1760-1774. [359] I. Adzhubei, D.M. Jordan, and S.R. Sunyaev, "Predicting functional effect of human missense mutations using PolyPhen-2," Currentprotocols in human genetics, 2013, pp. 7-20. [360] P. Kumar, S. Henikoff, and P.C. Ng, "Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm," Nature protocols, vol. 4, 2009, pp. 1073-1081. [361] J.M. Schwarz, D.N. Cooper, M. Schuelke, and D. Seelow, "MutationTaster2: mutation prediction for the deep-sequencing age," Nature methods, vol. 11, 2014, pp. 361-362. [362] B. Reva, Y. Antipin, and C. Sander, "Predicting the functional impact of protein mutations: application to cancer genomics," Nucleic acids research,2011, p. gkr407. [363] M. Kircher, D.M. Witten, P. Jain, B.J. O'Roak, G.M. Cooper, and J. Shendure, "A general framework for estimating the relative pathogenicity of human genetic variants," 3 1 0 - 3 15 . Nature genetics, vol. 46, 2014, pp. 179 [364] S. Chun and J.C. Fay, "Identification of deleterious mutations within three human genomes," Genome research, vol. 19, 2009, pp. 1553-156 1. [365] H. Carter, C. Douville, P.D. Stenson, D.N. Cooper, and R. Karchin, "Identifying Mendelian disease genes with the variant effect scoring tool," BMC genomics, vol. 14, 2013, p. S3. [366] G.M. Cooper, E.A. Stone, G. Asimenos, E.D. Green, S. Batzoglou, and A. Sidow, "Distribution and intensity of constraint in mammalian genomic sequence," Genome research,vol. 15, 2005, pp. 901-913. [367] M. Garber, M. Guttman, M. Clamp, M.C. Zody, N. Friedman, and X. Xie, "Identifying novel constrained elements by exploiting biased substitution patterns," Bioinformatics, vol. 25, 2009, pp. i54-i62. [368] E.V. Davydov, D.L. Goode, M. Sirota, G.M. Cooper, A. Sidow, and S. Batzoglou, "Identifying a high fraction of the human genome to be under selective constraint using GERP++," PLoS Comput Biol, vol. 6, 2010, p. e1001025. [369] 1000 Genomes Project Consortium, "An integrated map of genetic variation from 1,092 human genomes," Nature, vol. 491, 2012, pp. 56-65. [370] Exome Variant Server, NHLBI GO Exome Sequencing Project (ESP), Seattle, WA, "http://evs.gs.washington.edu/EVS/," Sep. 2014. [371] Exome Aggregation Consortium (ExAC), "ExAC Summary Data http://exac.broadinstitute.org," Apr. 2015. [372] P.D. Stenson, E.V. Ball, M. Mort, A.D. Phillips, K. Shaw, and D.N. Cooper, "The Human Gene Mutation Database (HGMD) and its exploitation in the fields of personalized genomics and molecular evolution," Currentprotocols in bioinformatics, 2012, pp. 1-13. [373] M. Lawrence, W. Huber, H. Pages, P. Aboyoun, M. Carlson, R. Gentleman, M.T. Morgan, and V.J. Carey, "Software for computing and annotating genomic ranges," PLoS computationalbiology, vol. 9, 2013, p. e1003118. [374] B. Neale, M. Ferreira, and S. Medland, Statistical Genetics, Taylor & Francis Group, 2012. [375] R.A. Fisher, Statistical methods for researchworkers, Genesis PublishingPvt Ltd, 1925. [376] O.J. Dunn, "Multiple comparisons among means," Journal of the American Statistical Association, vol. 56, 1961, pp. 52-64. [377] J.S. Amberger, C.A. Bocchini, F. Schiettecatte, A.F. Scott, and A. Hamosh, "OMIM. org: Online Mendelian Inheritance in Man (OMIM@), an online catalog of human genes and genetic disorders," Nucleic acids research, vol. 43, 2015, pp. D789-D798. [378] C.-D.G. of the Psychiatric Genomics Consortium and others, "Genetic relationship between five psychiatric disorders estimated from genome-wide SNPs," Nature genetics, vol. 45, 2013, pp. 984-994. [379] S.N. Murphy, G. Weber, M. Mendis, V. Gainer, H.C. Chueh, S. Churchill, and I. Kohane, "Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2)," Journalof the American Medical Informatics Association, vol. 17, 2010, pp. 124-130. [380] I.S. Kohane, S.E. Churchill, and S.N. Murphy, "A translational engine at the national scale: informatics for integrating biology and the bedside," Journal of the American Medical Informatics Association, vol. 19, 2012, pp. 181-185. 180 [381] A.S. Weitlauf, M.L. McPheeters, B. Peters, N. Sathe, R. Travis, R. Aiello, E. Williamson, J. Veenstra-VanderWeele, S. Krishnaswami, R. Jerome, and others, "Therapies for Children With Autism Spectrum Disorder," 2014. [382] S.A. Brigandi, H. Shao, S.Y. Qian, Y. Shen, B.-L. Wu, and J.X. Kang, "Autistic Children Exhibit Decreased Levels of Essential Fatty Acids in Red Blood Cells," International journal of molecular sciences, vol. 16, 2015, pp. 10061-10076. [383] J. Gordon Bell, D. Miller, D.J. MacDonald, E.E. MacKinlay, J.R. Dick, S. Cheseldine, R.M. Boyle, C. Graham, and A.E. O'Hare, "The fatty acid compositions of erythrocyte and plasma polar lipids in children with autism, developmental delay or typically developing controls and the effect of fish oil intake," Britishjournalof nutrition, vol. 103, 2010, pp. 1160-1167. [384] M. Wiest, J. German, D. Harvey, S. Watkins, and I. Hertz-Picciotto, "Plasma fatty acid profiles in autism: a case-control study," Prostaglandins, Leukotrienes and Essential FattyAcids, vol. 80, 2009, pp. 221-227. [385] S. Vancassel, G. Durand, C. Barthelemy, B. Lejeune, J. Martineau, D. Guilloteau, C. Andres, and S. Chalon, "Plasma fatty acid levels in autistic children," Prostaglandins, Leukotrienes and EssentialFattyAcids, vol. 65, 2001, pp. 1-7. [386] C. Betsholtz, "Lipid transport and human brain development," Nat Genet, vol. 47, 2015, pp. 699-701. [387] M. Aureli, S. Grassi, S. Prioni, S. Sonnino, and A. Prinetti, "Lipid membrane domains in the brain," Biochimica et Biophysica Acta (BBA)-Molecular and Cell Biology of Lipids, vol. 1851, 2015, pp. 1006-1016. [388] A. Guemez-Gamboa, L.N. Nguyen, H. Yang, M.S. Zaki, M. Kara, T. Ben-Omran, N. Akizu, R.O. Rosti, B. Rosti, E. Scott, and others, "Inactivating mutations in MFSD2A, required for omega-3 fatty acid transport in brain, cause a lethal microcephaly syndrome," Nature genetics, 2015. [389] V. Alakbarzade, A. Hameed, D.Q. Quek, B.A. Chioza, E.L. Baple, A. Cazenave-Gassiot, L.N. Nguyen, M.R. Wenk, A.Q. Ahmad, A. Sreekantan-Nair, and others, "A partially inactivating mutation in the sodium-dependent lysophosphatidylcholine transporter MFSD2A causes a non-lethal microcephaly syndrome," Nature genetics, 2015. [390] T. Papadopoulos, R. Schemm, H. Grubmiller, and N. Brose, "Lipid Binding Defects and Perturbed Synaptogenic Activity of a Collybistin R290H Mutant That Causes Epilepsy and Intellectual Disability," Journal of Biological Chemistry, vol. 290, 2015, pp. 82568270. [391] Y. Luo, G. Riedlinger, and P. Szolovits, "Text Mining in Cancer Gene and Pathway Prioritization," Cancer Informatics, vol. 13, 2014, pp. 69-79. 181