Research Survey: Automatic Hypothesis Generation using Literature-Based Discovery Kathleen Padova, June 2015 Undiscovered Public Knowledge Don R. Swanson (1924 – 2012) Concern over increased domain specialization [1] Focused on computer-aided information retrieval Uncovering unseen links between two distinct areas of study (aka disjoint literatures) could yield new discoveries a.k.a. “undiscovered public knowledge.” Pioneered the field of literature-based discovery (LBD) ABCs of Literature-based Discovery ABC Method ? A->B Open vs. Closed + B->C = A->C “Fish Oil and Reynaud’s Disease” Swanson hypothesized a connection between dietary fish oil and Reynaud’ Syndrome 1986 - Two papers on same topic: Library Quarterly [1] Perspectives in Biology and Medicine [2] Validated 3 years later in clinical trials [3] 1989 - Magnesium Deficiency and Migraines, also supported later by clinical trials [4] Search Process Literature on: Literature: Migraine Vascular Reactivity Spreading Depression Calcium Channel Blockers Inflammation Prostaglandins Platelet Aggregation Serotonin Cerebral Anoxia Epilepsy Literature: Magnesium Image reproduced from [4] ARROWSMITH Previous LBD studies were “partially systematic” Swanson joined with Neil R. Smalheiser, Department of Psychiatry, University of Illinois Together they developed a set of interactive software and database search strategies to facilitate discovery [5] Available: http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/start.cgi Applications: Assessing a gap in the biomedical literature: Magnesium deficiency and neurologic disease. Neuroscience Research Communications [6] Indomethacin and Alzheimer's Disease. Neurology [7] Linking Estrogen to Alzheimer's Disease: An informatics approach. Neurology [8] Calcium-independent phospholipase A-sub2 and schizophrenia. Archives of General Psychiatry [9] ARROWSMITH AB AB1 B1 B1C AB2 Target Literature A B2 B2C AB3 B3 B3C AB4 B4 B4C Source Literature C Intermediate Literatures Image reproduced from [5] Basic LBD methodology Information/Entity Retrieval/Extraction Query the literature for areas of interest Identify key concepts/terms Characterize the literature Hypothesis Generation Find connections between retrieved literatures Evaluation / Vetting Review connections for novelty, feasibility Information/Entity Retrieval/Extraction Swanson’s method very manual Query formulation essential [10] As literature grew, search results became overwhelming Further research attempted to resolve by use of: Controlled Vocabularies [11][12] Established Subject Headings/Ontologies (e.g. MeSH) [13][14] [15] Text mining techniques incl. ranking, weighting, clustering [16] Information modeling [17][18] Relationship Extraction[19] Hypothesis Generation Early LBD relied on manually creating a 2nd query and co-occurrence of query terms Later research improvements include: Semantics/NLP to extract relationships [20][21][22] Latent Semantic indexing [23] Vector Space Modeling [24][25][26][27] Lexical Statistics [28] Fuzzy Set Theory [29][30] Baysian Nets [31] Ranking, weighting [14][32][33] Hypothesis Evaluation / Vetting Dependent on domain expert review Results typically in lists and tables, long to review Later LBD methods include ranking and more recently, visualizations [34][35] Still requires a domain expert; but evaluation is easier Beyond ABC Discovery patterns [27] Multiple intermediary steps [36] Applications LBD seen largely in biomedical sciences (mining MEDLINE, PubMed, newer Gene dbs) Drug repositioning [37] Drug-Disease linkages [38] Gene Disease linkages [39][40][41] A few non-medical science applications Water purification [42] Technology and social issues [43] References [1] D. R. Swanson, “Undiscovered Public Knowledge,” Libr. Q., vol. 56, no. 2, pp. 103–118, 1986. [2] D. R. Swanson, “Fish oil, Raynaud’s syndrome, and undiscovered public knowledge.,” Perspect. Biol. Med., vol. 30, no. 1, pp. 7–18, 1986. [3] R. A. DiGiacomo, J. M. Kremer, D. M. Shah, M. D. Albany, and N. York, “Fish-oil dietary supplementation in patients with Raynaud’s phenomenon: a double-blind, controlled, prospective study.,” 1989. [4] D. R. Swanson, “Swanson 1988 Migraine and magnesium- Eleven neglected connections.pdf,” Perspect. Biol. Med., vol. 31, no. 4, pp. 526–557, 1988. [5] N. R. Smalheiser and D. R. Swanson, “Using ARROWSMITH: A computer-assisted approach to formulating and assessing scientific hypotheses,” Comput. Methods Programs Biomed., vol. 57, no. 3, pp. 149–153, 1998. [6] N. R. Smalheiser and D. R. Swanson, “Assessing a gap in the biomedical literature: Magnesium deficiency and neurologic disease,” Neurosci. Res. Commun., vol. 15, no. 1, pp. 1–9, 1994. [7] N. R. Smalheiser and D. R. Swanson, “Indomethacin and Alzheimer’s disease,” Neurol. , vol. 46 , no. 2 , p. 583, Feb. 1996. [8] N. R. Smalheiser and D. R. Swanson, “Linking estrogen to Alzheimer ’ s disease : An informatics approach Hippocampal formation size predicts declining memory performance in normal aging,” Neurology, vol. 47, pp. 809–810, 1996. [9] N. R. Smalheiser and D. R. Swanson, “Calcium-independent phospholipase a2 and schizophrenia,” Arch. Gen. Psychiatry, vol. 55, no. 8, pp. 752–753, 1998. References [10] R. Kostoff, M. Briggs, J. Solka, and R. Rushenberg, “Literature-related discovery (LRD): Methodology☆,” Technol. Forecast. Soc. Change, vol. 75, no. 2, pp. 186–202, 2008. [11] a Holzinger, R. Geierhofer, F. Mödritscher, and R. Tatzl, “Semantic Information in Medical Information Systems: Utilization of Text Mining Techniques to Analyze Medical Diagnoses,” J. Univers. Comput. Sci., vol. 14, no. 22, pp. 3781–3795, 2008. [12] R. Mack, S. Mukherjea, a. Soffer, N. Uramoto, E. Brown, a. Coden, J. Cooper, a. Inokuchi, B. Iyer, Y. Mass, H. Matsuzawa, and L. V. Subramaniam, “Text analytics for life science using the Unstructured Information Management Architecture,” IBM Syst. J., vol. 43, no. 3, pp. 490–515, 2004. [13] X. Hu, “Mining novel connections from large online digital library using biomedical ontologies,” Libr. Manag., vol. 26, no. 4/5, pp. 261–270, 2005. [14] P. Srinivasan, “MeSHmap: a text mining tool for MEDLINE.,” Proc. AMIA Symp., pp. 642–646, 2001. [15] J. Demaine, J. Martin, B. De Bruijn, and B. De Bruijn, “Haystacks and Hypotheses,” in Proceedings of the ASIST Annual Meeting, 2003, vol. 40, pp. 59–64. [16] P. Srinivasan, “Text mining: Generating hypotheses from MEDLINE,” J. Am. Soc. Inf. Sci. Technol., vol. 55, no. 5, pp. 396– 413, 2004. [17] J. R. Katukuri, Y. Xie, V. V Raghavan, and A. Gupta, “Hypotheses generation as supervised link discovery with automated class labeling on large-scale biomedical concept networks,” BMC Genomics, vol. 13, no. Suppl 3, p. S5, 2012. References [18] A. Z. Ijaz, M. Song, and D. Lee, “MKEM: a Multi-level Knowledge Emergence Model for mining undiscovered public knowledge.,” BMC Bioinformatics, vol. 11 Suppl 2, no. Suppl 2, p. S3, 2010. [19] J. M. Vicente-Gomila, “The contribution of syntactic-semantic approach to the search for complementary literatures for scientific or technical discovery,” Scientometrics, vol. 100, no. 3, pp. 659–673, 2014. [20] M. Weeber, H. Klein, a R. Aronson, J. G. Mork, L. T. de Jong-van den Berg, and R. Vos, “Text-based discovery in biomedicine: the architecture of the DAD-system.,” Proc. AMIA Symp., pp. 903–907, 2000. [21] M. Weeber, H. Klein, L. T. W. De Jong-Van Den Berg, R. Vos, and L. T. W. D. J. Den Berg, “Using Concepts in LiteratureBased Discovery : Simulating Swanson ’ s Raynaud – Fish Oil and Migraine – Magnesium Discoveries,” J. Am. Soc. Inf. Sci. Technol., vol. 52, no. 7, pp. 548–557, 2001. [22] K. M. Hettne, M. Weeber, M. L. Laine, H. Ten Cate, S. Boyer, J. a. Kors, and B. G. Loos, “Automatic mining of the literature to generate new hypotheses for the possible link between periodontitis and atherosclerosis: Lipopolysaccharide as a case study,” J. Clin. Periodontol., vol. 34, no. 12, pp. 1016–1024, 2007. [23] M. D. Gordon, M. D. Gordon, S. Dumais, and S. Dumais, “Using latent semantic indexing for literature-based discovery,” J. Am. Soc. Inf. Sci. Technol., vol. 49, pp. 674–685, 1998. [24] W. D. Maciel, A. C. Faria-Campos, M. a Gonçalves, and S. V. Campos, “Can the vector space model be used to identify biological entity activities?,” BMC Genomics, vol. 12, no. Suppl 4, p. S1, 2011. [25] I. N. Sarkar, “A vector space model approach to identify genetically related diseases,” J. Am. Med. Informatics Assoc., vol. 19, no. 2, pp. 249–254, 2012. References [26] S. Lee, J. Choi, K. Park, M. Song, and D. Lee, “Discovering context-specific relationships from biological literature by using multi-level context terms,” BMC Med. Inform. Decis. Mak., vol. 12, no. Suppl 1, p. S1, 2012. [27] T. Cohen, D. Widdows, R. W. Schvaneveldt, P. Davies, and T. C. Rindflesch, “Discovering discovery patterns with predication-based Semantic Indexing,” J. Biomed. Inform., vol. 45, no. 6, pp. 1049–1065, 2012. [28] R. K. Lindsay, “Literature-based discovery by lexical statistics,” J. Am. Soc. Inf. Sci., vol. 50, no. 7, pp. 574–587, 1999. [29] J. D. Wren, “Using fuzzy set theory and scale-free network properties to relate MEDLINE terms,” Soft Comput., vol. 10, no. 4, pp. 374–381, 2006. [30] J. D. Wren, R. Bekeredjian, J. a. Stewart, R. V. Shohet, and H. R. Garner, “Knowledge discovery by automated identification and ranking of implicit relationships,” Bioinformatics, vol. 20, no. 3, pp. 389–398, 2004. [31] J. Atkinson and A. Rivas, “Discovering novel causal patterns from biomedical natural-language texts using Bayesian nets,” IEEE Trans. Inf. Technol. Biomed., vol. 12, no. 6, pp. 714–722, 2008. [32] T. Miyanishi, K. Seki, and K. Uehara, “Hypothesis generation and ranking based on event similarities,” Proc. 2010 ACM Symp. Appl. Comput. - SAC ’10, p. 1552, 2010. [33] V. I. Torvik and N. R. Smalheiser, “A quantitative model for linking two disparate sets of articles in MEDLINE,” Bioinformatics, vol. 23, no. 13, pp. 1658–1665, 2007. [34] Y. Tsuruoka, M. Miwa, K. Hamamoto, J. Tsujii, and S. Ananiadou, “Discovering and visualizing indirect associations between biomedical concepts,” Bioinformatics, vol. 27, no. 13, pp. 111–119, 2011. References [35] S. Spangler and A. Wilkins, “Automated hypothesis generation based on mining scientific literature,” in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014, pp. 1877–1886. [36] M. S. Hossain, J. Gresock, Y. Edmonds, R. Helm, M. Potts, and N. Ramakrishnan, “Connecting the dots between PubMed abstracts,” PLoS One, vol. 7, no. 1, 2012. [37] M. Weeber, “Drug discovery as an example of literature-based discovery,” Comput. Discov. Sci. Knowl., pp. 290–306, 2007. [38] R. Frijters, M. van Vugt, R. Smeets, R. van Schaik, J. de Vlieg, and W. Alkema, “Literature mining for the discovery of hidden connections between drugs, genes and diseases,” PLoS Comput. Biol., vol. 6, no. 9, pp. 1–11, 2010. [39] D. Hristovski, B. Peterlin, J. a. Mitchell, and S. M. Humphrey, “Using literature-based discovery to identify disease candidate genes,” Int. J. Med. Inform., vol. 74, no. 2–4, pp. 289–298, 2005. [40] D. Hristovski, B. Peterlin, and S. Dzeroski, “Literature-based Discovery Support System and Its Application to Disease Gene Identification,” Proc. AMIA Symp., p. 928, 2001. [41] J. S. Wu, E. F. Kao, and C. N. Lee, “Discovering hidden connections among diseases, genes and drugs based on microarray expression profiles with negative-term filtering,” PLoS One, vol. 9, no. 6, 2014. [42] R. N. Kostoff, J. L. Solka, R. L. Rushenberg, and J. a. Wyatt, “Literature-related discovery (LRD): Water purification,” Technol. Forecast. Soc. Change, vol. 75, no. 2, pp. 256–275, 2008. [43] V. Ittipanuvat, K. Fujita, I. Sakata, and Y. Kajikawa, “Finding linkage between technology and social issue: A Literature Based Discovery approach,” J. Eng. Technol. Manag. - JET-M, vol. 32, pp. 160–184, 2014.