Research Survey: Automatic Hypothesis Generation using Literature

advertisement
Research Survey: Automatic
Hypothesis Generation using
Literature-Based Discovery
Kathleen Padova, June 2015
Undiscovered Public Knowledge
 Don R. Swanson (1924 – 2012)
 Concern over increased domain specialization [1]
 Focused on computer-aided information retrieval
 Uncovering unseen links between two distinct areas of
study (aka disjoint literatures) could yield new discoveries
a.k.a. “undiscovered public knowledge.”
 Pioneered the field of literature-based discovery (LBD)
ABCs of Literature-based
Discovery
 ABC Method
?
A->B
 Open vs. Closed
+
B->C
=
A->C
“Fish Oil and Reynaud’s Disease”
 Swanson hypothesized a connection between
dietary fish oil and Reynaud’ Syndrome
 1986 - Two papers on same topic:
 Library Quarterly [1]
 Perspectives in Biology and Medicine [2]
 Validated 3 years later in clinical trials [3]
 1989 - Magnesium Deficiency and Migraines, also
supported later by clinical trials [4]
Search Process
Literature on:
Literature:
Migraine
Vascular Reactivity
Spreading Depression
Calcium Channel
Blockers
Inflammation
Prostaglandins
Platelet Aggregation
Serotonin
Cerebral Anoxia
Epilepsy
Literature:
Magnesium
Image reproduced from [4]
ARROWSMITH

Previous LBD studies were “partially systematic”

Swanson joined with Neil R. Smalheiser, Department of Psychiatry, University of
Illinois

Together they developed a set of interactive software and database search
strategies to facilitate discovery [5]

Available: http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/start.cgi

Applications:
 Assessing a gap in the biomedical literature: Magnesium deficiency and neurologic



disease. Neuroscience Research Communications [6]
Indomethacin and Alzheimer's Disease. Neurology [7]
Linking Estrogen to Alzheimer's Disease: An informatics approach. Neurology [8]
Calcium-independent phospholipase A-sub2 and schizophrenia. Archives of General
Psychiatry [9]
ARROWSMITH
AB
AB1
B1
B1C
AB2
Target
Literature
A
B2
B2C
AB3
B3
B3C
AB4
B4
B4C
Source
Literature
C
Intermediate Literatures
Image reproduced from [5]
Basic LBD methodology
 Information/Entity Retrieval/Extraction
 Query the literature for areas of interest
 Identify key concepts/terms
 Characterize the literature
 Hypothesis Generation
 Find connections between retrieved literatures
 Evaluation / Vetting
 Review connections for novelty, feasibility
Information/Entity
Retrieval/Extraction
 Swanson’s method very manual
 Query formulation essential [10]
 As literature grew, search results became overwhelming
 Further research attempted to resolve by use of:
 Controlled Vocabularies [11][12]
 Established Subject Headings/Ontologies (e.g. MeSH) [13][14]
[15]
 Text mining techniques incl. ranking, weighting, clustering [16]
 Information modeling [17][18]
 Relationship Extraction[19]
Hypothesis Generation
 Early LBD relied on manually creating a 2nd query and
co-occurrence of query terms
 Later research improvements include:







Semantics/NLP to extract relationships [20][21][22]
Latent Semantic indexing [23]
Vector Space Modeling [24][25][26][27]
Lexical Statistics [28]
Fuzzy Set Theory [29][30]
Baysian Nets [31]
Ranking, weighting [14][32][33]
Hypothesis Evaluation / Vetting
 Dependent on domain expert review
 Results typically in lists and tables, long to review
 Later LBD methods include ranking and more recently,
visualizations [34][35]
 Still requires a domain expert; but evaluation is easier
Beyond ABC
 Discovery patterns [27]
 Multiple intermediary steps [36]
Applications
 LBD seen largely in biomedical sciences (mining
MEDLINE, PubMed, newer Gene dbs)
 Drug repositioning [37]
 Drug-Disease linkages [38]
 Gene Disease linkages [39][40][41]
 A few non-medical science applications
 Water purification [42]
 Technology and social issues [43]
References
[1] D. R. Swanson, “Undiscovered Public Knowledge,” Libr. Q., vol. 56, no. 2, pp. 103–118, 1986.
[2] D. R. Swanson, “Fish oil, Raynaud’s syndrome, and undiscovered public knowledge.,” Perspect. Biol. Med., vol. 30, no. 1,
pp. 7–18, 1986.
[3] R. A. DiGiacomo, J. M. Kremer, D. M. Shah, M. D. Albany, and N. York, “Fish-oil dietary supplementation in patients with
Raynaud’s phenomenon: a double-blind, controlled, prospective study.,” 1989.
[4] D. R. Swanson, “Swanson 1988 Migraine and magnesium- Eleven neglected connections.pdf,” Perspect. Biol. Med., vol. 31,
no. 4, pp. 526–557, 1988.
[5] N. R. Smalheiser and D. R. Swanson, “Using ARROWSMITH: A computer-assisted approach to formulating and assessing
scientific hypotheses,” Comput. Methods Programs Biomed., vol. 57, no. 3, pp. 149–153, 1998.
[6] N. R. Smalheiser and D. R. Swanson, “Assessing a gap in the biomedical literature: Magnesium deficiency and neurologic
disease,” Neurosci. Res. Commun., vol. 15, no. 1, pp. 1–9, 1994.
[7] N. R. Smalheiser and D. R. Swanson, “Indomethacin and Alzheimer’s disease,” Neurol. , vol. 46 , no. 2 , p. 583, Feb. 1996.
[8] N. R. Smalheiser and D. R. Swanson, “Linking estrogen to Alzheimer ’ s disease : An informatics approach Hippocampal
formation size predicts declining memory performance in normal aging,” Neurology, vol. 47, pp. 809–810, 1996.
[9] N. R. Smalheiser and D. R. Swanson, “Calcium-independent phospholipase a2 and schizophrenia,” Arch. Gen. Psychiatry,
vol. 55, no. 8, pp. 752–753, 1998.
References
[10] R. Kostoff, M. Briggs, J. Solka, and R. Rushenberg, “Literature-related discovery (LRD): Methodology☆,” Technol.
Forecast. Soc. Change, vol. 75, no. 2, pp. 186–202, 2008.
[11] a Holzinger, R. Geierhofer, F. Mödritscher, and R. Tatzl, “Semantic Information in Medical Information Systems: Utilization
of Text Mining Techniques to Analyze Medical Diagnoses,” J. Univers. Comput. Sci., vol. 14, no. 22, pp. 3781–3795, 2008.
[12] R. Mack, S. Mukherjea, a. Soffer, N. Uramoto, E. Brown, a. Coden, J. Cooper, a. Inokuchi, B. Iyer, Y. Mass, H. Matsuzawa,
and L. V. Subramaniam, “Text analytics for life science using the Unstructured Information Management Architecture,” IBM
Syst. J., vol. 43, no. 3, pp. 490–515, 2004.
[13] X. Hu, “Mining novel connections from large online digital library using biomedical ontologies,” Libr. Manag., vol. 26, no.
4/5, pp. 261–270, 2005.
[14] P. Srinivasan, “MeSHmap: a text mining tool for MEDLINE.,” Proc. AMIA Symp., pp. 642–646, 2001.
[15] J. Demaine, J. Martin, B. De Bruijn, and B. De Bruijn, “Haystacks and Hypotheses,” in Proceedings of the ASIST Annual
Meeting, 2003, vol. 40, pp. 59–64.
[16] P. Srinivasan, “Text mining: Generating hypotheses from MEDLINE,” J. Am. Soc. Inf. Sci. Technol., vol. 55, no. 5, pp. 396–
413, 2004.
[17] J. R. Katukuri, Y. Xie, V. V Raghavan, and A. Gupta, “Hypotheses generation as supervised link discovery with automated
class labeling on large-scale biomedical concept networks,” BMC Genomics, vol. 13, no. Suppl 3, p. S5, 2012.
References
[18] A. Z. Ijaz, M. Song, and D. Lee, “MKEM: a Multi-level Knowledge Emergence Model for mining undiscovered public
knowledge.,” BMC Bioinformatics, vol. 11 Suppl 2, no. Suppl 2, p. S3, 2010.
[19] J. M. Vicente-Gomila, “The contribution of syntactic-semantic approach to the search for complementary literatures for
scientific or technical discovery,” Scientometrics, vol. 100, no. 3, pp. 659–673, 2014.
[20] M. Weeber, H. Klein, a R. Aronson, J. G. Mork, L. T. de Jong-van den Berg, and R. Vos, “Text-based discovery in
biomedicine: the architecture of the DAD-system.,” Proc. AMIA Symp., pp. 903–907, 2000.
[21] M. Weeber, H. Klein, L. T. W. De Jong-Van Den Berg, R. Vos, and L. T. W. D. J. Den Berg, “Using Concepts in LiteratureBased Discovery : Simulating Swanson ’ s Raynaud – Fish Oil and Migraine – Magnesium Discoveries,” J. Am. Soc. Inf. Sci.
Technol., vol. 52, no. 7, pp. 548–557, 2001.
[22] K. M. Hettne, M. Weeber, M. L. Laine, H. Ten Cate, S. Boyer, J. a. Kors, and B. G. Loos, “Automatic mining of the literature
to generate new hypotheses for the possible link between periodontitis and atherosclerosis: Lipopolysaccharide as a case
study,” J. Clin. Periodontol., vol. 34, no. 12, pp. 1016–1024, 2007.
[23] M. D. Gordon, M. D. Gordon, S. Dumais, and S. Dumais, “Using latent semantic indexing for literature-based discovery,” J.
Am. Soc. Inf. Sci. Technol., vol. 49, pp. 674–685, 1998.
[24] W. D. Maciel, A. C. Faria-Campos, M. a Gonçalves, and S. V. Campos, “Can the vector space model be used to identify
biological entity activities?,” BMC Genomics, vol. 12, no. Suppl 4, p. S1, 2011.
[25] I. N. Sarkar, “A vector space model approach to identify genetically related diseases,” J. Am. Med. Informatics Assoc., vol.
19, no. 2, pp. 249–254, 2012.
References
[26] S. Lee, J. Choi, K. Park, M. Song, and D. Lee, “Discovering context-specific relationships from biological literature by using
multi-level context terms,” BMC Med. Inform. Decis. Mak., vol. 12, no. Suppl 1, p. S1, 2012.
[27] T. Cohen, D. Widdows, R. W. Schvaneveldt, P. Davies, and T. C. Rindflesch, “Discovering discovery patterns with
predication-based Semantic Indexing,” J. Biomed. Inform., vol. 45, no. 6, pp. 1049–1065, 2012.
[28] R. K. Lindsay, “Literature-based discovery by lexical statistics,” J. Am. Soc. Inf. Sci., vol. 50, no. 7, pp. 574–587, 1999.
[29] J. D. Wren, “Using fuzzy set theory and scale-free network properties to relate MEDLINE terms,” Soft Comput., vol. 10, no.
4, pp. 374–381, 2006.
[30] J. D. Wren, R. Bekeredjian, J. a. Stewart, R. V. Shohet, and H. R. Garner, “Knowledge discovery by automated
identification and ranking of implicit relationships,” Bioinformatics, vol. 20, no. 3, pp. 389–398, 2004.
[31] J. Atkinson and A. Rivas, “Discovering novel causal patterns from biomedical natural-language texts using Bayesian nets,”
IEEE Trans. Inf. Technol. Biomed., vol. 12, no. 6, pp. 714–722, 2008.
[32] T. Miyanishi, K. Seki, and K. Uehara, “Hypothesis generation and ranking based on event similarities,” Proc. 2010 ACM
Symp. Appl. Comput. - SAC ’10, p. 1552, 2010.
[33] V. I. Torvik and N. R. Smalheiser, “A quantitative model for linking two disparate sets of articles in MEDLINE,”
Bioinformatics, vol. 23, no. 13, pp. 1658–1665, 2007.
[34] Y. Tsuruoka, M. Miwa, K. Hamamoto, J. Tsujii, and S. Ananiadou, “Discovering and visualizing indirect associations
between biomedical concepts,” Bioinformatics, vol. 27, no. 13, pp. 111–119, 2011.
References
[35] S. Spangler and A. Wilkins, “Automated hypothesis generation based on mining scientific literature,” in Proceedings of the
20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014, pp. 1877–1886.
[36] M. S. Hossain, J. Gresock, Y. Edmonds, R. Helm, M. Potts, and N. Ramakrishnan, “Connecting the dots between PubMed
abstracts,” PLoS One, vol. 7, no. 1, 2012.
[37] M. Weeber, “Drug discovery as an example of literature-based discovery,” Comput. Discov. Sci. Knowl., pp. 290–306,
2007.
[38] R. Frijters, M. van Vugt, R. Smeets, R. van Schaik, J. de Vlieg, and W. Alkema, “Literature mining for the discovery of
hidden connections between drugs, genes and diseases,” PLoS Comput. Biol., vol. 6, no. 9, pp. 1–11, 2010.
[39] D. Hristovski, B. Peterlin, J. a. Mitchell, and S. M. Humphrey, “Using literature-based discovery to identify disease
candidate genes,” Int. J. Med. Inform., vol. 74, no. 2–4, pp. 289–298, 2005.
[40] D. Hristovski, B. Peterlin, and S. Dzeroski, “Literature-based Discovery Support System and Its Application to Disease
Gene Identification,” Proc. AMIA Symp., p. 928, 2001.
[41] J. S. Wu, E. F. Kao, and C. N. Lee, “Discovering hidden connections among diseases, genes and drugs based on
microarray expression profiles with negative-term filtering,” PLoS One, vol. 9, no. 6, 2014.
[42] R. N. Kostoff, J. L. Solka, R. L. Rushenberg, and J. a. Wyatt, “Literature-related discovery (LRD): Water purification,”
Technol. Forecast. Soc. Change, vol. 75, no. 2, pp. 256–275, 2008.
[43] V. Ittipanuvat, K. Fujita, I. Sakata, and Y. Kajikawa, “Finding linkage between technology and social issue: A Literature
Based Discovery approach,” J. Eng. Technol. Manag. - JET-M, vol. 32, pp. 160–184, 2014.
Download