Measuring Scholarly Impact and Beyond Ying Ding Indiana University dingying@indiana.edu http://info.slis.indiana.edu/~dingying/index.html Outline • Assessing the credibility of researchers and institutions • Identifying significant scientific advancement/ emerging trends • Other Research Analytics • Tool: Data2Knowledge Platform • Next Steps Outline • Assessing the credibility of researchers and institutions • Identifying significant scientific advancement/ emerging trends • Other Research Analytics • Tool: Data2Knowledge Platform • Next Steps Productivity Productivity Top Journals Top Researchers Measuring Scholarly Impact in the field of Semantic Web Data: 4,157 papers with 651,673 citations from Scopus (1975-2009), and 22,951 papers with 571,911 citations from WOS (1960-2009) Impact through citation Impact Top Journals Top Researchers Rising Stars • In WOS, M. A. Harris (Gene Ontology-related research), T. Harris (design and implementation of programming languages) and L. Ding (Swoogle – Semantic Web Search Engine) are ranked as the top three authors with the highest increase of citations. • In Scopus, D. Roman (Semantic Web Services), J. De Bruijn (logic programming) and L. Ding (Swoogle) are ranked as top three for the significant increase in number of citations. Ding, Y. (2010). Semantic Web: Who is Who in the field, Journal of Information Science, 36(3): 335-356. Popular vs. Prestigious Ding, Y., & Cronin, B. (2011). Popular and/or Prestigious? Measures of Scholarly Esteem, Information Processing and Management, 47(1), 80-96. Popular vs. Prestigious Data: 15,370 IR papers with 341,871 cited references from 1956 to 2008. Popular vs. Prestigious Academic Career Peak PageRank and Weighted PageRank š(š) šš _š š = (1 − š) š +š š š=1 š š š=1 šš (šš ) š¶(šš ) • The weighted-PageRanks bring finer granularity to ranking experts under various situations by including different contextual information as weighted vectors to PageRank algorithms. – including an author’s total publications as the weighted vector, PageRank can calculate a contextualized ranking reflecting the scholar’s productivity; – adding author’s expertise as the weighted vector, PageRank can calculate a contextualized ranking reflecting the scholar’s domain knowledge and research interest. Ding, Y. (2011). Applying weighted PageRank to author citation network. Journal of the American Society for Information Science and Technology, 62(2), 236-245. PageRank and other ranks Rank Degree Betweennes s 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 Citation Closeness SALTON G 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ROBERTSON SE 3 2 2 2 2 2 2 2 2 2 2 11 21 10 ABITEBOUL S 2 3 3 3 3 4 4 5 8 13 3 4 7 71 BELKIN NJ 4 4 4 4 4 3 3 3 3 3 4 22 24 20 VANRIJSBERGEN CJ 5 5 5 5 5 5 5 4 4 4 5 5 13 4 RUI Y 6 6 6 6 7 8 9 9 10 15 6 24 31 23 SARACEVIC T 7 8 7 7 6 6 6 6 5 5 7 25 51 25 CROFT WB 9 9 9 9 9 7 7 7 6 6 8 15 14 14 SPINK A 13 13 13 12 12 11 11 11 9 8 9 57 93 57 JONES KS 12 10 10 10 10 10 8 8 7 7 10 17 30 18 SMITH JR 8 7 8 8 8 9 10 10 11 16 11 38 32 35 FALOUTSOS C 10 11 11 13 13 13 15 16 19 26 12 12 2 11 HARMAN D 11 12 12 11 11 12 12 12 12 11 13 3 14 3 VOORHEES EM 43 16 16 17 17 17 17 18 17 17 14 20 38 19 FLICKNER M 18 18 19 20 20 21 21 24 25 34 15 16 11 15 BATES MJ 15 17 17 15 16 16 16 15 15 12 16 78 81 78 CODD EF 14 14 14 16 18 19 20 22 28 38 17 74 23 74 BAEZAYATES R 34 43 45 48 47 50 53 54 55 55 18 6 16 6 FUHR N 19 19 20 19 19 18 18 19 18 21 19 2 9 2 JAIN AK 42 34 40 39 39 40 40 42 45 49 20 39 33 38 Ding, Y., Yan, E., Frazho, A., & Caverlee, J. (2009). PageRank for ranking authors in co-citation networks. Journal of the American Society for Information Science and Technology, 60(11), 2229-2243. Scatterplots Topic-based Rank (Productivity Rank) Online IR/Web IR 1956-1980 1981-1990 1991-2000 2001-2008 D.T. Hawkins, N.A. Stokolova, E. Eisenbach, K. Yamanaka, T. Radecki, R. Fugmann, J. Eyre, D.H. Kraft, Z. Mazur, K. Hosono P. Willett, S.P. Harter, C. Batt, D. Ellis, M. Keen, S.E. Hocker, L. Bronars, P.G. Enser, S. Stigleman, B. Vickery W.B. Corft, H.C. Chen, W. Umstatter, C.A. Lynch, P. Martin, D. Samson, N.J. Santora, C. Womserhacker, N.J. Belkin, R. Wagnerdobler J. Han, D. Suciu, H.P. Kriegel, S.Y. Su, K.L. Tan, G. Graefe, L. Wong, L. Libkin, J.W. Su, P.Z. Revesz M. Thelwall, C.C. Yang, A. Spink, P. Jacso, I. Fourie, H.C. Chen, N. Ford, H. Xie, G.G. Chowdhury, B. Hjorland Database and Query Processing Evaluation Medical IR Multimedia IR G. Salton, A.G. Pickford, W. Goffman, E. Garfield, G.K. Thompson, W.S. Cooper, K. Janda, F.W. Lancaster, R. Fugmann, P. Willett S.J. Martinez, M.G. Manzone, C.M. Bowman, F.A. Landee, J. Frome, I. Berghans, S.L. Visser, H. Skolnik, Y.J. Lee, T.K.S. Engar D.W. Stemple, R.H. Guting, A. Sernadas, C. Katzeff, S.Y. Su, W. Perrizo, J.S. Davis, C.T. Yu, B.S. Goldshteyn, I.A. Macleod C.L. Borgman, T. Radecki, G. Salton, W.B. Croft, J.S. Ro, J. Panyr, D.C. Blair, M.E. Maron, P. Thompson, C.A. Lynch J.Z. Li, F. Bry, H.J. Kim, D. Papadias, K. Subieta, J. Van den Bussche, D. Taniar, F. Geerts, M. Song, Y.D. Chung A. Spink, R.M. Losee, E. Levine, C. Cole, P. Willett, W.R. Hersh, C.T. Meadow, B. Hjorland, E. Garfield, T. Cawkell S.G. Aiken, I. Soutar, S. Barcza, C.C. Tsai, W. Hersh, S.J. Westerman, H.H. Emurian, L.L. Consaul, H.J. Markowitsch, D. Roberts H.C. Chen, F. Crestani, A.K. Jain, E. Wilhelm, J.I. Khan, B.S. Manjunath, H.K. Kim, H.M. Wang, S.F. Chang, S. Levialdi R.N. Kostoff, U.J. Balis, G. Eysenbach, R.B. Haynes, G. Nilsson, H. Shatkay, N.L. Wilczynski, C.R. Shyu, J.I. Westbrook, G.O. Babnett T.S. Huang, H.J. Zhang, G.J. Lu, J. Li, C.C. Chang, E. Izquierdo, J. Lassksonen, H. Burkhardt, C.J. Liu, D. Ziou Using Author-Topic Modeling Algorithm to rank author based on different topics (Productivity Rank) Data: Information Retrieval articles from 1956 to 2008 (15,367 papers with 350,750 citations) Topic-based PageRank (Citation Rank) Online IR/Web IR 1956-1980 I_PR PR_t(.85) PR_t(.5) PR_t(.15) D.T. Hawkins, N.A. Stokolova, R.K. Summit, M.E. Williams, T. Radecki, A. Macleodi, T. Saracevic, R.S. Marcus, R. Fugmann, C.T. Yu G. Salton, D.T. Hawkins, M.E. Williams, R.K. Summit, F.W. Lancaster, A. Kent, N.A. Stokolova, R. Fugmann, C.W. Cleverdon, W.S. Cooper D.T. Hawkins, N.A. Stokolova, R. Fugmann, T. Radecki, G. Salton, R.K. Summit, I.A. Macleod, J. Farradane, M.E. Williams, A.M. Rees 1981-1990 S.E. Robertson, D. Ellis, P. Willett, P. Ingwersen, B.C. Vickery, A.S. Pollitt, D.H. Kraft, H.M. Brooks, A.F. Smeaton, E.A. Fox N.J. Belkin, W.B. Frakes, T. Imielinski, G.W. Furnas, T. Catarci, T. Kohonen, R. Agrawal, S.K. Chang, H.C. Chen, P. Valduriez A. Spink, T. Saracevic, B. Hjorland, S.E. Roberston, B.J. Jansen, N.J. Belkin, E.M. Voorhees, W.R. Hersh, P. Ingwersen, P. Vakkari G. Salton, A. Kent, M.E. Williams, F.W. Lancaseter, R.K. Summit, D.T. Hawkins, C.W. Cleverdon, D.B. Mccarn, W.S. Cooper, H. Martint, C.P. Bourne G. Salton, A. Bookstein, S.E. Robertson, T. Radecki, W.B. Croft, C.J. Vanrijsbergen, C.T. Yu, W.S. Coopwer, P. Willett, K.S. Jones G. Salton, N.J. Belkin, S.E. Robertson, S. Abiteboul, T. Saracevic, C.J. Vanrijsbergen, W.B. Croft, M.J. Bates, K.S. Jones, D. Harman G. Salton, A. Spink, N.J. Belkin, T. Saracevic, S.E. Roberston, Y. Rui, E.M. Voorhees, B.J. Jansen, J.R. Smith, K.S. Jones G. Salton, P. Willett, S.E. Robertson, A. Bookstein, S.P. Harter, W.B. Croft, T. Radecki, C.J. Vanrijsbergen, C.T. Yu, D. Ellis G. Salton, N.J. Belkin, S. Abiteboul, S.E. Robertson, S.K. Chang, T. Saracevic, H.C. Chen, C.J. Vanrijsbergen, W.B. Croft, M.J. Bates A. Spink, T. Saracevic, G. Salton, H.C. Chen, B.J. Jansen, B. Hjorland, N.J. Belkin, S.E. Robertson, P. Vakkari, E.M. Voorhees P. Willett, S.P. Harter, D. Ellis, S.E. Robertson, G. Salton, A.F. Smeaton, P. Ingwersen, B.C. Vickery, M.J. Bates, A. Bookstein H.C. Chen, N.J. Belkin, G. Salton, S.K. Chang, N. Fuhr, T. Saracevic, S.K.M. wong, M.J. Bates, S. Abiteboul, T. Catarci 1991-2000 2001-2008 A. Spink, H.C. Chen, B. Hjorland, T. Saracevic, B.J. Jansen, P. Vakkari, P. Borlund, S.E. Robertson, F. Crestani, N.J. Belkin Ding, Y. (2011). Topic-based PageRank for author cocitation networks. Journal of the American Society for Information Science and Technology, 62(3), 449-466. Citing vs. Mentioning Data: Full text JASIST journal articles (2000-2011), 866 articles and 32,496 references Counting methods matters! 400 350 CountX Rank 300 250 200 150 100 50 0 0 10 20 30 40 50 60 70 80 90 100 CountOne Rank Spearman r=0.589, p<0.01 (2-tailed) Ding, Y., Liu, X., Guo, C., & Cronin, B. (2013). The distribution of references across texts: Some implications for citation analysis. Journal of Informetrics, 7(30, 583-592 Diversity Subgraph Paths Tim Berners-Leeļ James Hendlerļ V. S. Subrahmanianļ Laksv. S. Lakshmananļ Jiawei Han Tim Berners-Leeļ James Hendlerļ Qiang Yangļ Hongjun Luļ Jiawei Han Tim Berners-Leeļ James Hendlerļ Qiang Yangļ Bing Liuļ Jiawei Han Tim Berners-Leeļ James Hendlerļ Qiang Yangļ Ke Wangļ Jiawei Han Tim Berners-Leeļ James Hendlerļ QiangY angļ Jian Peiļ Jiawei Han current 0.0653 0.0513 0.0428 0.0428 0.0428 The constraintbased subgraph between Jiawei Han & Tim BernersLee with constraint on James Hendler He, B., Ding, Y., Tang, J., Reguramalingam, V., & Bollen, J. (2013). Mining diversity subgraph in multidisciplinary scientific collaboration networks: A meso perspective. Journal of Informetrics, 7(1), 117-128 Institutions Data: A total of 50,920 LIS articles written by 42,991 researchers published during 1955 to 2009 Country He, B., Ding, Y., Yan, E. (2012). Mining patterns of author orders in scientific publications. Journal of Informetrics, 6(3), 359-367 Outline • Assessing the credibility of researchers and institutions • Identifying significant scientific advancement/ emerging trends • Other Research Analytics • Tool: Data2Knowledge Platform • Next Steps Mapping the field of IR Author Co-citation Map in the field of Information Retrieval (1992-1997) Data: 1,466 IR-related papers was selected from 367 journals with 44,836 citations. Mapping the field of IR Ding, Y., Chowdhury, G. and Foo, S. (1999). Mapping Intellectual Structure of Information Retrieval: An Author Cocitation Analysis, 1987-1997, Journal of Information Science, 25(1): 67-78. Popular Topics in IR Data: Information Retrieval articles from 1956 to 2008 (15,367 papers with 350,750 citations) Popular Topics in IR Online IR/Web IR 1956-1980 1981-1990 1991-2000 2001-2008 system, online, language, theory, query, computerized, thesaurus, evaluation, semantic, bibliography online, systems, text, concepts, reference, principles, proceedings, practice, knowledge, services system, web, knowledge, database, data, query, design, text, management, distributed web, search, digital, searching, knowledge, system, query, user, model, internet query, language, queryprocessing, database, relational, system, distributed, data, database-system, comparison query, database, databases, data, object-oriented, queries, processing, relational, model, language query, data, xml, processing, queries, databases, database, efficient, web, querying systems, document, full-text, model, evaluation, fuzzy, effectiveness, search, user, expert text, evaluation, systems, searching, search, online, relevance, library, user, hypertext Database and Query Processing Evaluation system, document, storage, evaluation, data, automatic, model, relevance, indexing, online Medical IR system, data, storage, computerized, chemical, medical, literature, biomedical, evaluation, management Multimedia IR database, medical, system, clinical, patient, management, health, identification, automated, optical database, medical, health, clinical, management, search, design, study, support, knowledge image, content-based, system, indexing, databases, multimedia, images, visual, video, color image, content-based, learning, images, relevance, color, feedback, video, semantic, similarity Ding, Y. (2011). Topic-based PageRank for author cocitation networks. Journal of the American Society for Information Science and Technology, 62(3), 449-466. Evolving of topics and communities Li, D., Ding, Y., Sugimoto, C., He, B., Tang, J., Yan, E., Lin, N., Qin, Z. & Dong, T. (2011). Modeling Topic and Community Structure in Social Tagging: the TTR-LDA-Community Model. Journal of the American Society for Information Science and Technology, 62(9), 1849-1866. Topic evolution in IR Data: Information retrieval (IR) papers from Scopus for 2001-2007 (2001-2003, 12,194; 2004-2005, 19,145; and 2006-2007, 21,423). Community Matching Communities and Topics in IR Yan, E., Ding, Y., Milojevic, S., & Sugimoto, C. R. (2012). Topics in dynamic research communities: An exploratory study for the field of information retrieval. Journal of Informetrics, 6(1), 140-153. Lead-Lag Analysis Data: astrophysics collected from WoS (166,191) and arXiv (117,913) for the last 20 years (1992-2011). Lead-Lag Patterns Hu, B., Dong, X., Zhang, C., Bowman, T., Ding, Y., Yan, E., Milojevic, S., Ni., C., & Lariviere, V. (forthcoming). A lead-lag analysis of the topic evolution patterns for preprints and publications. Journal of the Association for Information Science and Technology Lead-Lag Patterns Topic popularity Mathematical Measures for Topic Popularity f’(t): Gaining Popularity; Losing Popularity; Regaining Popularity; Duration of Popularity Most of the topics in both arXiv and WoS affect the popularity of each other because when a topic become s popular in either arXiv or WoS, it will become popular in the other in less than 4 years. Very few topics become popular in one channel without becoming popular in the other. Only 10 topics’ lead time in arXiv and WoS are longer than 10 years (i.e. meaning that a topic becomes popular in arXiv or WoS but is not popular in the other channel in the next 10 years). This work clearly demonstrates that open access preprints will have stronger growth tendency as compared to traditional printed publications in astrophysics. Subject Categories 34 Our Data Scopus data Journal citation 15 Years, 5 slices 18,500+ Journals 4+ Million links 13+ Million Citations Yan, E., Ding, Y., Cronin, B., & Leydesdorff, L. (2013). A bird's-eye view of scientific trading: 35 Dependency relations among fields of science. Journal of Informetrics, 7(2), 249-264. Our Metaphor 36 fastest growths Exported knowledge 37 highest ratios Export/import ratio 38 Source of incoming citations (who cites you) 39 Source of outgoing citations (whom you cite) 40 lowest shortest paths Shortest path length 41 knowledge destination (to) knowledge source (from) Shortest path matrix (2011 data) 42 Critical knowledge path (2011 data) 43 Outline • Assessing the credibility of researchers and institutions • Identifying significant scientific advancement/ emerging trends • Other Research Analytics • Tool: Data2Knowledge Platform • Next Step Next Generation of Bibliometrics • Newly developed methods allow in-depth analysis of scholarly communication – Topic modeling (e.g., Latent Dirichlet Allocation) – Information Extraction (e.g., Entity Extraction) – Social Network Analysis (e.g., Community Detection) • Big data demonstrates the power of connected data to enable knowledge discovery – Structured data – Unstructured data – Social media data Bibliometrics and Beyond 46/46 Content-based Impact Analysis • There are two levels: – Syntactic level (position) • Papers cited in different sections of the articles • How many times papers are mentioned in one article – Semantic level (semantics) • Citation Sentiment Analysis: sentence level, window-size • Concept-based: knowledge concept level (e.g., topic, knowledge unit/entity, or bio-entities) Ding, Y., Song, M., Wang, X., Zhang, G., Zhai, C., & Chambers, T. (2014). Content-based citation analysis: The next generation of citation analysis. Journal of the American Society for Information Science & Technology, 65(9), 1820-1833. Semantic Level • Concepts – Topics – Major entities in research • Bio entities • Knowledge unit (e.g., domain theories, well-established algorithms) • Why not keyword – – – – Ambiguous literal words Not normalized But can be a starting point to extract concepts. Concept (keywords, synonyms (students vs. pupils), antonyms (birth vs. death), homonyms (pupil (student) vs. pupil (part of eye), etc.) Jeong, Y., Song, M., & Ding, Y. (2014). Content-based Author co-citation analysis. Journal of Informetrics, 8(1),197-211. Topic Modeling • IR papers (1956-2008) (No. of nodes, No. of edges) Coauthorship network Citation network 1956-1980 (Phase 1) (930, 4256) 1981-1990 (Phase 2) (961, 2252) 1991-2000 (Phase 3) (6650, 24184) 2001-2008 (Phase 4) (13640, 63140) (6054, 11192) (5978, 17084) (36411, 171814) (62636,444203 ) collaboration strength of productive authors within topics 100.00% % of collaboration 80.00% 60.00% 40.00% 20.00% 0.00% 1956-1980 Direct Collaboration 0.70% Indirect Collaboration 0.11% Loose Collaboration 0 No Collaboration 99.19% Indirect: path=<6, loose: >6 1981-1990 0.32% 0 0 99.68% 1991-2000 1.54% 8.46% 2.76% 87.24% 2001-2008 1.79% 36.84% 11.05% 50.32% collaboration strength of productive authors across topics % of collaboration 120.00% 100.00% 80.00% 60.00% 40.00% 20.00% 0.00% Direct Collaboration Indirect Collaboration Loose Collaboration No Collaboration 1956-1980 0 0.08% 0 99.92% 1981-1990 0.05% 0.03% 0 99.93% 1991-2000 0 2.01% 2.16% 95.83% 2001-2008 0.13% 29.65% 13.83% 56.40% citation strength of productive authors within topics 90.00% 80.00% % of citation 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 1956-1980 Direct Citation 3.31% Indirect Citation 12.69% Loose Citation 3.10% No Citation 80.91% Indirect:=<3, loose: >3 1981-1990 3.89% 20.54% 5.28% 70.29% 1991-2000 7.28% 41.90% 7.36% 43.47% 2001-2008 9.47% 69.37% 9.79% 14.95% citation strength of productive authors across topics 90.00% 80.00% % of citation 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 1956-1980 Direct Citation 0.98% Indirect Citation 13.18% Loose Citation 18.35% No Citation 72.49% 1981-1990 0.83% 13.90% 8.80% 76.47% 1991-2000 1.20% 49.39% 8.51% 40.89% 2001-2008 1.45% 81.15% 4.05% 13.35% Ding, Y. (2011). Scientific collaboration and endorsement: Network analysis of coauthorship and citation networks. Journal of Informetrics, 5(1), 187-203 EntityMetrics Entitymetrics is defined as using entities (i.e., evaluative entities or knowledge entities) in the measurement of impact, knowledge usage, and knowledge transfer, to facilitate knowledge discovery. Ding, Y., Song, M., Han, J., Yu, Q., Yan, E., Lin, L., & Chambers, T. (2013). Entitymetrics: Measuring the impact of entities. PLoS One, 8(8): 1-14. EntityMetrics Drug Disease Protein Pathway Gene Cite PubMed Entities Entity Graph • Heterogeneous Entity Graph Bcl-2 Inhibitor Diabetes p53 Cancer STAT3 ci: cite co: co-occur ci/co: Drug Metformin Disease Breast Cancer AMPK P13K Protein Pathway Gene Metformin related entity-entity citation network Data: 4,770 articles retrieved from PubMed Central with 134,844 references, and 1,969 bio-entities (i.e., 880 genes, 376 drugs, and 713 diseases) Metformin related entity-entity citation network Entity Citation Network vs. Entity CoOccurrence Network • Gene Gene Co-Occurrence Network (GG) vs. Gene Cite Gene Network (GCG) – The GCG network shares many genes with the GG network and as a result is a competitive complement to the GG network – Using gene relationships based on citation relation extends the assumption of gene interaction being limited to the same article and opens up a new opportunity to analyze gene interaction from a wider spectrum of datasets. – 1,149 gene pairs from GCG were found in GG. A total of 164 pairs out of 1,149 were not found in GG before 2005, but were found in GCG before 2005. In particular, the PARK2 and PINK1 gene pair ranks fifth by co-occurrence frequency in the GG network, implying the gene pair has highly been studied since 2005 Song, M., Han, N., Kim, Y., Ding, Y., & Chambers, T. (2013). Discovering implicit entity relation with the gene-citation-gene network. PLoS One, 8(12), e84639 Big Data in Life Sciences • There is now an incredibly rich resource of public information relating compounds, targets, genes, pathways, and diseases. Just for starters there is in the public domain information on: – – – – – – – • 69 million compounds and 449,392 bioassays (PubChem) 59 million compound bioactivities (PubChem Bioassay) 4,763 drugs (DrugBank) 9 million protein sequences (SwissProt) and 58,000 3D structures (PDB) 14 million human nucleotide sequences (EMBL) 22 million life sciences publications - 800,000 new each year (PubMed) Multitude of other sets (drugs, toxicogenomics, chemogenomics, metagenomics …) Even more important are the relationships between these entities. For example a chemical compound can be linked to a gene or a protein target in a multitude of ways: – – – – – – Biological assay with percent inhibition, IC50, etc Crystal structure of ligand/protein complex Co-occurrence in a paper abstract Computational experiment (docking, predictive model) Statistical relationship System association (e.g. involved in same pathways cellular processes) Wild, D. J., Ding, Y., Sheth, A. P., Harland, L., Gifford, E. M., & Lajiness, M. S. (2012). System chemical biology and the Semantic Web: What they mean for the future of drug discovery research. Drug Discovery Today (impact factor=6.422), 17(9-10), 469-474. Text CSV Table HTML XML Patient Disease Tissue Cell Pathway DNA RNA Protein Drug Chem2Bio2RDF • • • • • • • • • • • • • NCI Human Tumor Cell Lines Data PubChem Compound Database PubChem Bioassay Database PubChem Descriptions of all PubChem bioassays Pub3D: A similarity-searchable database of minimized 3D structures for PubChem compounds Drugbank MRTD: An implementation of the Maximum Recommended Therapeutic Dose set Medline: IDs of papers indexed in Medline, with SMILES of chemical structures ChEMBL chemogenomics database KEGG Ligand pathway database Comparative Toxicogenomics Database PhenoPred Data HuGEpedia: an encyclopedia of human genetic variation in health and disease. 31m chemical structures 59m bioactivity data points 3m/19m publications ~5,000 drugs Chen, B., Dong, X., Jiao, Dazhi, Wang, H., Zhu, Q., Ding, Y. and Wild, D. (2010). Chem2Bio2RDF: A semantic framework for linking and mining chemogenomic and systems chemical biology data. BMC Bioinformatics, 2010, 11, 255. Dereferenable URI PlotViz: Visualization Bio2RDF Browsing Cytoscape Plugin Chem2Bio2RDF RDF Triple store Linked Path Generation and Ranking LODD uniprot Others SPARQL ENDPOINTS Third party tools Chen, B., Ding, Y., & Wild, D. J. (2012). Improving integrative searching of systems chemical biology data using semantic annotation. Journal of Cheminformatics, 4:6 (doi:10.1186/1758-2946-4-6). 65 SEMANTIC GRAPH MINING: PATH FINDING ALGORITHM 15 5 8 2 13 23 3 6 1 19 14 9 16 21 10 18 4 25 17 7 11 20 12 Dijkstra’s algorithm 24 26 22 Bio-LDA • Latent Dirichlet Allocation (LDA) – The core of the group of powerful statistical modeling techniques for automated extraction of latent topics from large document collections • Bio-LDA – Extended LDA model with Bio-terms as latent variable – Bio-terms: compound, gene, drug, disease, protein, side effect, pathways ļ¢ ļ¢ Calculate bio-term entropies over topics Use the Kullback-Leibler divergence as the non-symmetric distance measure for two bioterms over topics Example: Topic 10 Apply Bio-LDA on 336,899 PubMed article abstracts in 2009 and extract 50 topics Wang, H., Ding, Y., Tang, J., Dong, X., He, B., Qiu, J. & Wild, D. (2011): Finding complex biological relationships in recent PubMed articles using Bio-LDA. PLos One 6(3): e17243. doi:10.1371/journal.pone.0017243. Thiazolinediones (TZDs) – revolutionary treatment for type II Diabetes Troglitazone (Rezulin): withdrawn in 2000 (liver disease) Rosiglitazone (Avandia): restricted in 2010 (cardiac disease) Rosiglitazone bound into PPAR-γ Pioglitazone: ???? (does decrease blood sugar levels, was associated with bladder tumors and has been withdrawn in some countries.) PPARG: TZD target SAA2: Involved in inflammatory response implicated in cardiovascular disease (Current Opinion in Lipidology 15,3,,269278 2004) APOE: Apolipoprotein E3 essential for lipoprotein catabolism. Implicated in cardiovascular disease. ADIPOQ: Adiponectin involved in fatty acid metabolism. Implicated in metabolic syndrome, diabetes and cardiovascular disease CYP2C8: Cytochrome P450 present in cardiovascular tissue and involved in metabolism of xenobiotics CDKN2A: Tumor suppression gene SLC29A1: Membrane transporter Semantic Prediction http://chem2bio2rdf.org/slap Chen, B., Ding, Y., & Wild, D. (2012). Assessing Drug Target Association using Semantic Linked Data. PLoS Computational Biology, 8(7): e1002574. doi:10.1371/journal.pcbi.1002574, Example: Troglitazone and PPARG Association score: 2385.9 Association significance: 9.06 x 10-6 => missing link predicted Topology is important for association Cmpd 1 Cmpd 1 hasSubstructure hasSubstructure hasSubstructure hasSubstructure bind Cmpd 2 Protein 1 bind Cmpd 2 Protein 1 Semantics is important for association Cmpd 1 bind bind Cmpd1 Cmpd1 Cmpd1 Cmpd1 bind hasSideeffect hasSubstructure Protein 2 Protein 2 bind hasGO Protein 2 hyperten sion substruct ure1 Cmpd 2 GO:0000 1 bind hasGO hasSubstructure Cmpd 2 Cmpd 2 Protein 1 Protein 1 PPI hasSide ffect Protein 1 bind bind Protein 1 Protein 1 SLAP Pipeline Path filtering Cross-check with SEA • SEA analysis (Nature 462, 175-181, 2009) predicts 184 new compound-target pairs, 30 of which were experimentally tested • 23 of these pairs were experimentally validated (<15uM) including 15 aminergic GPCR targets and 8 which crossed major receptor classification boundaries • 9 of the aminergic GPCR target pairings were correctly predicted by SLAP (p<0.05) – for the other 6 compounds were not present in our set • 1 of the 8 cross-boundary pairs was predicted Assessing drug similarity from biological function • Took 157 drugs with 10 known therapeutic indications, and created SLAP profiles against 1,683 human targets • Pearson correlation between profiles > 0.9 from SLAP was used to create associations between drugs • Drugs with the same therapeutic indication unsurprisingly cluster together • Some drugs with similar profile have different indications – potential for use in drug repurposing? Outline • Assessing the credibility of researchers and institutions • Identifying significant scientific advancement/ emerging trends • Other Research Analytics • Tool: Data2Knowledge Platform • Next Steps Data2Knowledge platform… Data2Knowledge AMiner PMiner SLAP Mining knowledge from articles: • Researcher profiling • Expert search • Topic analysis • Reviewer suggestion Mining knowledge from patents: • Competitor analysis • Company search • Patent summarization Mining drug discovery data • Predicting targets • Repurposing drugs • Heterogeneous graph search ... Mining more data… AMiner • Research profiling • Integration • Interest analysis • Topic analysis • Course search • Expert search Researchers: 31,222,410 Publications: 69,962,333 Conferences/Journals: 330,236 Citations: 133,196,029 Knowledge Concepts: 7,854,301 • Association • Disambiguation • Suggestion • Geo search • Collaboration recommendation Expert Search Basic Info. Citation statistics Research Interests Publications Social Network Expertise Search Finding top experts, top conferences, and highly cited papers for “data mining” Geographic Search Finding the most hot regions on “data mining” Conference Analysis Which year is the most successful year in the KDD’s history? Who are the highly cited authors? What is author nationality distribution for the highly cited KDD papers in the past years? Reviewer Suggestion Interest matching COI avoiding Load balancing Forecast review quality Cross-domain Collaboratinon Recommendation What are the cross-domain topics on which you can work in the target domain? Who are the best collaborators on each of these topics? Topic Browser 200 topics have been discovered automatically from the academic articles Academic Performance Measure Academic Statistics New Stars Widely used.. ļ¬ ļ¬ The largest publisher: Elsevier Conferences KDD 2010 KDD 2011 KDD 2012 KDD 2013 KDD 2014 WSDM 2011 WSDM 2014 ICDM 2011 ICDM 2012 SocInfo 2011 ICMLA 2011 WAIM 2011 etc. …… What is PMiner? • Current patent analysis systems focus on search – Google Patent, WikiPatent, FreePatentsOnline • PMiner is designed for an in-depth analysis of patent activity at the topic-level – – – – Topic-driven modeling of patents Heterogeneous network co-ranking Intelligent competitive analysis Patent summarization * Patent data: > 3.8M patents > 2.4M inventors > 400K companies > 10M citation relationships * Journal data: > 2k journal papers > 3.7k authors The crawled data is increasing to >300 Gigabytes. J. Tang, B. Wang, Y. Yang, P. Hu, Y. Zhao, X. Yan, B. Gao, M. Huang, P. Xu, W. Li, and A. K. Usadi. PatentMiner: Topic-driven Patent Analysis and Patent Search Topics of search results Top Patents Top Inventors Top Companies Topic-based Analysis for “Microsoft” • A court decision in 08/2012: Samsung’s Galaxy smart phone infringed upon a series of patents of Apple’s iphone, besides 4 appearance design patents, 3 software patents so-called 381, 915, and 163 are included, respectively cover "bounce back" , “pinch-to-zoom”, and “tap-to-zoom”. • The above 3 software patents all belong to the following three patent categories: active solidstate devices (touch screen), computer graphics processing (graph scaling), and selective visual display systems (tap to select). Demo Y. Yang, J. Tang, J. Keomany, Y. Zhao, Y. Ding, J. Li, and L. Wang. Mining Competitive Relationships by Learning across Heterogeneous Networks. Outline • Assessing the credibility of researchers and institutions • Identifying significant scientific advancement/ emerging trends • Other Research Analytics • Tool: Data2Knowledge Platform • Next Steps Challenges for Identifying Emerging Trends • Paradigm-changing discoveries have notoriously limited early impact because the more a discovery deviates from the current paradigm, the longer it takes to be appreciated by the community – Dashun Wang, Chaoming Song, and Albert-László Barabási. 2013. Quantifying Long-Term Scientific Impact. Science 342 (6154) , three variables: – Preferential attachment (# of citations) – Citation decay (aging of citations) – Community recognition (scholarly conformity, controlled by domain leaders) • Domain leader recognition is critical (peer review or who cite this article) • Long-term citation (t->infinite), the # of citations are decided only by community recognition • High innovative papers usually still cite conventional approaches or knowledge. For example, Newton laws of gravitation using geometry rather than calculus, Darwin origin of species, using conventional examples of breeding of dogs.– Brian Uzzi et al. (2013), Atypical combinations and scientific impact, Science, 342, 468 Proposed Approaches • Step 1: identify early features for scientific innovation, for example, study the features of highly cited articles the first 10 year: – each year citation patterns, citation increase rates (popularity vs. prestige) – citation (author, venue), reference (author, venue), coauthor, venue, (z-score), – Adding topics (transdisciplinary vs. within discipline) – Collaboration: Transdiciplinary team-authored articles, features for high impact and high innovation (z-score) Proposed Approaches • Step 2: Building mathematical models – Categorizing learned features, – Identifying variables – Building math models – Test and evaluation • Step 3: Supervised machine learning methods High impact articles vs. low impact articles High impact patent vs. low impact patent Knowledge is power! Acknowledgements Thanks to all the collaborators: David Wild , Kyle Stirling, Judy Qiu (Indiana) Jie Tang, Juanzi Li, Jing Zhang, Zhanpeng Fang, Yang Yang (TsingHua) Jim Walson (Panoscopix) Bing He (Johns Hopkins) Chengxiang Zhai, Chi Wang, Brian Foote (UIUC) Eric M. Gifford, Huijun Wang (Merck) Bin Chen (Stanford) Michael S. Lajiness (Eli Lilly)