The Workshops of the Thirtieth AAAI Conference on Artificial Intelligence Scholarly Big Data: AI Perspectives, Challenges, and Ideas: Technical Report WS-16-13 A Feasibility Study of an Approach to Extend Research Footprints Francisco Osuna, Bhanukiran Gurijala, Patricia Esparza, Monika Akbar, and Ann Gates Department of Computer Science and Cyber-ShARE Center of Excellence University of Texas at El Paso, El Paso, Texas, USA fjosuna, bgurijala, pesparza3, makbar, agates@utep.edu ent expertise systems, e.g., Influuent, Vivo, Team Science Toolkit. In this paper, we define an expertise system as a web-based system that publishes expertise and resources at an institution or across institutions through a distributed network. The purpose of the paper is to present results of a study that evaluates how key concepts extracted from publications and proposal submissions can be used to identify potential membership in communities of practice (CoPs), i.e., groups of people with a shared domain of interest. The effort supports UTEP’s long-term goal of developing CoPs and associated ontologies to extend the researcher’s footprint, i.e., concepts that define a researcher’s interests based on his or her research activities. The paper presents a Background section that compares various expertise systems. The next section describes the methodology used to extract the concepts associated with publications and proposal submissions and assess their alignment with CoPs. The approach was applied to publications and proposals of over 90 researchers. The paper presents a case study that analyzes the results for three of those researchers. The researchers have unique experiences, e.g., one highly focused in a particular area of research and others who work across areas with involvement in one or more CoP(s). The paper ends with a summary. Abstract Funding agencies and the National Academies of Science, Engineering, and Medicine have been promoting the importance of interdisciplinary research (IDR). Supporting team-based IDR requires the ability to discover the expertise needed to solve complex problems. Many universities have adopted expertise systems, which includes the presentation of keywords or concepts to identify experts. The efforts at University of Texas at El Paso (UTEP) have focused on building “communities of practice” that support diverse faculty who have an affinity for a particular topic and facilitate the ability to identify researchers with diverse expertise, knowledge, and skills who can contribute to new initiatives on campus. Our premise is that the university can facilitate the identification of potential contributors to communities of practice by correlating their associated ontologies to the concepts associated with researchers’ publications and proposal submissions. This paper presents the results of a preliminary study to examine the feasibility of the approach. Introduction There has been an increased emphasis on interdisciplinary research (IDR) and activities that support interactions needed to solve problems that cross disciplinary boundaries and advance education and research (Committee on Key Challenge Areas for Convergence and Health, Board on Life Sciences, Division on Earth and Life Studies, & National Research Council, 2014; Cooke & Hilton, 2015; Stokols, Hall, Taylor, & Moser, 2008)). Indeed, today’s scientific and social challenges are complex and require engaging individuals who can contribute different perspectives, experiences, knowledge, and skills. One challenge in supporting IDR opportunities is the ability to identify researchers who have expertise, or even peripheral knowledge or interest, to contribute to an initiative, considering that discoveries often occur at the boundaries of different disciplines. There are a number of differ- Background Efforts to connect people across disciplines and institutions have centered on the adoption of expertise systems. We briefly describe three exemplars: Vivo, which supports research and other creative collaborations within an institution and across institutions; Influuent, which supports collaborations with the University of Texas (UT) System institutions; and the Team Science Toolkit, which supports the practice and study of team science. Vivo (Krafft et al., 2010) provides a portal built on Semantic Web technologies to support the acquisition and management of data. Member institutions represent the structure of their data through their own ontology which Copyright © 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 684 can then be mapped to the VIVO Web ontology. Influuent (The University of Texas System Office of Public Affairs, 2015), was developed in collaboration with Elsevier. Influuent populates the expertise of University of Texas (UT) System researchers through analysis of their publications in Scopus, an abstract and citation database. Influuent displays the researcher’s “fingerprint” with the option to view collaborators and their associated departments and institutions. The Team Science Toolkit (Vogel et al., 2013) allows people to register their contact information, expertise statement, and keywords. The website provides resources to help users manage, support, and conduct team-based research. UTEP developed the Expertise Connector (EC) system (Expertise Connector, 2015) using Semantic Web technologies to support collaborations by helping users identify and search for experts that shape scholarly and educational research at the university. The portal also aims to align potential grants with communities. The source of expertise for faculty and professional staff profiles is Digital Measures (DM), an information system that allows faculty members to store information related to their professional activities and accomplishments for annual, tenure, and promotion evaluations (Digital Measures, 2015). Recognizing that expertise comes from multiple sources, UTEP extended DM to include links to personal websites, curriculum vita, associated centers, research stories, communities (described later), and social networks. Research stories are fed from the Communications Office on a daily basis. EC provides a portal to CoP (Wenger, 1998) that connect people who can learn from others with similar interests through joint activities (virtual or in-person) that target all or part of the membership. The portal supports the ability to share information by linking related efforts and resources. Each community has a set of keywords associated with it. We are in the process of relating an ontology with each CoP that captures concepts and relationships to which members can associate. this study does not make a distinction between types of publications, the status of the proposal, i.e., whether it is pending, funded, or not funded, or the level of contribution to such artifacts. Other data used in the study were the EC keywords describing research expertise identified by the researcher, as well as ontologies associated with the communities of practice. In the study, we targeted three CoPs: Smart Cities, Cyber Security, and Undergraduate Research. It is important to note that researchers who self-describe keywords may consider broader audiences of the expertise system, or those from their discipline. In the former case, keywords or concepts would more likely be general, while in the latter case, they would be more discipline specific. Footprint Generation Expertise systems typically use publications to define a footprint. This paper evaluates the use of proposal submissions to define the footprint with the aim of ultimately including CoP membership. The methodology extracted keywords from the titles and abstracts of researcher’s publications and submitted proposals. Later, these keywords were used to identify prevailing concepts in the research activities. The following subsections detail each step. Keyword Extraction For each faculty member, two different footprints were created using the set of keywords extracted from the titles and abstracts of publication records and proposals as described above. The extraction was done with the Rapid Automatic Keyword Extraction (RAKE) algorithm (Berry & Kogan, 2010). RAKE is an unsupervised and domainindependent method for extracting keywords from individual document. RAKE was chosen for its simplicity and efficiency in automatically extracting keywords in a single pass providing advantages in high volume collections by freeing up computing resources for other analytical methods. Parsed data were cleaned using stop-lists of keywords that presented noise, e.g. professor, teacher, paper, or results. Only keywords of length three or greater were considered as valid keywords. This work also considered phrases containing at most six keywords. Methodology This section describes the methodology used to conduct the feasibility study. It is organized around the following steps: data acquisition, footprint generation, community-ofpractice matching, and the case study. Concept Identification The set of keywords and phrases resulting from the extraction contained a large number of keywords at different levels of abstraction. To gain meaningful insight into the research areas of a faculty member, it was not sufficient to depend solely on the extracted keywords. Thus, the next step was to use the keywords to create concept-based footprints for both publications and proposal submissions. A concept is a high-level abstraction of related keywords that aid in the identification of notions not necessarily explicitly described in a document. There are a number of approaches available for deriving meaning from keywords Data Acquisition The study utilized titles and abstracts from two data sources that document publications and proposal submissions of faculty members within the university. Researchers’ publication data were retrieved from DM. The source of proposal submissions was UTEP’s ORSP. Because the intent of the effort is to identify potential interest in a CoP, 685 including clustering (Jain, Murty, & Flynn, 1999), topic modeling (Blei, 2012), semantic similarity (Jiang & Conrath, 1997), concept identification (Bower & Trabasso, 1964), and document summarization (Carbonell & Goldstein, 1998). This research uses the AlchemyAPI (AlchemyAPI, 2015) Web service to derive concepts from a set of given keywords. If the number of users is scaled up, the potential cost of AlchemyAPI, which is software-as-aservice, could limit the amount of information processed because of constraints on requests. For the purposes of this study, such limitations did not hinder the computation or analyses. The natural language processing REST Web service was used because it is specifically geared for semantic textual analysis. Given a set of keywords, it provides a set of concepts identified in the keywords along with their relevance score. The relevance score describes the importance of each concept in ranges from 0 to 1, where a higher relevance score indicates more significant concept. This step results in two sets of concepts from two types of activities (i.e., publications and proposals) for each faculty. Results Each case includes two figures. The first figure shows the top concepts identified in DM publications and proposals submissions through ORSP. The concepts are presented along the X-axis and frequency of the concepts is presented along the Y-axis. The second figure shows the percentage of matches for concepts and keywords of EC, Influuent records, publications, and proposals for the three CoPs. The keywords marked EC are the researcher-defined keywords from EC. The remaining concepts are labeled as Influuent (originating from Scopus), DM Pubs (originating from DM publications), and ORSP Props (originating from ORSP). Case 1: Faculty A Faculty A is member of the Department of Civil Engineering. The self-described keywords of this faculty from the EC included cross-border transportation, freight and transportation logistics, intelligent transportation systems, traffic engineering, transportation engineering, and transportation planning. Influuent identified the following set of top concepts based on his/her publication data: highway systems, travel time, neural networks, genetic algorithms, traffic signals, global positioning system, commercial vehicles, trucks, rapid transit, and costs. Among the concepts identified in the faculty’s publications from DM are Neural network, Interstate highway system, and Public transport (Figure 1a). Analysis of the proposals linked to Faculty A revealed a different set of concepts (Figure 1b). Similar to the publication footprint, Transportation planning and Freeway are dominant concepts, linked to at least three proposals. The proposal footprint also identified contextual information such as location (e.g., Texas), as well as different research interests (e.g., Higher education). In particular, the keyword Texas is relevant because of the extensive work that the researcher conducts in the state regarding transportation. Community-of-Practice Matching This step focused on evaluating how key concepts extracted from publications and proposal submissions could be used to identify potential membership in communities of practice. In the study, the concepts associated with a CoP’s ontology were used to identify the number of matches between the terms belonging to each of the concept-based footprints. The ontologies were translated into a flat list of concepts, which gave equal weight to each term. The list of matched concepts was ordered by the frequency of concepts appearing in different resources in descending order. For this study, a higher match of concepts between the CoP ontology and a researcher’s footprint denotes a closer affinity of the researcher to that CoP. Case Study We conducted a case study on three of the over 90 researchers examined using the methodology described in this section. The case study addressed the following research question: How effective are a researcher’s publications and proposal submissions in identifying a researcher’s alignment with a community of practice? The study also examined the differences in keywords or concepts between those stored in EC, which is populated by researchers, and those stored in Influuent, which is populated by Scopus. The study examined researchers from three disciplinary areas: Civil Engineering (Faculty A), Electrical and Computer Engineering (Faculty B), and Anthropology (Faculty C). The analysis was anonymized to preserve privacy. (a) Publications (b) Proposals Figure 1: Concepts extracted for Faculty A. 686 Figure 3: Matching research footprints of Faculty A across communities of practices. (a) Publications Observations: Faculty A is a member of the Smart Cities CoP. As shown in Figure 3, the expertise systems (i.e., Expertise and Influuent) were able to match Faculty A with this CoP. More than 30% of the EC keywords (selfdescribed) were a match with the Smart Cities CoP ontology. In terms of publications, DM identified more matches than Influuent (more than 20% matches related to Smart Cities compared to less than 15% matches). Proposals submission through ORSP revealed diverse concepts related to Smart Cities. This suggests that the publications and proposals of this faculty cover areas that contain concepts related to Smart Cities. As presented in this Figure 2, although none of the EC keywords of Faculty A matched with the Cyber-Security CoP, the publications (i.e., Influuent, DM Pubs) indicated there might be some areas of shared interest. DM Pubs in particular identified concepts that align with this CoP (e.g., Computer simulation, Intelligent agent, and Control system) making Faculty A a potential candidate for collaboration with the Cyber-Security community. In terms of the Undergraduate Research CoP, the DM Pubs and ORSP Props were able to identify a possible alignment. Note that, while EC keywords show a high percentage of matches with the Smart Cities CoP (more than 30%), it fails to deliver any match with the two other CoPs. This could be due to the fact that the keywords are selfdescribed by the faculty, hence, more likely to match with the CoPs where the faculty chose to become a member. (b) Proposals Figure 2: Concepts extracted for Faculty B. DM publications of Faculty B (Figure 2a) identified concepts related to imaging (e.g., Hyperspectral imaging, Image Processing, Multi-spectral image), Numerical analysis, and Machine learning. Figure 2b shows results of similar analysis on proposal data. When compared to publications data, proposals identified similar concepts. Because the faculty is new to the university, he/she does not have a deep proposal record. Observations: Faculty B is a member of the CyberSecurity CoP. Indeed, in Figure 4, we observe that CyberSecurity concepts match with both the publication and proposal footprint of this faculty. This area was also identified by the EC and Influuent profile. Figure 4: Matching research footprints of Faculty B across Although this faculty is not a member of the Smart Cities CoP, the EC keywords and publications indicate a possible alignment of concepts between Faculty B and Smart Cities CoP. The exception in this case is the proposals, as there was no match between any concepts appearing in the proposals of Faculty B and Smart Cities CoP. A small set of concepts related to Undergraduate Research was identified in the publications of this faculty through Influuent and DM (less than 5%). This suggests that the faculty is covering some areas of Undergraduate Research in his/her research activities that appear in the publications. Case 2: Faculty B Faculty B is a member of the Department of Electrical and Computer Engineering. In EC, Faculty B identified his/her research interests in the following areas: Control systems, Cyber-physical systems, Electric power and energy systems, Hyperspectral remote sensing, Remote sensing, Signal processing, and Machine learning. Influuent identified the top concepts appearing in the publications of this faculty as Imagery, Remote sensing, Factorization, Parameter estimation, Image analysis, Pixels, Data reduction, Bathymetry, Imaging techniques, and Optical engineering. 687 with that of Smart Cities CoP (more than 12%). Less concepts of this CoP matched when EC keywords and publications concepts were considered. EC keywords of this faculty had the most matches with the Cyber-Security CoP. Influuent concepts yielded the least number of matches with this CoP. Concepts and keywords related to Undergraduate Research were identified by both EC and publications. While the concepts from proposals of Faculty C resulted in more matches with Smart Cities CoP, the EC keywords indicated interest towards the Cyber-Security CoP. The analysis suggests that Faculty C can potentially contribute to all the three CoPs considered in this study. Case 3: Faculty C Faculty C is from the Sociology & Anthropology program. Some of the research interests identified by Faculty C in EC are: Anthropology, Borders, Community engagement, Culture, Health, Human rights, Immigration, and Society. Top concepts identified in Influuent based on the publications of Faculty C includes United States of America, Mexico, Immigration, Anthropology, Illegality, Labor, Border region, Call center, and Linguistics. Related Work (a) Publications Closely related work includes Cross-domain Topic Modeling (Tang, Wu, Sun, & Su, 2012), which uses research publications to address sparse connection, complementary expertise, and topic skewness challenges involved in recommending interdisciplinary collaborations. Similar work includes a recommendation algorithm for scientific articles based on both content and users’ rating (Wang & Blei, 2011). This work combines collaborative filtering based on latent factor models to recommend articles to a particular user from other users’ libraries and content analysis based on probabilistic topic modeling for recommending unrated articles. Probabilistic topic models such as Latent Dirichlet Allocation are designed to discover and annotate vast unstructured collections of documents to infer hidden thematic information (Blei, 2012). The Language Model Approach (Tomokiyo & Hurst, 2003) combines the extraction of candidate keyphrases and their ranking. Maui (Medelyan, 2015) automatically determines main topics in documents by extraction of keywords from text with or without use of a reference to a controlled vocabulary. One of the closest work addressing keyword extraction is CiteTextRank (Gollapalli & Caragea, 2014), a graph-based algorithm for keyword extraction using document’s content and context within a citation network. (b) Proposals Figure 5: Concepts extracted for Faculty C. DM publications of Faculty C revealed concepts related to Sociology (e.g., Culture, Human Rights), Anthropology, Psychology and concepts related to immigration and border (Figure 5a). The concepts are in alignment with the listed keywords in EC and Influuent-detected concepts. However, the ORSP Props detected a different set of concepts that are more related to local and regional issues (e.g., Rio Grande, Irrigation, Water) (Figure 5)). Although many of the concepts are related to water, neither the EC keywords nor the concepts appearing in the publications (i.e., Influuent and DM) was able to identify an area of interest that addresses regional water issues. Further investigation on the timeline of proposals identifying water issues reveal that it is indeed a new research interests of this faculty. Observations: Faculty C is not a member of any of the CoPs considered in this study. Figure 6 shows some of the concepts appearing in the proposal of this faculty matches Summary Identifying researchers who can support IDR opportunities should be extended beyond identifying expertise. One should consider peripheral knowledge, or affinity to a topic. This paper presents a preliminary study to examine the effects of relating knowledge from disparate data sources to identify potential membership in communities of practice. The study showed that Influuent and DM Pubs yielded different concepts (e.g., Figure 3, Undergraduate Research). This may be partially due to data sources of our preliminaries observation. Scopus (source of Influuent data) does not capture all the publications of a researcher. Figure 6: Matching research footprints of Faculty C across communities of practices. 688 Bower, G. H., & Trabasso, T. R. (1964). Concept identification. Studies in Mathematical Psychology, 32-94. Carbonell, J., & Goldstein, J. (1998). The use of MMR, diversitybased reranking for reordering documents and producing summaries. Paper presented at the Proceedings of the 21st Annual International ACM SIGIR, pp. 335-336. Committee on Key Challenge Areas for Convergence and Health, Board on Life Sciences, Division on Earth and Life Studies, & National Research Council. (2014). Convergence: Facilitating transdisciplinary integration of life sciences, physical sciences, engineering, and beyond The National Academies Press. Cooke, N. J., & Hilton, M. L. (2015). Enhancing the effectiveness of team science National Academies Press. Digital Measures. (2015). Digital measures. Retrieved October 30, 2015, from http://www.digitalmeasures.com/ Expertise Connector. (2015). Expertise connector. Retrieved October 30, 2015, from http://expertise.utep.edu/ Gollapalli, S. D., & Caragea, C. (2014). Extracting keyphrases from research papers using citation networks. Paper presented at the Proceedings of the 28th AAAI, pp. 1629-1635. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys (CSUR), 31(3), 264-323. Jiang, J. J., & Conrath, D. W. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. arXiv Preprint CmpLg/9709008. Krafft, D. B., Cappadona, N. A., Caruso, B., Corson-Rikert, J., Devare, M., Lowe, B. J., et al. (2010). Vivo: Enabling national networking of scientists. Medelyan, A. (2015). Maui - multi-purpose automatic topic indexing. Retrieved December 04, 2015, from https://code.google.com/p/maui-indexer/ Stokols, D., Hall, K. L., Taylor, B. K., & Moser, R. P. (2008). The science of team science: Overview of the field and introduction to the supplement. American Journal of Preventive Medicine, 35(2), S77-S89. Tang, J., Wu, S., Sun, J., & Su, H. (2012). Cross-domain collaboration recommendation. Paper presented at the Proceedings of the 18th ACM SIGKDD, pp. 1285-1293. The University of Texas System Office of Public Affairs. (2015). UT system launches free online database to connect industry with thousands of world-class researchers. Retrieved October 29, 2015, from https://www.utsystem.edu/news/2015/05/14/utsystem-launches-free-online-database-connect-industrythousands-world-class-resea Tomokiyo, T., & Hurst, M. (2003). A language model approach to keyphrase extraction. Paper presented at the Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment-Volume 18, pp. 33-40. Vogel, A. L., Hall, K. L., Fiore, S. M., Klein, J. T., Bennett, L. M., Gadlin, H., et al. (2013). The team science toolkit: Enhancing research collaboration through online knowledge sharing. American Journal of Preventive Medicine, 45(6), 787-789. Wang, C., & Blei, D. M. (2011). Collaborative topic modeling for recommending scientific articles. Paper presented at the Proceedings of the 17th ACM SIGKDD, pp. 448-456. Wenger, E. (1998). Communities of practice: Learning as a social system. Systems Thinker, 9(5), 2-3. The algorithms used for concept extraction by Influuent may also contribute to the differences. Self-described keywords given by EC show the variability in how researchers described themselves. Using concept abstraction provides a richer set of researcher footprints. The study also raises questions. Additional investigations are needed to understand the differences in Influuent concepts and the concepts identified through DM Pubs and ORSP Props. Another important aspect is how to classify the keywords based on the needs of the user. For example, a user looking for researchers in cancer research would want the concepts at a high level of abstraction than those who want to know who does research on a particular type of precursor cell, e.g., B-lymphoid. Another potential classification is those who are on the periphery of a discipline, as shown in the work for those who have interests in undergraduate research. The latter supports the ability to extend CoPs by identifying researchers who have the potential to contribute to a community e.g., Faculty B who is not part of the Smart City CoP, but has publications that align well with this community. The approach also supports the efforts of UTEP’s ORSP to convene researchers in particular areas to build collaborations. Future work includes comparing different approaches for concept identification (e.g., keywords vs. sentence level parsing), investigating a statistical approach for assigning weights to the concepts extracted from different sources, and extending the ontologies to support emerging communities and IDR opportunities. Our long-term goal is to extend the researcher’s footprint. As more researchers are associated with a CoP, they will be able to use its ontology to refine and directly associate with the community’s keywords, enriching their footprint and the community’s ontology. This will facilitate the university’s ability to identify collaborators and expertise. EC can be extended to identify funding opportunities associated with the communities. Acknowledgements This work is supported in part by the National Science Foundation (NSF) grants HRD-1242122 and DUE #0963648. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the author(s) and do not necessarily reflect the views of the NSF. References AlchemyAPI. (2015). Alchemy API. Retrieved October 23, 2015, from http://www.alchemyapi.com Berry, M. W., & Kogan, J. (2010). Text mining: Applications and theory John Wiley & Sons. Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84. 689