A Feasibility Study of an Approach to Extend Research Footprints

advertisement
The Workshops of the Thirtieth AAAI Conference on Artificial Intelligence
Scholarly Big Data: AI Perspectives, Challenges, and Ideas:
Technical Report WS-16-13
A Feasibility Study of an Approach to Extend Research Footprints
Francisco Osuna, Bhanukiran Gurijala, Patricia Esparza, Monika Akbar, and Ann Gates
Department of Computer Science and Cyber-ShARE Center of Excellence
University of Texas at El Paso, El Paso, Texas, USA
fjosuna, bgurijala, pesparza3, makbar, agates@utep.edu
ent expertise systems, e.g., Influuent, Vivo, Team Science
Toolkit. In this paper, we define an expertise system as a
web-based system that publishes expertise and resources at
an institution or across institutions through a distributed
network. The purpose of the paper is to present results of
a study that evaluates how key concepts extracted from
publications and proposal submissions can be used to identify potential membership in communities of practice
(CoPs), i.e., groups of people with a shared domain of interest. The effort supports UTEP’s long-term goal of developing CoPs and associated ontologies to extend the researcher’s footprint, i.e., concepts that define a researcher’s
interests based on his or her research activities.
The paper presents a Background section that compares
various expertise systems. The next section describes the
methodology used to extract the concepts associated with
publications and proposal submissions and assess their
alignment with CoPs. The approach was applied to publications and proposals of over 90 researchers. The paper
presents a case study that analyzes the results for three of
those researchers. The researchers have unique experiences, e.g., one highly focused in a particular area of research
and others who work across areas with involvement in one
or more CoP(s). The paper ends with a summary.
Abstract
Funding agencies and the National Academies of Science,
Engineering, and Medicine have been promoting the importance of interdisciplinary research (IDR). Supporting
team-based IDR requires the ability to discover the expertise
needed to solve complex problems. Many universities have
adopted expertise systems, which includes the presentation
of keywords or concepts to identify experts. The efforts at
University of Texas at El Paso (UTEP) have focused on
building “communities of practice” that support diverse faculty who have an affinity for a particular topic and facilitate
the ability to identify researchers with diverse expertise,
knowledge, and skills who can contribute to new initiatives
on campus. Our premise is that the university can facilitate
the identification of potential contributors to communities of
practice by correlating their associated ontologies to the
concepts associated with researchers’ publications and proposal submissions. This paper presents the results of a preliminary study to examine the feasibility of the approach.
Introduction
There has been an increased emphasis on interdisciplinary
research (IDR) and activities that support interactions
needed to solve problems that cross disciplinary boundaries and advance education and research (Committee on
Key Challenge Areas for Convergence and Health, Board
on Life Sciences, Division on Earth and Life Studies, &
National Research Council, 2014; Cooke & Hilton, 2015;
Stokols, Hall, Taylor, & Moser, 2008)). Indeed, today’s
scientific and social challenges are complex and require
engaging individuals who can contribute different perspectives, experiences, knowledge, and skills.
One challenge in supporting IDR opportunities is the
ability to identify researchers who have expertise, or even
peripheral knowledge or interest, to contribute to an initiative, considering that discoveries often occur at the boundaries of different disciplines. There are a number of differ-
Background
Efforts to connect people across disciplines and institutions
have centered on the adoption of expertise systems. We
briefly describe three exemplars: Vivo, which supports
research and other creative collaborations within an institution and across institutions; Influuent, which supports collaborations with the University of Texas (UT) System institutions; and the Team Science Toolkit, which supports
the practice and study of team science.
Vivo (Krafft et al., 2010) provides a portal built on Semantic Web technologies to support the acquisition and
management of data. Member institutions represent the
structure of their data through their own ontology which
Copyright © 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
684
can then be mapped to the VIVO Web ontology. Influuent
(The University of Texas System Office of Public Affairs,
2015), was developed in collaboration with Elsevier. Influuent populates the expertise of University of Texas (UT)
System researchers through analysis of their publications
in Scopus, an abstract and citation database. Influuent displays the researcher’s “fingerprint” with the option to view
collaborators and their associated departments and institutions. The Team Science Toolkit (Vogel et al., 2013) allows people to register their contact information, expertise
statement, and keywords. The website provides resources
to help users manage, support, and conduct team-based
research.
UTEP developed the Expertise Connector (EC) system
(Expertise Connector, 2015) using Semantic Web technologies to support collaborations by helping users identify
and search for experts that shape scholarly and educational
research at the university. The portal also aims to align
potential grants with communities. The source of expertise
for faculty and professional staff profiles is Digital
Measures (DM), an information system that allows faculty
members to store information related to their professional
activities and accomplishments for annual, tenure, and
promotion evaluations (Digital Measures, 2015). Recognizing that expertise comes from multiple sources, UTEP
extended DM to include links to personal websites, curriculum vita, associated centers, research stories, communities (described later), and social networks. Research stories
are fed from the Communications Office on a daily basis.
EC provides a portal to CoP (Wenger, 1998) that connect
people who can learn from others with similar interests
through joint activities (virtual or in-person) that target all
or part of the membership. The portal supports the ability
to share information by linking related efforts and resources. Each community has a set of keywords associated
with it. We are in the process of relating an ontology with
each CoP that captures concepts and relationships to which
members can associate.
this study does not make a distinction between types of
publications, the status of the proposal, i.e., whether it is
pending, funded, or not funded, or the level of contribution
to such artifacts.
Other data used in the study were the EC keywords describing research expertise identified by the researcher, as
well as ontologies associated with the communities of
practice. In the study, we targeted three CoPs: Smart Cities, Cyber Security, and Undergraduate Research.
It is important to note that researchers who self-describe
keywords may consider broader audiences of the expertise
system, or those from their discipline. In the former case,
keywords or concepts would more likely be general, while
in the latter case, they would be more discipline specific.
Footprint Generation
Expertise systems typically use publications to define a
footprint. This paper evaluates the use of proposal submissions to define the footprint with the aim of ultimately including CoP membership. The methodology extracted
keywords from the titles and abstracts of researcher’s publications and submitted proposals. Later, these keywords
were used to identify prevailing concepts in the research
activities. The following subsections detail each step.
Keyword Extraction
For each faculty member, two different footprints were
created using the set of keywords extracted from the titles
and abstracts of publication records and proposals as described above. The extraction was done with the Rapid
Automatic Keyword Extraction (RAKE) algorithm (Berry
& Kogan, 2010). RAKE is an unsupervised and domainindependent method for extracting keywords from individual document. RAKE was chosen for its simplicity and
efficiency in automatically extracting keywords in a single
pass providing advantages in high volume collections by
freeing up computing resources for other analytical methods. Parsed data were cleaned using stop-lists of keywords
that presented noise, e.g. professor, teacher, paper, or results. Only keywords of length three or greater were considered as valid keywords. This work also considered
phrases containing at most six keywords.
Methodology
This section describes the methodology used to conduct the
feasibility study. It is organized around the following steps:
data acquisition, footprint generation, community-ofpractice matching, and the case study.
Concept Identification
The set of keywords and phrases resulting from the extraction contained a large number of keywords at different
levels of abstraction. To gain meaningful insight into the
research areas of a faculty member, it was not sufficient to
depend solely on the extracted keywords. Thus, the next
step was to use the keywords to create concept-based footprints for both publications and proposal submissions.
A concept is a high-level abstraction of related keywords
that aid in the identification of notions not necessarily explicitly described in a document. There are a number of
approaches available for deriving meaning from keywords
Data Acquisition
The study utilized titles and abstracts from two data
sources that document publications and proposal submissions of faculty members within the university. Researchers’ publication data were retrieved from DM. The source
of proposal submissions was UTEP’s ORSP. Because the
intent of the effort is to identify potential interest in a CoP,
685
including clustering (Jain, Murty, & Flynn, 1999), topic
modeling (Blei, 2012), semantic similarity (Jiang & Conrath, 1997), concept identification (Bower & Trabasso,
1964), and document summarization (Carbonell & Goldstein, 1998). This research uses the AlchemyAPI (AlchemyAPI, 2015) Web service to derive concepts from a set of
given keywords. If the number of users is scaled up, the
potential cost of AlchemyAPI, which is software-as-aservice, could limit the amount of information processed
because of constraints on requests. For the purposes of this
study, such limitations did not hinder the computation or
analyses.
The natural language processing REST Web service was
used because it is specifically geared for semantic textual
analysis. Given a set of keywords, it provides a set of concepts identified in the keywords along with their relevance
score. The relevance score describes the importance of
each concept in ranges from 0 to 1, where a higher relevance score indicates more significant concept. This step
results in two sets of concepts from two types of activities
(i.e., publications and proposals) for each faculty.
Results
Each case includes two figures. The first figure shows the
top concepts identified in DM publications and proposals
submissions through ORSP. The concepts are presented
along the X-axis and frequency of the concepts is presented along the Y-axis. The second figure shows the percentage of matches for concepts and keywords of EC, Influuent
records, publications, and proposals for the three CoPs.
The keywords marked EC are the researcher-defined keywords from EC. The remaining concepts are labeled as
Influuent (originating from Scopus), DM Pubs (originating
from DM publications), and ORSP Props (originating from
ORSP).
Case 1: Faculty A
Faculty A is member of the Department of Civil Engineering. The self-described keywords of this faculty from the
EC included cross-border transportation, freight and
transportation logistics, intelligent transportation systems,
traffic engineering, transportation engineering, and transportation planning. Influuent identified the following set of
top concepts based on his/her publication data: highway
systems, travel time, neural networks, genetic algorithms,
traffic signals, global positioning system, commercial vehicles, trucks, rapid transit, and costs.
Among the concepts identified in the faculty’s publications from DM are Neural network, Interstate highway
system, and Public transport (Figure 1a). Analysis of the
proposals linked to Faculty A revealed a different set of
concepts (Figure 1b). Similar to the publication footprint,
Transportation planning and Freeway are dominant concepts, linked to at least three proposals. The proposal footprint also identified contextual information such as location (e.g., Texas), as well as different research interests
(e.g., Higher education). In particular, the keyword Texas
is relevant because of the extensive work that the researcher conducts in the state regarding transportation.
Community-of-Practice Matching
This step focused on evaluating how key concepts extracted from publications and proposal submissions could be
used to identify potential membership in communities of
practice. In the study, the concepts associated with a CoP’s
ontology were used to identify the number of matches between the terms belonging to each of the concept-based
footprints. The ontologies were translated into a flat list of
concepts, which gave equal weight to each term. The list of
matched concepts was ordered by the frequency of concepts appearing in different resources in descending order.
For this study, a higher match of concepts between the CoP
ontology and a researcher’s footprint denotes a closer affinity of the researcher to that CoP.
Case Study
We conducted a case study on three of the over 90 researchers examined using the methodology described in
this section. The case study addressed the following research question: How effective are a researcher’s publications and proposal submissions in identifying a researcher’s alignment with a community of practice? The study
also examined the differences in keywords or concepts
between those stored in EC, which is populated by researchers, and those stored in Influuent, which is populated
by Scopus. The study examined researchers from three
disciplinary areas: Civil Engineering (Faculty A), Electrical and Computer Engineering (Faculty B), and Anthropology (Faculty C). The analysis was anonymized to preserve privacy.
(a) Publications
(b) Proposals
Figure 1: Concepts extracted for Faculty A.
686
Figure 3: Matching research footprints of Faculty A across
communities of practices.
(a) Publications
Observations: Faculty A is a member of the Smart Cities CoP. As shown in Figure 3, the expertise systems (i.e.,
Expertise and Influuent) were able to match Faculty A with
this CoP. More than 30% of the EC keywords (selfdescribed) were a match with the Smart Cities CoP ontology. In terms of publications, DM identified more matches
than Influuent (more than 20% matches related to Smart
Cities compared to less than 15% matches). Proposals
submission through ORSP revealed diverse concepts related to Smart Cities. This suggests that the publications and
proposals of this faculty cover areas that contain concepts
related to Smart Cities.
As presented in this Figure 2, although none of the EC
keywords of Faculty A matched with the Cyber-Security
CoP, the publications (i.e., Influuent, DM Pubs) indicated
there might be some areas of shared interest. DM Pubs in
particular identified concepts that align with this CoP (e.g.,
Computer simulation, Intelligent agent, and Control system) making Faculty A a potential candidate for collaboration with the Cyber-Security community. In terms of the
Undergraduate Research CoP, the DM Pubs and ORSP
Props were able to identify a possible alignment.
Note that, while EC keywords show a high percentage of
matches with the Smart Cities CoP (more than 30%), it
fails to deliver any match with the two other CoPs. This
could be due to the fact that the keywords are selfdescribed by the faculty, hence, more likely to match with
the CoPs where the faculty chose to become a member.
(b) Proposals
Figure 2: Concepts extracted for Faculty B.
DM publications of Faculty B (Figure 2a) identified concepts related to imaging (e.g., Hyperspectral imaging, Image Processing, Multi-spectral image), Numerical analysis,
and Machine learning. Figure 2b shows results of similar
analysis on proposal data. When compared to publications
data, proposals identified similar concepts. Because the
faculty is new to the university, he/she does not have a
deep proposal record.
Observations: Faculty B is a member of the CyberSecurity CoP. Indeed, in Figure 4, we observe that CyberSecurity concepts match with both the publication and proposal footprint of this faculty. This area was also identified
by the EC and Influuent profile.
Figure 4: Matching research footprints of Faculty B across
Although this faculty is not a member of the Smart Cities CoP, the EC keywords and publications indicate a possible alignment of concepts between Faculty B and Smart
Cities CoP. The exception in this case is the proposals, as
there was no match between any concepts appearing in the
proposals of Faculty B and Smart Cities CoP. A small set
of concepts related to Undergraduate Research was identified in the publications of this faculty through Influuent
and DM (less than 5%). This suggests that the faculty is
covering some areas of Undergraduate Research in his/her
research activities that appear in the publications.
Case 2: Faculty B
Faculty B is a member of the Department of Electrical and
Computer Engineering. In EC, Faculty B identified his/her
research interests in the following areas: Control systems,
Cyber-physical systems, Electric power and energy systems, Hyperspectral remote sensing, Remote sensing, Signal processing, and Machine learning. Influuent identified
the top concepts appearing in the publications of this faculty as Imagery, Remote sensing, Factorization, Parameter
estimation, Image analysis, Pixels, Data reduction, Bathymetry, Imaging techniques, and Optical engineering.
687
with that of Smart Cities CoP (more than 12%). Less concepts of this CoP matched when EC keywords and publications concepts were considered. EC keywords of this faculty had the most matches with the Cyber-Security CoP. Influuent concepts yielded the least number of matches with
this CoP. Concepts and keywords related to Undergraduate
Research were identified by both EC and publications.
While the concepts from proposals of Faculty C resulted
in more matches with Smart Cities CoP, the EC keywords
indicated interest towards the Cyber-Security CoP. The
analysis suggests that Faculty C can potentially contribute
to all the three CoPs considered in this study.
Case 3: Faculty C
Faculty C is from the Sociology & Anthropology program.
Some of the research interests identified by Faculty C in
EC are: Anthropology, Borders, Community engagement,
Culture, Health, Human rights, Immigration, and Society.
Top concepts identified in Influuent based on the publications of Faculty C includes United States of America, Mexico, Immigration, Anthropology, Illegality, Labor, Border
region, Call center, and Linguistics.
Related Work
(a) Publications
Closely related work includes Cross-domain Topic Modeling (Tang, Wu, Sun, & Su, 2012), which uses research
publications to address sparse connection, complementary
expertise, and topic skewness challenges involved in recommending interdisciplinary collaborations. Similar work
includes a recommendation algorithm for scientific articles
based on both content and users’ rating (Wang & Blei,
2011). This work combines collaborative filtering based on
latent factor models to recommend articles to a particular
user from other users’ libraries and content analysis based
on probabilistic topic modeling for recommending unrated
articles. Probabilistic topic models such as Latent Dirichlet
Allocation are designed to discover and annotate vast unstructured collections of documents to infer hidden thematic information (Blei, 2012). The Language Model Approach (Tomokiyo & Hurst, 2003) combines the extraction
of candidate keyphrases and their ranking.
Maui (Medelyan, 2015) automatically determines main
topics in documents by extraction of keywords from text
with or without use of a reference to a controlled vocabulary. One of the closest work addressing keyword extraction is CiteTextRank (Gollapalli & Caragea, 2014), a
graph-based algorithm for keyword extraction using document’s content and context within a citation network.
(b) Proposals
Figure 5: Concepts extracted for Faculty C.
DM publications of Faculty C revealed concepts related
to Sociology (e.g., Culture, Human Rights), Anthropology,
Psychology and concepts related to immigration and border
(Figure 5a). The concepts are in alignment with the listed
keywords in EC and Influuent-detected concepts. However, the ORSP Props detected a different set of concepts that
are more related to local and regional issues (e.g., Rio
Grande, Irrigation, Water) (Figure 5)). Although many of
the concepts are related to water, neither the EC keywords
nor the concepts appearing in the publications (i.e., Influuent and DM) was able to identify an area of interest
that addresses regional water issues. Further investigation
on the timeline of proposals identifying water issues reveal
that it is indeed a new research interests of this faculty.
Observations: Faculty C is not a member of any of the
CoPs considered in this study. Figure 6 shows some of the
concepts appearing in the proposal of this faculty matches
Summary
Identifying researchers who can support IDR opportunities
should be extended beyond identifying expertise. One
should consider peripheral knowledge, or affinity to a topic. This paper presents a preliminary study to examine the
effects of relating knowledge from disparate data sources
to identify potential membership in communities of practice. The study showed that Influuent and DM Pubs yielded
different concepts (e.g., Figure 3, Undergraduate Research). This may be partially due to data sources of our
preliminaries observation. Scopus (source of Influuent
data) does not capture all the publications of a researcher.
Figure 6: Matching research footprints of Faculty C across
communities of practices.
688
Bower, G. H., & Trabasso, T. R. (1964). Concept identification.
Studies in Mathematical Psychology, 32-94.
Carbonell, J., & Goldstein, J. (1998). The use of MMR, diversitybased reranking for reordering documents and producing summaries. Paper presented at the Proceedings of the 21st Annual
International ACM SIGIR, pp. 335-336.
Committee on Key Challenge Areas for Convergence and Health,
Board on Life Sciences, Division on Earth and Life Studies, &
National Research Council. (2014). Convergence: Facilitating
transdisciplinary integration of life sciences, physical sciences,
engineering, and beyond The National Academies Press.
Cooke, N. J., & Hilton, M. L. (2015). Enhancing the effectiveness
of team science National Academies Press.
Digital Measures. (2015). Digital measures. Retrieved October
30, 2015, from http://www.digitalmeasures.com/
Expertise Connector. (2015). Expertise connector. Retrieved October 30, 2015, from http://expertise.utep.edu/
Gollapalli, S. D., & Caragea, C. (2014). Extracting keyphrases
from research papers using citation networks. Paper presented at
the Proceedings of the 28th AAAI, pp. 1629-1635.
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering:
A review. ACM Computing Surveys (CSUR), 31(3), 264-323.
Jiang, J. J., & Conrath, D. W. (1997). Semantic similarity based
on corpus statistics and lexical taxonomy. arXiv Preprint CmpLg/9709008.
Krafft, D. B., Cappadona, N. A., Caruso, B., Corson-Rikert, J.,
Devare, M., Lowe, B. J., et al. (2010). Vivo: Enabling national
networking of scientists.
Medelyan, A. (2015). Maui - multi-purpose automatic topic indexing.
Retrieved
December
04,
2015,
from
https://code.google.com/p/maui-indexer/
Stokols, D., Hall, K. L., Taylor, B. K., & Moser, R. P. (2008).
The science of team science: Overview of the field and introduction to the supplement. American Journal of Preventive Medicine,
35(2), S77-S89.
Tang, J., Wu, S., Sun, J., & Su, H. (2012). Cross-domain collaboration recommendation. Paper presented at the Proceedings of the
18th ACM SIGKDD, pp. 1285-1293.
The University of Texas System Office of Public Affairs. (2015).
UT system launches free online database to connect industry with
thousands of world-class researchers. Retrieved October 29,
2015,
from
https://www.utsystem.edu/news/2015/05/14/utsystem-launches-free-online-database-connect-industrythousands-world-class-resea
Tomokiyo, T., & Hurst, M. (2003). A language model approach
to keyphrase extraction. Paper presented at the Proceedings of the
ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment-Volume 18, pp. 33-40.
Vogel, A. L., Hall, K. L., Fiore, S. M., Klein, J. T., Bennett, L.
M., Gadlin, H., et al. (2013). The team science toolkit: Enhancing
research collaboration through online knowledge sharing. American Journal of Preventive Medicine, 45(6), 787-789.
Wang, C., & Blei, D. M. (2011). Collaborative topic modeling for
recommending scientific articles. Paper presented at the Proceedings of the 17th ACM SIGKDD, pp. 448-456.
Wenger, E. (1998). Communities of practice: Learning as a social
system. Systems Thinker, 9(5), 2-3.
The algorithms used for concept extraction by Influuent
may also contribute to the differences. Self-described keywords given by EC show the variability in how researchers
described themselves. Using concept abstraction provides a
richer set of researcher footprints.
The study also raises questions. Additional investigations are needed to understand the differences in Influuent
concepts and the concepts identified through DM Pubs and
ORSP Props. Another important aspect is how to classify
the keywords based on the needs of the user. For example,
a user looking for researchers in cancer research would
want the concepts at a high level of abstraction than those
who want to know who does research on a particular type
of precursor cell, e.g., B-lymphoid. Another potential classification is those who are on the periphery of a discipline,
as shown in the work for those who have interests in undergraduate research. The latter supports the ability to extend CoPs by identifying researchers who have the potential to contribute to a community e.g., Faculty B who is not
part of the Smart City CoP, but has publications that align
well with this community. The approach also supports the
efforts of UTEP’s ORSP to convene researchers in particular areas to build collaborations.
Future work includes comparing different approaches
for concept identification (e.g., keywords vs. sentence level
parsing), investigating a statistical approach for assigning
weights to the concepts extracted from different sources,
and extending the ontologies to support emerging communities and IDR opportunities. Our long-term goal is to extend the researcher’s footprint. As more researchers are
associated with a CoP, they will be able to use its ontology
to refine and directly associate with the community’s keywords, enriching their footprint and the community’s ontology. This will facilitate the university’s ability to identify collaborators and expertise. EC can be extended to identify funding opportunities associated with the communities.
Acknowledgements
This work is supported in part by the National Science
Foundation (NSF) grants HRD-1242122 and DUE
#0963648. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the author(s) and do not necessarily reflect the views of the NSF.
References
AlchemyAPI. (2015). Alchemy API. Retrieved October 23, 2015,
from http://www.alchemyapi.com
Berry, M. W., & Kogan, J. (2010). Text mining: Applications and
theory John Wiley & Sons.
Blei, D. M. (2012). Probabilistic topic models. Communications
of the ACM, 55(4), 77-84.
689
Download