Assessing Author Research Focus Using Vector Space Modeling

advertisement
Assessing Author Research Focus
Using Vector Space Modeling
Kun Lu
School of Information Studies
U. of Wisconsin-Milwaukee
P.O. Box 413, Milwaukee, WI USA 53201
e-mail: kunlu@uwm.edu
ABSTRACT
A method for comparison of research focus by a group of
authors using vector space modeling is presented. The body
of published work by an author or group of authors in a
given field may be represented as a vocabulary based on
language use. Using vector space modeling, publications
may be represented as vectors, resulting in a topic space.
The works of a given author may be mapped onto the space,
resulting in an author space. The density of a given author
space provides an indication of the coherence of the
author’s body of work, where a high density author space is
indicative of a more focused research agenda and a low
density author space indicates more varied areas of
investigation or language use. This concept was applied to a
set of publications appearing in eight high impact factor
journals in information science from which the 100 most
prolific authors were identified. Their author space
characteristics were determined and compared to identify
those with the most focused research. The findings
demonstrate significant correlations in authors’ numbers of
publications, vocabulary usage, and resulting author space
density, with some exceptions.
Keywords
Informetrics, Author space, Vector space modeling
INTRODUCTION
The purpose of this research is to demonstrate the
application of word analysis to publications in a vector
space environment to identify the level of focus of authors’
research. To date, studies have relied on word analysis for
mapping of disciplinary structure (e.g., Braam, Moed, van
Raan, 1991a, 1991b; Janssens, Leta, Glänzel, & De Moor,
2006), but to the best of our knowledge not for author
analysis using vector space modeling and author space
This is the space reserved for copyright notices.
ASIST 2011, October 9-13, 2011, New Orleans, LA, USA.
Copyright notice continues right here.
Dietmar Wolfram
School of Information Studies
U. of Wisconsin-Milwaukee
P.O. Box 413, Milwaukee, WI USA 53201
e-mail: dwolfram@uwm.edu
density outcomes.
In the vector space model, publications related to a certain
topic that are included in an indexed data set constitute a
topic space. The focus of an author’s research output can be
represented by similarities among his/her works. If an
author’s publications are similar to each other, his/her
research agenda is assumed to be more focused and vice
versa. A mathematical representation of this measure can be
thought of as a density, where a highly dense space
represents higher homogeneity in the objects and thus more
focus in an author’s research output. A low density space
indicates greater variation. The space density is defined as
the average similarities between each object in the space
and the centroid, or average document vector. The
mathematical representation of the density is:
Density ( S ) 
 Sim( D, C )
DS
n
(1)
where Sim() is the similarity function, cosine is used as the
similarity function in the current study, S is the author
space, C is the centroid of that space, D is the document in
the space and n is the number of objects in the space. We
note that vocabulary conventions vary from field to field.
Some fields have more limited vocabularies than others.
However, we believe that the consistent use of a similar
vocabulary in a similar way provides evidence of a focused
research agenda.
Note that the density calculation takes into account the
number of publications, so outcomes are not affected by the
number of publications produced by an author. More
information about densities in an indexed space can be
found in Wolfram & Zhang (2008).
The author space density provides a measure by which
different authors may be compared for the extent of their
research focus. We hypothesize that highly prolific authors
will have a lower author space density because they are
more likely to contribute to a broader range of topics.
Authors with a higher density space are believed to have a
more focused or narrower research agenda.
THE VECTOR SPACE MODEL FOR INFORMATION
RETRIEVAL
Author
Pubs.
(Rank)
Density
(Rank)
VocSize
(Rank)
THELWALL, M
97 (1)
0.318 (91)
1482 (1)
EGGHE, L
74 (2)
0.287 (98)
937 (6)
GLANZEL, W
61 (3)
0.310 (92)
965 (5)
ROUSSEAU, R
61 (3)
0.244 (100)
1008 (4)
LEYDESDORFF, L
60 (5)
0.328 (87)
1092 (2)
JACSO, P
56 (6)
0.291 (94)
799 (11)
OPPENHEIM, C
48 (7)
0.252 (99)
1020 (3)
SPINK, A
48 (7)
0.378 (72)
907 (7)
BAR-ILAN, J
40 (9)
0.322 (90)
875 (9)
NICHOLAS, D
34 (10)
0.384 (64)
889 (8)
BORNMANN, L
31 (11)
0.377 (73)
724 (16)
DANIEL, HD
30 (12)
0.386 (62)
682 (21)
METHOD
HUNTINGTON, P
30 (12)
0.392 (58)
832 (10)
Journals with the highest impact factor in the category of
“information science & library science” appearing in the
Journal Citation Report 2009 Social Sciences Edition were
identified. Journals associated with allied subject areas such
as Management Information Systems and Medical
Informatics, were excluded. Eight journals were selected
for inclusion in the study (Journal of Informetrics, Annual
Review of Information Science and Technology, Journal of
the American Society for Information Science and
Technology (Not including JASIS), Scientometrics,
Information Processing & Management, Journal of
Information Science, Online Information Review, and
Journal of Documentation). Bibliographic records for
documents published in these journals between 2000 and
2010 were downloaded. Records downloaded were further
limited to three document types: articles, proceedings
papers and reviews. The other document types were less
likely to represent research contributions by the authors. In
total, 5227 records were downloaded from Thomson
Reuters Web of Science (WoS). The raw WoS records were
processed and only three fields were kept, namely, the
article title (i.e. “TI” field), the Keywords Plus (i.e. “ID”
field) and the abstract (i.e. “AB” field). The records then
were indexed with a widely used Lemur information
retrieval toolkit. Stop words were removed and Porter
stemming was applied.
JANSEN, BJ
27 (14)
0.438 (25)
690 (19)
SCHUBERT, A
27 (14)
0.290 (96)
522 (41)
JARVELIN, K
26 (16)
0.339 (85)
790 (12)
CHEN, HC
24 (17)
0.291 (95)
760 (13)
FORD, N
24 (17)
0.384 (65)
738 (15)
WILSON, CS
24 (17)
0.308 (93)
721 (17)
ZHANG, J
24 (17)
0.322 (89)
580 (28)
BURRELL, QL
23 (21)
0.391 (59)
498 (47)
CRONIN, B
23 (21)
0.288 (97)
510 (43)
VAUGHAN, L
22 (23)
0.440 (24)
602 (25)
MEYER, M
21 (24)
0.481 (11)
670 (21)
MOED, HF
21 (24)
0.373 (76)
661 (23)
The vector space model is one of the most influential
models in information retrieval (Salton and McGill, 1983).
In this model, each document is represented as a vector and
the elements of the vector consist of words appearing in the
collection. The document vectors in a collection constitute a
document term matrix. The value of each element
represents the term significance in the document. By virtue
of the vector space model, documents are transformed into
vectors. Traditional measures like angle (e.g. based on a
cosine measure) and distance (e.g. Euclidean distance) can
be used to measure the similarity between documents. In
the vector space, a number of documents constitute a
document space. In the current study, each author will be
viewed as a document space consisting of the articles
he/she has written. This space is named the author space.
From the 5227 records downloaded, we were able to
identify 6282 different author names using string matching.
Because it is impractical to list all of the authors in our
collection, we focused on the 100 most prolific authors
according to WoS “analyze results” function. We selected
the most prolific authors because the more an author writes
the better the algorithm used “understands” his/her
interests, and thus the more accurate our measure will be.
The 5227 records served as the basis of the topic space,
TF*IDF term weighting was employed to assign term
significance in the space. Terms that were single characters
or only consisted of digits (e.g. “2001”) were filtered out. It
Table 1. Author Space Density Outcomes –
25 Most Prolific Authors
is believed that these terms add noise into the space rather
than meaning. The author space for each of the 100 most
prolific authors was generated, which consisted of all the
articles an author wrote represented in the topic space.
RESULTS
A summary table for the 25 most prolific authors identified
appears in Table 1. Pubs represents the number of
publications included in the author space. VocSize
represents the number of indexed content-bearing terms or
vocabulary size for an author. Density is defined as in
equation (1). Ranks are included in parentheses.
There is a significant negative correlation (Kendall’s tau =
-0.466, α<0.01) between productivity and author space
density, where more prolific authors are more likely to have
a less dense author space (Figure 1). There is also a
significant positive correlation between vocabulary size and
number of publications (Kendall's tau = 0.628, α<0.01) and
a significant negative correlation between vocabulary size
and author space density (Kendall's tau = -0.655, α<0.01).
Among the 100 most prolific authors, the authors with the
10 lowest density values all appear in the top 25 list for
productivity, lending evidence that the authors have
contributed to a broader range of areas. This agrees with
our hypothesis; however, there are exceptions among
prolific authors. Of note, three prolific authors who appear
in the top 25 (Jansen; Vaughan; Meyer), also have high
author density values that appear in the top 25, suggesting a
more focused research agenda, or a more consistent use of
vocabulary in their research output. In addition, several
authors (Schubert, Burrell, Cronin) demonstrate that a large
number of publications and a relatively low author space
density need not translate to an equally large indexed
vocabulary.
Density
0.6
judgment for a given author’s contributions. It is a
reflection of the language use in a given author’s research,
which can serve as a means to identify the breadth of topics
he or she has undertaken. Furthermore, the density value
should not be considered in absolute terms, but instead as a
comparative measure. Density outcomes will depend on the
documents and authors included in the topic space, which
influence the characteristics of the author space. In the case
of the current study, the data may not reflect the full oeuvre
(body of work) of the identified authors because only eight
journals were used. The developed approach also has
applications to explore how authors’ breadth of language
use in their research changes over time. This aspect may be
studied in a future investigation.
0.5
REFERENCES
0.4
Braam, R.R., Moe, H.F., & van Raan, A.F.J. (1991a).
Mapping of science by combined co-citation and word
analysis. 1. Structural aspects. Journal of the American
Society for Information Science, 42(4), 233-251.
0.3
0.2
0.1
0
0
20
40
60
80
100
# of Publications
Figure 1. Scatter plot for author productivity and
author space density
DISCUSSION & CONCLUSION
This study has demonstrated how one may use vector space
modeling, classically applied in information retrieval, to
study scholarly communication to compare the breadth of
prolific author contributions to a topic area. The density of
a given author space should not be considered a value
Braam, R.R., Moe, H.F., & van Raan, A.F.J. (1991b).
Mapping of science by combined co-citation and word
analysis. 1. Dynamical aspects. Journal of the American
Society for Information Science, 42(4), 252-266.
Janssens, F. Leta, J. Glänzel, W., & De Moor, B. (2006).
Towards mapping library and information science.
Information Processing & Management, 42, 1614–1642.
Salton, G., & McGill, M. J. (1983). Introduction to modern
information retrieval. New York: McGraw-Hill.
Wolfram, D., & Zhang, J. (2008). The influence of indexing
practices and term weighting algorithms on document
spaces. Journal of the American Society for Information
Science and Technology, 59(1), 3-11.
Download