paper -

Roy B. Clariana
Pennsylvania State University
Patricia E. Wallace
The College of New Jersey
Veronica M. Godshalk
University of South Carolina, Beaufort
Essays are an important measure of complex learning but pronouns in text can confound the author’s intended meaning.
Our interest here is in automatic essay scoring. How do pronouns affect computer-based text analysis? Participants in an
undergraduate business course (N = 49, but 4 with missing data, final sample N = 45) completed an essay as part of the
course final examination and investigators manually edited every occurrence of pronouns in these essays to their
antecedents. The original unedited and the edited essays were processed by ALA-Reader software using linear aggregate
and sentence aggregate methods. These data were then analyzed using a Pathfinder network (PFNET) approach. The
sentence aggregate approach obtained substantially different PFNET representations of the unedited and edited essays;
the presence of pronouns negatively impacted the quality of sentence aggregate data. However, there was little difference
between the PFNETs obtained for the unedited and edited essays for the linear aggregate method. The linear aggregate
method appears to be relatively robust to pronoun confounding at least for the narrow purposes of establishing group
knowledge structure and for expert referent pattern matching for determining individual essay scores.
Acquisition of expertise, Computer-based text analysis
Understanding and measuring the acquisition of expertise and the progress of learning in complex domains is
an important issue for instructional designers, teachers, and researchers. A common and reasonable way to
represent and measure this change from novice to expert is by comparing an individual’s or a group’s mental
model before and after instruction to an expert’s mental model (Seel, 1999). With effective learning, the
individual and the group become more like the expert. A number of recent technology-based approaches for
measuring aspects of individual and group mental models are under development (Johnson et al, 2006), such
as Analysis of Constructed Shared Mental Model (ACSMM), Surface, Matching, and Deep Structure (SMD),
and Model Inspection Trace of Concepts and Relations (MITOCAR). This investigation considers one
method called Analysis of Lexical Aggregates (ALA-Reader software) that uses students’ essays as the data
source for deriving individual and group knowledge structures (Clariana and Wallace, 2007).
Following Anderson (1983), we view structure as the essence of knowledge (p. 5). We consider neural
networks as biologically plausible representations of knowledge structure (Elman, 1993; Rumelhart and
McClelland, 1986). In our connectionist view, knowledge structure is a precursor of meaning and as such is
the underpinning of knowledge (Goldsmith and Johnson, 1990; Goldsmith et al, 1991). Specifically, an
individual’s knowledge structure is the expression of their language association neural network. Unique in
each individual, this network is their personal lexicon (i.e., as a multi-dimensional mental representation) that
is at the most fundamental level of memory (but pre-meaningful). We hold that measurers of knowledge
structure are primarily measures of this association network (Deese, 1965) but are not necessarily measures
of semantic meaning (e.g., word association approaches), although some do also measure meaning (e.g.,
concept maps). So ALA-Reader data measures of structural knowledge are not precisely measures of an
individual’s mental model, but we hold that it is a representation of a critical aspect of their mental model and
so can serve as a place holder for their mental model.
This experimental investigation considers the likely negative effects of pronouns on the quality of ALAReader measures of knowledge structure. Pronouns in text present a substantive problem for text processing
software because pronouns carry meaning forward and backward in the text, but often must be treated as stop
words because most software is not ‘clever’ enough to know what antecedents the pronouns refer to.
For example, here is a passage from a student’s essay in the present investigation: “Total Quality
Management encompasses all of these factors, with an emphasis on quality leadership. It allows for
employee-employer relations as well as getting the task at hand accomplished. Also, it includes contingency
plans if things go haywire.” The pronoun ‘it’ in succeeding sentences is an example of anaphoric reference, a
linguistic unit that refers to another nearby linguistic unit. ‘It’ probably refers to ‘Total Quality Management’
in the first sentence although it could also refer to ‘quality leadership’. How critical are pronouns to the
quality of the text analysis output? This investigation uses human readers to edit all pronouns in students’
essays to the appropriate antecedent in order to compare the original unedited set of essays (with pronouns)
to their equivalent edited set of essays without pronouns. By comparing the unedited essay data output to the
edited essay data output, we can consider the relative effects of pronouns on the quality of text processing
output of ALA-Reader software.
The analysis of lexical aggregate approach used in this investigation utilizes a concept map scoring approach
described by Taricani and Clariana (2006) but applies it to text passages. Essentially the software converts
the text passage into a network map data representation that can then be further analyzed by various other
tools. Their approach is based on previous research on free association norms (Deese, 1965), on structural
analyses of text propositions (Frase, 1969), on the matrix model of memory (Humphreys et al, 1989; Pike,
1984), on Kintsch’s discourse processing model (e.g., Kintsch, 2002), and on current knowledge
representation and neural network views of cognition (e.g., McClelland et al, 1995).
ALA-Reader translates a text passage into a proximity data file (e.g., the lower triangle of an n*n array
containing n(n+1)/2 elements). This proximity data file can be analyzed by Knowledge Network Organizing
Tool software (KNOT, Schvaneveldt, 1990) to convert it into a Pathfinder network (PFNET) representation of
that text. Based on graph theory, KNOT uses an algorithm to determine a least-weighted path that links all of
the terms. The resulting PFNET is based on a data reduction approach that is purported to represent the most
salient relationships in the raw proximity data.
The analysis method is described in detail in Clariana and Wallace (2007) but in brief, ALA-Reader
aggregates researcher selected key terms from the text at the sentence level (Shavelson, 1974) and also
linearly across sentences (Clariana and Wallace, 2007). Then the aggregate data can be analyzed by various
approaches, for example by multi-dimensional scaling, cluster analysis, and in this case, by Pathfinder
network scaling approaches. For example, the key-term rich passage below from a participant’s essay can be
represented as a linear aggregate PFNET (left panel of Figure 1) or as a sentence aggregate PFNET (right
panel of figure 1). The key terms in the sample passage are underlined and pronouns are shown in italics.
Sample passage: “Humanists believed that job satisfaction was related to productivity. They found that
if employees were given more freedom and power in their jobs, then they produced more.”
Figure 1. PFNET representations (force-directed graphs) of the passage sample using the linear aggregate approach (left
panel) and the sentence aggregate approach (right panel).
In this case and in most cases, the sentence aggregate approach tends to over specify pairwise associations
relative to the linear aggregate approach when applied to key-term rich sentences that are indicative of expert
responses but tends to relatively under specify associations when key terms in sentences are sparse (i.e., poor
or novice responses) and also when key terms are ‘disguised’ as an anaphoric references (i.e., pronouns).
If the pronouns ‘they’ and ‘their’ in the sample passage above are replaced with their antecedents, a
different sentence aggregate PFNET will be obtained (see Figure 2). Compare these PFNETs of edited essays
in Figure 2 below (pronouns replaced with referents) with the PFNETs of original unedited essays in Figure 1
above to note the effects of pronouns on the PFNET representations of this sample text passage. Replacing
pronouns with their antecedents in this key-term rich sample text passage tends to increase the number of
associations in the PFNET representations.
Figure 2. Linear aggregate (left panel) and sentence aggregate (right panel) PFNETs of the sample passage
with the pronoun referents of ‘they’ and ‘their’ added.
Although the sentence aggregate approach may relatively over-specify keyword pair wise associations,
this may not be an issue for the purposes of generating group knowledge representations or for scoring
individual essays because the Pathfinder network scaling approach is a data reduction approach that seeks the
most salient information in proximity data (Cooke, 1992; Johnson et al., 1994).
The Pathfinder network approach is a well established method for measuring knowledge structure
(Jonassen et al., 1993) that has been applied to predict course performance, compare individuals to groups,
represent group consensus, predict combat pilot performance, and to compare naïve, novice, intermediate,
and expert computer programmers (Villachica, 2000). Also Pathfinder KNOT software can be used to
average together multiple proximity data files in order to establish a single ‘group’ PFNET representation of
those essays. When multiple files are combined into one file, idiosyncratic spurious and error association
frequencies are less than the frequencies of correct associations, and so error and spurious associations drop
out of the averaged group representation, unless the error association is a common misconception. Common
misconceptions will tend to be included in the group average PFNET representation. But because spurious
over specification in an individual’s proximity data by ALA-Reader drops out during the averaging process,
over specification is less of a problem when used for averaged group representation purposes; although it
may be a problem when representing individual student’s knowledge structure.
When using ALA-Reader and KNOT for generating an individual student’s essay score relative to an
expert referent, of the two types of PFNET comparison measures, ‘links in common’ have been shown to be
a better predictor of rater essay scores than has the ‘similarity’ measure (Taricani and Clariana, 2006). This
may be because errors and spurious associations count in similarity scores but do not count in common
scores. Only associations that match the expert are counted towards the links in common score and so
spurious and incorrect associations are disregarded. Thus over-specification of a student’s PFNET pair wise
associations by ALA-Reader is only a small problem when the PFNET common score is used for essay
scoring, and this may account for why common scores are superior to similarity scores for scoring essays and
similarity scores are generally better than common scores for other purposes. As a side note, overspecification is potentially a big problem for the expert referent essay used as the baseline to score the
students’ essays. To handle this problem, it is best to meticulously design the expert referent as an ideal
PFNET rather than just convert an expert essay into a PFNET, thus avoiding the effects of unintended
spurious and error associations in the expert referent essay data set.
We pointed out above that since the density of key terms per sentence has a profound effect on the
resulting PFNETs, novice or poor essays are more likely to be under specified and thus negatively impacted
by the presence of many pronouns relative to the same essays that have been edited to include the pronoun
referents; and this should not have as great of an effect on good essays that have plenty of keywords to
overcome any negative effects of pronouns. To consider this possibility, the first analysis consists of
comparisons of the original unedited and the edited essays average group representations of the 15 bottom
performing students to those of the 15 top performing students. In the second analysis, individual students’
unedited and edited essays are processed by the ALA-Reader software with KNOT analysis and those scores
are compared to three human rater essay scores to consider the effects of pronouns on the ALA-Reader
scoring method.
2.1 Method
Participants were undergraduate students (N = 49, but 4 with missing data, final sample N = 45) enrolled in
several sections of a required business course in an eastern university in the USA. Students completed an
essay as part of the course final examination. The essay prompt stated,
“Describe and contrast in an essay of 300 words or less the following management theories:
Classical/Scientific Management, Humanistic/Human Resources, Contingency, and Total Quality
Management. Please use the terms below in your essay: administrative principles, benchmarking,
bureaucratic organizations, contingency, continuous improvement, customers, customer focus, efficiency,
employee, empowerment, feelings, Hawthorne studies, human relations, humanistic, leadership,
management (i.e., bosses), Management by Objectives, motivate, needs, organization (i.e., corporation),
plan, product, quality, relationship, scientific management (classical), service, situation (or
environment), TQM, and work (or job, task).”
These 29 terms above (plus their synonyms and metonyms) were the key terms used by the ALA-Reader
software during text processing.
The 45 student essays contained 14,761 total words (1,798 unique words) which is an average of 301
words per essay (range from 170 to 479 words, standard deviation of 70.1). The fifteen most common words
account for 31% of all of the text (4,526 words) and include in order the (929 occurrences), and (500), of
(491), to (470), management (358), a (289), is (273), in (249), that (221), employees (211), this (170), on
(167), are (139), as (137), and their (133). The three most common pronouns are their (ranked 15th with 133
occurrences), it (18th with 128 occurrences), and they (19th with 127 occurrences). Students use of ‘their’ and
‘they’ in their essays indicates that they were writing from the perspective of either a manager or of an
employee and not always as an independent observer.
2.2 Comparing group average data representations
ALA-Reader software was used to process the unedited essays (with pronouns present) and the edited
essays (pronouns replaced with their antecedents) using both a linear aggregate approach and a sentence
aggregate approach. To identify the best and worst essays for grouping purposes, human rater essay scores
for these 45 essays were used to rank the student essays; then Pathfinder KNOT software was used to average
together the proximity files of the 15 top performing and the 15 bottom performing students for each of the
four data sets. Using “T” for top group and “B” for the bottom group, “L” for linear analysis and “S” for
sentence analysis, and “U” for unedited essays and “E” for edited essays, the eight data sets are:
 TLU and BLU – top group and bottom group, linear aggregate, unedited
 TLE and BLE – top group and bottom group, linear aggregate, edited
 TSU and BSU – top group and bottom group, sentence aggregate, unedited
 TSE and BSE – top group and bottom group, sentence aggregate, edited
The correlations between these eight group average proximity raw data files and also the PFNET similarity
scores (calculated as PFNET intersection divided by PFNET union) are shown below in Table 1.
Table 1. Pearson correlations (above the diagonal) and PFNET similarity data (below the diagonal) of the average
proximity raw data for the top group (n =15) and the bottom group (n =15) with either linear or
sentence aggregate approaches for both unedited essays and edited essays.
There are very strong correlations between the proximity raw data (values above the diagonal in Table 1)
within the top group and within the bottom group, but not across top and bottom groups. Editing the
pronouns to their referents (in Table 1, compare U ‘unedited’ to E ‘edited’) had almost no effect on the top
group or the bottom group proximity data (see the four underlined values above the diagonal, all r >.97).
PFNET similarity scores however were impacted by editing the pronouns to their referents (values in
italics below the diagonal in Table 1), but less so for the linear aggregate approach (TLU to TLE r = .81 and
BLU to BLE r = .83) than for the sentence aggregate approach (TSU to TSE r = .78 and BSU to BSE r =
.71). Even so, these are remarkably similar PFNETS with many links in common. For example, the TLU
PFNET contains 31 links and the TLE PFNET contains 34 links and these two PFNETs share 29 links in
common. The graphical representations of these two PFNETs as force-directed graphs are also visually quite
similar (see Figure 3).
Figure 3. Linear aggregate PFNET group knowledge representation of the top 15 student essays (TLU, unedited essays –
left panel and TLE, essays edited to replace pronoun referents – right panel).
It is fascinating to us to compare the PFNET visual depictions of the averaged essays. For example, the
terms ‘management’, ‘employee’, ‘organization’, ‘customer’, and ‘product’ are well connected and so are
central terms in both representations. Three of the four super-ordinate management theory categories in the
essay writing prompt, ‘scientific management’, ‘contingency’, and ‘TQM’, were all associated with
‘management’ while the fourth category, ‘humanistic’, was associated with both ‘management’ and
‘employee’. Emotion-related terms such as ‘feelings’, ‘needs’, ‘relationship’, and ‘motivation’ were all
associated with ‘employee’ in both PFNET representations as were the action words ‘benchmarking’, ‘work’,
and ‘product’.
However, comparing the term ‘leadership’ in the left and right panel of Figure 3, in the PFNET of the
edited essays (right panel of Figure 3) the concept of ‘leadership’ is relatively more connected and so takes a
relatively more central position. In contrast, in the PFNET of the unedited essays (left panel of Figure 3), the
term ‘leadership’ just hangs off of the term ‘need’. Be sure to note that these differences in structure for these
two PFNETs in Figure 3 relate to how students used pronouns, probably ‘it’ to refer to ‘leadership’, in their
essay passages, and not necessarily to their views of the centrality of leadership in these theories that are
discussed. In fact, the edited essay PFNET representation (right panel) shows that ‘leadership’ is a central
aspect of their views.
2.3 Comparing individual data representations
ALA-Reader software was used to process the unedited and edited essays using both a linear aggregate
approach and a sentence aggregate approach. This produced four separate data sets each containing 45
proximity files, one for each essay. All 45 of the proximity files in these four separate data sets (linear
aggregate unedited, linear aggregate edited, sentence aggregate unedited, and sentence aggregate edited)
were transformed into PFNETs by KNOT software and then all PFNETs were compared to an expert referent
PFNET to obtain scores that consisted of the number of links in common between the student and the referent
To establish comparison benchmark scores, the three human essay scores for each essay were converted
to a single factor score using the SPSS version 15.0 factor analysis regression option. Pearson correlations
were conducted for each of the four data sets between this human benchmark factor score and the ALAReader essay scores. The linear aggregate method obtained better correlations with the human raters than did
the sentence aggregate approach (see Table 2), although the correlations are not very large. Replacing
pronouns with their referents had almost no effect on linear aggregate scores but had a strong positive effect
on sentence aggregate scores. This suggests that sentence-level analysis approaches are more negatively
impacted by the presence of pronouns and other anaphoric references relative to approaches that analyze
across rather than within sentences.
Table 2. Pearson correlations between the three human-rater factor score and the four ALA-Reader essay scores.
Analysis Approach
3 raters
linear aggregate unedited
linear aggregate edited
sentence aggregate unedited
sentence aggregate edited
* p <. 05, ** p < .01
Post hoc analysis showed a moderate correlation between the human rater essay scores and essay length,
but this was less so for ALA-Reader essay scores. The biasing effect of essay length is a construct-irrelevant
influence on human essay scores that has been regularly observed (Attali, 2007; Powers, 2005), although it
seems reasonable that better students will write longer essays. Would adjusting ALA-Reader scores for essay
length improve the relative quality of the scores? The SPSS factor score regression method was again used to
combine the ALA-Reader linear aggregate unedited scores with the essay length variable to obtain length
adjusted ALA-Reader scores (see LU+ in Table 3 below). For comparison purposes, a benchmark “universal”
essay score was established using the SPSS factor score combination of the three human essay scores plus the
linear aggregate unedited essay score. Correlations between all of these scoring approaches are shown in
Table 3. The unadjusted LU score (r = 0.68) was not as related to the universal benchmark as the human rater
scores but the length adjusted LU score (LU+) was almost as good as the rater scores for estimating the
universal essay score (r = 0.84).
Table 3. Post hoc analysis using Pearson correlation comparing the human raters,
selected ALA-Reader essay scores, final examination score, and essay length.
Universal essay score
Rater 1 (the course instructor)
Rater 2 (investigator)
Rater 3 (graduate assistant)
Linear aggregate score (LU)
Length adjusted LU score (LU+)
Final examination score
Essay length
All are significant at the p<.05 level or better.
Rater 1
Rater 2
Rater 3
The ALA-Reader sentence aggregate approach obtained substantially different PFNET representations for
unedited compared to edited essays; thus pronouns in the text passages strongly affected the sentence
aggregate method results and based on the analyses of individual essay scores, pronouns in text passages
have a negative effect on sentence aggregate PFNET representations. On the other hand, there was little
difference between the PFNETs obtained for edited and unedited essays using the linear aggregate method.
The analysis of the group knowledge representations and the individuals’ essay scores indicate that the linear
aggregation approach is relatively robust to pronoun interference and is generally superior to the sentence
aggregate method for the narrow purposes of average group knowledge representation and for scoring
individual essays.
Human essays are grammatically imperfect and sometimes incoherent; our experience gained from
editing the pronouns in these 45 essays is that even humans can’t always agree on a pronoun’s antecedent.
Pronouns in text can be a substantial issue for text analysis software and routines for handing pronouns are
expensive to develop and generally are not perfectly accurate; thus these routines may add more error to the
data. Although this is a small sample (i.e., 45 well-constrained essays), these encouraging results suggest that
the ALA-Reader linear aggregation method is robust to pronoun interference, and so would NOT benefit
much from a pronoun handling subroutine, thus the development cost of a pronoun handler is not warranted.
Perhaps these results regarding the effects of anaphoric reference may generalize to other text analysis
software that analyzes in a linear way across sentences, although the results here indicate that a pronoun
handling subroutine would benefit text analysis software that analyzes only within sentences.
It is important to note that many concept map and essay scoring approaches depend on the analysis of
propositions because propositions are held to be the lowest atomistic level of meaning. Propositions in essays
are subject-verb-object and in concept maps are concept-link-concept, and these are essentially sentences,
and so a majority of scoring approaches should be classified as within sentence approaches. Bit we hold that
propositions are formed on the fly from the structural knowledge lexicon. And inferentially, the results of this
investigation for linear over sentence analysis provides some support for our notion that structural knowledge
is at a pre-meaning level, not at the proposition level.
And one more note, it is fairly common to use human rater scores when determining the concurrent
criterion related validity of text processing output. So the likely biased effects of essay length on human
scores must be considered when conducting any type of text analysis studies involving human raters (Hoyt,
2000); otherwise the software will acquire the humans’ bias. In this case, it is simple to add an essay length
scoring adjustment to ALA-Reader essay scores so that the adjusted scores will have a stronger correlation
with human rater scores, but it seems critical to honestly report both the unadjusted essay score as well as the
length adjusted essay score so that users and researchers will be mindful of possible human bias built in to
the software.
