DERIVING AND MEASURING GROUP KNOWLEDGE STRUCTURE VIA COMPUTER-BASED ANALYSIS OF ESSAY QUESTIONS: THE EFFECTS OF CONTROLLING ANAPHORIC REFERENCE Roy B. Clariana Pennsylvania State University USA RClariana@psu.edu Patricia E. Wallace The College of New Jersey USA PWallace@tcnj.edu Veronica M. Godshalk University of South Carolina, Beaufort USA Godshalk@gwm.sc.edu ABSTRACT Essays are an important measure of complex learning but pronouns in text can confound the author’s intended meaning. Our interest here is in automatic essay scoring. How do pronouns affect computer-based text analysis? Participants in an undergraduate business course (N = 49, but 4 with missing data, final sample N = 45) completed an essay as part of the course final examination and investigators manually edited every occurrence of pronouns in these essays to their antecedents. The original unedited and the edited essays were processed by ALA-Reader software using linear aggregate and sentence aggregate methods. These data were then analyzed using a Pathfinder network (PFNET) approach. The sentence aggregate approach obtained substantially different PFNET representations of the unedited and edited essays; the presence of pronouns negatively impacted the quality of sentence aggregate data. However, there was little difference between the PFNETs obtained for the unedited and edited essays for the linear aggregate method. The linear aggregate method appears to be relatively robust to pronoun confounding at least for the narrow purposes of establishing group knowledge structure and for expert referent pattern matching for determining individual essay scores. KEYWORDS Acquisition of expertise, Computer-based text analysis 1. INTRODUCTION Understanding and measuring the acquisition of expertise and the progress of learning in complex domains is an important issue for instructional designers, teachers, and researchers. A common and reasonable way to represent and measure this change from novice to expert is by comparing an individual’s or a group’s mental model before and after instruction to an expert’s mental model (Seel, 1999). With effective learning, the individual and the group become more like the expert. A number of recent technology-based approaches for measuring aspects of individual and group mental models are under development (Johnson et al, 2006), such as Analysis of Constructed Shared Mental Model (ACSMM), Surface, Matching, and Deep Structure (SMD), and Model Inspection Trace of Concepts and Relations (MITOCAR). This investigation considers one method called Analysis of Lexical Aggregates (ALA-Reader software) that uses students’ essays as the data source for deriving individual and group knowledge structures (Clariana and Wallace, 2007). Following Anderson (1983), we view structure as the essence of knowledge (p. 5). We consider neural networks as biologically plausible representations of knowledge structure (Elman, 1993; Rumelhart and McClelland, 1986). In our connectionist view, knowledge structure is a precursor of meaning and as such is the underpinning of knowledge (Goldsmith and Johnson, 1990; Goldsmith et al, 1991). Specifically, an individual’s knowledge structure is the expression of their language association neural network. Unique in each individual, this network is their personal lexicon (i.e., as a multi-dimensional mental representation) that is at the most fundamental level of memory (but pre-meaningful). We hold that measurers of knowledge structure are primarily measures of this association network (Deese, 1965) but are not necessarily measures of semantic meaning (e.g., word association approaches), although some do also measure meaning (e.g., concept maps). So ALA-Reader data measures of structural knowledge are not precisely measures of an individual’s mental model, but we hold that it is a representation of a critical aspect of their mental model and so can serve as a place holder for their mental model. This experimental investigation considers the likely negative effects of pronouns on the quality of ALAReader measures of knowledge structure. Pronouns in text present a substantive problem for text processing software because pronouns carry meaning forward and backward in the text, but often must be treated as stop words because most software is not ‘clever’ enough to know what antecedents the pronouns refer to. For example, here is a passage from a student’s essay in the present investigation: “Total Quality Management encompasses all of these factors, with an emphasis on quality leadership. It allows for employee-employer relations as well as getting the task at hand accomplished. Also, it includes contingency plans if things go haywire.” The pronoun ‘it’ in succeeding sentences is an example of anaphoric reference, a linguistic unit that refers to another nearby linguistic unit. ‘It’ probably refers to ‘Total Quality Management’ in the first sentence although it could also refer to ‘quality leadership’. How critical are pronouns to the quality of the text analysis output? This investigation uses human readers to edit all pronouns in students’ essays to the appropriate antecedent in order to compare the original unedited set of essays (with pronouns) to their equivalent edited set of essays without pronouns. By comparing the unedited essay data output to the edited essay data output, we can consider the relative effects of pronouns on the quality of text processing output of ALA-Reader software. 2. ALA-READER SOFTWARE The analysis of lexical aggregate approach used in this investigation utilizes a concept map scoring approach described by Taricani and Clariana (2006) but applies it to text passages. Essentially the software converts the text passage into a network map data representation that can then be further analyzed by various other tools. Their approach is based on previous research on free association norms (Deese, 1965), on structural analyses of text propositions (Frase, 1969), on the matrix model of memory (Humphreys et al, 1989; Pike, 1984), on Kintsch’s discourse processing model (e.g., Kintsch, 2002), and on current knowledge representation and neural network views of cognition (e.g., McClelland et al, 1995). ALA-Reader translates a text passage into a proximity data file (e.g., the lower triangle of an n*n array containing n(n+1)/2 elements). This proximity data file can be analyzed by Knowledge Network Organizing Tool software (KNOT, Schvaneveldt, 1990) to convert it into a Pathfinder network (PFNET) representation of that text. Based on graph theory, KNOT uses an algorithm to determine a least-weighted path that links all of the terms. The resulting PFNET is based on a data reduction approach that is purported to represent the most salient relationships in the raw proximity data. The analysis method is described in detail in Clariana and Wallace (2007) but in brief, ALA-Reader aggregates researcher selected key terms from the text at the sentence level (Shavelson, 1974) and also linearly across sentences (Clariana and Wallace, 2007). Then the aggregate data can be analyzed by various approaches, for example by multi-dimensional scaling, cluster analysis, and in this case, by Pathfinder network scaling approaches. For example, the key-term rich passage below from a participant’s essay can be represented as a linear aggregate PFNET (left panel of Figure 1) or as a sentence aggregate PFNET (right panel of figure 1). The key terms in the sample passage are underlined and pronouns are shown in italics. Sample passage: “Humanists believed that job satisfaction was related to productivity. They found that if employees were given more freedom and power in their jobs, then they produced more.” satisfaction humanist humanist job job empowered productivity satisfaction empowered productivity employee employee Figure 1. PFNET representations (force-directed graphs) of the passage sample using the linear aggregate approach (left panel) and the sentence aggregate approach (right panel). In this case and in most cases, the sentence aggregate approach tends to over specify pairwise associations relative to the linear aggregate approach when applied to key-term rich sentences that are indicative of expert responses but tends to relatively under specify associations when key terms in sentences are sparse (i.e., poor or novice responses) and also when key terms are ‘disguised’ as an anaphoric references (i.e., pronouns). If the pronouns ‘they’ and ‘their’ in the sample passage above are replaced with their antecedents, a different sentence aggregate PFNET will be obtained (see Figure 2). Compare these PFNETs of edited essays in Figure 2 below (pronouns replaced with referents) with the PFNETs of original unedited essays in Figure 1 above to note the effects of pronouns on the PFNET representations of this sample text passage. Replacing pronouns with their antecedents in this key-term rich sample text passage tends to increase the number of associations in the PFNET representations. satisfaction job humanist humanist satisfaction productivity empowered job employee empowered employee productivity Figure 2. Linear aggregate (left panel) and sentence aggregate (right panel) PFNETs of the sample passage with the pronoun referents of ‘they’ and ‘their’ added. Although the sentence aggregate approach may relatively over-specify keyword pair wise associations, this may not be an issue for the purposes of generating group knowledge representations or for scoring individual essays because the Pathfinder network scaling approach is a data reduction approach that seeks the most salient information in proximity data (Cooke, 1992; Johnson et al., 1994). The Pathfinder network approach is a well established method for measuring knowledge structure (Jonassen et al., 1993) that has been applied to predict course performance, compare individuals to groups, represent group consensus, predict combat pilot performance, and to compare naïve, novice, intermediate, and expert computer programmers (Villachica, 2000). Also Pathfinder KNOT software can be used to average together multiple proximity data files in order to establish a single ‘group’ PFNET representation of those essays. When multiple files are combined into one file, idiosyncratic spurious and error association frequencies are less than the frequencies of correct associations, and so error and spurious associations drop out of the averaged group representation, unless the error association is a common misconception. Common misconceptions will tend to be included in the group average PFNET representation. But because spurious over specification in an individual’s proximity data by ALA-Reader drops out during the averaging process, over specification is less of a problem when used for averaged group representation purposes; although it may be a problem when representing individual student’s knowledge structure. When using ALA-Reader and KNOT for generating an individual student’s essay score relative to an expert referent, of the two types of PFNET comparison measures, ‘links in common’ have been shown to be a better predictor of rater essay scores than has the ‘similarity’ measure (Taricani and Clariana, 2006). This may be because errors and spurious associations count in similarity scores but do not count in common scores. Only associations that match the expert are counted towards the links in common score and so spurious and incorrect associations are disregarded. Thus over-specification of a student’s PFNET pair wise associations by ALA-Reader is only a small problem when the PFNET common score is used for essay scoring, and this may account for why common scores are superior to similarity scores for scoring essays and similarity scores are generally better than common scores for other purposes. As a side note, overspecification is potentially a big problem for the expert referent essay used as the baseline to score the students’ essays. To handle this problem, it is best to meticulously design the expert referent as an ideal PFNET rather than just convert an expert essay into a PFNET, thus avoiding the effects of unintended spurious and error associations in the expert referent essay data set. We pointed out above that since the density of key terms per sentence has a profound effect on the resulting PFNETs, novice or poor essays are more likely to be under specified and thus negatively impacted by the presence of many pronouns relative to the same essays that have been edited to include the pronoun referents; and this should not have as great of an effect on good essays that have plenty of keywords to overcome any negative effects of pronouns. To consider this possibility, the first analysis consists of comparisons of the original unedited and the edited essays average group representations of the 15 bottom performing students to those of the 15 top performing students. In the second analysis, individual students’ unedited and edited essays are processed by the ALA-Reader software with KNOT analysis and those scores are compared to three human rater essay scores to consider the effects of pronouns on the ALA-Reader scoring method. 2.1 Method Participants were undergraduate students (N = 49, but 4 with missing data, final sample N = 45) enrolled in several sections of a required business course in an eastern university in the USA. Students completed an essay as part of the course final examination. The essay prompt stated, “Describe and contrast in an essay of 300 words or less the following management theories: Classical/Scientific Management, Humanistic/Human Resources, Contingency, and Total Quality Management. Please use the terms below in your essay: administrative principles, benchmarking, bureaucratic organizations, contingency, continuous improvement, customers, customer focus, efficiency, employee, empowerment, feelings, Hawthorne studies, human relations, humanistic, leadership, management (i.e., bosses), Management by Objectives, motivate, needs, organization (i.e., corporation), plan, product, quality, relationship, scientific management (classical), service, situation (or environment), TQM, and work (or job, task).” These 29 terms above (plus their synonyms and metonyms) were the key terms used by the ALA-Reader software during text processing. The 45 student essays contained 14,761 total words (1,798 unique words) which is an average of 301 words per essay (range from 170 to 479 words, standard deviation of 70.1). The fifteen most common words account for 31% of all of the text (4,526 words) and include in order the (929 occurrences), and (500), of (491), to (470), management (358), a (289), is (273), in (249), that (221), employees (211), this (170), on (167), are (139), as (137), and their (133). The three most common pronouns are their (ranked 15th with 133 occurrences), it (18th with 128 occurrences), and they (19th with 127 occurrences). Students use of ‘their’ and ‘they’ in their essays indicates that they were writing from the perspective of either a manager or of an employee and not always as an independent observer. 2.2 Comparing group average data representations ALA-Reader software was used to process the unedited essays (with pronouns present) and the edited essays (pronouns replaced with their antecedents) using both a linear aggregate approach and a sentence aggregate approach. To identify the best and worst essays for grouping purposes, human rater essay scores for these 45 essays were used to rank the student essays; then Pathfinder KNOT software was used to average together the proximity files of the 15 top performing and the 15 bottom performing students for each of the four data sets. Using “T” for top group and “B” for the bottom group, “L” for linear analysis and “S” for sentence analysis, and “U” for unedited essays and “E” for edited essays, the eight data sets are: TLU and BLU – top group and bottom group, linear aggregate, unedited TLE and BLE – top group and bottom group, linear aggregate, edited TSU and BSU – top group and bottom group, sentence aggregate, unedited TSE and BSE – top group and bottom group, sentence aggregate, edited The correlations between these eight group average proximity raw data files and also the PFNET similarity scores (calculated as PFNET intersection divided by PFNET union) are shown below in Table 1. Table 1. Pearson correlations (above the diagonal) and PFNET similarity data (below the diagonal) of the average proximity raw data for the top group (n =15) and the bottom group (n =15) with either linear or sentence aggregate approaches for both unedited essays and edited essays. TLU TLE TSU TSE TLU 1 0.81 0.56 0.49 TLE 0.99 1 0.52 0.46 TSU 0.98 0.97 1 0.78 TSE 0.99 0.98 0.99 1 BLU 0.70 0.66 0.61 0.64 BLE 0.79 0.75 0.73 0.74 BSU 0.70 0.67 0.66 0.71 BSE 0.69 0.73 0.67 0.74 BLU BLE BSU BSE 0.41 0.37 0.30 0.29 0.42 0.38 0.29 0.35 0.43 0.36 0.35 0.33 0.40 0.33 0.32 0.31 1 0.83 0.47 0.45 0.98 1 0.43 0.40 0.92 0.95 1 0.71 0.90 0.91 0.99 1 There are very strong correlations between the proximity raw data (values above the diagonal in Table 1) within the top group and within the bottom group, but not across top and bottom groups. Editing the pronouns to their referents (in Table 1, compare U ‘unedited’ to E ‘edited’) had almost no effect on the top group or the bottom group proximity data (see the four underlined values above the diagonal, all r >.97). PFNET similarity scores however were impacted by editing the pronouns to their referents (values in italics below the diagonal in Table 1), but less so for the linear aggregate approach (TLU to TLE r = .81 and BLU to BLE r = .83) than for the sentence aggregate approach (TSU to TSE r = .78 and BSU to BSE r = .71). Even so, these are remarkably similar PFNETS with many links in common. For example, the TLU PFNET contains 31 links and the TLE PFNET contains 34 links and these two PFNETs share 29 links in common. The graphical representations of these two PFNETs as force-directed graphs are also visually quite similar (see Figure 3). Figure 3. Linear aggregate PFNET group knowledge representation of the top 15 student essays (TLU, unedited essays – left panel and TLE, essays edited to replace pronoun referents – right panel). It is fascinating to us to compare the PFNET visual depictions of the averaged essays. For example, the terms ‘management’, ‘employee’, ‘organization’, ‘customer’, and ‘product’ are well connected and so are central terms in both representations. Three of the four super-ordinate management theory categories in the essay writing prompt, ‘scientific management’, ‘contingency’, and ‘TQM’, were all associated with ‘management’ while the fourth category, ‘humanistic’, was associated with both ‘management’ and ‘employee’. Emotion-related terms such as ‘feelings’, ‘needs’, ‘relationship’, and ‘motivation’ were all associated with ‘employee’ in both PFNET representations as were the action words ‘benchmarking’, ‘work’, and ‘product’. However, comparing the term ‘leadership’ in the left and right panel of Figure 3, in the PFNET of the edited essays (right panel of Figure 3) the concept of ‘leadership’ is relatively more connected and so takes a relatively more central position. In contrast, in the PFNET of the unedited essays (left panel of Figure 3), the term ‘leadership’ just hangs off of the term ‘need’. Be sure to note that these differences in structure for these two PFNETs in Figure 3 relate to how students used pronouns, probably ‘it’ to refer to ‘leadership’, in their essay passages, and not necessarily to their views of the centrality of leadership in these theories that are discussed. In fact, the edited essay PFNET representation (right panel) shows that ‘leadership’ is a central aspect of their views. 2.3 Comparing individual data representations ALA-Reader software was used to process the unedited and edited essays using both a linear aggregate approach and a sentence aggregate approach. This produced four separate data sets each containing 45 proximity files, one for each essay. All 45 of the proximity files in these four separate data sets (linear aggregate unedited, linear aggregate edited, sentence aggregate unedited, and sentence aggregate edited) were transformed into PFNETs by KNOT software and then all PFNETs were compared to an expert referent PFNET to obtain scores that consisted of the number of links in common between the student and the referent PFNET. To establish comparison benchmark scores, the three human essay scores for each essay were converted to a single factor score using the SPSS version 15.0 factor analysis regression option. Pearson correlations were conducted for each of the four data sets between this human benchmark factor score and the ALAReader essay scores. The linear aggregate method obtained better correlations with the human raters than did the sentence aggregate approach (see Table 2), although the correlations are not very large. Replacing pronouns with their referents had almost no effect on linear aggregate scores but had a strong positive effect on sentence aggregate scores. This suggests that sentence-level analysis approaches are more negatively impacted by the presence of pronouns and other anaphoric references relative to approaches that analyze across rather than within sentences. Table 2. Pearson correlations between the three human-rater factor score and the four ALA-Reader essay scores. Analysis Approach 3 raters linear aggregate unedited linear aggregate edited sentence aggregate unedited sentence aggregate edited * p <. 05, ** p < .01 0.58** 0.56** 0.34** 0.41** Post hoc analysis showed a moderate correlation between the human rater essay scores and essay length, but this was less so for ALA-Reader essay scores. The biasing effect of essay length is a construct-irrelevant influence on human essay scores that has been regularly observed (Attali, 2007; Powers, 2005), although it seems reasonable that better students will write longer essays. Would adjusting ALA-Reader scores for essay length improve the relative quality of the scores? The SPSS factor score regression method was again used to combine the ALA-Reader linear aggregate unedited scores with the essay length variable to obtain length adjusted ALA-Reader scores (see LU+ in Table 3 below). For comparison purposes, a benchmark “universal” essay score was established using the SPSS factor score combination of the three human essay scores plus the linear aggregate unedited essay score. Correlations between all of these scoring approaches are shown in Table 3. The unadjusted LU score (r = 0.68) was not as related to the universal benchmark as the human rater scores but the length adjusted LU score (LU+) was almost as good as the rater scores for estimating the universal essay score (r = 0.84). Table 3. Post hoc analysis using Pearson correlation comparing the human raters, selected ALA-Reader essay scores, final examination score, and essay length. Univ. Universal essay score 1 Rater 1 (the course instructor) 0.86 Rater 2 (investigator) 0.91 Rater 3 (graduate assistant) 0.89 Linear aggregate score (LU) 0.68 Length adjusted LU score (LU+) 0.84 Final examination score 0.53 Essay length 0.69 All are significant at the p<.05 level or better. Rater 1 0.86 1 0.69 0.71 0.43 0.62 0.51 0.58 Rater 2 0.91 0.69 1 0.76 0.60 0.72 0.42 0.57 Rater 3 0.89 0.71 0.76 1 0.56 0.62 0.48 0.45 LU 0.68 0.43 0.60 0.56 1 0.81 0.34 0.33 LU+ 0.84 0.62 0.72 0.62 0.81 1 0.44 0.81 Exam 0.53 0.51 0.42 0.48 0.34 0.44 1 0.37 length 0.69 0.58 0.57 0.45 0.33 0.81 0.37 1 3. CONCLUSION The ALA-Reader sentence aggregate approach obtained substantially different PFNET representations for unedited compared to edited essays; thus pronouns in the text passages strongly affected the sentence aggregate method results and based on the analyses of individual essay scores, pronouns in text passages have a negative effect on sentence aggregate PFNET representations. On the other hand, there was little difference between the PFNETs obtained for edited and unedited essays using the linear aggregate method. The analysis of the group knowledge representations and the individuals’ essay scores indicate that the linear aggregation approach is relatively robust to pronoun interference and is generally superior to the sentence aggregate method for the narrow purposes of average group knowledge representation and for scoring individual essays. Human essays are grammatically imperfect and sometimes incoherent; our experience gained from editing the pronouns in these 45 essays is that even humans can’t always agree on a pronoun’s antecedent. Pronouns in text can be a substantial issue for text analysis software and routines for handing pronouns are expensive to develop and generally are not perfectly accurate; thus these routines may add more error to the data. Although this is a small sample (i.e., 45 well-constrained essays), these encouraging results suggest that the ALA-Reader linear aggregation method is robust to pronoun interference, and so would NOT benefit much from a pronoun handling subroutine, thus the development cost of a pronoun handler is not warranted. Perhaps these results regarding the effects of anaphoric reference may generalize to other text analysis software that analyzes in a linear way across sentences, although the results here indicate that a pronoun handling subroutine would benefit text analysis software that analyzes only within sentences. It is important to note that many concept map and essay scoring approaches depend on the analysis of propositions because propositions are held to be the lowest atomistic level of meaning. Propositions in essays are subject-verb-object and in concept maps are concept-link-concept, and these are essentially sentences, and so a majority of scoring approaches should be classified as within sentence approaches. Bit we hold that propositions are formed on the fly from the structural knowledge lexicon. And inferentially, the results of this investigation for linear over sentence analysis provides some support for our notion that structural knowledge is at a pre-meaning level, not at the proposition level. And one more note, it is fairly common to use human rater scores when determining the concurrent criterion related validity of text processing output. So the likely biased effects of essay length on human scores must be considered when conducting any type of text analysis studies involving human raters (Hoyt, 2000); otherwise the software will acquire the humans’ bias. In this case, it is simple to add an essay length scoring adjustment to ALA-Reader essay scores so that the adjusted scores will have a stronger correlation with human rater scores, but it seems critical to honestly report both the unadjusted essay score as well as the length adjusted essay score so that users and researchers will be mindful of possible human bias built in to the software. REFERENCES Anderson, J.R., 1983. The architecture of cognition. Harvard University Press, Cambridge, Massachusetts, USA. Attali, Y., 2007. Construct Validity of e-rater in Scoring TOEFL Essays (ETS Research Memorandum No. RR-07-21). Educational Testing Service, Princeton, New Jersey, USA. Clariana, R.B., and Wallace, P.E., 2007. A computer-based approach for deriving and measuring individual and team knowledge structure from essay questions. Journal of Educational Computing Research, Vol. 37, No. 3, pp. 209-225. Cooke, N.M., 1992. Predicting judgment time from measures of psychological proximity. Journal of Experimental Psychology: Learning, Memory, and Cognition, Vol. 18, No. 3, pp. 640-653. Deese, J., 1965. The structure of associations in language and thought. The Johns Hopkins Press, Baltimore, Maryland, USA. Elman, J. L., 1993. Learning and development in neural networks: The importance of starting small. Cognition, Vol. 48, pp. 71-99. Frase, L.T., 1969. Structural analysis of the knowledge that results from thinking about text. Journal of Educational Psychology, Vol. 60, No. 6, pp. 1-16. Goldsmith, T. E., and Johnson, P. J., 1990. A structural assessment of classroom learning. In Schvaneveldt, R.W. (Ed.), Pathfinder associative networks: studies in knowledge organization. Ablex Publishing Corporation Norwood, New Jersey, USA, pp. 241-253. Goldsmith, T. E., & Johnson, P. J., & Acton, W.H., 1991. Assessing knowledge structure. Journal of Educational Psychology, Vol. 83, No. 1, pp. 88-96. Hoyt, W.T., 2000. Rater bias in psychological research: When is it a problem and what can we do about it? Psychological Methods, Vol. 5, No. 1, pp. 64-86. Humphreys, M.S., Bain, J.D., and Pike, R., 1989. Different ways to cue a coherent memory system: A theory for episodic, semantic and procedural tasks. Psychological Review, Vol. 96, No. 2, pp. 208-233. Johnson, P.J., Goldsmith, T.E., and Teague, K.W., 1994. Locus of the predictive advantage in Pathfinder-based representations of classroom knowledge. Journal of Educational Psychology, Vol. 86, No. 4, pp. 617-626. Johnson, T.E., O’Connor, D.L., Pirnay-Dummer, P.N., Ifenthaler, D., Spector, J.M., and Seel, N., 2006. Comparative study of mental model research methods: relationships among ACSMM, SMD, MITOCAR and DEEP methodologies. Proceedings of the Second International Conference on Concept Mapping, San José, Costa Rica, pp. 177-184. Jonassen, D.H., Beissner, K., and Yacci, M. (1993). Structural knowledge: techniques for representing, conveying, and acquiring structural knowledge. Lawrence Erlbaum Associates, Hillsdale, New Jersey, USA. Kintsch, W.A., 2002. The role of knowledge in discourse comprehension: A construction-integration model. In Polk, T.A. and Seifert, C.M. (eds.), Cognitive Modeling. MIT Press, Boston, Massachusetts, USA, pp. 5-47. McClelland, J.L., McNaughton, B.L., and O'Reilly, R.C., 1995. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models. Psychological Review, Vol. 102, No. 3, pp. 419-457. Pike, R., 1984. A comparison of convolution and matrix distributed memory systems. Psychological Review, Vol. 91, No. 3, pp. 281-294. Powers, D. E., 2005. “Wordiness”: A selective review of its influence, and suggestions for investigating its relevance in tests requiring extended written responses (ETS Research Memorandum No. RM-04-08). Educational Testing Service, Princeton, New Jersey, USA. Rumelhart, D. and McClelland, J., 1986. Parallel distributed processing, Volumes I and II. MIT Press, Cambridge, Massachusetts, USA. Schvaneveldt, R.W., 1990. Pathfinder Associative Networks: Studies in Knowledge Organization. Ablex Publishing Corporation, Norwood, New Jersey, USA. Seel, N.M., 1999. Educational diagnosis of mental models: Assessment problems and technology-based solutions. Journal of Structural Learning and Intelligent Systems, Vol. 14, No. 2, pp. 153-185. Shavelson, R.J., 1974. Some methods for examining content structure and cognitive structure in instruction. Educational Psychologist, Vol. 11, No. 1, pp. 110-122. Taricani, E. M. and Clariana, R.B., 2006. A technique for automatically scoring open-ended concept maps. Educational Technology Research and Development, Vol. 54, No. 1, pp. 61-78. Villachica, S.W. (2000). An investigation of the stability of Pathfinder-related measures. Doctoral dissertation, University of Northern Colorado. Dissertation Abstracts International, Vol. 60, No. 12, pp. A4393.