Cocciolo 1 Anthony Cocciolo Experimenting with Latent Semantic Analysis (LSA) Abstract: The aim of this paper is to test both the practical use value and the psychological underpinnings of Latent Semantic Analysis (LSA), which is a statistical theory and method for extracting and representing the contextual meaning of words. To test the practical use value, we will use LSA to analyze a large corpus of articles within a particular discourse and ask, can LSA decide which category each article goes in? Is LSA able to categorize as well as a human editor? To test LSA’s ability to simulate psychological processes, we will experiment with Kintsch, Patel and Ericson’s (1999) hypothesis that the semantic space created by LSA is similar to an expert’s Long Term Working Memory (LTWM). Introduction to Latent Semantic Analysis (LSA) Latent Semantic Analysis (LSA), introduced by Landauer, Folktz, and Laham (1998), is a statistical theory and method for extracting and representing the contextual meaning of words. This form of analysis is most useful in finding which texts are most related among a large corpus of texts. For example, LSA is used by websites that produce a large quantity of content (ie- newspapers) as a way of directing users to more information on some topic of interest. Since this process can be integrated into website content management systems, LSA offers a low-cost way of suggesting additional content to a user. This process is comparable to more high-touch methods of suggesting content, where an editorial activity may be to manually select which articles are most related to any given article. This process can be costly and error prone because it necessitates that the individual doing the selecting have a broad knowledge of the contents of the archive. Cocciolo 2 In addition to text relatedness, LSA can be used as a way of creating keyword searchable indexes. Such keyword searches would differ from some of the more prominent search engines. For example, it would differ from Google in that it would not depend on hyperlinks as a ranking mechanism. It would also differ from more crude search engines, such as many online library catalogs (OPACs), which only return results based on keyword matches and do little to rank results. For example, most OPACs do not make distinctions between how much a book is about something. One 500 page book could be entirely concerned with the philosophy of John Dewey, and another 500 page book may devote only 100 pages to Dewey’s philosophy. However, most OPACs, since they do not have the full text of the books available at their disposal, will not give precedence to the more “Deweyian” of the two books. The value of having full-text available is consequently fueling mass digitization projects by such Internet companies as Amazon and Google. Given that full-text availability it important, how do the searches which use LSA differ from the aforementioned searches? According to Landauer, Foltz, and Laham (1998), LSA is different in that it “represents the meaning of a word as a kind of average of the meaning of all the passages in which it appears, and the meaning of a passage as a kind of average of the meaning of all the words it contains” (6). Hence, the first item of one’s search results should represent the most about X search term compared to all other texts within the corpus. This does not depend on “simple contiguity frequencies, cooccurrence counts, or correlations in usage”, which would privilege longer texts which frequently use some given term. Rather, LSA captures “only how differences in word choice and differences in passage meaning are related” (5). For example, assume that we Cocciolo 3 were concerned with finding documents about “violence”. An LSA-backed search would not necessarily find the result with the highest occurrence of the word “violence” in its text, but rather the result that is most about “violence” compared to the other texts within the corpus. Although LSA “allows us to closely approximate human judgment of meaning similarity between words and to objectively predict the consequences of overall wordbased similarity between passages”, there are certain inherent limitations (Landauer 4). The most striking limitation its results are “somewhat sterile and bloodless” in that none “of its knowledge comes directly from perceptual information about the physical world, from instinct, or from experimental intercourse with bodily functions, feelings and intentions” (4). It also does not make use of word order or the logical arrangement of sentences (5). Although its results work “quite well without these aids”, “it must still be suspected of resulting incompleteness or likely error on some occasions” (5). Laudauer, Foltz and Laham analogizes LSA’s knowledge of the world in the following way: One might consider LSA’s maximal knowledge of the world to be analogous to a well-read nun’s knowledge of sex, a level of knowledge often deemed a sufficient basis for advising the young (5). Hence, LSA’s knowledge is based on word counts and vector arithmetic for very large semantic spaces, and is deprived of more sense-driven information. In addition to viewing LSA as a practical means of obtaining text similarity and performing keyword searches, Landuaer, Laham and Foltz claim that LSA is also a “model of the [human] computational process and representations underlying substantial portions of the acquisition and utilization of knowledge” (4). Thus, along with having a Cocciolo 4 practical component, LSA is hypothesized to underlie human cognitive processes. This does not seem to be based on neurological evidence, but rather on how well it works in practice: It is hard to imagine that LSA could have simulated the impressive range of meaning-based human cognitive phenomena that is has unless it is doing something analogous to what humans do. (33) Hence, the authors claim that since it works most of the time, then it must have some underlying basis in human cognitive processes. Although the authors admit that “LSA’s psychological reality is certainly still open”, it seems difficult to comprehend how it could represent human cognitive processes. Landauer, Foltz and Laham concede that it is unlikely that the human brain performs the LSA/SVD algorithm, they do believe that the “brain uses as much analytic power as LSA to transform its temporally local experiences to global knowledge” (34). Could we conclude that all constructions, either mathematical or statistical, that mimic human behavior have a basis in the constitution of the human brain? To avoid the issue of psychological basis, which may require neurological investigation, this paper will instead look at LSA’s ability to simulate psychological phenomenon. The relationship between LSA and Long Term Working Memory (LTWM) will be dealt with in the “Psychology and LSA” section. Rationale To test the practical applicability of LSA, we will perform an experiment that tests its ability to aid in information search and retrieval. This experiment will use the Teachers College Record, a journal of education research, as a corpus of text. This experiment will begin with a set of categories or topics, such as “Charters” and Cocciolo 5 “Literacy”—topics of keen interest to educators—and ask if LSA is able to decide which article should go in what category. This will then be compared against the categories that have been assigned by the editors. The question is thus: can LSA decide which articles go in which categories? Methodology 1) The first step in performing the LSA is to prepare the full-text that will be read into the analyzer. This case involves 1580 distinct articles, book reviews, commentaries and editorials that were available in HTML format. Although there are 8528 full-text HTML articles available, only 1580 of those have been manually categorized. Since we are only interested in comparing LSA’s results to those that are manually-categorized, we must set aside those articles that do not comprise the 1580. These 1580 articles represent a wide range of years (Table 1). These articles have been placed into 2825 categories, meaning that some texts were placed in multiple-categories. Decade 2000s 1990s 1980s 1970s 1960s 1950s 1940s 1930s 1920s 1910s 1900s unknown Total Articles 756 240 40 267 24 9 51 12 18 50 33 8 1508 Table 1: Breakdown by Years 2) The next step was to prepare and install the LSA software, which is Infomap NLP, developed at Stanford University, which is a variation of LSA based on Laundauer, Cocciolo 6 Foltz and Laham’s model. This software was installed on a Windows 2000 platform running CYGWIN (the Linux interpreter for Windows). 3) After installing Infomap, it was necessary to modify the stop-list. This list includes words that the analyzer will ignore in its analysis. Such words includes numeric characters, punctuation, and very common words, such as “and” and “this”. There are also words included on the stop-list that are often used in conjunction with other words, and that alone to not derive clear semantic value. For example, the word “studies” was on the stop list, which could be perceived as ambiguous (studies as a verb, social studies, studies about some subject area, etc.) There is also a selection of words that were included on the original stop list that had to be removed from the list because they are critical terms for the field of education. The stop-list included with InfoMap was used primarily for analyzing psychology texts, where such words as “learning”, “schools”, “students” and “social” “studies” are not particular to the core vocabulary. These words were removed from the stop-list because they could indeed represent the subject matter at-hand, unlike in psychology where they may be peripheral to some other issue. 4) After modifying the stop-list, I ran the infomap-build command, which reads in all the text files within the corpus and builds a model. This process took about 2 hours to run, and will vary depending on the amount of text being read and computer speed. The infomap-build program creates a series of word-vectors and indexes which can provide a basis for rapid searches. 5) The next step is to install the model so that it is permanently accessible. This is accomplished by running the infomap-install program. Cocciolo 7 6) After installing the model, one may begin making keyword searches against the word-vector indexes. For each topic area, I searched for the most related documents to that phrase, making sure to return the same number of results that were assigned manually. For example, if there are 92 articles manually classified under “language arts” by the editors, I will ask Infomap to return only 92 entries. This is so that I can compare how similar the 92 entries are, if at all. The search is achieved by running the associate program, and piping the results into a text file. 7) The results of the analysis were fed back into the SQL Server database which contained the articles and manually selected categorizations. SQL commands were performed to compare which documents were included manually by the editors, and which documents the analysis decided to include. Certain topics could not be included in the analysis because hyphenated-words are not supported. These phrases include “highstakes” and “at-risk”. The topic “special needs” was also omitted because each word analyzed singularly would not produce any meaningful results. Results Of the 1508 distinct articles that formed the semantic space, 2825 results were asked of the Infomap software (which coincides with the number of articles manuallycategorized). Of these 2825, Infomap correctly matched 750 documents into their proper category (26.5%). The most successful categories include “Charters & Vouchers” (75%), “Race and Ethnicity” (62%), and “Libraries” (56%) (see Table 2). Name Charters & Vouchers Race and Ethnicity Philanthropy Libraries Tracking LSA 42 79 5 10 2 Manual 56 127 9 18 4 % Correct 75.00 62.20 55.56 55.56 50.00 Cocciolo 8 Supply and Demand Online Publishing Standardized Testing Teaching Profession Philosophy Language Arts Literacy Reform Publishing and Communication Qualitative Methods War and Education Politics Distance Learning Gender and Sexuality Leadership Early Childhood Development Faculty Religion Standards Science Parental Influence Technology in the Classroom Arts Cognition Pre-School and Child Care Transformative Learning Program Evaluation Health and Nutrition Professional Development Comparative Education Athletes and Academics History of Schooling Experimental Research Urban Education Educational Psychology Educational Development Student Teaching Sociology of Education Violence Alternative Assessment Admissions and Tuition Survey Research Policy and Teaching Practice Social Studies Technology Education Mathematics Economics and School-to-work 8 2 19 67 50 39 17 54 6 9 22 59 9 15 14 13 15 4 15 4 10 9 8 7 2 6 14 2 13 7 1 11 3 9 12 8 1 20 2 5 4 1 5 2 1 3 1 16 4 39 148 114 92 41 145 17 26 64 179 28 47 44 42 50 14 53 15 39 37 35 31 9 28 66 10 66 39 6 68 21 64 91 61 8 169 18 50 40 10 52 26 14 50 29 50.00 50.00 48.72 45.27 43.86 42.39 41.46 37.24 35.29 34.62 34.38 32.96 32.14 31.91 31.82 30.95 30.00 28.57 28.30 26.67 25.64 24.32 22.86 22.58 22.22 21.43 21.21 20.00 19.70 17.95 16.67 16.18 14.29 14.06 13.19 13.11 12.50 11.83 11.11 10.00 10.00 10.00 9.62 7.69 7.14 6.00 3.45 Cocciolo 9 Control Working Conditions Student and Community Life Supervision Corp. Training and Continuing Ed. Grading Rural Education Secondary Schools College Advising Peace Studies Dropouts Independent Schools Schools of Education Writing for Publication Post-Industrial Education Totals 1 2 1 0 0 29 60 42 12 12 3.45 3.33 2.38 0.00 0.00 0 0 0 0 0 0 0 0 0 0 750 7 5 11 9 32 7 4 46 9 11 2825 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Table 2: Results of LSA In deciphering the results matrix, what might this indicate about LSA, both as its practical use value and its plausible psychological foundations? We will first address its use value. It seems clear that LSA works best with an explicit vocabulary. Many of the tests of LSA were conducted using introductory psychology texts, which tend to have a more defined vocabulary than the education discourse. However, in the cases where education’s most explicit vocabulary was in use, as illustrated by the “Charters & Vouchers” category, high match-rates resulted (75%). This coincides with other high-matches, such as “Philanthropy” (56%), “Libraries” (56%), “Tracking” (50%) and “Supply and Demand” (50%). “Race and Ethnicity” had the second highest matches (62%, 79/127), however, one would not necessarily conclude the “Race and Ethnicity” is particular to the education discourse, at least not in the same way that “Tracking” may be. This may illustrate divisions amongst authors’ conceptual viewpoints: one either writes a paper which is richly laden with issues of ethnicity and diversity, or one avoids the issue altogether. Hence, within the education discourse, “Race and Ethnicity” is a subject onto Cocciolo 10 itself, much in the same way that “Charters and Vouchers” is for the education discourse, or “Operant Conditioning” is for psychology. Some of the categories that turned up the least results seem to be related to both vocabulary and phraseology. For example, categories which are the result of combining two or more words, where each word alone is highly general, such as “Independent Schools”, resulted in low matches. Because of the way the analysis software decomposes the semantic space into word vectors, it is highly likely that “schools” and “independent” were too general to turn up significant results. The conclusion could be drawn that LSA works best where individual words are specific and do not require phrasing (combining multiple words) to derive the correct semantic meaning. Combining multiple words where each word is specific, such as “Charters and Vouchers” or “Race and Ethnicity”, is not as important as the specificity of a single word (ie- the specificity of “Charter” or “Voucher”). If LSA were to be used in a practical context, it would be necessary that the LSA algorithm be altered so that it treated general words that constituted a phrase (ie“Independent Schools”) as a single whole (or as a phrase). Psychology and LSA In considering the conducted experiment, can we make any conclusion as to the psychological basis of LSA? Although we can make no strong conclusions because of a lack of neurological evidence, there are some interesting interpretations one could make. One particularly compelling aspect is introduced by Kintsch, Patel, and Ericsson (1999) which relates LSA to Long-term working memory (LTWM). Their work concerns itself with the ways in which experts employ their long-term working memory, which is available within a domain of expertise, in conjunction with short-term working memory, Cocciolo 11 which is capacity limited. The authors argue that “superior memory in expert domains is due to LTWM, whereas in non-expert domains LTWM can be of no help” (4). How is LSA related to LTWM? The authors argue that LSA space and LTWM are similar: We assume that the LSA space is the longterm working memory structure within which automatic knowledge activation in comprehension takes place. Thus, the memory structure that is responsible for the generation of LTWM is the semantic space: items close in the semantic space are accessible in LTWM, whereas items removed in the semantic space require more elaborate and resource consuming retrieval operations. (10) The authors further this notion by introducing the concept of the semantic neighborhood, which are closely related concepts in the semantic space and are most readily recalled in LTWM. They argue that “the semantic space itself functions like a retrieval structure, making close neighbors of newly generated text propositions automatically available in LTWM, as long as they can be successfully integrated with the existing episodic structure” (15). In considering the education semantic space that we have constructed using the TC Record corpus, can we see correlations between the ways that semantic neighbors are arranged and the ways in which an education expert would recall such information? In order to answer this question, we must look at some of the semantic neighborhoods. Because we know that “Charters and Vouchers” has high semantic value because of the atomicity and specificity of each word, let us view the semantic neighborhood. The semantic neighborhood is retrieved by running the associate –c all2 voucher charter, which yields following results: Word vouchers Relevance 0.85533 Cocciolo 12 voucher 0.8553 charter 0.8553 choice 0.82545 hoxby 0.7766 cleveland 0.73171 tax 0.72868 private 0.72352 public 0.66732 foul 0.66232 competition 0.65939 charters 0.65783 cost 0.64018 operate 0.63993 opponents 0.63858 cps 0.63445 diversion 0.63192 moe 0.62014 metcalf 0.61714 viteritti 0.61088 Table 3: The Semantic Neighborhood for “Charters and Vouchers” The semantic space for “Charters and Vouchers” is highly compelling, especially in considering the ways in which an expert would use LTWM in relationship to this topic. For example, the most relevant example after the variations of “charter” and “voucher” is the word “choice”, which is really the core of the issue. The next word is “hoxby”, which is the last name of Caroline M. Hoxby, a Professor at Harvard University who specializes in school choice. This is followed by “Cleveland”, which refers to the Cleveland Voucher Study, followed by “tax”, “public” and “private”, all highly relevant terms related to the area. Hence, it does seem like the semantic space derived from LSA mimics the way that an expert would use LTWM. For example, if someone asked an expert about charters and vouchers, she would be quick to recall issues of choice, taxation, public and private, as well as important research (Cleveland study) or important Cocciolo 13 people in the field (Caroline M. Hoxby). The same is similar for “Race and Ethnicity”, which creates a highly relevant semantic space with many related concepts (identity, stereotypes, status, ethnocultural): Word Relevance ethnicity 0.931 race 0.931 identity 0.81423 racial 0.80893 differences 0.73978 dominant 0.72314 native 0.70456 stereotypes 0.69805 racism 0.69333 ethnic 0.68762 status 0.68694 inferiorized 0.68104 pseudo 0.68065 african 0.67231 identities 0.66333 culture 0.66236 disadvantage 0.6622 ses 0.65739 feinberg's 0.65686 ethnocultural 0.65322 Table 3: The Semantic Neighborhood for “Ethnicity and Race” For categories that LSA did not closely correspond with manual selection, such as “Schools of Education”, the semantic neighborhood was much weaker and less relevant: public education schools advocates private rugby marshaling unaccountable 0.81857 0.80009 0.80009 0.6241 0.62262 0.60444 0.58259 0.57646 Cocciolo 14 mightily 0.56911 popular 0.55268 nonrepression 0.54957 subsidize 0.54627 lenin 0.54432 school 0.54382 privatization 0.54269 schooling 0.54218 borrowman 0.53971 proponents 0.53948 secondary 0.53218 campaigns 0.53132 Table 4: The Semantic Neighborhood for “Schools of Education” The semantic space is so weak because the word “of” is not considered, nor is the word ordering, which clearly makes a large difference in this case. This limitation is addressed by Laudaler, Foltz, and Laham, who note that LSA does not take into account word ordering (5). Conclusion In conclusion, LSA has a practical use value and can simulate psychological processes, particularly long term working memory. The operative term in this description is “simulate”—although LSA can create semantic spaces which mirror an expert’s LTWM in certain cases, this should not imply that LSA has a true psychological basis. The uncovering of a basis would at least require neurological study which is beyond the scope (and expertise) of this writer. Both its use value and ability to simulate an expert’s LTWM is affected by many factors, most notably vocabulary and phraseology. For example, words that are specific and do not require phrasing to derive semantic value tend to derive better semantic neighbors and work to better organize documents within categories (ie- “Charters & Vouchers”, “Race and Ethnicity”). Phrases where individual Cocciolo 15 words are highly general, word arrangement is critical, or phrases that use connectors (ie“of”), such “Schools of Education”, produce less compelling results. Cocciolo 16 References Kintsch, W., Patel, V. L., Ericson, K. A. (1999). The role of long-term working memory in text comprehension. Psychologia, 42, 186-198. Landuaer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to Latent Semantic Analysis. Discourse Processes, 25, 259-284.