A Preliminary Probabilistic Model of Language Processing Hadjar Homaei Michael Mozer James Martin Department of Computer Science and Department of Computer Science and Department of Computer Science and Institute of Cognitive Science Institute of Cognitive Science Institute of Cognitive Science University of Colorado at Boulder University of Colorado at Boulder University of Colorado at Boulder Boulder, CO 80309 Boulder, CO 803039 Boulder, CO 803039 hadjar.homaei@colorado.edu mozer@colorado.edu ABSTRACT The time course of language processing is a huge issue in psycholinguistics. Timing of things such as eye movements, ERPs, button presses, etc., is used to adjudicate between several major classes of theories, and there are a number of computational models that try to explicitly model the dynamics of interacting information sources. The time course of processing is thus a very interesting question. martin@colorado.edu Despite decades of research on language, we are far from answers to any of these questions. Having only recently begun to see the true importance of statistical properties of the language environment, we are now faced with an unfortunate lack of sufficient data, particularly on the typical exposure of children in everyday life at various stages of development. Although we have acquired vast knowledge about such tests as priming and lexical decision, we have relatively little information about the major language abilities of comprehension and production. Keywords Language processing, Sequential processing, Probabilistic models 1. INTRODUCTION Symbolic, rule based models of human language understanding have been linguistics preference for so long, mainly because they are straightforward and able to handle the recursive nature of language processing. Because of their structured knowledge, these models were usually not able to learn and were simply our knowledge of language, hard-coded for the hardware to process. Interestingly enough it very matches the popularity of Chomsky school of language and the belief that the main engines of language processing is innate in the human brain and there is not much to learn after all. But over the past three decades there has been rising interest in connectionist models of language processing. At the start, these models were actually straightforward implementations of the same symbolic models using simple neural networks processing units. However, as the representational and learning abilities of connectionist theories and systems got a lot better, independently, connectionist models have taken on a noticeably different character than symbolic methods. 2. Issues in Sentence Processing The term, “Sentence processing”, refers to the series of analysis that lead to understanding sentences while reading. Some classic empirical and theoretical issues in sentence processing are ambiguity resolution, garden path effects, top-down vs. bottom-up parsing, and modularity. But in order to understand how human brain is able to learn to solve these problems, we have to understand what information is available in the language learner’s environment, what abilities does the language learner acquire, what behaviors does the learner show in using those abilities and finally, can a system learn those abilities in an appropriate environment and what properties must the system have to show the same behaviors as human learners? 3. Approaches to Sentence Processing The traditional approach to language research has focused largely on the task of parsing; that is, constructing a labeled hierarchical representation of the structure of a sentence, usually reflecting the grammatical rules used to generate it. Symbolic models excel at manipulating such representations, but they struggle to incorporate meaning into sentence processing. It is not easy to capture slight variations in meaning with variables and values. One aspect of language that can be problematic in particular is the way that small aspects of word meaning can have a major influence on the correct syntactic interpretation. Most sentence processing models are designed to address one of four major language tasks: parsing, comprehension, word prediction, or production. Because the main goal of our research is on word prediction we will also focus on word prediction literature in our brief review of the literature. 3.1 Connectionists Models Our focus here is on models that try to explain the semantic and syntactic issues involved in processing multi-word utterances. Connectionists model in this area use a wide range of approaches to solve this problem, from localist implementations of symbolic systems, to systems of interacting localist units, to distributed representations and multi-layer learning rules, to recurrent learning systems. Some of the nest connectionist models of sentence processing are the ones that try to do word prediction. Word prediction is a surprisingly useful ability. It can be used as a basis for a language model which computes the likelihood of a particular utterance happening in the language. It can also be used in most speech recognition systems to resolve the ambiguity by using that likelihood from the word prediction model. Accurate word prediction is actually sufficient to generate a language model and therefore it can also indicate knowledge for the grammar of that language. Because of that, word predictions models are sometimes called parsers too although we usually use this term to refer to a system that can explicitly reveal syntactic structure of a sentence. The most well known connectionist prediction models are Elman models (Elman, 1990, 1991, 1993) that use simple-recurrent networks, also known as Elman networks. Elman (1990) used simple recurrent networks (SRN) to do letter prediction on a letter sequence of concatenated words. It was also shown that this model could be used to detect word boundaries by identifying locations of high entropy in the sequence where prediction becomes difficult. Later, Elman extended this model to do word prediction in a simple language. Word representations that are produced in the network’s hidden layer can be clustered to generate a syntactical and semantical classification of words. This shows that so much of the knowledge that is required for parsing and comprehension can be extracted from a word prediction system. Elman (1991) extended the word prediction model that was originally based on a simple language to process sentences that could possibly include multiple embedded clauses. The major idea of this work was to show that connectionist models are able to learn and represent complex and hierarchical structure. This is a very important aspect of natural languages that every language processing model should take into account if it wants to deal with natural language. As Elman puts it, “The important result of this work is to suggest that the sensitivity to context which is characteristic of many connectionist models, and which is built-in to the architecture of SRNs, does not preclude the ability to capture generalizations which are at a high level of abstraction”. One major and very interesting finding of this work was that these networks could only learn highly complex sentences only if they had first started by learning simpler structure. This idea was explained further in Elman (1993), in which he shows that these networks can also learn complex structures if their memory spans are held back in the beginning and are allowed to expand gradually. Most of other connectionist word prediction models are more or less based on Elman (1999). Chater and Conkey (1992) compared Elman’s SRN training method to a more complex alternative, backpropagation through time (Rumelhart et al., 1986), which extends the propagation of error derivatives back to the beginning of the sentence. Naturally, they found that backpropagation through time, which is slower and a lot less “biologically plausible” method, can produce better results that Elman’s SRN. Christiansen (1994) tested the ability of SRNs to learn simple languages that have three types of recursion: counting recursion, center-embeddings, and cross-dependencies, which context-free grammars can not account for. However, the result of these experiments comparing to statistical bi-gram models was really poor. In some cases they even worked worse than uni-gram models. So in a next experiment, Christiansen extended the language that was used by Elman (1991) to include prepositional phrases, left recursive genitives, conjunction of noun phrases, and sentential complements. Generally, those networks performed better on these extended languages and showed behaviors that reproduces human comprehension performance on similar sentences. In a rather more recent work Christiansen and Chater (1999) extended their preliminary results and could provided more detailed comparisons with human performance. Finally, Tabor, Juliano, and Tanenhaus (1997) did a number of experiments to compare human and SRN network reading times on sentences that included structural ambiguities. Even though they had used a simple-recurrent prediction network in these studies, but relative reading times were deduced using an interesting dynamical system analysis. Basically, the hidden word representations that are produced in the hidden layer of the network at different stages in processing sentences are plotted in a high dimensional space and they are treated as physical masses that have a certain gravitational force. To decide on the reading time of the network on each word, the network’s hidden representation is plotted in the high dimensional space and then allowed to float among the attractor masses until it finds a stable state and stays there. The settling time is used as a substitute for reading time. 3.2 Probabilistic Models Probabilistic methods are providing new approaches to the whole problem of sentence processing and it also provides means to investigate fundamental cognitive science questions of how humans structure, process and acquire language. In such models, Language comprehension and production involve probabilistic inference and acquisition involves choosing the best model, based on innate constraints and linguistic and other perceptions. One of the advantages of probabilistic models over connectionist models is that it can account for the learning and processing of language, while it still maintains the sophistication of symbolic models. Recent growth of theoretical developments and online corpus creation such as PropBank, WordBank, FrameNet, etc. has enabled large statistical and probabilistic models to be tested, revealing probabilistic constraints in processing and suggesting important links with probabilistic theories of categorization and ambiguity resolution in perception. [2] We will have a short review on probabilistic models as we justify why we are using them in 4.1. 4. Our Approach The current plan of our research is focused on three main areas of language processing: production, prediction and comprehension, and will try to find a plausible story about how different constraints – Syntax, Semantics and Pragmatics are combined and interact in a probabilistic system to do so. At last, the model should be judged by its ability to explain human behavior. Although some progress has been made, many technical questions remain to be answered before the model will be complete and ready for evaluation. 4.1 A Probabilistic Model There’s enough evidence to believe that human processing of language has an underlying probabilistic structure Evidence for the role of frequency and probability specifically in language comprehension processing. Evidences gathered up throughout the second half of the 20th century, showed that high frequency words are accessed more quickly, they are accessed more easily, and they are accessed with less input signal than low-frequency words. [read Chater 2007 for more] Although these evidences can convince us to try to use probabilistic methods to model language understanding, and we know that many kinds of knowledge must interact probabilistically in the process of building an interpretation of a sentence but we still do not know very much about how this probabilistic process takes place, how different aspects of linguistic knowledge is represented, how these probabilities are combined, how some interpretations are favored over others and picked, what is the relationship between probability and behavioral measures like reading time. Detailed understanding on time course of the use of different types of knowledge, clearing out the role that memory limitations, interference, and locality plays in sentence processing are key aspects of the architecture of the human sentence processing mechanism. A thorough understanding of sentence processing needs to combine these results and build one single comprehensive model. But unfortunately most of them do not tell us enough about our former question: how can we understand the role of probability in representing linguistic knowledge, combining evidence, and selecting interpretations. Constraint- based (or sometimes called constraint-based lexicalist) approaches to model probabilistic aspects of language processing addresses some of these questions about the role of probability. The main idea of constraint-based models is that they keep and process on all interpretations of a partially read or ambiguous sentence in parallel and choice among these competing interpretations is made by integrating a large number of constraints over a wide range of types of knowledge. There have been a number of computational implementations of the constraint-based framework that are mainly neural network models that get various frequency-based and contextual features as input and combine these features using activation to converge on one particular interpretation as the winning interpretation (Burgess and Lund, 1994; Kim et al., 2002; Spivey-Knowlton, 1996; Pearlmutter, Daugherty, MacDonald, & Seidenberg, 1994).[1] The most completely implemented of these models, and the one that makes the clearest claims about integration of probability of probabilistic constraints and how probabilities can imply processing-time, is Spivey and colleagues’ competitionintegration model (Spivey-Knowlton, 1996; McRae et al., 1998)[1] This model uses a normalized recurrence algorithm for modeling constraint integration. Now, the main problem with these constraint-satisfaction models is deciding on their structure: how these structured interpretations should be built probabilistically, how structural knowledge plays a role, how do we set probabilities for structure and how do we combine constraints based on the structure. Actually, there are other probabilistic models which focus on exactly these questions of structure. For example Jurafsky (1996) and Crocker and Brants (2000) both propose sentence processing models based on probabilistic grammars, which present a principled foundation of probabilistic structure by their nature. Jurafsky’s model is a probabilistic parser that keeps multiple parses of an ambiguous sentence, ranking each possible parse tree by its probability. The probability of an interpretation is computed by multiplying two probabilities: the stochastic context-free grammar (SCFG) “prefix” probability of the portion of the sentence that is already seen, and the “valence’” (syntactic/semantic subcategorization) probability for each verb. [1] However, in spite of their probabilistic nature, neither of these two models, Jurafsky (1996) nor Crocker and Brants (2000) modeled the probabilistic relation between individual words, often known as word transition probabilities or word bigram probabilities, which is a very important class of behavioral studies. In a recent work, McDonald et al. (2001) studied the effect of this probability on reading time by running eye tracking experiments and recording their eye fixations in the points that subjects are read verb-noun pairs embedded in normal sentences. Each verb-noun pair either has a high transition probability or a low transition probability. High Probability: One way to avoid confusion is to make the changes during vacation. Low Probability: One way to avoid discovery is to make the changes during vacation. Other aspects of the sentence pairs, such as their length and frequency of the noun in the corpus, neutral context, and sentence plausibility were all matched. McDonald et al. (2001) showed that the duration of subjects’ average initial fixation on the noun was shorter if the noun was in a high-transition-probability verb-noun pair. Our goal in this paper is to attempt to build a model which meets these principles. The fundamental insight of our model is the use of Graphical Models (specifically dynamic Bayes nets) in modeling the probabilistic, nature of human sentence processing. The advantage of using Bayes nets is that they can represent the causal relationship between different probabilistic knowledge sources, how they can be combined, and what we know about the independence of probabilities. In our Bayesian model of sentence processing, we construct dynamic Bayes nets incrementally, as a sentence is being processed (or produced). Each Bayes net integrates probabilistic knowledge of lexical, syntactic, and semantic knowledge in an online manner. Our proposal is thus that human combine structure and evidence probabilistically, computing and incrementally recomputing the probability of each interpretation of an utterance as it is processed. This model is on-line and incremental; it assigns structure word by word as the sentence is read, changing structure as new information comes into the parser. Like most sentence processing models, our model is sensitive to various constraints, including syntactic structure, thematic biases, and lexical structure. Also our Bayesian model is probabilistic, incrementally computing the probability of each interpretation conditioned on the input words so far, and on lexical, syntactical and semantic constraints and knowledge. So the most preferred interpretation at any time is the one with the highest probability. The fundamental insight of our Bayesian model is to build multiple interpretations for the input, in parallel, compute the probability of each interpretation, and choose the interpretation with the maximum probability. Furthermore, this probability plays a role in reading time; words or structure which are unexpected (low probability) take longer to read. 4.2 Grammar Rather than studying language understanding in the abstract, we'll create a micro world and interpret language in terms of interpretation of actions in the micro world. S -> A V B | A W | B is X by A | A threw P with I V -> eats | throws | addresses X -> eaten | thrown | addressed W -> V | runs A -> Student(s) | Professor | Secretary | Hadjar | Mike | Pitcher B -> Pizza | Aush C -> Student(s) | Letter I -> Hand | Bat P -> Party | Ball (This is a simplified grammar, the complete grammar is not shown here) 4.3 Task The basic issues that we will asses our model on are pragmatics and two types of ambiguity resolution. We expect our model to be able to capture the pragmatics of the world, for example in our micro-world that is used for testing the model, Mike likes Aush and does not eat Pizza, Hadjar likes Pizza more than Aush. So when the utterance given to the model is “Mike eats” the probability distribution over words must show a general higher probabilities for words that represent food (Aush and Pizza) and also must show higher probability for the next word to be Aush rather than Pizza. (1) Hadjar eats … (Pizza). Mike eats … (Mike). The first type of ambiguity that we are addressing is ambiguity about case role types in a sentence based on the verb. For example the verb “throw” in different senses can accept different case roles. Consider the following examples from our grammar: (2) The pitcher threw a ball with a bat. Hadjar threw a party. In the first sentence “throw” means “toss” and it can accept an instrument case role in a sentence, while “threw” in the second sentence means “cast” and it can not have an instrument. Another task that we expect our model to be able to accomplish is to be able to capture higher order dependencies where the system has to decide about what object follows, not only based on the verb or the agent, but based on both. Consider the following examples, also from our grammar, (3) The professor addresses … (students). The secretary addresses … (the letter). In neither of these sentences, the verb “address” can specify what object might come next nor the agent “professor” or “secretary”. In fact it’s the combination of agent-action that can specify the object. 4.4 Architecture Our Bayesian model basically tries to build multiple interpretations of the input that are associated with a probability and choose the interpretation with the maximum probability. Assuming that that words or structures that are less probable, take more time to read, this probability can later be used to model the reading time. Suppose that we are given an input sequence of words W {w1, w2 ,...} and a set of potential interpretations I {s1 , s2 ,..., sn } for sentence S. So our final task is to find the most probable interpretation s* given the input sequence W. This computation can be declared by the following formula: s * argmax P( s | W ) s This equation tells us that we would be able to find the best interpretation for a sequence of words and consequently the best sequence of words that express a meaning, if we knew how to compute P( s | W ) . In order to capture this probability, instead of counting and dividing all word sequences and all interpretations, which is impractical, we decomposed our knowledge of the model into a number of components. (Figure 1) These components can be categorized into three different types: Semantics, Syntax and Context. Each of these categories has other components that convey different parts of the knowledge in the model. Figure 1. Generative Model Semantics consists of Se which is the random variable for the ultimate semantics of the world, (i.e. all interpretations), CR which is a set of random variables, each representing one specific type of case role and WM that stands for word meanings, also is a set of random variables, same number as CRs. Syntax consists of Sy which is the random variable for the general syntactic structure and F which represents the form, aspect and tense of each abstract concept that fills each role. Context is represented by C as is simply a counter of where we are in the sentence. The dashed line around CR, WM and P is that this part of the Bayesian network is repeated a constant number of times, but it is still naïve bayes, and it is not a part of the dynamism of this network. The knowledge in the model consists of the following distributions: What we want: P(Se | w1, w2 ,...) P( wi | Ci , Se, Sy,{CR,WM , F}) : Temporal constraint P(Ci | Ci 1 ) : Context-update function. In a trivial, localist way, this update function can simply be a counter function, such that the context state increments by 1 at each time. P ( Sy | Se) : We could decide to make syntax independent of semantics if we wanted to. P({F } | Se) : Is the probability of form/tense/aspect of the filler of each role. We separated this concept from syntax to keep the syntax node only encoding the structure of parse tree, so in future we might be able to replace syntax node with some version of a incremental statistical parser. P({CR} | Se) : Represents the probability of each CR participating in the sentence. P({WM } | Se) : Is the probability distribution over word meanings filling each case role. P (Se) The generative process first involves selecting the semantics of the utterance, represented by the random variable Se. Given the semantics, the syntactic structure must be determined, denotd by the random variable Sy. Se, CR, WM, F and Sy jointly determine each of the words in the utterance. Each{CR,WM,F}i set, as depicted in figure 1. is related to one case role. For example, {CR,WM,F}patient jointly determine the surface form that might fill the role of “patient” in the sentence. C, Se and Sy’s contribution to this will be finding the appropriate place to put this surface form, in the sentence. To encode the sequential structure of a sentence, we imagine a context representation that evolves over time, denoted by c1, c2, ..., cn where time serves as the index. The context tells the model where it is in the production of a sentence, and could represent something like which constituent of the sentence is currently being processed. The context is much like the context representation in an Elman or Jordan net: it specifies where we are in the sequence. Although the evolution of the context depends only on the previous context in the generative model, when the model is used for recognition, the individual words can modulate the context representation. E.g., if ci = a, and from state a transitions can be made to either state b or c, the inferred value of ci+1 will depend not only on cn but also on wi+1. The generative model above can be used for recognition, and the formal statement of recognition is to estimate P(Se | w1, w2 ,...) -the probability of a given semantic representation given the sequence of words. Pragmatics comes from priors over semantics, and syntax conditioned on semantics. Consider the verb eat. The patient must be a physical object, and so we need to be able to represent the semantics of "X eat Y", where X is any animate agent, and Y is any physical object. Pragmatics comes in when we impose additional constraints on the eating, e.g., Mike eats eggs and salad but not pizza. That can be expressed in terms of priors over the semantic representations. 5. Results In testing our model we first generated a training corpus based on the predefined grammar. Each record in the training set contains knowledge about the semantic meaning of each sentence (Se), meaning of each word (WM) and case roles (CR) and also the word surface form and aspect (F). So basically the only hidden information is the syntactic structure of the sentence. The reason that we made all semantic information visible was that during language learning, most of the semantics information is accessible to the language learner through perception (and instruction in case of thematic roles and different word meanings with the same surface). However the syntax is not explicitly accessible to language learners. Infants learn to speak without having any idea about the existence of a grammar; still they can capture some aspects of grammar through perception, such as how to make a singular noun, pleural. That is why we made the form and aspect information (F) visible to the learning algorithm. We trained our system using this corpus and tested it on tasks described in part 4.3. But first of all we tried our system to see how it can incrementally update its interpretation of the sentence. The graph in figure 2.a, shows how the distribution over semantic values in the semantics node changes as new words are given to the model. Point 1,2 and 3 in this graph represent semantic values representing “Mike eats Aush”, ”Hadjar eats Aush” and “Jeff eats Aush”. This graph reveals both semantics and pragmatics of the world, by giving high probability to these three first semantic values, so it understands that Aush can only be eaten and only by people.1 And also shows that if Aush is going to be eaten, it’s probably the case that “Aush has been eaten by Mike” because Mike likes Aush more than Hadjar and Jeff. 1 When testing this sample, professor and student and other animate objects were not part of the corpus. 5.1 Word prediction, Pragmatics In order to test our model on word prediction we used the sample in 4.3.1. Mike eats … Hadjar eats … We trained the model with a corpus that has priors over semantics, so that “Mike eats Aush” is more probable than “Mike eats Pizza”, and “Hadjar eats Pizza” is more probable than “Hadjar eats Aush”, although all four sentences are semantically correct. We gave our model the first two words of each sentence and then computed the probability of the next word given two previous words. The result is shown in figure 2.a and 2.b. point 4 in the graph shows W4 which is “Aush” and point 5 is W5, “Pizza”. You can see from the graph that the result given by the model explicitly shows what has been favored in the corpus. 5.2 Case role assignment To test our model on case role assignment we use the same sample in 4.3.2 again. The pitcher threw a ball with a bat. Hadjar threw a party. Remember that the meaning of the verb “throw” in the first sentence is “to toss” which can easily have an instrument associated with it, and in the second sentence it means “to cast” which does not usually have an instrument as its arguments. So we test our model by feeding each of these sentences to the model, word by word and monitor the changes in the distribution in CR nodes. As shown in figure 3. our belief in the presence of each of instrument case role (point 4) is changed after the word in parenthesis is given to the model. Green and yellow bars respectively show the belief in each CR before and after the new word is fed to the model. Point 1 through 4 represent CRs for Action, Agent, Patient and Instrument. Is it clear that after the model has seen “Hadjar threw a party”, it is disambiguated the meaning of “throw” and it does not believe that it will need an instrument. On the other hand when the model reads “The pitcher threw the ball” it still expects the sentence to have some instrument role, but as it receives the next word “with” this belief goes even higher. Figure 2 Word prediction 5.3 Higher order dependencies Our example for higher order dependencies was the following: The professor addressed students The secretary addressed the letter To test our model on this phenomenon we gave these two sentences to the model incrementally and monitored the changes in WMpatient which gives us a prediction of what word meaning will fill the patient case role. Figure 4. shows the conditional probability distribution for WMpatient values after the model has seen parts of the sentence. Figure 3 Case roles 6. ACKNOWLEDGMENTS Our thanks to Jeff Elman. 7. REFERENCES [1] Narayanan, S., Jurofsky, D., A Bayesian model of human sentence processing. in preparation, 2005. [2] Chater, N., Manning, C., Probabilistic models of language processing and acquisition. Trends in Cognitive Science, Vol. 10 No. 7, July 2007. [3] Grodner, D., Gibson, E., Consequences of the Serial Nature of Linguistic Input for Sentential Complexity. Cognitive Science, 29, 2005, 261-291. [4] Elman, J., Finding structure in time. Cognitive Science, 14, 1990, 179-211. Figure 4 Higher order dependencies Dark and light brown bars respectively show probabilities, before and after the model has seen the word in parenthesis. Note that we have only shown probabilities for four values in WMpatient for simplicity. They respectively refer to “Professor”, ”Secretary”, “Student” and “Letter”. Based on the results, after the model have seen “the professor” or “the secretary”, there’s no preference on “student” or “letter” to fill the patient role. However after model receives the next word, “addresses” it clearly selects “student” to fill the patient role for the first sentence and “letter” for the second one. [This is the example that Jeff Elman gave us to work on for higher order dependencies, but I don’t see why it’s possible that one model be able to do task 1 and 2 and not be able to do this one.] [5] Griffiths, T., Tanenbaum, J., Two proposals for causal grammers., 2005. [6] Jurafsky, D., Martin, J. Speech and language processing. Upper Saddle River, NJ: Prentice Hall. 2000. [7] Hale, J., The Information Conveyed by Words in Sentence., Journal of Psycholinguistic Research, Vol. 32, No. 2, March 2003. [8] Rohde, D., A Connectionist Model of Sentence Comprehension and Production., PhD Thesis, School of Computer Science, Carnegie Mellon University and the Center for the Neural Basis of Cognition, 2002.