PROBABILISTIC LEXICALIZED CONTEXT-FREE GRAMMARS İbrahim Alız Department of Computer Engineering , Middle East Technical University ( METU) ibrahim_aliz@yahoo.com “With enough knowledge we can figure out the probability of just about anything.” Referring to the moral here, it didn’t get to much time for computer linguists to use the power of probability for parsing, to deal with the ambiguities in natural language understanding task. Probabilistic parsing is a key contribution to disambiguation. Choose the most probable parse as the answer, so simple. However, additionally, using the help of subcategorization and lexical dependency information and so of probabilistic lexicalized context-free grammars (PLCFG) which is an extension to the probabilistic context free grammars (PCFG) one can get better results. This paper gives a brief description on the principles of PLCFG, and then gives a suggestion, for an implementation on a PLCFG within a limited Turkish lexicon and grammar. An easy way to think of a lexicalized grammar is as a context free grammar with a lot more rules; it is as if we created many copies of each rule, one copy for each possible headword for each constituent. In general, it will be to costly to keep all these rules around but thinking lexicalized grammars this way makes it clearer that we can parse them with standard CFG parsing algorithms. As an example for a sentence like “what does your student want to write” we have the following parsing results. Lexicalized parse tree (write, what, Swrite S), (write, does, S does S), (write, student, S NP VP), (student, your, NP your student), (write, want, VPwant VP), (write, to, VPto write), Usage of a lexical head (most important item for the constituent) for each constituent is the main idea while extending a PCFG to a lexicalized PCFG. For example the head of a noun phrase is the main noun typically the rightmost one (e.g. student for “your student”). More generally, heads are computed bottom up and the head of a constituent c is a deterministic function of the rule used to expand c. For example the c is expanded using s np vp, the function would indicate that one should find the head of the c by looking for the head of the vp. Lexicalized statistical parsers collect, to a first approximation, two kinds of statistics. One relates the head of a phrase to the rule used to expand the phrase, which we denote p(r | h), and the other relates the head of a phrase to the head of a subphrase, which we denote p(h | m, t), where h is the head of the subphrase, m the head of the mother phrase, and t the type of subphrase. Therefore , for a lexicalized parser to find the probability of a corresponding parse we use the following formula, if s is the entire sentence, π is a particular parse of s, c ranges over the constituents of π, and r(c) is the rule used to expand c, then p(s, π) = ∏c p(h(c)|m(c)) * p(r(c)|h(c)) Here we first find the probability of the head of the constituent h(c) given the head of the mother m(c) and then the probability of the rule r(c) given the head of c. However, before parsing we have to train the parser using a pre parsed training corpus, referring to the Charniak’s work* on statistical parsing which also uses two more equations on calculating the probability for individual rules and on their dependencies, meaning to give the necessary probabilities to the rules. Thus, having the primitive probabilities approximated by the lexical dependencies between the words in the training corpus, subcategorized on the word affinities, we can calculate the probability of each parse using the above formula. Having enough information on the basics of PLCFG, my aim is to develop a LPCFG parser on a specified Turkish grammar & lexicon. As it is a natural tendency to calculate the probabilities with the proven phsycological results on human parsing, both with the lexical dependency another determining factor in human parsing, it is not shocking that parsers that are implementations of lexicalized PCFG have a success rate 88%. With more intelligent machines pushing the tight limits every day on Natural Language Processing topics, the LPCFG’s are an importmant phase on getting the moral of understanding human speech recognition and parsing. References Statistical Techniques for Natural Language Parsing Eugene Charniak, Department of Computer Science, Brown University *Statistical Parsing with a Context Free Grammar and Word Statistics Eugene Charniak, Department of Computer Science, Brown University Speech and Language Processing, Jurafsky and Martino, Prentice Hall 2000 A Model of Syntactic Disambiguation Based on Lexicalized Grammars Yusuke Miyao, Deparment of Computer Science, University of Tokyo Jun’ichi Tsujii, Department of Computer Science, University of Tokyo