G22.2590 - Natural Language Processing - Spring 2001 Lecture 1 Outline Prof. Grishman January 17, 2001 Introduction Centrality of Natural Language a primary (and natural) mode of human communication representation for most recorded human knowledge a very rich and flexible representation (when compared to most formal representations) Applications of Natural Language Processing (NLP) machine translation interactive systems … data base query, expert systems, help systems limited appeal with written input: people don't like to type a lot information extraction and document retrieval grammar checking speech applications: dictation; interactive querying Our Goal create systems which can perform such applications: an engineering problem the term "language engineering" has become popular, especially in Europe, to reflect this orientation natural language processing systems are complex, and require good design techniques modular approaches to break the problem up at appropriate points formal models which reflect aspects of the structure of language Relation to other Fields Linguistics goal of linguistics is to describe language provide simple models which can predict language behavior understand what is universal about language through these formal models, understand how language can be acquired formal models from linguistics have been of value in NLP but its goals are not the same as NLP: a single counterexample can invalidate a model as a linguistic theory, but would not significantly lessen its value for NLP NLP must address all phenomena which arise in an application, while linguistics may focus on select phenomena which give insight into the language faculty Psycholinguistics goal is to understand human performance in generating and analyzing language little influence on NLP to date some syntactic analyzers which try to mimic human performance & difficulties Artificial Intelligence & Machine Learning AI is concerned primarily with generic problem solving strategies & suitable knowledge representations 1 there is an inherent link between AI and NLP: some NLP problems require the sort of deep reasoning addressed by AI but NLP has found increasing success through avoiding deep reasoning, and so the link has weakened Statistics statistical methods and models, originally used in signal processing, information theory, and physics, have become more widely used in NLP easily trainable and easily computable models are more attractive now that lots of training data is available Analyzing our Needs: setting our agenda What functionality do we require in order to address NLP applications? Machine Translation People have been interested in machine translation since the earliest days of computing. At first, people imagined that machine translation is mostly a "data processing" task … a system looks up the words one at a time in a bilingual dictionary, and then maybe has to fix up the translation a bit. However, there is a lot more to do for machine translation: word segmentation: for some languages (such as Japanese and Chinese) there are no spaces between words, so it's not clear what the words are morphology: words appear in different forms, indicating singular vs. plural (for nouns), present tense vs. past tense (for verbs), nominative vs. accusative case, etc. English has only a few morphological forms, so it's possible to put them all in a dictionary. This isn't true of most Western languages; for example, a Spanish verb could have over 50 forms. syntax: word-for-word translation only works if the word order in the two languages is about the same; if it's not, we need to understand enough about the structure of the two languages (their syntax) to change from one word order to another. English has a rather fixed subject-verb-object order ("SVO"), while many more inflected languages have more variable word order. lexical semantics: many words are polysemous … they have multiple meanings. A word will have to be translated differently depending on its meaning in a particular context; otherwise the translation is likely to make little sense. For example, "bill" means both a statement of charges (an "invoice") and a part of a duck (its "beak"). It's not likely that any foreign language has a word with both these senses. If we chose the wrong sense in translating "bill" into a foreign language, it would be like reading the English sentence "At the end of the meal, the waiter presented the beak." discourse: in order to create a proper translation we sometimes have to look beyond the individual sentence. That can be true in selecting word senses. The need also arises in translating into English from languages where subject pronouns can be omitted; we need to figure out what the subject actually is, so that we can supply a "he" or "she" or "it" in English. Information Extraction An information extraction system processes text and extracts information about a specific type of event or relationship. For example, one extraction system we've worked on reads 2 newspaper articles and builds a database of executives who were hired for new management jobs. If it reads a sentence such as IBM hired Fred Smith as president. it would create a table entry person Fred Smith company IBM position president We can find some of these items by simple pattern matching, looking for something like <word> hired <word> <word> as <word> but this won't get us very far. For better performance, we need name recognition: a company name may be several words ("General Motors"), a person may have a title or middle name ("Mr. Smith", "Fred X. Smith") syntax: the information may appear in the passive ("Fred Smith was hired by IBM") or in a relative clause ("Fred Smith, who was hired by IBM"); also, there may be extra modifiers ("IBM yesterday hired Fred Smith as president") lexical semantics: there may be lots of synonyms for hired ("appointed", "named", …) which the system should recognize discourse - pronouns: if a pronoun appears in a relevant sentence, the system has to figure out what the pronoun refers to ("Fred Smith left Compaq last week. IBM hired him yesterday as president.") We will be studying (and testing) information extraction applications over the course of the semester. Interactive Command and Query A number of systems have been constructed to serve as natural language front ends for data base query. For complex queries, they spare the user the need to learn a formal query language. On the other hand, they run into difficulty if the user keeps asking questions the system cannot answer. The system has to translate a natural language query into a formal data base query. It has to be able to accept a wide range of queries or it will be worse than useless (it will be much more frustrating than a formal query language --- it won't really be "natural language"). To do so it needs to analyze syntax: it needs to divide the query into phrases which correspond, roughly to different data base attributes: Which customers | bought | green widgets | last week? lexical semantics: it needs to figure out how these phrases map into data base relations quantifier semantics: if the user says "List all the customers who bought more than five widgets.", the system has to figure out the quantifier structure discourse - pronouns and sentence fragments: in asking a sequence of questions, the user is likely to use pronouns and fragments to keep the queries short: "How many widgets did General Motors buy? How many kumquats? Did it buy any tangelos?" dialog: a good interactive system will be responsive in its replies, pointing out false assumptions and asking for clarification: 3 "How many programmers in the child care department make over $50,000?" "There are no programmers in the child care department." "How many people live in Washington?" "Washington, D.C. or the State of Washington?" Summary Natural language is a very rich a powerful communication medium. If we are to build systems which can utilize this medium, we must analyze language at several levels: syntax: what is the structure of a sentence? semantics: what is the meaning of a sentence (in isolation)? discourse: how can a sentence be interpreted in context? dialog: how is language used to exchange information? 4