Information Systems Natural Language Processing Advanced Higher

Information Systems Natural Language Processing Advanced Higher 9020 Summer 2001 HIGHER STILL Information Systems Natural Language Processing Advanced Higher Support Materials This publication may be reproduced in whole or in part for educational purposes provided that no profit is derived from the reproduction and that, if reproduced in part, the source is acknowledged. First published 2001 Learning + Teaching Scotland Northern College Gardyne Road Broughty Ferry Dundee DD5 INY Tel. 01382 443 600 CONTENTS Section 1: Motivation and context for Natural Language Processing Section 2: Analysing Natural Language Processing Section 3: Parsing and General Techniques Section 4: Further Techniques used in Natural Language Processing Information Systems: Natural Language Processing (AH) Information Systems: Natural Language Processing (AH) 1. MOTIVATION AND CONTEXT FOR NATURAL LANGUAGE PROCESSING 1.1 The Nature and Role of Natural Language Processing Natural Language Processing (NLP) is a sub-field of Artificial Intelligence. It is also known as Computational Linguistics. NLP is concerned with the production and comprehension of natural languages such as English or Russian. It deals largely with written language or text, but there is some consideration of spoken language, including phonology, the study of the sounds that make up a language. You can find an interesting glossary of terms used in NLP on the World Wide Web at: http://www.cs.bham.ac.uk/~pxc/nlpa/nlpgloss.html 1.1.1 Functions of language in human communication Language is the principal method of communication between humans. We use it for a number of different purposes: • Transferring information by making statements, e.g. ‘It’s raining outside.’ This allows us to benefit from the experience of others and reduces the amount of exploration or information seeking each individual has to carry out. • Querying others about some aspect of the world or their experience of it. This is done by asking questions, e.g. ‘What’s the weather like in Spain?’ • Answering questions, e.g. ‘The rain in Spain falls mainly on the plain.’ This can also be regarded as a form of information transfer. • Making requests or giving commands, i.e. asking others to carry out actions: ‘Peel me a grape’. Direct commands are sometimes considered impolite, so requests are often phrased indirectly, e.g. ‘Could you peel a grape for me?’ or ‘I could use some help peeling these grapes.’ • Promising to do something or offering deals, e.g. ‘I’ll buy the grapes if you’ll peel them.’ • Acknowledging offers or requests, e.g. ‘OK’, ‘Fine’ etc. • Sharing feelings or experiences, e.g. ‘I really enjoy listening to Mozart.’ Notice that some of these (informing, answering, acknowledging and sharing) are intended to transfer information to the listener, while others (requesting, commanding, querying) are intended to prompt the listener to take some action. Information Systems: Natural Language Processing (AH) 1 Some communications such as greetings ‘Good morning. How are you today?’ ‘I’m fine thanks. How are you?’ are intended only to build and reinforce social links and convey little or no real information. 1.1.2 Language as a sign of intelligence No one really knows if we use language because we’re intelligent or if we’re intelligent because we use language. Jerison suggests that human language arises from the need for better ‘cognitive maps’ of our territory. He points out that dogs and other carnivores rely largely on scent marking and their sense of smell to tell them where they are and what other animals have been there. The early primates (30 million years ago) lacked this well-developed sense of smell and substituted sounds for scent marking. Language may simply be a means of compensating for our inadequate noses! 1.1.3 Natural language and other forms of communication Natural language is not the only form of communication which exists: we’ll look briefly at four others: sign language, non-human communication, programming languages and formal logic. Sign Language Sign languages, such as British Sign Language (BSL) and American Sign Language (ASL) are true languages, with vocabularies of thousands of words, and grammars as complex and sophisticated as those of any spoken or written language. BSL is now the fourth most widely used language in the UK and a major campaign is under way to have it officially recognised by the government, as has already happened in most EU countries. ASL is quite different from BSL - it is more closely related to French Sign Language, due to the influence of Laurent Clerc, the first teacher of the deaf in the United States. ASL has a Topic-Comment syntax, while English uses Subject-Object-Verb. In terms of syntax, ASL shares more with spoken Japanese than it does with English. Sign languages are not an invented system like Esperanto. They are linguistically complete, natural languages and are the native languages of many deaf men and women, as well as some hearing children born into deaf families. Sign languages are sometimes described as gestural languages. This is not absolutely correct because hand gestures are only one component. Facial features such as eyebrow motion and lip-mouth movements are also significant and form a crucial part of the grammatical system. Sign languages also make use of the space surrounding the signer to describe places and persons that are not present. Sign languages have a very complex grammar. Unlike spoken languages where there is only a single stream of sounds, sign languages can have several things happening at the same time. For instance, the concept of ‘very big’ is conveyed by the simultaneous use of a hand gesture for ‘big’ and a mouth/cheek shape for ‘very’. Sign languages have their own morphology (rules for the creation of words), phonetics (rules for hand shapes), and grammar that are very unlike those found in spoken languages. Information Systems: Natural Language Processing (AH) 2 Sign languages should not be confused with Signed English, which is a word-forword signed equivalent of English. Deaf people tend to find it tiring, because its grammar, like that of spoken languages, is linear, while that of sign languages is primarily spatial. Non-Human Communication It is often stated that one of the great differences between humans and animals is the ability of humans to use language. This has frequently been challenged, particularly in studies of the use of language by primates. These studies have generally followed one of two paths: the use of language by primates in the wild and attempts to teach some form of language (not necessarily spoken) to primates in captivity. Wild primates use a variety of methods of communication. Many use scents to mark their territory and they use touch to indicate relationships: mothers carry their young and adults may sit and/or sleep together or groom each other. The higher primates look at whatever they are paying attention to. Important visual cues include facial expression, hair erection, general posture, and tail position. Primates use vocal communication, from soft grunts to whoops, when they want to attract the attention of others. Sounds may be used to signal danger of an attack or the location of a food source. The meaning of primate communication depends on the social and environmental context as well as the particular signals being used. Most animals use a fixed set of signals to represent messages, which are important to their survival (food here, predator nearby, approach, withdraw etc.). Vervet monkeys have the most sophisticated animal communication that we know of. The sounds they use are learned, rather than instinctive. They have a variety of calls for different predators: a loud bark for leopards, a short cough for eagles and a chatter for snakes. They also use one type of grunt to communicate with dominant members of their own group, another to communicate with subordinate members and a third type to communicate with members of other groups. They are even capable of lying! A vervet that is losing a fight may make the leopard alarm, causing the whole group to run for the trees and forget the fight. There have been numerous attempts to teach some kind of language to primates. Researchers argue that projects of this nature can provide valuable information, not only about the nature of language and cognitive and intellectual capacities, but also about such issues as the uniqueness of human language and thought. Such projects also shed light on the early development of language in humans. Another reason for teaching language to primates is the hope of discovering better methods for training children with learning difficulties who fail to develop linguistic skills during their early years. Allen and Beatrice Gardner began teaching American Sign Language to an infant chimpanzee named Washoe in 1966. They provided a friendly environment that they believed would be conducive to learning. The people who looked after Washoe used only sign language in her presence. She was able to transfer her signs spontaneously to a new situation, e.g. she used the word ‘more’ in a variety of contexts, not just for more tickling, which was the first context. Information Systems: Natural Language Processing (AH) 3 The Gardners reported that Washoe began to use combinations of signs spontaneously after learning only about eight or ten of them. At one stage Washoe ‘adopted’ an infant chimpanzee named Loulis. For the next five years, no sign language was used by humans in Loulis' presence; however, Loulis still managed to learn over 50 signs from the other chimpanzees. The year after Project Washoe began, David and Ann Premack started an experiment with a different kind of language. They used plastic tokens, which represented words and varied in shape, size, texture, and colour to train a chimpanzee named Sarah. Sentences were formed by placing the tokens in a line. Sarah was taught nouns, verbs, adjectives, pronouns, quantifiers, same-difference, negation, and compound sentences. To show that she was not simply responding to cues from her trainers, she was introduced to a new trainer who didn’t know her language. When this trainer presented her with questions, she gave the correct answers less frequently than usual, but still well above chance. A chimpanzee named Lana learned to use another language system, a keyboard with keys for various lexigrams, each representing one word. When Lana pressed a key with a lexigram on it, the key would light up and the lexigram would appear on a projector. If keys were pressed accidentally, Lana used the period key as an eraser so that she could restart the sentence - she did this on her own before it occurred to the researchers. Lana started using ‘no’ as a protest (e.g. when someone else was drinking a Coke and she did not have one) after having learned it as a negation. Lana acquired many skills which showed her ability to abstract and generalise, e.g. she spontaneously used ‘this’ to refer to things for which she had no name, and she invented names for things by combining lexigrams in novel ways. However, many linguists, including the highly influential Noam Chomsky, argue that language is a uniquely human gift. According to this school, chimpanzees and other close relatives cannot use language because they lack the human brain structures that make language work. Chomsky argues that trying to teach language to a chimpanzee is a bit like teaching a human being to fly. An athlete may be able to jump 20 feet, but it’s a crude imitation of flying. Programming Languages Programming languages have a number of features in common with natural languages, but there are also significant differences. Programming languages have a lexicon (or vocabulary) and rules governing how sentences in the languages are constructed. Most languages allow two different kinds of words, usually referred to as keywords and identifiers. There are a fixed number of keywords, e.g. begin, end, do, while etc. and these have a fixed function. There are an infinite number of identifiers. These are usually associated with a fixed function at the time of declaration, e.g. procedure name, variable name etc. In general, computer programmers have far more ability to generate new words than the speakers of a natural language, although their new words are often influenced by natural languages, e.g. CustName, TotPrice etc. Information Systems: Natural Language Processing (AH) 4 The syntax (grammatical rules) of modern programming languages can be rigorously defined and can often be expressed in a formal notation such as BNF (Backus Normal Form) or Syntax Diagrams. Unfortunately it’s not quite as easy to describe the semantics (meaning) of a programming language in a formal manner and this is normally still done by means of an English description. However, it’s still considerably more rigorous than a natural language. Formal Logic One important area of NLP is the study of the semantics or meaning of natural language statements. In many cases, the most important aspect of semantics is determining whether a sentence is true or false. We can simplify this task by defining a formal language with simple semantics and mapping natural language sentences on to it. This formal language should be unambiguous, have simple rules of interpretation and inference and have a logical structure determined by the form of the sentence. Two commonly used formal languages are propositional logic and predicate logic. Formal Logic is covered in greater detail in section 4.1.4. 1.1.4 Natural Language Modalities (Speech and Text) Natural language occurs in two distinct forms, text and speech. Although these can be considered as two different ways of expressing the same information there are important distinctions. Speech is usually less formal than text, but it can convey important additional information by means of volume, tone of voice etc. that are absent in text. It can also be more confusing as a result of accent, mispronunciation etc. NLP has traditionally focused on text with Speech Recognition and Speech Generation being regarded as relatively disparate fields. However, in recent years there has been a degree of convergence as researchers have realised that a knowledge of language structure can assist in recognition or generation. We’ll look briefly at these fields. Speech Recognition Speech recognition is the process by which a computer converts an acoustic speech signal to text. It should be distinguished from speech understanding, the process by which a computer converts an acoustic speech signal to some form of abstract meaning. Speech recognition systems can be speaker-dependent or speaker-independent. A speaker-dependent system is designed to operate for a single speaker. These systems are usually easier to develop, cheaper to buy and more accurate, but not as flexible as speaker adaptive or speaker independent systems. A speaker-independent system is designed to operate for any speaker of a particular language. These systems are the most difficult to develop, most expensive and accuracy is lower than speaker dependent systems. However, they are more flexible. Information Systems: Natural Language Processing (AH) 5 A speaker-adaptive system is developed to adapt its operation to the characteristics of new speakers. It's difficulty lies somewhere between speaker-independent and speaker dependent systems. The size of vocabulary of a speech recognition system affects its complexity, processing requirements and accuracy. Some applications only require a few words (e.g. numbers only), others require very large dictionaries (e.g. dictation machines). An isolated-word system operates on single words at a time - requiring a pause between each word. This is the simplest form of recognition to perform because the end points are easier to find and the pronunciation of a word tends not affect others. Thus, because the occurrences of words are more consistent they are easier to recognise. A continuous speech system operates on speech in which words are not separated by pauses. Continuous speech is more difficult to handle for a variety of reasons. It is difficult to find the start and end points of words. Another problem is coarticulation the production of each phoneme is affected by the production of surrounding phonemes, and similarly the start and end of words are affected by the preceding and following words. The recognition of continuous speech is also affected by the rate of speech. Rapid speech tends to be harder. Speech recognition starts with the digital sampling of speech, followed by acoustic signal processing. The next stage is recognition of phonemes, groups of phonemes and words. Most systems utilise some knowledge of the language to aid the recognition process. Some systems try to ‘understand’ speech, i.e. they try to convert the words into a representation of what the speaker intended to mean or achieve. Speech Synthesis Speech synthesis programs convert written input to spoken output by automatically generating synthetic speech. Speech synthesis is often referred to as ‘Text-to-Speech’ conversion (TTS). There are several algorithms available. The easiest way is to just record the voice of a person speaking the desired phrases. This is useful if only a restricted volume of phrases and sentences is used, e.g. messages in a train station, or schedule information via phone. The quality depends on the way recording is done. More sophisticated, but poorer in quality, are algorithms that split the speech into smaller pieces. The smaller those units are, the fewer are they in number, but the quality also decreases. One frequently used unit is the phoneme, the smallest linguistic element. Depending on the language used there are about 35-50 phonemes in western European languages, i.e. there are 35-50 single recordings. The problem is combining them, as fluent speech requires fluent transitions between the elements. The intelligibility is therefore lower, but the memory required is small. One solution to this dilemma is the use of diphones. Instead of splitting at the transitions, the cut is done at the centre of the phonemes, leaving the transitions themselves intact. This gives about 400 elements (20*20) and the quality increases. The longer the units become, the more elements are there, but the quality increases along with the memory required. Other units that are widely used are half-syllables, syllables, words, or combinations of them, e.g. word stems and inflectional endings. Information Systems: Natural Language Processing (AH) 6 1.1.5 Ambiguity Natural languages, including English, display a remarkable amount of ambiguity. The same word, phrase or sentence can have a whole variety of different meanings. Consider a simple sentence like ‘He made her duck’. This has at least four different meanings: • • • • He cooked duck for her He cooked a duck, which she had provided He caused her to lower her head He made a duck for her (presumably from wood, plaster or some other substance). Ambiguity, and the process of resolving it (disambiguation) are considered in detail in Section 2. 1.1.6 Complexity and Scale One of the problems faced by NLP researchers is the sheer complexity and scale of natural language, in terms of the amounts of knowledge of different kinds needed to describe it. We may justifiable regard certain types of computing systems as large or complex, for example, airline reservation systems or banking systems, but they are far less complex than natural languages. The complexity of natural languages arises largely from three areas: the range of words available in the language (its lexicon), the grammar of the language (its syntax) and the meaning of sentences within the language (its semantics). The Oxford English Dictionary lists more than half a million words, but the English language is generally reckoned to have a lexicon of around a million words, although this is difficult to establish precisely because of the number of related words, e.g. are ‘climb’, ‘climbing’ and ‘climber’ three different words, or one word with several suffixes? This lexicon isn’t fixed. We keep adding to it by borrowing words from other languages. You can find an interesting list of these at: http://www.krysstal.com/borrow.html New words (or neologisms) are also coined continuously, often in technical areas. Recent additions to the English language include ‘cyberspace’ and ‘gameboy’. A further level of complexity is added by the syntax of natural languages. The rules by which words may be combined to produce phrases and sentences are highly complex and allow for the generation of an infinite number of possible sentences. Some words can belong to more than one grammatical category (e.g. ‘flying’ can be an adjective or a verb), adding to the number of possible meanings of sentences. Another level of complexity comes from the meaning, or semantics, of words and phrases. Many English words have more than one meaning, and they keep gaining new ones, e.g. the use of ‘house’ or ‘garage’ to describe types of music. Information Systems: Natural Language Processing (AH) 7 1.2 Goals for the Development of NLP NLP has two primary goals. The technological goal is to build intelligent computer systems, such as natural language interfaces to databases, machine-translation systems, text analysis systems, speech understanding systems, or computer-aided instruction systems. This goal cannot be achieved without using sophisticated theories of the type being developed by theoretical linguists. The linguistic, or cognitive science, goal is to gain a better understanding of how humans communicate by using natural language. This second motivation is not unique to NLP, but is shared with theoretical linguistics and psycholinguistics. The current state of knowledge about natural language processing is so preliminary that it is not yet feasible to build a complete model of human communication – this would require major advances in both NLP and the experimental techniques used by psycholinguistics. 1.3 Potential Application Areas for NLP 1.3.1 Machine Translation Machine Translation (MT) can be defined as the use of computer systems to translate text from one natural language to another, with or without human assistance. An old joke tells of the scientist who devised a machine to translate English into Russian and vice versa. To test his machine he input the proverb ‘Out of sight, out of mind’. The machine translated it into Russian and translated the Russian back into English. The final output was ‘Blind lunatic’. Although there is an element of truth here, the joke underestimates the importance which machine translation has now acquired. There are many types of MT system, ranging from MAHT (Machine Aided Human Translation) where the emphasis is on the human element, through to FAHQMT (Fully Automated Human Quality Machine Translation). There are three broad categories of MT strategy: • Direct systems: these carry out a word-for-word translation between source and target language with no intermediate representation. One major limitation of this type of system is poor handling of long-distance dependencies. • Transfer systems: language-dependent intermediate representations are used between each language pair to be translated between (e.g. English-to-French, French-to-English etc.). Thus, information derived in the analysis stage may provide input to the synthesis stage. • Interlinguas: the target text is generated from an intermediate language-neutral representation, itself built up from the source text. The analysis and synthesis stages are completely separate. MT has been a major research area since the 1950s, however MT systems are still far short of the overall quality achieved by human translators, but they can offer major benefits in terms of cost and management. Information Systems: Natural Language Processing (AH) 8 There are three standard ways of improving the quality of output from MT systems: • Text mark-up facilitates easier translation through the addition of helpful information to the source text. • Controlled languages are designed to be easy to translate, general-purpose languages, based on a simple grammar and vocabulary. • Sublanguages focus on a particular field, allowing more complex grammar and vocabulary within a small range of documents. The use of sublanguages allows restrictions to be placed on the range of text for translation and improves the MT output quality without any significant increase in processing demand. For many types of subject-specific knowledge there will be an associated sublanguage, e.g. weather forecasting, software manuals. The restrictions in syntax complexity and size of vocabulary offer many benefits towards MT system design, including simpler analysis and synthesis modules, a smaller lexicon, and the avoidance of difficult constructs such as idioms. In general, system complexity is reduced. However, these benefits should be seen as a trade-off with the ability of the system to act in a general-purpose way and handle novel constructs. It may also be difficult to reuse components within other sublanguage applications. Sublanguages have provided some major successes. One example is the Météo system, which is possibly the most successful MT system to date and has been translating weather reports from English to French for the Canadian Office of Meteorology for nearly two decades. Météo is based on the specialised sublanguage of weather forecasts, uses a simple set of temporal dimensions and avoids idioms. Researchers at the University of Montreal realised that Météo's success was due to the restricted nature of the texts it worked on and looked at eliminating the input text altogether in favour of data gathered directly from weather stations. This approach led to the development of a system that produces parallel English and French weather bulletins for the Canadian eastern seaboard. The planning of what will be said and in what order is done once for both languages. It is only towards the end that the processes diverge. The same approach is now being taken with reports on labour statistics. Over the last few years, several web-sites have started offering automatic translations of short documents, such as web pages. Take a look at the following: http://babelfish.altavista.com/translate.dyn http://www.freetranslation.com/ If you want to find out more about machine translation there is an entertaining and easy-to-read book about it available on the World Wide Web at: http://clwww.essex.ac.uk/~doug/book/book.html Information Systems: Natural Language Processing (AH) 9 1.3.2 Text Retrieval and Question Answering Text retrieval is the process of matching a user query against free-text records, such as bibliographic records, newspaper articles or sections of a manual. Queries can range from multi-sentence descriptions of the information required to a few words. Text retrieval systems currently in use range from simple Boolean systems through to systems making extensive use of NLP. The recent growth of the Internet, and especially the World Wide Web, has led to new search requirements from users who want effective and user-friendly searching systems. At the same time, computer hardware has become capable of running complex searches against massive amounts of data with acceptable response times. This combination of factors has produced a demand for more effective search methodologies, making greater use of NLP techniques. Reasons for using NLP in text retrieval are mostly intuitive: users normally decide on the relevance of documents by reading and analysing them, so if we can automate document analysis this should help in the process of deciding on document relevance. However, the use of NLP techniques has not yet significantly improved performance in text retrieval. Most researchers believe that it is easier to improve the effectiveness of text retrieval by means of statistical methods than by NLP-based approaches. Only a small proportion of current research is based on NLP techniques, but NLP resources like thesauri, lexicons, dictionaries and proper name databases, are used regularly. It seems that NLP resources are having more of an impact than NLP techniques on text retrieval at present. One reason for this is that NLP techniques are not generally designed to handle large amounts of text from different subject areas. There is, however, an inherent mismatch between the statistical techniques used in text retrieval and the linguistic techniques used in NLP. The statistical techniques attempt to match the rough statistical approximation of a record to a query. Further refinement of this process using NLP techniques often adds only noise to the matching process, or fails because of the inconsistencies of language use. The proper integration of these two techniques is difficult. What we really need are NLP techniques designed specifically for text retrieval along with text retrieval techniques developed specifically for taking advantage of NLP techniques. Information Systems: Natural Language Processing (AH) 10 1.3.3 Command and Control Command and control systems involve interaction with devices, often via speech recognition. They are widely used in defence-related areas but also provide useful aids for the handicapped and speech driven applications such as word processing. A number of significant products have appeared on the market in recent years. Some of them are described briefly below: Dragon NaturallySpeaking Dragon NaturallySpeaking Standard lets you communicate with your PC by speaking. You can write reports, letters, and e-mails in virtually any Windows application with your voice. Specialist versions are available for the legal and medical professions. Product features include: • revise, edit and format text: with select-and-say editing features. • manage e-mail by voice: create and send e-mail by voice. listen to your messages read aloud. • customise vocabulary: with names and terms you use. • quick correct list: proofread and make changes to your work as you go. • intuitive commands: say commands that make sense to you in Word 97/2000. • switch between applications: launch programs by voice and switch between applications just by saying so. You can get more information about Dragon NaturallySpeaking at: http://www.dragonsys.com/ and you can read a complete online book about Dragon NaturallySpeaking at: http://www.sayican.com/lib/sayican/onlinebook.html Information Systems: Natural Language Processing (AH) 11 IBM ViaVoice IBM's ViaVoice for Windows allows you to dictate, format and edit text directly in popular word processing applications like Microsoft Word 97 and 2000. Or dictate into ViaVoice SpeakPad and transfer your words into other Windows applications such as e-mail - with a single voice command. Personalised Attention Words help dictation and voice commands work more smoothly and the Text-To-Speech feature reads back what you've already dictated for quick, easy editing, spelling correction and grammar. Product features include the following: • Direct Dictation in Microsoft Word 97 and 2000: Now you can speak, format and edit directly into this popular application. • Natural Commands in Microsoft Word 97 and 2000: Can make working with Word easier as you use everyday language rather than 'computer speak' to command functions within this popular word processor. • SpeakPad: Simplest way to dictate, format and edit text: Directly into the ViaVoice speech-enabled word processor. • Improved accuracy and correction: Greater accuracy and quicker correction means your letters, e-mail, homework can be finished faster and more easily and with greater product satisfaction. • Text-To-Speech: Editing made easy. Listen as ViaVoice reads back the text you have dictated and formatted so you know just what it is you’ve dictated. • Attention Words: Choose whatever words you like to ensure ViaVoice knows when you are issuing a command rather than dictating text. • 60,000-word vocabulary: The greater the vocabulary, the greater the accuracy as ViaVoice recognises the words you use. • Teach ViaVoice new words: Greater accuracy when you use special words, terms, nicknames, acronyms, pet phrases, addresses and idioms: ViaVoice can learn to recognize them, and add them to the vocabulary. • Command the Web: Your voice is the easy and natural way to quickly navigate basic Web commands in selected browsers. You can get further information about ViaVoice from IBM's ViaVoice web site at: http://www-4.ibm.com/software/speech/desktop/w8-win.html Information Systems: Natural Language Processing (AH) 12 Game Commander This program is a bit different, as it's designed to provide voice control for computer games. The original Game Commander won numerous awards for bringing voice control to games. Game Commander 2 breaks new ground with lightning fast command response and even more control over your games. You can even run it concurrently with popular voice chat programs. It can take command of Windows applications too. • Voice commands with no training: put the power of speaker-independent voice control to work immediately without tedious voice training. • Customisable audible feedback: assign your own sounds and recorded speech to hear your commands being acknowledged and enhance the gaming experience. • Global commands: common commands are available across all applications. • Automatic command file loading: the right commands are always ready as soon as you need them. No need to fuss with files while you work and play. • Powerful command editing: the Game Commander Studio gives you full access to all your commands and supports cut, copy, and paste operations to make editing easy. • Multi-channel auto fire: say a command and have it repeated until you tell it to stop. Issue more commands while auto fire runs, including more auto fire commands! • Massive macro capabilities: unleash up to 256 keystrokes per voice command. • Easy keystroke entry: just press the key as you would in the game. Many special Windows keys and combinations are also supported. • Adjustable actions: fine-tune any keystroke or action for maximum control. • Extended actions: configurable delay, key up, and key down actions, and step sequencing add more control capabilities than ever before. • Works with many voice chat programs: use push-to-talk to switch between Game Commander and popular voice chat programs (Windows 9x and Me only) or use push-to-talk alone to enable command recognition only when you need it. • Voice training: for special cases, strong accents, or non-English commands, voice training takes only three utterances, not ‘War and Peace’. You can get more information about Game Commander from: http://www.gamecommander.com Information Systems: Natural Language Processing (AH) 13 The Game Commander 2 documentation is available for download in Adobe Acrobat format from: http://www.gamecommander.com/misc/ 1.3.4 Text Analysis We might want to use computers to analyse texts for a variety of reasons, e.g. to determine the readability of a piece of text or its suitability for readers of a specified level; to determine the authorship of a text, etc. Readability Analysis A number of word processing programs, e.g. Word and Word Perfect, now have grammar checkers built in to them and other grammar and style checkers (such as Correct Grammar and Grammatik) are available separately. These programs can be useful for some kinds of textual analysis by providing statistics revealing average sentence length, average number of syllables per word, percentage of sentences in passive voice, and a readability index such as the Flesch-Kincaid Grade Level or the Gunning Fog Index. Microsoft Word can be used to provide readability statistics for documents. Click Tools on the menu bar, then choose Options, followed by Spelling and Grammar, then check the ‘readability statistics’ box under Grammar. The Flesch-Kincaid grade level arose from a book entitled ‘The Art of Readable Writing’, published by Dr. Rudolph Flesch in 1941, in which he described a simple method of analysing readability. He analysed text samples of about 100 words, assigning each sample a readability index based upon the average number of syllables per word and the average number of words per sentence. Most scores range from 0 to 100. College graduates should be able to follow prose in the 0 - 30 range. Scores of 50 - 60 are high-school level and 90 - 100 should be readable by fourth graders, i.e. children who have completed four years of primary education. The General Motors Corporation automated Flesch’s algorithm in the early 1970s. The program, called GM-STAR (General Motors Simple Test Approach for Readability) was used to ensure that workshop manuals were readable. be made more readable. The key to this program is a very simple algorithm to count the number of syllables in a word. The Flesch Index (F) for a given text sample is calculated from three statistics: • The total number of sentences (N), • The total number of words (W), • The total number of syllables (L), Though crude, since it is designed simply to reward short words and sentences, the index is useful. It gives a basic, objective idea of how hard prose is to wade through. Fog Indexes are also used to provide a broad estimate of the grade level (number of years of education) required to understand written material. (Some writers suggest that ‘fog’ is an acronym for ‘frequency of gobbledygook’.) Even if grade level standards change, an index can still be used to estimate the relative difficulty of the material. Information Systems: Natural Language Processing (AH) 14 There are several fog indexes, but the one developed by Robert Gunning in ‘Technique of Clear Writing’ is one of the simplest and most effective. A simplified version of Gunning's procedure is as follows: a. Pick any 100-word segment of text. b. Count the sentences in the segment (a fragment of a sentence at beginning or end counts as a whole sentence). c. Divide 100 by the number of sentences (= average sentence length). d. Count the words in the segment with 3 or more syllables. e. Add the two numbers (average sentence length + number of 3-or-moresyllable words). f. Multiply the sum by 0.4. g. Round off the result to the nearest whole number. This is the GFI. The GFI is not an absolute measure, e.g. reader familiarity with the subject matter is also important in determining readability. Computer-Based Text Analysis The first tool of computer-based text analysis was the ‘Concordance’ - a tool that was transformed by the computer, but originated in methods that go back to the middle ages. The concordance grew out of medieval biblical scholarship that tried to find parallels between the Old and New Testaments by finding places where the words in the text of the Old Testament foreshadowed a passage in the New. Scholars became aware that it would be useful to group words into categories, and develop indexes that pointed to all (or at least many) occurrences of those words in the different books of the Bible. Thus was started the early thematic concordance, which named the major people, places, things and ideas that appeared in the Bible. In the late 1940s Roberta Busa, an Italian philosopher decided to produce a concordance of the complete writings of the Medieval scholar Thomas Aquinas. Although computers were virtually unknown in Italy at the time, it was clear that the task required some kind of machinery. The work was begun on with punched cards and card sorting machines and was completed (33 years later) in the 1970s, using large IBM mainframe computers with computer-driven typesetting equipment. With various indexes and other associated information, the Index consists of about 70,000 typeset pages. There were two full concordances. One, produced directly by the computer, was a complete list of the occurrences of all word forms. This type of concordance, called unlemmatised, lists all word forms under separate entries. Busa's concordance also included a lemmatised concordance, where the list of headwords are standardised as they might appear in a dictionary - different forms of each verb or noun are gathered under a single entry. Information Systems: Natural Language Processing (AH) 15 For the lemmatised concordance, the computer could not automatically bring all related forms together on its own. Thus, the lemmatised was a machine-assisted concordance, requiring significant human interaction. Early on the KWIC (Keyword in Context) concordance format was developed. The entire vocabulary of the work is listed in alphabetical order. Each word form, called a headword, is followed by its occurrences. Each occurrence, in turn, is given on a separate line consisting, first, of some ‘reference information’ that helps the KWIC user locate the occurrence in the full text, and then by a brief excerpt that shows the word in its context - hence the name. Most of the early computer work in text analysis was in the production of unlemmatised concordances. Obviously, if any work was to be done with a computer on a text, the text itself had to be in ‘machine-readable form’. By the end of the 1970s it was clear that more texts needed to be available in electronic form and standard software was required. Oxford University became an early leader in both these areas with the establishment of the Oxford Text Archive - a repository for electronic texts and the Oxford Concordance Program (OCP), a mainframe-based text analysis system. Both OCP and the Archive are still important today. Nowadays, a significant part of the Archive's holdings are available via the Internet. OCP has been overshadowed by other computing developments, but the software is still in use. Alastair McKinnon, a philosopher at McGill University, began working in the 1960s with the goal of publishing a complete printed concordance of the published writings of the Danish philosopher Soren Kierkegaard (1.9 million words) - based not only on the Danish text, but also on the standard German, English and French translations. After the concordance was published he continued to experiment, and by the early 1980s he realised the benefits of having access to a large and significant text in electronic form, and having computer tools that could answer sophisticated queries. When his work became known, he published the electronic form of the Kierkagaard corpus, along with his collection of computer programs (called Textmap) used to manipulate it. The appearance of the personal computer in the early 1980s led to further advances in text analysis. Word processing software was probably the most important development, but more specialised software was also written. The first widely known software package was the Brigham Young Concordance program (BYC) – later sold as WordCruncher. Another widely used package is TACT, developed at the University of Toronto. In addition to allowing scholars to search for words or phrases throughout a text, the availability of electronic texts has made possible increased use of statistical methods often a particular collection of methods called Multivariate Analysis (MVA). Such methods promise to allow the computer to do more than just find words. Gerard Ledger used statistical methods to work out the chronology of Plato's dialogues. He ignored traditional critical methods and used multivariate analysis techniques to identify which dialogues were likely to have been written earlier or later. Information Systems: Natural Language Processing (AH) 16 However, for both the literary scholar and the average reader, working with the ideas found in a text is much more interesting than working with the words themselves. Making the computer move from identifying individual words to the ideas that these words represent is difficult. For this reason there is still scepticism about the use of computers in text analysis and a corresponding need to improve on the existing tools. Text Analysis Software TextQuest is a well-known program for text analysis. Full details can be obtained from: http://www.textquest.de/tqe.htm A demo version can also be downloaded from this site. The following applications can be performed by TextQuest: • • • • • • • list of words, sorted by alphabet or by frequency, ascending or descending, also with exclusion lists (STOP-words) list of word sequences list of word permutations KWICs - key word in context with variable line length SITs - search patterns in text unit content analysis with powerful features like interactive coding, control files, and negation detection control of multiple search patterns 1.3.5 Scanning Newspaper Stories NLP techniques have been successfully applied in sorting text into fixed topic categories. One aspect of this is scanning newspaper stories. There are several commercial services, which provide subscribers with access to news stories on specified topics, e.g. news on a particular industry, company or geographical area. Categorisation has traditionally been carried out by human experts, but in recent years NLP software has been shown to be just as effective, categorising 90% of stories correctly. This may seem surprising, considering the lack of success in using NLP techniques in information retrieval. However, in this case, the categories are fixed, so researchers can spend their time addressing other problems. 1.3.6 Intelligent Tutoring Systems Computers have been used in education for since the late 1970s. The earliest systems were known as Computer-Based Training (CBT) systems. These systems had a major drawback - instruction was not tailored to the learner's needs. Instead, the decisions about how to progress through the material were programmed: ‘if question 19 has been answered correctly, proceed to question 56; otherwise go to question 35.’ The learner's abilities were not taken into account. While CBT may be effective in helping learners, it does not provide the same kind of individual attention that a student would receive from a human tutor. For a computer based system to provide such attention, it must reason about the subject matter and the learner. This has led to research into Intelligent Tutoring Systems (ITSs). Information Systems: Natural Language Processing (AH) 17 These offer considerable flexibility in presentation of material along with a greater ability to respond to student needs. ITSs have been shown to be effective at increasing student motivation and performance, e.g. students using Smithtown, an ITS for economics, performed as well as students taking a traditional economics course, but only spent half as much time covering the material. You can download the Smithtown software from: http://www.pitt.edu/~akatz/akatz.htm If you want to see another example of an ITS, you find one for Algebra at: www.algebratutor.org Many systems attempt to simulate a realistic working environment in which the student can learn. One example is the Advanced Cardiac Life Support (ACLS) Tutor in which a student takes the role of team leader in providing emergency life support for heart attack patients. The system not only monitors student actions, but runs a realistic simulation of the patient's condition and maintains an environment that is reasonably faithful to the real life situation. Some systems take a less rigorous approach to representing the environment. The situations presented are similar to real world scenarios, but they are not exact simulations. Smithtown takes this approach by providing a simulated setting for students to test hypotheses about economics. However, the underlying model is not an exact simulation of how the laws of economics would operate in the real world. Systems tend to concentrate on teaching one type of knowledge. The most common type of ITS teaches procedural skills; the goal is for students to learn how to perform a particular task. Systems that are designed according to these principles are often called cognitive tutors. Other ITSs concentrate on teaching concepts and mental models. These systems encounter two main difficulties. First, a more substantial domain knowledge is needed. Second, since learning concepts and frameworks are less well understood than learning procedures, there is less cognitive theory to guide knowledge representation and the pedagogical module. ITSs of this type require a larger domain knowledge base and are sometimes referred to as knowledge based tutors. As a result of not having a strong model of skill acquisition or expert performance, these systems are forced to use general teaching strategies. They also place more emphasis on the communication and presentation system in order to achieve learning gains. An example of such a system is the Pedagogical Explanation Generation (PEG) system which uses a substantial domain knowledge base to construct answers to student queries about electrical circuits. Generally, tutors that teach procedural skills use a cognitive task analysis of expert behaviour, while tutors that teach concepts and frameworks use a larger knowledge base and place more emphasis on communication. There are exceptions to these rules, but they are useful guidelines for classifying ITSs. A good survey of Intelligent Tutoring Systems can be found at: http://www.dis.port.ac.uk/~callear/CBT64203.htm Information Systems: Natural Language Processing (AH) 18 1.4 Relationship to Other Disciplines and Other Fields of AI 1.4.1 Other Disciplines Computer Science NLP is a sub-field of Artificial Intelligence, itself a field of Computer Science. As a result it uses many of the same technique and procedures as mainstream Computer Science. These include programming, data structures and algorithms. One field which is closely connected is compiler construction. This is hardly surprising, since compiler construction deals with programming languages, which share many of the characteristics of natural languages, although they are considerably less complex. Compiler construction involves parsers, lexical analysers, syntactic analysers and other tools used in NLP. It also makes extensive use of algorithms and data structures. The traffic between NLP and Computer Science hasn’t all been one way. NLP has given Computer Science time sharing, interactive interpreters, the linked-list data type and some of the key concepts of object-oriented programming and graphical user interfaces. Linguistics The main goal of linguistics (or more correctly, theoretical linguistics) is to produce a structural description of natural language. Linguists do not usually consider how actual sentences are processed (parsed) or how sentences can be generated from structural descriptions. Linguistic theories should generally hold true across different languages, so linguists tend to concentrate on the general principles that underlie all natural languages, and devote less time and effort to examining any particular language. The aim of linguistics is a formal specification of language structure, in the form of rules that define the range of possible structures and the constraints on these. Psychology The area of psychology that deals with language is known as psycholinguistics. Like computational linguists, psycholinguists are interested in how people produce and understand natural language. In psychological terms, a linguistic theory is only useful if it explains actual behavior. Psycholinguists are interested in both the representation of linguistic structures and the processes by which a person can produce such structures from actual sentences. The primary tool of psycholinguistics is experimentation, i.e. actual measurements made on people as they produce and understand language. Areas studied include the time needed to read each word in a sentence or to decide whether a given item is a valid word or not, the types of errors people make as they perform various linguistic tasks, and so on. Experimental data is used to validate or reject specific hypotheses about language. These hypotheses are often derived from the theories of theoretical or computational linguists. Information Systems: Natural Language Processing (AH) 19 Language Teaching For many years, foreign language teaching has been supplemented by the use of language laboratories. These are rooms, often divided into booths, where students can listen individually to recordings of foreign language material, and record and play back their own responses, while being monitored by a teacher. When language laboratories were introduced, they were hailed as a technique that would greatly improve the rate and quality of foreign language learning by removing the burden of repetitive drills from the teacher and providing students with more opportunities to practice listening and speaking. Although many schools were quick to install expensive language laboratory equipment, it became obvious within a few years that there would be no major breakthrough. The expected improvements were not realized and the popularity of the language laboratory began to wane. There were several reasons for this failure. Recorded materials were often poorly designed, leading to frustration and boredom. Materials were not matched to the other work students were doing in class and few teachers were properly trained in materials design or laboratory use. Nowadays we have a better appreciation of the strengths and limitations of the language lab. We also have access to better hardware and software. Modern language laboratories which make use of interactive multimedia have proven to be extremely effective. When used properly, language laboratories can provide a valuable extra dimension. Recorded material can supply a variety of authentic and well-recorded models for improving listening comprehension. Laboratories can be used as libraries or material, giving learners extra opportunities to practice at an appropriate level. However, the limitations of language laboratories must always be kept in mind. Their value depends on the development of suitable teaching materials which reinforce what has been taught in class and provide opportunities for creative use. As in many other areas, developments in software have failed to keep pace with developments in hardware. Modern language labs incorporates Computer-Assisted Language Learning (CALL) workstations. PCs complement the audio and video facilities, enabling interactive teaching of written language skills. Several kinds of exercise, such as sentence restructuring, checking of translation or dictation tasks, and cloze testing can be controlled by the computer, using texts displayed on the screen. Increasingly clever interactive games are available. In Storyboard, for example, learners are given a passage of blanks; they have to ‘buy’ words and complete the passage before their supply of money runs out. The use of computers in CALL offers many benefits to students. Computers can offer real advantages in composing text, and on-line help such as dictionary support can be very useful. Whilst simple technologies like these are practical and useful to students, more ambitious assistance, such as providing support through video, has limitations due to bandwidth requirements. These difficulties will disappear as bandwidth increases. A history of CALL and a summary of the current state of the field can be found at: http://www.gse.uci.edu/markw/call.html Information Systems: Natural Language Processing (AH) 20 1.4.2 Other Fields of AI Robotics Robotics and AI are often seen as totally distinct fields, with Robotics deriving from Mechanical Engineering and AI from Computer Science. However the two are closely related and Robotics can be seen to some extent as the physical implementation of AI principles. A robot is defined as ‘a reprogrammable, multifunctional manipulator designed to move material, parts, tools, or specialized devices through various programmed motions for the performance of a variety of tasks’ (Robot Institute of America, 1979). The word ‘robot’ was coined by the Czech playwright Karel Capek from the Czech word for forced labour or serf. He used it in his 1921 play R.U.R. (Rossum's Universal Robots) which was a huge success throughout Europe. Oddly enough, the robots in the play were not mechanical in nature but were created through chemical means. The term 'robotics' refers to the study and use of robots. It was coined by scientist and writer Isaac Asimov (1920 - 1992), best known for his many works of science fiction. He first used the word ‘robotics’ in ‘Runaround’, a short story published in 1942. ‘I, Robot’, a collection of several short stories about robots, was published in 1950. Asimov proposed the three ‘Laws of Robotics’ and later added a 'zeroth law'. • Law Zero: A robot may not injure humanity, or, through inaction, allow humanity to come to harm. • Law One: A robot may not injure a human being, or, through inaction, allow a human being to come to harm, unless this would violate a higher order law. • Law Two: A robot must obey orders given it by human beings, except where such orders would conflict with a higher order law. • Law Three: A robot must protect its own existence as long as such protection does not conflict with a higher order law. Within the research community the first robots were probably Grey Walter's Machina (1940s) and the John's Hopkins Beast. Remote controlled devices had been built even earlier with at least the first radio controlled vehicles built by Nikola Tesla in the 1890s. Tesla is better known as the inventor of the induction motor, AC power transmission, and numerous other electrical devices. The first industrial modern robots were the Unimates developed by Devol and Engleberger in the late 50s and early 60s. The first patents were by Devol for parts transfer machines. Engleberger formed Unimation and was the first to market robots. As a result, he has been called the ‘father of robotics’. Modern industrial arms have increased in capability and performance through controller and language development, improved mechanisms, sensing and drive systems. Information Systems: Natural Language Processing (AH) 21 The robot industry grew rapidly in the 1980s, primarily due to large investments by the automotive industry. However, the quick leap into the factory of the future turned into a plunge when the integration and economic viability of these efforts proved disastrous. One of the main applications of AI is in the area of robot control. By using evolving control architectures, the robot can 'learn' the best way to do a task. Designers can use neural networks and genetic algorithms to enable the robot to cope with complicated tasks, such as navigation in a complex environment. Another area is image, sound and pattern recognition - 3 traits that any anthropomorphic robot would need. Again, neural-networks could be used to analyse data from the optical or audio device the robot used. Robotics is in many respects Mechanical AI. It is also a lot more complicated, since the data the robot is receiving is real-time, real-world data, a lot more complicated that more software-based AI programs have to deal with. On top of this more complicated programming required, algorithms to respond via motors and other sensors is needed. Some researchers believe that the field of robotics is where AI is all eventually aimed, most research is intended to one day become part of a robot. Machine Vision Many robotics applications require machine vision. Machine vision replaces human vision with video cameras and specialised computers, and can improve on human vision where precise and repeatable visual measurements and inspections are required. Machine vision is primarily used for guiding robot movement for automated assembly and for automated quality control. Machine vision poses a number of complex technical problems, including edge detection, depth perception and dealing with shadows. In automated assembly, components are selected and placed on an assembly by the robot. If the robot lacks vision, the components and assembly must be precisely positioned so that the robot can locate them. This requires expensive fixtures. A robot with vision can use cheaper and more general fixtures and can be taught to find and place the components on the assembly. Visual guidance can compensate for some variations in the components and the assembly and therefore carry out tasks that are impossible with blind placement. The additional cost of a vision system can be recovered from improvements in manufacturing flexibility and quality. One example of automated assembly that requires machine vision is the placement of surface mount components on printed circuit boards . The vision system determines the precise location of the printed circuit board using reference marks on the circuit board. The robot picks up each component and holds it in front of the camera. The vision system verifies that the component's leads are correctly positioned and determines the component's precise location. From this information the robot arm places the component precisely on the printed circuit board. Information Systems: Natural Language Processing (AH) 22 Machine vision systems are replacing human vision for quality control inspection of manufactured objects. These inspections are often too fast or precise for human vision, and the demand for quality requires repeatable visual inspection on each object. This can be accomplished with a high-speed machine vision system that is integrated into flow of the manufacturing process, either as part of robot assembly or at a station specifically designed for inspection. Planning and Searching In AI, searching usually crops up in the context of problem solving. One simple example is the missionaries and cannibals problem, usually stated as follows: ‘three missionaries and three cannibals are on one side of a river, along with a boat that can hold one or two people. Find a way to get everyone to the other side; without ever leaving a group of missionaries in one place outnumbered by the cannibals in that place’. A problem consists of four parts: an initial state, a set of operators, a goal test function, and a path cost function. The environment of the problem is represented by a state space. A path through the state space from the initial state to a goal state is a solution. The first step is to decide what the right operator set is. We know that the operators will involve taking one or two people across the river in the boat, but we have to decide if we need a state to represent the time when they are in the boat, or just when they get to the other side. Because the boat holds only two people, no ‘outnumbering’ can occur in it; hence, only the end points of the crossing are important. For the purposes of the solution, when it comes time for a cannibal to get into the boat, it does not matter which one it is. Any permutation of the three missionaries or the three cannibals leads to the same outcome. These considerations lead to the following formal definition of the problem: • States: a state consists of an ordered sequence of three numbers representing the number of missionaries, cannibals, and boats on the bank of the river from which they started. Thus, the start state is (3,3,1). • Operators: from each state the possible operators are to take either one missionary, one cannibal, two missionaries, two cannibals, or one of each across in the boat. There are at most five operators, although most states have fewer because it is necessary to avoid illegal states. • Goal test: reached state (0,0,0). • Path cost: number of crossings. This state space is small enough to make it a trivial problem for a computer to solve. People find it more difficult because some of the moves involve backtracking and we react intuitively against this. Information Systems: Natural Language Processing (AH) 23 This is an artificial problem, but similar techniques can be applied to solving a number of real-world problems. Route finding Route finding is defined in terms of specified locations and transitions along links between them. Route-finding algorithms are used in a variety of applications, such as routing in computer networks, automated travel advisory systems, and airline travel planning systems. The traveling salesman problem The travelling salesman problem, or TSP for short, is this: given a finite number of cities, along with the cost of travel between each pair of them, find the cheapest way of visiting all the cities and returning to your starting point. The aim is to find the shortest tour. An enormous amount of effort has been expended to improve the capabilities of TSP algorithms. In addition to planning trips for traveling salespersons, these algorithms have been used for tasks such as planning movements of automatic circuit board drills. You can find a detailed description of the different versions of the problem and the attempts made to solve it at: http://www.keck.caam.rice.edu/tsp/index.html Robot navigation can be reduced to a variation of the traveling salesman problem. VLSI layout The design of silicon chips is one of the most complex engineering design tasks currently undertaken. A typical VLSI chip can have as many as a million gates, and the positioning and connections of every gate are crucial to its operation. Two of the most difficult tasks are cell layout and channel routing. In cell layout, the components of the circuit are grouped into cells, each of which performs some recognised function. Each cell has a fixed footprint (size and shape) and requires a certain number of connections to each of the other cells. The aim is to place the cells on the chip so that they do not overlap and so that there is room for the connecting wires to be placed between the cells. Channel routing finds a specific route for each wire using the gaps between the cells. These search problems are extremely complex, but definitely worth solving. Assembly sequencing Automatic assembly of complex objects by a robot was first demonstrated by a robot called FREDDY in 1972. Progress since then has been slow but sure, to the point where assembly of objects such as electric motors is economically feasible. In assembly problems, the problem is to find an order in which to assemble the parts of some object. If the wrong order is chosen, there will be no way to add some part later in the sequence without undoing some of the work already done. Information Systems: Natural Language Processing (AH) 24 SEARCH ALGORITHMS A single general search algorithm can be used to solve any problem; specific variants of the algorithm embody different strategies. Search algorithms are judged on the basis of: • Completeness: is the strategy guaranteed to find a solution, if there is one? • Time Complexity: how long does it take to find a solution? • Space complexity: how much memory is required to find a solution? • Optimality: does the strategy find the best solution if there are several possible solutions? Breadth-first search expands the shallowest node in the search tree first. It is complete, optimal if all operators cost the same, but has high time and space complexity. The space complexity makes it impractical in most complex cases. Depth-first search expands the deepest node in the search tree first. It is neither complete nor optimal, and has high time complexity and low space complexity. In search trees of large or infinite depth, the time complexity makes it impractical. (Depth-first and breadth-first strategies with respect to parsing are considered in greater detail in Section 3.4 of these notes.) Depth-limited search places a limit on how deep a depth-first search can go. If the limit happens to be equal to the depth of shallowest goal state, then time and space complexity are minimized. Iterative deepening search repeats a depth-limited search with increasing limits until a goal is found. It is complete and optimal, and has medium time and space complexity. Uniform-cost search expands the least-cost leaf node first. It is complete, and unlike breadth-first search is optimal even when operators have differing costs. Its space and time complexity are the same as for breadth-first search. Bi-directional search can enormously reduce time complexity, but is not always applicable. Its memory requirements may be impractical. Information Systems: Natural Language Processing (AH) 25 PLANNING Planning and problem solving use different approaches to the representation of goals, states and actions, and the representation and construction of action sequences. We’ll now look at some of the difficulties encountered by search-based problem-solving approaches, and the methods used by planning systems to overcome these. Let us see how these factors affect the ability of a theoretical intelligent device (which we’ll refer to as an ‘agent’) to solve the following simple problem: ‘Get a pint of milk, a brown loaf and a lawnmower’. Treating this as a problem-solving exercise, we need to specify the initial state: the agent is at home but without any of the desired objects, and the operator set: all the things that the agent can do. It is obvious that there are too many actions and too many states to consider. The agent can only choose among states to decide which is closer to the goal, it cannot eliminate actions from consideration. Even if we could get the agent into the supermarket, the agent would then resort to a guessing game, by considering actions, such as buying an orange, buying corn flakes, buying milk and ranking these as good or bad. It then knows that buying milk is a good idea, but has no idea what to try next and must start guessing again. The fact that the problem-solving agent considers sequences of actions starting from the initial state also contributes to its difficulties. It forces the agent to decide first what to do in the initial state, where the relevant choices are essentially to go to any of a number of other places. Until the agent has figured out how to obtain the various items, by buying, borrowing or stealing etc., it can’t really decide where to go. It needs a more flexible way of structuring its thoughts, so that it can work on whichever part of the problem is most likely to be solvable given the current information. The first key idea behind planning is to the representation of states, goals, and actions. Planning algorithms use descriptions in some formal language, usually first-order logic. States and goals are represented by sets of sentences, and actions are represented by logical descriptions of preconditions and effects. This enables the planner to make direct connections between states and actions. For example, if the agent knows that the goal includes Have(Milk), and that Buy(x) achieves Have(x), then it knows that it is worthwhile to consider a plan that includes Buy(Milk). It need not consider irrelevant actions such as Buy(ShoePolish) or GoToSleep. The second key idea is that the planner is free to add actions to the plan wherever they are needed, rather than always starting at the initial state. For example; the agent may decide that it is going to have to Buy(Milk), even before it has decided where to buy it, how to get there, or what to do afterwards. There is no necessary connection between the order of planning and the order of execution. The representation of states as sets of logical sentences plays a crucial role in making this freedom possible. For example, when adding the action-Buy(Milk) to the plan, the agent can represent the state in which the action is executed as, say, At(Supermarket). Search algorithms that require complete state descriptions do not have this option. Information Systems: Natural Language Processing (AH) 26 The third key idea behind planning is that most parts of the world are independent of most other parts. This makes it feasible to take a combined goal like ‘Get a pint of milk, a brown loaf and a lawnmower’ and solve it with a divide-and-conquer strategy. A sub-plan involving going to the supermarket can be used to achieve the first two objectives, and another subplan (e.g., going to the hardware store or borrowing from a neighbour) can be used to achieve the third. The supermarket subplan can be further divided into a milk subplan and a bread subplan. We can then put all the subplans together to solve the whole problem. This works because there is little interaction between the two subplans: going to the supermarket does not interfere with borrowing from a neighbour, and buying milk does not interfere with buying bread. Divide-and-conquer algorithms are efficient because it is almost always easier to solve several small sub-problems rather than one big problem. However, divide-andconquer fails in cases where the cost of combining the solutions to the sub-problems is high. For tricky puzzles, planning techniques will not do any better than problemsolving techniques. Fortunately, real world sub-goals tend to be nearly independent. If this were not the case, the sheer size of the real world would make problem solving impossible. Expert Systems In recent years, expert systems have received a great deal of attention. Expert systems are AI-based programs which have been used to solve a range of problems in a whole variety of fields, including include computer system design and medical diagnosis. An expert system stores the knowledge of one or more human experts in a particular field. The field is called a domain. The experts are called domain experts. A user asks the expert system about a problem within the domain. The system applies its stored knowledge to solve it. A domain expert is a person who, because of training and experience, can do things the rest of us cannot. Domain experts know a great many things and have tricks and techniques for applying what they know to problems and tasks. They are also good at discarding irrelevant information to get at basic issues, and at recognising new problems as instances of types with which they are already familiar. The part of the expert system that stores the knowledge is called the knowledge base. The part that holds details of the problem to be solved is known as the global database. The part that applies the knowledge to the problem is called the inference engine. Like most modern computer programs, expert systems usually have a have a friendly user interfaces. This doesn't make the system work any better, but it does allow inexperienced users to specify problems and understand the system's conclusions. Expert systems are produced by knowledge engineers, who begin by reading domain-related literature to become familiar with issues and terminology. When this foundation is established, the knowledge engineer holds extensive interviews with one or more domain experts to acquire their knowledge. The results of these interviews are organised and translated into software that a computer can use. Information Systems: Natural Language Processing (AH) 27 The interviews take the most time and effort of any of these stages. These interviews can be very time consuming and are often the longest part of the system development. The format used to capture the knowledge is called a knowledge representation. The most popular knowledge representation is the production rule (also called the if-then rule). Production rules reflect the ‘rules of thumb’ or heuristics that experts use in their day-to-day work. A knowledge base consisting of rules is sometimes called a rule base. Machine Learning Machine learning is the study of algorithms that enable computers to improve their performance and increase their knowledge base. Research in this area has taken place since the mid-1950s. One early development was a program, written by Arthur Samuel, which learned to play draughts well enough to beat skilled human players. The study of machine learning has provided information about how adults learn concepts and rules and how children's abilities develop, thus contributing to developmental and cognitive psychology. Some of these ideas concern language acquisition, making machine learning relevant to linguistics. However, not all work in machine learning is directly relevant to psychology, since artificial intelligence researchers sometimes develop algorithms that bear little relation to human learning. The most successful machine-learning programs produce expert systems that perform even better than humans. Recently, machine learning has become a useful technology for mining large databases for information of commercial and scientific importance. There are six major approaches: to machine learning: learning from examples, artificial neural networks, genetic algorithms, explanation-based learning, evaluating hypotheses, and analogical inference (case-based reasoning). The first three approaches can all function in systems that have very little information and need to learn from experience, without guidance from previously acquired knowledge. The final three approaches involve learning in systems that already have extensive knowledge bases. Information Systems: Natural Language Processing (AH) 28 LEARNING FROM EXAMPLES The most active area of research in machine learning has been the learning of concepts from examples. The input to such learning programs consists of descriptions of examples, and the output consists of different kinds of representations that generalize about those examples. These output representations can be new concepts, new rules, or decision trees-that provide a convenient of classifying new examples. One of the first projects on learning from examples was a program by Patrick Winston that learned the concept of an arch. Given examples of arches and examples of things that were not arches, the program's task was to produce a general description of arches that would correctly classify additional examples of arches while excluding nonarches. Some concept-learning algorithms start from specific descriptions that are expanded to be more general, while others initially produce general descriptions and are then modified to handle more specific information. For example, a program might learn the concept of an arch by assuming that every arch is exactly like the first one it encounters, and then gradually generalise as it receives more examples. Alternatively, a program might be designed to make an immediate generalisation, such as that an arch consists of two vertical blocks with an object on top, and then expand this to encompass new examples. Another kind of learning from examples produces general rules instead of concepts, while a third method produces decision trees rather than rules or concepts. Learning concepts, rules, and decision trees from examples all produce symbolic descriptions, but a very different kind of output is produced by another approach to machine learning using artificial neural networks. Neural Networks The inputs to a neural network learning program are a network consisting of a set of nodes connected by excitatory and inhibitory links, along with a set of training examples. The links represent the weightings of the connections, positive and negative, between the nodes. Learning algorithms modify these weightings to improve the performance of the network at some task - for example, predicting the weather or classifying plants. The output of the learning algorithm is not a new representation such as a concept or a rule, but rather a modification of the weightings of the connections in the network. The commonest learning algorithm in neural networks is called ‘backpropagation,’ which trains a network by adjusting the weights that connect the different nodes. The network consists of input nodes representing features of examples (e.g. hot, humid, windy days) and output nodes which representing a conclusion (e.g. sunny or rainy). The input and output nodes are linked by hidden nodes that have no initial interpretation. Random weights are assigned to the links between nodes, then the input nodes are activated for features that the network is meant to learn about. Activation spreads through the network to the hidden nodes and then to the output nodes. Information Systems: Natural Language Processing (AH) 29 Errors are determined by calculating the difference between the computed activation of the output nodes and the expected activation, e.g. an input of hot and dry may erroneously predict that the day will be rainy. To enhance future performance, errors are propagated backwards down the links, changing the weights in such a way that the errors are reduced. After sufficient examples have been presented to the network, it will operate correctly. Genetic Algorithms In the 1970s, John Holland used an analogy between learning and biological adaptation to develop an approach to machine learning called ‘genetic algorithms’. Species adapt to their environments by mutation, which introduces new genetic combinations, and by natural selection, which favours members of the species who are genetically suited to survive and reproduce. The inputs to a genetic algorithm program are structures for performing tasks. These can be simple structures, such as binary strings representing the presence or absence of features, or more complex structures such as computer programs. The outputs are modified structures that should perform the desired task more effectively. To obtain this improvement, genetic algorithms alter the input structures by randomly modifying them (mutation) and by combining them (crossover). The resulting structures are then evaluated for their effectiveness, and variation and selection are repeated. Genetic algorithms have many useful applications, such as generating expert systems for designing computer circuits. Explanation-Based Learning The methods described above can generate new knowledge using very little background knowledge. By contrast, explanation-based learning makes extensive use of existing knowledge. Explanation-based learning can start from a single example instead of the large sets of training examples typically used in learning from examples. The input to an explanation-based learning algorithm consists of an example plus a database of general rules or schemas. The output is a new concept, formed by constructing an explanation of why the example is an instance of the goal concept, and by generalising the explanation to obtain the goal concept. The new concept does not arise from generalisation from numerous examples, but from the knowledge-intensive attempt to understand what is happening in a particular example. Evaluating Hypotheses Hypothesis evaluation usually takes place in information rich areas such as medical diagnosis. A doctor needs to generate and evaluate hypotheses that can explain why a patient has a particular collection of symptoms. The inputs are a set of facts to be explained and a knowledge base that can be used to generate hypotheses to explain them. The outputs are judgments about the acceptability of various hypotheses. The algorithms required need to generate hypotheses by searching the knowledge base for explanations, then choose among competing hypotheses. Some systems use Bayesian networks, in which facts and hypotheses are represented by nodes, connected by links that represent probabilities. Information Systems: Natural Language Processing (AH) 30 Other systems operate more qualitatively, e.g. by treating hypothesis choice as a process of constraint satisfaction that can be computed by artificial neural networks. Analogical Inference (Case-Based Reasoning) Analogical inference is useful when general information about a problem is not available but there is a stock of similar, previously solved problems. The required inputs are a problem to be solved (the target) and a knowledge base of previously solved problems. Problems can include plans, such as how to get to the airport, designs, such as how to build an aircraft and explanations, such as why a plane crashed. A new plan, design, or explanation can be formed by adapting a previous one (the source) that already exists in the knowledge base. To achieve this, we need algorithms for retrieving a potentially relevant case, mapping it to the target problem, establishing correspondences and adapting the case to provide a solution. Once the target problem has been solved, another kind of learning can take place by noting the common properties of the target and source to provide a basis for subsequent problem solving. Information Systems: Natural Language Processing (AH) 31 Information Systems: Natural Language Processing (AH) 32 2. ANALYSING NLP 2.1 Levels of Processing The processes involved in the comprehension and production of language can be analysed at several different levels: Phonological: the sounds made in the language. The units include phonemes and syllables. At this level we should know: • How to classify sounds as belonging to specific linguistic categories. • How to use patterns of pitch and volume to determine boundaries between units. • What sounds can legally follow other sounds. • How units at one level (e.g. phonemes) combine to form units at a higher level (e.g. syllables). Morphological: the shape or structure of the words used in the language, based on the prefixes, suffixes and root forms making it up. The units are morphemes. At this level we should know: • What order to put, or expect, the morphemes in a polymorphemic word. • How the sounds in a morpheme change when it combines with another. • How to interpret or produce a word consisting of a novel combination of morphemes. There are several different types of morphology: • Inflectional morphology: refers to the changes to a word that are required in a particular grammatical context, e.g. most nouns add an ‘s’ to signify a plural. • Derivational morphology: refers to the derivation of a new word from another word, often of a different category, e.g. ‘baldness’ is derived from the adjective ‘bald’ plus the suffix ‘ness’. • Compound morphology: refers to taking two different words and combining them into one, e.g. ‘goalkeeper’ is a compound of ‘goal’ and ‘keeper’. (‘Keeper’ is itself derived from the verb ‘keep’ by derivational morphology.) Syntactic: the grammatical rules of the language. The units are phrases, clauses and sentences. It is often difficult to separate syntax completely from semantics (see below). At this level we should know: • How to order the components of a sentence to indicate a particular meaning. • What roles the components of a sentence play in the state or event described. • Where to expect gaps in particular patterns. Information Systems: Natural Language Processing (AH) 33 Semantic: the meaning of utterances in the language. The units are objects, relations, variables and worlds. At this level we should know: • How words and sentences relate to objects and relations in the world. • How to interpret the scope of negation. • How to ‘find’ the thing that is referred to by a definite noun phrase (such as ‘the woman who always parks in front of my house’). Pragmatic: the context within which the language is used. Units include utterances, discourses, turns. At this level we should know: • How get somebody to do something for you. • How the meanings of words such as go and come change with the context. • How to begin and end a phone conversation without offending the hearer. • What forms sound more formal than others and how to select the forms that are appropriate to the situation. • When someone has said something that is politically incorrect. 2.2 Categories of Words Words can be classified into different categories according to their function. The same classes have been used since the days of the Ancient Greeks. • Adjective: a word that qualifies a noun, e.g. ‘white’ in ‘white horse’; • Adverb: a word that qualifies a verb, e.g. ‘boldly’ in ‘to go boldly’; • Auxiliary: a verb which occurs along with another verb to add some further meaning, e.g. ‘will’ in ‘will fly’ adds the notion of future action. • Determiner: a word which specifies whether we are dealing with a specific object or a definite object, e.g. ‘the’ in ‘the book’; • Noun: a word that refers to an object or entity, e.g. ‘horse’; • Preposition: a word that denotes position, e.g. ‘on’ in ‘on the shelf’. • Pronoun: a word that refers to a person without using a proper name, e.g. ‘he’, ‘you’. • Verb: a word that refers to an action, e.g. ‘flying’, ‘washing’, ‘thinking’. • Conjunction: a word which joins parts of a sentence together, e.g. ‘and’, ‘or’ Information Systems: Natural Language Processing (AH) 34 We distinguish between content and function words, and between words of open and closed classes: Content words (such as nouns and verbs) carry the meaning of a sentence. They are an open class, since the number of members of the class can be extended almost indefinitely. Function words, such as prepositions or conjunctions, serve a grammatical function in the sentence. They are a closed class since there is only a limited number of each and it is difficult to add new ones. However, closed classes do change over time. Up until fairly recently ‘thee’ and ‘thou’ were in common use as pronouns. Nowadays they only appear in poetry or archaic works. 2.3 Ambiguity Ambiguity is one of the biggest problem areas in processing natural language by computer. No-one is really sure why natural language is so ambiguous, although some linguists believe that ambiguity arises from the variety of sources from which a language is derived. Many jokes rely on ambiguity for their effect. We are tricked into expecting a particular ending, only to find that we are wrong. English is reckoned to be the only language in the world where your nose can run and your feet can smell. Ambiguity is a problem in NLP because it increases the range of possible interpretations of an utterance. If each word in a 10 word sentence has 3 possible meanings, then the whole sentence would have 310 or 59049 different meanings! This problem is known as combinatorial explosion. The process of removing ambiguity is often referred to as disambiguation. Ambiguity can be local or global. Local ambiguity means that part of a sentence can have more than one interpretation, but the whole sentence cannot. Global ambiguity means that the whole sentence can have more than one interpretation. Consider the sentence: ‘I know more beautiful women than Julie Rodgers’ This can have two distinct meanings: ‘I know women who are more beautiful than Julie Rodgers’ or ‘I know a larger number of beautiful women than Julie Rodgers knows.’ The sentence is globally ambiguous. However if we change it to: ‘I know more beautiful women than Julie Rodgers, although she knows quite a few.’ the ambiguity is removed. However, the first phrase is still locally ambiguous, and it is only when the second phrase is added that the ambiguity is resolved. Information Systems: Natural Language Processing (AH) 35 Local ambiguity can often be resolved by syntactic analysis. Consider the following sentences. ‘The old train the young’ ‘The old train pulled out of the station’ Once we realise that ‘train’ is a verb in the first sentence and a noun in the second, the ambiguity is resolved. Resolution of global ambiguity requires semantic or pragmatic analysis. Look at the following sentences: ‘I saw the Houses of Parliament flying over London’ ‘I saw a Boeing 747 flying over London’ Here we can differentiate between the two sentences, because we know what can and because we know what can and cannot fly. The first sentence must mean: ‘I saw the Houses of Parliament (while I was) flying over London’ However, the second sentence remains globally ambiguous as it can have either of the following meanings: ‘I saw a Boeing 747 (while I was) flying over London’ ‘I saw a Boeing 747 (that was) flying over London’ Types of Ambiguity Structural ambiguity This occurs where there is more than one way of parsing a given sentence, e.g. ‘You can have peas and bean or carrots with your dinner.’ This can be parsed as: ‘You can have (peas) and (beans or carrots) with your dinner.’ or as ‘You can have (peas and beans) or (carrots) with your dinner.’ The ambiguity can sometimes be resolved by the use of intonation or pauses, if the sentence is spoken, or by the use of punctuation, if it is written. However, what about: ‘I saw a man on a hill with binoculars’ There are no commas and there are unlikely to be pauses in speech. We may have to rely on world knowledge to derive the most likely meaning, e.g. how far away is the hill, how many men are on it, where are the binoculars? Information Systems: Natural Language Processing (AH) 36 Form class ambiguity This occurs when a word can belong to more than one syntactic class, i.e. it can be represented by more than one terminal symbol. This is sometimes known as syntactic ambiguity or categorical ambiguity. E.g. the word ‘time’ can be used as a noun, a verb or an adjective: ‘Time is money’ ‘Time me on the last lap’ ‘Time travel is impossible’ Or the word ‘flies’ can be a noun or a verb: ‘Time flies like the wind’ ‘Fruit flies like bananas’ The ambiguity can be resolved by syntactic analysis, i.e. by deciding which syntactic class the ambiguous word belongs to. Word sense ambiguity This occurs when a word has only one terminal symbol, but can refer to different concepts. It is sometimes called lexical ambiguity, and arises because a word has more than one meaning: e.g. ‘lead’ – is it the lead in a pencil, the metal lead or the dog’s lead? Consider the use of the word ‘charged’ in the following sentences: ‘The battery was charged with jump leads.’ ‘The thief was charged with breaking and entering.’ ‘The lecturer was charged with student guidance.’ ‘The atmosphere was charged with excitement.’ The ambiguity can be resolved by semantic analysis: we know that jump leads and student guidance aren’t criminal offences. But what about sentences like: ‘The boy ran away from the bank’ How do we tell whether the bank in question is the side of a river or a financial institution? We can’t without further information. (See Section 2.4) Word sense ambiguity can cut across syntactic categories, e.g. the word ‘back’ is an adverb in ‘go back’, an adjective in ‘back door’, a noun in ‘I have a sore back’ and a verb in ‘back up your data regularly’. Information Systems: Natural Language Processing (AH) 37 Referential ambiguity This occurs where more than one object is referred to by a noun phrase, e.g. ‘Jack was cycling with Colin when he fell off his bike.’ Who fell off the bike? Referential ambiguity may be resolved in a variety of ways: ‘Jack gave Colin a present and he thanked him profusely.’ We know that it’s most likely that the recipient of the present is doing the thanking, but what if the present was given in thanks for something else? An ambiguous pronoun in a second phrase or sentence is more likely to apply to the subject of the first phrase than to its object. ‘The supervisor fired the worker. He was known to be aggressive.’ It is more likely that the supervisor was ‘known to be aggressive’. Where there is only one sentence involved, an ambiguous pronoun is more likely to refer to the person closest to it in the sentence. ‘Kirsten gave Jill a coat because she was cold.’ The most likely interpretation is that Jill was cold. Similar logic would suggest that it was Colin who fell off the bike in the first example given. But what about: ‘Jack and Kirsten gave Colin and Jill some computer games because they liked them.’ Should we assume that ‘them’ refers to Jack and Kirsten, or is this too obvious? We don't really need to say that we like someone if we give something to them. 2.4 Sequences of Sentences Sometimes we need to refer to a previous sentence in order to understand the meaning of a given sentence, e.g. ‘He walked away from the bank.’ It is unclear whether bank is the side of a river, or a financial institution. However, the following are not ambiguous: ‘John threw a stone in the river. He walked away from the bank’. ‘John withdrew £50 from the ATM. He walked away from the bank.’ Information Systems: Natural Language Processing (AH) 38 2.5 Garden Path Sentences Garden path sentences are those which they lead the listener up the garden path to an incorrect parse. The following sentences may look as though they incorrect, but each of them is grammatically correct and each has a clear meaning. 1. The horse raced past the barn fell. 2. The man who hunts ducks out on weekends. 3. The cotton clothing is usually made of grows in Mississippi. 4. The prime number few. 5. Fat people eat accumulates. None of these sentences has random words tacked on; none of them are sentence fragments stitched together; none of them are incomplete. Here are the sentences with a bit of explanation that should clarify what they mean. The horse raced past the barn fell. The horse (that was) raced past the barn fell (down). The man who hunts ducks out on weekends. What does the man who hunts do on weekends? He ducks out on weekends. The cotton clothing is usually made of grows in Mississippi. Where do they grow the cotton that that clothing is usually made of? Mississippi. The prime number few. The mediocre number many; the prime number few. Fat people eat accumulates. The fat that people eat accumulates. Information Systems: Natural Language Processing (AH) 39 Information Systems: Natural Language Processing (AH) 40 3. PARSING AND GENERATION TECHNIQUES 3.1 A Simple Formal Grammar Context-Free Grammars (CFGs) are a method of describing language and other hierarchical structures. They are related to Phrase Structure Trees. CFGs have the following characteristics: • a left-hand and right-hand side, separated by the symbol ::= (read as ‘consists of’) • one symbol only on the left-hand side • at least one symbol on the right-hand side • symbols on the left-hand side of rules are always non-terminals (that is, they never appear as leaves on trees) • symbols on the right-hand side of rules may be either terminals or nonterminals. In this section we’ll define a formal grammar for a small subset of the English language which we’ll call Innglish. Natural languages use fixed sets of letters (in written form) or sounds (in spoken form), which combine to form a fixed set of words, the lexicon or vocabulary of the language. Words can be regarded as the symbols of a formal language. These symbols can be combined to form strings. Some strings such as ‘This is a sentence’ are valid English sentences, while others, such as ‘A this sentence is’ are not. We can devise a grammar, or set of rules which allow us to generate or recognise reasonably complex sentences. There are a number of different systems for describing grammars, but most of them use the concept of grouping symbols together to form phrases and are thus known as phrase structure grammars. Phrases such as ‘a boy’, ‘the kitchen’ and ‘the seat in the garden’ are noun phrases. As we will see later, phrases help us to describe the semantics, or meaning of a sentence. By using different types of phrase and specifying the permissible relationships between them helps us to define the allowable stings of the language. For example, we can say that a noun phrase can combine with a verb phrase (such as ‘sat on the chair’) to form a sentence. A formal language is defined as a set of strings, each of which consists of a sequence of symbols chosen from a finite set of terminal symbols. Terminal symbols cannot be further subdivided, but they can be categorised into groups according to their function, e.g. nouns, verbs, adjectives etc. Information Systems: Natural Language Processing (AH) 41 Other symbols, such as <noun phrase>, <verb phrase> and <sentence> consist of groupings of terminal symbols and are known as non-terminal symbols. The highest permitted level of non-terminal symbol (in this case, <sentence>) is known as the distinguished non-terminal symbol. The first step in defining a grammar is to define a lexicon, or list of acceptable words. These words are grouped according to their category or part of speech. A brief lexicon for Innglish is given below. The ellipsis (…) after the members of a category indicates that it is possible to add more members. Category Members Noun boy, seat, garden, kitchen, firemen … Verb flying, visiting, sitting, running … Adjective big, green … Adverb quickly, slowly … Preposition in, on, at … Conjunction and, or … Article a, an, the … We then specify the rules which describe the grammar: <sentence> ::= <noun phrase> <verb phrase> <verb phrase> ::= | | | | <verb group> <verb phrase> <noun phrase> <verb phrase> <adjectival phrase> <verb phrase> <prepositional phrase> <verb phrase> <adverb> <verb group> ::= | <verb> <auxiliary><verb> <noun phrase> ::= <pronoun> | <noun> | <noun phrase> <conjunction> <pronoun> | <adjectival phrase> <noun> | <article> <noun> | <noun phrase> <prepositional phrase> <adjectival phrase> ::= | <adjective> <article> <adjective> <prepositional phrase> ::= <preposition> Information Systems: Natural Language Processing (AH) 42 3.2 Drawing Phrase-Structure Trees Suppose that we have an input string such as ‘they are visiting firemen’. We can use the grammar to recognise whether the sentence is a member of the language described by the grammar or not. Indeed, if the sentence is a member of the language, then we shall be able to draw the phrase-structure tree or trees. We draw trees with the root at the top. The root is labelled with the distinguished symbol of the grammar: <sentence> | ‘They are visiting firemen’ We can then draw in the non-terminals that make up the sentence: <sentence> / \ <noun phrase> <verb phrase> | | ‘they’ ‘are visiting firemen’ We then pick one of the non-terminals <noun phrase> and expand it: <sentence> / \ <noun phrase> <verb phrase> | <pronoun> | ‘they’ ‘are visiting firemen’ We then repeat the process by expanding another non-terminal <verb phrase>: <sentence> / \ <noun phrase> <verb phrase> | / \ <pronoun> <verb group> <noun phrase> | | | ‘they’ ‘are visiting’ ‘firemen’ We then repeat the process again by expanding <verb group>: <sentence> / \ <noun phrase> <verb phrase> | / \ <pronoun> <verb group> <noun phrase> / \ <auxiliary> <verb> | | ‘they’ ‘are’ ‘visiting’ ‘firemen’ Finally, we expand <noun phrase>: <sentence> / \ <noun phrase> <verb phrase> | / \ <pronoun> <verb group> <noun phrase> / \ | <auxiliary> <verb> <noun> | | | ‘they’ ‘are’ ‘visiting’ ‘firemen’ Information Systems: Natural Language Processing (AH) 43 We have now expanded all the non-terminal symbols of the sentence. However this sentence can also be parsed in another way. The first 3 steps are the same: <sentence> | ‘They are visiting firemen’ We can then draw in the non-terminals that make up the sentence: <sentence> / \ <noun phrase> <verb phrase> | | ‘they’ ‘are visiting firemen’ We then pick one of the non-terminals <noun phrase> and expand it: <sentence> / \ <noun phrase> <verb phrase> | <pronoun> | ‘they’ ‘are visiting firemen’ We then repeat the process by expanding another non-terminal <verb phrase>. However, this time the word ‘visiting’ is placed in the <noun phrase>: <sentence> / \ <noun phrase> <verb phrase> | / \ <pronoun> <verb group> <noun phrase> | | | ‘they’ ‘are’ ‘visiting firemen’ We then expand <verb group>: <sentence> / \ <noun phrase> <verb phrase> | / \ <pronoun> <verb group> <noun phrase> | <verb> | ‘they’ ‘are’ ‘visiting firemen’ We then repeat the process again by expanding <noun phrase>: <sentence> / \ <noun phrase> <verb phrase> | / \ <pronoun> <verb group> <noun phrase> | / \ <verb> <adjectival phrase> <noun> | | | ‘they’ ‘are’ ‘visiting’ ‘firemen’ Information Systems: Natural Language Processing (AH) 44 Then we expand <adjectival phrase>: <sentence> / \ <noun phrase> <verb phrase> | / \ <pronoun> <verb group> <noun phrase> | / \ <verb> <adjectival phrase> <noun> | <adjective> | ‘they’ ‘are’ ‘visiting’ ‘firemen’ As you can see, the sentence can have two different parse trees (and therefore, two different meanings) depending on whether we treat the word ‘visiting’ as a verb, or as an adjective. Note that we cannot tell from the sentence itself which of these meanings is correct. We may be able to tell if the sentence is placed in context: ‘Who are they? They are visiting firemen.’ ‘Where are they? They are visiting firemen.’ This example shows a clear case of global ambiguity. Ambiguity is where there is more than one interpretation possible. In NLP we distinguish between global ambiguity, where there is more than one possible interpretation of a whole utterance (e.g. a sentence) and local ambiguity, where part of an utterance seems ambiguous. Later information will allow us to ignore all or most of the interpretations. 3.3 Terminology To those coming to NLP without much of a background in linguistics or traditional grammar, the terms used in syntax can seem very confusing. We've already encountered the words non-terminal, pre-terminal and terminal. The good news is that terminals are simply words from the lexicon, so we need concern ourselves with these no further. Non-terminals In Context-Free Grammars (CFGs), we write rules to describe phrases. Phrases are simply groups of words and the name of the phrase usually takes its name from some important word within it, e.g. <noun phrase> <verb phrase> <prepositional phrase> <sentence> There are other phrases that are sometimes used, such as <verb group>, used in the previous example. Information Systems: Natural Language Processing (AH) 45 Pre-terminals In the CFG we have used so far, we used categories such as: adjective, adverb, auxiliary, determiner, noun, preposition, pronoun, verb. These are known as preterminals. 3.4 Search and Control in Parsing We’ll now use the grammar and lexicon presented earlier in this section to show how phrase-structure trees can be developed top-down or bottom-up. We’ll use Noam Chomsky’s famous example ‘They are visiting firemen’ to show how the search can be controlled either depth-first or breadth-first. The grammar presented earlier was: <sentence> ::= <noun phrase> <verb phrase> <verb phrase> ::= | | | | <verb group> <verb phrase> <noun phrase> <verb phrase> <adjectival phrase> <verb phrase> <prepositional phrase> <verb phrase> <adverb> <verb group> ::= | <verb> <auxiliary><verb> <noun phrase> ::= <pronoun> | <noun> | <noun phrase> <conjunction> <pronoun> | <adjectival phrase> <noun> | <article> <noun> | <noun phrase> <prepositional phrase> <adjectival phrase> ::= | <adjective> <article> <adjective> <prepositional phrase> ::= <preposition> The lexicon included: adjective auxiliary noun pronoun verb verb -> -> -> -> -> -> visiting are firemen they are flying We have already seen that we can derive phrase-structure trees from the grammar for the sentence ‘They are visiting firemen’. Now we’ll describe how to search for solutions in the form of phrase-structure trees. There are two main search strategies: top-down and bottom-up. Information Systems: Natural Language Processing (AH) 46 The Top-Down Search Strategy The top-down search strategy is sometimes known as hypothesis-driven search, because it operates by proposing that the input string (‘they are visiting firemen’), is covered by the distinguished non-terminal of the grammar. In the example given above, we would start by proposing <sentence> as the distinguished non-terminal: <sentence> | ‘They are visiting firemen’ We can then draw in the non-terminals that make up the sentence: <sentence> / \ <noun phrase> <verb phrase> | | ‘they’ ‘are visiting firemen’ We then pick one of the non-terminals <noun phrase> and expand it: <sentence> / \ <noun phrase> <verb phrase> | <pronoun> | ‘they’ ‘are visiting firemen’ We then repeat the process by expanding another non-terminal <verb phrase>: <sentence> / \ <noun phrase> <verb phrase> | / \ <pronoun> <verb group> <noun phrase> | | | ‘they’ ‘are visiting’ ‘firemen’ We then repeat the process again by expanding <verb group>: <sentence> / \ <noun phrase> <verb phrase> | / \ <pronoun> <verb group> <noun phrase> / \ <auxiliary> <verb> | | ‘they’ ‘are’ ‘visiting’ ‘firemen’ Finally, we expand <noun phrase>: <sentence> / \ <noun phrase> <verb phrase> | / \ <pronoun> <verb group> <noun phrase> / \ | <auxiliary> <verb> <noun> | | | ‘they’ ‘are’ ‘visiting’ ‘firemen’ Information Systems: Natural Language Processing (AH) 47 The Bottom-Up Search Strategy The bottom-up search strategy is sometimes known as data-driven search, because it essentially operates by building upwards from the input string (in this example, ‘they are visiting firemen’), to the distinguished non-terminal symbol of the grammar. We start of from the sentence itself, broken down into its constituent parts: <pronoun> <auxiliary> <verb> <noun> | | | | ‘they’ ‘are’ ‘visiting’ ‘firemen’ We can then try to add a non-terminal that dominates one or more ‘lower’ nodes: <noun phrase> | <pronoun> <auxiliary> <verb> <noun> | | | | ‘they’ ‘are’ ‘visiting’ ‘firemen’ We then repeat the process: <noun phrase> | <pronoun> ‘they’ <verb group> / \ <auxiliary> <verb> <noun> | | | ‘are’ ‘visiting’ ‘firemen’ and again: <noun phrase> | <pronoun> ‘they’ <verb group> <noun phrase> / \ | <auxiliary> <verb> <noun> | | | ‘are’ ‘visiting’ ‘firemen’ and again: <noun phrase> | <pronoun> ‘they’ <verb phrase> / \ <verb group> <noun phrase> / \ | <auxiliary> <verb> <noun> | | | ‘are’ ‘visiting’ ‘firemen’ until we can finally add the distinguished non-terminal symbol to the root of the tree: <sentence> / \ <noun phrase> <verb phrase> | / \ <pronoun> <verb group> <noun phrase> / \ | <auxiliary> <verb> <noun> | | | ‘they’ ‘are’ ‘visiting’ ‘firemen’ Information Systems: Natural Language Processing (AH) 48 Control of the search strategy As we saw earlier, more than one phrase-structure tree may be derived from the grammar when applied to ‘they are visiting firemen’. We want our NLP systems to find all trees of globally ambiguous sentences such as this. (We may want to choose between the trees on the basis of some other kind of information, such as semantic information.) When we're writing algorithms, we must have a method of ensuring that we've considered every possibility. The process of handling alternatives in searches is known as control. There are essentially two kinds of control: depth-first control and breadth-first control. Depth-first control of the search strategy Control is the process of handling alternatives in search: depth-first control pursues one alternative as far as possible until it is successful or blocks. Only then does it consider the next alternative. For example, consider the top-down search when it has got as far as: <sentence> / \ <noun phrase> <verb phrase> | <pronoun> | ‘they’ ‘are visiting firemen’ There are two <verb phrase> rules which can be used: <verb phrase> ::= | <verb group> <verb phrase> <noun phrase> Depth-first control uses a stack, a last-in, first-out (LIFO) data structure which operates like a railway siding. Items are always added (pushed) or removed (popped) from the top of the stack. In depth-first control, the alternatives are placed on a stack and the first item popped. This is expanded by placing alternatives on the stack. So our stack would look like: <verb group> <verb phrase> ::= ::= <auxiliary> <verb> <verb phrase> <noun phrase> When the <auxiliary> and <verb> have been consumed, the alternatives rules for <noun phrase> are placed on the stack: <noun phrase>::= <noun phrase>::= <noun phrase>::= <noun phrase>::= <noun phrase>::= <noun phrase>::= <verb phrase> ::= <pronoun> <noun> <noun phrase> <conjunction> <pronoun> <adjectival phrase> <noun> <article> <noun> <noun phrase> <prepositional phrase> <verb phrase> <noun phrase> Each alternative would be popped from the stack in turn, until there only remains: <verb phrase> ::= <verb phrase> <noun phrase> It is only now that this alternative <verb phrase> rule can be used. Information Systems: Natural Language Processing (AH) 49 Breadth-first control of the search strategy Breadth-first control uses a queue rather than a stack data structure. A queue is a first-in first-out (FIFO) data structure, i.e. elements are added at one end of the data structure and taken away from the other. If you think about it, this is exactly how a queue in a supermarket operates. Customers join at the tail of the queue, move gradually up to the head, pay for their goods and leave. If we look again at the handling of the alternative <verb phrase> rules, we'll see the difference. Again, there are two VP rules which can be used: <verb phrase> ::= | <verb group> <verb phrase> <noun phrase> The first entry is removed from the queue and expanded, but the new entry goes at the end of the queue: <verb phrase> ::= <verb group> ::= <verb phrase> <noun phrase> <auxiliary> <verb> This means that the next entry to be expanded is the alternative <verb phrase> rule, not the <verb group>. With one important exception, the two strategies produce the same results, although not in the same order. If there are two solutions, one involving more ‘steps’ than the other, then breadth-first search will find the solution with fewer ‘steps’ first. The order in which depth-first search will find solutions depends on the order of rules in the grammar. The important exception is where the search never halts because there is an infinite branch in the ‘search tree’. Consider a situation where there is one solution and an infinite branch in the search tree. Breadth-first search will find the solution before disappearing into infinity (because the solution must have fewer ‘steps’ than the infinite branch). Depth-first search may find the solution before disappearing into infinity, but only if the rules in the grammar are ordered so that the ‘finite’ solution is found before the infinite branch. If there are no infinite branches, which control is preferable? There is no obvious answer to this question. Depth-first used to be the only practical option because on average there are fewer entries on the stack at any one time. When computers were very restricted as to the amount of memory available, this was an important criterion. These days, memory is very much larger and so doesn't affect the choice of control to the same extent. We have shown that it is possible to have either a top-down or bottom-up search strategy, depending on whether we start at the distinguished symbol (hypothesis) or the words (data). We've also shown that it is possible to have either depth-first control or breadth-first control. This implies that we can have four kinds of search algorithm: 1. Top-down, depth-first 2. Top-down, breadth-first 3. Bottom-up, depth-first 4. Bottom-up, breadth-first. Information Systems: Natural Language Processing (AH) 50 4. FURTHER TECHNIQUES USED IN NLP 4.1 Key Phrase Matching: ELIZA and the Chatterbots ELIZA is a computer program which emulates the behaviour of a client-centred psychotherapist. The original program was written by Stanley Weizenbaum at MIT in 1966 for a mainframe computer, but there are numerous PC versions available, as well as web-based versions. Details of how to obtain some of these are given later. The following is a typical conversation (Eliza’s responses are in capitals): Men are all alike. IN WHAT WAY? They're always bugging us about something or other. CAN YOU THINK OF A SPECIFIC EXAMPLE? Well, my boyfriend made me come here. YOUR BOYFRIEND MADE YOU COME HERE He says I'm depressed much of the time. I AM SORRY TO HEAR YOU ARE DEPRESSED It's true. I am unhappy DO YOU THINK COMING HERE WILL HELP YOU NOT TO BE UNHAPPY I need some help, that much seems certain. WHAT WOULD IT MEAN TO YOU IF YOU GOT SOME HELP Perhaps I could learn to get along with my mother. TELL ME MORE ABOUT YOUR FAMILY My mother takes care of me. WHO ELSE IN YOUR FAMILY TAKES CARE OF YOU My father. YOUR FATHER You are like my father in some ways. WHAT RESEMBLANCE DO YOU SEE You are not very aggressive but I think you don't want me to notice that. WHAT MAKES YOU THINK I AM NOT VERY AGGRESSIVE You don't argue with me. Information Systems: Natural Language Processing (AH) 51 WHY DO YOU THINK I DON'T ARGUE WITH YOU You are afraid of me. DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOU My father is afraid of everybody. WHAT ELSE COMES TO YOUR MIND WHEN YOU THINK OF YOUR FATHER Bullies. DOES THAT HAVE ANYTHING TO DO WITH THE FACT THAT YOUR BOYFRIEND MADE YOU COME HERE The operation of the program is straightforward. The input is read and inspected for the presence of a keyword. If one is found, the sentence is transformed according to a rule associated with the keyword or, under certain conditions, an earlier transformation is retrieved. The generated text is then displayed. Keywords can have a rank or precedence number. The program will abandon a keyword already found in the left-to-right scan of the text in favour of one having a higher rank. Commas and periods are treated as delimiters. If one is encountered after a keyword has been found, all subsequent text is deleted from the input line. If no keyword has yet been located, the text prior to the delimiter is deleted and the scan continues. As a result, only single phrases or sentences are ever transformed. The fundamental technical problems with which ELIZA has to deal with are the following: 1. The identification of the ‘most important’ keyword in the input message. 2. The identification of some minimal context within which the chosen keyword appears; e.g., if the keyword is ‘you’, is it followed by the word ‘are’ (in which case an assertion is probably being made). 3. The choice of an appropriate transformation rule, and, of course, the making of the transformation itself. 4. The provision of a mechanism that will permit ELIZA to respond ‘intelligently’ when the input text contained no keywords. Eliza often appears to behave in an ‘intelligent’ fashion, but there is no real analysis of the input text and certainly no attempt to understand it. Weizenbaum himself went to great pains to point out how trivial the program was. You can read his original paper at: http://acf5.nyu.edu/~mm64/x52.9265/january1966.html You can find a web-based version of Eliza at: http://www.uib.no/People/hhiso/eliza.html Information Systems: Natural Language Processing (AH) 52 A similar program (Dr Werner Wilhelm Webowitz!) can be found at: http://www.parnasse.com/drwww.shtml Various versions of Eliza can be downloaded from: http://www-cgi.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/classics/eliza/0.html Eliza spawned a whole generation of programs known collectively as chatterbots. You can find links to most of them here: http://www.simonlaven.com One of the most interesting chatterbots was Julia, who emulates a real player in multiuser role-playing games. Julia was sufficiently realistic to encourage at least one other participant to ask for a date. You can read a hilarious account of her adventures at: http://foner.www.media.mit.edu/people/foner/Julia/Julia.html Another interesting chatterbot is the Chomskybot, which can be found at: http://rubberducky.org/cgi-bin/chomsky.pl This program generates paragraphs of text in the style of the noted linguist, Noam Chomsky. An example is given below: ‘Conversely, any associated supporting element is rather different from the requirement that branching is not tolerated within the dominance scope of a complex symbol. If the position of the trace in (99c) were only relatively inaccessible to movement, the appearance of parasitic gaps in domains relatively inaccessible to ordinary extraction cannot be arbitrary in the system of base rules exclusive of the lexicon. I suggested that these results would follow from the assumption that a descriptively adequate grammar does not affect the structure of a general convention regarding the forms of the grammar. So far, the descriptive power of the base component raises serious doubts about the ultimate standard that determines the accuracy of any proposed grammar. However, this assumption is not correct, since an important property of these three types of EC may remedy and, at the same time, eliminate non-distinctness in the sense of distinctive feature theory.’ If you think this is difficult to follow, you should try reading some real examples of Chomsky’s work! Information Systems: Natural Language Processing (AH) 53 4.1.2 Finite State Automata A Finite State Automaton (FSA) or Finite State Machine is a mathematical model of a system which has discrete inputs and outputs and can be in any one of a finite number of states. A state summarises the information concerning past inputs that is needed to determine the behaviour of the system on subsequent inputs. Typical examples are the control mechanism for an elevator, or the lexical analyser component of a programming language compiler. Let’s look at a very simple FSA which can be used to recognise strings written in the sheep language, Sheeptalk, i.e. any string from the following set: baa! baaa! baaaa! baaaaa! baaaaaa! etc. We can show the automaton as a directed graph: a finite set of vertices (or nodes), together with a set of directed links (or arcs) between pairs of vertices. We'll represent vertices with circles and arcs with arrows: a b q0Q0 a a q1 q2 ! q3 q4 This automaton has five states, represented by nodes in the graph. State 0 is the start state, represented by the incoming arrow. State 4 is the final state or accepting state, represented by the double circle. The automaton also has four transitions, represented by arcs in the graph. The FSA can be used for accepting or recognizing strings as follows. Think of the input as being written on a long tape broken up into cells, with one symbol written in each cell of the tape: q0 b a a a ! The machine begins in the start state (q0), and goes through the following process: Check the next letter of the input. If it matches the symbol on an arc leaving the current state, then cross that arc, move to the next state, and advance one symbol in the input. If the machine in the accepting state (q4) when it runs out of input, it has successfully recognised an instance of Sheeptalk. Information Systems: Natural Language Processing (AH) 54 If the machine never gets to the final state, either because it runs out of input, or it gets some input that doesn't match an arc, or if it gets stuck in some non-final state, we say it rejects or fails to accept an input. We can also represent an automaton by a state-transition table. As with the directed graph notation, the state-transition table represents the start state, the accepting states, and what transitions leave each state with which symbols. Input b a ! 1 0 0 0 2 0 0 3 0 0 3 4 0 0 0 State 0 1 2 3 4: State 4 is marked with a colon to indicate that it's a final state (you can have as many final states as you want), and the 0 indicates an illegal or missing transition. The first row should be read as: ‘if we're in state 0 and we see the input b we go to state 1. If we're in state 0 and we see the input a or !, we fail’. It is possible to use Finite State Automata to recognise many acceptable English sentences. However, they cannot model some language constructs, for instance centreembedded phrases. Also Finite State Automata descriptions of the syntax of natural languages are repetitious and long-winded. The following is a description of part of English that could account for phrases such as: • • • • • the dog the large dog the very large dog the very very large dog the very very very large dog etc. adjective q0 article q1 Information Systems: Natural Language Processing (AH) noun q2 55 Instead of labelling the arcs with pre-terminals, such as article, adjective and noun, we'll use strings of letters to illustrate our examples. We could draw an FSA that could recognise the following strings: • • • • • cde ccde cdee ccdee cccccccdeee c q0 c q1 e d q2 e q3 This language allows any number of c and any number of e. However, suppose that we want to model a recogniser that will accept any of the following: • • • cde ccdee cccdeeee but reject any of the following: • • • • • ccde cdee cccdee cccccccdeee cdeeeeeeeeeeee i.e. the number of c and the number of e must be the same. It is impossible to draw a recogniser that will recognise this language and only this language. Information Systems: Natural Language Processing (AH) 56 It may seem that these strings of arbitrary letters have nothing to do with English. However, English includes sentences which are similar in structure. Consider a sentence such as: ‘The girl whose mother told me that she'd been painted by Van Gogh at the party shouted.’ We can split this into several parts: The girl whose mother told me that she'd been painted by Van Gogh at the party shouted. We could go on extending this sentence indefinitely by simply embedding more and more phrases in the centre. This type of structure is known as centre-embedded. Centre-embedded sentences cannot be modelled by an FSA, so we can reasonably conclude that it is impossible to model the whole of English grammar by an FSA. 4.1.3 Using Templates Many web sites wish to display live data (such as the current date or time) or collect data from users, e.g. payment or delivery information. This can be done by directly coding the required commands into the web site, usually in HTML. However, the problem is often easier solved by the use of templates, written in a scripting language such as Perl. The big advantage of this approach is that all pages depending on a specific template can be amended simply by changing the template. This is a complex area, but you can obtain more information from: http://www.zdnet.com/devhead/stories/articles/0,4413,2184927,00.html Templates can also be used to generate text. A humorous example can be found at: http://after.logos.uwaterloo.ca/~tjdonald/harpo/harpo.html Information Systems: Natural Language Processing (AH) 57 4.1.4 Representations using Logic One important area of NLP is the study of the semantics or meaning of natural language statements. In many cases, the most important aspect of semantics is determining whether a sentence is true or false. We can simplify this task by defining a formal language with simple semantics and mapping natural language sentences on to it. This formal language should be unambiguous, have simple rules of interpretation and inference and have a logical structure determined by the form of the sentence. Two commonly used formal languages are propositional logic and predicate logic. Propositional logic is concerned with the logic of truth functional, sentential or propositional operators such as and, or and not. Sentential operators are those, which operate on one or more complete sentences to give a new sentence. If they are also truth functional operators then the truth of the resulting sentence can be determined knowing only the truth values of the sentences from which it was constructed. An example is the construction known as conjunction. This consists in joining two sentences with the connective and, e.g. the conjunction of the two sentences: • Grass is green • Pigs don't fly. Is the sentence: • Grass is green and pigs don't fly. The conjunction of two sentences will be true if, and only if, each of the two sentences from which it was formed is true. Other propositional connectives include: p or q known as the disjunction of ‘p’ and ‘q’. not p known as the negation of ‘p’. In natural languages, words whose primary role is truth functional often have other roles as well. This is one of many ways in which natural languages fail to be ideal for some logical or technical purposes. To overcome these difficulties formal languages may be helpful. Where a logic is concerned only with sentential connectives it is usually called a propositional logic. The best known, and probably the simplest of these logics is known as classical or boolean propositional logic, in which it is assumed that all propositions have a definite truth value; a proposition is either true or it is false. Information Systems: Natural Language Processing (AH) 58 A predicate is a feature of language, which you can use to make a statement about something, e.g. to attribute a property to that thing. If you say, ‘Peter is tall’, then you have applied the predicate ‘is tall’ to Peter. A predicate may be thought of as a kind of function, which applies to things and yields a proposition. They are therefore sometimes known as propositional functions. Analysing the predicate structure of sentences permits us to make use of the internal structure of atomic sentences, and to understand the structure of arguments, which cannot be accounted for by propositional logic alone. Predicates (or relations): • • • • Are operators, which yield atomic sentences. Operate on things other than sentences. Are therefore not truth functional operators. Yield atomic sentences whose truth can be determined knowing only the identity of the things to which the predicate is applied. The term relation is typically used of a predicate which is applied to more than one thing, e.g. ‘greater than’, which is applied to two things to make a comparison, but can also be used for predicates taking one or zero things. The number of ‘things’ involved (as arguments) is called the arity of the predicate or relation. Though predicates are one of the features, which distinguish predicate logic from propositional logic, they are really just a bit of extra structure necessary to permit the study of quantifiers. The two important features of natural languages, which are captured in predicate logic, are the terms ‘every’ and ‘some’. These are sometimes called the universal and existential quantifiers. These features of language refer to one or more individuals or things, which are not by themselves propositions and which therefore force some kind of analysis of the structure of ‘atomic’ propositions. Where a logic is concerned not only with sentential connectives but also with the internal structure of atomic propositions it is usually called a predicate logic. The best known, and probably the simplest of these logics is known as classical or boolean first-order predicate logic. ‘Classical’ or ‘boolean’ simply means that propositions are either true or false. ‘First-order’ means that we consider predicates (or relations) on the one hand, and individuals on the other; that atomic sentences are constructed by applying the former to the latter; and that quantification is permitted only over the individuals. A classic example of what can be done with predicate logic is the inference from the premises: • All men are mortal. • Socrates is a man. • Socrates is mortal to the conclusion: Information Systems: Natural Language Processing (AH) 59 4.2 Examples of NLP Techniques in Use 4.2.1 Using Logic in Answering Questions From A Database Database searching is based on the principles of Boolean Logic. Most search engines use a symbolic logic system called Boolean Logic, named after George Boole, the British mathematician who invented it. It uses a set of connecting words - AND, OR, NOT and NEAR -- to make your search more useful. The Internet can be regarded as a vast computer database, whose contents can be searched according to the rules of computer database searching. On many Internet search engines, the options to construct logical relationships among search terms extend beyond the traditional practice of Boolean searching. Full details can be found at: http://library.albany.edu/internet/boolean.html 4.2.2 Key Phrase Matching in Text Retrieval Browsing accounts for much of the time uses spend interacting with online text collections or digital libraries, but it is poorly supported by standard search engines. Conventional systems often operate at the wrong level, indexing words when people think in terms of topics, and returning documents when people want a broader view. As a result, users cannot easily determine what is in a collection, how well a particular topic is covered, or what kinds of queries will provide useful results. Gutwin, Paynter et al. built a new kind of search engine, Keyphind, that is explicitly designed to support browsing. Automatically-extracted keyphrases form the basic unit of both indexing and presentation, allowing users to interact with the collection at the level of topics and subjects rather than words and documents. The keyphrase index also provides a simple mechanism for clustering documents, refining queries, and previewing results. The authors compared Keyphind to a traditional query engine in a small usability study. Users reported that certain kinds of browsing tasks were much easier with the new interface, indicating that a keyphrase index would be a useful supplement to existing search tools. This is described in some detail in their paper ‘Improving Browsing in Digital Libraries with Keyphrase Indexes’. The complete paper can be found at: http://www.cs.usask.ca/homepages/faculty/gutwin/1999/keyphind-journal/keyphind-12-final-TR.html Information Systems: Natural Language Processing (AH) 60 4.2.3 Using Templates in Intelligent Tutoring Systems We have already given some consideration to Intelligent Tutoring Systems (ITS) in Section 1. One major problem with ITSs is the expense of developing them. This expense can be reduced considerably if we do not have to start afresh every time. One way of doing this is to make use of shells or templates which can be used across a range of Systems. This area was one of several addressed at an ITS Workshop held in Montreal in 1996. A collection of the papers from this workshop is available online at: http://advlearn.lrdc.pitt.edu/its-arch/papers/index.html The following papers are of particular interest: http://advlearn.lrdc.pitt.edu/its-arch/papers/blumenthal.html http://advlearn.lrdc.pitt.edu/its-arch/papers/brusilovsky.html http://advlearn.lrdc.pitt.edu/its-arch/papers/fleming.html http://advlearn.lrdc.pitt.edu/its-arch/papers/goodkov.html 4.3 NLP Software There are a number of sources on the World Wide Web for NLP software. Try the following: Natural Language Software Registry: http://registry.dfki.de/ Summer Institute of Linguistics: http://www.sil.org/computing/catalog/ Information Systems: Natural Language Processing (AH) 61 Information Systems: Natural Language Processing (AH) 62

Information Systems Natural Language Processing Advanced Higher

Related documents

Products

Support

Information Systems Natural Language Processing Advanced Higher

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib