EECS 595 / LING 541 / SI 661&761 Natural Language Processing Fall 2005 Lecture Notes #1 Introduction Course logistics • Instructor: Prof. Dragomir Radev (radev@umich.edu) Ph.D., Computer Science, Columbia University Formerly at IBM TJ Watson Research Center • Times: Thursdays 2:40-5:25 PM, in 411, West Hall • Office hours: TBA, 3080 West Hall Connector Course home page: http://www.si.umich.edu/~radev/NLP-fall2005 Example (from a famous movie) Dave Bowman: Open the pod bay doors, HAL. HAL: I’m sorry Dave. I’m afraid I can’t do that. Example I saw her fall • How many different interpretations does the above sentence have? How many of them are reasonable/grammatical? Example 1 The Standard and Poor's 500 and the Nasdaq composite index both reached four-year highs Thursday as investors, unfazed by oil prices nearing $70 per barrel, welcomed a raft of strong earnings reports. Example 1 The Standard and Poor's 500 and the Nasdaq composite index both reached four-year highs Thursday as investors, unfazed by oil prices nearing $70 per barrel, welcomed a raft of strong earnings reports. Example 1 The Standard and Poor's 500 and the Nasdaq composite index both reached four-year highs Thursday as investors, unfazed by oil prices nearing $70 per barrel, welcomed a raft of strong earnings reports. Example 1 The Standard and Poor's 500 and the Nasdaq composite index both reached four-year highs Thursday as investors, unfazed by oil prices nearing $70 per barrel, welcomed a raft of strong earnings reports. Example 1 The Standard and Poor's 500 and the Nasdaq composite index both reached four-year highs Thursday as investors, unfazed by oil prices nearing $70 per barrel, welcomed a raft of strong earnings reports. Example 1 The Standard and Poor's 500 and the Nasdaq composite index both reached four-year highs Thursday as investors, unfazed by oil prices nearing $70 per barrel, welcomed a raft of strong earnings reports. Example 2 Accenture posts higher earnings Consulting and technology services firm beats estimates; stock gains in after-hours trading. July 7, 2005: 4:35 PM EDT NEW YORK (Reuters) - Accenture Ltd., one of the world's largest consulting and technology services firms, posted a higher quarterly profit Thursday boosted by a rebound in consulting demand. Fiscal third-quarter net income more than doubled to about $484 million, or 51 cents a share, from $210 million, or 37 cents a share, a year earlier, the company said. Analysts had expected earning of 43 cents a share, according to First Call. Accenture stock rose about 2 percent in after-hours trading after falling nearly 6 percent in regular New York Stock Exchange trading. • Gary Larson (“The Far Side”) cartoon: • What we say to dogs: – “Okay Ginger! I’ve had it! You stay out of the garbage! Understand, Ginger?“ • What they hear: – “Blah Ginger! blah blah blah blah blah blah blah blah blah blah blah Ginger?" Time Warner to hold off on Cablevision But top Time Warner execs said it may eventually be interested in the cable assets. July 8, 2005: 7:20 PM EDT SUN VALLEY, Idaho (Reuters) - A top Time Warner Inc. executive said Friday it could not bid for Cablevision until it completes a deal to buy Adelphia Communications Corp., splashing cold water on early buyout speculation. Time Warner is in a joint deal with Comcast Corp. to buy bankrupt cable provider Adelphia Communications Corp. "We can't do anything else until we get it (Adelphia) integrated," said Don Logan, chairman of Time Warner's media and communications group. But he added, "We've always said we are interested in Cablevision. ... Anything is possible." In June, the Dolan family offered Cablevision shareholders about $33.50 per share in a $7.9 billion deal to take the company private. Analysts and one of Cablevision's top investors have said the offer is too low and could put the cable system, which serves 3 million customers in the New York area, into play for other suitors, including Time Warner Cable and Comcast. Wall Street analysts said in June that Time Warner, if it were to bid, could top the offer with a $35 to $40 per share bid. Time Warner is the parent company of this Web site. Time Warner chief executive Dick Parsons said on Friday his company's decision about whether to buy Cablevision Corp. rests on whether the Dolan family decides to put it up for sale. "Chuck (Dolan) controls it and it's not as if we could take it away from him," Parsons said during a break at the Allen & Co. conference in Sun Valley, Idaho. "When he's ready to bring that asset to market he knows we're here." Parsons would not comment on whether he has had recent conversations with Dolan about buying Cablevision. Parsons said he and Dolan agree that cable assets are undervalued and that now is a good time to buy them. Time Warner is the parent company of CNN/Money. Stocks edge up Major gauges make tentative gains at Friday's open after steep Fed-inspired selloff. July 1, 2005: 9:46 AM EDT NEW YORK (CNN/Money) - Stocks inched higher early Friday, recovering some from the big selloff after the Federal Reserve boosted interest rates again, and signaled it didn't intend to pause anytime soon. The Dow Jones industrial average (down 99.51 to 10,274.97, Charts), the broader Standard & Poor's 500 (up 2.50 to 1,193.83, Charts) index and the Nasdaq composite (up 4.84 to 2,061.80, Charts) all added a few points in the early going, with the Nasdaq lagging the blue chip indicators a bit. Stocks ended a mixed quarter on a down note Thursday, with the Dow losing more than 100 points after the Fed raised the target for its fed funds rate, an overnight bank lending rate, another quarter point to 3.25 percent. In the closely watched statement, the central bankers acknowledged the impact of higher energy prices and other negatives, but said the economic expansion remains on track. They also pledged to keep raising rates at a "measured" pace, all of which suggested that they don't plan to pause in the near term. Gains early Friday were broad based, with 27 out of 30 Dow issues rising. In corporate news, Microsoft (up $0.02 to $24.86, Research) has settled antitrust claims made by IBM (unchanged at $74.20, Research), the companies said Friday. The software leader will pay IBM $775 million as part of the deal. A number of economic reports were due around 10 a.m. ET. The Institute for Supply Management's manufacturing index for June was expected to have risen to 51.5 in the month from 51.4 in May, according to a consensus of economists surveyed by Briefing.com. The revised read on June consumer sentiment from the University of Michigan was also due, as was the May read on construction spending. Treasury prices slipped after Thursday's big rally. The fall raised the yield on the 10-year note to 3.94 percent from 3.92 percent late Thursday. Treasury prices and yields move in opposite directions. In currency trading, the dollar jumped versus the euro and the yen. U.S. light crude oil for August delivery rose 32 cents to trade at $56.82 a barrel in electronic trading. Crude set a record closing price for a nearby futures contract at $60.54 on Monday. COMEX gold fell $1.20 to $435.90 an ounce. In global trade, Asian-Pacific markets ended mostly lower, and European markets rose at midday. Google cracks $300 Shares of the popular search engine pass $300 for the first time and are now up 260% since IPO. June 27, 2005: 5:52 PM EDT By Paul R. La Monica, CNN/Money senior writer NEW YORK (CNN/Money) - Shares of Google, the popular search-engine company, surpassed the $300 level for the first time on Monday, sparking memories of the dot-com stock craze of the late 1990s. Google gained 2.3 percent to finish at $304.10, slightly below its high for the day of $304.30. The stock has now gained nearly 260 percent since it went public last August at $85 a share. Much of the optimism surrounding Google comes from the fact that it is the leader in the white-hot online advertising industry. The company reported much better than expected sales and earnings for the first quarter, thanks to a booming market for online advertising, particularly ads tied to specific keyword searches. And during the past few weeks, Google has released several new features -- including a desktop search function for businesses and a test version of a personalized home page tool -- that should help the company remain competitive against rivals Yahoo! and Microsoft. Several analysts have also speculated that Google will soon launch an online payment service that could compete against eBay's PayPal. In addition, many investors have been betting that the company, which now has a market value of nearly $85 billion, will soon be added to the benchmark S&P 500 index. But the stock's meteoric rise as of late -- shares have surged more than 50 percent since the company reported first-quarter results in mid-April -- has some analysts thinking that the stock could take a hit in the near future. "You might see the stock pause temporarily," said Marianne Wolk, an analyst with Susquehanna Financial Group. "For the longer term, we're still very bullish but in the very short term it wouldn't be a surprise to see the stock stabilize or pull back." The key for Google will be how strong its second quarter results are. Google is set to report these numbers on July 21. Analysts expect Google's sales, excluding revenues it shares with affiliates, a figure known as traffic acquisition costs or TAC, to come in at $840 million, nearly double last year's levels. Earnings, excluding certain one-time charges, are forecast at $1.21, an increase of 121 percent from a year ago. Wolk thinks that Google should meet these targets but does not believe the company will report results that are significantly better than consensus projections. And if Google does not continue to beat estimates, the stock could take a bath. "For Google to keep heading higher, it's absolutely critical that they keep hitting numbers. Everyone now believes the story," said John Tinker, an analyst with ThinkEquity Partners. Still, many investors are finding it hard to bet against Google because it has been posting extremely strong levels of sales growth and healthy profit margins as a public company. So the comparisons to the late 1990s, when shares of many unprofitable Internet companies soared solely due to hype, may not be apt. To that end, Google is expected to generate nearly $3.6 billion in sales, excluding TAC and revenue of $5 billion next year as the company continues to benefit from a shift of advertising dollars from more mainstream media sources such as television, radio, and newspapers, to the Web. In addition to its ubiquitous search engine, Google has branched out into related areas in order to capitalize on the boom in online advertising. The company has a comparison shopping site, Froogle, a free e-mail service called Gmail which features ads embedded in e-mails, and a local search site that operates as kind of a Web version of the Yellow Pages. Google also has expanded rapidly abroad, with sales from outside the U.S. accounting for nearly 40 percent of total sales in the first quarter. What's more, some argue that Google is not overvalued, since it continues to trade at a discount to its top rival, Yahoo. However, this gap has narrowed significantly as of late. Google's price-to-earnings ratio, based on 2005 earnings estimates, is 58. Yahoo trades at 61.5 times earnings estimates for this year. "Google is not an undiscovered stock any more," said Tinker. "It's no longer inefficiently priced." And Google also potentially faces the issue of the summer sluggishness that typically affects Internet stocks. Last year, shares of several Internet companies plunged in July as results did not live up to lofty expectations. Silly sentences • • • • • • • • • • • Children make delicious snacks Stolen painting found by tree I saw the Grand Canyon flying to New York Court to try shooting defendant Ban on nude dancing on Governor’s desk Red tape holds up new bridges Iraqi head seeks arms Blair wins on budget, more lies ahead Local high school dropouts cut in half Hospitals are sued by seven foot doctors In America a woman has a baby every 15 minutes. How does she do that? Main problems in language • Novel words and usages – Blogs, little “r” me,7342.67 – Spam as verb, email • Inconsistencies – Beverly Hills, Beverly Sills – junior college, college junior – pet spray, pet llama • Parsing problems – Cup holder – Federal Reserve Board Chairman • Implicature/reasoning • World knowledge • Subjectivity, scoping, negation Types of ambiguity • • • • • • • • • • • • • • Morphological: Joe is quite impossible. Joe is quite important. Phonetic: Joe’s finger got number. Part of speech: Joe won the first round. Syntactic: Call Joe a taxi. Pp attachment: Joe ate pizza with a fork. Joe ate pizza with meatballs. Joe ate pizza with Mike. Joe ate pizza with pleasure. Sense: Joe took the bar exam. Modality: Joe may win the lottery. Subjectivity: Joe believes that stocks will rise. Scoping: Joe likes ripe apples and pears. Negation: Joe likes his pizza with no cheese and tomatoes. Referential: Joe yelled at Mike. He had broken the bike. Joe yelled at Mike. He was angry at him. Reflexive: John bought him a present. John bought himself a present. Ellipsis and parallelism: Joe gave Mike a beer and Jeremy a glass of wine. Metonymy: Boston called and left a message for Joe. Synonyms/paraphrases The S&P 500 climbed 6.93, or 0.56 percent, to 1,243.72, its best close since June 12, 2001. The Nasdaq gained 12.22, or 0.56 percent, to 2,198.44 for its best showing since June 8, 2001. The DJIA rose 68.46, or 0.64 percent, to 10,705.55, its highest level since March 15. What is Natural Language Processing • Natural Language Processing (NLP) is the study of the computational treatment of natural language. • NLP draws on research in Linguistics, Theoretical Computer Science, Mathematics and Statistics, Artificial Intelligence, Psychology, etc. • • • • • • • • • • • • • • • • • • NLP Information extraction Named entity recognition Trend analysis Subjectivity analysis Text classification Anaphora resolution, alias resolution Cross-document crossreference Parsing Semantic analysis Word sense disambiguation Word clustering Question answering Summarization Document retrieval (filtering, routing) Structured text (relational tables) Paraphrasing and paraphrasing/entailment ID Text generation Machine translation What is needed: (1) linguistic knowledge • Examples: – Zipf’s law: rank(wi)*freq(wi) = const – Collocations: • Strong beer but *powerful beer • Big sister but *large sister • Stocks rise but ?stocks ascend (225,000 hits on Google vs. 47 hits) – Constituents: • • • • Children eat pizza. They eat pizza. My cousin’s neighbor’s children eat pizza. _ Eat pizza! – Burstiness • P(ct=2|ct>=1) • How to get it: – Manual rules – Automatically acquired from large text collections (corpora) Linguistics • Knowledge about language: – – – – – – – Phonetics and phonology - the study of sounds Morphology - the study of word components Syntax - the study of sentence and phrase structure Lexical semantics - the study of the meanings of words Compositional semantics - how to combine words Pragmatics - how to accomplish goals Discourse conventions - how to deal with units larger than utterances What is needed: (2) mathematical and computational tools • • • • • • • • • • • Language models Estimation methods Hidden Markov Models (HMM): for sequences Context-free grammars (CFG): for trees Conditional Random Fields (CRF) Generative/discriminative models Maximum entropy models Random walks Latent semantic indexing (LSI) + Representation issues + Feature engineering Theoretical Computer Science • Automata – Deterministic and non-deterministic finite-state automata – Push-down automata • Grammars – Regular grammars – Context-free grammars – Context-sensitive grammars • Complexity • Algorithms – Dynamic programming Mathematics and Statistics • • • • • • Probabilities Statistical models Hypothesis testing Linear algebra Optimization Numerical methods Artificial Intelligence • Logic – First-order logic – Predicate calculus • Agents – Speech acts • Planning • Constraint satisfaction • Machine learning Existing applications • • • • • • • Web search Natural language interfaces to databases Parsing job postings Military intelligence Summarizing medical records Information extraction for databases Wrapper induction Potential applications • Trend recognition • Db conversion + named entity extraction + classification + relation extraction • Detecting change • Summarization • Social network analysis • Assigning subjectivity scores (stars) • Sentiment classification • Alignment of text w/ other signal (time series) • Record linkage Current work at CLAIR • • • • • • • • • Semi-supervised entity and relation extraction Subjectivity analysis + factuality extraction Protein interaction recognition Text summarization Text mining from the Web Lexical network models of the Web Syntactic alignment Chronology recovery Classification Final remarks • • • • Language is not adversarial It is used to convey useful information Hard to extract this information automatically Need to use NLP – – – – – – – – Inference: mathematics, statistics, machine learning Networks/fields Graph theory Differential equaitions Statistics/optimization Linguistics/KR/AI Sequence alignment Linear algebra/vector analysis Ambiguity I saw her fall. • The categories of knowledge of language can be thought of as ambiguity-resolving components • How many different interpretations does the above sentence have? • How can each ambiguous piece be resolved? • Does speech input make the sentence even more ambiguous? Time flies like an arrow. The alphabet soup (NLP vs. CL vs. SP vs. HLT vs. NLE) • • • • • • NLP (Natural Language Processing) CL (Computational Linguistics) SP (Speech Processing) HLT (Human Language Technology) NLE (Natural Language Engineering) Other areas of research: Speech and Text Generation, Speech and Text Understanding, Information Extraction, Information Retrieval, Dialogue Processing, Inference • Related areas: Spelling Correction, Grammar Correction, Text Summarization Some demos • • • • • • • • AT&T Labs Text to Speech (http://www.research.att.com/projects/tts/demo.html) Babelfish (http://babelfish.altavista.com) OneAcross (http://www.oneacross.com) AskJeeves (http://www.ask.com) IONaut (http://www.ionaut.com:8400) – seems to be down NSIR (http://tangra.si.umich.edu/clair/NSIR/html/nsir.cgi) AnswerBus (http://www.answerbus.com) NewsInEssence (http://www.newsinessence.com) The Turing Test • Alan Turing: the Turing test (language as test for intelligence) • Three participants: a computer and two humans (one is an interrogator) • Interrogator’s goal: to tell the machine and human apart • Machine’s goal: to fool the interrogator into believing that a person is responding • Other human’s goal: to help the interrogator reach his goal Q: Please write me a sonnet on the topic of the Forth Bridge. A: Count me out on this one. I never could write poetry. Q: Add 34957 to 70764. A: 105621 (after a pause) Some brief history • Foundational insights (40’s and 50’s): automaton (Turing), probabilities, information theory (Shannon), formal languages (Backus and Naur), noisy channel and decoding (Shannon), first systems (Davis et al., Bell Labs) • Two camps (57-70): symbolic and stochastic. Transformation grammar (Harris, Chomsky), artificial intelligence (Minsky, McCarthy, Shannon, Rochester), automated theorem proving and problem solving (Newell and Simon) Bayesian reasoning (Mosteller and Wallace) Corpus work (Kučera and Francis) Some brief history • Four paradigms (70-83): stochastic (IBM), logicbased (Colmerauer, Pereira and Warren, Kay, Bresnan), nlu (Winograd, Schank, Fillmore), discourse modelling (Grosz and Sidner) • Empiricism and finite-state models redux (83-93): Kaplan and Kay (phonology and morphology), Church (syntax) • Late years (94-03): strong integration of different techniques, different areas (including speech and IR), probabilistic models, machine learning The state of the art and the nearterm future • World-Wide Web (WWW) • Sample scenarios: – – – – – – – – – generate weather reports in two languages teaching deaf people to speak translate Web pages into different languages speak to your appliances find restaurants answer questions grade essays (?) closed-captioning in many languages automatic description of a soccer game Structure of the course • Three major parts: – Linguistic, mathematical, and computational background – Computational models of morphology, syntax, semantics, discourse, pragmatics – Applications: text generation, machine translation, information extraction, etc. • Three major goals: – Learn the basic principles and theoretical issues underlying natural language processing – Learn techniques and tools used to develop practical, robust systems that can communicate with users in one or more languages – Gain insight into many open research problems in natural language Readings • Speech and Language Processing (Daniel Jurafsky and James Martin) Prentice-Hall, 2000 ISBN: 0-13-095069-6 • Handouts given in class • 1-2 chapters per week Optional readings: Natural Language Understanding by Allen Foundations of Statistical Natural Language Processing by Manning and Schütze. Grading • • • • • Four homework assignments (40%) Midterm (15%) Final project (20%) Final exam (25%) Additional requirements for SI761 Assignments • (subject to change) – Finite-state modeling, part of speech tagging, and information extraction • Fsmtools/lextools/JMX (Bell Labs, Penn) – Tagging and parsing • Brill tagger/Charniak parser (JHU, Brown) – Machine translation • GIZA++/Rewrite decoder (Aachen, JHU, ISI) – Text generation • FUF/Surge (Columbia) Syllabus Introduction (JM1) Linguistic Fundamentals Regular Expressions and Automata (JM2) Morphology and Finite-State Transducers (JM3) Word Classes and Part of Speech Tagging (JM8) Context-Free Grammars for English (JM9) Parsing with Context-Free Grammars (JM10) Features and Unification (JM11) Lexicalized and Probabilistic Parsing (JM12) Natural Language Generation (JM20) (Cont’d) The Functional Unification Formalism (Handout) Language and Complexity (JM13) Representing Meaning (JM14) Semantic Analysis (JM15) Discourse (JM18) Rhetorical Analysis (Handout) Dialogue and Conversational Agents (JM19) Other meetings • CLAIR meeting (TBA) • Artificial Intelligence Seminar (Tuesdays 4-5:30) • STIET (Thursdays 4-5:30) Projects Each student will be responsible for designing and completing a research project that demonstrates the ability to use concepts from the class in addressing a practical problem. A significant part of the final grade will depend on the project assignment. Students can elect to do a project on an assigned topic, or to select a topic of their own. The final version of the project will be put on the World Wide Web, and will be defended in front of the class at the end of the semester (procedure TBA). In some cases (and only with instructor’s approval), students may be allowed to work in pairs when the project’s scope is significant. Sample projects • • • • • • • • • • • • • Noun phrase parser Paraphrase identification Question answering NL access to databases Named entity tagging Rhetorical parsing Anaphora resolution, entity crossreference Document and sentence alignment Using bioinformatics methods Encyclopedia Information extraction Speech processing Sentence normalization • • • • • • • • • • • • • Text summarization Sentence compression Definition extraction Crossword puzzle generation Prepositional phrase attachment Machine translation Generation Semi-structured document parsing Semantic analysis of short queries User-friendly summarization Number classification Domain-specific PP attachment Time-dependent fact extraction Main research forums and other pointers • Conferences: ACL/NAACL, SIGIR, AAAI/IJCAI, ANLP, Coling, HLT, EACL/NAACL, AMTA/MT Summit, ICSLP/Eurospeech • Journals: Computational Linguistics, Natural Language Engineering, Information Retrieval, Information Processing and Management, ACM Transactions on Information Systems, ACM TALIP, ACM TSLP • University centers: Columbia, CMU, JHU, Brown, UMass, MIT, UPenn, USC/ISI, NMSU, Michigan, Maryland, Edinburgh, Cambridge, Saarland, Sheffield, and many others • Industrial research sites: IBM, SRI, BBN, MITRE, MSR, (AT&T, Bell Labs, PARC) • Startups: Language Weaver, Ask.com, LCC • The Anthology: http://www.aclweb.org/anthology What this course is NOT • EECS 597 / LING 792 / SI 661 “Language and Information”, last taught in Winter 2005, essentially an introduction to corpus-based and statistical NLP. – Topics covered: introduction to computational linguistics, information theory, data compression and coding, N-gram models, clustering, lexicography, collocations, text summarization, information extraction, question answering, word sense disambiguation, analysis of style, and other topics . • SI 760 “Information Retrieval”, last taught Winter 2005. – Topics covered: information need, IR models, documents, queries, query languages, relevance, retrieval evaluation, reference collections, query expansion and relevance feedback, indexing and searching, XML retrieval, language modeling approaches, crawling the Web, hyperlink analysis, measuring the Web, similarity and clustering, social network analysis for IR, hubs and authorities, PageRank and HITS, focused crawling, relevance transfer, question answering • • The new advanced NLP/IR course, to be offered Winter 2006. An undergraduate Linguistics course such as Ling 212 “Intro to the Symbolic Analysis of Language” or Ling 320 “Programming for Linguistics and Language Studies” Other sites • Johns Hopkins University (Jason Eisner) http://www.cs.jhu.edu/~jason/465/ • Cornell University (Lillian Lee) http://courses.cs.cornell.edu/cs674/2002SP/ • Stanford University (Chris Manning) http://www.stanford.edu/class/cs224n/ • JHU Summer workshop http://www.clsp.jhu.edu/ws2003/calendar/preliminary.shtml Readings • J&M Chapters 1, 2 • “What is Computational Linguistics” by Hans Uszkoreit http://www.coli.uni-sb.de/~hansu/what_is_cl.html • Lecture notes #1