I256 Applied Natural Language Processing Fall 2009 Lecture 12 Projects Barbara Rosario Today • Special guest: Rob Ennals, Intel Labs Berkeley • More project ideas • Next class – Finish up classification – Information extraction 2 Announcements • Tuesday October 20 assignment 4 due – – 5% more if submitted at least 24 hours in advance We’ll accept late submissions if: 1) You haven’t submitted late a previous homework And 2) You let me know in advance (by the day before) • Thursday October 15 project proposal due http://courses.ischool.berkeley.edu/i256/f09/assign ments/project_proposal.html – – – – 1 page General idea/topic (If you know already) what kind of data/resources would you like to use? (If you know already) what methods do you think you'll use? 3 Projects important dates • • • • Thursday Oct 15: Proposal Due Thursday October 22: Receive Feedback on Proposal Thursday October 29: Turn in revised proposal (if required) Thursday November 12: Check point (more information later) • Dec 1 and 3: Class Presentations • Thursday Dec 10 (subject to change): Final Project Writeup due 4 Rob Ennals 5 Project ideas • Whatever you like and are interested in! • Ideally, it should have at least one of the following elements: • Interesting, novel application and/or data – i.e. topic classification for reuter wouldn’t count…. – Twitter? • New algorithm – Then you can use reuter data… • Linguistic analysis – To inform the NLP! (i.e. analysis to be useful to a NLP algorithm task/algorithm) • Implementation for novel use (iPhone?) 6 Scaling Up to Large Datasets System calls to external software • Python is not able to perform the numerically intensive calculations required by machine learning methods nearly as quickly as lower-level languages such as C. • On large datasets, you may find that the learning algorithm takes an unreasonable amount of time and memory to complete if you use the pure-Python machine learning implementations • NLTK's facilities for interfacing with external machine learning packages. • Once these packages have been installed, NLTK can transparently invoke them (via system calls) to train classifier models significantly faster than the pure-Python classifier implementations. • See the NLTK webpage for a list of recommended 7 machine learning packages that are supported by NLTK. Software • If you need some fancy (i.e. expensive) software, let me know asap – I may be able to buy it and let you use it for the projects • An annotated list of resources http://nlp.stanford.edu/links/statnlp.html 8 Final Project Ideas • NLP with me all the time: Interfaces 90% useful 90% of the time • What are the NLP problems for a speech interfaces that is always with me? • Take an audio recorder with you for a whole day. Record all the speech commands you would give to your perfect interface – – – – – – – Call mike Write this message to sally hi sally movie tonight? Remind me to buy milk when I go to the store Put dentist on tue on the calendar Where can I buy a bluetooth device nearby? Set facebook status class today sucked glad is over Twitter class today sucked glad is over 9 NLP with me all the time • Analysis – – – – Analyze the commands How many types of actions/classes? What NLP apps (translations? extractions, etc) Call [Mike]: action/class = phone, argument = Mike • NLP tasks: classification and extraction – Set Facebook status [class today sucked glad is over]: action/class = facebook, argument = [class today sucked glad is over] • NLP tasks: classification and extraction • Build a NLP algorithm for this data 10 NLP with me all the time • Additional: note the context of what you were doing while you said the commands (we are interested in how the context can inform the NLP) – For example: send this picture to Annette – Context: Annette is in front of me 11 Final Project Ideas • NLP summarization for audio interfaces – Summarize email, blogs, news article – Different lengths or incremental (tell me more, or tell me less –get to the point!) – (Are audio summaries different from written ones?) 12 Final Project Ideas • Intel® Reader • To assist people with various disabilities (blindness, dyslexia) • The Intel Reader performs text-to-speech (TTS) on captured images (with OCR) and downloaded text files 13 14 Intel® Reader • Text to speech: Improved Speech Output – Contextual Pronunciation • TTS engines still relatively poor on context-based pronunciation variations – Examples: “LIVE” “LEAD” • • • • “I live in California” vs. “I watched the live performance of the concert” “That battery is made from lead” vs. “I will lead the troops into battle” 15 Final Project Ideas • Two NLP problems for Intel® Reader • Contextual Pronunciation – Identify words that have ambiguous pronunciation – Choose the right pronunciation • OCR errors – Identify words that are mistakes (o-c, miso, misc) – Choose the right words 16 Final Project Ideas • Blog analysis – Categorize blog topics (maybe including link analysis) – Segment blogs into pieces based on topics – Do blog author analysis – Summarize blog reaction to some event, e.g., what did people think of “An Inconvenient Truth” • There is a contest on this: – http://www.icwsm.org/ 17 Final Project Ideas • Create a Negativity/Emotion/Flame Recognizer – There is some related work, but this is somewhat under-explored – Emotions in email, blogs, facebook statuses… 18 Previous Final Project • • • HomeSkim (2005) – Chan, Lib, Mittal, Poon – Apartment search mashup – Extracted fields from Craigslist listings – http://www.ischool.berkeley.edu/programs/masters/projects/2006/homes kim Orpheus (2004) – Maury, Viswanathan, Yang – Tool for discovering new and independent recording artists – Extracted artists, links, reviews from music websites – http://groups.sims.berkeley.edu/orpheus/demo/orpheus_demo.swf Breaking Story (2002) – Reffell, Fitzpatrick, Aydelott – Summarize trends in news feeds – Categories and entities assigned to all news articles 19 – http://dream.sims.berkeley.edu/newshound/ 20 21 HomeSkim Craigslist Analysis 22 23 24 25 26