I256: Applied Natural Language Processing

advertisement
I256
Applied Natural Language
Processing
Fall 2009
Lecture 12
Projects
Barbara Rosario
Today
• Special guest: Rob Ennals, Intel Labs
Berkeley
• More project ideas
• Next class
– Finish up classification
– Information extraction
2
Announcements
•
Tuesday October 20 assignment 4 due
–
–
5% more if submitted at least 24 hours in advance
We’ll accept late submissions if:
1) You haven’t submitted late a previous homework
And
2) You let me know in advance (by the day before)
•
Thursday October 15 project proposal due
http://courses.ischool.berkeley.edu/i256/f09/assign
ments/project_proposal.html
–
–
–
–
1 page
General idea/topic
(If you know already) what kind of data/resources would
you like to use?
(If you know already) what methods do you think you'll
use?
3
Projects important dates
•
•
•
•
Thursday Oct 15: Proposal Due
Thursday October 22: Receive Feedback on Proposal
Thursday October 29: Turn in revised proposal (if required)
Thursday November 12: Check point (more information
later)
• Dec 1 and 3: Class Presentations
• Thursday Dec 10 (subject to change): Final Project Writeup due
4
Rob Ennals
5
Project ideas
• Whatever you like and are interested in!
• Ideally, it should have at least one of the
following elements:
• Interesting, novel application and/or data
– i.e. topic classification for reuter wouldn’t count….
– Twitter?
• New algorithm
– Then you can use reuter data…
• Linguistic analysis
– To inform the NLP! (i.e. analysis to be useful to a NLP
algorithm task/algorithm)
• Implementation for novel use (iPhone?)
6
Scaling Up to Large Datasets
System calls to external software
• Python is not able to perform the numerically intensive
calculations required by machine learning methods
nearly as quickly as lower-level languages such as C.
• On large datasets, you may find that the learning
algorithm takes an unreasonable amount of time and
memory to complete if you use the pure-Python machine
learning implementations
• NLTK's facilities for interfacing with external
machine learning packages.
• Once these packages have been installed, NLTK can
transparently invoke them (via system calls) to train
classifier models significantly faster than the pure-Python
classifier implementations.
• See the NLTK webpage for a list of recommended
7
machine learning packages that are supported by NLTK.
Software
• If you need some fancy (i.e. expensive)
software, let me know asap
– I may be able to buy it and let you use it for
the projects
• An annotated list of resources
http://nlp.stanford.edu/links/statnlp.html
8
Final Project Ideas
• NLP with me all the time: Interfaces 90% useful
90% of the time
• What are the NLP problems for a speech
interfaces that is always with me?
• Take an audio recorder with you for a whole day.
Record all the speech commands you would
give to your perfect interface
–
–
–
–
–
–
–
Call mike
Write this message to sally hi sally movie tonight?
Remind me to buy milk when I go to the store
Put dentist on tue on the calendar
Where can I buy a bluetooth device nearby?
Set facebook status class today sucked glad is over
Twitter class today sucked glad is over
9
NLP with me all the time
• Analysis
–
–
–
–
Analyze the commands
How many types of actions/classes?
What NLP apps (translations? extractions, etc)
Call [Mike]: action/class = phone, argument = Mike
• NLP tasks: classification and extraction
– Set Facebook status [class today sucked glad is over]:
action/class = facebook, argument = [class today sucked
glad is over]
• NLP tasks: classification and extraction
• Build a NLP algorithm for this data
10
NLP with me all the time
• Additional: note the context of what you were
doing while you said the commands (we are
interested in how the context can inform the
NLP)
– For example: send this picture to Annette
– Context: Annette is in front of me
11
Final Project Ideas
• NLP summarization for audio interfaces
– Summarize email, blogs, news article
– Different lengths or incremental (tell me more,
or tell me less –get to the point!)
– (Are audio summaries different from written
ones?)
12
Final Project Ideas
• Intel® Reader
• To assist people with various disabilities
(blindness, dyslexia)
• The Intel Reader performs text-to-speech
(TTS) on captured images (with OCR) and
downloaded text files
13
14
Intel® Reader
• Text to speech: Improved Speech Output
– Contextual Pronunciation
• TTS engines still relatively poor on
context-based pronunciation variations
– Examples: “LIVE” “LEAD”
•
•
•
•
“I live in California” vs.
“I watched the live performance of the concert”
“That battery is made from lead” vs.
“I will lead the troops into battle”
15
Final Project Ideas
• Two NLP problems for Intel® Reader
• Contextual Pronunciation
– Identify words that have ambiguous
pronunciation
– Choose the right pronunciation
• OCR errors
– Identify words that are mistakes (o-c, miso,
misc)
– Choose the right words
16
Final Project Ideas
• Blog analysis
– Categorize blog topics (maybe including link analysis)
– Segment blogs into pieces based on topics
– Do blog author analysis
– Summarize blog reaction to some event, e.g., what
did people think of “An Inconvenient Truth”
• There is a contest on this:
– http://www.icwsm.org/
17
Final Project Ideas
• Create a Negativity/Emotion/Flame
Recognizer
– There is some related work, but this is
somewhat under-explored
– Emotions in email, blogs, facebook statuses…
18
Previous Final Project
•
•
•
HomeSkim (2005)
– Chan, Lib, Mittal, Poon
– Apartment search mashup
– Extracted fields from Craigslist listings
– http://www.ischool.berkeley.edu/programs/masters/projects/2006/homes
kim
Orpheus (2004)
– Maury, Viswanathan, Yang
– Tool for discovering new and independent recording artists
– Extracted artists, links, reviews from music websites
– http://groups.sims.berkeley.edu/orpheus/demo/orpheus_demo.swf
Breaking Story (2002)
– Reffell, Fitzpatrick, Aydelott
– Summarize trends in news feeds
– Categories and entities assigned to all news articles
19
– http://dream.sims.berkeley.edu/newshound/
20
21
HomeSkim Craigslist Analysis
22
23
24
25
26
Download