Keynotes2 周明-互联网时代自然语言处理研究的思考

advertisement
NLP Research at Internet Age
An Overview of NLP at Microsoft Research Asia
Ming Zhou
Manager of Natural Language Group
Microsoft Research Asia
Trends of Internet Services
• Eco system to work with third party’s apps
– Apple Apps, Facebook, Twitter, Baidu, Sina, QQ
• Real time content collection and search
– Twitter, Facebook, Del.ici.ous, NYT, YouTube
• Mobile search
– Contextual intent understanding
– Towards decision making and action taking
• Social power
– Social tags (like) for general search engines
– Search engines in SNS
– Social QA
Impact and Challenge to NLP Research
• Impact
–
–
–
–
Biggest database ever – connects data
Biggest social network – connects people
Harnessing collective intelligence
Contextual information processing: User, user’s social
network, location, time
– Real-time information processing: Collection, index,
operation without delay
• Challenge
– How to leverage data, people, contextual information to
reach real-time information processing?
Problems of Traditional NLP
Approaches (NLP 1.0)
• Deep in individual component technologies but reach
upper bounds
• Less consider scenarios, user’s need, market need
• Serious data sparseness with human annotation
• Evaluation bottleneck
• Slow deployment
• Lack effective framework to involve users’ feedback
4
New Strategy of NLP (NLP2.0)
•
•
•
•
Data collection from the web
Domain specific and open-IE
Contextual NLP
Maximize on the system level not on the
individual component
• Earlier deployment on Internet
• Make best use of social factors
5
Our Vision and Task
Understand user and document in any language, for any device
and any applications
• Advanced NLP technologies
– Word breaker, POS tagging, chunking, syntactic parser, semantic role
labeling, speller, query suggestion, summarization
– Chinese, Japanese, English
• Multi-language information access
– Statistical machine translation
– Multi-language search
• Semantic computing
– Sentiment analysis, event extraction, ontology learning
– Understanding query intent and document
– Contextual NLP
MSRA NLP Research Overview
Applications
Chinese IME
English writing wizard
News Search
Comparison Shopping
Japanese IME
Pocket translator
Twitter Search
Chatbot
Query speller
Couplet generation
Resume Routing
General web search
Text analysis
Machine Translation
Component techs
Skeleton parser
Translation evaluation
Named entity identification
Tran. know. acquisition
Pos tagging
Information Extraction
Information Retrieval
Meta data extraction
paraphrasing
Term extraction
WEB mining for MT
Annotation tool
SMT
Machine learning
Vertical search
Cross language IR
NLP enriched Indexing
SLM
NLP (C, J, E)
MT (C, J, E)
MRD
Data
MRD
Parsing lexicon
IR and IE (C,J,E)
Bilingual corpus
Balanced corpus
Tagged corpus
and search
Query-doc relevance
Translation
Bilingual tagged
lexicon
corpus
Text mining
Research Accomplishment
• Awards
–
–
–
–
MSRA Best Research Team(2010)
Finalist of WSJ Asian Innovation Awards (2010)
MS ARD Best Project (Engkoo)
MSRA Best Innovation (1998-2008): IME and Chinese couplets
–
–
–
–
–
Best result in NIST 2008 SMT, CWMT 2008 and CWMT 2009
Best result in SIGHAN 2006 bake off on Chinese word segmentation
Best result in cross language information retrieval in TREC-9, NTCIR-III
40 ACL papers, 9 SIGIR, 17 Coling papers (2000-2010)
PC Chair, area chair of ACL
• Academic impact
• Collaboration with universities
– HIT Joint lab on NLP, Speech and Search, Tsinghua Joint lab on Media
and Network
– 400 interns in 12 years
– Summer schools since 2001
– PhD supervisors at universities
8
Summer School on Information Extraction
(Harbin, June, 2005)
Cheng Niu: Information
extraction
Frank Seide: Speech
information extraction
and search
Hwee Tou Ng: Advanced
topics of information
extraction
Chin-Yew Lin:
Information extraction
for automatic
summarization
Projects based on NLP 2.0
• Engkoo: Web-based English learning service
– Data mining from the web
• Chinese couplets
– Include user’s power into system evolvement
• Semantic analysis and search of microblogging
– Move to SNS, mobile
Engkoo
Parallel data mining from the web
Video:
http://video.sina.com.cn/v/b/37417609-1286528122.html
Rapidly Changing Language
• Approximately 1.5 billion people speak English as a
primary, secondary or business language
• China: The largest “English speaking” country with
250 million English learners and USD 60 billion annual
expenses
• Problem: Live language: new words, new meanings
Key Insight:
With billions of translated web pages and sharable repositories
of language data growing every day, the Internet holds the
sum of human language knowledge
www.engkoo.com
Major Features:
Endless Lexicon with Native Definitions
Microsoft Products:
Bing
Human-Like TTS & Phonetic Search
Office
State-of-the-Art Machine Translation
(NIST OpenMT Winner)
Real-time Interactive Alignment
MSN
Massive Dictionary Mined from the
Web
Fresh and Diverse Examples
Advanced Search with Sentence
Analysis
Sentences Classification
Learn Contextual Usage with Word
Alignment
Learn Contextual Usage with Word
Alignment
Learn Contextual Usage with Word
Alignment
Hints of Easy-Confused Words
1. word’s idiomatic usage
•
Verb~Noun (decline~offer)
•
Verb~Adv (greatly~improve)
•
Adj~Noun (arduous~task)
•
Adv~Adj (extremely~bad)
Knowlege Mining Pipeline
Web
Mining
2. paraphrasing
tokenizing: he could hardly afford to
that golden time.
• waste
turn_on~light,
switch_on~light
他 无法 浪费 那样• 的 好
时光。
laborious~task, hard~task
skeleton parsing: (Tsub~he~afford) (ModAdv~hardly~afford)
(Tobj~waste~afford)
•
deeply~moved,
deeply~touched
(Tobj~time~waste) (AdjAttrib~golden~time)
1. single word
(Tsub~他~浪费) (ModAdv~无法
~
浪费
)(Tobj~
浪费~时光
) etc.
3. collocation
translations
“he”,
“could”,
“hardly”,
“afford”
Parallel Sentence:
(AdjAttrib~好
~
时光
)
• “他”
订,~“计划
,make~plan
无法”
, ”浪费“ etc.
He could hardly afford to waste that
golden time.
alignment:
he(他) could hardly afford to(无法)•waste(
浪费
)
that(
那样的
)
~旅馆
, book~room
2. 订
single
word
with
its POS
他无法浪费那样的好时光。
golden(好) “he_Pron”,
time(时光•)“could_Verb”,“hardly_Adv”
订~杂志
,
etc.
Machine
Translation
Model
subscribe
to ~magazine
“他_Pron”,
“无法_Adv”,
”浪费_Verb“ etc.
Paraphrasing
Model
3. collocation
“Tsub~he~afford ”, “Tobj~time~waste” etc.
“Tsub~他~浪费”, “ModAdv~无法~浪费” etc.
Mined
Data
Linguistic
Parsing
Parsed
Data
Knowledge
Mining
Linguistic
Knowledge
Indexed
Data
Multilevel
Indexing
Chinese Couplets
Include user‘s power into system
evolvement
Chinese Couplets (http://duilian.msra.cn)
http://video.sina.com.cn/v/b/10937201-1452530713.html
FS and SS Share the Same Style
Repetition of
pronunciations(音韵联)
风 (wind)----------------水 (water)
吹 (blow) ---------------使 (make)
荞(buckwheat) -- ------舟 (ship)
动(wave)----------------流 (go)
桥 (bridge) -------------洲 (island)
未 (not) -----------------不 (not)
动(wave) ---------------流(go)
FS and SS Share the Same Style
Decomposition of
characters (拆字联)
有 (have)----------------- 缺 (lack)
子 (son) -------------------鱼 (fish)
有 (have) ------------------缺 (lack)
女 (daughter)-------------羊 (mutton)
方 (so) ---------------------敢 (dare)
称 (call) --------------------叫 (call)
好(good) -------------------鲜(fresh)
好
女
子
鲜
鱼
羊
FS and SS Share the Same Style
Person
name
(人名联)
Palindrome
(回文联)
板桥(Banqiao)---------------- 东坡 (Dongpo)
造(produce) -------------------居 (live)
桥(bridge) ---------------------坡 (mountain)
板(board)----------------------东(east)
•Banqiao(板桥) and Dongpo(东坡) are famous litterateurs
•Reading from top to down is identical to down to top
SS Generation Process
海
Sea
阔
wide
山
hill
高
high
天
sky
深
deep
天高
sky high
山高
hill high
凭
allow
任
permit
倚
depend
鱼
fish
跃
jump
虫
insect
飞
fly
鸟
bird
舞
dance
虎
tiger
鸣
tweedle
鸟飞
bird fly
虎啸
tiger roar
SMT decoding
Linguistic
filtering
Reranking
山高任鸟飞
天高任鸟鸣
天高任鸟飞
山高靠虎啸
山高任虎啸
山深任鸟飞
天高任花香
……
山高任鸟飞
天高任鸟鸣
天高任鸟飞
山深任鸟飞
天高任花香
天高任鸟舞
山高任花香
……
天高任鸟飞
山高任鸟飞
天高任鸟鸣
天高任鸟舞
山深任鸟飞
山高任花香
天高任花香
……
SS Generation Approach
FS input
• A multi-phase SMT approach
Phrase-based loglinear model
– Phase1: a phrase-based log-linear model
N-best
candidates
– Phase2: some linguistic filters
Linguistic filters
– Phase3: a Ranking SVM
Ranking SVM
model
SS output
Great Examples
• FS:月落乌啼霜满天
• SS:风吹雁过雨连宵
• FS:千江有水千江月
• SS:万里无云万里星
• FS:秦淮河桨声灯影
• SS:松花江水色月光
• FS:此木为柴山山出 (此+木=柴;山+山=出)
• SS:白水作泉日日昌 (白+水=泉;日+日=昌)
User log for Model Enhancement
• Motivation
– Training data is not adequate
– While user log is big(60k/m), increasing, diverse
• What logs we record
– User inputs
– User finalized couplets
• Second sentences selected out of the candidates provided by our system
• User modified second sentences
User’s Log Analysis
Number of input sentences
12,322
Number of unique input sentences
6,698
Users directly select from system
output
3,459
User manual modify system output
606
Save as favorite couplets
109
Invalid user input
618
No second sentence generated
2,211
Banner generation
2,687
Select the generated banner as
favorite
428
No banner output
265

Data Source


Log from
http://couplet.msra.
cn
Time period

Aug. 31-Oct. 9,
2006
New Framework with Log Data
First sentence
input
Translation
model
Source-Channel
model
Language
model
Training data
Translation
model
Language
model
Log data
N-best
candidates
Mutual
informatio
n
Re-ranking
Second sentence
output
Mutual
informatio
n
User
operation
Twitter Search
Move to social internet and mobile
Tweets
A collection of tweets
Tweets
Cluster
News &
Images Link
Extraction
User Influence Measure
Hot tag, topic Extraction
Popular Tweet Extraction
Top video, music, artists Extraction
Individual tweet
Semantic
Role Labeling
NE
Recognition
Raw Data
Sentence Boundary
Detection
Sentiment
Analysis
Dependency
Parsing
Text
Normalization
Co-reference
Classification
Semantic Search
Community Extraction
Multi-level Indexing
Noise
Filtering
Statistical
Relationship
Learning
Conclusion
•
•
•
•
•
Internet trends and impacts to NLP
NLP2.0 strategy
Web data mining: Engkoo
User’s power: Couplets
SNS and mobile: Twitter search
Download