Introduction

advertisement
Multilingual and Crosslingual
Information System
Wen-Hsiang Lu (盧文祥)
Department of Computer Science and Information Engineering
National Cheng Kung University
2014/02/17
1
Contact Information
• Room: 4261, Monday 09:10 - 12:00 AM
• Instructor: Prof. Wen-Hsiang Lu (盧文祥)
–
–
–
–
–
Office: 4216
Office hours: Monday 12:10 - 2:10PM
Phone: 62545
Web page: http://myweb.ncku.edu.tw/~whlu/mis.htm
Email: whlu@mail.ncku.edu.tw
– Teaching assistant: 王廷軒
• Email: playif@gmail.com
2
Course Grading
•
•
•
•
Class participation/presentation:
Tests:
Project:
Homeworks:
30%
25%
25%
20%
3
Source Textbooks
• Christopher D. Manning and Hinrich Schutze, Foundations of Statistical
Natural Language Processing, The MIT Press, 1999. (全華科技圖書 :
02-23717725)
• Daniel Jurafsky and James H. Martin, Speech and Language Processing:
An Introduction to Natural Language Processing, Computational
Linguistics, and Speech Recognition, Prentice Hall, 2000.
• James Allen, Natural Language Understanding, Benjamin/Cummings
Publishing Co, 1995.
• Gregory Grefenstette, Cross-Language Information Retrieval, Kluwer,
1998.
• Jean Veronis, Parallel Text Processing: Alignment and Use of
Translation Corpora, Kluwer, 2000.
4
Other Useful Sources (1)
• Reference Books
– Charniak, E. Statistical Language Learning.
– Cover, T. M., Thomas, J. A. Elements of Information Theory.
– Jelinek, F. Statistical Methods for Speech Recognition.
• Major Conferences:
–
–
–
–
ACL (Association of Computational Linguistics)
COLING (International Conference on Computational Linguistics )
HLT (Human Language Technology Conference)
IJCNLP (International Joint Conference on Natural Language Processing )
• Journals
–
–
–
–
Computational Linguistics
Natural Language Engineering
TALIP (ACM Transactions on Asian Language Information Processing)
TSLP (ACM Transactions on Speech and Language Processing)
5
Other Useful Sources (2)
• Resource URL
– http://www.aclclp.org.tw/res_other_c.php (中華民國計算語言學學會)
– http://nlp.stanford.edu/software/index.shtml (Stanford NLP Group)
– http://www.phontron.com/nlptools.php (Graham Neubig)
• Tools/Software
– Online Dictionary
• WordNet
http://wordnet.princeton.edu/
• HowNet
http://www.keenage.com/html/c_index.html
• The Academia Sinica Bilingual Ontological Wordnet (BOW)
http://bow.sinica.edu.tw/
6
CKIP (中研院詞庫小組)
(Chinese Knowledge and Information Processing)
• Parser: http://140.109.19.112/main.exe?id=6833
• POS (part of speech) tagger: http://ckipsvr.iis.sinica.edu.tw/
7
Eric Brill's POS Tagger
• Website: http://cst.dk/online/pos_tagger/uk/
This/DT is/VBZ a/DT book/NN ./.
8
Stanford Parser
• Website
– http://nlp.stanford.edu/software/lex-parser.shtml
• Tools
– Online version
• Stanford Parser version 1.5.1
• English & Chinese
• http://josie.stanford.edu:8080/parser/
9
Stanford Parser
10
[Homework 1]
• Using CKIP POS (part of speech) tagger, Eric
Brill’s POS tagger, and Stanford parser to tag and
parse at least three sentence.
11
Course Topics
• Probability and Information Theory
– basics: definitions, formulas, examples.
• Language Modeling
– n-gram models, parameter estimation
– smoothing (EM algorithm)
• Some Linguistics
– phonology, morphology, syntax, semantics, discourse
• Words and the Lexicon
– word classes, mutual information, lexicography.
12
Course Topics (cont.)
• Hidden Markov Models
– background, algorithms, parameter estimation
• Tagging: methods, algorithms, evaluation
– tag sets, HMM tagging, transformation-based, feature-based
• Grammars and Parsing: data, algorithms
– statistical parsing: algorithms, parameterization, evaluation
13
Course Topics (cont.)
• Applications
–
–
–
–
–
–
–
–
–
Machine Translation (MT)
Acoustic Speech Recognition (ASR)
Information Retrieval (IR)
Cross-Language Information Retrieval (CLIR)
Question Answering (QA)
Cross-Language Question Answering (CLQA)
Summarization
Information Extraction
…
14
Course Introduction
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Lecture1: Introduction
Lecture2: Mathematical Foundations
Lecture3: Linguistics Essentials
Lecture4: Corpus-based Work
Lecture5: Collocations
Lecture6: Statistical Inference: n-gram Models over Sparse Data
Lecture7: Word Sense Disambiguation
Lecture8: Statistical Alignment and Machine Translation
Lecture9: Markov Models
Lecture10: Term Translation Extraction & Cross-Language Information
Retrieval
Lecture11 : Statistical/Probabilistic Models for Word Alignment & CLIR
Lecture12: Part-of-Speech Tagging
Lecture13: Probabilistic Context Free Grammars
Lecture14: Question Answering
15
The Ultimate Research Goal in
Natural Language Processing (NLP)
• To develop an automated language understanding
system
• Why is this important?
– Easy for everyone to use language
– Natural Human interface for a variety of applications
(e.g., database access, on-line tutor, robot control, etc.)
– Language seems fundamental for developing an
intelligent system
• iPhone Siri
• IBM's DeepQA project
16
Natural Language is VERY Useful
17
18
OCR Problems
19
20
Aspects of Computational
Linguistics
• Description of the Language: universals, cross-linguistic
research
• Implementation of Computer Model: algorithms and
data structures, formal models to represent knowledge,
model of the reasoning process
• Psycho-Linguistic Aspect: humans are an existence proof
of the computability of language comprehension;
psychological research can be used to justify a computer
model; obtain human processing parameters
21
NLP Issues
• Why is NLP difficult?
– Many “words”, many “phenomena”, many “rules”
• OED (Oxford English Dictionary): 400k words;
Finnish lexicon (of forms): ~2 ×107
• sentences, clauses, phrases, constituents, coordination, negation,
imperatives/questions, inflections, parts of speech, pronunciation,
topic/focus, and much more!
– irregularity (exceptions, exceptions to the exceptions, ...)
• potato  potato es (tomato, hero,...); photo  photo s, and even:
both mango  mango s or  mango es
• Adjective / Noun order: new book, electrical engineering, general
regulations, flower garden, garden flower
22
Difficulties in NLP (cont.)
– Ambiguity
• books: NOUN or VERB?
– you need many books vs. she books her flights online
• Thank you for not smoking, drinking, eating or playing
radios without earphones. (MTA bus)
– Thank you for not eating without earphones??
– Thank you for drinking?? …
• Fred’s hat was blown off by the wind. He tried to catch it.
– ...catch the wind or ...catch the hat ?
23
Rules or Statistics?
• Preferences:
– context clues: she books  books is a verb
– rule: if an ambiguous word (verb/nonverb) is preceded by a
matching personal pronoun  word is a verb
– pronoun reference:
– she/he/it often refers to the most recent noun or pronoun (but
there are certainly exceptions)
– selectional restrictions:
– catching hat is better than catching wind (but not always)
– semantics:
– We thank people for doing helpful things or not doing annoying
things
24
Solutions
• Don’t guess if you know:
•
•
•
•
•
morphology (inflections)
lexicons (word information)
unambiguous names
perhaps some (really) fixed phrases
syntactic rules?
• Use statistics (based on real-world data) for
preferences (only?)
• No doubt: but this is an important question!
25
Types of Linguistic Knowledge
• Acoustic/Phonetic Knowledge: How words are
related to their sounds. (transliteration)
– E ri c sson <=> 易利信
• Morphological Knowledge: How words are
constructed out of basic meaning units.
un + friend + ly  unfriendly
love + past tense  loved
object + oriented  object-oriented
26
More Types of Linguistic Knowledge
• Lexical Knowledge (or Dictionary): This should
include information on parts of speech, features
(e.g., number, case), typical usage, and word
meaning.
• Syntactic Knowledge: How words are put
together to make legal sentences (or constituents
of sentences).
27
More Types of Linguistic Knowledge
• Semantic Knowledge: Word meanings, how
words combine into sentence meaning,
– e.g.,
Fred tossed the ball.
Semantic roles
28
More Types of Linguistic Knowledge
• Pragmatic Knowledge: How context affects the
interpretation of a sentence. Examples:
– Louise loves him.
[Context 1:] Who loves Fred?
[Context 2:] Louise has a cat.
– What time is it?
[Context 1:] Fred is fidgeting (坐立不安)
and staring at his watch.
[Context 2:] Louise has no watch.
29
More Types of Linguistic Knowledge
• World Knowledge: How other people‘s minds
work, what a listener knows or believes, the
etiquette (成規) of language. Examples:
–
–
–
–
Will you pass the salt?
I read an article about the war in the paper.
Fred saw the bird with his binoculars.
Tim was invited to Tom's birthday party. He went to
the store to buy him a present.
30
Multilingualism Issues in Web Age
• Language barrier
– There are about 6,700 languages listed in the Ethnologue
(http://www.ethnologue.com/)
• Information overloading
– Scaling up of language resources
•
•
•
•
Webpages
News
Weblogs
Microblogs
31
Multilingual Understanding??
32
Multilingual Understanding??
33
Multilingual Understanding??
34
Real World Situation
• Use statistical model based on REAL WORLD DATA and care
about the best sentence only
• Imagine:
– Each sentence W = { w1, w2, ..., wn } gets a probability P(W|X) in a
context X
– For every possible context X, sort all the imaginable sentences W
according to P(W|X):
– Ideal situation:
best sentence (most probable in context X)
P(W)
Wbest
Wworst
35
Real World Situation
• Unable to specify a set of grammatical sentences using fixed
“categorical” rules
• (disregarding the “grammaticality” issue)
best sentence (most probable in context X)
P(W)
Wbest
Wworst
36
Download