INTRODUCTION TO ARTIFICIAL INTELLIGENCE - clic

advertisement
INTRODUCTION TO ARTIFICIAL
INTELLIGENCE
Truc-Vien T. Nguyen
Lab: Named Entity Recognition
Download
• Slides
http://sites.google.com/site/trucviennguyen/Lab NER -- Vien.pdf
• Software
http://sites.google.com/site/trucviennguyen/Teaching/AI/SSHSecureShellClien
t-3.2.9.rar
Natural Language Processing (NLP)
• Main purpose of NLP
– Build systems able to analyze, understand and
generate languages which human use naturally
• Involved Tasks
–
–
–
–
–
Automatic Summarization
Information Extraction
Speech Recognition
Machine Translation
…
Information Extraction (1)
News 3
News 2
News 1
Form 3
WHO: vcvcvcvcvcvcvcvcvc
Form 2vcvcvcvcvcvcvcvcvc
WHAT:
WHO: vcvcvcvcvcvcvcvcvc
vcvcvcvcvcvcvcvcvc
WHEN:
Form 1vcvcvcvcvcvcvcvcvc
WHAT:
WHO: vcvcvcvcvcvcvcvcvc
vcvcvcvcvcvcvcvcvc
WHEN:
WHAT: vcvcvcvcvcvcvcvcvc
WHEN: vcvcvcvcvcvcvcvcvc
Mapping of texts into fixed structure
representing the key informations
Information Extraction (2)
Sam Brown retired as executive vice president of
the famous hot dog manufacturer, Hupplewhite Inc.
He will be succeeded by Harry Jones.
EVENT: leave job
Person: Sam Brown
Position: executive vice president
Company: Hupplewhite Inc.
EVENT: start job
Person: Harry Jones
Position: executive vice president
Company: Hupplewhite Inc.
Entity and Relation
• Entity
– An object in the world
– Ex. President Bush was in Washington today
– Example: Person, Organization, Location, GPE
• Relation
– A relationship between two entities
– Ex. LocatedIn(“Bush”, “Washington”)
– Example: LocatedIn, Family, Employment
Named Entity Recognition
• Named Entity Recognition
– Subtask of information extraction
– Locate and classify elements in text into predefined
categories: names of persons, organizations,
locations, expressions of times, etc
• Example
– James Clarke, director of ABC company
(Person)
(Organization)
CoNLL2003 shared task (1)
• English and German language
• 4 types of NEs:
–
–
–
–
LOC Location
MISC Names of miscellaneous entities
ORG Organization
PER Person
• Training Set for developing the system
• Test Data for the final evaluation
CoNLL2003 shared task (2)
• Data
–
–
–
–
columns separated by a single space
A word for each line
An empty line after each sentence
Tags in IOB format
• An example
Milan
's
player
George
Weah
meet
NNP
POS
NN
NNP
NNP
VBP
B-NP
B-NP
I-NP
I-NP
I-NP
B-VP
I-ORG
O
O
I-PER
I-PER
O
CoNLL2003 shared task (3)
English
precision
recall
F
[FIJZ03]
88.99%
88.54%
88.76%
[CN03]
88.12%
88.51%
88.31%
[KSNM03]
85.93%
86.21%
86.07%
[ZJ03]
86.13%
84.88%
85.50%
--------------------------------------------------[Ham03]
69.09%
53.26%
60.15%
baseline
71.91%
50.90%
59.61%
Dataset
• Italian NER-- Evalita 2009 - PER/ORG/LOC/GPE
– Development set: 223.706 tokens
– Test set:
90.556 tokens
• English NER-- CoNLL 2003 - PER/ORG/LOC/MISC
– Training set:
203.621 tokens
– Development set: 51.362 tokens
– Test set:
46.435 tokens
• Mention Detection-- ACE 2005
– 599 documents
CRF++ (1)
•
•
•
•
•
•
Can redefine feature sets
Written in C++ with STL
Fast training based on LBFGS for large scale
Less memory usage both in training and testing
encoding/decoding in practical time
Available as an open source software
http://crfpp.googlecode.com/svn/trunk/doc/index.html
CRF++ (2)
• use Conditional Random Fields (CRFs)
• CRFs methodology: use statistical correlated features
and train them discriminatively
• simple, customizable, and open source
implementation
• for segmenting/labeling sequential data
• can define
– unigram/bigram features
– relative positions (windows-size)
Template basic
• An example:
He
reckons
the
current
account
PRP
VBZ
DT
JJ
NN
B-NP
B-VP
B-NP
I-NP
I-NP
<< CURRENT TOKEN
Template
Expanded feature
%x[0,0]
%x[0,1]
%x[-1,0]
%x[-2,1]
%x[0,0]/%x[0,1]
the
DT
reckons
PRP
the/DT
A Case Study
•
•
•
•
Installing CRF++
Data for Training and Test
Making the baseline
Training CRF++ on the
– NER dataset: English CoNLL2003, Italian EVALITA
– Mention classification: ACE 2005 dataset
• Annotating the test corpus with CRF++
• Evaluating results
• Exercise
Installing CRF++
• First, ssh compute-0-x where x=1..10
• Unzip the lab--NER.tar.gz file (tar -xvzf lab-NER.tar.gz)
• Enter the lab--NER directory
– Unzip the CRF++-0.54.tar.gz file (tar -xvzf CRF++0.54.tar.gz)
– Enter the CRF++-0.54 directory
– Run ./configure
– Run make
Training/Classification (1)
• Notations
–
–
–
–
–
xxx
nnn
yyy
zzz
ttt
train_it.dat/train_en.dat/train_mention.dat
it.model/en.model/mention.model
test_it.dat/test_en.dat/test_mention.dat
test_it.tagged/test_en.tagged/test_mention.tagged
test_it.eval/test_en. eval/test_mention.eval
• Note that the test_it.dat already contains the right
NE tags but the system is not using this information
for tagging the data
Training/Classification (2)
• Enter the CRF++-0.54 directory
• Training
./crf_learn ../templates/template_4 ../corpus/xxx ../models/nnn
• Classification
./crf_test -m ../models/nnn ../corpus/yyy > ../corpus/zzz
• Evaluation
perl ../eval/conlleval.pl ../corpus/zzz > ../corpus/ttt
• See the results
cat ../corpus/ttt
THANKS
• I used material from
– Text Processing II: Bernardo Magnini
– Lab Text Processing II: Roberto Zanoli
Download