INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition Download • Slides http://sites.google.com/site/trucviennguyen/Lab NER -- Vien.pdf • Software http://sites.google.com/site/trucviennguyen/Teaching/AI/SSHSecureShellClien t-3.2.9.rar Natural Language Processing (NLP) • Main purpose of NLP – Build systems able to analyze, understand and generate languages which human use naturally • Involved Tasks – – – – – Automatic Summarization Information Extraction Speech Recognition Machine Translation … Information Extraction (1) News 3 News 2 News 1 Form 3 WHO: vcvcvcvcvcvcvcvcvc Form 2vcvcvcvcvcvcvcvcvc WHAT: WHO: vcvcvcvcvcvcvcvcvc vcvcvcvcvcvcvcvcvc WHEN: Form 1vcvcvcvcvcvcvcvcvc WHAT: WHO: vcvcvcvcvcvcvcvcvc vcvcvcvcvcvcvcvcvc WHEN: WHAT: vcvcvcvcvcvcvcvcvc WHEN: vcvcvcvcvcvcvcvcvc Mapping of texts into fixed structure representing the key informations Information Extraction (2) Sam Brown retired as executive vice president of the famous hot dog manufacturer, Hupplewhite Inc. He will be succeeded by Harry Jones. EVENT: leave job Person: Sam Brown Position: executive vice president Company: Hupplewhite Inc. EVENT: start job Person: Harry Jones Position: executive vice president Company: Hupplewhite Inc. Entity and Relation • Entity – An object in the world – Ex. President Bush was in Washington today – Example: Person, Organization, Location, GPE • Relation – A relationship between two entities – Ex. LocatedIn(“Bush”, “Washington”) – Example: LocatedIn, Family, Employment Named Entity Recognition • Named Entity Recognition – Subtask of information extraction – Locate and classify elements in text into predefined categories: names of persons, organizations, locations, expressions of times, etc • Example – James Clarke, director of ABC company (Person) (Organization) CoNLL2003 shared task (1) • English and German language • 4 types of NEs: – – – – LOC Location MISC Names of miscellaneous entities ORG Organization PER Person • Training Set for developing the system • Test Data for the final evaluation CoNLL2003 shared task (2) • Data – – – – columns separated by a single space A word for each line An empty line after each sentence Tags in IOB format • An example Milan 's player George Weah meet NNP POS NN NNP NNP VBP B-NP B-NP I-NP I-NP I-NP B-VP I-ORG O O I-PER I-PER O CoNLL2003 shared task (3) English precision recall F [FIJZ03] 88.99% 88.54% 88.76% [CN03] 88.12% 88.51% 88.31% [KSNM03] 85.93% 86.21% 86.07% [ZJ03] 86.13% 84.88% 85.50% --------------------------------------------------[Ham03] 69.09% 53.26% 60.15% baseline 71.91% 50.90% 59.61% Dataset • Italian NER-- Evalita 2009 - PER/ORG/LOC/GPE – Development set: 223.706 tokens – Test set: 90.556 tokens • English NER-- CoNLL 2003 - PER/ORG/LOC/MISC – Training set: 203.621 tokens – Development set: 51.362 tokens – Test set: 46.435 tokens • Mention Detection-- ACE 2005 – 599 documents CRF++ (1) • • • • • • Can redefine feature sets Written in C++ with STL Fast training based on LBFGS for large scale Less memory usage both in training and testing encoding/decoding in practical time Available as an open source software http://crfpp.googlecode.com/svn/trunk/doc/index.html CRF++ (2) • use Conditional Random Fields (CRFs) • CRFs methodology: use statistical correlated features and train them discriminatively • simple, customizable, and open source implementation • for segmenting/labeling sequential data • can define – unigram/bigram features – relative positions (windows-size) Template basic • An example: He reckons the current account PRP VBZ DT JJ NN B-NP B-VP B-NP I-NP I-NP << CURRENT TOKEN Template Expanded feature %x[0,0] %x[0,1] %x[-1,0] %x[-2,1] %x[0,0]/%x[0,1] the DT reckons PRP the/DT A Case Study • • • • Installing CRF++ Data for Training and Test Making the baseline Training CRF++ on the – NER dataset: English CoNLL2003, Italian EVALITA – Mention classification: ACE 2005 dataset • Annotating the test corpus with CRF++ • Evaluating results • Exercise Installing CRF++ • First, ssh compute-0-x where x=1..10 • Unzip the lab--NER.tar.gz file (tar -xvzf lab-NER.tar.gz) • Enter the lab--NER directory – Unzip the CRF++-0.54.tar.gz file (tar -xvzf CRF++0.54.tar.gz) – Enter the CRF++-0.54 directory – Run ./configure – Run make Training/Classification (1) • Notations – – – – – xxx nnn yyy zzz ttt train_it.dat/train_en.dat/train_mention.dat it.model/en.model/mention.model test_it.dat/test_en.dat/test_mention.dat test_it.tagged/test_en.tagged/test_mention.tagged test_it.eval/test_en. eval/test_mention.eval • Note that the test_it.dat already contains the right NE tags but the system is not using this information for tagging the data Training/Classification (2) • Enter the CRF++-0.54 directory • Training ./crf_learn ../templates/template_4 ../corpus/xxx ../models/nnn • Classification ./crf_test -m ../models/nnn ../corpus/yyy > ../corpus/zzz • Evaluation perl ../eval/conlleval.pl ../corpus/zzz > ../corpus/ttt • See the results cat ../corpus/ttt THANKS • I used material from – Text Processing II: Bernardo Magnini – Lab Text Processing II: Roberto Zanoli