GATE Evaluation Tools GATE Training Course October 2006 Kalina Bontcheva 1/(19) System development cycle 1. 2. 3. 4. 5. Collect corpus of texts Annotate manually gold standard Develop system Evaluate performance Go back to step 3, until desired performance is reached 2/(19) Corpora and System Development • “Gold standard” data created by manual annotation • Corpora are divided typically into a training and testing portion • Rules and/or learning algorithms are developed or trained on the training part • Tuned on the testing portion in order to optimise – Rule priorities, rules effectiveness, etc. – Parameters of the learning algorithm and the features used (typical routine: 10-fold cross validation) • Evaluation set – the best system configuration is run on this data and the system performance is obtained • No further tuning once evaluation set is used! 3/(19) Some NE Annotated Corpora • MUC-6 and MUC-7 corpora - English • CONLL shared task corpora http://cnts.uia.ac.be/conll2003/ner/ NEs in English and German http://cnts.uia.ac.be/conll2002/ner/ NEs in Spanish and Dutch • TIDES surprise language exercise (NEs in Cebuano and Hindi) • ACE – English - http://www.ldc.upenn.edu/Projects/ACE/ 4/(19) Some NE Annotated Corpora • MUC-6 and MUC-7 corpora - English • CONLL shared task corpora http://cnts.uia.ac.be/conll2003/ner/ NEs in English and German http://cnts.uia.ac.be/conll2002/ner/ NEs in Spanish and Dutch • TIDES surprise language exercise (NEs in Cebuano and Hindi) • ACE – English - http://www.ldc.upenn.edu/Projects/ACE/ 5/(19) The MUC-7 corpus • 100 documents in SGML • News domain Named Entities: • 1880 Organizations (46%) • 1324 Locations (32%) • 887 Persons (22%) • Inter-annotator agreement very high (~97%) • http://www.itl.nist.gov/iaui/894.02/related_projects/muc/pr oceedings/muc_7_proceedings/marsh_slides.pdf 6/(19) The MUC-7 Corpus (2) <ENAMEX TYPE="LOCATION">CAPE CANAVERAL</ENAMEX>, <ENAMEX TYPE="LOCATION">Fla.</ENAMEX> &MD; Working in chilly temperatures <TIMEX TYPE="DATE">Wednesday</TIMEX> <TIMEX TYPE="TIME">night</TIMEX>, <ENAMEX TYPE="ORGANIZATION">NASA</ENAMEX> ground crews readied the space shuttle Endeavour for launch on a Japanese satellite retrieval mission. <p> Endeavour, with an international crew of six, was set to blast off from the <ENAMEX TYPE="ORGANIZATION|LOCATION">Kennedy Space Center</ENAMEX> on <TIMEX TYPE="DATE">Thursday</TIMEX> at <TIMEX TYPE="TIME">4:18 a.m. EST</TIMEX>, the start of a 49-minute launching period. The <TIMEX TYPE="DATE">nine day</TIMEX> shuttle flight was to be the 12th launched in darkness. 7/(19) ACE – Towards Semantic Tagging of Entities • MUC NE tags segments of text whenever that text represents the name of an entity • In ACE (Automated Content Extraction), these names are viewed as mentions of the underlying entities. The main task is to detect (or infer) the mentions in the text of the entities themselves • Rolls together the NE and CO tasks • Domain- and genre-independent approaches • ACE corpus contains newswire, broadcast news (ASR output and cleaned), and newspaper reports (OCR output and cleaned) 8/(19) ACE Entities • Dealing with – Proper names – e.g., England, Mr. Smith, IBM – Pronouns – e.g., he, she, it – Nominal mentions – the company, the spokesman • Identify which mentions in the text refer to which entities, e.g., – Tony Blair, Mr. Blair, he, the prime minister, he – Gordon Brown, he, Mr. Brown, the chancellor 9/(19) ACE Example <entity ID="ft-airlines-27-jul-2001-2" GENERIC="FALSE" entity_type = "ORGANIZATION"> <entity_mention ID="M003" TYPE = "NAME" string = "National Air Traffic Services"> </entity_mention> <entity_mention ID="M004" TYPE = "NAME" string = "NATS"> </entity_mention> <entity_mention ID="M005" TYPE = "PRO" string = "its"> </entity_mention> <entity_mention ID="M006" TYPE = "NAME" string = "Nats"> </entity_mention> </entity> 10/(19) Annotate Gold Standard – Manual Annotation in GATE GUI 11/(19) Ontology-Based Annotation (coming in GATE 4.0) 12/(19) Two GATE evaluation tools • AnnotationDiff • Corpus Benchmark Tool 13/(19) AnnotationDiff • Graphical comparison of 2 sets of annotations • Visual diff representation, like tkdiff • Compares one document at a time, one annotation type at a time • Gives scores for precision, recall, F_measure etc. 14/(19) Annotation Diff 15/(19) Corpus Benchmark Tool • Compares annotations at the corpus level • Compares all annotation types at the same time, i.e. gives an overall score, as well as a score for each annotation type • Enables regression testing, i.e. comparison of 2 different versions against gold standard • Visual display, can be exported to HTML • Granularity of results: user can decide how much information to display • Results in terms of Precision, Recall, F-measure 16/(19) Corpus structure • Corpus benchmark tool requires particular directory structure • Each corpus must have a clean and marked directory • Clean holds the unannotated version, while marked holds the marked (gold standard) ones • There may also be a processed subdirectory – this is a datastore (unlike the other two) • Corresponding files in each subdirectory must have the same name 17/(19) How it works • Clean, marked, and processed • Corpus_tool.properties – must be in the directory from where gate is executed • Specifies configuration information about – What annotation types are to be evaluated – Threshold below which to print out debug info – Input set name and key set name • Modes – Default – regression testing – Human marked against already stored, processed – Human marked against current processing results 18/(19) Corpus Benchmark Tool 19/(19)