N I V E R T H R G H O F E D I U N B NER for the RCAHMS Data Malvina Nissim Jochen Leidner WS on Bootstrapping Methods for NER Edinburgh, January 2004 Nissim|Leidner NER for the RCAHMS Data S Y IT E U Edinburgh, Jan 2004 N I V E R T H E RCAHMS R G H O F 1 D I U N B I Royal Commission on the Ancient and Historical Monuments of Scotland I mission: record sites, monuments and buildings of Scotland’s past. I National Monuments Record of Scotland (NMRS): searchable database to obtain information on architectural, archaeological and maritime sites throughout Scotland I Computer Application (CANMORE): for National MOnuments Record Enquiries I on-line access to the NMRS I data to be searched by location (place name, area or Ordnance Survey 1:10,000 map sheet) by type (the classification or function of a site, monument or building) or by keyword. I http://www.rcahms.gov.uk/ Nissim|Leidner NER for the RCAHMS Data S Y IT E U Edinburgh, Jan 2004 N I V E R T H E What can we do for them? R G H O F 2 D I U N B I provide tools to aid the population of their database I improve their internal search engine (CANMORE) I help standardise the terminology I help clean/enrich their thesaurus Nissim|Leidner NER for the RCAHMS Data S Y IT E U Edinburgh, Jan 2004 N I V E R T H E What can they do for us? R G H O F 3 D I U N B I recognition of new entities (beyond the standard “person”, “location”, and “organisation”) I granularity: different levels of specification I entities (referring to specific objects) vs terms (referring to classes) I assess feasibility and costs of porting existing methods and technology to a different domain for recognising new entities Nissim|Leidner NER for the RCAHMS Data S Y IT E U Edinburgh, Jan 2004 N I V E R T H E Some NMRS Figures R G H O F 4 D I U N B I 56422 documents (one for each site — specific ID) I largest document = 5398 words I smallest document = 1 word I mean = 71 words; median = 46 words I 17723 documents (31.4%) have ≤ 20 words I 8839 documents (15.7%) have 4 words Nissim|Leidner NER for the RCAHMS Data S Y IT E U Edinburgh, Jan 2004 N I V E R T H I “Now demolished” H G E R Example Texts O F 5 D I U N B I “ARCHITECT: Thomas McRae, alterations to nos. 8 and 10, 1926.” I “[ . . . ] The Innocent Railway was one of Scotland’s first freight railways and was opened in 1831 to carry coal from the mines in Dalkeith into the city. It earned its name as the carriages were originally drawn by horses rather than a steam engine. The line incorporates one of the earliest surviving railway tunnels (NT27SE 589) at its NW end, extending from the Wells o’ Wearie to the station at St Leonards (NT27SE 2735), as well as a fine example of a cast iron bridge across the Braid Burn to the SE of Bawsinch Nature Reserve (NT27SE 553); a large mound of spoil to the N of the railway is probably associated with the construction of the tunnel (NT 2762 7236). The popularity of the line was such that coaches were added to carry passengers, and the line continued in use until the 1960s. The route is maintained as a cycleway and footpath. Visited by RCAHMS (ARG), 15 December 1998 C R Wickham-Jones 1996. [ . . . ]” Nissim|Leidner NER for the RCAHMS Data S Y IT E U Edinburgh, Jan 2004 N I V E R T H E Target Entities R G H O F 6 D I U N B • location – Aberdeenshire, Dalkeith • site – tunnel, mound • artefact – sword • person – J W Hedges • animal – fox • date – 1831, 1960s, 15 December 1998 • number – 10 • organization – RCAHMS Nissim|Leidner NER for the RCAHMS Data S Y IT E U Edinburgh, Jan 2004 N I V E R T H E RCAHMS Thesaurus R G H O F 7 D I U N B I 1670 entries for locations and sites I 17 top nodes for site types I automatic annotation (TTT and NITE tools) I terms and entities (e.g. “railway” and “Innocent Railway”) Nissim|Leidner NER for the RCAHMS Data S Y IT E U Edinburgh, Jan 2004 N I V E R T H H G E R Terms vs Entities O F 8 D I U N B I mark both I coreference issues Nissim|Leidner NER for the RCAHMS Data S Y IT E U Edinburgh, Jan 2004 N I V E R T H location county council region parish cline1-1 specific locations? natural objects? Nissim|Leidner H G E R Location and Site Subclassification O F 9 D I U N B site agriculture and subsistence civil commemorative commercial communications defence domestic education gardens parks and urban spaces health and welfare industrial maritime monument (by form) recreational religious ritual and funerary transport water supply and drainage NER for the RCAHMS Data S Y IT E U Edinburgh, Jan 2004 N I V E R T H H G E R Issues with the RCAHMS Data O F 10 D I U N B I texts are very telegraphic I very little syntax I inconsistencies in transcriptions (both in style and in terminology) I skewed distribution of entities I regular polysemy I match in automatic annotation I entities vs terms I granularity Nissim|Leidner NER for the RCAHMS Data S Y IT E U Edinburgh, Jan 2004