Title Version Author(s) Date Status Distribution ID FCS extension meeting 2 Dieter Van Uytvanck 2016-02-08 Draft FCS taskforce CE-2014-0472 1 Context Date: 2014-12-04, immediately after CLARIN-D developer meeting Location: MPI Nijmegen Participants: Oliver Schonefeld, Daniel Jettka, Dirk Goldhahn, Thomas Eckart, Jörg Knappen, Olha Shkaravska, Dieter Van Uytvanck 2 Action Points All: read the SRU 2.0 spec: http://docs.oasis-open.org/searchws/searchRetrieve/v1.0/os/part3-sru2.0/searchRetrieve-v1.0-os-part3sru2.0.html All: Gather examples of CQP queries using the univerals POS tags [see example in section 3.3] All: Search for comprehensive documentation and grammar/parser for Corpus Query Language (the CQP-related CQL, further on in this document: CQP) 3 Meeting notes 3.1 Protocol First, the need for a more powerful FCS spec is explained: simple text search is nice, but does not offer compelling search functionality. Purpose of this meeting is to look into possibilities of setting up advanced search functionality for the FCS (in addition to the currently existing basic protocol, which will remain). 2 examples of concrete advanced search: Searching in phonetic transcriptions (using a specific CQL index?) Searching in POS-tagged texts (more complicated, needs more powerful language than just Contextual Query Language) An attempt is made in enumerating possibly interesting search tiers (non-exhaustive): - full text text-token POS lemma orthographic transcription orthographic normalisation phonetic transcription discourse transcription named entities 1 - translation gesture signs dependency trees constituents gloss named entity Search tiers with the highest priority seem to be: - POS text-token full-text lemma phonetic transcription translation The respective data types: - POS > Controlled Vocabulary (enumeration) text-token > string full-text > string lemma > string phonetic transcription > string with specific alphabet (eg IPA or (x)-sampa) translation > string Necessary operators: - identity regex-match any_of <list> (= regex) sounds_like proximity: t1 prox t2 An extensive discussion follows. In order to address more powerful queries it is clear that the Contextual Query Language has several problems: - quite some effort is necessary to tailor a flavour of CQL to fit the requirements for FCS, as it was originally designed for queries in bibliographic metadata - it is pretty far away from the query languages used at most of the CLARIN endpoints, requiring (if it was to be used) quite some translation At the same time it is concluded that: - the SRU protocol by itself is very suitable for federated search - As used in FCS today, Contextual Query Language does address the need for a simple fulltext search language Thus the question is: is it possible to rely on a more powerful query language to extend the basic search? The protocol we are currently using (SRU 1.2) does not, however a newer version (SRU 2.0) allows the use of arbitrary query languages on top of Contextual Query Language. Oliver indicates that SRU 2.0 is more complex than 1.2, so it would mean we need to adjust the FCS client- and server-side libraries. Decision: we need to look into SRU 2.0 as a serious candidate for the FCS 2 3.2 Advanced Query Language Next comes the point of the exact advanced query language to use. It is clear that developing such a language (including the grammar and parsers for it) is not trivial. At the same time many CLARIN endpoints are using some form of the Corpus Query Language (as used in the Corpus Query Processor, http://cwb.sourceforge.net/ – further on referred to as CQP in order to avoid confusion with the Contextual Query Language) for their local search engines. If we could use CQP (or a subset of it) as lingua franca for the more powerful searches it would: - minimize adoption effort at the side of some endpoints1 - minimize implementation efforts at the side of the SRU clients and servers, as it would take away the need to design such a query language ourselves Decision: we should seriously consider using CQP as an (optional but more powerful) query language within the Federated Content Search – especially in combination with SRU 2.0 3.3 POS tags As mentioned before (see data types) most of the query parameters are strings, where there is no need for closer harmonization (they can be passed as is to the end points). Exception to this are Part of Speech tags, where there are different sets in use, mostly depending on the corpora/language context. It is clear that if we want to support queries over multiple corpora (each using different tag sets), it is necessary to have a lingua franca for the POS tags too. Oliver suggests to look into the Universal Tagset for this purpose: http://universaldependencies.github.io/docs/u/pos/index.html There are mappings available for almost all tagsets that are widespread: https://code.google.com/p/universal-pos-tags/ Example mapping (verb followed by a noun): CQP query using universal POS tags: [pos="VERB"] [pos="NOUN"] After (automatic) translation to the tagset used for Polish (IPIPAN) this becomes: [pos="aglt|bedzie|imps|impt|pact|pant|pcon|ppas|praet|winien"] [pos="depr|subst"] Decision: the universal POS tagset should be seriously considered as the minimum requirement for FCS-searches; so each endpoint supporting POS-queries should map the universal POS tags to the tagset it uses locally Obviously, it is not possible to achieve the same level of precision as when someone was using the original tagset. Therefore it should be investigated if it is possible to support next to the universal POS set, the locally used sets (like the IPIPAN one for Polish Some, however, need to adopt CQP from scratch, thus are required to do more work. Furthermore, there is still an implementation effort for the endpoint, even if they use CQP. They still need to perform an “adjustment” step from FCS-CQP to their local flavour, e.g. expanding Universal-POS, etc. 1 3 corpora). The FCS search engine (aggregator) needs then to know which tagset can be used where and needs to show this to the user. 4