
FCS extension meeting
Dieter Van Uytvanck
FCS taskforce
1 Context
Date: 2014-12-04, immediately after CLARIN-D developer meeting
Location: MPI Nijmegen
Participants: Oliver Schonefeld, Daniel Jettka, Dirk Goldhahn, Thomas Eckart, Jörg
Knappen, Olha Shkaravska, Dieter Van Uytvanck
2 Action Points
All: read the SRU 2.0 spec: http://docs.oasis-open.org/searchws/searchRetrieve/v1.0/os/part3-sru2.0/searchRetrieve-v1.0-os-part3sru2.0.html
All: Gather examples of CQP queries using the univerals POS tags [see example in
section 3.3]
All: Search for comprehensive documentation and grammar/parser for Corpus
Query Language (the CQP-related CQL, further on in this document: CQP)
3 Meeting notes
3.1 Protocol
First, the need for a more powerful FCS spec is explained: simple text search is nice, but
does not offer compelling search functionality. Purpose of this meeting is to look into
possibilities of setting up advanced search functionality for the FCS (in addition to the
currently existing basic protocol, which will remain).
2 examples of concrete advanced search:
 Searching in phonetic transcriptions (using a specific CQL index?)
 Searching in POS-tagged texts (more complicated, needs more powerful
language than just Contextual Query Language)
An attempt is made in enumerating possibly interesting search tiers (non-exhaustive):
full text
orthographic transcription
orthographic normalisation
phonetic transcription
discourse transcription
named entities
dependency trees
named entity
Search tiers with the highest priority seem to be:
phonetic transcription
The respective data types:
POS > Controlled Vocabulary (enumeration)
text-token > string
full-text > string
lemma > string
phonetic transcription > string with specific alphabet (eg IPA or (x)-sampa)
translation > string
Necessary operators:
any_of <list> (= regex)
proximity: t1 prox t2
An extensive discussion follows. In order to address more powerful queries it is clear
that the Contextual Query Language has several problems:
- quite some effort is necessary to tailor a flavour of CQL to fit the requirements
for FCS, as it was originally designed for queries in bibliographic metadata
- it is pretty far away from the query languages used at most of the CLARIN
endpoints, requiring (if it was to be used) quite some translation
At the same time it is concluded that:
- the SRU protocol by itself is very suitable for federated search
- As used in FCS today, Contextual Query Language does address the need for a
simple fulltext search language
Thus the question is: is it possible to rely on a more powerful query language to extend
the basic search? The protocol we are currently using (SRU 1.2) does not, however a
newer version (SRU 2.0) allows the use of arbitrary query languages on top of
Contextual Query Language. Oliver indicates that SRU 2.0 is more complex than 1.2, so it
would mean we need to adjust the FCS client- and server-side libraries.
Decision: we need to look into SRU 2.0 as a serious candidate for the FCS
3.2 Advanced Query Language
Next comes the point of the exact advanced query language to use. It is clear that
developing such a language (including the grammar and parsers for it) is not trivial. At
the same time many CLARIN endpoints are using some form of the Corpus Query
Language (as used in the Corpus Query Processor, http://cwb.sourceforge.net/ –
further on referred to as CQP in order to avoid confusion with the Contextual Query
Language) for their local search engines. If we could use CQP (or a subset of it) as lingua
franca for the more powerful searches it would:
- minimize adoption effort at the side of some endpoints1
- minimize implementation efforts at the side of the SRU clients and servers, as it
would take away the need to design such a query language ourselves
Decision: we should seriously consider using CQP as an (optional but more powerful)
query language within the Federated Content Search – especially in combination with
SRU 2.0
3.3 POS tags
As mentioned before (see data types) most of the query parameters are strings, where
there is no need for closer harmonization (they can be passed as is to the end points).
Exception to this are Part of Speech tags, where there are different sets in use, mostly
depending on the corpora/language context.
It is clear that if we want to support queries over multiple corpora (each using different
tag sets), it is necessary to have a lingua franca for the POS tags too. Oliver suggests to
look into the Universal Tagset for this purpose:
There are mappings available for almost all tagsets that are widespread:
Example mapping (verb followed by a noun):
CQP query using universal POS tags:
[pos="VERB"] [pos="NOUN"]
After (automatic) translation to the tagset used for Polish (IPIPAN) this becomes:
[pos="aglt|bedzie|imps|impt|pact|pant|pcon|ppas|praet|winien"] [pos="depr|subst"]
Decision: the universal POS tagset should be seriously considered as the minimum
requirement for FCS-searches; so each endpoint supporting POS-queries should map the
universal POS tags to the tagset it uses locally
Obviously, it is not possible to achieve the same level of precision as when someone was
using the original tagset. Therefore it should be investigated if it is possible to support
next to the universal POS set, the locally used sets (like the IPIPAN one for Polish
Some, however, need to adopt CQP from scratch, thus are required to do more work.
Furthermore, there is still an implementation effort for the endpoint, even if they use CQP. They
still need to perform an “adjustment” step from FCS-CQP to their local flavour, e.g. expanding
Universal-POS, etc.
corpora). The FCS search engine (aggregator) needs then to know which tagset can be
used where and needs to show this to the user.