Profile structure

advertisement
Dorte Haltrup Hansen
16/7 2014
textCorpusProfile
– metadata for text corpora in CLARIN-DK
Profile id: clarin.eu:cr1:p_1386164908461
Introduction
Our goal for the metadata set for text corpora was to create a frame, with the possibility of specifying a wealth of detailed
metadata while leaving as many of the metadata elements optional as possible. With this strategy the profile should cover
the details of the current text corpora, and allow for expressing information from future text corpora.
We inspected different profiles in the Component Registry1 covering: collections, corpora, written corpora and text
corpora. After comparing coverage, granularity and information types, we chose to take our point of departure in the
META-SHARE profile resourceInfo v3.0 – corpora (clarin.eu:cr1:p_1361876010571). This profile is widely used in both the
META-SHARE platform, currently containing 1032 corpora of which 503 are text corpora, and in the VLO where 163
resources use this profile2. The other profiles for corpora were only used for a maximum of 10 resources. The resourceInfo
profile is extensive, covering all sorts of corpora and it has very tight restrictions in form of a long list of mandatory fields
which make the profile less flexible to use, however we assume that using the metadata elements from resourceInfo and
thereby sharing ISOcat references, will help future exchange of resources.
We have restructured the resourceInfo profile to have a clear distinction between:
 general information,
 corpus specific information and
 text corpus specific information.
In this way the different components are reusable for different collection profiles.
We have also simplified the profile by leaving out all components dealing with non-text corpora to have a profile for text
corpora only. The changes resulted in a new leaner textCorpusProfile containing 103 components and 234 elements
compared to the 419 components with 1587 elements in the resourceInfo v3.0 - corpora. Of the 234 elements 215 are
overlapping resourceInfo elements, sharing Data Categories with definitions in ISOcat. Only 15 of the metadata elements
are mandatory.
“Hints for filling in the metadata”
We recommend that you fill in the metadata in English except for title, description and subject_topic that can be given
in both the language of the text corpus and in English if needed.
If you are not interested in expressing a lot of information about your corpus but rather leave this information in a
documentation file, you only need to fill in the 15 mandatory metadata, see Appendix I. Be aware that metadata is for
search and filtering, if you don’t give metadata, you cannot search in them later on.
Size:
Besides giving the mandatory information about the corpus size in 2.4 sizeInfo in 2 generalCorpusInfo, sizeInfo can be
stated at many places in the profile: language, modality, time- and geographical coverage, text format, text classification,
validation and annotation. This is to accommodate heterogeneous collection with respect to e.g. language or classification.
The components containing sizeInfo can be repeated for each language, classification etc.
1
2
http://catalog.clarin.eu/ds/ComponentRegistry
http://clarin.eu/content/virtual-language-observatory
Dorte Haltrup Hansen
16/7 2014
Creation:
There are two places where information about creation is stated:
 1.2 resourceCreatorInfo with date and information about the creator
 2.5 creationInfo with information about the creation process (the creation mode and methods).
Classification:
There are also different places to places to state classification information:
 2.6 timeCoverageInfo and 2.7 geographicalCoverageInfo stated under 2 generalCorpusInfo
 3.1 textClassification stated under 3 textCorpusInfo.
The division is due to the fact that some classification parameters can only be stated for texts e.g. genre, text type and
register while time and geographical coverage can be applied general to all kinds of corpora.
Contact information:
There are several places in the profile where contact information in terms of personInfo, comminationInfo,
organizationInfo can be stated. The only place where it is mandatory to give this information is however in 1.2.1
resourceCreator.
In many places of the profile pick lists are provided to guide you in filling in the metadata element. Appendix III will show
you where.
Profile structure
Most components can occur 0 - unbounded amount of times, giving room for repetition of the information types in the
component. The elements in the components can on the other hand mostly be stated only 0 – 1 time, meaning that
repetition of elements is done by repeating components keeping a bloc of information together.
The profile consists of the following general structure (see Appendix II for definition of all major components):
1 generalInfo.
All general information about the text corpus, such as title, description, license etc. More specific corpus information must
be stated in generalCorpusInfo and textCorpusInfo.
2 generalCorpusInfo
General information like linguality, language, modality, size, creation, time and geographic coverage.
3 textCorpusInfo
Information specific to a text corpus, namely text classification, text type, encoding and annotation.
Appendices
Appendix I - Mandatory elements
Appendix II - All major components
Appendix III - All metadata element
Appendix IV – Example
(Mangler info visse steder. Markeret med XX XX XX)
Download