Dorte Haltrup Hansen 16/7 2014 textCorpusProfile – metadata for text corpora in CLARIN-DK Profile id: clarin.eu:cr1:p_1386164908461 Introduction Our goal for the metadata set for text corpora was to create a frame, with the possibility of specifying a wealth of detailed metadata while leaving as many of the metadata elements optional as possible. With this strategy the profile should cover the details of the current text corpora, and allow for expressing information from future text corpora. We inspected different profiles in the Component Registry1 covering: collections, corpora, written corpora and text corpora. After comparing coverage, granularity and information types, we chose to take our point of departure in the META-SHARE profile resourceInfo v3.0 – corpora (clarin.eu:cr1:p_1361876010571). This profile is widely used in both the META-SHARE platform, currently containing 1032 corpora of which 503 are text corpora, and in the VLO where 163 resources use this profile2. The other profiles for corpora were only used for a maximum of 10 resources. The resourceInfo profile is extensive, covering all sorts of corpora and it has very tight restrictions in form of a long list of mandatory fields which make the profile less flexible to use, however we assume that using the metadata elements from resourceInfo and thereby sharing ISOcat references, will help future exchange of resources. We have restructured the resourceInfo profile to have a clear distinction between: general information, corpus specific information and text corpus specific information. In this way the different components are reusable for different collection profiles. We have also simplified the profile by leaving out all components dealing with non-text corpora to have a profile for text corpora only. The changes resulted in a new leaner textCorpusProfile containing 103 components and 234 elements compared to the 419 components with 1587 elements in the resourceInfo v3.0 - corpora. Of the 234 elements 215 are overlapping resourceInfo elements, sharing Data Categories with definitions in ISOcat. Only 15 of the metadata elements are mandatory. “Hints for filling in the metadata” We recommend that you fill in the metadata in English except for title, description and subject_topic that can be given in both the language of the text corpus and in English if needed. If you are not interested in expressing a lot of information about your corpus but rather leave this information in a documentation file, you only need to fill in the 15 mandatory metadata, see Appendix I. Be aware that metadata is for search and filtering, if you don’t give metadata, you cannot search in them later on. Size: Besides giving the mandatory information about the corpus size in 2.4 sizeInfo in 2 generalCorpusInfo, sizeInfo can be stated at many places in the profile: language, modality, time- and geographical coverage, text format, text classification, validation and annotation. This is to accommodate heterogeneous collection with respect to e.g. language or classification. The components containing sizeInfo can be repeated for each language, classification etc. 1 2 http://catalog.clarin.eu/ds/ComponentRegistry http://clarin.eu/content/virtual-language-observatory Dorte Haltrup Hansen 16/7 2014 Creation: There are two places where information about creation is stated: 1.2 resourceCreatorInfo with date and information about the creator 2.5 creationInfo with information about the creation process (the creation mode and methods). Classification: There are also different places to places to state classification information: 2.6 timeCoverageInfo and 2.7 geographicalCoverageInfo stated under 2 generalCorpusInfo 3.1 textClassification stated under 3 textCorpusInfo. The division is due to the fact that some classification parameters can only be stated for texts e.g. genre, text type and register while time and geographical coverage can be applied general to all kinds of corpora. Contact information: There are several places in the profile where contact information in terms of personInfo, comminationInfo, organizationInfo can be stated. The only place where it is mandatory to give this information is however in 1.2.1 resourceCreator. In many places of the profile pick lists are provided to guide you in filling in the metadata element. Appendix III will show you where. Profile structure Most components can occur 0 - unbounded amount of times, giving room for repetition of the information types in the component. The elements in the components can on the other hand mostly be stated only 0 – 1 time, meaning that repetition of elements is done by repeating components keeping a bloc of information together. The profile consists of the following general structure (see Appendix II for definition of all major components): 1 generalInfo. All general information about the text corpus, such as title, description, license etc. More specific corpus information must be stated in generalCorpusInfo and textCorpusInfo. 2 generalCorpusInfo General information like linguality, language, modality, size, creation, time and geographic coverage. 3 textCorpusInfo Information specific to a text corpus, namely text classification, text type, encoding and annotation. Appendices Appendix I - Mandatory elements Appendix II - All major components Appendix III - All metadata element Appendix IV – Example (Mangler info visse steder. Markeret med XX XX XX)