From textual checklist to information systems Corresponding author

advertisement
From textual checklist to information systems
Corresponding author:
Stefano Martellos
Dept. of Life Sciences, University of Trieste,
Via L. Giorgieri 10 I-34100 Treiste
e-mail: martelst@units.it
tel: +39 (0)40558 3889
fax: +39 (0)40 57 88 55
From textual checklist to information systems: the case study of ITALIC
Stefano Martellos
Abstract
Keywords
biodiversity informatics, checklists, database, lichens
Revisori favoriti: de Felici, Holetschek, Guntsch, Attorre; sfavorito: Dave Roberts
Introduction
Checklists are a fundamental tool for accessing the information which has been prouced during
centuries of biological research. They summarise part of the biological diversity in a given area,
and provide the basis for specimen revision, critical re-appraisal of poorly-known taxa, and
further exploration of under-investigated areas. For this reason, they are an endless work in
progress, and could be continuously updated with new information. Checklists are structured in
different formats, from lists of names to detailed reports on the distribution and ecology of listed
taxa, and are normally published in the form of paper-printed books, or as special issues of
scientific journals. For this reason, they cannot be updated without printing a new editions, or
publishing notes and/or updates. Paper-printed checklists lack some of the most interesting
features of online databases, which 1) are esily accessible through the web, 2) are updated by a
continuous flow on new data, and 3) organise data in an effective way, e.g. by returning complex
elaborations, other than lists of taxa only.
The possibility to convert textual checklists into digital formats have been explored since the
beginning of the internet age. An approach consist in publishing the original files - or, when the
original files did not exist, or were missing, a digitalised version of a paper-printed book - in the
Web (two interesting examples: Wetter & al., 2001, Meades & al, 2000). This approach is
effectively used also in important international repositories, e.g. the BHL (Biodiversity Heritage
Library, http://www.biodiversitylibrary.org/),
which
are fundamental in
improving the
accessibility to scientific literature. However, this approach can not increase the performances of
a digital checklist to the level of a database, which require the conversion of the text into
structured data.
Normally, checklists have a more or less regular structure, in which taxa and their data
(distribution in the area, ecology, etc.) are organised in “textual records”. These records can be
delimited by symbols, strings or formattations, such as carriage returns, which are normally
present in texts. Further delimitators can be found inside the “records”, in order to atomise
them into “fields”. This atomisation process can produce structured data files, which can be
used in different scenarios: 1) exported to existing facilities, e.g. the GBIF (Global Biodiversity
Information Facility, http://www.gbif.org); 2) used in collaborative virtual research environments
(e.g. the Scratchpads, http://scratchpads.eu/); 3) organised in information systems developed
ad hoc. An interesting example of the first scenario is given by Ramsen & al. (2012), which
converted a textual checklist of ca. 4100 vascular plants into structured data following the
Darwin Core Archive (DwC-A) standard, which was recently developed by the GBIF and the
Biodiversity Information Standards (TDWG). The data were then published online through the
GBIF.
In this paper, the whole process of conversion of the Checklist of Italian Lichens (Nimis, 1993)
into an information system (Nimis & Martellos, 2002) is discussed.
The Checklist of Italian Lichens
The checklist (Nimis, 1993) was originally written by using Microsoft Word for DOS 5.0 on a
personal computer running DOS Operating System, then converted into a Microsoft Word for
Winrows format.
It is organised in sections, each containing the information on a taxon. A section (fig. 1) is
divided into serveral paragraphs, each devoted to different information:
1. taxon name, followed by the author(s);
2. reference to the first paper reporting the taxon, and, when present, its basyonym,
followed by the reference to its pubblication
3. synonyms, listed alphabetically
4. distribution in the country, divided into three areas (Northern Italy, N; Central Italy, C;
Southern Italy, S). For each area there is a further subdivision into administrative regions
(Table 1 – divisione dei tre distretti in regioni amministrative). For each administrative
region, all references to scientific papers reporting the presence of a taxon are listed
5. ecology, expressend with a sequence of ecological indicatiors values (Table 2 – indicatori
ecologici e loro valori), and a note providing further comments on the presence and
distribution of the taxon in the country.
+Hypocenomyce scalaris (Ach.) M.Choisy
Bull. Mens. Soc. Linn. Lyon, 22: 103, 1953 - Lichen scalaris [Lilj. ex] Ach., Utkast. Sv.
Fl.: 422, 1792.
Syn.: Biatora ostreata (Hoffm.) Th.Fr., Lecidea ostreata (Hoffm.) Schaer., Lecidea scalaris (Ach.) Ach., Psora ostreata Hoffm., Psora
scalaris (Ach.) Hook.
N - VG, Frl, Ven (Nascimbene & Caniglia 1997, 2000b, 2002c, 2003c, Caniglia & al. 1999, Nascimbene & al. 2006e), TAA (Lecid. Exs. 262:
Hertel 1992b, Nascimbene & Caniglia 2000b, 2002c, Caniglia & al. 2002, Nascimbene 2005b, 2006b, Nascimbene & al. 2005, 2006, 2006e), Lomb
(Alessio & al. 1995, Valcuvia & al. 2003, Nascimbene & al. 2006e), Piem (Caniglia & al. 1992, Isocrono & al. 2004, 2007), VA (Piervittori &
Isocrono 1999, Valcuvia & al. 2000b), Emil (Nimis & al. 1996, Tretiach & al. 2008). C - Tosc (Benesperi 2007), Umbr (Ravera 1998, Ravera & al.
2006), Marc (Nimis & Tretiach 1999), Laz (Massari & Ravera 2003), Abr (Nimis & Tretiach 1999), Sar (Zedda 1995, Zedda & Sipman 2001). S Bas (Potenza 2006), Cal (Puntillo 1996), Si.
Sq/ Ch/ A.s/ Epiph-Lign/ 1-2, 3-5, 3-4, 1/ Alt: 2-4/ A: a, A1: vc, B: a, C: rr, D: vr,
E: a, F: vr, G: a, H: a/ PF: 1-2/ Note: a temperate to boreal-montane, circumpolar lichen,
found on acid bark, esp. of conifers, and on lignum, incl. charred wood, much more common in
the north than in the mountains of the south.
The first version of the checklists was printed in 1993, and the original file was continuously
updated by the author. In 1997 it was decided to try to convert the text file into the first version
of ITALIC, the Information System on Italian Lichens. The process involved the conversion of
the text into structured data, their sotrage into an Oracle 8 database, and the development of
several query interfaces (Nimis & Martellos, 2002). This first version was then improved during
the years, by adding new functions and modernising its layout. A new version of the whole
information system is expected for spring 2013.
The second version of the checklist (Nimis & Martellos, 2003) originates from the information
system, which structured data were converted into a textual format, and published on paper.
Conversion
First step: looking for separators among records and blocks of information
The checklist is structured in chapters - delimited by two carriage returns – beginning with the
name of a taxon, preceded by a + character. Each chapter is divided into sections separated by a
single carriage return, and the section devoted to synonyms starts always with the string “Syn.:
“.
The convertion begins dividing the text in records (one for each taxon), and each record into five
fields. The process, for which Microsoft Word for Windows was used, consists in replacing all the
carriage returns (^p) with a symbol which is not present in the text, in this case “@”. The result
is a text without carriage returns. The double “@” which preced a “+” symbol were then
replaced with a single carriage return (^p). At the end, the text is divided in paragraphs
separated by carriage returns, and divided into five sections by the “@” symbol.
Hypocenomyce scalaris (Ach.) M.Choisy@Bull. Mens. Soc. Linn. Lyon, 22: 103, 1953 - Lichen
scalaris [Lilj. ex] Ach., Utkast. Sv. Fl.: 422, 1792.@Syn.: Biatora ostreata (Hoffm.) Th.Fr., Lecidea ostreata (Hoffm.)
Schaer., Lecidea scalaris (Ach.) Ach., Psora ostreata Hoffm., Psora scalaris (Ach.) Hook.@N - VG, Frl, Ven (Nascimbene & Caniglia 1997,
2000b, 2002c, 2003c, Caniglia & al. 1999, Nascimbene & al. 2006e), TAA (Lecid. Exs. 262: Hertel 1992b, Nascimbene & Caniglia 2000b, 2002c,
Caniglia & al. 2002, Nascimbene 2005b, 2006b, Nascimbene & al. 2005, 2006, 2006e), Lomb (Alessio & al. 1995, Valcuvia & al. 2003,
Nascimbene & al. 2006e), Piem (Caniglia & al. 1992, Isocrono & al. 2004, 2007), VA (Piervittori & Isocrono 1999, Valcuvia & al. 2000b), Emil
(Nimis & al. 1996, Tretiach & al. 2008). C - Tosc (Benesperi 2007), Umbr (Ravera 1998, Ravera & al. 2006), Marc (Nimis & Tretiach 1999), Laz
(Massari & Ravera 2003), Abr (Nimis & Tretiach 1999), Sar (Zedda 1995, Zedda & Sipman 2001). S - Bas (Potenza 2006), Cal (Puntillo 1996),
Si.@Sq/ Ch/ A.s/ Epiph-Lign/ 1-2, 3-5, 3-4, 1/ Alt: 2-4/ A: a, A1: vc, B: a, C: rr, D: vr, E: a,
F: vr, G: a, H: a/ PF: 1-2/ Note: a temperate to boreal-montane, circumpolar lichen, found on
acid bark, esp. of conifers, and on lignum, incl. charred wood, much more common in the north
than in the mountains of the south.
Second step: from Microsoft Word to Microsoft Access
The second step converts the text file into a Microsoft Access data table. The processs requires
the conversion of the Word file into a Text (.txt) file. The file can then be inported into an
Access table with five columns, by using the symbol “@” as column separator. This data table
can already be used in an information system to performa simple queries. However, while
developing ITALIC, it was decided to continue in the conversion process, trying to obtain a
further atomisation of the text, separating taxonomic and distributional information from
synonyms and from ecological informations, and splitting the data in three different tables.
During the process, three copies of the original table were made. The first, named “taxonomy”,
hosted the first two columns and the fourth (name, basionym and istribution). The second,
named “synonyms”, hosted the first and the third column (name and synonyms). The third,
named “ecology”, hosted the first and the fifth column (name and ecology). Each table
underwent further elaboration separately.
Third step: taxonomy and distribution
The table “taxonomy” is made of three columns, and do not require any further elaboration. The
distribution, ontained as a text in a single column, could have been split into several different
columns, one for each administrative region. However, databases can easily perform complex
queries in textual columns, and maintaining all the distirbutional information in a single field does
not represent a drawback for the functionalities of the infrmation system.
Fourth step: atomising ecology
The third table, “ecology”, required complex elaborations. At the beginning, it is made of two
columns, the first containing the name of the taxa, the second a complex and long string, which
is composed both of texts and numerical data (the ecological indicator values), and of the
commonness rarity status of the taxa in 9 bioclimatical belts (Nimis & Martellos, 2003 – Italic,
the info system etc.). This column can be divided into several parts, by using two separators: the
word “Note: “, which separated author notes from other information, and the slash (“/”) symbol,
which was used to separate: 1) growth form, 2) type of photobiont, 3) reproductive strategy, 4)
substrata, 5) ecological indicator values, 6) altitudinal range, 7) commonness rarity in the
bioclimatic districts, and 8) poleophoby. Practivcally, “Note: “ and “/ ” were replaced with the
symbol “@”. At the end, the second column ot the table ecology contained strings like:
Sq@Ch@A.s@Epiph-Lign@1-2, 3-5, 3-4, 1@Alt: 2-4@A: a, A1: vc, B: a, C: rr, D: vr, E: a, F:
vr, G: a, H: a@PF: 1-2@a temperate to boreal-montane, circumpolar lichen, found on acid bark,
esp. of conifers, and on lignum, incl. charred wood, much more common in the north than in the
mountains of the south.
The table was then exported into a text file, by using “@” as separator. The resulting text file is
then re-imported into Access, again by using “@” as separator. The result is a table with ten
columns.
Hypocenomy Sq Ch A.s Epiph-Lign 1-2, 3-5, Alt: A: a, PF: a temperate to borealce
scalaris
3-4, 1
2-4 A1:
1-2 montane,
circumpolar
(Ach.)
vc, B:
lichen, found on acid
M.Choisy
a, C:
bark, esp. of conifers, and
rr, D:
on lignum, incl. charred
vr, E:
wood,
much
more
a, F:
common in the north than
vr, G:
in the mountains of the
a, H:
south.
a
Furter refinemend is made by removing the “Alt: “ and “PF: “ strings from the seventh and
ninth columns, and the codes defining the bioclimatic districts (A:, A1:, B:, etc.) from the eighth.
The process continues focusing on the conversion of some fields (ecological indicator values,
altitude, commonnes-rarity status and poleophoby) from textual to numerical, in order to permit
to the Information System complex elaborations on these information.
At the beginning, all the ecological indicator values and the commonness-rarity status of the 9
bioclimatic districts are separated into different columns, by replacing the commas in columns six
and eight with the “@”, and exporting and re-importing the table by using “@” as separator.
The result is a table with 21 columns.
Hypocen S C A Epi 1- 3- 3- 1 2-4 a vc a rr vr a vr a a 1omyce
q h . ph 2 5 4
2
scalaris
s (Ach.)
Lig
M.Chois
n
y
a
temperate
to
boreal-montane,
circumpolar lichen,
found on acid bark,
esp. of conifers, and
on
lignum,
incl.
charred wood, much
more common in the
north than in the
mountains of the
south.
Single values (e.g. 1) for ecological indicator values were converted into double values (e.g. 1-1),
so that all the ecological indicator values, altitudes and poleophoby scores were espressed as
ranges. Then, the “-” symbols were replaced by “@”. This operations was limited to columns 610 and 20, because the symbol “-” could be present in other columns (e.g. in the fifth column,
“Epiph-Lign”, and in the 21th column). The table was then exported and re-imported again. The
result was a table with 27 columns.
Hypoce S C A Epi 1 2 3 5 3 4 1 1 2 4 a vc a rr vr a vr a a 1 2 a temperate to
nomyce q h . ph
boreal-montane,
scalaris
s circumpolar lichen,
(Ach.)
Lig
found on acid bark,
M.Choi
n
esp. of conifers, and
sy
on lignum, incl.
charred wood, much
more common in
the north than in
the mountains of
the south.
Each ecological indicator value was represented by two colums, a maximum and a minimum. The
commonness rarity status was stored in nine columns, but espressed by textual codes, which
needed to be converted into numbers, ranging form 0 (absent, a) to 8 (extermely common, ec).
This was dome by a search and replace process column by column. The result is shown in fig. XX
Hypoce S C A Epi 1 2 3 5 3 4 1 1 2 4 0 7 0 4 2 0 2 0 0 1 2 a temperate to borealnomyce q h . ph
montane, circumpolar
scalaris
s lichen, found on acid
(Ach.)
Lig
bark, esp. of conifers,
M.Choi
n
and on lignum, incl.
sy
charred wood, much
more common in the
north than in the
mountains of the south.
The three tables - “taxonomy”, “ecology” and “synonyms” - were then imported into an Oracle
10g database.
Fifth step: the Information System
The Information System (available at the address http://dbiodbs.units.it/) was developed on the
data stored in the three tables, and was written in PL/SQL language. It can be queryed by using
three query interfaces (Nimis & Martellos, 2002):
1. Taxonomic interface, which permits to retrieve all the information on a taxon, extracting
data from all the tables and from all the realted archives (images, maps, etc.) which have
been added to the Information System.
2. Floristic interface, which permits to build “virtual” releves of lichen vegetation, by
combining ecological indicator values and other data, hence reconstructing certain
environmental conditions, and returning lists of taxa which potentially occur under those
conditions (Nimis & Martellos, 2001).
3. Statistic interface, which returns matrices of data for two selected parameters. This
interface permits complx elaboration, such as returning the matrix of epiphytic lichen
occurring in shady situations in the different bioclimatic districts of the country in
relation to the eutrophication. The results of this interface were used, as an example, by
Nimis & Martellos (2003).
Sixth step: atomisation of synonyms
The table “synonyms” underwent a further elaboration, by separating each synonym in a
different records, after the Information System was completed. This transformation, while not
fundamental for the query systems, was performed to easily return even a single synonym instead
of a list when the system is queried for a string in taxon names. Each row ot the table
”synonyms” was extracted and elaborated by using the comma which separated the synonyms.
The process created as many records as the synonyms, and inserted them into a new table
“synonyms2”. At the end, the table “synonyms2” is used in the information system and the
original table “synonyms” is dropped.
A serious problem, in this case, could be due to the use of the comma as a separator. In fact,
when a taxon name hase several authors, they are separated by a comma. This is rare in lichens,
but common e.g. in vascular plants. For this reason, this process required a thorough manual
review of the results, and could be not be easy to perform in other checklists.
Discussion
Nowaday, one of the most challenging tasks in biodiversity informatics is exposing into the digital
domain the literature produced in centuries of scientific research. One successful approach to
the problem is represented by publishing in large online repositories, e.g. the Biological Heritage
Library (******* cita), scanned versions of original papers and books. However this process,
sometimes – as in the case of checklists - can go further, strongly enhancing the use of original
data.
******While this process can be difficult starting from paper printed texts, when the original
digital files are available, bla blabla**************
Converting a text into a structured data format can be a difficult process. Even in consistently
structured texts, where paragraphs have all the same organisation, some differences can be
present, hence creating problems during the process. Separators can be missing, or some
imformation can be absent, thus creating gaps in the data structure. For this reason, each
conversion should be followed by a careful quality control, to verify the structure of the data.
However, the conversion of a checklist into a structured data format is often feasible, and can
strongly enhance the usability of the information it contains. Once structured, data can be
exported in different standards (e.g., Darwin Core Archive), thus contributing to different
projects or repositories of biodiversity information. Furthermore, structured data can be used in
complex information systems, hence returning results far more complex than lists of taxa. In the
example provided here, structured data deriving from a checklist are used to produce virtual
releves of lichen vegetation, predictive distributional maps, data matrices depicting the
distribution of lichens in different ecological scenarios, etc. These information systems can be
published in the web as stand-alone resources, or be integrated into national (Martellos & al.,
2011) and/or international networks of biodiversity data.
*** citare tra le risorse in cui possono essere integrati I dati anche biocase e vibrant??
Aknowledgments
This research was funded by the Italian Ministry of Environment (MATTM) in the framework of
the National project “Sistema Ambiente 2010” for the development of the Italian National
Biodiversity Node (NNB). The author is grateful to Prof. P.L. Nimis for his useful comments on
the paper.
References
Martellos S, Attorre F, De Felici S, Cesaroni D, Sbordoni V, Blasi C, Nimis PL. 2011. Plant
sciences and the Italian National Biodiversity Network. Plant Biosystems 145(4): 758-761.
DOI: 10.1080/11263504.2011.620342
Meades SJ, Stuart G, Broulliet L. 2000. Annotated Checklist of the Vascular Plants of
Newfoundland
and
Labrador.
[cited
2012
May
28.
Available
from:
http://www.digitalnaturalhistory.com/meades.htm
Nimis PL. 1993. The Lichens of Italy. An Annotated Catalogue. Mus. Reg. Sci. Nat. Torino,
Monogr. XII, 897 pp.
Nimis PL, Martellos S. 2001. Testing the predictivity of ecological indicator values. A
comparison of real and virtual releves of lichen vegetation. Plant Ecology 157: 165-172
Nimis PL, Martellos S. 2002. ITALIC, a database on Italian Lichens Bibliotheca Lichenologica
82: 271-282
Nimis PL, Martellos S. 2003. On the ecology of sorediate lichens in Italy Bibliotheca
Lichenologica 86
Nimis PL, Martellos S. 2003. A second checklist of the lichens of Italy, with a thesaurus of
synonyms. Mus. Reg. Sci. Nat. Saint-Pierre, Valle d’Aosta, Monogr. 4, 192 pp.
Remsen D, Knapp S, Georgiev T, Stoev P, Penev L. 2012. From text to structured data:
Converting a word-processed floristic checklist into Darwin Core Archive format. PhytoKeys
9: 1–13. DOI: 10.3897/phytokeys.9.2770
Wetter MA, Cochrane TS, Black MR. Watermolen, Dreux J., Editor. 2001 - Checklist of the
vascular plants of Wisconsin (Technical bulletin. (Wisconsin Dept. of Natural Resources), No.
192) Wisconsin Department of Natural Resources, 2001. 258 pgs. [cited 2012 May 28].
Available from: http://digital.library.wisc.edu/1711.dl/EcoNatRes.DNRBull192
Tables
Legend to figures
Figures
Addresses of the author
Stefano Martellos
Dept. of Life Sciences
University of Trieste,
Via L. Giorgieri 10 I-34100 Trieste, Italy
Download