Michiels-Chagnoux_Paris-Herbarium-digitization

Rapid digitization of P Herbarium
Switching to the fast track:
Rapid digitization of the
world's largest herbarium
TDWG 2011- New Orleans
Simon Chagnoux, Henri Michiels
Rapid digitization of P Herbarium
The French Museum
Rapid digitization of P Herbarium
An old institution
• Founded in 1635 (at that time the Royal
garden of medicinal plants)
• In 1793, the French revolution turns the
garden into the national Museum
• Now: 15 locations in France, 2000 people
20 Oct. 2011
TDWG - Orleans
3
Rapid digitization of P Herbarium
Renovating the Herbarium
An opportunity to digitize the
entire collection
Rapid digitization of P Herbarium
The Paris Herbarium
20 Oct. 2011
TDWG - Orleans
6
Rapid digitization of P Herbarium
The Renovation Project (1)
• Two main drivers to this project :
– the herbarium, designed for 6 million
specimens, was packed with 10 million
sheets and fitted with old storage
– raising the storage density required to
reinforce the floors
20 Oct. 2011
TDWG - Orleans
7
Rapid digitization of P Herbarium
The Renovation Project (2)
• The only way of doing this was to move
away the entire collection and to put it
back in the renovated place after works
• An opportunity for
– New sorting, from geographic to phylogenetic
(APG3)
– Reconditioning
– Digitizing
20 Oct. 2011
TDWG - Orleans
8
Rapid digitization of P Herbarium
Renovation Calendar
2006 – Start of the project
2009 – Start of the works
2010 (June) – Start of digitization
2011 (Nov) – Opening of the first rearranged spaces to researchers
2012 – End of the project
20 Oct. 2011
TDWG - Orleans
9
Rapid digitization of P Herbarium
Budget
• Overall project cost: 24,5 Million €
Building renovation
– Building renovation
12 000 000
– Movers
900 000
Movers
– Attaching specimens
3 200 000
Attaching specimens (partial)
– Reconditioning, digitization
and sorting
6 700 000
Reconditioning, digitization and sorting
– Supplies
1 600 000
– Storage
100 000
Supplies
Storage
20 Oct. 2011
TDWG - Orleans
10
Rapid digitization of P Herbarium
The renovation cycle
Floor by floor
renovation
Herbarium
Digitization
Reconditioning
Sorting
20 Oct. 2011
Industrial
Partner
TDWG - Orleans
Warehouse
11
Before ....
20 Oct. 2011
TDWG - Orleans
12
... And after
20 Oct. 2011
TDWG - Orleans
13
Rapid digitization of P Herbarium
Why digitize ?
• Because all the parts have to be manipulated
in the course of the project
• Digitization gives us:
– a virtual copy of specimens
– the possibility to share and study specimens
without touching them
• More than an electronic copy of the
collection catalog, we’ll have a collaborative
tool for managing scientific knowledge
inside, as well as outside the institution
20 Oct. 2011
TDWG - Orleans
14
Rapid digitization of P Herbarium
2D Digitization is cheap
• the cost of digitization is marginal
compared to the full project
• full specimen processing (moving,
$1,5
sorting, reconditionning, new
furniture)
• digitization and name processing
$0,1
• digitization is appealing to funding
20 Oct. 2011
TDWG - Orleans
15
Rapid digitization of P Herbarium
A new paradigm
• For 15 years we have been entering all
information of some specimens,
– 1 million entries in the database (rich information)
– One fifth (200 000 images) was photographed
• Since summer 2010, we use a massive
approach where digitization precedes data entry
– 2 million records digitized in one year
– limited information in the database (name and
geographic area)
– The scientific information can be added without
manipulating the specimens themselves
20 Oct. 2011
TDWG - Orleans
16
Rapid digitization of P Herbarium
The workflow
Digitizing, reconditionning and
sorting
Rapid digitization of P Herbarium
An industrial process (1)
• We chose a contractor with an industrial
know-how
• A dedicated place had to be set-up and
equipped by the contractor
• Two teams of 20 workers in two shifts
working from 6am to 9pm
• The process had to align on the schedule of
the renovation works, floor by floor
20 Oct. 2011
TDWG - Orleans
18
Rapid digitization of P Herbarium
An industrial process (2)
• Planned production rate: 17 000 sheets
per day over 24 months
 ca. 15 seconds / sheet
• At this rate, a variation of ± 1 second per
specimen has an impact of ± 300 k€ over
the project cost
20 Oct. 2011
TDWG - Orleans
19
The Bussy-St-Georges site
20 Oct. 2011
TDWG - Orleans
20
Rapid digitization of P Herbarium
Workflow overview
from
Herbarium
To Herbarium
UP
Data entry
UP
Image c apture
300 DP I
UP
DOWN
DOWN
DOWN
UP
DOWN
Unpac k ing and
adding barc ode
Rec onditionning and
S orting
20 Oct. 2011
TDWG - Orleans
21
Rapid digitization of P Herbarium
How to alleviate data entry
• We take advantage of the physical
ordering of specimens
• We provide a name list to the contractor
(APG 3 classification)
• The contractor enriches the list with the
information generated during the process
and provides us with a table containing
consolidated information (image number,
barcode numbers, classification,…)
20 Oct. 2011
TDWG - Orleans
22
Rapid digitization of P Herbarium
1 – Delivery (1)
A carting company
transports the
specimens to the
facility where they
arrive in clearly
labeled boxes.
Boxes receive a
tracking barcode
20 Oct. 2011
TDWG - Orleans
23
Rapid digitization of P Herbarium
1 – Delivery (2)
• The Museum provides two files:
1. a “logistics” file
– number of boxes
– family name and number
– genus name and number
– geographic area
2. a “taxonomy” file
– List of available taxon names with family,
genus, species, authors, ID (taxon number)
20 Oct. 2011
TDWG - Orleans
24
Rapid digitization of P Herbarium
1 – Delivery (3)
• This information is digested by the
contractor’s Information System and used
along the industrial process (labeling,
sorting, quality assurance)
20 Oct. 2011
TDWG - Orleans
25
Rapid digitization of P Herbarium
2 – Folder processing
For each folder, the operator :
1. replaces the jacket (color according to
region)
2. reads the species name and types the
first letters on its computer
3. selects the name in a list
4. prints a label with barcode and
identification information, and sticks it on
the folder
20 Oct. 2011
TDWG - Orleans
26
Rapid digitization of P Herbarium
3 – Specimen Digitization (1)
• Datamatrix and barcode are stuck on
each sheet
– Datamatrix: for tracking purposes
– Barcode: specific to Muséum and to int’l
herbarium standard
• The specimens are placed three by three
on a tray
20 Oct. 2011
TDWG - Orleans
27
Rapid digitization of P Herbarium
3 - Specimen Digitization (2)
•
•
•
•
The tray is placed on a conveyor belt
The sheet is scanned
The scan is checked (framing and focus)
At the end of the chain, the barcode is
read to check if all specimens are back in
the folder
20 Oct. 2011
TDWG - Orleans
28
The Digitization Bench
20 Oct. 2011
TDWG - Orleans
29
Rapid digitization of P Herbarium
4 - Reconditioning
• After scanning, each sheet is inserted in a
sulfurized paper liner
• The barcode of each specimen is read,
allowing the system to check if all
specimens are back in the right folder
• The folders are stored in a “cut box”
before sorting
20 Oct. 2011
TDWG - Orleans
30
Rapid digitization of P Herbarium
5 - Sorting 1 (by genus)
• This sorting consists in storing specimens
by family and genus names
• The operator puts the jackets in boxes
and places them on shelves according to
the family and genus numbers (the
shelves are labelled in advance by the
contractor)
20 Oct. 2011
TDWG - Orleans
31
Rapid digitization of P Herbarium
6 - Sorting 2 (by species)
• The operator takes a box, reads the
barcode on each jacket
• The system displays the species name
and assigns a number which is printed on
a label
• The label is sticked on the folder, which is
then stored on the shelf with the same
number
20 Oct. 2011
TDWG - Orleans
32
Rapid digitization of P Herbarium
7 – Packing, transport and
final storage
• The folders are put in boxes and sent to
the Museum
• The contractor stores the folders in the
Museum’s herbarium
20 Oct. 2011
TDWG - Orleans
33
Rapid digitization of P Herbarium
How to ensure quality in
mass digitization?
Checking:
•Focus
•Data quality
60 000
images
produced
each week
1
•Barcode number
•Barcode location
1% of the
production
checked (ca.
600 images)
2
4
Samples are
distributed among
botanical staff
3
Rapid digitization of P Herbarium
Scanning Resolution and
Image Format
Rapid digitization of P Herbarium
Production of images
• The conveyor belt passes the specimens
under a bidirectional scanner which
produces 11x17” (A3), 300 dpi, 5000 x
3300 pixel images
• TIFF files are saved offline (one
production day per disk of 1 TB)
• JPEG’s are made for online use
20 Oct. 2011
TDWG - Orleans
36
Rapid digitization of P Herbarium
Scanning resolution and
image size
• One TIFF image is 50 MB
• One JPEG is 5 MB. This compression rate
was chosen to have the same level of details
as with TIFF (only colour is slightly changed)
• This choice is a technico-economic trade-off
• For 10 million images:
– TIFF represents 500 TB
– JPEG represents 50 TB
– Data represents <100 GB
20 Oct. 2011
TDWG - Orleans
37
Rapid digitization of P Herbarium
Why do we keep TIFF ?
• Partners seek lossless data (Reflora,
Mellon)
• Standard for physical publishing
• Native scan output, which can be used for
any future use or transformation
20 Oct. 2011
TDWG - Orleans
38
Rapid digitization of P Herbarium
Handling TIFF data
• We cannot afford « live » storage
of 500 TB
• … and even 1 Po with redundancy ! $$$
• With a lot of energy consumption and heat
dissipation for rarely accessed images
• We are planning to start using tape
storage next year, with HSM software
• For the time being, USB disks are stored
in the collection warehouse
20 Oct. 2011
TDWG - Orleans
39
Rapid digitization of P Herbarium
Exception for the types
• The types are not part of this industrial
process
• They are manually digitized on-premises
at 600 dpi (200 MB in compressed TIFF)
• This process was initiated by the Mellon
foundation in 2004
• We now have about 100 000 type images
20 Oct. 2011
TDWG - Orleans
40
Rapid digitization of P Herbarium
What we’ve achieved
and learned …
… after 12 months of collaboration
between scientists and industrials (over
an anticipated duration of 24 months)
Rapid digitization of P Herbarium
Achievements
• 2,1 million specimens processed between
June 2010 and August 2011
• Images and data are of good quality
• The new premises comply with today’s
standards (space, safety, light, airconditioning, …)
20 Oct. 2011
TDWG - Orleans
42
Rapid digitization of P Herbarium
Fast but ... not fast enough
18,000
16,000
14,000
12,000
10,000
Forecast
8,000
Actual
6,000
4,000
2,000
-
20 Oct. 2011
TDWG - Orleans
43
Rapid digitization of P Herbarium
Reasons for being behind
schedule
• Logisticians have under-estimated the
sorting work
• Only two digitization chains are
operational, instead of three (due to lack
of staff)
20 Oct. 2011
TDWG - Orleans
44
Rapid digitization of P Herbarium
Software and quality
assurance
• There is more software needed for
ensuring tracability and detecting failures
than for data acquisition.
• Fast web publication of images allows a
broader audience to perform quality
control.
• Continuous control is mandatory
20 Oct. 2011
TDWG - Orleans
45
Rapid digitization of P Herbarium
People
• Working under constant time pressure
during two years is really difficult in an
academic context
• The contractor must be considered as a
service provider and not just the team
next-door (not obvious in an academic
context)
20 Oct. 2011
TDWG - Orleans
46
Rapid digitization of P Herbarium
Working with a contractor
• Culture clash
ROI
speed
robustness
quality
exhaustivity
specifity
• Many parameters were not known at the
beginning of the project (processes,
numbers, ...)
• Quality control is a key point to make sure
that scientific excellence governs the
industrial throughput (to be defined upfront)
• Write everything and always refer to the
contract
20 Oct. 2011
TDWG - Orleans
47
Rapid digitization of P Herbarium
Digitizing other objects
• Digitizing herbarium is « easy »:
– same dimensions for all objects
– Easy manipulation and scanning
– The plant itself is not touched – only the
paper
• Digitizing 3D objects is a lot more complex
and generally requires to manipulate the
specimen itself
20 Oct. 2011
TDWG - Orleans
48
Rapid digitization of P Herbarium
Is it over ?
Digitization is just a very first step…
Rapid digitization of P Herbarium
Virtual herbarium
• The amount of information available online will lower the number of physical visits
to the Herbarium
• … but visitors leave post-it note on the
sheets  How to replace this ?
– Annotation systems
– « virtual visit » website
20 Oct. 2011
TDWG - Orleans
50
Rapid digitization of P Herbarium
Spot the differences …
AFM
FABACEAE
?
Abrus aureus R. Vig.
20 Oct. 2011
TDWG - Orleans
51
Rapid digitization of P Herbarium
Differences are
•
Occurrence
–
•
Event
–
•
locationID | higherGeographyID | higherGeography | continent | waterBody | islandGroup | island | country | countryCode |
stateProvince | county | municipality | locality | verbatimLocality | verbatimElevation | minimumElevationInMeters |
maximumElevationInMeters | verbatimDepth | minimumDepthInMeters | maximumDepthInMeters |
minimumDistanceAboveSurfaceInMeters | maximumDistanceAboveSurfaceInMeters | locationAccordingTo | locationRemarks |
verbatimCoordinates | verbatimLatitude | verbatimLongitude | verbatimCoordinateSystem | verbatimSRS | decimalLatitude |
decimalLongitude | geodeticDatum | coordinateUncertaintyInMeters | coordinatePrecision | pointRadiusSpatialFit | footprintWKT |
footprintSRS | footprintSpatialFit | georeferencedBy | georeferenceProtocol | georeferenceSources |
georeferenceVerificationStatus | georeferenceRemarks
Identification
–
•
eventID | samplingProtocol | samplingEffort | eventDate | eventTime | startDayOfYear | endDayOfYear | year | month | day |
verbatimEventDate | habitat | fieldNumber | fieldNotes | eventRemarks
Location
–
•
occurrenceID | catalogNumber | occurrenceDetails | occurrenceRemarks | recordNumber | recordedBy | individualID |
individualCount | sex | lifeStage | reproductiveCondition | behavior | establishmentMeans | occurrenceStatus | preparations |
disposition | otherCatalogNumbers | previousIdentifications | associatedMedia | associatedReferences | associatedOccurrences |
associatedSequences | associatedTaxa
identificationID | identifiedBy | dateIdentified | identificationReferences | identificationRemarks | identificationQualifier | typeStatus
Taxon
–
taxonID | scientificNameID | acceptedNameUsageID | parentNameUsageID | originalNameUsageID | nameAccordingToID |
namePublishedInID | taxonConceptID | scientificName | acceptedNameUsage | parentNameUsage | originalNameUsage |
nameAccordingTo | namePublishedIn | higherClassification | kingdom | phylum | class | order | family | genus | subgenus |
specificEpithet | infraspecificEpithet | taxonRank | verbatimTaxonRank | scientificNameAuthorship | vernacularName |
nomenclaturalCode | taxonomicStatus | nomenclaturalStatus | taxonRemark
20 Oct. 2011
TDWG - Orleans
52
Rapid digitization of P Herbarium
OCR / NLP ?
20 Oct. 2011
TDWG - Orleans
53
Rapid digitization of P Herbarium
Projects to fill the gap
• Remote Taxonomists
– Yack web tool
• Citizen Science / CrowdSourcing
– « les collecteurs » project
• Repatriation project
– Reflora (Brasil)
20 Oct. 2011
TDWG - Orleans
54
Rapid digitization of P Herbarium
Thank you !
A project managed by:
• Direction of Collections
– Michel Guiraud mguiraud (at) mnhn (.) fr
– Pascale Joannot joannot (at) mnhn (.) fr
• DSI (Information Systems)
– Henri Michiels michiels (at) mnhn (.) fr
– Simon Chagnoux chagnoux (at) mnhn (.) fr
20 Oct. 2011
TDWG - Orleans
55