Rapid digitization of P Herbarium Switching to the fast track: Rapid digitization of the world's largest herbarium TDWG 2011- New Orleans Simon Chagnoux, Henri Michiels Rapid digitization of P Herbarium The French Museum Rapid digitization of P Herbarium An old institution • Founded in 1635 (at that time the Royal garden of medicinal plants) • In 1793, the French revolution turns the garden into the national Museum • Now: 15 locations in France, 2000 people 20 Oct. 2011 TDWG - Orleans 3 Rapid digitization of P Herbarium Renovating the Herbarium An opportunity to digitize the entire collection Rapid digitization of P Herbarium The Paris Herbarium 20 Oct. 2011 TDWG - Orleans 6 Rapid digitization of P Herbarium The Renovation Project (1) • Two main drivers to this project : – the herbarium, designed for 6 million specimens, was packed with 10 million sheets and fitted with old storage – raising the storage density required to reinforce the floors 20 Oct. 2011 TDWG - Orleans 7 Rapid digitization of P Herbarium The Renovation Project (2) • The only way of doing this was to move away the entire collection and to put it back in the renovated place after works • An opportunity for – New sorting, from geographic to phylogenetic (APG3) – Reconditioning – Digitizing 20 Oct. 2011 TDWG - Orleans 8 Rapid digitization of P Herbarium Renovation Calendar 2006 – Start of the project 2009 – Start of the works 2010 (June) – Start of digitization 2011 (Nov) – Opening of the first rearranged spaces to researchers 2012 – End of the project 20 Oct. 2011 TDWG - Orleans 9 Rapid digitization of P Herbarium Budget • Overall project cost: 24,5 Million € Building renovation – Building renovation 12 000 000 – Movers 900 000 Movers – Attaching specimens 3 200 000 Attaching specimens (partial) – Reconditioning, digitization and sorting 6 700 000 Reconditioning, digitization and sorting – Supplies 1 600 000 – Storage 100 000 Supplies Storage 20 Oct. 2011 TDWG - Orleans 10 Rapid digitization of P Herbarium The renovation cycle Floor by floor renovation Herbarium Digitization Reconditioning Sorting 20 Oct. 2011 Industrial Partner TDWG - Orleans Warehouse 11 Before .... 20 Oct. 2011 TDWG - Orleans 12 ... And after 20 Oct. 2011 TDWG - Orleans 13 Rapid digitization of P Herbarium Why digitize ? • Because all the parts have to be manipulated in the course of the project • Digitization gives us: – a virtual copy of specimens – the possibility to share and study specimens without touching them • More than an electronic copy of the collection catalog, we’ll have a collaborative tool for managing scientific knowledge inside, as well as outside the institution 20 Oct. 2011 TDWG - Orleans 14 Rapid digitization of P Herbarium 2D Digitization is cheap • the cost of digitization is marginal compared to the full project • full specimen processing (moving, $1,5 sorting, reconditionning, new furniture) • digitization and name processing $0,1 • digitization is appealing to funding 20 Oct. 2011 TDWG - Orleans 15 Rapid digitization of P Herbarium A new paradigm • For 15 years we have been entering all information of some specimens, – 1 million entries in the database (rich information) – One fifth (200 000 images) was photographed • Since summer 2010, we use a massive approach where digitization precedes data entry – 2 million records digitized in one year – limited information in the database (name and geographic area) – The scientific information can be added without manipulating the specimens themselves 20 Oct. 2011 TDWG - Orleans 16 Rapid digitization of P Herbarium The workflow Digitizing, reconditionning and sorting Rapid digitization of P Herbarium An industrial process (1) • We chose a contractor with an industrial know-how • A dedicated place had to be set-up and equipped by the contractor • Two teams of 20 workers in two shifts working from 6am to 9pm • The process had to align on the schedule of the renovation works, floor by floor 20 Oct. 2011 TDWG - Orleans 18 Rapid digitization of P Herbarium An industrial process (2) • Planned production rate: 17 000 sheets per day over 24 months ca. 15 seconds / sheet • At this rate, a variation of ± 1 second per specimen has an impact of ± 300 k€ over the project cost 20 Oct. 2011 TDWG - Orleans 19 The Bussy-St-Georges site 20 Oct. 2011 TDWG - Orleans 20 Rapid digitization of P Herbarium Workflow overview from Herbarium To Herbarium UP Data entry UP Image c apture 300 DP I UP DOWN DOWN DOWN UP DOWN Unpac k ing and adding barc ode Rec onditionning and S orting 20 Oct. 2011 TDWG - Orleans 21 Rapid digitization of P Herbarium How to alleviate data entry • We take advantage of the physical ordering of specimens • We provide a name list to the contractor (APG 3 classification) • The contractor enriches the list with the information generated during the process and provides us with a table containing consolidated information (image number, barcode numbers, classification,…) 20 Oct. 2011 TDWG - Orleans 22 Rapid digitization of P Herbarium 1 – Delivery (1) A carting company transports the specimens to the facility where they arrive in clearly labeled boxes. Boxes receive a tracking barcode 20 Oct. 2011 TDWG - Orleans 23 Rapid digitization of P Herbarium 1 – Delivery (2) • The Museum provides two files: 1. a “logistics” file – number of boxes – family name and number – genus name and number – geographic area 2. a “taxonomy” file – List of available taxon names with family, genus, species, authors, ID (taxon number) 20 Oct. 2011 TDWG - Orleans 24 Rapid digitization of P Herbarium 1 – Delivery (3) • This information is digested by the contractor’s Information System and used along the industrial process (labeling, sorting, quality assurance) 20 Oct. 2011 TDWG - Orleans 25 Rapid digitization of P Herbarium 2 – Folder processing For each folder, the operator : 1. replaces the jacket (color according to region) 2. reads the species name and types the first letters on its computer 3. selects the name in a list 4. prints a label with barcode and identification information, and sticks it on the folder 20 Oct. 2011 TDWG - Orleans 26 Rapid digitization of P Herbarium 3 – Specimen Digitization (1) • Datamatrix and barcode are stuck on each sheet – Datamatrix: for tracking purposes – Barcode: specific to Muséum and to int’l herbarium standard • The specimens are placed three by three on a tray 20 Oct. 2011 TDWG - Orleans 27 Rapid digitization of P Herbarium 3 - Specimen Digitization (2) • • • • The tray is placed on a conveyor belt The sheet is scanned The scan is checked (framing and focus) At the end of the chain, the barcode is read to check if all specimens are back in the folder 20 Oct. 2011 TDWG - Orleans 28 The Digitization Bench 20 Oct. 2011 TDWG - Orleans 29 Rapid digitization of P Herbarium 4 - Reconditioning • After scanning, each sheet is inserted in a sulfurized paper liner • The barcode of each specimen is read, allowing the system to check if all specimens are back in the right folder • The folders are stored in a “cut box” before sorting 20 Oct. 2011 TDWG - Orleans 30 Rapid digitization of P Herbarium 5 - Sorting 1 (by genus) • This sorting consists in storing specimens by family and genus names • The operator puts the jackets in boxes and places them on shelves according to the family and genus numbers (the shelves are labelled in advance by the contractor) 20 Oct. 2011 TDWG - Orleans 31 Rapid digitization of P Herbarium 6 - Sorting 2 (by species) • The operator takes a box, reads the barcode on each jacket • The system displays the species name and assigns a number which is printed on a label • The label is sticked on the folder, which is then stored on the shelf with the same number 20 Oct. 2011 TDWG - Orleans 32 Rapid digitization of P Herbarium 7 – Packing, transport and final storage • The folders are put in boxes and sent to the Museum • The contractor stores the folders in the Museum’s herbarium 20 Oct. 2011 TDWG - Orleans 33 Rapid digitization of P Herbarium How to ensure quality in mass digitization? Checking: •Focus •Data quality 60 000 images produced each week 1 •Barcode number •Barcode location 1% of the production checked (ca. 600 images) 2 4 Samples are distributed among botanical staff 3 Rapid digitization of P Herbarium Scanning Resolution and Image Format Rapid digitization of P Herbarium Production of images • The conveyor belt passes the specimens under a bidirectional scanner which produces 11x17” (A3), 300 dpi, 5000 x 3300 pixel images • TIFF files are saved offline (one production day per disk of 1 TB) • JPEG’s are made for online use 20 Oct. 2011 TDWG - Orleans 36 Rapid digitization of P Herbarium Scanning resolution and image size • One TIFF image is 50 MB • One JPEG is 5 MB. This compression rate was chosen to have the same level of details as with TIFF (only colour is slightly changed) • This choice is a technico-economic trade-off • For 10 million images: – TIFF represents 500 TB – JPEG represents 50 TB – Data represents <100 GB 20 Oct. 2011 TDWG - Orleans 37 Rapid digitization of P Herbarium Why do we keep TIFF ? • Partners seek lossless data (Reflora, Mellon) • Standard for physical publishing • Native scan output, which can be used for any future use or transformation 20 Oct. 2011 TDWG - Orleans 38 Rapid digitization of P Herbarium Handling TIFF data • We cannot afford « live » storage of 500 TB • … and even 1 Po with redundancy ! $$$ • With a lot of energy consumption and heat dissipation for rarely accessed images • We are planning to start using tape storage next year, with HSM software • For the time being, USB disks are stored in the collection warehouse 20 Oct. 2011 TDWG - Orleans 39 Rapid digitization of P Herbarium Exception for the types • The types are not part of this industrial process • They are manually digitized on-premises at 600 dpi (200 MB in compressed TIFF) • This process was initiated by the Mellon foundation in 2004 • We now have about 100 000 type images 20 Oct. 2011 TDWG - Orleans 40 Rapid digitization of P Herbarium What we’ve achieved and learned … … after 12 months of collaboration between scientists and industrials (over an anticipated duration of 24 months) Rapid digitization of P Herbarium Achievements • 2,1 million specimens processed between June 2010 and August 2011 • Images and data are of good quality • The new premises comply with today’s standards (space, safety, light, airconditioning, …) 20 Oct. 2011 TDWG - Orleans 42 Rapid digitization of P Herbarium Fast but ... not fast enough 18,000 16,000 14,000 12,000 10,000 Forecast 8,000 Actual 6,000 4,000 2,000 - 20 Oct. 2011 TDWG - Orleans 43 Rapid digitization of P Herbarium Reasons for being behind schedule • Logisticians have under-estimated the sorting work • Only two digitization chains are operational, instead of three (due to lack of staff) 20 Oct. 2011 TDWG - Orleans 44 Rapid digitization of P Herbarium Software and quality assurance • There is more software needed for ensuring tracability and detecting failures than for data acquisition. • Fast web publication of images allows a broader audience to perform quality control. • Continuous control is mandatory 20 Oct. 2011 TDWG - Orleans 45 Rapid digitization of P Herbarium People • Working under constant time pressure during two years is really difficult in an academic context • The contractor must be considered as a service provider and not just the team next-door (not obvious in an academic context) 20 Oct. 2011 TDWG - Orleans 46 Rapid digitization of P Herbarium Working with a contractor • Culture clash ROI speed robustness quality exhaustivity specifity • Many parameters were not known at the beginning of the project (processes, numbers, ...) • Quality control is a key point to make sure that scientific excellence governs the industrial throughput (to be defined upfront) • Write everything and always refer to the contract 20 Oct. 2011 TDWG - Orleans 47 Rapid digitization of P Herbarium Digitizing other objects • Digitizing herbarium is « easy »: – same dimensions for all objects – Easy manipulation and scanning – The plant itself is not touched – only the paper • Digitizing 3D objects is a lot more complex and generally requires to manipulate the specimen itself 20 Oct. 2011 TDWG - Orleans 48 Rapid digitization of P Herbarium Is it over ? Digitization is just a very first step… Rapid digitization of P Herbarium Virtual herbarium • The amount of information available online will lower the number of physical visits to the Herbarium • … but visitors leave post-it note on the sheets How to replace this ? – Annotation systems – « virtual visit » website 20 Oct. 2011 TDWG - Orleans 50 Rapid digitization of P Herbarium Spot the differences … AFM FABACEAE ? Abrus aureus R. Vig. 20 Oct. 2011 TDWG - Orleans 51 Rapid digitization of P Herbarium Differences are • Occurrence – • Event – • locationID | higherGeographyID | higherGeography | continent | waterBody | islandGroup | island | country | countryCode | stateProvince | county | municipality | locality | verbatimLocality | verbatimElevation | minimumElevationInMeters | maximumElevationInMeters | verbatimDepth | minimumDepthInMeters | maximumDepthInMeters | minimumDistanceAboveSurfaceInMeters | maximumDistanceAboveSurfaceInMeters | locationAccordingTo | locationRemarks | verbatimCoordinates | verbatimLatitude | verbatimLongitude | verbatimCoordinateSystem | verbatimSRS | decimalLatitude | decimalLongitude | geodeticDatum | coordinateUncertaintyInMeters | coordinatePrecision | pointRadiusSpatialFit | footprintWKT | footprintSRS | footprintSpatialFit | georeferencedBy | georeferenceProtocol | georeferenceSources | georeferenceVerificationStatus | georeferenceRemarks Identification – • eventID | samplingProtocol | samplingEffort | eventDate | eventTime | startDayOfYear | endDayOfYear | year | month | day | verbatimEventDate | habitat | fieldNumber | fieldNotes | eventRemarks Location – • occurrenceID | catalogNumber | occurrenceDetails | occurrenceRemarks | recordNumber | recordedBy | individualID | individualCount | sex | lifeStage | reproductiveCondition | behavior | establishmentMeans | occurrenceStatus | preparations | disposition | otherCatalogNumbers | previousIdentifications | associatedMedia | associatedReferences | associatedOccurrences | associatedSequences | associatedTaxa identificationID | identifiedBy | dateIdentified | identificationReferences | identificationRemarks | identificationQualifier | typeStatus Taxon – taxonID | scientificNameID | acceptedNameUsageID | parentNameUsageID | originalNameUsageID | nameAccordingToID | namePublishedInID | taxonConceptID | scientificName | acceptedNameUsage | parentNameUsage | originalNameUsage | nameAccordingTo | namePublishedIn | higherClassification | kingdom | phylum | class | order | family | genus | subgenus | specificEpithet | infraspecificEpithet | taxonRank | verbatimTaxonRank | scientificNameAuthorship | vernacularName | nomenclaturalCode | taxonomicStatus | nomenclaturalStatus | taxonRemark 20 Oct. 2011 TDWG - Orleans 52 Rapid digitization of P Herbarium OCR / NLP ? 20 Oct. 2011 TDWG - Orleans 53 Rapid digitization of P Herbarium Projects to fill the gap • Remote Taxonomists – Yack web tool • Citizen Science / CrowdSourcing – « les collecteurs » project • Repatriation project – Reflora (Brasil) 20 Oct. 2011 TDWG - Orleans 54 Rapid digitization of P Herbarium Thank you ! A project managed by: • Direction of Collections – Michel Guiraud mguiraud (at) mnhn (.) fr – Pascale Joannot joannot (at) mnhn (.) fr • DSI (Information Systems) – Henri Michiels michiels (at) mnhn (.) fr – Simon Chagnoux chagnoux (at) mnhn (.) fr 20 Oct. 2011 TDWG - Orleans 55