I.1 The digital image - Memoria Digitization

advertisement
Digital Technology in Service of Access to Rare Library
Materials
Adolf Knoll, National Library of the Czech Republic
One of the most acute problems in Czech libraries, archives, and museums has been safeguarding and
preservation of rare and endangered library materials to ensure long-term access to them for future
generations. While in the beginning of 1990s the attention of collection curators was preponderantly given to
old manuscripts and important archival materials, the new millennium is starting with great alarm
concerning the condition of acid-paper materials, especially newspapers and journals.
After improvement of storage conditions and careful mass treatment of selected paper materials the
accent on long-term access got the primary importance. As the experience of direct open use of rare
publications and manuscripts in Czech libraries was very negative, appropriate technologies were looked for
to avoid users' manipulation with rare originals at minimum. For example, more than 200 years of very
open access to the National Library manuscripts and old printed books left in many cases irreparable traces
on frequently used volumes: cut-off folios or even illuminations, torn-away pages, worn-out bindings and
page edges, dismantled spines of books and codices, blurred colours, etc. If the natural ageing of carriers was
added to complete the scene and combined with ink corrosion and brittle paper, the situation was not too
optimistic.
In the years before, the library was trying to film certain titles, but the production volume was very
low and no serious preservation microfilming behaviour was applied. The nineties brought a real boom of
various digitization devices and also the idea that the digital technology could be applied in the library world
to help to safeguard the documentary heritage. In parallel, however, the preservation microfilming
strengthened its position and a lot of discussion was led to solve the problem, which of these two
technologies would be more suitable for preservation.
However, fortunately the life is richer than any artificial schemes and nowadays the emerging digital
world coexists with the analogue realm of classical documents. Digitization is more considered being an
access tool even though its preservation role cannot be disregarded. It brings an until today extraordinary
enhancement of access opportunities that enables not to expose originals for direct use and - with the fast
development of technology - it can also play certain role that can be called digital preservation.
However, digital files are very fragile especially thanks to rapidly changing hardware and software
platforms and formats; therefore, no total preference for the digital copy has been given until today to play
the role of a real replacement preservation tool. Digital documents cannot be deciphered without technical
equipment and a lot of other high technology products, they are not humanly understandable as written or
printed materials or even their microfilm copies are.
All this and a lot of other ideas were in our minds, when we considered for the first time in 1992 to
apply digitization of originals for the work with old manuscripts and books.
We liked certain touch of eternity in the unchangeable substance of the digital image and
simultaneously we were concerned very much about its fragility. Our concern was not so much related to
digital information carriers, because they could be rapidly replaced and the written information refreshed. Of
course, this point of view was related mostly to what we intended to produce and store digitally. Today we
must admit that we have been feeling certain preoccupations about the industrial production of documents
on various information carriers, namely compact discs, as we were charged some six years ago with the
function to archive the audio CD documents.
I
Data
When we started to digitize old manuscripts in routine in 1996, we were aware of the fact that we should not
return again to the originals, that the products of the work we were doing had to last in time. In the beginning
much attention was paid to the digital image as the main container of the information in which the users
were interested. This question concerned the parameters of the image, but the images had to be bound into a
virtual book that should be identifiable as unique document. To make a correct choice, where the stable point
was from the perspective of which we could take our decisions? Where should we look for it? And who
should look for it and decide: librarians and archivists or technicians?
How to bridge the vacuum existing in those years between technological thinking and classical
library profession especially in manuscripts and old printed books circles? There was almost no common
denominator: these two groups of people were unable to speak to each other.
However, finally the common point was found and its role nowadays is even more decisive than
before. It is the user. For him the digital access is being prepared, he will be or will not be satisfied with the
replacing surrogate. The user's point of view is being projected into technical parameters of the digital image
and into the character and structure of the metadata framework.
I.1
The digital image
The digital image has the following basic parameters: space resolution and brightness resolution - the latter
one is most frequently called colour depth or number of colours. The digital image with which we work in
our digitization programmes is a bitmap that can be imagined as a grid filled with coloured dots. These dots
altogether give the impression of a continuous colour space. Each dot has its place in this grid, i.e. it can be
defined by its vertical and horizontal positions.
I.1.1
Resolution
The space resolution - simply called resolution - tells how many dots or elements of the picture are necessary
to express a unit of length. The picture element is simply called pixel and the resolution is usually given in
number of pixels or dots per inch - dpi.
The scanning devices or digital cameras are given their resolutions frequently by combination of
two numbers, e.g. 2000 x 3000 pixels, or by the total number of applicable pixels, e.g. 6 million = 6
megapixels for the same case. The problem is that any original scanned will be expressed maximally with
this number of dots or pixels. If the scanned original is an A5 page of a book, it can be expressed by 6
million dots. In practice, this means that the longer side of such a page has 210 mm and it can be expressed
at maximum with the longer side of the scanning element, i.e. we will have 3000 dots per 210 mm. If we
recalculate this, we will have the resolution of 360 dpi.
However, if our scanned original will be a large size newspaper having, for example, 800 mm at its
longer side, then we can have at maximum 3000 dots per 800 mm, i.e. 94 dpi. And if the original will be a
large map having 1600 mm at its longer side, then the resolution will be only 47 dpi. Of course, all this is
true provided the shorter sides will fit simultaneously into the scanning window.
After some testing with various originals and the disposable resolution, we will easily discover that
we can digitize some types of originals successfully, while the larger ones - if digitized - will give images of
very poor quality, which will be unusable. The available pixels will not be able to cover sufficiently fine
details and their quality will not satisfy our user.
To solve this problem, it is necessary to know from the beginning, which originals will be digitized
and which not. The decision that some documents will be excluded from digitization now, because they
form a very small percentage of the materials prepared for processing, can save us substantial amount of
money we must invest in the equipment. We should be also aware of the fact that the technological
development is very fast and that the future will certainly bring better devices at more affordable costs,
which will enable to digitize the put-apart originals at a substantially higher level than we could do now.
We started with a digital 2000 x 3000 camera and after two or three years we bought another device
with the resolution of 6000 x 8000 dots that enables us to process also other types of documents. For the
moment being, such a resolution can cover most of the needs of our libraries and other cultural institutions
2
for digitization. In these days, the company AIP Beroun with which we have been digitizing has installed a
third device of comparable resolution as the second one.
Nobody speaks today about indirect digitization through high-quality colour slides that was
relatively common in the beginning. It is true that it is expensive, but in some cases this technology can solve
certain requirements even today, for example, production of high-quality images of very large rare posters.
I.1.2
Colour depth
Each pixel has not only its two-dimensional surface, but it can be also featured as a three-dimensional
element. Its third dimension contains information about colour. If no third dimension exists, the pixel
represented by one bit of information can have only two values: one or zero; this means that it can be or
black or white. Such an image is called black-and-white, 1-bit, bitonal, or bi-level image. Its typical sample
is the classical black-and-white fax or Xerox image.
The 1-bit image cannot express all the necessary information sufficiently; therefore, the pixel should
be able to express colour. The number of colours that the pixel can express depends on our decision how
many bits we can assign to the pixel. If we assign only 4 bits, we can handle 16 colours, but if we assign 8
bits - i.e. 1 byte of information - we will have 256 colours. The so-called true colour image has 24 bits = 3
bytes per pixel that means more than 16.7 million colours. The pixel can be assigned even more bits, but
usually these do not express colour, they rather express the colour attributes, e.g. the transparency.
In practice, the number of colours increases substantially the volume of the computer file containing
the image. If we apply, for example, all the disposable bits of the 2000x3000 pixels of our digital camera
into an image, then the image will have 6 million bits and it will have 750 KB in black-and-white. However,
the same image in the true colour quality will have 6,000,000 bits x 24 bits = 18 MB.
This should be taken into consideration very seriously especially in case of large originals scanned
at high resolution. The digitizing team may decide to decrease the number of colours for some originals,
especially for monotone textual documents such as newspapers or even textual manuscripts. In this case the
preference is rather going towards digitization in 256 shades of grey than in 256 colours. The reason is very
simple: homogeneity of the images is preserved when applying 256 shades of grey face to individual and
different palettes of 256 colours for each image and possibility of efficient compression as explained below.
In some cases also 1-bit image can serve very well the users' needs especially as to simpler documents such
as textual periodicals.
The 1-bit image has also its problems even if it is very economical. The problem consists in
establishment of a threshold indicating which dot should be black and which should remain white. They are
many techniques of recalculation from richer images into the black-and-white image and there are also tools
for easier definition of acceptable thresholds for automated processing that is important, for example, when
scanning the microfilm.
The growing digital storage capacities and the poor quality of many damaged and partly
deteriorated originals make us prefer the 256 shades of grey today to the 1-bit image whose role begins to be
shifted more into the image post-processing during its delivery.
I.2
The production processing of the digital image
Many digital images, which were produced as direct output of scanning or digital photographing, should be
further processed to respond better to the needs of users.
First of all, unnecessary parts of the image should be cut off and afterwards the sets of practically
applicable images should be produced from the obtained source image.
The images should be stored in standardized computer files in concordance with a chosen format.
However, which format should be selected?
The selection of the appropriate format should take place on the basis of the analysis of applicable
3
parameters confronted with our needs and possibilities. Furthermore, the format is the unit that should be
handled by certain viewing tools to enable access. In plus, we should rely on those formats, which will last in
time and lower thus additional costs for conversion into the new ones.
The graphic format is a framework especially for colour depth and possible compression. Each
format is assigned certain extension of colour depth and it supports or it does not support compression. If it
supports compression, it does it for concrete compression schemes; therefore, we should also know their
properties.
For example, the well-known GIF format enables only 256 colours, while compressing the image
by the LZW algorithm. On the other hand, another well-known JPEG format supports the true colour image
– or, better-said, RGB image. This image cannot be recalculated and then stored in JPEG in only 256 colours
to save the storage space, because JPEG does not support fewer colours than 16.7 million. In JPEG, the
applied compression scheme will be DCT that stands for discrete cosine transform. In opposition to these
two examples, the TGA true colour format supports no compression.
I.2.1
Compression
The compression can be lossless or lossy. Each bit from the losslessly compressed image returns after
decompression into its former position, while in case of lossy compression it does not.
The compression schemes have specialized into two domains: 1-bit image and colour image. The
development in these two domains takes place rather separately, because the applied techniques vary
substantially. As everywhere in the computer world, there are ISO standards and de facto standards. Some of
the ISO standards are used as for example DCT in JPEG, but some of them are not, for example JBIG for
the compression of the 1-bit image even if it has been here for some 7 years. It is also not true that better
schemes are used and the worse ones are not. The practical life follows another line that is dominated by
availability of usage comfort.
The first condition for a compression scheme to be used is its implementation in a widely accepted
and used format. If it does not happen, the scheme is not used, because it cannot be handled: there are
(almost) no tools for production and no tools for access. Even if JBIG is superior to the existing best
practically used and ISO implemented 1-bit compressor - CCITT Fax Group 4 in TIFF - it is not used
because of not having been supported in a largely used format. It will be interesting to see, which future will
have the JBIG2, which is excellent and becoming another ISO standard now.
On the other hand, not everything that has been implemented in a known format, especially TIFF, is
practically usable or makes sense to be used. For example, TIFF 6 enabled, among others, JPEG DCT
compression and as such it was - 6 years ago - even recommended for some Memory of the World
applications. Nobody uses JPEG DCT in TIFF, but almost everybody uses JPEG, which is also an ISO
standard. The answer is very simple: rich availability of various tools for JPEG and almost no tools for
TIFF/DCT on the other hand. In plus, JPEG is one of the native Internet formats and Internet is changing the
world.
Another interesting example is the LZW compression scheme, which was enabled in TIFF 5 for true
colour image. It is used, but rarely. Why? There are two reasons: LZW is licensed by Unisys and any
developer must buy the license to compress or decompress images in LZW in his editors or viewers - and the
user must pay the developer. The second reason is that during the time PNG format was developed and it
replaced LZW by its own lossless compressor. PNG is more efficient both for true colour and 1-bit image
than LZW and even if it is only a de facto standard, it is a solution with a very prospective future. It has been
admitted as the third Internet native graphic format and it is free.
Nowadays, the best reliable lossless compression solutions are TIFF/CCITT Fax Group 4 for 1-bit
image and PNG for colour image.
However, the lossless compression is not able to reduce the size of the image file radically. For the
true colour image it may be up to 30-50%, while in the 1-bit domain it is more. To enable faster transfer of
image files especially over networks and even their efficient and cheaper storage, much work has been done
4
in the lossy compression domain. The best widely used lossy compression scheme for the RGB image, i.e.
true colour and 256-shades-of-grey image, is now the DCT in JPEG. The image file size can be reduced
several times, but the image is characterized by loss of information compared with the source image.
There is no objective rule, how to set up the compression ratio - frequently called quality factor - for
JPEG. It is software dependent and it differs for various types of the image. If the image is photorealistic
with many rich and different objects, the compression ratio can be higher than in case of objects having
larger smooth areas. The threshold dividing the values of the quality factor into acceptable and unacceptable
areas should be established individually for each group or type of images. Our decision is taken in
compromise between our wish to have smaller files and the wish of the user to have acceptable comfort. We
think, nevertheless, that some algorithms can be defined for characteristic types of original documents
through testing the images within the sets with various parameters. This work is done now in a project of
ours in which the most important players are our users and their subjective evaluation of the tested images.
JPEG DCT is dividing the image in smaller squares from which each one is handled separately. The
lossy compression algorithm always tries to accentuate prominent lines and edges, because their acceptable
display can dominate the image and thus it can suppress the imperfection of rendering less dominating
objects. If the quality factor is too high to enable very tough compression, the above mentioned processing
takes place rather separately in each square; therefore, in this case the image can have disturbing square
artefacts that are best seen on smooth places or quite on textual pages.
The lossy compression is nowadays a domain in which a lot of development is taking place. Its
direction is twofold again: 1-bit and RGB areas, while in concrete efforts, the aims are two ISO standards:
JBIG2 for 1-bit image and JPEG2000 for RGB image.
The new feature is the lossy compression of the 1-bit image face to the lossless CCITT Fax Group 4
scheme. The success of the new solutions lies especially in the pre-processing phase before the compression
itself: the small bitmaps groups - frequently featuring various characters - are compared with one another
with the help of their dictionaries. Following various threshold algorithms the bitmap clusters are optimized
so that their number can be reduced and the patterns displayed several times in various positions within the
image. Also unnecessary noise is being removed. In this way, the bitmap segment of the image encoding is
getting less voluminous, while the repeating clusters are assigned only their positions.
The best solutions must be based on the best pre-processing algorithms. Our tests have shown that
today the best marketable and relatively already largely applied compression scheme of this type is JB2
developed by AT&T and distributed today by Lizardtech Inc. within the DjVu format. In fact, it is one of the
schemes related to JBIG2. Furthermore, DjVu is able here to use the same cluster dictionary for more
images if these are grouped in one file. In this way, the total size of the multipage image is smaller than the
sum of the sizes of the same images saved separately. This is also a substantial enhancement compared with,
for example, the multipage version of TIFF 6 or other similar solutions.
For the colour image the direction applied also in the JPEG2000 draft is a new compression wavelet
algorithm. The image is no longer split in squares compressed separately; the wavelet can be rather imagined
as sublimation of the characteristic dominants into their dramatically smaller archetypal representation. This
idea is not so far from the so-called fractal compression, but the applied algorithms are other ones. After
decompression, the image expands into its full representation; it is no more assembled from squares as in the
former case. If compressed toughly, the wavelet image has also artefacts, but their character is different from
the JPEG ones. They are smoother for the user's eye and continuous compared with the discrete ones from
JPEG; therefore, they do not disturb so much at the same compression ratio. The efficiency of the wavelet
compression is far superior to DCT and the savings are several times better. Our tests have shown that today
the best marketable solution is the LWF format of the Luratech Inc., which is now also very active in
implementation of JPEG2000. Another good solution, but slightly less efficient, is IW44, a wavelet
component of the already above discussed DjVu format.
I.2.2
Graphic format
The graphic format is an envelope in which the image is stored; there is information about resolution, colour
5
depth, and applied compression scheme. There are dozens of graphic formats, but if we work for the future,
we must seek higher stability in their application. It seems today that those formats, which are efficient and
frequently used, have a good future. This usage is also in certain extent evident in the Internet world. The
web prefers several formats and several other formats are planned for the web or enabled especially thanks
to plug-in or ActiveX technology.
I.2.2.1
Vector graphics
In our previous discussion, we have left apart the simpler (computer) colour graphics: various diagrams and
schemes. For this domain, the photorealistic compressors as DCT or wavelets are not appropriate. It is much
better to limit the colour depth and compress them losslessly in GIF or PNG if we stick to the bitmap
representation. Sometimes it is even useful to dither them down to 1-bit images and to handle them in this
area.
Nevertheless, this sort of images is more suitable for vector graphics in which the image is no longer
stored as a pixelated grid, but as a vector formula. It is appropriate to mention here some new development
occurring with relationship with Internet: it is the so-called SVG format that stands for scalable vector
graphics. SVG is the vector image stored as an XML file; much work is done now to implement SVG into
the Internet tools. In this case, a very slim text file represents the image; therefore, its transfer is rather
efficient.
I.2.2.2
Raster graphics (bitmaps)
I.2.2.2.1
Traditional solutions
The Internet world has also got great impact on usage of the bitmap image formats. Three of them are
directly implemented in web browsers: GIF, JPEG, and PNG. The other ones can be used provided their
MIME is specified on the servers and within the browsers. Additionally, for direct viewing of other formats
in web browsers, the above-mentioned add-ons, such as plug-ins or ActiveX, are necessary to be installed. If
so, we can easily work, for example, with TIFF in the browser window or with other formats. The add-ons
can also bring additional handling quality to the browsers: zooming, sharpening, cropping, printing, etc.
The most appropriate format for efficient lossy storage of digital images of rare library materials is
JPEG. It is also a good presentation format. If we stick to lossless compression, PNG should be considered;
it is the best solution available and it seems that neither lossless wavelet compression will outperform it
significantly. PNG is now a de facto format, but it is very good and it seems that it will gain even more
applicability in the future than it has got until today. If we do not like compression, we can remain with the
uncompressed TIFF, but we will need a lot of storage space, while knowing that the compression algorithms
are well described and that there is no danger to use them.
GIF cannot be recommended for serious work with graphic representations of rare materials,
because it has the limit of only 256 colours; it is acceptable for preview images or simple graphics.
I.2.2.2.2
Emerging solutions
We have already mentioned the wavelet compression as well as some of the formats, which are on the
market and which enable the wavelet encoding.
As there is still some lack of standardization in this domain, the wavelet formats will rather play an
important presentation role. They can form a presentation layer existing simultaneously with the source
images.
It has been also observed that many images consist of various objects from which some are rather
textual, while the other ones are more photorealistic or they represent accompanying graphics. It is said that
the content is mixed; it is called Mixed Raster Content (MRC). If there are tools able to zone the image into
textual and true colour segments and if these segments are compressed by the most efficient compression
tools separately and then stored in one single format, the result should be a very efficient and slim image file.
6
We can call such formats as document formats: at present, there are two good solutions of document
formats on the market: DjVu and LDF. Both of them work separately with foreground and background
segments of the image, while the dominating foreground layer is always a 1-bit image. An efficient 1-bit
compressor always compresses this 1-bit image: JB2 for DjVu and LDF 1-bit compressor for LDF. It is to
be noted here that these two 1-bit compressors are the best ones on the market today even if new ones are
starting to appear as e.g. CVision JBIG2 compatible one.
The second foreground layer contains colour information of the textual layer only and it is always
compressed by a wavelet algorithm as well as the coloured rest of the image - called background - is. The
applied compressors are IW44 for DjVu and LWF for LDF, again ones of the best on the market today.
The results are astonishing: a textual newspaper page in 256 shades of grey has as JPEG 2152 KB,
while as a typical DjVu image only 130 KB – and it is well readable and thanks to DjVu plug-in also its
integration into web browsers is excellent. In plus, it is displayed progressively; thus, it is readable even in
case of absence of a considerable part from all the bits contained in the image file.
This DjVu performance should be also seen in context with the 1-bit representation of the same
page: it has 255 KB when stored in TIFF/CCITT Fax Group 4. If recalculated into 1-bit and compressed in
DjVu as 1-bit text, it will have only 73 KB. All these numbers are very eloquent and they indicate that many
things can be done for the document delivery area in libraries.
Both DjVu and LDF can switch off the zoning and compress also the image as 1-bit (including the
corresponding dithering) or they can compress and store it as a photorealistic image only in wavelets.
I.3
The delivery of the digital image
The digital images form together a virtual representation of the original rare book. The user is no longer
given the original for the study, while in most cases the digital copy is a good replacement for him. To be a
real good replacement, the user must have some comfort when working with the pages of the virtual
document.
I.3.1
The preparation of the source image
When the digitization device in our digitization programme produces the image of a manuscript page, the
image serves for preparation of a set of 5 images fulfilling various roles in the communication with the user.
Two preview images are made and stored in GIF: the smaller one has ca. 10 KB and it is used for
creation of a gallery representation of the entire manuscript. It enables the user to have a very basic
orientation with the whole document: to see, whether there is only text or also images or illuminations, etc.
The larger preview image has ca. 50 KB and it assists the user's decision, whether to access the page for
further study.
Another three images of different quality and resolution are stored in JPEG: Internet quality image
of ca. 150 KB, user quality image for normal work and research has ca. 1 MB, while the excellent quality
image is for archiving or special use as, for example, printing of a facsimile copy.
Especially the starting Internet access to our Digital Library has shown some necessity to reshape
and optimize this set. It has been decided to abandon the Internet quality level because of its low readability
and rather to optimize the user quality. This will be done through better establishment of the JPEG quality
factor for various types of documents with the help of professional users who need the documents for
research. The output of this work will be faster image processing for creation of digital copies and probably
also economy of storage space. This economy will be favourably met both by Internet users and CD users,
because in the latter case, even larger manuscripts will fit into one CD that has not been the case in ca. 10%
of digitized documents until today.
Also special image processing tools are going to be developed and the Adobe Photoshop used until
to date replaced by faster programs optimized for this sort of work. It seems that all the scanning devices of
manuscripts - there will be three in a short time - will use one special image processing station.
7
These sets of images are not used for periodicals, where the requirements are different.
I.3.2
The delivery of the source image
Users must be able to manipulate the image to investigate details, to enhance readability of some paragraphs
of the text, to have a good orientation within the image, etc. The basic presentation tool - that is generally a
web browser today - does not enable it. If the image is larger than the monitor screen, the work with it can
become very unpleasant especially in case of larger formats as newspapers or in case of pages with
extremely fine details scanned at very high resolution.
Furthermore, we must be also aware of the fact that any bitmap image is displayed on our computer
screen in its resolution that has been selected by us. Thus in finer scans - at higher resolution - the displayed
image is larger than the original is.
To solve this problem even for the web recognized image formats - JPEG, PNG, and GIF - various
plug-ins are developed to handle the images. The plug-ins or ActiveX components (for some solutions in
MSIE) are installed into the browser and associated with the file extension. When the image file - for
example, PNG - is called separately into the browser window or into a frame, it is displayed via these
viewing tools, which give some additional value to the presentation of the image.
There are several tools of this kind and almost all of them are at least shareware programs. They
differ also in their functions and they have one problem in common today: they cannot handle JPEG in the
Microsoft Internet Explorer because for this format only the native DLL is used and cannot be replaced. It is
a great pity, because JPEG is and it will certainly remain one of the best presentation formats.
As also the qualities of some other features in various plug-ins vary, it was decided to develop our
own plug-in. It is called ImGear and it can handle several formats including TIFF, PNG, TGA, BMP, etc.
There are separate plug-ins for Netscape and for Internet Explorer; neither we have succeeded to implement
any enhanced JPEG viewing within the Internet Explorer.
If the user accesses our images with the browser, Netscape is recommended for easier work. The
user can then work with the manuscripts and periodicals on Internet, when using our Digital Library, or
locally with copies written on compact discs. For the local access to manuscripts, there is also another
choice: to use a special viewing tool called ManuFreT, which enables also a more sophisticated work with
the descriptive metadata. As an image viewer and editor, ManuFreT solves easily access to JPEG with many
added graphic functions.
The earlier viewer in ManuFreT was a separate solution from the ImGear tool, but nowadays in the
new 32-bit ManuFreT software, the image handling is enabled through the same plug-in as in the web
browsers. Such a viewing reflects better the user's way of work that he has acquired on Internet.
The ImGear interface is being rewritten now to react more appropriately to users' comments and
especially to open additional functions for other types of image files. The multipage TIFF viewing and the
work with LZW compressed files, such as GIF and TIFF/LZW, are being enabled as well as some postprocessing of 1-bit images to enhance their readability - filling in of non-smooth edges of partial bitmap
clusters with grey pixels.
The user of our Digital Library is enabled the free use of the ImGear software for non-commercial
purposes.
I.3.3
The dynamic post-processing of the source image for delivery
When the user goes to our Digital Library, he has JPEG and TIFF images available for research. Even if
compressed in a very efficient manner while preserving good readability, the files may be quite large: 0.8 - 1
MB for manuscripts and 1 - 3 MB for periodicals in their normal user quality. At low speed or very busy
network traffic, this fact can influence negatively the work. In plus, the Internet Explorer users have
problems with access to larger sizes of JPEG images.
8
We have performed quite comprehensive tests of mixed raster content approaches and emerging
compression schemes in all of the domains and we have decided to apply the DjVu format as a presentation
option in our Digital Library. At present, the user can choose, whether he will view the desirable image of a
manuscript or a newspaper page in JPEG or TIFF/G4 or whether he will ask the system to convert it into
DjVu and sent in this format.
We have installed a special DjVu server with an efficient DjVu command line tool. The speed of
conversion is very high, the DjVu images are very slim and they can be easily handled in both browsers.
This conversion option has also been integrated into our Digital Library interface. The short delay in the
beginning of the transfer is imponderable; because it is easily overbalanced by the very short time needed for
the file transfer itself.
If there is a delay of the image delivery from the Digital Library, it is rather a problem of a slower
operation of the robotic mass storage library in case the image is not pre-cached on fast disk arrays and it is
requested and searched in magnetic tapes.
The user has several options for the on-the-fly DjVu conversion: he can apply the full DjVu
philosophy in the typical option or he can switch off one of the two compressors in the photorealistic or
bitonal options. He can even set up the background quality factor.
II
Metadata
The metadata are added value to the data: they describe and classify the contents, they structure the digital
document, and they bind their components together. They are also necessary to provide technical
information for easier processing and access.
Our metadata philosophy is based on two alternative ways of access to the digital documents:

basic rudimentary access;

sophisticated access.
The basic rudimentary access must be possible through generally available tools, especially web
browsers; therefore, the metadata arrangement follows the basic features of web browsers: HTML formal
formatting.
As HTML does not enable mark-up and description of contents, it has been completed with
additional elements that make it possible. Thus the DOBM SGML-based language and approach have been
developed to ensure the metadata management in our digitization programmes. This combines the direct
readability of HTML with content mark-up of relevant objects.
The file structure of the digital document has a tree character: the main levels of the manuscript
description are the entire book and the individual pages, while at periodicals the tree is richer, going from the
entirety of the title down to individual pages and even articles.
The metadata tree is a gateway through which the user gets access to the desired data files, which
are referenced from here. In general, the data files can be images, text, sound, or video files. They are in
principal external face to the metadata container.
The content-oriented mark-up is based on minimal cataloguing requirements as well as on good
practices concerning the description of various objects in the documents, such as, for example, illuminations,
music notes, incipit, etc.
The language was developed in 1996 and in that year also the mandatory content categories were set
up. However, since that time, some more consistent approaches were agreed in various communities in
Europe, for example, a definition of a manuscript catalogue record in SGML in the MASTER project or the
electronic document format for digitized periodicals in the DIEPER project.
Even if these projects can have different goals than our projects have, we consider as necessary to
co-operate with them and to share - even in some cases as by-products - our data. We have
9
participated in MASTER and therefore decided to accept its bibliographic record definition for the
bibliographic part of our document format for manuscripts. Necessary tools are being developed now to
bridge the two programmes. Our aim is that what we do for digitization, we will use for MASTER and what
we have done for MASTER, we will use as a component part of the description of digitized manuscripts.
As to DIEPER, the situation is different: both we have a document format, but the aim of DIEPER
is to provide access especially to scholarly journal articles, while the aim of our programme is to preserve
and to provide access to endangered and rare historical periodicals. To share with DIEPER our digitized
periodicals, it is necessary for us to deliver their metadata description in the DIEPER format that we have
planned to do as soon as the DIEPER structure is declared final.
It is foreseen that the central DIEPER service will read our input and that of other libraries and it
will try to act as a uniform gateway to digitized periodicals in co-operating institutions. When the user marks
what he is interested in and when he demands the access to it - be it an article or an annual set - the central
resolution service will redirect him to the on-line service, where the document is stored. Thanks to foreseen
unique identifiers of the component parts of periodicals, the access to the requested part will be provided
directly from the source library in case it is free - or the user will have to pass through prescribed entrance
procedures of the concrete provider.
II.1
Formal and content rules of the metadata description
II.1.1 Contents
There are various points of view that enable description of contents, but the described objects are always
defined certain attributes whose values are then classified and identified in concrete situations. Different
people can consider different things as important quite in case of the same object. However, the practical
work in various areas has led to certain standardization. Thus, we have, for example, cataloguing rules; they
say of which elements a catalogue description should consist, which attributes these elements can have, and
how they can be identified. Nevertheless, there can be more approaches defined as cataloguing rules.
If, for example, two different approaches are used to describe the same element of contents, these
descriptions will not be comparable between each other and no system built on both will be efficient. Thus, a
very important precondition for each metadata description is application of broadly accepted content
descriptive rules, because the value of our work can be enhanced in co-operation and the co-operation is
always based on the same or similar understanding of things.
Even if it is evident, the practice is not so easy, because various institutions build even their new
approaches on their various traditions to maintain internal compatibility of their tools. In this framework, to
start anything new for digitization, for example, may mean to reshape much more a lot of other work and
procedures within the institution. Very frequently, to co-operate means additional effort. However, even this
effort is easier if well-defined content descriptive rules are respected, because between major approaches
various compatibility bridges have already been built.
In other words, even under the same category of objects it is important to understand the same
things. This is the most important requirement for any description. If, for example, we all agree on the
definition of author of a document, then all the names put into this category will be comparable provided
that also their writing will follow the same rules that is not always the case.
Even if we agree on this precondition, it is not sure that we will be able to communicate, because
our systems how to say - mark up - that an object is this or that can be rather different. Broadly speaking,
however, we should be able to export from them certain standardized output in spite of the tools they use for
mark-up of described objects.
II.1.2 Formal framework
It is evident that at least the communication metadata output or files should be static and well structured to
enable easier exchange of information. We can look for our model again at the library cataloguing: if
10
libraries are able to export their records into MARC or UNIMARC, they can interchange them, of course,
provided they have followed comparable rules to identify the descriptive objects.
The so-called electronic formats for digitized documents are much more complex; therefore, also
the danger of non-communication is much higher. Furthermore, there are almost no traditions: the formats
are created now on the basis of available standards, rules, and good practices. More than in the traditional
library work, the de facto standardization of various good practices takes place in various applications. In
many cases, there is also lack of even formal common denominators.
An exchange format and also a storage format must be built on clearly shaped systems: they must
enable a very good structuring of very complex documents and the description of various kinds of objects. It
could be even said that the description of almost everything should be enabled that can be met in documents
on any of their structural levels.
Such a platform on which to do these things exists and it is called SGML. The question is, however,
what to build on it to be able to communicate and, simultaneously, to be able to use existing access tools.
After having written our enhanced HTML - called DOBM - we solved the situation for a great deal of time.
Much more a problem for us today is acceptation of various modifications of the description of contents than
a quick change in the syntactic area of the description environment.
Also the emergence and importance of XML is being taken into consideration and our electronic
document formats are going to be redefined in this syntax. Simultaneously, the sets of the description
elements will be enlarged following the requirements of the programmes with which we intend to co-operate
as well our own new requirements produced through research of, for example, other methods of calibration
of scanning devices.
The main component parts of the metadata description of a digitized manuscript are today:
bibliographic description of the whole document, descriptions of individual pages, and technical description.
The new XML DTD will adopt the MASTER bibliographic record DTD for the bibliographic description,
while the mandatory elements for description of individual pages will be preserved. The technical
description will be enlarged to enable a better reproduction of the original. The XML will be first applied for
manuscripts and only after that for periodicals.
III
Access to digitized documents
The user may require two types of access: local access to documents or their parts stored on compact discs or
other off-line media and Internet access. Both types of access should be built on the same digital document.
He may also require additional services, for example, printing, delivery of certain parts of the large
documents on CDs, etc.
The sophisticated access in both environments is built on enhanced use of metadata descriptions. In
certain extent, these must be based - together with our entire digitization approach - on foreseeable
requirements of potential users.
Nowadays, all our documents are structured following our DOBM SGML language, but their
structures are being dissolved in the Digital Library environment, where the documents are indexed to be
searchable; indexed metadata and graphic data are then stored separately. However, they can be any time
exported into static structures.
The interface of the Digital Library is the AIP SAFE system, developed by the Albertina icome
Praha Ltd. It is a document delivery system that enables especially sophisticated document delivery from
digital archives, interfacing of the digital library for the user, control of the production workflow of the
digital documents, and off-line and on-line enhancement of metadata descriptions.
11
DjVu Server
Mass Storage
ADIC Scalar 1000 AIT
Magnetic tapes
MrSID Image Server
Fibre Channel Disks
SAM FS - Sun Solaris Server
RetrievalWare
AIP SAFE
Document
System
Delivery
Producer
User
Fig. 1 - Functional scheme of the Digital Library
AIP SAFE integrates with web browsers and allows the use of various plug-ins for easier data
handling. Its heart is the AiPServer, which consists of the following logical entities:

SQL server that is connected through TCP/IP to the system database;

Sirius web server that creates the interface between external users, web servers, and AIP SAFE
system;

Storage Server that enables storage of data into large robotic libraries;

System applications for administration and other tools.
When some years ago we could not imagine especially access to manuscripts on the web, today this
is the reality requested by users. We are now in the middle of a testing period of all the components of the
Digital Library. During this testing period the interface will be optimized and also some other tools including
those for administration.
The heart of the Digital Library is a Sun Server with the installed Storage and Archive Manager File
System (SAM-FS). It works with a mass storage capacity on magnetic tapes (ADIC Scalar 1000 AIT) and
fibre channel disk arrays for storage of indexed metadata and pre-cache of data from the mass storage
device.
The Digital Library is fed by the two digitization programmes, which are described below. Three
special servers or added services are foreseen for a greater comfort of the user:

DjVu server as explained above to reduce the volume of transferred image data;

MrSID server for access to larger source files, e.g. images of maps;

RetrievalWare - Excalibur for search in texts read by OCR.
As of June 2001, the DjVu server is fully applied and integrated into the Digital Library, while the
integration of the other two services is in development. It is to be also mentioned that OCR will be used on
older documents and that it is rather foreseen to use it as hidden source for searching.
12
IV
National Framework
Our digitization activities have been developed in close relationship with the UNESCO Memory of the
World programme. They led to establishment of two separate programmes:

Memoriae Mundi Series Bohemica for digital access to rare library materials, especially
manuscripts and old printed books - in routine since 1996;

Kramerius for preservation microfilming of acid paper materials and digitization of microfilm in routine since 1999.
The volume of digital data available in the programmes is today ca. 230,000 mostly manuscript
pages and ca. 200,000 pages of endangered periodicals. This volume was dependent on the funding that
could be allocated for digitization.
For many years, there was no systematic funding in this area, only in 2000 the libraries succeeded to
build national frameworks for various activities such as retrospective conversion, union catalogue, digital
library, and also for the above mentioned programmes. These exist nowadays as component parts of the
Public Information Services of Libraries programme, which is - on its turn - a part of the national
information policy.
Czech institutions can now apply in the Calls for Proposals launched by the Ministry of Culture to
receive support for digitization. They must respect the standards of the National Library and they must agree
with provision of access to digitized documents through the Digital Library. It seems nowadays that the
interest of various types of institutions is growing; they are libraries, but also museums, archives, and various
church institutions.
It is expected that thanks to these national sub-programmes, the volume of digitized pages will grow
with ca. 80,000 - 100,000 pages of manuscripts and old printed books and ca. 200,000 pages of periodicals
annually.
13
V
Contents
I
DATA
1
I.1
The digital image
I.1.1 Resolution
I.1.2 Colour depth
2
2
3
I.2
The production processing of the digital image
I.2.1 Compression
I.2.2 Graphic format
I.2.2.1 Vector graphics
I.2.2.2 Raster graphics (bitmaps)
I.2.2.2.1 Traditional solutions
I.2.2.2.2 Emerging solutions
3
4
5
6
6
6
6
I.3
The delivery of the digital image
I.3.1 The preparation of the source image
I.3.2 The delivery of the source image
I.3.3 The dynamic post-processing of the source image for delivery
7
7
8
8
II
9
METADATA
II.1 Formal and content rules of the metadata description
II.1.1 Contents
II.1.2 Formal framework
10
10
10
III
ACCESS TO DIGITIZED DOCUMENTS
11
IV
NATIONAL FRAMEWORK
13
V
14
CONTENTS
14
Download