Class 8

advertisement
Digital Reformatting of
Text
Aaron Choate
Digital Library Production Services
The University of Texas Libraries
From last time:

Calculating potential file size
(no really… this time we got it!)
file size = height x width x bit-depth x dpi2
8 bits per byte
imaging
Benchmarking

Subjective evaluation becomes more
problematic when the goal is legibility
rather than fidelity.
imaging
Benchmarking

Physical Type, size and presentation
imaging
Banchmarking

Physical condition
• Darkening pages
• Fading ink
• Stains
• bleed-through
• Uneven printing
• Fold lines
• smearing
imaging
Benchmarking

Document classification
• Simple text / printed line art
• Distinct-edge based representation
Bitonal?
• Manuscripts
• Soft-edge-based
Grayscale / color
• Mixed material
imaging
Benchmarking

Medium and support
• Support – (paper, clay tablet, etc.)
• Thin paper? (bleed through)
• Medium – (graphite pencil, inks, etc)
• Fading of ink
• Variations in color or density
imaging
Benchmarking

Tonal Representation
imaging
Benchmarking

Color Appearance
• Is color reproduction necessary to the
•
•
document’s meaning?
What purpose does the color serve?
How important is maintaining the color
appearance?
imaging
Benchmarking

Detail
• Printed text –
• Measure the height of the smallest lowercase letter
that typifies the item or group of items.
• Manuscripts, line art –
• Measure the finest stroke-width that must be
represented and characterize the needed level of
quality
imaging
Benchmarking

QI…(Quality Index)
• Defining detail as character height
• ANSI/AIIM preservation microfilming standard
•
for determining requirements for text legibility
Defines a range from barely legible through
excellent that maps to technical test targets
imaging
Benchmarking

Line pairs
Excellent = 8 line pairs
Good = 5 line pairs
Marginal = 3.6 line pairs
Barely legible = 3.0 line pairs
imaging
Benchmarking
Digital QI
 Bitonal (only black pixels)
QI = (dpi x .039h)/3
h = 3QI/.039dpi
dpi = 3QI/.039h

Tonal images (grayscale for printed text)
QI = (dpi x .039h)/2
h = 2QI/0.39dpi
dpi = 2QI/.039h
Text Capture

Methods

Accuracy …
• Rekeying
• OCR
Software




Scansoft - Omnipage Pro
Abbyy – Fine Reader
Adobe Acrobat …
PrimeOCR – Prime Recognition
Encoding
XML vs SGML



SGML (Standard Generalized Markup
Language ) is the grand-daddy of all markup
languages
XML is a subset of SGML with an intent on
being the format for use on the Internet.
XML attempts to fill the gap between SGML,
which can be used for just about anything, and
HTML which is severely limited and currently
being abused because of this. (table structures
for layout, clear 1 pixel GIFs.. etc)
xml
DTDs vs Schemas
TEI

xml
Text Encoding Initiative
• Initially launched in 1987, the TEI is an
international and interdisciplinary standard
that helps libraries, museums, publishers, and
individual scholars represent all kinds of
literary and linguistic texts for online research
and teaching, using an encoding scheme that
is maximally expressive and minimally
obsolescent.
TEI

xml
Levels of encoding
•
•
•
•
•
Level 1: Fully Automated Conversion and
Encoding
Level 2: Minimal Encoding
Level 3: Simple Analysis
Level 4: Basic Content Analysis
Level 5: Scholarly Encoding Projects
Character sets

Unicode –
Unicode provides a unique number for
every character, no matter what the
platform, no matter what the program, no
matter what the language.
character sets
Unicode

Greek & Coptic
Software



XMetal
Oxygen
Cooktop
Software

MetaE
Download