Bring documents to mobile devices – the new ePUB output

advertisement
Bring documents to mobile
devices – the new ePUB
output for eBooks
Zoltan Urban, Director of R&D
Nuance Document Imaging Developers
Conference 2013
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 1
Agenda
–
E-books
–
–
–
New technology in OmniPage Capture SDK 19
–
–
–
Popular formats
Issues with PDF and image formats
Workflow
Scanning
API and settings
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 2
Popular E-book
formats
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 3
Popular e-book formats
–
In order of popularity
–
–
–
–
–
PDF (normal, scanned)
ePub
KF8 / PRC (Amazon Kindle)
DJVU (scans with high compression)
Usability on mobile devices?
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 4
Original PDF on iPhone
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 5
Original PDF on iPhone
Readable
font size
Full text line
can be seen
But text
must scroll
horizontally
line by line
But small
font size
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 6
ePub created by OmniPage SDK 19
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 7
E-book conversion problems
Original e-book as PDF
Line End
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 8
E-book conversion problems
Converted to ePub using a free PC application
Line End
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 9
E-book conversion problems – Solved!
Converted using OmniPage SDK 19
Line End
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 10
E-book conversion problems
Original e-book as PDF
Title
Picture
Caption
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 11
E-book conversion problems
Converted by a popular PC application
Title
Picture
Caption is on the next page
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 12
E-book conversion problems – Solved!
Converted by OmniPage SDK 19
Title
Line End
Picture
Caption
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 13
New technology in
OmniPage Capture
SDK 19
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 14
ePub 3.0
This is the latest standard version for ePubs
– Supported by iOS
– Almost every ePub reader can display it
–
–
PC, Mac, Android
Necessary for fancy floating footnotes (see later)
– Amazon’s Kindle readers can consume it after conversion
(send ePub as e-mail to the device)
–
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 15
Technology challenges
–
The basis for creating good quality flowing representation
of documents is to determine well its logical units and
their properties, like
–
–
–
–
–
–
–
paragraphs (even if they cross pages)
columns
tables
graphics (non-textual areas)
headers, footers, page numbering
headings (also needed for table-of-contents generation)
footnotes
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 16
Workflow for ePub creation
–
–
–
–
–
–
Setting the new Book mode
Finding headers and footers
Finding footnotes and their references
Finding the headings and their level
Generating table-of-contents from the headings
Using the new ePub output converter
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 17
Book mode
–
A new approach to determine the logical units of the document
–
Switches to flowing mode as opposed to page layout fidelity
 The latter is the usual default for text outputs
–
Paragraph line spaces gets normalized rather than being
precisely defined
 Single, 1.5x, double
Finds paragraphs crossing pages and links the parts together
– Finds footnotes and their references
–
–
Can be selected using a new setting
ON: set "Kernel.Processing.Mode“ to PROCESSING_BOOK
– OFF: set "Kernel.Processing.Mode“ to PROCESSING_NORMAL
–
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 18
ePub-specific formatting
–
Multi-columns cannot be used
–
–
Font size scaling is relative by default (no absolute point sizes)
–
–
Tables can be utilized for placing objects side-by-side, like text
and graphics
100% is for detected sizes of 9-13 pts, others are expressed
as 50%, 75%, 100%, 133%, 150%, 200%. This makes the
output more uniform
Uses special properties to control the formatting details
–
CharFont, CharSize, CharStyle, ParAlignment, ParIndent,
ParFirstIndent, ParLinespacing, ParSpacing, FootNotes,
Cross-references, Bullets/Numbering, LineBreaks.
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 19
ePub-specific formatting (continued)
–
Do not assume specific dimensions for target device
–
–
Using relative scaling where possible, like column widths,
indents, picture sizes
Dimensions are relative to each other or to the page size,
e.g.: table columns are defined as percentages of full width
Do not specify right margin in left aligned paragraphs or
vice versa
– Do not specify horizontal or vertical cell alignment
–
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 20
Table sample
Original PDF type e-book
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 21
Table sample
Converted using a common
free PC application
Text is laid out as individual
paragraphs for each line of
each cell, no table structure
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 22
Table sample
Converted using OmniPage
SDK 19
Text is laid out in a table,
cells contain multiple lines
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 23
Table sample
Table and column
widths are relative to
the page width
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 24
Handling of footnotes
Original PDF type e-book
Footnote flags
Footnotes
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 25
Handling of footnotes
Converted by OmniPage SDK 19
Footnote flags with link
Footnotes
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 26
Handling of footnotes
In iBooks on iPhone
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 27
Handling of footnotes
In iBooks on iPad
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 28
Dual table-of-contents (TOC)
–
Existing table of contents cannot be used
–
–
–
We need to create two TOCs – ePub2 and ePub3
–
–
They are not links for scanned documents
The page numbers do not apply in ePub readers
Necessary for older readers
We analyze paragraph properties and assign heading
levels (h1, h2) to them
–
The heading levels control the generation of the table-ofcontents
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 29
Table-of-contents
Original PDF type e-book
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 30
Table-of-contents
Converted to ePub using a free PC application
Generated TOC
Original text of the TOC
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 31
Table-of-contents
Converted to ePub using OmniPage SDK 19
Generated TOC
Original text of the TOC
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 32
ePub output converter – file splitting
ePub files are zipped packages
– The actual text of the book is contained in several HTML
files
–
–
We impose a limit of 256kB
 Some readers have memory problems with larger files
 It speeds up the loading of the text
– If we detect chapter headings we try to break the text
there
–
But we don’t go below 8kB (except for the end of
document)
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 33
Improvements to help document
scanning
–
–
Improved handling of facing pages (“open book”)
Camera input
–
Auto detection of the source
–
Resolution estimation for optimal text size
–
Improved binarization
 Using EXIF information in JPEG files
 Can be overridden with a new camera flag setting
 Resolution enhancement for small characters
 Subsampling for zoomed-in text
 method selected automatically
 Correction for uneven lighting
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 34
Processing image only PDF
Note
• the facing
pages
• the headers
and footers
• the paragraph
split by the
page break
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 35
Processing image only PDF
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 36
API and settings
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 37
API and settings
Setting of the book mode
– Three ePub output converter modes
–
–
–
–
Regular
Simple
Poem
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 38
ePub Simple mode
Recommended for lower quality scans or for documents with a very
simple structure
– Does not try to preserve too much formatting, the output is even
more portable between readers
– Its output format name is "Converters.Text.ePubSimple”
– Internally defaults to the following
–
–
–
–
–
–
–
–
Converters.Text.ePubSimple.CharFont = FALSE
Converters.Text.ePubSimple.CharSize = FALSE
Converters.Text.ePubSimple.CharStyle = FALSE
Converters.Text.ePubSimple.ParAlignment = FALSE
Converters.Text.ePubSimple.ParIndent = FALSE
Converters.Text.ePubSimple.ParLinespacing = FALSE
Converters.Text.ePubSimple.ParSpacing = FALSE
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 39
ePub Poem mode
Recommended for poems where the line endings do have
importance
– Its output format name is "Converters.Text.ePubPoem”
– Internally defaults to the following
–
–
Converters.Text.ePubPoem.LineBreaks = TRUE
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 40
ePub Poem mode
Original e-book as PDF
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 41
ePub Poem mode
Converted to ePub using OmniPage SDK 19
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 42
ePub Regular mode
For good quality input (like normal PDFs) this delivers the
best fidelity
– Especially useful if the formatting is complex with varying
font sizes, indentations etc.
– Its output format name is "Converters.Text.ePub”
–
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 43
Thank you
© 2002-2013 Nuance Communications, Inc. All rights reserved. Page 44
Download