Making Portable Document Format (PDF) Files Work for You

advertisement
Making Portable Document
Format (PDF) Files Work for You
Tuomas S. Kostiainen
and
Jill R. Sommer
ATA Conference 2009
PDF File Basics

What is a PDF file and why do we use them?
–
Stands for "Portable Document Format." PDF is a multiplatform file format developed by Adobe Systems. A PDF
file captures document text, fonts, images, and even
formatting of documents from a variety of applications. You
can e-mail a PDF document to your friend and it will look
the same way on his screen as it looks on yours, even if he
has a Mac and you have a PC. Since PDFs contain coloraccurate information, they should also print the same way
they look on your screen.
(Source: TechTerms.com)
PDF File Basics

Where do we as translators encounter PDF files?
–
–
–

Translation projects: source text, proofreading, reference
Forms: registration forms, IRS W-9
Creating PDFs: resume, invoices, file sharing,
printing/publishing
Problems associated with PDF files
–
–
Rigid (not meant to be editable)
Converting from PDF > DOC etc.
Adobe Acrobat



Versions: Adobe Reader, Adobe Acrobat Standard,
Adobe Acrobat Pro, Adobe Acrobat Pro Extended
Product comparison:
www.adobe.com/products/acrobat/matrix.html
Compatibility with earlier versions
–

Update to the current version of 9
Should be part of your toolbox
Other “Comparable” Products






PDF Nitro (Express, Professional, and free
version): www.nitropdf.com
Foxit PDF Tools (Reader, Editor, Createor,
Phantom etc.): www.foxitsoftware.com/pdf/
Solid PDF Tools: www.soliddocuments.com
DocuCom PDF Gold: www.pdfwizard.com
Pdf995Suite: www.pdf995.com
others
Working with PDF Files
1.
2.
3.
4.
5.
Editing and Commenting PDF Files
Searching for Text in PDF Files
Creating and Filling Electronic Forms
Using Electronic Signatures
Creating TMs from PDF Files Using
LogiTerm AlignFactory
Editing and Commenting PDF Files

Using Commenting Tools in a normal review cycle don’t use only sticky note comments
–

available in Reader only if the PDF file author has enabled
the document for commenting using Acrobat Pro (Advanced
> Extend Features in Adobe Reader)
Comment & Markup tools (Tools > Comment &
Markup / Tools > Customize Toolbars)
–
–
–
–
Text Edits, Highlight Text, Callout, Arrow, Rectangle, etc.
Show/Hide Comments
Comments List (View > Navigation Panels > Comments)
Spell checking of notes (Edit > Check Spelling)
Editing and Commenting PDF Files

Touching up text (changing text and text properties)
–
–

Typewriter tool
–

available in Reader only if the PDF author has enabled it
Inserting/extracting/rearranging pages
–

Tools > Advanced Editing / Advanced Editing toolbar
TouchUp Text, Crop
Document > Insert/Extract/Replace/Delete Pages, or use
the Pages navigation pane on the left
E-mail-based review or shared review (on
acrobat.com) for multi-party reviews
Searching for Text in PDF Files



Edit > Find: text within the current document
Edit > Search: text in one or more files
Indexing (only in Professional): possibility to index
hundreds of files for quick searching
–
–
–

Select Advanced > Document Processing > Full Text Index
with Catalog > New Index
Name the index, select directories to be included, click Build
and specify location for the index file
Use the resulting .pdx file for searching
Creating a Searchable Image
–
With the image file open in Acrobat, select Document > OCR
Text Recognition > Recognize Text Using OCR
Creating and Filling Electronic
Forms




Simple filling with Typewriter tool
Using text boxes
Converting electronic files to forms using Form
Wizard
Blueberry PDF Form Filler; FREE application for
filling in and printing PDF forms
(www.bbconsult.co.uk/Resources/PDFFormFiller.aspx)
–
Note: deselect “Lock All Controls” button
Using Electronic Signatures

Inserting a scanned signature
–

copy and paste via clipboard (gets inserted as a “stamp”)
Creating and using a digital ID
–
–
Creating a digital ID: Advanced > Security Settings >
Digital IDs > Add ID > “A new digital ID I want to create
now” > Next > New PKCS#12 digital ID file > Next. Fill the
information fields, as needed, click Next. Select location for
the ID and define a password. Click Finish to return to the
Security Settings dialog box. Click Close.
Signing a PDF document: Advanced > Sign & Certify >
Place Signature. Drag a rectangle where you want to place
the signature. Choose a digital ID, type the password,
choose appearance and click Sign.
Creating Translation Memories from
PDF Files Using LogiTerm AlignFactory

Other tools:
–
–

YouAlign by LogiTerm; online tool, FREE (for a
limited time), limited selection of languages
(www.youalign.com)
NoBabel AutoAligner by KCSL; online tool, limited
selection of languages (http://nobabel.com/)
LogiTerm AlignFactoryLight
(http://www.terminotix.com)
–
Quick and easy tool to create TMs from PDF files
Additional PDF-related links



www.adobe.com/support/
www.planetpdf.com
www.pdfstore.com
Part Two: Creating PDFs and OCR
Reasons translators might need to
create a PDF




Résumés
Invoices
Letters of certification
Various protected files
Creating a PDF from Word or
Excel

Choose Print – PDF tool (Acrobat Distiller,
win2pdf, etc.)

Select the menu button
Optical Character Recognition


Optical character recognition (or OCR) is the
translation of handwritten, typewritten or
printed text to generate a machine-editable
text.
PDFs can be either straight text or a graphic.
OCR can handle both.
Optical Character Recognition


OCR tools use pattern recognition, artificial
intelligence and computer vision as well as digital
character recognition.
The accurate recognition of Latin-script, typewritten
text is now considered largely a solved problem on
applications where clear imaging is available such as
scanning of printed documents. Some tools can now
easily recognize Cyrillic and Asian characters as
well.
OCR Tools





ABBYY FineReader (http://www.abbyy.com/)
PDF Transformer (by ABBYY)
OmniPage
(http://www.nuance.com/imaging/products/o
mnipage.asp)
Microsoft Office Document Imaging (part of
MS Office 2007)
ExperVision (http://www.expervision.com/)
ABBYY FineReader



ABBYY is the clear favorite among
translators (although PDF Transformer is a
close second), because it creates fewer text
boxes than other OCR programs
The spellcheck feature ensures the
document you are working on doesn’t have
any spelling errors that would corrupt the TM.
ABBYY supports the most languages (184 at
last count).

Abkhaz, Adyghian, Afrikaans, Agul, Albanian, Altai,
Armenian (Eastern, Western, Grabar), Avar, Aymara,
Azerbaijani (Cyrillic), Azerbaijani (Latin), Bashkir,
Basic, Basque, Belarusian, Bemba, Blackfoot,
Breton, Bugotu, Bulgarian, Buryat, C/C++, COBOL,
Catalan, Cebuano, Chamorro, Chechen, Chinese
Simplified, Chinese Traditional, Chukchee, Chuvash,
Corsican, Crimean Tatar, Croatian, Crow, Czech,
Dakota, Danish, Dargwa, Dungan, Dutch
(Netherlands and Belgium), English, Eskimo
(Cyrillic), Eskimo (Latin), Esperanto, Estonian, Even,
Evenki, Faroese, Fijian, Finnish, Fortran, French,

Frisian, Friulian, Gagauz, Galician, Ganda, German
(Luxemburg), German (new and old spelling), Greek,
Guarani, Hani, Hausa, Hawaiian, Hebrew,
Hungarian, Icelandic, Ido, Indonesian, Ingush,
Interlingua, Irish, Italian, JAVA, Japanese, Jingpo,
Kabardian, Kalmyk, Karachay-balkar, Karakalpak,
Kasub, Kawa, Kazakh, Khakass, Khanty, Kikuyu,
Kirghiz, Kongo, Koryak, Kpelle, Kumyk, Kurdish,
Lak, Latin, Latvian, Lezgi, Lithuanian, Luba,
Macedonian, Malagasy, Malay, Malinke, Maltese,
Mansy, Maori, Mari, Maya, Miao, Minangkabau,
Mohawk, Moldavian, Mongol, Mordvin, Nahuatl,
Nenets, Nivkh, Nogay,

Norwegian (nynorsk and bokmål), Nyanja,
Occidental, Ojibway, Ossetian, Papiamento, Pascal,
Polish, Portuguese (Portugal and Brazil), Provencal,
Quechua, Rhaeto-romanic, Romanian, Romany,
Rundi, Russian, Russian (old spelling), Rwanda,
Sami (Lappish), Samoan, Scottish Gaelic, Selkup,
Serbian (Cyrillic), Serbian (Latin), Shona, Simple
chemical formulas, Slovak, Slovenian, Somali,
Sorbian, Sotho, Spanish, Sunda, Swahili, Swazi,
Swedish, Tabasaran, Tagalog, Tahitian, Tajik, Tatar,
Thai, Tok Pisin, Tongan, Tswana, Tun, Turkish,
Turkmen, Tuvinian, Udmurt, Uighur (Cyrillic), Uighur
(Latin), Ukrainian,

Uzbek (Cyrillic), Uzbek (Latin), Welsh, Wolof, Xhosa,
Yakut, Zapotec, Zulu
Spellchecking



ABBYY supports pre- and post-reform German
orthography, Old German script, scripting languages,
and simple chemical formulas .
ABBYY can also sometimes replicate graphics and
logos that you can paste into your file.
The text recognition software includes dictionaries
with spell-checking capabilities for 38 languages
allowing verification of recognized text directly in the
FineReader Editor
Potential Problems



Page setups can vary and create
inconsistent margins and page layouts.
OCR tools have problems with handwriting,
bullet lists, check boxes, static from "fuzzy"
fax transmissions, and tables.
Formatting sometimes needs to be cleaned
up (double spaces, text boxes, columns, etc.)
ABBYY FineReader
Using OCR to create Word files





Open image (PDF, TIF, etc.)
Read file using OCR tool
Check spelling (allows you to verify words
that the OCR program misread or did not
recognize)
Save as Word file
Create a clean file and copy and paste the
text into it.
Troubleshooting



Use Edit->Paste Special to copy the text into
a fresh Word file and format it by hand.
Delete the illegible pages in Adobe and run
the legible pages through the tool.
OCR isn‘t a magic bullet. If the source text is
very illegible you may want to just give up
and type it in by hand.
CodeZapper



"CodeZapper"is a set of Word macros designed to “clean
up” Word files before being imported into a translation
environment program such as Deja Vu DVX or MemoQ.
Word documents are often strewn with junk or “rogue”
tags (so-called “smart tags”, languagetags, track changes
tags, soft hyphenations, scaling and spacing
changes,redundant bookmarks, etc.).
This tagged information shows up in the DVX or MemoQ
grid as spurious {1}codes{2}around, or even in the middle
of, words, making sentences difficult to read and translate
and generally negating many of the productivity benefits
of the program.
CodeZapper



OCR‘d files or files converted from PDF are
even worse.
CodeZapper tries to remove as many of
these tags as possible while retaining
formatting and layout.
It also contains a number of other macros
which may be useful before and after
importing files into DVX or MemoQ
Final Words



Do not use online OCR tools like the
Tesseract OCR Engine from Google if your
documents are confidential.
Try several demos to determine which tool
best suits your needs.
Shop around for the best price ($399.99 vs.
EUR 139/GBP 89 or even EUR 90 ($116)).
Links


CodeZapper
http://www.transir.cn/redirect.php?tid=992&g
oto=lastpost
Presentation:
http://translationmusings.com/2009/11/05/pre
sentation-from-ata-conference/
Download