Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009 PDF File Basics What is a PDF file and why do we use them? – Stands for "Portable Document Format." PDF is a multiplatform file format developed by Adobe Systems. A PDF file captures document text, fonts, images, and even formatting of documents from a variety of applications. You can e-mail a PDF document to your friend and it will look the same way on his screen as it looks on yours, even if he has a Mac and you have a PC. Since PDFs contain coloraccurate information, they should also print the same way they look on your screen. (Source: TechTerms.com) PDF File Basics Where do we as translators encounter PDF files? – – – Translation projects: source text, proofreading, reference Forms: registration forms, IRS W-9 Creating PDFs: resume, invoices, file sharing, printing/publishing Problems associated with PDF files – – Rigid (not meant to be editable) Converting from PDF > DOC etc. Adobe Acrobat Versions: Adobe Reader, Adobe Acrobat Standard, Adobe Acrobat Pro, Adobe Acrobat Pro Extended Product comparison: www.adobe.com/products/acrobat/matrix.html Compatibility with earlier versions – Update to the current version of 9 Should be part of your toolbox Other “Comparable” Products PDF Nitro (Express, Professional, and free version): www.nitropdf.com Foxit PDF Tools (Reader, Editor, Createor, Phantom etc.): www.foxitsoftware.com/pdf/ Solid PDF Tools: www.soliddocuments.com DocuCom PDF Gold: www.pdfwizard.com Pdf995Suite: www.pdf995.com others Working with PDF Files 1. 2. 3. 4. 5. Editing and Commenting PDF Files Searching for Text in PDF Files Creating and Filling Electronic Forms Using Electronic Signatures Creating TMs from PDF Files Using LogiTerm AlignFactory Editing and Commenting PDF Files Using Commenting Tools in a normal review cycle don’t use only sticky note comments – available in Reader only if the PDF file author has enabled the document for commenting using Acrobat Pro (Advanced > Extend Features in Adobe Reader) Comment & Markup tools (Tools > Comment & Markup / Tools > Customize Toolbars) – – – – Text Edits, Highlight Text, Callout, Arrow, Rectangle, etc. Show/Hide Comments Comments List (View > Navigation Panels > Comments) Spell checking of notes (Edit > Check Spelling) Editing and Commenting PDF Files Touching up text (changing text and text properties) – – Typewriter tool – available in Reader only if the PDF author has enabled it Inserting/extracting/rearranging pages – Tools > Advanced Editing / Advanced Editing toolbar TouchUp Text, Crop Document > Insert/Extract/Replace/Delete Pages, or use the Pages navigation pane on the left E-mail-based review or shared review (on acrobat.com) for multi-party reviews Searching for Text in PDF Files Edit > Find: text within the current document Edit > Search: text in one or more files Indexing (only in Professional): possibility to index hundreds of files for quick searching – – – Select Advanced > Document Processing > Full Text Index with Catalog > New Index Name the index, select directories to be included, click Build and specify location for the index file Use the resulting .pdx file for searching Creating a Searchable Image – With the image file open in Acrobat, select Document > OCR Text Recognition > Recognize Text Using OCR Creating and Filling Electronic Forms Simple filling with Typewriter tool Using text boxes Converting electronic files to forms using Form Wizard Blueberry PDF Form Filler; FREE application for filling in and printing PDF forms (www.bbconsult.co.uk/Resources/PDFFormFiller.aspx) – Note: deselect “Lock All Controls” button Using Electronic Signatures Inserting a scanned signature – copy and paste via clipboard (gets inserted as a “stamp”) Creating and using a digital ID – – Creating a digital ID: Advanced > Security Settings > Digital IDs > Add ID > “A new digital ID I want to create now” > Next > New PKCS#12 digital ID file > Next. Fill the information fields, as needed, click Next. Select location for the ID and define a password. Click Finish to return to the Security Settings dialog box. Click Close. Signing a PDF document: Advanced > Sign & Certify > Place Signature. Drag a rectangle where you want to place the signature. Choose a digital ID, type the password, choose appearance and click Sign. Creating Translation Memories from PDF Files Using LogiTerm AlignFactory Other tools: – – YouAlign by LogiTerm; online tool, FREE (for a limited time), limited selection of languages (www.youalign.com) NoBabel AutoAligner by KCSL; online tool, limited selection of languages (http://nobabel.com/) LogiTerm AlignFactoryLight (http://www.terminotix.com) – Quick and easy tool to create TMs from PDF files Additional PDF-related links www.adobe.com/support/ www.planetpdf.com www.pdfstore.com Part Two: Creating PDFs and OCR Reasons translators might need to create a PDF Résumés Invoices Letters of certification Various protected files Creating a PDF from Word or Excel Choose Print – PDF tool (Acrobat Distiller, win2pdf, etc.) Select the menu button Optical Character Recognition Optical character recognition (or OCR) is the translation of handwritten, typewritten or printed text to generate a machine-editable text. PDFs can be either straight text or a graphic. OCR can handle both. Optical Character Recognition OCR tools use pattern recognition, artificial intelligence and computer vision as well as digital character recognition. The accurate recognition of Latin-script, typewritten text is now considered largely a solved problem on applications where clear imaging is available such as scanning of printed documents. Some tools can now easily recognize Cyrillic and Asian characters as well. OCR Tools ABBYY FineReader (http://www.abbyy.com/) PDF Transformer (by ABBYY) OmniPage (http://www.nuance.com/imaging/products/o mnipage.asp) Microsoft Office Document Imaging (part of MS Office 2007) ExperVision (http://www.expervision.com/) ABBYY FineReader ABBYY is the clear favorite among translators (although PDF Transformer is a close second), because it creates fewer text boxes than other OCR programs The spellcheck feature ensures the document you are working on doesn’t have any spelling errors that would corrupt the TM. ABBYY supports the most languages (184 at last count). Abkhaz, Adyghian, Afrikaans, Agul, Albanian, Altai, Armenian (Eastern, Western, Grabar), Avar, Aymara, Azerbaijani (Cyrillic), Azerbaijani (Latin), Bashkir, Basic, Basque, Belarusian, Bemba, Blackfoot, Breton, Bugotu, Bulgarian, Buryat, C/C++, COBOL, Catalan, Cebuano, Chamorro, Chechen, Chinese Simplified, Chinese Traditional, Chukchee, Chuvash, Corsican, Crimean Tatar, Croatian, Crow, Czech, Dakota, Danish, Dargwa, Dungan, Dutch (Netherlands and Belgium), English, Eskimo (Cyrillic), Eskimo (Latin), Esperanto, Estonian, Even, Evenki, Faroese, Fijian, Finnish, Fortran, French, Frisian, Friulian, Gagauz, Galician, Ganda, German (Luxemburg), German (new and old spelling), Greek, Guarani, Hani, Hausa, Hawaiian, Hebrew, Hungarian, Icelandic, Ido, Indonesian, Ingush, Interlingua, Irish, Italian, JAVA, Japanese, Jingpo, Kabardian, Kalmyk, Karachay-balkar, Karakalpak, Kasub, Kawa, Kazakh, Khakass, Khanty, Kikuyu, Kirghiz, Kongo, Koryak, Kpelle, Kumyk, Kurdish, Lak, Latin, Latvian, Lezgi, Lithuanian, Luba, Macedonian, Malagasy, Malay, Malinke, Maltese, Mansy, Maori, Mari, Maya, Miao, Minangkabau, Mohawk, Moldavian, Mongol, Mordvin, Nahuatl, Nenets, Nivkh, Nogay, Norwegian (nynorsk and bokmål), Nyanja, Occidental, Ojibway, Ossetian, Papiamento, Pascal, Polish, Portuguese (Portugal and Brazil), Provencal, Quechua, Rhaeto-romanic, Romanian, Romany, Rundi, Russian, Russian (old spelling), Rwanda, Sami (Lappish), Samoan, Scottish Gaelic, Selkup, Serbian (Cyrillic), Serbian (Latin), Shona, Simple chemical formulas, Slovak, Slovenian, Somali, Sorbian, Sotho, Spanish, Sunda, Swahili, Swazi, Swedish, Tabasaran, Tagalog, Tahitian, Tajik, Tatar, Thai, Tok Pisin, Tongan, Tswana, Tun, Turkish, Turkmen, Tuvinian, Udmurt, Uighur (Cyrillic), Uighur (Latin), Ukrainian, Uzbek (Cyrillic), Uzbek (Latin), Welsh, Wolof, Xhosa, Yakut, Zapotec, Zulu Spellchecking ABBYY supports pre- and post-reform German orthography, Old German script, scripting languages, and simple chemical formulas . ABBYY can also sometimes replicate graphics and logos that you can paste into your file. The text recognition software includes dictionaries with spell-checking capabilities for 38 languages allowing verification of recognized text directly in the FineReader Editor Potential Problems Page setups can vary and create inconsistent margins and page layouts. OCR tools have problems with handwriting, bullet lists, check boxes, static from "fuzzy" fax transmissions, and tables. Formatting sometimes needs to be cleaned up (double spaces, text boxes, columns, etc.) ABBYY FineReader Using OCR to create Word files Open image (PDF, TIF, etc.) Read file using OCR tool Check spelling (allows you to verify words that the OCR program misread or did not recognize) Save as Word file Create a clean file and copy and paste the text into it. Troubleshooting Use Edit->Paste Special to copy the text into a fresh Word file and format it by hand. Delete the illegible pages in Adobe and run the legible pages through the tool. OCR isn‘t a magic bullet. If the source text is very illegible you may want to just give up and type it in by hand. CodeZapper "CodeZapper"is a set of Word macros designed to “clean up” Word files before being imported into a translation environment program such as Deja Vu DVX or MemoQ. Word documents are often strewn with junk or “rogue” tags (so-called “smart tags”, languagetags, track changes tags, soft hyphenations, scaling and spacing changes,redundant bookmarks, etc.). This tagged information shows up in the DVX or MemoQ grid as spurious {1}codes{2}around, or even in the middle of, words, making sentences difficult to read and translate and generally negating many of the productivity benefits of the program. CodeZapper OCR‘d files or files converted from PDF are even worse. CodeZapper tries to remove as many of these tags as possible while retaining formatting and layout. It also contains a number of other macros which may be useful before and after importing files into DVX or MemoQ Final Words Do not use online OCR tools like the Tesseract OCR Engine from Google if your documents are confidential. Try several demos to determine which tool best suits your needs. Shop around for the best price ($399.99 vs. EUR 139/GBP 89 or even EUR 90 ($116)). Links CodeZapper http://www.transir.cn/redirect.php?tid=992&g oto=lastpost Presentation: http://translationmusings.com/2009/11/05/pre sentation-from-ata-conference/