Are You suprised ? - OWL

advertisement
Encoding Specifications: TEI Documents
Our Americas Archive Partnership project
Fondren Library, Rice University
Last updated July 15, 2009
General Statements
These guidelines are intended for printed materials, such as books, pamphlets, broadsides and
other early government publications.
Texts were sent to a vendor for transforming page images to text using double keying method at
99.95% accuracy.
Specifications
Encoding of text should adhere to the recommendations specified in TEI in Libraries Level 4
Guidelines (http://www.diglib.org/standards/tei.htm) with the following modifications:
1) TEI Schema / Syntax
• Convert text using TEI-P5 Schema
• Please use the following beginning code for each TEI/XML document:
<?xml version="1.0" encoding="UTF-8"?>
<?oxygen RNGSchema="http://www.teic.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="xml"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
2) Page images
• Please include coding for presentation of page images with the following attributes
and with the following sequence (facs, id, n.)
Example
<pb facs="aa00135_0004" xml:id="p0004" n="4"/>
where facs is the filename of page image (without extension)
xml:id is a unique identifier for each page,
n is the page number as printed in the text.
o Please note that this means you will not need to transcribe the printed page
numbers from the text, as the style sheet will generate pagination from the above
coding.
3) Hyphenation
• hyphenated, non-compound words that appear at the end of lines should be closed up
to facilitate searching and retrieval.
4) Divisions
• Please use numbered rather than unnumbered divs
• Assign an attribute type per each <div>
• Use a unique xml:id for each <div>
• For every <div> within body text use the numbering convention: divlevel + the div
number (e.g. <div1003> for the 3rd div1 in a file)
i. Exclude n values for front and back matter
• Follow the below attribute sequence
• Please transcribe division titles using the <head> tag
Incorrect
<div type="section">
Correct
<div1 type="section" xml:id="div1003" n="3" >
<head>Annexation of Texas and Boundary with Mexico.</head>
<head type="sub"> Message from the President.</head>
Common div types may include:
 appendix
 article
 chapter
 contents (for table of contents)
 copyright (e.g. treat section titles such as AVISO OFICIAL as copyright div types)
 cover (book cover images: front, back or spine)
 entry
 dedication
 index
 letter
 list of illustrations
 poem
 section
 subchapter
 subsection
 title page
 volume
5) Special characters
Where possible, please use the Unicode character for special characters (particularly any
diacritics in Spanish texts) rather than the entity form. E.g. è rather than è or
è
6) Inline graphics
Any decorative elements such as photographs, illustrations, etc appearing within the body
of text should be recorded with a <figure> tag at the point in the text where it appears.
However please do not include the subtag <graphic url>
Example
<figure xml:id="illaa00135_0003a">
<head>caption, descriptive heading or title </head>
<p>any other text</p>
</figure>
The xml:id should be composed of the letters “ill” plus the filename of the source page
image plus a lowercase letter as shown in this example.
-2-
For full page illustrations, please centered the entire <figure> element
Example
<p rend="center">
<figure xml:id="illaa00030_0002ca">
<head>ISLA DE CUBA</head>
<p>[Señora de la Habana]</p>
</figure>
</p>
OR when full page illustration occurs within text that spans multiple pages
(ie. within a single paragraph) placed rend attribute within <hi> tag.
<p>…
<pb facs="m001_a228a" xml:id="pa228a"/>
<hi rend="center"><figure xml:id="illm001_a228aa">
<head>CLEOPATRA DANDOSE MUERTE CON EL ÁSPID</head>
</figure></hi>…</p>
7) Unidentifiable or Illegible text
If a keyer cannot transcribe any part of the text due to missing or physical damage of the
original page please use the <gap> tag with attribute “missing”. If there is any other
cause of illegibility (such as blurred text of the source material), please use the <unclear>
tag with attribute “illegible”.
8) Mark up any deletions/overstrikes using the <del> tag
9) Use the <seg type="letterhead"> tag
10) Title pages
When a formal title page exists, please use the <titlePage> and related subtags to encode
the title page information.
11) Table of contents
If there is a formal table of contents, then the tables of contents should be encoded using
a <div1> tag with an attribute type="contents". The table of contents should be encoded
as a <table>, typically with the title of the chapter in one cell, the page number in
another. Use the <ptr> element to provide a link to the appropriate page, as in <ref
target="p0001">1</ref>.
12) Please encode all line and paragraph breaks as they appear directly from the text (expect for
hyphenated words as describe in # 3 above)
13) Please encode any typographic features such as bold, italicized, underlining, superscript text
using the rend attribute. Please make the following exceptions for readability purposes:
•
If heads are centered in the original document, please do not mark up this alignment.
Do encode other alignments (we will modify the stylesheet so that right orientation is
not set way off from the rest of the page)
-3-
•
Please do not mark up any text in SMALL CAPS case—just use all uppercase.
•
In Spanish documents if the character n with macron (straight line symbol over the n)
is used please substitute with n with wavy tilde (ñ) instead.
14) Tables. Use table tags only for tabular data or in case of table of contents and indexes (i.e.
<div1 type="contents"/>).
•
Please use attribute <table rend=" rules" to display gridlines or borders for tables
(except for table of contents or indexes)
•
For cells containing numeric data please right align data (eg. <cell
rend="right">1,500</cell> )
•
Header cells: For cells that contain a label or heading, rather than data, use <cell
role="label">. (For cells containing data, there is no need to include the role
attribute; "data" is the default.)
•
For any table that spans multiple pages, close the table prior to inserting the <pb> tag
and reopen the table following the <pb> tag. Please repeat any header cells for table
data continuing onto the next page.
•
To place data in a table cell in the bottom right of the cell, use
<cell rend="right-bottom"> By default, everything is at the top right of the cell
15) Lists.
•
All lists should contain a type attribute: Bulleted, Ordered (for automated numbered
lists) or Simple
•
If special characters are used to denote list items please use the <label> tag rather
than transcribing the character directly within the <item> tag.
•
For lists that span multiple pages, place <pb> tag within the preceding <item> tag,
separated by <lb/> prior to where the page break should occur.
Example
<list type="ordered">
<label>1.<hi rend="sup">o</hi></label>
<item>
<hi rend="italic">Post Office and residence of the Postmaster.</hi><lb/>
<pb facs="aa00265_0004" xml:id="p0004" n="4"/>
</item>
<label>2.<hi rend="sup">o</hi></label>
<item>The goods to be accompanied…
16) Poetry.
•
Line groups: All verse — including poems without distinct stanzas, as well as verse
quoted within a block of prose — should be encoded with the <lg> element.
•
The type attribute is required at div level and lg tag.
-4-
•
Line breaks: When encoding verse it is important to distinguish between logical lines
of verse and the physical presentation of those lines on the printed page. In cases
where a line of verse is too long to fit on the printed page, and for that reason is
continued on a second line, use <l> to mark the logical line of verse and <lb/> to
mark the physical line break.
•
Indentation: If a line of verse is indented more than the surrounding lines, use <l
rend="indent">...</l>.
Example
<div1 type="poem" xml:id="n005" >
<pb facs="aa00145_0002" xml:id="p0002" n="2"/>
<head>America.</head>
<lg type="stanza">
<l>My country, 'tis of thee,</l>
<l>Sweet land of Liberty.</l>
<l rend="indent1">Of thee I sing;</l>
<l>Land where my fathers died,</l>
<l>Land of the pilgrim's pride,</l>
<l>From every mountain side</l>
<l rend="indent1">Let Freedom ring.</l>
</lg>
Source: UVA's TEU Guidelines for Keyboarding Vendors. See example code at
http://pogo.lib.virginia.edu/dlps/public/text/vendor/vendor.html#verse
17) Column breaks – typically used with bi-lateral translations (both English and Spanish text are
layout out in columns on a single page) or in poems.
•
•
•
Given TEI's requirements for nesting, will need to force a closed paragraph mark
before the column break line and immediately follow with an opening paragraph
mark for continuing text. Though not technically accurate, this does allow the use of
the column break tag and is a workaround to retain the layout of a page while also
retaining the actual text per page image.
When a translation, place both the English and Spanish columns of text in the same
div
Do not use <cols> tags
Example
<cb n="1" xml:lang="eng"/>
<p> ...
the said republic, who, after a<lb/>
*</p>* [ forced paragraph closer]
<cb n="2" xml:lang="spa"/>
<p>En el nombre de Dios Todo-Poderoso:</p>
18) Handling running heads on government document (example: item aa00310) treat the running
head as <fw>. This head does not need to be repeated throughout the rest of the document
-5-
just the first time it appears. (for more details, please see
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-fw.html)
Example:
< pb facs="aa00310_0001" xml:id="p001" n="1" />
<fw place="top">28th CONGRESS, <hi rend="italics">2d Session.</hi>
[SENATE] [30]</fw>
<head>MESSAGE FROM THE PRESIDENT OF THE UNITED STATES...
</head>
19) For any letters (eg div type="letter") please include tags for <dateline> including place
<date>, and <signed> for signature, etc
-6-
Download