All four TCP text projects are produced in the same way and to the same standards, which are documented, at least in part, on the TCP web site.

  1. Accuracy. The TCP strives to produce texts that are as accurately transcribed as possible, with a specified overall accuracy rate of 99.995% or better (i.e. one error or fewer per 20,000 characters).
  2. Keying. Given the nature of the material, the only method found to deliver such accuracy economically has been to have the books keyed by data conversion firms under contract.
  3. Quality control. Accuracy of transcription and aptness of markup are assessed in all cases by a group of library-based proofers and reviewers managed by the University of Michigan DLPS.
  4. Encoding. All resultant text files are marked up in valid SGML or XML (SGML is archived, XML is exported) conforming to a proprietary "Document Type Description" (DTD) derived from the P3/P4 version of the Text Encoding Initiative (TEI) standard.
  5. Purposeful markup. Compared to the full TEI, the TCP DTD is very simple and intended to capture only the features most useful for intelligible display, intelligent navigation, and productive searching. The TCP practice is to capture, so far as feasible, the overall hierarchical structure of each book (parts, sections, chapters, etc.); the features that tend to mark the beginnings and ends of divisions (headings, explicits, salutations, valedictions, datelines, bylines, epigraphs, etc.); the most significant elements of discourse and organization (paragraphs in prose, lines and stanzas in verse, speeches, speakers, and stage directions in drama, notes, block quotes, sequential numerations of all kinds); and only the most essential aspects of physical formatting (page breaks, lists, tables, font changes).
  6. Fidelity to the original. In each case, the text is intended to represent the book as originally printed, so far as that is possible. Printer's errors are preserved, hand-written changes are ignored, duplicate scans are omitted, out-of-order images are keyed in the intended order, and most of the unusual characters of the original are preserved.
  7. Ease of reading and searching. At the same time, though the transcriptions are carried out character-by-character, TCP, on the theory that all transcription is a kind of translation from one symbolic system to another, tends to define characters in terms more of their meaning than of their form, and to map eccentric letter-forms to meaningful modern equivalents, generally in keeping with the Unicode definition of "character."
  8. Languages. Though most of the TCP texts are in English, many are not. Books and divisions of books not in English are tagged with an appropriate language code, but are not otherwise distinguished.
  9. Omitted material. The TCP produces Latin-alphabet text. Non-textual material such as musical notation, mathematical formulae, and illustrations (except for any text they may contain) are omitted and their locations marked with a special tag. Extended text in non-Latin alphabets (Greek, Hebrew, Persian, etc.) is also omitted.

