
Stating the obvious, lately we have been having a lot of heat about what I will broadly describe broadly and hopefully without perceived prejudice as "encoding choices" -- a subject which is often described [incorrectly] as "file formats." IE a volunteer transcriber, "at" PG or DP or somewhere else, using a combination of OCR, blood, sweat and tears, and other "helpful" tools, makes choices, independently or directed, about what and how to encode the information which they see in that book, that which they feel is important to record -- or that which they decide is unimportant and should not be recorded. At this point in time one or another persons on this forum immediately jumps up and down and says "I have THE solution!" I would like to suggest that any such proposed "solution" must include a discussion and decision about what to actually encode, and what not to encode, and I would also like to suggest that any proposed "solution" [including those currently in use at PG and DP] should encode that which that format chooses to encode *unambiguously.* IE if a particular format and/or encoding scheme claims to encode a particular piece of information, but does not do so unambiguously, then in fact it is NOT successfully capturing that piece of information. Below, as a reminder, is a list of things that various volunteers choose to encode, or not encode, and which various encoding schemes support encoding unambiguously, or which do not support encoding unambiguously. To state the obvious: At least part of the problem is that the people on this forum have wide DISAGREEMENT about which of the following items should be encoded unambiguously, which is to say successfully, which should be ignored, and which should be recorded ambiguously [???]. [I have cribbed a lot of this from DP, but I could just have well cribbed it from the TEI community, or most good books on typography] As an example, and to start this throw-down, I will simply state from my point of view that "how discolored with age the paper stock is" is really NOT an important thing to encode from my point of view! [Since I would rather see a representation of the book "as-published" not "as-found"] PS: defining "unambiguous coding" means: "those who propose that file format or encoding can write a computer program to extract that particular piece of information from their files 100% successfully 100% of the time without error and without heroic coding efforts." ====== book-title author illustrator translator publisher publication date glyph paragraph blockquote poem-stanza poem-line poem-indentation-level epigrams thought break page break horizontal rule spelling (choice of) capitalization bold italic underline small-cap gesperrt illustrations color of paper stock how discolored with age that paper stock is captions drop-cap alternative font use distinctive font size changes author-specific eccentricities in use of punctuation author-specific eccentricities in use of spelling nationality of spelling, punctuation and/or style of typesetting tables superscripts subscripts page numbers line numbers page references page numbers TOC index Preface left-vs-right alignedness justification letter signature salutation "foreign" words language (of original work) language (of this publication) language (of a subsection of this publication) handedness (of punctuation) Foreword Introduction Front Material Prologue Epilogue Appendix References Conclusion Glossary Summary Acknowledgements Bibliography Advertisements Front Matter Back Matter Ex Biblio decorative marks decorative rules printers' [stock] art illuminated-letter black-letter first-letter-color punctuation spacing line-breaks foot-note side-note end-note all-caps chapter title chapter subtitle section title section subtitle hyphenation half-spaces contractions (form of)