
On Tue, January 24, 2012 6:33 pm, James Adcock wrote:
Don>Starting at the most basic level, is there any good reason not to use utf-8 as the basic encoding standard for everything including plain-text?
No.
BOM or no BOM?
Depends on the file. XML files (XHTML, TEI, etc.) are guaranteed to be ASCII in their first line, and that first line declares the encoding, so no BOM is necessary (and would probably confuse some tools). Subtle markup languages like reStructuredText which have no prolog need some mechanism to indicate that they contain UTF-8 encodings (to distinguish between that, latin-1 or MacRoman) so may need to have a BOM.
Unix or Windows style line breaks?
Don't know that it matters, but my preference would be Unix.
Line breaks meaning paragraph separations or line breaks meaning, well, whatever it is that PG means by line breaks?
All lines will wrap when displayed, so a mechanism is needed to indicate "this is not just whitespace it really is a new line!" All markups have a mechanism for this purpose. For ease of proofreading, I recommend that text lines be broken with insignificant new line characters at the same point as in the original text to the extent possible (hyphenated lines cannot follow this rule, and should be broken at the next, or previous, available whitespace.
"uft-8" meaning that "we" use the interpretation of the code points as defined by Unicode, or meaning "we" invent our own meanings for those code points?
Unicode without composition. Most have argued that UTF-8 requires Unicode. Technically you can UTF-8 encode any set of code points, but for this project it would serve no purpose.