
1) Preserves the hard-won effort I have already put into content creation, such that a future volunteer can build on my work without having to "reverse engineer" those gratuitous errors currently being introduced by the current PG use of txt and html.
Please give some real-world examples.
OK. My point being that IF PG were to accept a "proper" book INPUT encoding format that preserves the hard-won knowledge of the original encoding volunteer, then there would be no need for a future volunteer to have to completely scan that encoding against the original book scans in order to make another pass looking for errors, etc. So what all is "wrong" with TXT and HTML in this regards as stored in the PG databases? Both formats throw away the original volunteers' knowledge about the common parts of books: TOC, author info, pub info, copyright pages, index, chapters, etc. Yes one can code this information in HTML but there is no unambiguous way to do so which means that PG HTML encodings all take different paths, as one rapidly discovers if one tries to automagically convert PG HTML into other reflow file formats. You could follow common h1, h2, h3 settings by convention -- if PG were to establish and require such -- but then you end up with really ugly rendered HTML on common displays. You can overcome this with style sheets -- but then you are defeating many tools which automagically convert HTML into a variety of other reflow file formats for the various e-readers. Both formats as stored by PG gratuitous throw away hard-won line-by-line alignments between scan text and hand-scanno corrected text. These alignments are needed if a future volunteer wanted to make another pass at "fixing" errors in the text, for example by running through DP again, or running it against a future automagic tool comparing a new scan to the PG text. I submit my HTML to PG WITH the original line-by-line alignments -- because it doesn't in any way hurt the HTML and allows a future volunteer to make another pass on my work -- but then PG insists on throwing this information away anyway before posting their HTML files. Both formats throw away page numbers and page breaks, which again are necessary to make another volunteer pass against the original scans, and also to make future passes against broken link info, etc. Also would be useful for some college courses, where you need page number refs, even if reading on a reflow reader device. I'm NOT suggesting that page numbers should be typically be displayed in an OUTPUT reflow file format rendering, rather that this represents hard-won information that ought to be retained in a well-designed INPUT file format encoding. TXT files seem to me to almost always have some glyphs outside of the 8-bit char set. Unicode text files would at least overcome this limitation. HTML in theory doesn't have this limitation, but in practice I find in submitting "acceptable" HTML to PG running it through their battery of acceptance tools I find some glyphs I can't get through so I end up punting and throwing away "correct" glyph information dumbing down the representation of some glyphs. PG and DP *in practice* have a dumbed-down concept of punctuation, such that it's impossible to maintain and retain "original authors intent" as expressed in the printed work. For example, M-Dash is commonly found in three contexts: lead-in, lead-out, and connecting, similar to how ellipses are used at least in three different ways: ...lead in, lead out, and ... connecting. But in practice all one can get through PG and DP is connecting M-dash. Also consider all the [correct] variety of Unicode quotation marks which needlessly get reduced in PG and DP to only U+0022 OR U+0027. In general PG has a dumbed-down concept of punctuation, that near is near enough, and is actively hostile to accurately encoding the punctuation as rendered in the original print document. Again, it is EASY to dumb down an INPUT file format, for example if you need to output to a 7-bit or even a 5-bit teletypewriter, if that is what you want. So why insist that the input file encoder get it wrong in the first place? It is easy to throw away information when going from an INPUT file encoding to an OUTPUT file rendering. It is VERY DIFFICULT to correctly fix introduced errors when going back from a reduced OUTPUT file rendering to a correctly encoded input file encoding. What I am imagining is some simple-to-use file encoding format where a volunteer can correctly and unambiguously code the common things and conventions one commonly finds in every day books, such that another volunteer can pick up and make another pass on the book some years hence -- without having to reinvent nor rediscover work that the previous volunteer has already put into understanding and coding the book. Such an INPUT file encoding having little or nothing to do with how the output will be displayed in an eventual OUTPUT file rendering. DP already has much of this distinction in their work flow. Unfortunately, their page-by-page conventions and simplifications "dumbing down" for the sake of the multiple levels of volunteers guarantees loss of information. Not to mention that they also throw away the correctly encoded INPUT file hard-won knowledge for more ambiguous OUTPUT file renderings in HTML and TXT. The end result is that both PG and DP end up be "write once" efforts that are hostile to future improvements by future volunteers -- instead of encouraging on-going efforts to improve what we got. Which is also indicative of a general culture of quantity not quality. PG pretends that part of why we do what we do is to protect and preserve books in perpetuity. This implies in exchange that information that is gratuitously thrown away during input file encoding [directly in an output file rendering] is potentially lost for eternity. Why insist via policy that volunteer input file encoders must throw away this information?