
"don" == don kretz <dakretz@gmail.com> writes:
don> --===============2067625072== don> Content-Type: multipart/alternative; boundary=f46d04428c966205ed04cba257e2 don> --f46d04428c966205ed04cba257e2 don> Content-Type: text/plain; charset=ISO-8859-1 don> Here is a PG project that needs some formatting help. don> The HTML conversion would probably best be termed "fail". don> Yet it should be a simple case. don> How would you want proofers to submit it after "matching the scan"? don> What would you consider to be a properly marked up copy in don> the format of your choice? How should it get there from the previous don> step? don> http://www.gutenberg.org/cache/epub/14668/pg14668.html In this book one can improve the formatting at will, but its worst failure is in txt. Look at its main page, http://www.gutenberg.org/ebooks/14668 and compare the pdf (that has images) or the original http://archive.org/details/mcguffeysseconde00mcgu with the UTF-8 txt as provided by PG as default txt. In txt you have the following: TABLE OF VOCALS. Long Sounds Sound as in Sound as in a ate e err a care i ice a arm o ode a last u use a all u burn e eve oo fool SHORT SOUNDS. Sound as in Sound as in a am o odd e end u up i in oo look All these vocals carry a diacritic sign, easy to represent in unicode, but here everything is ASCII. These diacritics have been considered irrelevant, but they are the essential feature of this reader. There are also transcription errors. For example, in this TABLE 0F ASPIRATES. Sound as in Sound as in f fifi t tat h him sh she k kite ch chat p pipe th thick s same wh why "fifi" is "fife", and "kite" is "cake" Carlo