
On 10/9/2012 10:56 PM, Carlo Traverso wrote: [snip]
In this book one can improve the formatting at will, but its worst failure is in txt.
Well, I wouldn't say that the /worst/ failure is in the text, but you are right that it is pretty bad.
Look at its main page, http://www.gutenberg.org/ebooks/14668 and compare the pdf (that has images) or the original http://archive.org/details/mcguffeysseconde00mcgu with the UTF-8 txt as provided by PG as default txt.
I didn't need to go to archive.org, I just went down into the basement and pulled the McGuffey's Readers from the box where they were stored. My edition was printed (not published or copyrighted) by the Fairfax Christian Bookstore in Tyler, Texas. The McGuffey's Readers are still quite popular among home schoolers, presumably because they make no mention of uncomfortable topics such as evolution or climate change. [snip]
All these vocals carry a diacritic sign, easy to represent in unicode, but here everything is ASCII. These diacritics have been considered irrelevant, but they are the essential feature of this reader.
This was one of the first things I noticed; you don't even need to look at the page scans to realize that this has happened. Looking at the scans (or, in my case, the paper) you realize that these diacritical marks are also missing from the word lists that begin each lesson--a fatal flaw.
There are also transcription errors. For example, in this
TABLE 0F ASPIRATES.
Sound as in Sound as in f fifi t tat h him sh she k kite ch chat p pipe th thick s same wh why
"fifi" is "fife", and "kite" is "cake"
I don't know if you noticed, but the 'O' in "OF" was also replaced by a zero. And in at least one spot (I haven't reviewed the entire file yet) a page header ("ECLECTIC SERIES") was included as part of the text. I also found the transcriber's note at the beginning of the file a bit gratuitous. I have no problem with notes like that, which are subjective and have no bearing on the actual transcription process, being included in the catalog, but they really shouldn't be stuck in the file. This file obviously needs a lot of work, and probably is more deserving of attention than many of the recent obscure and esoteric works for which raw page scans are adequate to serve the academic community. I'll try to fix it as best I can, but this will probably take me several weeks. I'll post iterations on my web server as I go along, and provide comments here as iterations are completed.