
"don" == don kretz <dakretz@gmail.com> writes:
don> Why would text files not be derived from the master like don> other formats? I have expected that derivation can only be don> automated from greater information density to lower. don> I think however that testing ideas early against real data as don> you suggest is important.
"Jim" == Jim Adcock <jimad@msn.com> writes:
Carlo> I also assume that most errata, the important ones, report Carlo> a Carlo> correction to the text, mostly fixing a typo. I suppose that Carlo> errata for markup are relatively unusual, Jim> This is where we disagree. What I see *overwhelmingly* is Jim> that the errors in the files of PG are *overwhelmingly* Jim> massively errors of formatting. I would be hard-put to find Jim> a half dozen scannos in a given PG file. I can often find Jim> 100s if not 1000s of formattos in the same PG file. Jim> It's just that somehow the people at PG have become blind and Jim> tone-deaf to issues of formatting. Again, things tend to Jim> "work" in HTML. It's just the other formats that fall-down Jim> so badly. I see that I have expressed myself incorrectly, since I have been misunderstood (not by Greg, I believe). I try again. First, what is a master format? It is not a format for distribution, it is a format from which all other formats are derived, hence it implies a toolchain to derive these formats, and should be defined in a way that it will be able to derive future formats. Master formats are important since a modification (fix) to the master can be reflected to all the distributed formats. Moreover, when epub4 and Zoox formats (based on HTML6) will be released, it will be wasy to provide good epub4 and Zoox for all the books with a good and rich master file, taking advantage of the cool new features of HTML6 and epub4, just adding new formats to the toolchain. But PG has some 40000 books that don't have a good master format, and fixing a typo in a book having the standard hand-crafted 4 formats (HTML, txt-UTF-8, txt-8 and txt-7 (ASCII) requires to fix 4 files and regenerate the other ones. And the problem will become worse if we allow hand-crafted epub and kindle files. My proposal is a way to simplify the maintenance of the legacy formats for the requests sent to errata-MMX@pglaf.org. Errata like this one (many are much less clearly stated): =========================== Title: Astounding Stories of Super-Science, October, 1930 Author: Various Release Date: September 1, 2009 [EBook #29882] Language: English Page 7: "then low whirring noise" should be "then a low whirring noise". "You can take of your gas mask" should be "You can take off your gas mask". Page 103: Should "The fumes might attract prowlers" be "The flames might attract prowlers"? The image is very unclear. Page 118: "subterranean action shock the electron" should be "subterranean action shook the electron". Page 123: "with what the knew already" should be "with what she knew already". Page 139: "To the left Is the better path" should be "To the left is the better path" - wrong case. ========================= I have access to errata now, and I will be in position to tell how many formattos are submitted as errata, and to suggest modifications to the errata procedures to allow an automatic correction of all the formats just correcting the UTF-8 txt. Of course this does not address the "formattos" (nor, for example, the splitting of a paragraph in 2), but, if (as I suspect) errata receives mainly typos, this might be a substantial reduction of the workload for the errata team. And might allow PG to accept e.g. handcrafted epub, and replacement of some "bad" autogenerated epubs with your "good" fixed epubs. Carlo PS: several posts have come while I was composing this one. I especially agree with David's last post.