
I would like to propose a new approach. As Greg said, the core problem is how to maintain several formats of ebooks generated in several ways when we get errata reports. The master format is useful for this since a toolchain can then regenerate all the derived versions. My proposal is to have different toolchains for generation and updates. I assume that all the formats contain a text, and this is substantially the txt70; and that all the formats contain the same text, possibly subdivided in different parts, organized differently (e.g. the footnotes). The encoding is not important, it does not matter if the HTML contains < while the txt contains <, or if the HTML contains — while the text contains a dash. So I assume the txt to be UTF-8 (the dash example shows that this might be a bit simplified, but we may consider that "--" encodes a dash in non-UTF8 PG txt files, exactly as — encodes a dash in HTML. I also assume that most errata, the important ones, report a correction to the text, mostly fixing a typo. I suppose that errata for markup are relatively unusual, The problem is that, if the correction is accepted, we want to retrofit the correction to all the formats. Currently, this is made either manually editing the master, or manually editing the txt and HTML files; then recompute the derived formats. If we can retrofit automatically the changes from txt to the derived formats, then we can allow all the formats that we want. This is the same type of problems that is dealt with concurrent modifications in version control software like svn, except that here the problem should be handled at the character level, not at the line level. The only difference is that we need to operate with wdiff (or better dwdiff, or maybe even at a finer level) instead than diff (and patch). The problems might appear only if changes are done at the interface of words and markup. If you change "arid" to "and" and in HTML you have <i>arid</i> it is clear that it has to become <i>and</i> (we have changed i"ri" to "n" far from markup). Problems arise if you change "go" to "go!": should you change <i>go</i> to <i>go</i>! or to <i>go!</i> ? I am willing to investigate the issue on the basis of the errara reports that we receive, and possibly design software that can correct automatically all the formats (HTML, epub, mobi, either automatically deduced or hand-crafted)) from patches for txt. I was formerly in the errata@pglaf.org list, could I be added to the errata2010 mailing list for the purpose of studying the problem? Possibly with access to the archives, or at least part of them. Thanks. Carlo