[gutvol-d] A new viewpoint (was: Re: gee)

5 Feb 2012

      I would like to propose a new approach. As Greg said, the core problem
is how to maintain several formats of ebooks generated in several ways
when we get errata reports. The master format is useful for this since
a toolchain can then regenerate all the derived versions.

My proposal is to have different toolchains for generation and
updates.

I assume that all the formats contain a text, and this is
substantially the txt70; and that all the formats contain the same
text, possibly subdivided in different parts, organized differently
(e.g. the footnotes). The encoding is not important, it does not
matter if the HTML contains < while the txt contains <, or if the
HTML contains — while the text contains a dash. So I assume the
txt to be UTF-8 (the dash example shows that this might be a bit
simplified, but we may consider that "--" encodes a dash in non-UTF8
PG txt files, exactly as — encodes a dash in HTML.

I also assume that most errata, the important ones, report a
correction to the text, mostly fixing a typo. I suppose that errata
for markup are relatively unusual,

The problem is that, if the correction is accepted, we want to
retrofit the correction to all the formats. Currently, this is made
either manually editing the master, or manually editing the txt and
HTML files; then recompute the derived formats.

If we can retrofit automatically the changes from txt to the derived
formats, then we can allow all the formats that we want. This is the
same type of problems that is dealt with concurrent modifications in
version control software like svn, except that here the problem should
be handled at the character level, not at the line level. The only
difference is that we need to operate with wdiff (or better dwdiff, or
maybe even at a finer level) instead than diff (and patch). The
problems might appear only if changes are done at the interface of
words and markup. If you change "arid" to "and" and in HTML you have
arid it is clear that it has to become and (we have
changed i"ri" to "n" far from markup). Problems arise if you change
"go" to "go!": should you change go to go! or to
go! ?

I am willing to investigate the issue on the basis of the errara
reports that we receive, and possibly design software that can correct
automatically all the formats (HTML, epub, mobi, either automatically
deduced or hand-crafted)) from patches for txt. 

I was formerly in the errata@pglaf.org list, could I be added to the
errata2010 mailing list for the purpose of studying the problem?
Possibly with access to the archives, or at least part of them.

Thanks. Carlo

[gutvol-d] A new viewpoint (was: Re: gee)

traverso＠posso.dm.unipi.it