Re: [gutvol-d] RST/PGTEI/etc

14 Feb 2012

      ...
...
...
...
...
"Jim" == Jim Adcock <jimad@msn.com> writes:
Don> What I find remarkable is that after 2 decades anyone would
    Don> expect the
    >> Project Gutenberg old guard to do anything other than the same
    >> thing they've been doing.

    Greg> ....Which is to leave such decisions to the eBook's
    Greg> submitter(s).

    Jim> Again, I have written software that would allow one to
    Jim> back-align PG works to "the original text" even when they are
    Jim> not "identical" texts, and can reintroduce page numbers and
    Jim> "original" line breaks.  It's in a crude state right now,
    Jim> because no one has actually expressed an interest. My intent
    Jim> was that it could be used by DP to reprocess old crufty PG
    Jim> files back through their system (which it could be used for)
    Jim> if they wanted to [so that no one at DP really has ANY excuse
    Jim> to complain about independently produced books] or it could
    Jim> be used by someone wanting to back-submit to archive.org Or
    Jim> it could be used to pursue "more scholarly" versions.

    Jim> The software "works" by taking one "polished" PG text and one
    Jim> "unpolished" say raw OCR, Levenshtein matching them on word
    Jim> tokens and then clones the formatting whitespace from the one
    Jim> to the other. It can also clone over the page numbers.

    Jim> In general, obviously, if you want to say produce a
    Jim> "scholarly" edition from a PG text you're going to have to
    Jim> re-proof your book after performing such back matching.  My
    Jim> software can help with that too.

Much simpler, one can use wdiff (or dwdiff) and preserve whitespace
from one and non-whitespace from the other through regexp. I am
interested in comparing your tool with my wdiff approach. 

Carlo

Re: [gutvol-d] RST/PGTEI/etc

traverso＠posso.dm.unipi.it