Re: [gutvol-d] a review of some digitization tools -- 018

14 Dec 2011

      On Tue, December 13, 2011 8:35 pm, Jim Adcock wrote:
...
What I am not clear about is why BB insists that what one starts from must
be an "an impoverished text-file" because I never work with text files per
se until I am forced to derive one at the end of my html development as a
needless extra step in order to get the PG WWers to accept my html work.  I
do not start with an "an impoverished text-file" for the simple reason that
my OCR gives me better file format choices which help preserve more of the
information available in the original page images, such that I do not have
to rediscover and re-enter that information again later manually -- after
needlessly throwing that information away in the first place just to reduce
the OCR result to txt70.
I don't get this either.

I /never/ start with impoverished text. (Well, okay, I did one once. But it
was sooooo painful that I vowed I would never do it again.) If FineReader is
offering to save my OCR as HTML (class 2 tag soup), why would I not accept the
offer? Use a tool like Tidy to convert to XHTML and I have a file that can
easily be manipulated with scripts as well as plain text editors.

This is one of the reasons I want so badly to see kenh's script from
archive.org, or at least to have it running. BB is right that the OCR text at
Internet Archive is unusable as a starting point for e-books, but I think I
could work with the HTML output from that script. If I knew how it worked, I
could probably even replicate it in a different programming language for
off-line use.

Heck, even the stuff at Distributed Proofreaders nowadays has a modicum of
HTML embedded in it.

About the only reason you would need to start with plain text is if you're
trying to fix the early e-texts in the PG corpus -- not a bad idea, but there
are better places to start if that's really what you're trying to do.

Re: [gutvol-d] a review of some digitization tools -- 018

Lee Passey