Re: [gutvol-d] a review of some digitization tools -- 018

14 Dec 2011

      Hi Jim,

I can not tell you why BB does it, but I might be able to explain some of cavets
of approach.

The first is it is hard to disassociate the markup from the text to process, process it
and put it back together correctly. Why do think that MS et. al. produce such crappy
code.

The other is for any automatic processing you to have a known state or structure.
The more you know or can make correct assumptions on the better a algorithm
will work.

It actually does not matter in the end what format it is. But, PG does want a simple text
file and it is easy to use these "impovirshed" file. It is easy to go from these to somethinhg
PG will expect than doing from a more complex layout. You have more work to do.

It is always easier to from simple to more complex than from complex to simple.

regards
	Keith.

Am 14.12.2011 um 04:35 schrieb Jim Adcock:
...
What I am not clear about is why BB insists that what one starts from must
be an "an impoverished text-file" because I never work with text files per
se until I am forced to derive one at the end of my html development as a
needless extra step in order to get the PG WWers to accept my html work.  I
do not start with an "an impoverished text-file" for the simple reason that
my OCR gives me better file format choices which help preserve more of the
information available in the original page images, such that I do not have
to rediscover and re-enter that information again later manually -- after
needlessly throwing that information away in the first place just to reduce
the OCR result to txt70.
PS: I call it "txt70" for the simple reason that I wish to distinguish that
what PG insists one submit is not a text file in any normal sense, anymore
than ZML is a normal text file in any normal sense.  At least ZML has the
arguable advantage that it retains the original line breaks -- but I have
shown how these can be easily rederived.  And the txt70 has a PG-specific
requirement to put in manual line breaks at about every 70 chars, not to
mention reimagining some of the standard ASCII code points as prosodic
markers. PG'ers tend to spend so much time smelling their own roses that
they forget that that which they call a text file really isn't a text file,
anymore than the contents of an html file, or of a ZML file, is a text file.

Re: [gutvol-d] a review of some digitization tools -- 018

Keith J. Schultz