hey jim, if you're still around (and you should be),
then you'll want to know i've started on that book,
"the wings of the dove", by henry james.
i've mounted it on my website; you can find it here:
> http://z-m-l.com/go/wotdj/wotdjp123.html
that's a page-by-page rendering, with a form at
the bottom of each page for _reporting_errors_...
you will see that the text for each page is given,
as well as the scan for that page, for verification.
this is in keeping with your excellent suggestion
that p.g. should mount a book in such a way that
other people can come along later to improve it.
both the text and the scans are from archive.org.
(i used the scan from the university of california.)
as many people here undoubtedly know, the o.c.r.
done by the o.c.a. is dreadful. most people likely
think it's dreadful in the way that much o.c.r. is,
namely, that it's filled with misrecognition errors.
but the o.c.r. from the o.c.a. is worse. much worse.
that's because their tech people there mishandle it.
specifically, they _lose_ the em-dashes in the text!
oh, the o.c.r. recognizes the em-dashes, but then
-- somewhere in their file-handling workflow --
the o.c.a. "tech people" there lose the em-dashes!
for example, look at page 9 from the book:
> http://z-m-l.com/go/wotdj/wotdjp009.html
you'll see that the em-dashes in the last paragraph
have been dropped from the text. it's unbelievable!
this problem _alone_ is enough to make the o.c.r.
totally unworkable. but this isn't the only problem.
(i tried to restore the em-dashes programmatically,
by coding a tool, but it's less work to redo the o.c.r.)
that's not all. there are more problems...
if you look at the text more closely, you will see that
the techies also lost the apostrophes in contractions!
i did a few global changes to restore _some_ of 'em,
like in the contraction "i'm", but i didn't fix them all...
this is not a problem that is _common_ to o.c.a. books,
but it's not a _rare_ occurrence either. stunning idiocy.
and further, the hyphens on end-line hyphenates are
missing as well! this sometimes happens in the o.c.r.,
so i'm not sure if that's what happened with this book
or if end-line hyphens were lost in the o.c.a. workflow,
but whatever it was, damage to the text is considerable.
and like i said, these problems are rather pervasive...
it's really ridiculous -- and quite sad -- that the people
in charge of the technology over at the o.c.r. are idiots
who have built a workflow that actually damages text...
what's even worse is that -- when i've brought this to
their attention -- they've responded with ad hominem
attacks on _me_, as if _i_ were the guilty perpetrator...
eventually, when i persisted, they finally consented to
solve the worst of the problems -- the em-dashes --
but i don't know if they ever did solve the problem...
meanwhile, they've banned me from their listserves,
so they wouldn't have to listen to my persistent posts.
talk about killing the messenger! it's appalling...
the main reason this is so troubling is that the o.c.a.
are supposed to be "the good guys", who are the only
competitor to google. and they are badly incompetent.
plus they have thin skins to boot, and they would rather
_silence_ the people who point out their problems than
do the work that will solve their self-induced problems.
this does not bode well for our future. not well at all...
anyway...
the main benefit of the o.c.r. from the o.c.a. is that
it has retained the structure from the original book.
that means we can use a clever mash-up tool (requiring
lots of elbow-grease) to use the cleaned-up text from jim
and hang it on the structural scaffolding of the o.c.a. text.
first, let's look at the o.c.a. text:
> http://z-m-l.com/go/wotdj/wotdj.zml
this is the single-file version of the .zml text for this book,
the file that was used to generate the page-by-page view...
now, in a separate window side-by-side with the above,
let's load in (a slightly reworked version of) jim's text:
> http://z-m-l.com/go/wotdj/wotdj.txt
you'll see that you're able to match of the paragraphs...
for instance, do a search for "she looked about her and"
to find that paragraph in both windows, to sync them...
so essentially, what we want to do is take the linebreaks
and pagebreaks from the o.c.a. file and inject them into
the (clean) e-text. we want to reintroduce the structure.
(which p.g. should have never stripped in the first place.)
or, to look at it in the other direction, we want to replace
all of the incorrect lines of text in the o.c.a. version with
the good, cleaned equivalent text from the p.g. version...
once we do that, we'll have a good clean structured e-text.
we'll get to that this week. gotta get some vitamin d now...
-bowerbird