
hey jim, if you're still around (and you should be), then you'll want to know i've started on that book, "the wings of the dove", by henry james. i've mounted it on my website; you can find it here:
that's a page-by-page rendering, with a form at the bottom of each page for _reporting_errors_... you will see that the text for each page is given, as well as the scan for that page, for verification. this is in keeping with your excellent suggestion that p.g. should mount a book in such a way that other people can come along later to improve it. both the text and the scans are from archive.org. (i used the scan from the university of california.) as many people here undoubtedly know, the o.c.r. done by the o.c.a. is dreadful. most people likely think it's dreadful in the way that much o.c.r. is, namely, that it's filled with misrecognition errors. but the o.c.r. from the o.c.a. is worse. much worse. that's because their tech people there mishandle it. specifically, they _lose_ the em-dashes in the text! oh, the o.c.r. recognizes the em-dashes, but then -- somewhere in their file-handling workflow -- the o.c.a. "tech people" there lose the em-dashes! for example, look at page 9 from the book:
you'll see that the em-dashes in the last paragraph have been dropped from the text. it's unbelievable! this problem _alone_ is enough to make the o.c.r. totally unworkable. but this isn't the only problem. (i tried to restore the em-dashes programmatically, by coding a tool, but it's less work to redo the o.c.r.) that's not all. there are more problems... if you look at the text more closely, you will see that the techies also lost the apostrophes in contractions! i did a few global changes to restore _some_ of 'em, like in the contraction "i'm", but i didn't fix them all... this is not a problem that is _common_ to o.c.a. books, but it's not a _rare_ occurrence either. stunning idiocy. and further, the hyphens on end-line hyphenates are missing as well! this sometimes happens in the o.c.r., so i'm not sure if that's what happened with this book or if end-line hyphens were lost in the o.c.a. workflow, but whatever it was, damage to the text is considerable. and like i said, these problems are rather pervasive... it's really ridiculous -- and quite sad -- that the people in charge of the technology over at the o.c.r. are idiots who have built a workflow that actually damages text... what's even worse is that -- when i've brought this to their attention -- they've responded with ad hominem attacks on _me_, as if _i_ were the guilty perpetrator... eventually, when i persisted, they finally consented to solve the worst of the problems -- the em-dashes -- but i don't know if they ever did solve the problem... meanwhile, they've banned me from their listserves, so they wouldn't have to listen to my persistent posts. talk about killing the messenger! it's appalling... the main reason this is so troubling is that the o.c.a. are supposed to be "the good guys", who are the only competitor to google. and they are badly incompetent. plus they have thin skins to boot, and they would rather _silence_ the people who point out their problems than do the work that will solve their self-induced problems. this does not bode well for our future. not well at all... anyway... the main benefit of the o.c.r. from the o.c.a. is that it has retained the structure from the original book. that means we can use a clever mash-up tool (requiring lots of elbow-grease) to use the cleaned-up text from jim and hang it on the structural scaffolding of the o.c.a. text. first, let's look at the o.c.a. text:
this is the single-file version of the .zml text for this book, the file that was used to generate the page-by-page view... now, in a separate window side-by-side with the above, let's load in (a slightly reworked version of) jim's text:
you'll see that you're able to match of the paragraphs... for instance, do a search for "she looked about her and" to find that paragraph in both windows, to sync them... so essentially, what we want to do is take the linebreaks and pagebreaks from the o.c.a. file and inject them into the (clean) e-text. we want to reintroduce the structure. (which p.g. should have never stripped in the first place.) or, to look at it in the other direction, we want to replace all of the incorrect lines of text in the o.c.a. version with the good, cleaned equivalent text from the p.g. version... once we do that, we'll have a good clean structured e-text. we'll get to that this week. gotta get some vitamin d now... -bowerbird
participants (1)
-
Bowerbird@aol.com