
here is a jumble of responses to various things that were said recently. i think it's got a pretty good flow, so i haven't bothered to put it in chronological order. but if you think that misrepresents anything, say so... *** don said:
I've scanned and reviewed the proofing on a lot of books, all ABBYY-OCR'ed, and one error per 10 pages exhibits good book selection Abbyy management skills I wish I had, or maybe books with 12 lines per page.
i was talking about _after_ the preprocessing, not before... raw o.c.r. has a typical rate of about 1 error per page on the relatively clean and simple and straightforward books. on a simple book, like the "betty lee" that roger was using, you can fix about 90% of those errors in roughly one hour. (assuming you have a good tool, which ain't hard to build.) another hour (or two) will fix 80% of the remaining errors, meaning you are left with ~2% of the original o.c.r. errors. at that point, the wise course of action is to send the book to the smoothreaders, and count on them to find the rest... for the encyclopedia stuff you do, don, it takes more work. more time, and more passes, and less accuracy at the end. how much, i can't say, because i don't have the experience.
I'm pretty sure we could do better work, quicker, if we proofed in parallel, then compared (and merged) results
i don't think the "serial/parallel" distinction matters much. as long as the work of both individuals gets incorporated -- which it does, in both -- i can see no quality difference. as for parallel being "quicker", it will be, _while_proofing,_ since person "a" can start at the same time as person "1". but once you add in the time for "comparison and merge", you've likely lost any advantage you had, and maybe more. i don't think anyone here has summoned up more evidence for the value of comparison/merge than i have, but if you are arguing that it's quicker and better, i probably disagree. (most of my research involved a pre-existing digitization, meaning that there was _no_ cost to generate that corpus. and sure, in _that_ type of situation, it is clearly "quicker".) *** roger said:
there are several more regex checks that would have helped and I'll bake those in.
a reg-ex path is almost certainly the wrong way to go. yeah, a small number are necessary and efficient, but as you get more and more, their signal-to-noise ratio starts to rot, and you waste your time on false alarms. the most effective means is also the easiest one of all, namely a straightforward spellcheck. i looked closely, and documented it carefully and extensively on this list, so i can tell you spellcheck does most of what you need. where reg-ex shines is with punctuation abnormalities. but if you're having those kinds of problems, then you will need to have help from smoothreaders _anyway_... *** don said:
Are you trying to proof and format at the same time? I did that experiment several years ago with the same effect.
sadly, it will take a long time to erase the d.p. myths... the reason proofing and formatting do not get along at d.p. is because their system mixes pseudo-markup with plain-text which is shown in an .html text-area... the pseudo-markup _interferes_ with the plain-text -- since, face it, markup is inherently obstructive -- so yes, that makes "proofing-while-formatting" hard. but also causing the difficulty is the simple fact that plain-text does not resemble the p-book page-scan. the font is monospaced... also, the paragraphs are (a) not indented, and (b) separated by a blank line... italics are not displayed. blockquotes aren't set off. headings aren't centered. poetry is formatted right. in short, there is _nothing_ about their interface that aids in the task of _either_ proofing _or_ formatting! so if you try to do _both_, plus deal with the obstacles of in-line markup, you're really fighting a losing battle. and the answer is as obvious as the nose on your face:
look at the middle pane. it looks similar to the scan. it's not an "exact" match, but it's getting pretty close. the paragraphs _look_like_ the ones on the page-scan. the blocks of text between the two match quite nicely. the fonts are fairly equivalent, the leading is identical, there are curly-quotes, matching the ones on the scan. and the italics, if there were any, would _be_ italicized. so the only real discrepancy is the lack of justification. with this interface, you can indeed proof-and-format at the same time, and actually succeed. at least i can. i think results would be better than the d.p. interface. because there's no drag on proofing-and-formatting. that doesn't mean i think it's a good idea to do both... as i said earlier, it's important to keep a laser-focus... but a person _could_ do both, if they really wanted to. so yes, don, i agree with you that one shouldn't try to proof-and-format at the same time. but _not_ due to the "it can't be done" reasoning that you have implied, a message that's been promulgated by d.p. ineptitude. d.p. just uses a bad interface that makes it difficult... *** roger said:
Yes, because removing formatting only to have to put it back in seems counterproductive, especially since Abbyy's output has a higher catch percentage than I believe a foofer does.
well, ok, first of all, part of your reasoning is based on _your_ failure on this task, and i don't think your performance is too useful. as far as i know, you do too little formatting on d.p. to be _experienced_ at it... and yes, spotting italics is a task that takes practice... i'd also expect that, having been humbled, you'd be careful next time, and thus better. so before too long at all, you'd be great at it. (even so, even the great ones do miss some.) to be sure, i am _not_ saying that it helps to "try harder" or "be careful" or "pay attention". to me, it seems like explicit consciousness might well hinder the task, not facilitate it... spotting italics might be a right-brain thing; i've noticed that sometimes i will view a page, and "see nothing", so i click to the next page, but then i "have a feeling" that i missed one, so i'll turn back, and sure enough, there it is. so, like i said, it feels like a right-brain thing. and thus it might be better to be "unfocused". you can just look at a page and get a gestalt that tells you if the slant of an italic is there. whatever the case, it's definitely the situation that you get better the more often you do it. further, when i was doing it frequently, i got _faster_ at it, even as my accuracy improved. (so i tried to force myself to go even faster, but it didn't work. had to happen naturally.) but anyway, now i am out of practice again. having said all of that, however, i recommend that people use the .rtf version, since it does do a fairly good job these days on the italics... you still need to check, though, so on a book like this one, with easy and infrequent italics, it won't matter too much if you do .rtf or not. it will be the same amount of work either way. but on a book that has tons of difficult italics, it's well worth running a test to see how well abbyy recognizes them, because it could save a lot of time and energy, and boost accuracy. again, i am out of the habit these days, but i went through the whole book, to see how accurate i would be, and how long it takes. i spent an hour doing it, and found only 110 or 120 of the 160 italics you said are there. (that includes the ones i had marked earlier, whenever i encountered 'em "along the way".) so that's an accuracy-rate of a mere 70%-75%. even with another go, i might not find 'em all. so, from my perspective, it's a no-brainer to at least let abbyy _try_ to find them for me... moreover, i'd defer this task to collaborators if i was working in a distributed environment.
And having two foofers go through each page of a book like this seems to be a waste of resources.
well, that's a separate question. it depends on how much energy you are willing to expend to _guarantee_ that you've got an error-free book. personally, i don't think italics are _that_ vital; nonetheless, i wouldn't want to miss more than a half-dozen in a 240-page book like this one. and i cannot say, without doing some research, how good smoothreaders are at catching these. (although i hasten to point out that, unlike d.p., my smoothreaders view text and scan together, so it'd be easier for them to detect-and-check.) heck, to be honest, i can't even say that i know what kind of accuracy the d.p. formatters have. i just assume their 2 rounds must be adequate. but i don't really have _research_ to back it up. nobody does. it's too hard to create a criterion. *** don said:
It will be interesting to see what your testing reveals. I remember that, at the time, abbyy's accuracy with formatting location and boundaries was not good, and that one could spend more time fixing what's there than to put it there in the first place.
gosh, more b.s. posing as "accumulated wisdom" from that flaccid-fountain-of-fail known as d.p. here's the truth, rather than a recycled d.p. myth. abbyy is now _pretty_good_ at recognizing italics. even back in the day, when this myth took shape, abbyy did a _fair_ job, if you gave it decent scans. the problem was that some people at d.p. threw crap scans at abbyy, then said "it doesn't work"... of course o.c.r. isn't gonna give you good results from a scan that looks like your baby kutzed on it! it's optical character recognition, not mind-reading! on some of those scans, _i_ couldn't read the type. so how could anyone expect o.c.r. to do the trick? i am continually amazed at how _well_ abbyy does. even on the current "betty lee" book, if you look at the places where there were glitches in the o.c.r., then look at the corresponding place on the scan, you usually see the splotch that caused the glitch. this scan-set was not up to the usual quality level for scans done by roger. must've been a bad copy. and here's another tip which many of you know: use scan-tailor to clean up your scans for o.c.r. (don knows -- as he always advises it to others.) i don't know all the ins-and-outs of scan-tailor, so i can't give any advice on optimizing results, but it proved itself during my initial experiments. so do a test-run of a few scans to see if your o.c.r. gets good results. if not, do mods with scan-tailor.
Also consider how the abbyy formatting data could be used by software after your proofing, to consider each location where abbyy found formatting and whether you had put formatting there; and resolve the differences efficiently.
now _this_ is a good suggestion. myself, i would use the abby italic information right off the bat, without any hesitation at all... but if someone is reluctant to do that, they can certainly use that information _after_the_fact_. for instance, as i reported above, i located 113-119 places where "betty lee" has italics. (119 places, to be exact, found in 113 lines.) so i've appended a list of those lines with italics. a similar list can be generated from abbyy info, and the two lists then compared to test for diffs. the diff results could also be used for _training_, i.e., i could view lines where i missed the italics, and perhaps pick up a clue on how to improve... *** don said:
The best way to get past the tedium is to come up with an objective means for comparing results derived from different, and differences in, the process.
doesn't seem like that should be too difficult...
Until then we're stuck with our prejudices.
doesn't seem like a very appealing alternative... -bowerbird p.s. cases of italics that i found in "betty lee"...
to have _everything_ talked over in the family! I "True enough. But you don't have to tell _all_ Lucia (pronounced 'Lu _chee_ a,' in Italian and as for her she was going to _choose_ her I _am_ expecting it to put them to flight! I thank most of them fine, and both as _worthy_ and as _interested_ in their own team's winning. Do not haven't lost it. Oh, wouldn't that be _awful?"_ Mother's that has _grand_ notes in it and I use it "Oh, _do_ you?" laughed Betty. "This is so "He gives the _full_ address inside. It's haps just to joke me -- and while I _thought_ that other things, I never found out till the _middle_of_ _my_sophomore_year_ that junior orchestra only meant _second_ to the senior orchestra, sort of a Probably it _was_ "nerve," but she had not meant Clara probably thought that Betty was _gloating_ enough tact to be president of _anything"_ lunch room side by side. "I _thought_ it was Woods, rather proudly. "As soon as _he_ gets stuff _ing._ But I'd never get tired of it and I "She'll copy Lucia, and it will do _her_ good to dent of Lyon 'Y', _you_ would be going home _junior_ for president!" Come over here on the _chaise_longue_ and let me broader than the ordinary _chaise_longue._ Golden "If you only understood Italian, Betty! _Che_ _peccato!_ That means 'What a pity' -- for I'll he is on an African _safari;_ he wrote me just "I _casually_mention_ hearing from my father mother. She doesn't care a _centime_ for him; stuff that I'm most scared for fear she _will_ start him and tell him just to come over and _get_ her!" Betty. I believe I _will_ write and tell my father _were_ connected with Ramon -- and she had everybody could select her own doll. And _these_ _"I_ wouldn't care for such a friend," said Dotty. I _knew,_ I should scarcely give any information that died. Isn't it _pitiful?_ So I just sent Bessie the kind that _she_ has to _help."_ the kind that _she_ has to _help."_ _rather_ you wouldn't tell Mother. I don't know changed, provided she _could_ change it now. She at first and they _are_ good folks, Betty -- at least They were so _good_ to me! I'm going to give in the _little_ basket till Christmas morning, Mrs. chance that _he_ may get work the first of the sang Latin hymns and songs, the _Adeste_Fidelis,_ "It's a shame you aren't going to _your_ grand- to _make_anything,_ I don't see, but I'll appre- _never_ forget you. Couldn't if I tried!" called it in the old days when _she_ was a little _"We_ read things," importantly said Amy "Funny she wants _you_ Mumsy, when she has in trouble. Well, Lucia wanted _her;_ perhaps "Did you see that _tr-ragic_ face, Betty?" asked _your_ side," she told Carolyn Gwynne. "It wasn't _anything!_ I got knocked down and "We certainly could, and _crede_mihi,_ Betty, to do, and you can't do everything, _crede_mihi!"_ _"'Crede_mihi'_ -- I can't," laughed Betty. "Do you suppose _'mihi'_ ought to come before _'crede?'_ Oh, yes, imperative first!" _"'O_tempora,_o_mores!'"_ replied Carolyn, my word for it' isn't enough like 'believe _me'_ like _ubi_est?_ or _Quid_loquor?_ like _ubi_est?_ or _Quid_loquor?_ _Quid_agis?_--_O_miserabile_me!_--_horribile_ _Quid_agis?_--_O_miserabile_me!_--_horribile_ _dictu_--_age_vero_--_da_operam,_ and other expres- centrated on basketball "to beat _us."_ And beat think that he owned her! She _liked_ Jack. couldn't help it. Jack _must_ be all right! Why, Betty had wondered if her own parents _could_ where she was. It _was_ nonsense. She would go home when she got ready. But she _would_ near dinner time it is, and Mother _will_ be "Oh, you _did!"_ exclaimed Doris. "Tell us facetious about it. Isn't that so, _mon_cher_ _papa?"_" "" _wild_set_ of the _society_bunch!_ What do you _wild_set_ of the _society_bunch!_ What do you "Don't _ask_ me, Doris. I don't like _any_ of "Don't _ask_ me, Doris. I don't like _any_ of cally. "Would she _let_ me, do you suppose?" her. Perhaps it was Chet's remark about _not_ _forgetting_ that misled her! She was dressed, if there _was_ a pretty good current in the river, frowned again. _Could_ her father tell him? Then "Oh, Father, you _told_ him! And I know he's "Ramon Balinsky Sevilla is not _in_ Detroit!" friend met him. _Now_ are you satisfied?" new dress, _"so_ becoming," she said. "Betty, I for it in others, and _have_a_care,_ Betty. That _a_la_ victrola, though Jack apologized for not de cook. Dat young lady in dere's done _passed_ _out!_ An' de butleh -- he gone, too." "No, honey, not daid. No, you jus' _keep_out._ me, she would know the way around _this_ house _Mother,_ was still up! Then there _was_ something in the remarks "I thought he _might_ call up to see if I had gotten Betty was not quite sure just _how_ she could think it was _terrible!"_" " " cause _you_ didn't know what _was_ happening." cause _you_ didn't know what _was_ happening." just _saw_ going to pieces little by little and hav- reach Italy and how he was having the _palazzo_ opened in Milan. Now _that_ may mean some- so _lonely,_ Mrs. Lee!" And your mother asked _"Would_ you, Lucia? I wish you would stay. to have a real old Italian _palazzo_ to come to them the Sevillas. At least _he_ had not found daughter -- _perhaps,_ Lucia thought. But the perhaps, some day. I thank you for _trusting_ annual Lyon High _Star._ It was the custom to with her _Star_ under her arm, Lucia in the thing. And she's _got_ to!" not. Of course, we could not mention _that_ before