pontifications from mount high horse -- #1498

ok, i went back and did a better job on "betty little", and then compared the product to the original o.c.r. well, what i _acutally_ have is what i scraped from his editor demo, and some of that text had been edited... but i got to it pretty early, so i think i minimized that. (it would be nice if roger put out his _actual_ o.c.r. it would also be great if roger put out his .rtf copy. but i'm not sure how interested he is in this stuff.) at any rate, i will shy away from hard numbers, and just report the pattern of results, which is very clear. in general, i found this digitization looks exactly like the dozens of other ones that i have reported on here. as usual, the o.c.r. was good. quite surprisingly good. (i guess it's time that we should no longer be surprised. still, these scans were _murky_; but no, it didn't matter.) there were right around 256 errors, on a 256-page book, on the raw o.c.r., which is pretty much what you'd expect. considering the number of lines -- over 7000 of them -- those 256 errors constitute an accuracy rate of over 95%. roger hasn't released his text yet, so i can do a compare, which will probably reveal more errors that i missed, but even if i missed twice as many as i found (quite unlikely), the accuracy of the raw o.c.r. is still gonna be about 90%. *** i spent more time finding and fixing errors in this book than i wanted to, more time than i would have otherwise, because i was using it as the content for a new program. so i can't give a good estimate of the cost-benefit ratio of the time i spent, but i can say that i did indeed catch a pretty good percentage of the errors on my first pass. i made errors, lots of them! -- many more than the 22 which roger reported a while back -- but i can also say that i woulda caught almost all of the errors originally _if_ i had done a careful job, and done it more fully... i did a rush job, because i didn't know how fast roger was going to act, and i wanted to get my stuff out first, so roger and everyone else would know that i _hadn't_ used his results to create mine, i did mine on my own. and i wasn't thorough, because i didn't know if people would care. heck, i didn't even know if _i_ would care. but the project ended up being fun. it was _nostalgic_, coming in at the end of the year, plus i hadn't done an analysis of a digitization in a long time. and i'm rarely able to assess my own performance, so that's a blast... maybe i'll make a list of the errors later, for you to see. or maybe not. either way, the results are unmistakable. the o.c.r. was good. many of the "errors" were due to scan-spots -- which o.c.r. is duty-bound to report -- or outright errors in the p-book. it was full of errors! this is one of those e-books that is _more_accurate_ -- out of the chute -- than the p-book it came from. and, to repeat, this result is _the_typical_finding_... across the board, i have demonstrated, over and over, that the o.c.r. is good, and the vast majority of errors can be fixed by using extremely simple preprocessing, the type that you can do in one hour for a simple book. correcting the o.c.r. -- and even doing the formatting -- for a book is easy. it doesn't take rounds and rounds of volunteers wasting time and energy poring over a book. all it takes is one or two people using a good tool, and a couple of smoothreaders to catch the stealthy stuff... if you want to split up the job, you can have 10 or 20 people using that same "good tool" to do the job, and 10 or 20 smoothreaders, probably giving better results. but even one person and one smoothreader can do fine. and if you solicit error-reports and act on 'em diligently, you can execute a very smart march toward perfection... *** the things i just said apply to d.p. and p.g., obviously, but they also apply to some points roger has made... for instance, roger said this:
I've heard from some people that solo process that actually like to go through the book page at a time because they enjoy following the story as they go, which doesn't happen when someone is in production mode at the book level.
now, first of all, of course, this is another case where roger exhibits fundamental misunderstanding about the essence of "production mode at the book level"... rather, it is page-oriented systems like the one at d.p. which make it difficult for people to "follow the story". in a system like the ones i make, the entire book is available to a person at all times, so they can surely "follow the story" if they choose to read it in order. the main difference is that, in a book-oriented system, you will _begin_ by cleaning up the big errors first -- the ones that are simple for the system to auto-detect -- so that you can then "settle in" to read each page, during which process you can look for _subtle_ errors, without being distracted by a need to fix any big ones, as that does indeed detract from "following the story"... in a page-oriented system, an absence of preprocessing means you might need to fix a bug on nearly every page, and that hurts both your accuracy _and_ comprehension. so roger has not just "failed to get things right" here... he has actually gotten it _completely_backward_, sadly. and, like i said, he's one of the smarter guys here. sadly. *** if anyone wants to see my analysis of my performance on "betty lee", let me know. or view the product online.
http://z-m-l.com/go/betle/betle.zml http://z-m-l.com/go/betle/betlep123.html
oh yeah, i almost forgot to tell you... i've programmed yet another book-digitization editor. once again, it's in python, like the one i built recently. but it's rather full-fledged, like the one i built in perl, back in 2010, when i was working on roger's "sitka"... it's not all finished yet, but you can look at it here:
that is targeted at the ipad right now, but i can also make it work on an iphone by sizing the text smaller. it's _increasingly_ important to offer people the chance to contribute to your digitization project when they are using a mobile form-factor, like the ipad or the iphone. *** have a nice day. -bowerbird

I'm close enough to finishing my Betty Lee, Junior project to compare it to BB's posted version at http://z-m-l.com/go/betle/betle.zml. Most of these diffs are from the first 80 pages or so. I'll be posting my version of Betty Lee, Junior somewhere as soon as I run the last few checks. Unfortunately, Project Gutenberg won't take it. Before the Rule 6 freeze, I posted the first two books in this series as PG texts #34605 and #34728. I did not get clearance on Betty Lee Jr. or Betty Lee Sr. in time. I'm open to suggestions as to where to put the final version of this book. BB wrote: it would be nice if roger put out his actual o.c.r. it would also be great if roger put out his .rtf copy. but i'm not sure how interested he is in this stuff. I *am* interested in "this stuff." Not so much this one book, but in the processes. I'm still learning (and re-learning) a lot on this project. So I'll do what I can to accommodate BB's request. My "actual ocr" on this wasn't RTF, which is why I missed so many italics originally. In Betty Lee, Sr., I went from the RTF and retained the markup. But in this project, I didn't use it. Best I can do is go back to the original batch I used with Abbyy and save the text as a RTF, which I've done. I'll make that available to BB. I post these diffs not to compare processes, since mine has had the benefit of a smoothreader. I post it because there are some diffs which perhaps BB would have caught if he improved his pre-smoothie process. Some are regex-catchable. Some are guiguts-catchable or detectable with one of my analysis programs. Many are smoothreader catches. ----- One common error is in scannos, which as BB pointed out, should be caught by a smoothie. Here are some examples, "RF" is me: RF: some of them, and give Ramon's message, but I just can't show BB: some of them, and give Earn on's message, but I just can't show RF: Betty and next to Peggy Pollard, who, it BB: Betty and nest to Peggy Pollard, who, it RF: a thing to work for that being president BB: a tiling to work for that being president RF: the back. Mary Emma could not go with BB: the back. Mary Emma could hot go with RF: problems. From Lucia's manner, she BB: problems. From Lucia's manlier, she RF: of the page and below was a brief resume BB: of the page and below war; a brief resume And my favorite scanno in this text: RF: I'm the crossest girl you ever saw, so far as mere looks BB: I'm the Grossest girl you ever saw, so far as mere looks ----- There spelling discrepancies, which are findable with spellcheck: RF: of those still, quiet stiletto exchanges BB: of those still, quiet stilletto RF: tonsillitis. Betty saw her and overheard BB: tonsilitis. Betty saw her and overheard ----- This scan had lots of stray marks which made it into the OCR text: RF: packed a thin chiffon dress, while BB: packed, a thin chiffon dress, while RF: this, Miss Betty Lee!" BB: this,' Miss Betty Lee!" ----- Hard to find are missed italics: RF: wouldn't do _one thing_. She is sweet BB: wouldn't do one thing. She is sweet RF: other times too, but _always_ then, BB: other times too, but always then, before ----- There are missing quote marks: RF: won't you?" BB: won't you? RF: who sat down. "How is your mother BB: who sat down. How is your mother And there are extra quote marks: RF: little habit of dropping in when BB: little habit of 'dropping in' when ----- Catchable by my analysis program, which does Levenshtein checks. This one is one edit distance away: RF: are the Sevillas and where do they live? BB: are the Savillas and where do they live? ----- Guiguts-catchable errors: RF: sometimes! I can't study! Come over here BB: sometimes! I can't study I Come over RF: who reads the sport page." BB: who reads the! sport page." RF: know." BB: know," (at end of paragraph) ----- There was one error that appears to be a bug in BB's generator, putting spaced double quotes at the end of a line: RF: like my residence here. BB: like my residence here." " There are many of these. ----- That's it for my sampling of the kind of errors left behind by BB's process as far as it went. With the addition of a good smoothreader, many of these diffs would have disappeared from BB's version. Hope this helps. --Roger

-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Roger Frank Sent: Saturday, December 31, 2011 5:46 AM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] pontifications from mount high horse -- #1498
I'm close enough to finishing my Betty Lee, Junior project to compare it to BB's posted version at http://z-m-l.com/go/betle/betle.zml. Most of these diffs are from the first 80 pages or so.
I'll be posting my version of Betty Lee, Junior somewhere as soon as I run the last few checks. Unfortunately, Project Gutenberg won't take it. Before the Rule 6 freeze, I posted the first two books in this series as PG texts #34605 and #34728. I did not get clearance on Betty Lee Jr. or Betty Lee Sr. in time. I'm open to suggestions as to where to put the final version of this book.
BB wrote: it would be nice if roger put out his actual o.c.r. it would also be great if roger put out his .rtf copy. but i'm not sure how interested he is in this stuff.
I *am* interested in "this stuff." Not so much this one book, but in the processes. I'm still learning (and re-learning) a lot on this project. So I'll do what I can to accommodate BB's request. My "actual ocr" on this wasn't RTF, which is why I missed so many italics originally. In Betty Lee, Sr., I went from the RTF and retained the markup. But in this project, I didn't use it. Best I can do is go back to the original batch I used with Abbyy and save the text as a RTF, which I've done. I'll make that available to BB.
I post these diffs not to compare processes, since mine has had the benefit of a smoothreader. I post it because there are some diffs which perhaps BB would have caught if he improved his pre-smoothie
Roger - if Betty Lee's author died before the end of December, 1960 (or 1961 as of Jan 1, 2012), you could check with Mark Akrigg of PGCanada to see if he's interested. (If the author is a pseudonym, he'll want to know the real name.) (I leave the question of an American citizen using Canadian copyright laws to you <g>.) Another possibility--a few months ago, someone obtained Murray Leinster's "The Wailing Asteroid" from Manybooks (http://www.manybooks.net/) and offered it to PG via PG's Errata system. I had to tell the reporter that they'd have to go through PG's Rule 6 copyright clearance process, but that that process has been withdrawn for further development. It may be that Manybooks has a different copyright clearance process. Re scannos like "tiling", "modem", etc - I add that sort of thing to Gutcheck's gutcheck.typ file. The file that comes with Gutcheck has many common scannos, but I've added a number that I encounter fairly regularly, e.g. cither (either), coining (coming), conies (comes), denned (defined), gaming (gaining), etc, etc. If there's interest, I'll post my list. Re tonsillitis/tonsilitis, etc. - personally, I wouldn't change an archaic spelling to a modern one, if that's what happened here. Al process.
Some are regex-catchable. Some are guiguts-catchable or detectable with one of my analysis programs. Many are smoothreader catches.
-----
One common error is in scannos, which as BB pointed out, should be caught by a smoothie. Here are some examples, "RF" is me:
RF: some of them, and give Ramon's message, but I just can't show BB: some of them, and give Earn on's message, but I just can't show
RF: Betty and next to Peggy Pollard, who, it BB: Betty and nest to Peggy Pollard, who, it
RF: a thing to work for that being president BB: a tiling to work for that being president
RF: the back. Mary Emma could not go with BB: the back. Mary Emma could hot go with
RF: problems. From Lucia's manner, she BB: problems. From Lucia's manlier, she
RF: of the page and below was a brief resume BB: of the page and below war; a brief resume
And my favorite scanno in this text:
RF: I'm the crossest girl you ever saw, so far as mere looks BB: I'm the Grossest girl you ever saw, so far as mere looks
-----
There spelling discrepancies, which are findable with spellcheck:
RF: of those still, quiet stiletto exchanges BB: of those still, quiet stilletto
RF: tonsillitis. Betty saw her and overheard BB: tonsilitis. Betty saw her and overheard
-----
This scan had lots of stray marks which made it into the OCR text:
RF: packed a thin chiffon dress, while BB: packed, a thin chiffon dress, while
RF: this, Miss Betty Lee!" BB: this,' Miss Betty Lee!"
-----
Hard to find are missed italics:
RF: wouldn't do _one thing_. She is sweet BB: wouldn't do one thing. She is sweet
RF: other times too, but _always_ then, BB: other times too, but always then, before
-----
There are missing quote marks:
RF: won't you?" BB: won't you?
RF: who sat down. "How is your mother BB: who sat down. How is your mother
And there are extra quote marks:
RF: little habit of dropping in when BB: little habit of 'dropping in' when
-----
Catchable by my analysis program, which does Levenshtein checks. This one is one edit distance away:
RF: are the Sevillas and where do they live? BB: are the Savillas and where do they live?
-----
Guiguts-catchable errors:
RF: sometimes! I can't study! Come over here BB: sometimes! I can't study I Come over
RF: who reads the sport page." BB: who reads the! sport page."
RF: know." BB: know," (at end of paragraph)
-----
There was one error that appears to be a bug in BB's generator, putting spaced double quotes at the end of a line:
RF: like my residence here. BB: like my residence here." "
There are many of these.
-----
That's it for my sampling of the kind of errors left behind by BB's process as far as it went. With the addition of a good smoothreader, many of these diffs would have disappeared from BB's version.
Hope this helps.
--Roger
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Dec 31, 2011, at 12:03 PM, Al Haines wrote:
The file that comes with Gutcheck has many common scannos, but I've added a number that I encounter fairly regularly, e.g. cither (either), coining (coming), conies (comes), denned (defined), gaming (gaining), etc, etc. If there's interest, I'll post my list.
Al, I for one am interested in your list. Also recently over at DP there was a thread about "n-u" transpositions and I grabbed some from that list. I don't know how many I'll find; many of those may be academic. But please, I'd like to improve my scanno list by incorporating yours. --Roger

-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Roger Frank Sent: Saturday, December 31, 2011 11:25 AM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] pontifications from mount high horse -- #1498
On Dec 31, 2011, at 12:03 PM, Al Haines wrote:
The file that comes with Gutcheck has many common scannos, but I've added a number that I encounter fairly regularly, e.g. cither (either), coining (coming), conies (comes), denned (defined), gaming (gaining), etc, etc. If there's interest, I'll post my list.
Al,
I for one am interested in your list. Also recently over at DP there was a thread about "n-u" transpositions and I grabbed some from that
Roger, here's my gutcheck.typ list. Some of its entries may look odd (e.g. signer), so I'll be happy to explain them. I'd like to see your list, too. For those who don't know, gutcheck.typ is a standard text file, one word per line. 11 44 ms ail alien arc arid bar bat bo borne bow bum bumbled carnage carne cither coining comer comers conies cur denned docs eraser eve eves gaming gram guru hag hare haying ho lier lime loth m modem nagged naming nave nicker nickered nock nocked nourish nourished nutter nuttered ringer ringers riot rioted signer snore spam stem tho tier tile tiling tram tum tune u vas wag wen yon list.
I don't know how many I'll find; many of those may be academic. But please, I'd like to improve my scanno list by incorporating yours.
--Roger
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Al, Roger, Bird, anyone else, For your list of scannos, do you examine every instance of every word on the list? Many of those words would be more frequently correct than a misscan.

Gutcheck checks all the words on my gutcheck.typ list, but reports only those that it encounters in the text being checked. True, some of the words are spelled correctly (e.g. alien, nicker, nocked, rioted, ringer) but they're sometimes scannos for Allen, flicker, flocked, noted, finger, respectively. (Gutcheck's check is case-insensitive). I'd rather have a correct word flagged as a possible error, than have an incorrect word not flagged. Al -----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of don kretz Sent: Saturday, December 31, 2011 7:06 PM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] pontifications from mount high horse -- #1498 Al, Roger, Bird, anyone else, For your list of scannos, do you examine every instance of every word on the list? Many of those words would be more frequently correct than a misscan.

BB>all it takes is one or two people using a good tool, and a couple of smoothreaders to catch the stealthy stuff... This, of course, for anyone who has tried it, depends entirely on the book, the quality of the pages and the decay thereof, how old in general the text is, how hard the text is, and the quality of the scan, if one does not have access to the original book. Take a shot at Bibliotheca Britannia if, god forbid, you get too full of yourself: http://www.archive.org/details/bibliothecabrit00wattgoog
participants (5)
-
Al Haines
-
Bowerbird@aol.com
-
don kretz
-
Jim Adcock
-
Roger Frank