Re: ok, let's take a look at gardner's book, just for the exercise

ok, you might remember that gardner invited me to take a look at a book he posted -- "the advocate"...
http://www.gutenberg.org/dirs/3/1/2/1/31212/31212-8.txt http://www.archive.org/details/advocateanovel00heavgoog http://www.canadiana.org/ECO/ItemRecord/48293?id=16c79d4f15394e51 http://books.google.com/books?id=ot4OAAAAYAAJ&oe=UTF-8
*** gardner's since told me that he just expected that i'd run his text through my clean-up script and report things that it found. but, as i told him a while back, i don't really have any such general clean-up script. besides... when i _really_ want to find out if a text is accurate, i compare it against another digitization, and then resolve any differences by looking at the page-scan. i've found it's an excellent way to approach perfection. i've also documented the process quite thoroughly, but i haven't gone through that exercise lately, so i thought i'd use gardner's invitation to remedy that. it's also true that i knew it'd be a good opportunity to demonstrate the importance of retaining the original linebreaks and pagebreaks when you digitize a book, so i did it for that reason as well. *** i cleaned up the o.c.r. i got from internet archive, and then compared it against gardner's proofing. i resolved the diffs by looking at the page-scan, and ended up with my final version, which is here:
comparing this against the p.g. e-text, i found about 79 places where the p.g. text is incorrect. i appended the list to this post, plus it is here:
(there _are_ 79 differences, but somebody _might_ challenge a few of them as _my_text_ being wrong, and not the p.g. e-text. i welcome dialog on them. there are about 70 cases where there is no doubt.) the vast majority of the diffs were on punctuation. abbyy v5 read many marks on the page as commas, and often garbled the semicolon/colon distinction. it recognized most letters correctly, except for h/b. despeckling might (or might not) help performance, but upgrading to abby v7 is probably the best bet... i salute gardner's courage in seeking to _improve_ his workflow. because of his openness to criticism, he's learned that his o.c.r. app does him no favors, which is a lesson that shouldn't really hurt much... especially since he ends up with a nice little list that points to some 79 possible errors in his digitization. *** it's worth noting that my cleaned-up version of the o.c.r. from archive.org was as bad as gardner's file. in the same way that the diffs helped me to find the errors in _his_ file, the diffs helped locate the errors in _mine_ -- the beauty of the comparison method is that two flawed files can produce one that is better, to the point the merged product can be nearly-perfect. given the plethora of flawed digitizations in the world, the comparison method gives us real hope for change. the fact that it is boatloads more efficient than using the word-by-word proof method is icing on the cake. if you don't understand this is the way of the future, you're not paying sufficient attention, my friend, and i suggest you get out of the way or you'll get run over. *** now, on to my last point, about pagination/linebreaks. gardner has said he will fix the errors that were found, and resubmit the book to project gutenberg; that's fine. but now there are two versions out there in the world: the one i created with original pagination/linebreaks, and a p.g. version where the text has been rewrapped. now, a person can-- with some degree of difficulty -- verify that the two versions have the exact same text... but what i have done is to juxtapose _my_ version with the page-scans of the book, so as to highlight the fact that the text i've presented does indeed match the scan:
with the p.g. version, the user must take your word that the version actually does correspond well to the p-book. given the starkness of their choice, the answer is clear... in the past, when the p.g. version was the _only_ version that was available as digital text, people were _happy_ to accept that it was a copy that was accurate enough... with a choice, however, and users now _have_ a choice, i'm not sure they'll be so willing to take it on pure faith. i _am_ sure michael will want to argue about this. fine. but i don't see anything to argue about; it's a slam dunk. *** i'll do a follow-up report (tomorrow?) discussing how you can generate other versions from the .zml master, as well as how to remove the pagination and rewrap... -bowerbird p.s. ok, here are the 79 diffs between me and p.g. use a monospaced font, so the pointer-line works: looked up and sighed, and he continued: looked up and sighed, and he continued: <-- missing paragraph break here =======================================xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx soon after their union, his wife died in giving soon, after their union, his wife died in giving ====x=========================================== to hide her shame. to hide her shame, =================x jest, all sat to listen to and applaud their host's inimitable stories, his jest, all sat to listen to and applaud, their host's inimitable stories, his ======================================x===================================== or,--leaving both revelation and the pandects,--become the or,--leaving both revelation, and the pandects,--become the ============================x============================== [Illustration: "As if at the jests of another [Illustration "As if at the jests of another <-- missing colon after illustration tag =============x=============================== of birds, the lowing of cattle, the barking of dogs, the churr of the bullfrog, of birds, the lowing of cattle, the barking, of dogs, the churr of the bullfrog, ===========================================x================================ ==== creeper half concealed its rugged exterior, and clothed in creeper half concealed; its rugged exterior, and clothed in ======================x==================================== done;--what is it that alarms you? done;--what is it that alarms you?" ==================================x swearing, into the vestibule. swearing, into the vestibule ============================x and attempted to bear her off, as of old the treacherous and attempted to tear her off, as of old the treacherous =================x====================================== and the affair that had been begun in vulgar, aimless, frolic, might and the affair that had been, begun in vulgar, aimless, frolic, might ============================x========================================= in which the gallant stranger had disappeared. in which the gallant stranger, had disappeared. =============================x================= fool, that rent the kingdom,--Rehoboam fool, that rent the kingdom,--Rehoboam. ======================================x him for a block-head, a little black-browed beetle, a him for a block head, a little black-browed beetle, a ===============x===================================== my blister, my settled, ceaseless source of irritation: the cause, the cause--of my blister, my settled, ceaseless source, of irritation: the cause, the cause--of ========================================x=================================== ===== against its cause, the gallant stranger. against its cause, the gallant stranger, =======================================x and exagerrated form. Anger and shame contended in the old law and exagerrated form. Auger and shame contended in the old law =======================x====================================== sir, come,--ah, will you resist your"--father he was about to say, but he recoiled sir, come,--ah, will you resist your--" father he was about to say, but he recoiled ====================================xxxx==================================== ======== thou shalt suffer, thou shalt bend, or thou shall suffer, thou shalt bend, or =========x============================ with the cup his own hands have fashioned? with the cup his own hands have, fashioned? ===============================x=========== at last growing calmer he exclaimed: "Down, down, ye cruel thoughts, ye at last growing calmer he exclaimed; "Down, down, ye cruel thoughts, ye ===================================x==================================== not have you murdered in your old age. not have you murdered in your old, age. =================================x===== her eyes dwelt abstractedly on the sight, then, her eves dwelt abstractedly on the sight, then, =====x========================================= "Can I have heard aright, or do "Can have heard aright, or do =====xx======================== am the seignieur am the seigneur ============x=== of the beloved in rock, not sand." of the beloved in rock, not sand.' =================================x nothing heard but you," replied nothing heard but you." replied =====================x========= "Nothing, I hope," she answered, falteringly. "Nothing. I hope," she answered, falteringly. ========x==================================== beware, beware, Amanda. beware, beware, Amanda, ======================x a gentleman, who announced himself as a "gentleman, who announced himself as ==x=================================== are giddy," remarked the seigneur gravely. are giddy." remarked the seigneur gravely. =========x================================ sir, no; we are fumigated, ventilated, scented, sir, no: we are fumigated, ventilated, scented, =======x======================================= assert, 'All flesh is glass.'" assert. 'All flesh is glass.'" ======x======================= have as little for your son," said the lawyer sarcastically. have as little for your son," said, the lawyer sarcastically. ==================================x========================== the manner of one who is going to make a confidential proposal: "Either remove the manner of one who is going to make, a confidential proposal: "Either remove ======================================x===================================== ==== parent made no answer, but secretly groaned in his dilemma, and at length excl parent made no answer, but secretly, groaned in his dilemma, and at length excl ===================================x======================================== === buy brass from you at the price of gold; I will not subsidize you to avoid your ward." buy brass from you at the price of gold; will not subsidize you to avoid your ward." =========================================x================================== ========== of life you keep a care, of life you keep a care. =======================x midst of three sons and a daughter; the former being dissipated and midst of three sons and a daughter: the former being dissipated and ==================================x================================ and not ill-disposed youth; whom his and Hot ill-disposed youth; whom his ====x=============================== their peccadilloes, and entertained other ideas of foreign travel than that their peccadilloes, and entertained, other ideas of foreign travel than that ===================================x======================================== had told her of the predilection and hopes of had told her of the predilection, and hopes of ================================x============= as of one who intends no longer to be checked, nor submit to unmerited as of one who intends no longer to lie checked, nor submit to unmerited ===================================xx================================== his breast charged with a spiteful purpose; and going straight to the lodgings of his breast charged with a spiteful purpose: and going straight to the lodgings of ==========================================x================================= ===== that night, and in the morning, having obtained leave of absence, rode that night, and in the morning, haying obtained leave of absence, rode ==================================x=================================== solitary, and filled with chequered thoughts, continued his way, solitary, and filled with checquered thoughts, continued his way, =============================x=================================== you are there unto me will still be heaven. you are there unto me will still he heaven. =================================x========= before you could have asked forgiveness," replied before you could have asked forgiveness." replied =======================================x========= boundless goodness and free grace, remits the debts and manifold boundless goodness and, free grace, remits the debts and manifold =====================x=========================================== not libel love, nor our sweet fortunes," cried not libel love, nor our sweet fortunes." cried ======================================x======= It _is_ because _it must_ be; it is unselfish; nay, unto itself It is because _it must_ be; it is unselfish; nay, unto itself ===xxxx======================================================== its riders; and prompted as it seemed by fear of a rescue, the its riders; and prompted is it seemed by fear of a rescue, the =========================x===================================== to swallow the ground, until again over all burst, to swallow the ground, until again overall burst, =======================================x========== gnome, from beneath her envelopement, gnome, from beneath her envelopment, ===============================x===== will not require to find your way back this year." will not require to find your way hack this year." ==================================x=============== gazed wistfully around to discover some glimpse of dawn. gazed wistfully around to discover, some glimpse of dawn. ==================================x====================== implore you, from this man," and with the words she sprang towards implore you, from this man," find with the words she sprang towards =============================xx==================================== "Keep quiet, gentle lady; have patience, bashful beauty; sit down, sit down; come p "Keep quiet, gentle lady; have patience, bashful, beauty; sit down, sit down; come p ================================================x=========================== ======== akin to contempt, he demanded: akin to contempt, he demanded: <-- missing paragraph break here ==============================xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx No. Besides at present you have taken a long- No, Besides at present you have taken a long- ==x========================================== and taking to their heels concealed themselves amongst the trees that covere and taking to their heels concealed, themselves amongst the trees that covere ===================================x======================================== = answered; "worse than your worst suspicions, answered "worse than your worst suspicions, ========x=================================== serve? who, except the father of yon boy, the serve? who, except the father of you boy, the ===================================x========= this, thinking it to be but the jest or boast, or, at furthest, merely the loose anno this, thinking it to be but the jest or toast, or, at furthest, merely the loose anno ========================================x=================================== ========= dubious dwelling where, some hours before, dubious dwelling where, some hours before. =========================================x have her face flayed; her hair shall be plucked up by the roots;" and she have her face flayed; her hair, shall be plucked up by the roots;" and she ==============================x=========================================== are, as you seem to be, a gentleman, do not leave me;" she exclaimed be are, as you seem to be, a gentleman. do not leave me;" she exclaimed be ===================================x=================================== calm thyself, girl," echoed the ponderous calm thyself, girl." echoed the ponderous ==================x====================== advanced, and bending over her whilst his voice fell, advanced and bending over her whilst his voice fell, ========x============================================ who shall blame the sun and moon who shall blame the sun and moon <-- missing paragraph break here ================================xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx in surprise and terror. "No, in surprise and terror. "No. ===========================x called for anything of earth, but called for any thing of earth, but ==============x=================== shall ooze out drop by drop, and each drop be a portion of your life." shall ooze out drop by drop, and each, drop be a portion of your life." =====================================x================================= missing, had been caught by the advocate's keen eye, and convinced missing, had been caught by, the advocate's keen eye, and convinced ===========================x======================================= until, detected, he stood, too nigh to retreat, too terrified to advance, an until, detected, he stood, too nigh, to retreat, too terrified to advance, an ===================================x======================================== = he cried: "Demon, degenerate dog, where hast thou been walking to and fro in the he cried: "Demon, degenerate dog, where bast thou been walking to and fro in the ========================================x=================================== ===== my son, my son," he cried in agony; "Oh, my son, my son." he cried in agony; "Oh, ==============x========================= did it, all you curious crowd. did it, all you curious crowd =============================x

Thank you BB. I've made a good share of these changes already based on the line by line comparison you did earlier. I'll merge all the rest and re-post as time allows. On 25-Feb-2010 16:55, Bowerbird@aol.com wrote:
comparing this against the p.g. e-text, i found about 79 places where the p.g. text is incorrect. i appended the list to this post, plus it is here:
(there _are_ 79 differences, but somebody _might_ challenge a few of them as _my_text_ being wrong,
You missed the place where I corrected the spelling of Ste Hélène. That's 80.
am the seignieur am the seigneur ============x===
Here, the correct spelling "seigneur" was used several other places in the text. It seemed reasonable to correct this as a typo. Definitely an improvement all 'round though. See you, ============================================================ Gardner Buchanan <gbuchana@teksavvy.com> Ottawa, ON FreeBSD: Where you want to go. Today.
participants (2)
-
Bowerbird@aol.com
-
Gardner Buchanan