March 2005 - gutvol-d - lists.pglaf.org

Re: [gutvol-d] lest the message be missed
by Bowerbird＠aol.com 12 Mar '05

12 Mar '05

michael said: > Of course we should not forget people such as David Widger, > who has produced nearly 3,000 eBooks, about one per day, > over a period of years, nor David Price who sent us > one eBook per week for years, or > several others who prefer to remain anonymous. that's right! :+) of course, david widger is super-human, not "an average person". ;+) but -- with the right tool -- now an _average_ person can do it too! -bowerbird

3 2

No part 2 of newsletter
by Tony Baechler 11 Mar '05

11 Mar '05

Hi. I'm sure I'm not the only one to notice this, but neither George, Greg or Michael commented, so I'm asking. What happened to part 2 of the newsletter? I got part 1 as always. George said that he would no longer be editing so part 2 would now be automated, but I think something must have happened because I never got anything. I did not fully read part 1 but I think based on length it is too short to contain a full list of new books. Any thoughts? Any idea when it will be sent out? No big rush, but I'm curious to see the apparently new, automated format if there is one.

2 1

ok, let's wrap this up, folks
by Bowerbird＠aol.com 11 Mar '05

11 Mar '05

kent said: > At the risk of coming into the middle: ain't _that_ the truth! ;+) unless i am prodded further, however, this will be my last post on this thread. and this will also be my last thread before i take a long break from here, with the exception of my final report on "my antonia", and reports on the book that miranda asked me to do... and yes, people, it's a long post, because it's full of detailed thinking and analyses. if that ain't for you, not your cup'o'tea this afternoon, hit the 'delete' key, don't go running off complaining to michael and greg... > My experience is that the time consuming part > of going from book to E-book is the proofreading. ok, let's take a look at what you have to say. > I use a Canon S230 3 megapixel camera um, it is unlikely that's good enough. this very issue of using a digital camera rather than a scanner is being discussed right this second on another listserve, but the people there are talking about 5-megapixel and up, even a 10-megapixel. i seriously doubt a 3-megapixel works well. there are other concerns with a camera too. are you using external lighting on the book? if not, then your images will be substandard. do you use a tripod? do you focus manually? as always, photography can be a tricky thing. > I use a Canon S230 3 megapixel camera > in a copy stand to get about 300 dpi scans. there are also issues with the "copy stand". some stands can be good. others, not at all. are your scans showing curvature problems? if so, that can be a killer to o.c.r. recognition. unevenness in the brightness across the page? that can significantly impair the o.c.r. too. > I use a Canon S230 3 megapixel camera > in a copy stand to get about 300 dpi scans. 300dpi ain't giving your o.c.r. app the best you could. and isn't really creating what you'd want for archives. it's much more time-consuming to scan at 600dpi, and i think it's an open-question whether we want to ask individuals like you to take that extra time, or whether we wait to re-do scans until we have the equipment that will make that process fly by. but if we do take the 300dpi shortcut in scanning, or by using a digital camera rather than a scanner, then we need to do it with the full knowledge that that decision _might_ impact o.c.r. accuracy, which in turn _might_ result in more proofing work, which _might_ end up actually _costing_ us time overall... given the differences obtained from different scanners, and different source-texts, and different o.c.r. programs, and even from different _people_ doing the imaging -- if you've looked at a range of scanned books, you'll know that different people exhibit a wide range of variability in how carefully, e.g., straight, they position each page -- it's very difficult to do the research we'd need to do to find out exactly _how_much_ time we're wasting by creating images at less-than-ideal resolution. but we are _certainly_ wasting some time, in some situations -- and perhaps a _lot_ of time in more than we know... i'll put this as plainly as i can: if we use inferior tools, we _will_ get inferior results. if you take care to notice it, my statements about "one evening" are hedged carefully with qualifiers about "the right scanner", "the right manipulations", "the right tools", and of course, "an average book"... a lot of the people who scoff are people who are using inferior tools, and getting inferior results. people once thought heavier-than-air flight impossible. it is, if you do it wrong. if you do _anything_ wrong. and there are lots and lots of things you can do wrong. but do _everything_right_ and flying is certainly possible. now people fly every day in a plane, with no second thought. and, to be clear, i'm talking about the amount of time that it takes _after_ the page-scans are cleaned up. as people have confirmed, the scanning and clean-up will often take a very long time, all by themselves. compared to _that_, proofing should be much faster. before i leave the arena of the image-creation process, i should say there is only _one_ "right" scanner out there currently, in the range of personal affordability anyway. it's that optic3600 that other people have mentioned here. if you're using another scanner, you're wasting your time. maybe you're not wasting a _lot_ of your time, perhaps not enough to consider a $250 scanner as an "investment", but you need to know that you _are_ wasting some time. and if you use inferior tools, you will get inferior results. one more thing, since carlo mentioned that sometimes he gets inferior results because the p-book is shoddy. hey, no question that a bad original will make bad scans. the best answer to that problem, though, is very simple: go find a cleaner copy of the book to get your scans from. _somewhere_ out in the world, there _is_ a cleaner copy. (if not, let that rare book be scanned by a professional!) and if those bad scans are coming from somewhere else? the same answer: go find a cleaner copy and scan _that_. don't waste valuable time dealing with inferior images! jon noring keeps talking about how wonderful it is that distributed proofreaders keeps the scans for their books. and it is. but the truth of the matter is that previous few of those scans can be considered good enough for archival. so those books will have to be rescanned in the future too. let's hope that brewster and/or google are doing it right... > I use Abby FineReader 5.0 v5 won't give you the accuracy that v7 will. that's likely the _main_ reason that proofing is taking you longer than it should. version 7 does a much better job that 5. you will find the upgrade price _is_ an excellent investment, even if your time isn't really worth very much... if you use inferior tools, blah blah blah... > then comes a first pass proofreading, > also fixing headers and footers. > this is often 30 seconds per page. um, no. you're getting way ahead of yourself. after scanning, you _first_ need to clean up the page-scans -- which means deskewing them, standardizing placement, etc. almost every page is skewed to some degree. even though this might not be apparent to you without careful analysis, it _is_ a factor with big impact on the o.c.r. accuracy. and furthermore, when a person views page after page of the images, to read 'em, even a small skewing causes a subconscious weirdness to them. as for placement, i mean the left and top margins of each scan are identical. it's another factor effecting reader subconscious. while it's less important to o.c.r. accuracy, it does sometimes exert an impact there too, specifically in regard to the "zoning". (and yes, you _do_ have to zone the pages to get the best o.c.r.) there are a whole slew of other ways to manipulate the images. i don't have any experience with some of them, to discuss them, but there are some people over at distributed proofreaders who seem to know a lot, including one person whose name escapes me, who has formulated his "recipe" for enhancing page-scan images. interestingly, it includes "blurring" the image at one point, which certainly seems counterintuitive, but has the effect of converting the one-pixel dots into two-pixel dots (or some such), which means they don't get deleted in a later step where the image is downsized. (d.p. resizes many scans to a size that works well in their system; that also might be considered a shortcoming in their scan-archive.) now some of the skeptics out there are probably muttering now that adding time to the imaging process to save it on the proofing process isn't really "saving" us any time. and there is a little truth to that. however, many of these image-cleanup steps can be _automated_, so they are great candidates for inclusion in our ideal work-flow. even more importantly, it's vital that we start considering the scans as a product in and of themselves. i fully agree with michael hart that "a picture of a book is not an e-book". i too want raw, editable text. but that doesn't mean a high-quality "picture of a book" isn't useful. indeed, as pointed out here, it's the first step on the way to getting the raw, editable text. and even after that, it continues to be useful. people _will_ -- in the future -- desire to _replicate_ older books. they will want print-outs that "look exactly like" the original book. (_especially_ with books like those by william blake, for instance.) and the best way to fill that demand is to have high-quality scans. tomorrow's low-end printers will be 600dpi (if they aren't already). so that's the resolution that we need to be aiming at with our scans. yes, i fully realize that that is ridiculous in terms of the present, when that kind of resolution overwhelms our memory and bandwidth, as soon as we stop thinking about books at the individual-book level and start thinking about them as collections in the tens of thousands. which is precisely why i tell people now that 300dpi is acceptable, even for the "archive" versions we're building for the here-and-now, just as long as the 300dpi scans give us acceptable o.c.r. recognition. but i give louder applause to the foresight to go to 600dpi right now. (me, though, i'll go 300dpi unless/until i have a high-speed scanner, expecting that _every_ book i'll scan will eventually be rescanned.) > then comes a first pass proofreading, > also fixing headers and footers. > this is often 30 seconds per page. ok, after you've cleaned up the scans, you can start the "proofing". but there are lots and lots of different ways of "doing the proofing", so let's be perfectly clear about exactly what we we're talking about. my software tool guides you through the processes a certain way, so i'll be discussing that path. like i said, i plan to release my tool in late spring, about the same time that the internet archive begins to release scan-books from their toronto project, so if you prefer, save this post until then, when my tool is out. that's fine with me. on the other hand, if you want to consider my alternative processes, to see which ones you can incorporate into your work-flow, read on. i don't mean to frustrate anyone by saying "i've got a tool to do that" before the tool is released. but if this advance information helps... the first thing to do is a quick check that you got all the scans right. my tool allows you to "thumb through" all of them, from start to end; it displays them 2-up, so they look exactly like a p-book page-spread. on the first pass, you'll just look at each spread, ensure it looks good. on the second pass, you'll be looking at the text instead of the scans. here, the 2-up view shows the text on one side, the scan on the other. (my tool uses this 2-up view -- text next to its scan -- throughout.) in this pass, you'll be formatting the text, to make it match the scan. i'm still in the process of figuring the best way to save o.c.r. output, i hope my tool will do most of the formatting right automatically, but when it doesn't, you will have to do the formatting yourself, manually. "manually" doesn't mean "editing", like you'd do with a word-processor. while that may be necessary on some rare occasions here, in general there will be buttons that you can click to do most of the formatting. for instance, say there's a block-quote that didn't get auto-formatted. you would select the lines of the quote, and hit a "block-quote" button. same for a poem that didn't get indented, or to right-justify an epigraph. if your book is like most -- one boring page after another boring page -- there will be very little for you to do. for "my antonia", for instance, the only real excitement here was with the occasional chapter heading. for books that need heavy formatting, you should save that for later, and move to the next step, which is where the tool starts "proofing". my tool -- and the ones that are being developed by other people too -- takes the o.c.r. results and automatically makes some changes _before_ ever presenting them to you "for proofing". for the most part, these are changes due to known recurring errors in the o.c.r. recognition routines, so a person generally needs to build a list idiosyncratic to their setup. (one person doing this had a list of over 400 rules with his old scanner, but when he bought the optic3600, he was able to drop _half_ of them.) there are also some checks that are generic to all setups. an example would be replacing any "tbe" word with "the". undoubtedly a flyspeck caused that nonsense error, so we would just change it automatically. remember that all of these changes are taking place _before_ the text has even been viewed yet by a human being, so if -- for some reason -- it _really_was_ "tbe" instead of "the" (because, for instance, it was _this_ message that was being scanned), the human can change it back! (well, if it actually was _this_ message being scanned, then the change wouldn't be _automatic_, not with my tool anyway, because any "scanno" that is in quotes is _not_ changed automatically, for just that reason. but you get my point: it's safe to make automatic changes at this time, because we know that human beings are still going to review the text.) there are a number of other checks that happen at this time as well, based on analyses of the text. i won't say much about these, because that would give away too much about my program before its release, but some of the obvious ones would include the one to "close up" the spaces that o.c.r. often injects around punctuation. (or which, like in "my antonia", are _really_ right there in the paper-book. an example is on the very first page -- page 3 -- where "hands" is surrounded by such floating quotemarks; it's clearly printed as " hands ". even jon, with his focus on "fidelity", tightened up those floating quotemarks.) this is where the o.c.r. of "mr," and "mrs," -- followed by a comma, instead of a period (which i mentioned before) -- would get fixed. all of these automatic changes are logged to a file, so they can be reviewed by a human. except that review is often a waste of time, because these changes are (or at least should be) totally obvious. and if your review _does_ show an auto-change that was incorrect, and therefore shouldn't have been made, you would seriously consider _the_removal_of_ the rule that was responsible for that auto-change. also, kent, since you specifically mentioned headers and footers, a good tool will let you retain those right up until the last minute. they don't hurt anything -- and they help you keep your bearing -- so there's no need to delete 'em. the tool should de-emphasize them -- mine displays them in gray, which makes 'em unobtrusive _and_ has the benefit of letting you know it identified them correctly -- but they're something you shouldn't have to spend time on in any way. after the automatic changes comes the fun part. at this time, the app does the hard work. again, i don't wanna steal thunder from my tool, but the aim at this point in time is to present to you _each_line_ that will need your attention (accompanied by the page-scan containing it), and _only_ those lines that need your attention (i.e., no false-alarms). that is, the tool seeks to find every line that has an _error_ in it, and present it to you, alongside a page-scan, so you can correct the error; and it seeks to show you _only_ those lines that really have an error, so it doesn't waste your time showing you lines you don't need to fix. that is the "secret sauce" in the tool -- to show you _every_ line that you'll need to fix, and _only_ the lines that need fixing, and no others... of course, that's the _ideal_, and we can only hope to _approach_ that. after all, if the tool knew for certain where each and every error was, we could just tell it to correct the errors itself, while we ate lunch. so we scale our expectations back to something a bit more reasonable, and have the program bring up -- to the best of its ability to do so -- each line for which it has some good reason to think we need to check. to put this into a phrase, we have the tool look for _probable_ errors. some of them might not actually be errors, but we go on probability... we do want to find _all_ the errors, or as many as we reasonably can, so we'll accept _some_ "false alarms". they're preferable to _missing_ an actual error. but at the same time, too many of 'em wastes our time. after all, the tool could just show us _every_ line and say "check it"; but that wouldn't be buying us any improved efficiency now, would it? so the closer we get to the ideal -- show us every line we need to see, and not one line that we _don't_ need to see -- the better we like it. and if the tool tells us what is wrong with the line, and suggests the correct fix, with a "yes, fix it" button we can click, so much the better. to use an example from above, let's say that it offered to close up those floating quotemarks around "hands" with just the click of button. slick! if we get _close_enough_ to the ideal -- where we are shown only lines that have errors, and no others -- then we will have just sat there and button-clicked, while our text became easily and adequately "proofed". once we've corrected every line that needs to be corrected, we are done! but we don't really have to get all the way to the ideal to be successful. again, my "standard" is 1 error every 10 pages. and i expect to do better. but if i attain that rate, i will consider my tool to have been "successful". i should say specifically that _spell-check_ is an important part of this. i find it laughable and ridiculous that distributed proofreaders does _not_ do a spell-check on the o.c.r. results before shipping them off to proofers. your first reaction might be "why do a spell-check, since that is exactly the job proofers are gonna be doing anyway?", plus then go on to point out how much time a spell-check would take, and various other considerations, perhaps even launch into your spiel about "what a distributed process is". (spare me; as a social psychologist, i understand it far better than most.) heck, there is actually some debate over at distributed proofreaders about whether a spell-check must be done _after_ the text comes out of proofing. which explains why some e-texts are actually being posted now that have obvious spelling errors in them that will _not_ pass a spell-check! awful! except i'm talking about a very specific form of limited spell-check, namely an analysis of the text that creates a list of all the words used in the book. again, i won't explain how it works, but the purpose is to compile the words that are _unique_ to the book. the best example is _names_of_characters_, another good example is _words_and_phrases_from_a_foreign_language_. and there are other categories. here are some examples from "my antonia": > kolaches > mamenka > misterioso > patria > tatinek > amour propre > noblesse oblige > Optima dies… prima fugit > palatia Romana > Primus ego in patriam mecum… deducam Musas these words are used to create a _book-specific_spell-check_dictionary_: words not in a normal spell-check dictionary, but which _are_ in the book. i believe that every e-text should include such a word-list in an appendix. first, it's useful, from the standpoint of end-users running a spell-check; once this book-specific word-list is specified as an additional dictionary, the entire file should pass through spell-check without pausing even once. but moreover, it's just plain _fascinating_ to browse this list for a book. it is a quickie road-map to the freakish extremes of that particular book. back to the job at hand... the word-list _is_ very useful to spell-check text right out of o.c.r., and _before_ you commence the job of "proofing". as a good example, remember those character-names? when you browse an alphabetized version of the word-list, you'll see a name popping up in a variety of variant forms, such as the possessive, the plural, and so on. what you'll _also_ see, though, is an occasional place where the name was misrecognized. boom! my tool allows you to click on it, and then immediately jumps you to it in the text -- right alongside the image -- so you can verify that it's an error, and change to the correct spelling. (my plan is to have a button you can just click to make the correction.) and if the error is obvious enough, you might not even go to the bother of jumping to its location in the text, but rather just fix it immediately. (remember, you can review these changes if you want down the line.) one of the test-books i used to develop my tool, way back when i first started putting it together, was "the hawaiian romance of laieikawai". (some of you know this e-text was in the group issued for dp#5000.) i might've spelled that name wrong; face it, it's a pretty difficult one. and, as you can imagine, the o.c.r. yielded quite a few variations of it! there were literally _dozens_ of 'em, off by a letter or two (or more). and not surprisingly, there were many hawaiian names, long and short, in this text, and the o.c.r. came up with a number of variants on each! although it was a pleasant story, and the o.c.r. was relatively clean for the pages -- remarkably so, considering how bad the scans were -- those difficult names made the task of proofing a terrible nightmare, so this text took a fairly long time to make it through all the rounds. using my tool, however, all of the various scannos on those names were easy to locate, and to correct, and that task was done quickly. thinking about individual proofers, going to the trouble of correcting each of those name scannos, independently, manually, i am appalled! imagine how much of a hassle that was! what a tremendous waste! but the scenario is even worse, at least for proofers who were careful, and took their job seriously, because in order to check _whether_ the name is spelled correctly or not, you must examine _every_instance_. and that process is extremely error-prone. and fatiguing. and boring. if the name was _at_least_ in the spell-check-dictionary for the file, the spell-check on the d.p. page would show it was correctly spelled (when it was) by failing to highlight it. and flag incorrect spellings. but until it's in the dictionary, every occurrence must be scrutinized. think how much of the proofer's time and energy could've been saved if the instructions would have said, "hey, ignore the hawaiian names, we fixed them all in a global operation before you got these pages...". to subject proofers to those difficulties, when such a simpler method isn't being developed and utilized, is almost an abuse of the good-will those fine volunteers are giving you by donating their time and energy along about now, someone will say, "d.p. plans to install the capability for a proofer to add a word to the spell-check dictionary for a book." well, gee, after 6,000 books, i would _hope_ you finally got the idea! and if you did it _right_, you'd create the book-specific dictionary _automatically_, before the first page is sent to the first proofer. i don't mean to sound high-handed and morally indignant and all that, because i fully realize this is an ongoing learning process for everyone, but hey, i guess it's easy to waste volunteer time if you have lots of it. and it would address my concerns _greatly_ if the people-in-charge (and the loudmouths who _act_ like they are) would be _accepting_ when well-intentioned people try to advise them on their processes. but there is an active hostility over there to constructive criticism. and i find that tragic. but i digress... getting back to the matter of an _individual_ doing a book, though, my objective for that situation is to make that person _efficient_. so _this_ is the type of spell-checking that you need to do _first_, one whose essential operating philosophy is a _book-wide_basis_. and then, only after that, yes, if you are an individual doing a book, the next thing to do is a _regular_ old spell-check, the type that goes from one questionable word to the next. the difference here -- and yes, one that my tool facilitates, of course -- is that when you come to a questionable word, the _page-scan_ is shown right there. some people actually say, "you should never do a spell-check, because some words that will pop up are actually as they were in the original, and they need to be left that way. so a spell-check is a waste of time, because what you really need to do instead is a line-by-line comparison." that's poppycock. _of_course_ that situation _can_ happen. sometimes. and that's why you've got the scan there, to check the questionable word. i don't advocate a blind "correction" to each and every questionable word. and you must be able to easily add a word to the book-wide dictionary, if you find that my tool is continually popping up a word that it shouldn't. (but odds are that it would've been put in the dictionary in the prior step.) but _nonetheless_, if you want to find words the o.c.r. _misrecognized_ -- and remember, that's the objective, to isolate _probable_ errors -- the best bet is to look at words that aren't in the spell-check dictionary. all right, so that takes care of spell-check. a final set of checks is then done that looks for anomalous situations; some of these involve punctuation, infrequent juxtapositions, and so on. there are some words that pass spell-check that you still want to view -- they are called "stealth scannos" over at distributed proofreaders -- and they are one of the things that are checked in this final set. and at that point, you're done with the text-cleanup. congratulations. all in all, as well as i can tell from the testing that i've done so far, you can expect the tool will present between 1% and 5% of the lines in the text-file to you for one kind of close examination or another, and perhaps 75% of those will require a "fix" of some kind or another, assuming that you got relatively clean o.c.r. results in the first place. that's a lot better than looking at 100% of the lines to "proof" them. and that, my friends, is how you can do a whole book in a few hours. unless you put aside that heavy markup earlier. if so, it's time to do it. once again, you will page through the book, text and scan side by side, doing whatever editing needs to be done so the text is formatted right. without knowing what kind of formatting you'll need to do, it's hard to tell you how you'll go about doing it. so you'll have to wait until you can get some hands-on experience with the tool to see exactly how it'll work. but it definitely will not be anything like the pseudo-markup over at d.p. -- where, for example, /* and */ are used to bracket poetry and stuff -- and it will most certainly not be any form of x.m.l. or h.t.m.l. markup . it _will_ be z.m.l. -- invisible markup that mimics the p-book page. and as my tool gets more and more advanced, it will actually _display_ the text just exactly as it will be shown by the z.ml. viewer-program. and sooner or later, the two apps will morph into one. (bet on sooner.) how complex can formatting get using z.m.l.? we'll have to see... ;+) so now that you've gone through all the post-o.c.r. cleanup my tool does, and the pages are nicely formatted so they resemble the original p-book, what next? well, it's probably the case now that your text is _already_ clean enough to meet or exceed our standard of 1 error every 10 pages. but i assume that if you're doing this book as an individual, it's because _you_actually_have_an_honest_desire_to_read_or_re-read_this_book._ because _that_ is really the absolute _best_ reason to digitize a p-book. so read it! read it in my tool, which allows you to display the image of the page right alongside the o.c.r. text for that page. keep in mind that you are reading for the express purpose of catching any errors in the text, so read carefully. at the same time, though, read for your enjoyment too! it's only by being engrossed in the story that you'll catch some errors, such as a word or a line inadvertently dropped. so become engrossed! if you find an error, first _log_it_! keep records, to improve the tool. _then_ use your word-processor to search the text for _similar_ errors. if that search yields other instances, see what you can learn from them, and expand your search based on anything you can generalize about them. some errors are flukes -- a coffee-stain on the page, or what have you. but others can be recurrent, and if you can pin down a recurrent error, you will become much more efficient in your efforts to clean up a text. finally, i will mention again that _text-to-speech_ can be _amazing_ in helping you to locate errors in a text you might never have _seen_ my tool will do text-to-speech; it'll even pronounce the punctuation, if you select that option, so you can verify that in your text as well. so i highly recommend that -- rather than reading the text to check it for that final "proof" -- you _listen_ to it instead, via text-to-speech. this has the added benefit that you can do it away from your computer. a lot of people enjoy putting a book onto a walkman, or even an ipod, and listening to it in the car, or at the exercise club, or out jogging. that's fine. (just be conscientious about _remembering_ any errors!) once you have done this final check, your "proofing" job is all finished. say what? does this mean i don't advocate a line-by-line comparison? isn't that what most people, like d.p., consider to _be_ "the proofing". well, let me put it this way: if you _want_ to do that, by all means, do! do i think it's absolutely necessary? well, in most cases, absolutely not! doesn't a failure to do that mean that you might release a text that has some small errors in it? well, yes, it certainly does, but that is exactly why i build the "continuous proofreading" step into my overall processes. no matter how good a job you might do, certainty requires more eyeballs. so if you're really feeling insecure, have other people read your file too. better yet, have someone else process the book completely independently, and compare their final file to yours. that should catch _every_ error. but if an error hides through all of the tools, and withstands a reading by an engrossed human and/or wasn't noticeable during text-to-speech, then that error is insignificant enough that i'm not gonna worry about it. i think it _should_ be corrected, and (due to "continuous proofreading") that it eventually _will_be_ corrected. but i ain't gonna worry about it. and considering the care i put into listserve posts, it's obvious i'm anal. there are 6,272 words in 707 lines in this message. find the typo in it. i circle the mistakes in everything i read, for the sheer fun of doing it. so if i can live with that error, hey, you can probably live with it too... at the point of insignificant errors, our attention is much better spent with a focus on digitizing additional books. i'll repeat, so it sinks in, that if someone _wants_ to do line-by-line comparison, that's _great_. but if we can get texts that are far-and-away error-free without it, then _i_ have far better ways to spend my time, thank you very much. and don't try to make that out that i don't care about finding errors, or that i'm talking about "something different" than what you mean, and that's the only reason i say it can be done in just one evening. because my processes will give just as accurate results as yours. and i'll be happy to prove it by finding the errors in _your_ e-texts. anyway, now you're done _proofing_, but you're not _completely_ done. because there's just one more step before you can send your e-text out. up until now, you might have had the text from each page in its own file. (or maybe you had it all in one file, since my tool can work either way.) but if you had them in separate files, they'll now need to be combined. we also want to get rid of the headers and footers and make it all nice. these are things my tool does for you -- mostly automatically -- but there are a few that do require some input from you, and some others you have to monitor to make sure they are done correctly. one example would be footnotes, which are moved to be end-notes. another example is to make sure all headings are at the right level. and when the end-line hyphenation is removed, you might be asked to make decisions for the tool when it seeks your guidance on that job. but for the most part, the tool will step you through all these tasks. it assumes that you're not an expert at doing this, and it helps you. there isn't that much more for me to explain about this final step, other than to mention that you _might_ want to execute this step before you read through the book or listen to it via text-to-speech. once you've concluded these steps, your file is a bona fide e-book. congratulations! you've moved a book into the realm of cyberspace! you can load your e-text into my z.m.l. viewer-program, and boom!, you'll see that what you created is a high-powered electronic-book! the headings are big and bold! your table-of-contents is hot-linked! words that were italicized in the p-book, which my tool marked with underscores like _this_, are again shown in all their italicized glory! illustrations are displayed on the appropriate page, automatically, and all you did was make sure their file-name was nearby that text. after this step, future versions of my tool might perform conversions of the e-text to other formats, like .html and .pdf and .rtf, if you want. plans in that regard are still fairly tentative, and i might decide that i will leave that matter to the end-reader using my viewer-program. your time might be better allocated by proceeding on to the next book. after all, it was fun to do it, wasn't it? and it only took one evening! > The real problem is my day job is using up most of my available > concentration, so I don't feel up to spending too much time proofing. well, yeah, there's no question that this job does take concentration. there's really no way around that. i will say, however, that my tool helps to _conserve_ your concentration by helping you to _focus_ on the things that require your attention, and not the things that don't. and that's really the big secret in making people more efficient here. indeed, that's what enables you to do an average book in just one evening. anyway, i have exposed enough flaws and gored enough sacred cows in this post that i can feel the vilification efforts building already. like i said, unless i am prodded, this is my last post in this thread. and except for a few final reports on the other threads, i'm all done. if those vilification efforts break out, though, and i am challenged, i _will_ remain here to defend myself, as i stand behind this post... otherwise, i'll be out of here until one of these tools is released, either from me or from one of the other people working on them, or until someone comes on here trying to tell you this job is hard. it ain't, folks. it's easy. and people have been flying for decades... the choice is up to you, people... -bowerbird

1 0

Re: [gutvol-d] a wiki-like mechanism for continuous proofreading and error-reporting
by Bowerbird＠aol.com 10 Mar '05

10 Mar '05

jon said: > Not possible, unless one bought the *big buck* (above office-level) > sheet feed or page turning scanners, or one simply used > a photocopy machine, and captured the low-rez images it produces. my girlfriend's office has a $10,000 lanier just down the hall. that's the kind of machine i was talking about. their website says that their high-end machines can scan 60+ pages an hour. but i grant you that a scanning time of a few hours (or more) is much more in line with what most normal people can attain, even those with lots of experience like yourself... > Yes, but this is not for the average, ordinary Joe > working in his basement. This requires a lot of $$$ > in upfront investment to get this fancy equipment > and software. i think you might be surprised in the coming months, jon. > There's still need for the whiz-bang scan cleanup software, > which I know is expensive. donovan was working on some open-source deskewing routines. might want to check that out. and i'm told that abbyy does a fairly good job setting brightness and contrast automatically. so the other thing that needs to be done is to standardize the placement of each scan relative to each other, which isn't hard. (removing curvature is a bear, but the best new scanner out -- the optik? -- lets you lay the book on the edge of the bed, which i understand effectively cures the curvature problems.) > in "My Antonia" a lot of pages were not numbered at all that's not uncommon. > (such as the last page in each chapter). yes, i noticed that. _that_ is a little uncommon. but like i said earlier, publishers can be weird. > I had to be especially careful not to mess up > and lose which page is which. it's _fairly_ easy to do each page in sequence -- just have to pay some attention turning the page -- and then using the auto-increment-name option will ensure that all of the files are named correctly. > Hmmm, this is a lot like what James Linden is developing, > which may be incorporated into PG of Canada's operations. <smile/> if you check the archives you'll find i'm the one who posted it. i also offered to write all the software. all that was ignored. doesn't matter though, i'm proceeding to build my own system. if james took my post to heart, then he's smart. :+) > Doing 15,000 texts, or a million texts, > still needs some manual processing. if you're manually opening every file, and manually summoning every scan you need to check, you're going to burn yourself out. _plus_ expose yourself to the reality of inadvertent changes. you have to have a system that tracks every change that's made, so you can review the log to make sure it was the correct change, and that nothing else was changed. reviewing the log is "manual", and so is the decision as to _approval/rejection_ of the change, but the change itself should be totally automated. > That's why, to me, it is more important to redo the collection, > put it on a common, surer footing (including building trust), > before launching into doing a lot more texts. the library needs to be _corrected_, yes, but _not_ "redone". and i think you do more damage than good when you talk about e-texts being done "incorrectly", when what you _really_ mean is that an edition was used that you don't happen to approve of, or that metadata isn't included, just to use some most examples. there are _real_ errors in the e-texts. honest-to-goodness mistakes. we need to concentrate on _those_, not on some edition that uses the british spellings instead of american ones. (even if that _was_ silly.) but distributed proofreaders is more interested in doing new books than fixing old ones. they're volunteers who set their own priorities. > Imagine how difficult it would be to process one million texts > if they were produced in the same ad-hoc fashion, > without following some common standards. i don't have to "imagine" it. that's the way the library is now. and i made my fair share of efforts to try and convince the powers that that situation needed to be addressed with some standardization. but the difficulty of doing it with the type of heavy-markup that you like has held up that whole darn process. if we would have proceeded with the "zen markup language" that i like, the library would have been clean now. > PG's ad hoc approach up to now (which DP has partly fixed) the d.p. e-texts still exhibit a large degree of inconsistencies. and contrary to what you imply, they are not generally error-free. some are, but others are not. the same is true of earlier e-texts. the quality has improved, yes, surely. it is still not highest quality. but they are volunteers, and thus they set their own bar for quality. and they certainly deliver quality that is high enough that we could use "continuous proofreading" and have the public zoom us to perfect. > it can't be done using any plain text regularization scheme you're wrong. dead wrong. *** anyway, jon, thanks for the information on your scanning experience. i come away from hearing it with an even more firm conclusion that scanning and image-cleanup is indeed the biggest part of the process. -bowerbird

4 3

re: [gutvol-d] a wiki-like mechanism for continuous proofreading and error-reporting
by Bowerbird＠aol.com 10 Mar '05

10 Mar '05

jon said: > But what resolution? their website tells you. in a $10,000 machine, it better be good. > There's a web site somewhere giving a review > of the model you describe, but don't have the URL handy. we don't need a review on a website, as there's plenty of d.p. people here who'll vouch it's an amazing machine. > Yes, that is true. There is a lot of interest in DP > to redo a lot of the pre-DP classics in the PG corpus, > from what I understand, so it may get done anyway > even if PG does not encourage it. you didn't read what i wrote. it is _distributed_proofreaders_ that -- as a whole -- is more interested in doing new books than re-doing old ones. if they wanted to do it before now, they would have. but they haven't... (a few of 'em have redone old books. including some html versions that jim recently asked them to fix up. but as a course of action, not much.) michael doesn't tell d.p. what to do. he doesn't tell _anyone_ what to do. even if you _ask_ him for guidance, he's usually too stubborn to give it. -bowerbird

2 1

a wiki-like mechanism for continuous proofreading and error-reporting
by Bowerbird＠aol.com 09 Mar '05

09 Mar '05

here's one from last week that never got mailed out... i'll be leaving here again very shortly, since i have been reminded just why i had stayed away, because this place can be so negative and destructive and poisonous... ick! *** jon, you said the scanning took "much more than four hours". so how long _did_ it take? and if you were to do it again, with your present scanner, how long would it take you? also, how long did it take you to manipulate the images? and how did you do that? what specific steps did you take, in what order, and what program did you use to do all that? is there anything of all that which you'd do differently now? *** jon said: > OCR is quite fast. It's making and cleaning up the scans > which is the human and CPU intensive part. well, it all depends, jon, it all depends... with the right hardware -- like office-level machinery -- 60 pages a minute can get swallowed by the gaping maw. that's right. one page per second. that seems fast to me. that means your 450-page scan-job would take 7.5 minutes. probably took you more time than that to cut the cover off. and the machine will automatically straighten those pages, o.c.r., and upload to the net, while you stare dumbfounded... likewise with the kirtas 1200, geared to scanning books. http://www.kirtas-tech.com/ it does "only" 20 pages a minute, but hey, 1000 pages/hour ain't nothing to sneeze at. they estimate that in a full-scale production environment, the price-per-scan is 3 cents a page. sounds like brewster should buy a half-dozen of these babies. so it all depends. the bottom line, though, is that if a person has experience, good equipment, solid software, and a concentrated focus, they can open a paper-book to start scanning it and move it all the way through to finished, high-power, full-on e-book in one evening, maybe two. *** i said: > third, you used a reasonable naming-scheme for your image-files! > the scan for page 3, for instance, is named 003.png! fantastic! > and when you had a blank page, your image-file says "blank page"! > please pardon me for making a big deal out of something so trivial > -- and i'm sure some lurkers wrongly think i'm being sarcastic -- > but most people have no idea how uncommon this common sense is! > when you're working with hundreds of files, it _really_ helps you > if you _know_ that 183.png is the image of page 183. immensely. > even the people over at distributed proofreaders, in spite of their > immense experience, haven't learned this first-grade lesson yet. i forgot to mention earlier that my processing tool can automatically rename your image and text-files, based on the page-numbers that it finds right in the text-files (which it extends in sequence for those files without a page-number -- usually the section-heading pages). so even if you're dealing with someone else's scans, and _they_ didn't name their files wisely, you don't have to deal with the consequences. *** jon said: > I believe as you do that an error reporting system is a good idea > so readers may submit errors they find in the texts they use -- > sort of an ongoing post-DP proofing process. i didn't elaborate earlier that it goes much deeper than that. a very important point here is that an error-reporting system -- over and above the obvious effect of getting errors fixed -- will actively incorporate readers into the entire infrastructure, making them active participants cumulating a world of e-books. if you have ever edited a page on a wiki, you're likely aware that the experience gives a very strong feeling of _empowerment_ -- because you can "leave your mark" right on a page, quite literally. if we set up a wiki-page to collect the error-reports for an e-text, in a system allowing people to check the text against a page-image, they'll be much more motivated to report errors than they are now, with the "send an e-mail" system. the feedback is more immediate, and compelling, with a wiki. furthermore, by collecting the reports, in the change-log right on the wiki, you can avoid duplicate reports. you can also give rational for rejecting any submitted error-reports, and/or engage people in a discussion about whether to act on a report. all of this makes your readers feel _responsible_ for the e-texts. a lifetime of experience with printed matter has made people very _passive_ about typographic errors. there's no reason to "report" an error they find in a newspaper, for instance, because hey, it's already been printed. the same with a magazine or a printed book. water under the bridge. and they translate that same attitude over to e-books, even though it _does_ do good to report errors there. so we need to do something to shake them out of their passivity, something to make them feel _responsible_ for helping fix errors. (just for the record, although i use the term "wiki", i don't mean it literally. what i have in mind is more of a "guestbook" type method, where people can _add_ their text to the page, but not necessarily _delete_ what other people have added. it's thus more like a blog, where everyone can add their comments to the bottom of the page, but the top part stays constant, to list the "official" information. but i'll still use the term "wiki" to connote a free-flowing attitude.) in addition to the wiki, you can build an error-reporting capability into the viewer-program that you give people to display the e-texts. if they doubt something in the e-text, they click a button and boom!, that page-image is downloaded into the program so they can see it. if they have indeed found an error, they copy the line in its bad form, correct it to its good form, and then click another button and boom!, the error-report is e-mailed right off to the proper e-mail address. this symbolic (and real!) incorporation of readers into our processes is a rad thing to do. but it's not the _only_ benefit of such a system; it also facilitates the automation of the error-correction procedures. the error-report can be formatted such that your software can automatically summon the e-text _and_ the relevant page-scan. so you see a screen with the page-scan _and_ the error-report. you check its merit, and if it's good, click the "approve" button and the e-text is automatically edited. further, the change-log is updated right on the wiki-page for that e-text, and anyone who requested error-notification gets an e-mail describing the change. auxiliary versions of the e-text -- like the .html and .pdf files -- are automatically updated. and all you did was click one button... face it, if you're dealing with 15,000+ e-texts, doing it manually is a sure-fire way to burn yourself out. who needs that hassle? i mocked up a demo up this, using a simple a.o.l. guestbook script. i'm sure you versatile script-kiddies here could do something that was much more sophisticated, but my version will give you the idea: http://users.aol.com/bowerbird/proof_wiki.html -bowerbird

2 1

Re: [gutvol-d] lest the message be missed
by Bowerbird＠aol.com 09 Mar '05

09 Mar '05

miranda said: > The scans are done and if you like I will mail you a copy. that would be great, miranda! i'd love to help you out! and you don't even have to mail them! just put them online, in one zip file, and let me know where they are. i'll go get 'em. oh yeah, i only have the _english_ module for abbyy finereader, and that won't work well with accented text, so you'll have to do the o.c.r. on the images with the correct language modules, ok? so put the o.c.r. files in a zip file too, so i can grab those. and i should say that my tool facilitates the proofing process, but can't help much if you're proofing a language you don't know, and i only know english. so you might not get the best results from me. so far, i'm just concentrating on doing _english_ books. once i get those down, then i can do work on helping people who speak other languages extend the tool for their purposes as well. oh yeah, one more thing. please include spell-check dictionaries for the languages that are contained in the text, because i only have an english one. marcello can probably help you find those... > I'd like to have the proofed book back before the weekend, > if that's not too much trouble. my schedule is full for the next week or more. i can't even get to the second half of "my antonia" until friday at the earliest, and probably next week. so it'll be a couple weeks before i can get to yours. and this sounds like it's not really "an average book", so it might take me two or three evenings, not just one. but i'd still love to take on your project sometime! so put those scans somewhere where i can grab 'em, and i'll get to them at my very first opportunity, ok? oh yeah, i do hope they are 600-dpi scans, like jon's. those were really fine. they gave very clean o.c.r., and they're very pleasant to look at, as well. nice. and let me know if you decide to start doing the project, because there's no reason for us to duplicate our efforts. it won't hurt my feelings if you get impatient waiting! but since there's lots of books over at d.p. for you to do, if you want to just hold off on this one and let me take it, feel free! -bowerbird

1 0

hey marcello
by Bowerbird＠aol.com 09 Mar '05

09 Mar '05

hey marcello, since you recently noted that some of the e-texts are subsets of other e-texts -- like the separate e-texts for books of the bible -- how about if you continue your constructive streak and give us a summary of these duplicated e-texts? best would be to delete the subsets and give us just the larger "collection" -- so we would have the smallest possible list of all the unique books in the whole library -- but if it would be easier to do it the other way -- delete the collections -- that would be fine too. whenever you get a chance... thanks... -bowerbird

1 0

Re: [gutvol-d] lest the message be missed
by Joshua Hutchinson 09 Mar '05

09 Mar '05

----- Original Message ----- From: Bowerbird(a)aol.com > > you can take an average p-book from scans to e-book in one evening. > one evening. HAHAHAHAHAHAHA *gasp* *wheeze* HAHAHAHAHAHAHA That's the most laughable thing I've read in a long time. If laughter helps us live longer, you just added 5 years to my life. Josh

3 2

re: [gutvol-d] stop changing the message-headers
by Bowerbird＠aol.com 09 Mar '05

09 Mar '05

marcello said: > Prove this assertion. history is the proof. study it. -bowerbird

2 1