a wiki-like mechanism for continuous proofreading and error-reporting

here's one from last week that never got mailed out... i'll be leaving here again very shortly, since i have been reminded just why i had stayed away, because this place can be so negative and destructive and poisonous... ick! *** jon, you said the scanning took "much more than four hours". so how long _did_ it take? and if you were to do it again, with your present scanner, how long would it take you? also, how long did it take you to manipulate the images? and how did you do that? what specific steps did you take, in what order, and what program did you use to do all that? is there anything of all that which you'd do differently now? *** jon said:
OCR is quite fast. It's making and cleaning up the scans which is the human and CPU intensive part.
well, it all depends, jon, it all depends... with the right hardware -- like office-level machinery -- 60 pages a minute can get swallowed by the gaping maw. that's right. one page per second. that seems fast to me. that means your 450-page scan-job would take 7.5 minutes. probably took you more time than that to cut the cover off. and the machine will automatically straighten those pages, o.c.r., and upload to the net, while you stare dumbfounded... likewise with the kirtas 1200, geared to scanning books. http://www.kirtas-tech.com/ it does "only" 20 pages a minute, but hey, 1000 pages/hour ain't nothing to sneeze at. they estimate that in a full-scale production environment, the price-per-scan is 3 cents a page. sounds like brewster should buy a half-dozen of these babies. so it all depends. the bottom line, though, is that if a person has experience, good equipment, solid software, and a concentrated focus, they can open a paper-book to start scanning it and move it all the way through to finished, high-power, full-on e-book in one evening, maybe two. *** i said:
third, you used a reasonable naming-scheme for your image-files! the scan for page 3, for instance, is named 003.png! fantastic! and when you had a blank page, your image-file says "blank page"! please pardon me for making a big deal out of something so trivial -- and i'm sure some lurkers wrongly think i'm being sarcastic -- but most people have no idea how uncommon this common sense is! when you're working with hundreds of files, it _really_ helps you if you _know_ that 183.png is the image of page 183. immensely. even the people over at distributed proofreaders, in spite of their immense experience, haven't learned this first-grade lesson yet.
i forgot to mention earlier that my processing tool can automatically rename your image and text-files, based on the page-numbers that it finds right in the text-files (which it extends in sequence for those files without a page-number -- usually the section-heading pages). so even if you're dealing with someone else's scans, and _they_ didn't name their files wisely, you don't have to deal with the consequences. *** jon said:
I believe as you do that an error reporting system is a good idea so readers may submit errors they find in the texts they use -- sort of an ongoing post-DP proofing process.
i didn't elaborate earlier that it goes much deeper than that. a very important point here is that an error-reporting system -- over and above the obvious effect of getting errors fixed -- will actively incorporate readers into the entire infrastructure, making them active participants cumulating a world of e-books. if you have ever edited a page on a wiki, you're likely aware that the experience gives a very strong feeling of _empowerment_ -- because you can "leave your mark" right on a page, quite literally. if we set up a wiki-page to collect the error-reports for an e-text, in a system allowing people to check the text against a page-image, they'll be much more motivated to report errors than they are now, with the "send an e-mail" system. the feedback is more immediate, and compelling, with a wiki. furthermore, by collecting the reports, in the change-log right on the wiki, you can avoid duplicate reports. you can also give rational for rejecting any submitted error-reports, and/or engage people in a discussion about whether to act on a report. all of this makes your readers feel _responsible_ for the e-texts. a lifetime of experience with printed matter has made people very _passive_ about typographic errors. there's no reason to "report" an error they find in a newspaper, for instance, because hey, it's already been printed. the same with a magazine or a printed book. water under the bridge. and they translate that same attitude over to e-books, even though it _does_ do good to report errors there. so we need to do something to shake them out of their passivity, something to make them feel _responsible_ for helping fix errors. (just for the record, although i use the term "wiki", i don't mean it literally. what i have in mind is more of a "guestbook" type method, where people can _add_ their text to the page, but not necessarily _delete_ what other people have added. it's thus more like a blog, where everyone can add their comments to the bottom of the page, but the top part stays constant, to list the "official" information. but i'll still use the term "wiki" to connote a free-flowing attitude.) in addition to the wiki, you can build an error-reporting capability into the viewer-program that you give people to display the e-texts. if they doubt something in the e-text, they click a button and boom!, that page-image is downloaded into the program so they can see it. if they have indeed found an error, they copy the line in its bad form, correct it to its good form, and then click another button and boom!, the error-report is e-mailed right off to the proper e-mail address. this symbolic (and real!) incorporation of readers into our processes is a rad thing to do. but it's not the _only_ benefit of such a system; it also facilitates the automation of the error-correction procedures. the error-report can be formatted such that your software can automatically summon the e-text _and_ the relevant page-scan. so you see a screen with the page-scan _and_ the error-report. you check its merit, and if it's good, click the "approve" button and the e-text is automatically edited. further, the change-log is updated right on the wiki-page for that e-text, and anyone who requested error-notification gets an e-mail describing the change. auxiliary versions of the e-text -- like the .html and .pdf files -- are automatically updated. and all you did was click one button... face it, if you're dealing with 15,000+ e-texts, doing it manually is a sure-fire way to burn yourself out. who needs that hassle? i mocked up a demo up this, using a simple a.o.l. guestbook script. i'm sure you versatile script-kiddies here could do something that was much more sophisticated, but my version will give you the idea: http://users.aol.com/bowerbird/proof_wiki.html -bowerbird

Bowerbird asked:
jon, you said the scanning took "much more than four hours". so how long _did_ it take? and if you were to do it again, with your present scanner, how long would it take you?
It took about a minute or so to carefully place each page on the flat bed scanner, close the top, initiate the scanning, open the top, and replace the page with a new one. While one page was being scanned, I could do some related work such as naming and saving the previous scanned images. It got old pretty fast. So with a manual flat bed scanner, with an already chopped book, it took me about ten hours, spread over a few days, to do the 450 or so pages in "My Antonia" (I did it in cracks of time). If I had chosen 300 dpi scanning (rather than 600 dpi), it would have gone faster, but not four times faster -- maybe 20-30% faster as a rough guess. Of course, one goal was archival-quality scans -- I could have cut corners to make it go faster. Obviously, a fairly new model, professional-grade sheet feed scanner would have made life a lot easier. But lots of people, the average Joe, generally only have the el cheapo flat bed scanners which are *slow*, plus they may not have the necessary knowledge on scanning and image processing fundamentals to do a good job. I have a strong background in image processing (plus being an engineer helps in general, as well as an amateur photographer), so I caught on quite fast after talking with a few of the pros on scanner newsgroups. As an aside, I'm used to processing giant images, on the order of 24000x18000 in pixel dimensions (fractal art printing using Kodak LVT -- now it's Durst Lambda and equivalent machines) -- and I did this a few years ago on lower-horsepower PCs.
also, how long did it take you to manipulate the images?
One needs enough *horsepower* to manipulate 600 dpi images (300 dpi images are *four* times smaller), plus some knowledge. Fortunately, most of today's basic Win XP boxes and laptops, and latest Mac OS X hardware, have sufficient horsepower (lots of memory helps.)
and how did you do that? what specific steps did you take, in what order, and what program did you use to do all that? is there anything of all that which you'd do differently now?
There are "all-in-one" professional-level application tools that straighten out misaligned images, and crops them accordingly. I did this processing mostly by hand using Paint Shop Pro plus another tool for semi-automated alignment whose name eludes me at the moment (it was a 15-day trial software, and it expired the day after I completed the job -- they want $400 for that sucker. :^( ) For all of the above, this is why I'm advocating a semi-centralized project to scan public domain texts, working in parallel with other scanning projects, such as IA's: 1) We will use volunteers who have access to higher-end scanners (if not ones we supply), plus the knowledge on how to use them properly for books. 2) We probably can get $$$ to buy sheet feed scanners (which are not that expensive, less than 1% the cost of the automated page turning scanners IA is using in Canada, as will be discussed below.) 3) We will be able to afford the professional-level "all-in-one" scan processing software to do the automated alignment, consistent cropping, and image clean-up. 4) We will establish sufficient guidelines, plus QC procedures, to maintain a minimum scanned image quality.
OCR is quite fast. It's making and cleaning up the scans which is the human and CPU intensive part.
with the right hardware -- like office-level machinery -- 60 pages a minute can get swallowed by the gaping maw. that's right. one page per second. that seems fast to me.
The fairly good quality sheet feed scanners, which are "office- quality", may be able to do 5-7 archival-quality scans per minute (this includes down time due to setting up, stuck pages, etc.) So for scanning alone, not including keeping track of pages, page numbering, and other administrative details associated with scanning, the average 300 page book could be raw-scanned, by someone experienced, in about 45 minutes. This assumes 600 dpi optical (archival quality). It may go a little faster with 300 dpi optical settings -- not sure...
that means your 450-page scan-job would take 7.5 minutes. probably took you more time than that to cut the cover off.
Not possible, unless one bought the *big buck* (above office-level) sheet feed or page turning scanners, or one simply used a photocopy machine, and captured the low-rez images it produces. If you want to increase speed for a given technology, the scan quality (dpi and maybe color depth) has to be reduced. (Well, except maybe for photographic-type scanners, which are coming down in price, where a high-rez snapshot is taken at one moment of each page rather than running a scan head over the page. I see this as the long-term savior to produce archival quality scans, and do it more quickly. It may also be possible to autorotate the book to assure alignment, rather than doing alignment by image processing after-the-fact.)
and the machine will automatically straighten those pages, o.c.r., and upload to the net, while you stare dumbfounded...
The software exists, but this is *expensive*. You are not going to find the average person able to afford to buy the software. However, for the proposed "Distributed Scanners", we'll get the needed hardware and software to speed up the process, plus the book chopper for those books which can be chopped (both Charles and Juliet at DP have these guillotine-type page choppers -- they are quite impressive. <smile/>)
likewise with the kirtas 1200, geared to scanning books. http://www.kirtas-tech.com/ it does "only" 20 pages a minute, but hey, 1000 pages/hour ain't nothing to sneeze at. they estimate that in a full-scale production environment, the price-per-scan is 3 cents a page. sounds like brewster should buy a half-dozen of these babies.
Brewster is already using something like the Kirtas for the Canada book scanning project. Not sure if it is a Kirtas or some other brand, though. I was told, or read somewhere, that the page turning scanner cost IA about $100,000. This is *major* bucks. Whether such machines will come down a lot in cost remains to be seen -- I doubt they will come down very much. These are fairly complex robotic machines, designed to handle all kinds of variations found in books, and to be very gentle on them -- yet produce a reasonably good image. I don't see a big enough market for these machines to substantially come down in cost by the power of competition. The Kirtas cost quote of 3 cents per page (which I assume includes labor, but unsure whether it includes capital equipment amortization) works out to about $10/book, which is IA's goal, btw. It requires a trained person to operate it.
the bottom line, though, is that if a person has experience, good equipment, solid software, and a concentrated focus, they can open a paper-book to start scanning it and move it all the way through to finished, high-power, full-on e-book in one evening, maybe two.
Yes, but this is not for the average, ordinary Joe working in his basement. This requires a lot of $$$ in upfront investment to get this fancy equipment and software. For books which can be chopped (such as books where the cover is falling off, or very common old printings), then one can use $1000 (or less) sheet feed scanners, which maybe run at an average 5-7 pages per minute. Of course, with a "fleet" of sheet feed scanners, and the right image capture system, it is possible to run them in parallel -- above two machines, though, it probably requires two people to keep the machines properly fed (I don't think one person can operate any more than two sheet feed scanners and keep them occupied -- just a guess.) There's still need for the whiz-bang scan cleanup software, which I know is expensive. It can be done by hand, but it is laborious. (This cleanup could be centralized at one place, but there's the issue of moving the raw scans to the central location.)
i forgot to mention earlier that my processing tool can automatically rename your image and text-files, based on the page-numbers that it finds right in the text-files (which it extends in sequence for those files without a page-number -- usually the section-heading pages).
so even if you're dealing with someone else's scans, and _they_ didn't name their files wisely, you don't have to deal with the consequences.
Well, yes. However, in "My Antonia" a lot of pages were not numbered at all (such as the last page in each chapter). I had to be especially careful not to mess up and lose which page is which. Of course, with the Kirtas or a sheet feed scanner properly run, it is possible to keep all the scans in the proper order (which for a monoplex sheet feed scanner just run the ordered stack through once, and then once again.)
i didn't elaborate earlier that it goes much deeper than that.
a very important point here is that an error-reporting system -- over and above the obvious effect of getting errors fixed -- will actively incorporate readers into the entire infrastructure, making them active participants cumulating a world of e-books.
This is *exactly* what we have in mind for LibraryCity's role in this, Bowerbird. We planned for this at least six months ago, but not implemented anything yet -- we have bigger fish to fry at the moment. But we envision enabling readers to build community around digital texts, and this includes mechanisms for error reporting/correction -- but not limited to just that.
if you have ever edited a page on a wiki, you're likely aware that the experience gives a very strong feeling of _empowerment_ -- because you can "leave your mark" right on a page, quite literally.
Yes, LibraryCity plans to use wiki, or wiki-like, technology in various of its processes to build community, to enable people to become an integral part of the texts themselves, and to create new content -- to make the old texts come alive.
if we set up a wiki-page to collect the error-reports for an e-text, in a system allowing people to check the text against a page-image, they'll be much more motivated to report errors than they are now, with the "send an e-mail" system. the feedback is more immediate, and compelling, with a wiki. furthermore, by collecting the reports, in the change-log right on the wiki, you can avoid duplicate reports. you can also give rational for rejecting any submitted error-reports, and/or engage people in a discussion about whether to act on a report.
all of this makes your readers feel _responsible_ for the e-texts.
Yes. This, btw, is also the power of Distributed Proofreaders -- it is an environment which not only increases trust in the work product, but it helps volunteers to feel like they are a part of something big.
in addition to the wiki, you can build an error-reporting capability into the viewer-program that you give people to display the e-texts. if they doubt something in the e-text, they click a button and boom!, that page-image is downloaded into the program so they can see it. if they have indeed found an error, they copy the line in its bad form, correct it to its good form, and then click another button and boom!, the error-report is e-mailed right off to the proper e-mail address.
With our XML-based approach, we have the power of XPointer/etc. to enable not only error reporting, but full annotation, interpublication linking and so on. We're going to let the public annotate the books they read (the annotations will point to the XML internally, not alter the documents themselves.) This is just one of many things we are thinking of. (Btw, one has to be careful in how to reconcile error correction of texts with their usefulness in a full hypertext setting -- we don't want error corrections to break the already-established links for annotations, interpublication linking, RDF/topic maps for indexing, and so forth.)
the error-report can be formatted such that your software can automatically summon the e-text _and_ the relevant page-scan. so you see a screen with the page-scan _and_ the error-report. you check its merit, and if it's good, click the "approve" button and the e-text is automatically edited. further, the change-log is updated right on the wiki-page for that e-text, and anyone who requested error-notification gets an e-mail describing the change. auxiliary versions of the e-text -- like the .html and .pdf files -- are automatically updated. and all you did was click one button... face it, if you're dealing with 15,000+ e-texts, doing it manually is a sure-fire way to burn yourself out. who needs that hassle?
Hmmm, this is a lot like what James Linden is developing, which may be incorporated into PG of Canada's operations. <smile/> It is a good idea to maintain change tracking of all texts. And to answer your last point. Doing 15,000 texts, or a million texts, still needs some manual processing. It is also important to produce them correctly and uniformly in the first place, gather full metadata about them and put the metadata into a library acceptable form (e.g., MARC), and for various fields, such as author name, to maintain a single authority database as librarians do. PG's collection has been assembled so ad-hoc that trying to consistently autoprocess the collection is nigh impossible. That's why, to me, it is more important to redo the collection, put it on a common, surer footing (including building trust), before launching into doing a lot more texts. Imagine how difficult it would be to process one million texts if they were produced in the same ad-hoc fashion, without following some common standards. In the meanwhile, while most of the pre-DP portion of the collection is redone, a strong focus can be made on the archival scanning and *public access* of public domain books (including tackling the 1923-63 era in the U.S.) and getting them online as soon as possible (including properly done metadata and copyright clearance). Then, when the next-gen systems are in place to resume major text production, the scans will be there, available, and already online for associating with the SDT versions. And this is where we diverge -- I don't believe the full process can be done totally by machine, there's still need for people to go over every text to make sure the markup for document structure and inline text semantics are correctly done. This is *very* important for the more advanced usages of the digital texts: indexing, interpublication linking, multiple output formats and presentation types, cataloging, data mining, and Michael Hart's dream of eventual language translation. PG's ad hoc approach up to now (which DP has partly fixed), works against making the text collection capable of meeting these very advanced needs. XML (or some other text structuring technology with similar fine granularity) is necessary -- it can't be done using any plain text regularization scheme, unless the scheme is made very complex, whereupon going to XML simply makes sense because it follows the general trends of XML in the publishing workflow. Jon Noring
participants (2)
-
Bowerbird@aol.com
-
Jon Noring