Page Scans versus Real eText?

[I posted the following to The eBook Community. But clearly most of the real experts in digitizing texts are found right here on gutvol-d, so I'm reposting the message here. I'm especially curious in the economics associated with commercially doing what Michael Hart and PG has done for years and years.] There are essentially two ways to digitize and place online textual works which exist only on paper. This applies, for example, to older public domain books. 1) Digitally scan the publication, and place the resultant page scan images online as the final product. Optionally, these page scans can be OCR'd to produce a raw (uncorrected) searchable index to search for page scans that may be of interest to the user. Additionally, the scanned images can be "packed" into a PDF document for online distribution and viewing, and for offline printing. 2) The publication is converted into real digital text, using either OCR or keying in by hand to produce the raw text. Then, significant human effort is expended to proofread and correct the digital text for any transcribing/OCR and other errors. The resultant cleaned-up text can either be kept in plain text form (traditional Project Gutenberg text), or marked-up into XML documents using some markup vocabulary. (Optionally, the original page scans can be kept along-side the cleaned-up/marked-up text, thereby accruing whatever advantages the first method gives.) Clearly, digital text is superior in many respects to page scan images. The biggest downside is the need to do the laborious human proofing. Online proofing systems such as Distributed Proofreaders have made proofing much easier to do, mobilizing many willing volunteer proofers and providing a convenient Internet interface. However, in discussions with various people on this topic I've not been able to explain, in a cogent and compelling way, all the reasons why the additional effort should be expended to produce high-quality digital text. Some of these people believe that putting the scanned page images online is more than sufficient. So, this is an "Ask TeBC" request for better arguments to use. I hope, too, that it catalyzes interesting discussion on the various aspects associated with the general issue of getting our printed heritage online, which is obviously an ebook-related topic. And this not only applies to books, but to periodicals, newspapers, and many other types of historical documents. ***** Another related question: If I have a typical printed book (say a 300 page fictional work), and I hire a commercial company to convert it into a clean, high-quality digital text with XML markup (e.g., using TEI-Lite), how much would it cost? In the U.S.? Overseas (such as in India)? Anyone know? ***** Jon Noring

Replying to both lists. . .mh On Sat, 13 Nov 2004, Jon Noring wrote:
[I posted the following to The eBook Community. But clearly most of the real experts in digitizing texts are found right here on gutvol-d, so I'm reposting the message here.
I'm especially curious in the economics associated with commercially doing what Michael Hart and PG has done for years and years.]
My guess is that by the time there is a serious commercial eBook industry, say to the point where the next David Letterman and Jay Leno are joking about eBooks then, the way they were about the Web 5-10 years ago, that most of the public domain books will already have been scanned, OCRed, placed online, and will be going through translations into the various languages of the world. If not most, then certainly most of the ones that were easy to find and of general interest. But still, at the rate things have been going over the past 15 years, another 15 years should put us only a few years from this goal, well within sight, before the commercial eBook industry is developed enough to be part of cultural awareness. Michael S. Hart Project Gutenberg
participants (2)
-
Jon Noring
-
Michael Hart