
[I posted the following to The eBook Community. But clearly most of the real experts in digitizing texts are found right here on gutvol-d, so I'm reposting the message here. I'm especially curious in the economics associated with commercially doing what Michael Hart and PG has done for years and years.] There are essentially two ways to digitize and place online textual works which exist only on paper. This applies, for example, to older public domain books. 1) Digitally scan the publication, and place the resultant page scan images online as the final product. Optionally, these page scans can be OCR'd to produce a raw (uncorrected) searchable index to search for page scans that may be of interest to the user. Additionally, the scanned images can be "packed" into a PDF document for online distribution and viewing, and for offline printing. 2) The publication is converted into real digital text, using either OCR or keying in by hand to produce the raw text. Then, significant human effort is expended to proofread and correct the digital text for any transcribing/OCR and other errors. The resultant cleaned-up text can either be kept in plain text form (traditional Project Gutenberg text), or marked-up into XML documents using some markup vocabulary. (Optionally, the original page scans can be kept along-side the cleaned-up/marked-up text, thereby accruing whatever advantages the first method gives.) Clearly, digital text is superior in many respects to page scan images. The biggest downside is the need to do the laborious human proofing. Online proofing systems such as Distributed Proofreaders have made proofing much easier to do, mobilizing many willing volunteer proofers and providing a convenient Internet interface. However, in discussions with various people on this topic I've not been able to explain, in a cogent and compelling way, all the reasons why the additional effort should be expended to produce high-quality digital text. Some of these people believe that putting the scanned page images online is more than sufficient. So, this is an "Ask TeBC" request for better arguments to use. I hope, too, that it catalyzes interesting discussion on the various aspects associated with the general issue of getting our printed heritage online, which is obviously an ebook-related topic. And this not only applies to books, but to periodicals, newspapers, and many other types of historical documents. ***** Another related question: If I have a typical printed book (say a 300 page fictional work), and I hire a commercial company to convert it into a clean, high-quality digital text with XML markup (e.g., using TEI-Lite), how much would it cost? In the U.S.? Overseas (such as in India)? Anyone know? ***** Jon Noring