
Karl Eichwalder wrote:
Joshua Hutchinson wrote:
I realize that scans are important to you, but they simply are NOT to most of us, beyond their use as OCR sources.
The "us" seems to be a very limited crowd ;) Scans are very important to all of "us" who rely on texts they can verify. Without scans the scientific world (universities and similar intitutes) will simply ignore the Gutenberg texts.
I don't have a problem with archiving the images but I will not change the work flow I do (or recommend the change in DP) to get "archival quality" scans. It simply is not something I deem important to the work we do here.
Texts and scans, both are important. Combining them makes a valuable e-book. Because it isn't that difficult to combine them and offer them side by side for reading we should work on this issue to make it happen.
It is interesting in that many who volunteer for PG and DP don't view scans as important in and of themselves, while the Internet Archive (and maybe this has changed) doesn't view structured/proofed digital text (which DP produces) as being vital -- IA's focus has been on acquiring, archiving and making available page scans (which is a GOOD thing, I'm not knocking it.) I do agree with Karl that both are equally important, and technology improvements the last few years has now made it feasible to archive and deliver both side-by-side. As a result of the experimental/demo "My Antonia", a few of us are now exploring redoing the top 500 or 1000 classic English-language Works. We would mobilize the scholars and enthusiasts (including those expert with bibliography), and on a case-by-case basis determine the particular public domain edition(s) for each Work from which we'd like to somehow secure high-resolution, high-quality page scans *done right* (we'll probably have to work with participating libraries for source material -- a very good reason to establish close ties to the library world which PG has not yet been able to establish.) Then the scans would be submitted to DP for conversion into structured digital text (SDT -- of course, PG would then get the work product as they want it.) We would then take the SDT and mark it up using a selected subset of TEI (probably PGTEI or something close to it). Meanwhile, the scans will be cleaned up and provided at various resolutions by another team of volunteers, as well as encapsulated within DjVu and PDF. The SDT-TEI would be converted using XSLT to XHTML 1.1 for web display (allowing end-users to pick their own preferred CSS stylesheets), and to other formats as there is a demand for. We'd put together a "Distributed Catalogers", mostly librarian volunteers, to assemble high-quality and authoritative/uniform metadata for each edition, probably (but not necessarily -- have to decide on this) encoded in MARC and/or MARC-XML. Archiving/organizing the editions will be done using the WEMI principle/system (Work-Edition-Manifestation-Item). Of course, the full database of SDT will be searchable with at least a Google-level search engine. We will also work hard to make it much easier to immediately correct the texts if any errors are found (having the original page scans available makes it much easier to check for errors -- hopefully the DP process will keep the error rate really low. Bowerbird is thanked for suggesting the need for a robust system for continuous, "post-publication" error correction -- having the scans available and online is *critical* for this functionality.) Since the markup will include paragraph level unique identifiers, it will now be possible to allow the world to link to each edition down to the paragraph level (to produce uniform links), and when XLink becomes more common, to link down to individual words. It will now be possible to build a community around each Work (at least where there's enough interested people to organize a community for a particular Work or author) who will then be able to annotate, interlink, blog and discuss each edition/Work -- for this it is *necessary* we use XML markup *done right*. (This is an answer to those who believe plain text is sufficient -- it is NOT sufficient if we want to integrate the texts at a high level with various human endeavors, to build a more robust knowledge management system, etc. -- those who view the only purpose of digital texts is for casual, private reading are taking a very limited view of the many possibilities. Anyway, XML is self-describing text, so we fulfill the requirement for longevity which Michael Hart has preached from the very beginning.) There's more, but that gives a rough flavor. If anyone here is interested in becoming a part of this project, let me know in private email. We're now working on funding this as a non-profit so there will be adequate funds to *hire* full-time developers to build the necessary infrastructure. It will be independent of academia to assure the work product will be completely and totally open and free to the world (it disturbs me to no end that so many academic-sponsored digitization projects over the years, including those at publicly-funded universities, keep their work product under wraps -- I won't name names -- this is the beauty and the power of the PG vision.) This project is not intended to compete or replace PG/DP -- it will be limited in scope -- but rather will work synergistically with PG/DP to properly redo the most popular English-language works. Of course, others can copy the process for doing/redoing the most popular works in other languages. We probably will also welcome DP to submit to us the scans of works they've already done and for which they can release the scans. For a primitive, largely-unfinished view of what could be done, refer to the "My Antonia" project at http://www.openreader.org/myantonia/ . Jon Noring