Re: [gutvol-d] Scans and Texts (Re: Copyright Verification?)

14 Jul 2005

      Karl Eichwalder wrote:
...
Joshua Hutchinson wrote:
...
...
I realize that scans are important to you, but they simply are NOT to
most of us, beyond their use as OCR sources.
...
The "us" seems to be a very limited crowd ;)  Scans are very important to
all of "us" who rely on texts they can verify.  Without scans the
scientific world (universities and similar intitutes) will simply ignore
the Gutenberg texts.
...
...
I don't have a problem with archiving the images but I will not change
the work flow I do (or recommend the change in DP) to get "archival
quality" scans.  It simply is not something I deem important to the
work we do here.
...
Texts and scans, both are important.  Combining them makes a valuable
e-book.  Because it isn't that difficult to combine them and offer them
side by side for reading we should work on this issue to make it happen.
It is interesting in that many who volunteer for PG and DP don't view
scans as important in and of themselves, while the Internet Archive
(and maybe this has changed) doesn't view structured/proofed digital text
(which DP produces) as being vital -- IA's focus has been on acquiring,
archiving and making available page scans (which is a GOOD thing, I'm
not knocking it.)

I do agree with Karl that both are equally important, and technology
improvements the last few years has now made it feasible to archive
and deliver both side-by-side.

As a result of the experimental/demo "My Antonia", a few of us are
now exploring redoing the top 500 or 1000 classic English-language
Works. We would mobilize the scholars and enthusiasts (including those
expert with bibliography), and on a case-by-case basis determine the
particular public domain edition(s) for each Work from which we'd like
to somehow secure high-resolution, high-quality page scans *done
right* (we'll probably have to work with participating libraries for
source material -- a very good reason to establish close ties to the
library world which PG has not yet been able to establish.) Then the
scans would be submitted to DP for conversion into structured digital
text (SDT -- of course, PG would then get the work product as they
want it.) We would then take the SDT and mark it up using a selected
subset of TEI (probably PGTEI or something close to it). Meanwhile,
the scans will be cleaned up and provided at various resolutions by
another team of volunteers, as well as encapsulated within DjVu and
PDF. The SDT-TEI would be converted using XSLT to XHTML 1.1 for web
display (allowing end-users to pick their own preferred CSS stylesheets),
and to other formats as there is a demand for. We'd put together a
"Distributed Catalogers", mostly librarian volunteers, to assemble
high-quality and authoritative/uniform metadata for each edition,
probably (but not necessarily -- have to decide on this) encoded in
MARC and/or MARC-XML. Archiving/organizing the editions will be done
using the WEMI principle/system (Work-Edition-Manifestation-Item). Of
course, the full database of SDT will be searchable with at least a
Google-level search engine. We will also work hard to make it much
easier to immediately correct the texts if any errors are found
(having the original page scans available makes it much easier to
check for errors -- hopefully the DP process will keep the error rate
really low. Bowerbird is thanked for suggesting the need for a robust
system for continuous, "post-publication" error correction -- having
the scans available and online is *critical* for this functionality.)

Since the markup will include paragraph level unique identifiers, it
will now be possible to allow the world to link to each edition down to
the paragraph level (to produce uniform links), and when XLink becomes
more common, to link down to individual words. It will now be possible
to build a community around each Work (at least where there's enough
interested people to organize a community for a particular Work or
author) who will then be able to annotate, interlink, blog and discuss
each edition/Work -- for this it is *necessary* we use XML markup
*done right*. (This is an answer to those who believe plain text is
sufficient -- it is NOT sufficient if we want to integrate the texts
at a high level with various human endeavors, to build a more robust
knowledge management system, etc. -- those who view the only purpose
of digital texts is for casual, private reading are taking a very
limited view of the many possibilities. Anyway, XML is self-describing
text, so we fulfill the requirement for longevity which Michael
Hart has preached from the very beginning.)

There's more, but that gives a rough flavor. If anyone here is
interested in becoming a part of this project, let me know in private
email. We're now working on funding this as a non-profit so there will
be adequate funds to *hire* full-time developers to build the necessary
infrastructure. It will be independent of academia to assure the work
product will be completely and totally open and free to the world (it
disturbs me to no end that so many academic-sponsored digitization
projects over the years, including those at publicly-funded universities,
keep their work product under wraps -- I won't name names -- this is
the beauty and the power of the PG vision.) This project is not intended
to compete or replace PG/DP -- it will be limited in scope -- but rather
will work synergistically with PG/DP to properly redo the most popular
English-language works. Of course, others can copy the process for
doing/redoing the most popular works in other languages. We probably will
also welcome DP to submit to us the scans of works they've already done
and for which they can release the scans.

For a primitive, largely-unfinished view of what could be done, refer
to the "My Antonia" project at http://www.openreader.org/myantonia/ .

Jon Noring