Re: [gutvol-d] Final PGTEI... page numbers

28 Oct 2004

      Bowerbird said:
...
brad said:
...
...
Detractors of XML on this list have brought up the fact that the
TEI manual is 1400 pages long as a negative.  Why?
...
but since you asked, the reason this is seen as a "negative"
is because we think that precious few of the volunteers who
have traditionally shouldered the effort of creating e-texts
will continue to do so if an understanding of those 1400 pages
of t.e.i. documentation were to become a prerequisite.
but maybe now distributed proofreaders has enough people
on-board that they feel less uncomfortable taking that risk...
The "1400" pages is for the full-blown TEI spec, which includes some
pretty obscure stuff. Interspersed within it include long and (to me)
fascinating general discourses on the structure of textual documents,
with copious examples. In essence, it is probably one of the better
"textbooks" ever written on this topic even if it is only there to
support the description of the TEI markup.
...
or maybe not, as their tentative plan thus far involves
adding two "markup" rounds (at least, and maybe more)
to their existing two "proofing" rounds, so as to minimize
the number of people who need to be concerned with markup.
Essentially yes. Distributed Proofreader's longer-term vision, as I
understand it (and Juliet can correct me where I'm off anywhere in
this message), is to settle upon some subset of TEI to apply to all
documents (either use TEI-Lite or some other comparable subset -- for
the occasional oddball document the more extended TEI will be used in
"manual" mode.) In addition, for most of DP's volunteers, the markup
will be "under-the-hood" and largely invisible -- most of the
volunteer work anyway is for copyediting the text (correcting OCR
errors), not markup insertion, so no need to require these volunteers
to learn the gory details of TEI. Only the most experienced and
interested of the DP volunteers, who do the final cleanup/finishing
stages, will actually play with the markup itself.
...
as you put it, the learning of a complex system like t.e.i. is often
"a gradual process of incremental epiphanies".  can we _survive_
the situation where thousands of volunteers are put through that?
with perhaps many becoming alienated in the course of doing so?
Well as I noted above, DP, where the action is for large-scale
production of e-texts (they are now the actual engine which drives
PG's growth), does not plan to inflict TEI on the general first-level
volunteers (this is what I inferred from my talks a while back with
Charles.) With regards to the specifics of the markup which DP will
eventually use (likely a subset of TEI as previously noted), that will
ultimately be determined by them based on compatibility with the
production interface as well as what works best for the various uses
(note the plural) of the texts.

[Aside: the DP-produced XML Master texts will certainly be used for
many purposes, all of which instill requirements on the markup
specification, and which must be considered -- this is the biggest
missing area not being discussed on gutvol-*. The most exciting of
these is where the DPXML texts will be archived into a special
library-like repository which allows a very high-level of end-user
interface and customizability to the collection (e.g., bookmarking,
annotation, interlinking within the repository and to other content
repositories, blogging, etc. -- all things several associates and I
are now working on.) Of course, the other uses are to generate
portable digital formats as the end-user wants, higher-quality
text-to-speech capability, and Michael Hart's dream of language
translation. These, too, guide the nature of the Master markup
vocabulary. Of course, there must be library-compatible and properly
designed catalog, metadata, and identifier information for each e-text
in the repository. And where they exist, the original page scans of
the source documents will also be available and interlinked with the
XML versions. Brewster Kahle at the Internet Archive will *gladly*
archive the page scans for DP/PG. I envision that most of the earlier
portion of the PG collection, which contains most of the classics,
will be redone by DP from source documents to assure proper metadata
collection, uniformity and conformity with the rest of the DPXML texts
and to have the page scans available. Once DP gets into major
production with many more volunteers, redoing the earlier texts won't
be a big deal -- it needs to eventually be done anyway, in my view.]

I would think and hope that DP will convene a formalized working group
of the various experts and enthusiasts here and elsewhere to hammer
out the DP Markup Specification based on requirements gathering and
analysis, which is the proper way to do this. The DPMWG will have a
more formalized and committed leadership structure, with weekly
teleconference calls. From my standards working group experience, it's
amazing how much stuff gets done during weekly teleconferences and the
occasional face-to-face meeting (biannual or annual), while written
listserv exchanges in a group like gutvol-* usually ends up going
around and around in circles. I expect it won't take that long to
hammer out the "beta" of the DP Markup Vocabulary when the working
group is organized properly and committed to generate and then resolve
the various requirements.

I would even ask someone like C. Michael Sperberg-McQueen to be an
advisor to the working group (his brother Roger Sperberg and I have
worked closely together on various projects in the past. <smile/>.) I
would think that DP's vision to include TEI in its next generation
system so as to do *large-scale* production of e-texts (possibly up to
a few hundred *per day* to begin the process of one million texts in
a decade or two) will greatly excite the TEI community and we will
attract some pretty smart and dedicated working group members to add
to the several already here. Volunteerism is not only for the "rank
and file" (those who will do the basic copyediting), but also includes
those who are more technically minded and understand the markup issues
as it relates to the production environment.

Jon Noring