Re: [PGCanada] James website and more news

11 Nov 2004

      Andrew wrote:
...
I must say that from the point of view of technical details
about the texts themselves, James' preferance for xml as a
master file does make me wonder somewhat. As we don't have
another project (that I am aware of) to base our initial
efforts in this format on, we will be working out the bugs
on our own... Also, I believe the collection will grow much
slower than if we were to use un-marked-up text files.
Why the rush?

Initially it may grow slower, but it is much better to do it right
from the start, including proper metadata collection and structuring.

Notice the spinning wheels PG-MS (MS == Mother Ship) has in cleaning
up (both text and metadata) its 10,000+ texts (especially the older
pre-DP-era texts.) The well-known adage applies here: "If you don't
have the time to do it right the first time, when are you ever going
to have the time?"

PG-MS has learned a lot over 30 years, especially in the realm of
"what NOT to do" (such as ignoring source metadata and allowing the
combining of several sources -- things which are BAD for several
reasons -- PG should NOT be a publisher per se, but rather preserve
the past and to make the texts useful in the present and future by
accurate digitization.) (This "what NOT to do" reminds me of the Monty
Python TV sketch: "How Not To Be Seen" <laugh/>)

Note that DP itself is seriously considering switching to an XML-based
system, likely to be based on a selected subset of TEI or TEI-Lite. I
know James will disagree, but PG-Canada should do likewise. So long as
the subset chosen is well-defined and structurally/semantically-oriented
(with its own defined RelaxNG Schema with its own namespace), so things
can be kept under strict control, yet take advantage of all the tools
and expertise out there in TEI-Land. I'm not sure how PGTEI fits into
this equation, but it certainly merits study. Of course, working with DP
is important. And it is important that trained librarians, such as Alev,
play an important role in the design and collection of metadata/catalog-info.
...
However, the alluring vision of a collection of consistently
marked-up texts is tempting. Perhaps I lack the prior experience
to really know how long it will take to get to that point.
Yes, it is tempting. And there's nothing wrong with going about things
slow and methodically to debug things and to build a strong and robust
system (such as DP-Canada?) to eventually speed things up.

Here's what I see the process, which is different in some ways to how
PG-MS now does it:

1) Select and find the texts (books, periodicals, etc.) which are
   relevant to PG-Canada's interests (it is important PG-Canada define
   what its focus will be.)

2) Copyright clear them.

3) Scan these texts, collect the metadata/catalog-info, and place the
   page scans online. (Optionally, OCR can be done on these scans, and
   the raw, uncorrected OCR text can be used to enable a "temporary"
   full-text-search capability of the collection of page scans.)

4) Start converting selected texts into XML, prioritizing them based
   on various criteria (to be determined.) Eventually this will be
   done via the next generation DP, but for the start do it manually
   (maybe run the text itself through the current DP to remove scanning
   errors, and then mark it up afterwards.)

It is clear that PG-Canada may build a big library of page scans,
while the production of XML texts from them will lag. That's not an
issue since copyright clearing and scanning are themselves very
important, and the intermediate product, page scans, are themselves
useful for the interim. PG-Canada can work with Brewster Kahle and the
Internet Archive on its "Million Book Project" (to scan one million
books and place the scans online.) It might even be possible to
acquire the scans from Brewster, in addition to supplying scans to
Brewster. Note that Brewster is now focusing his book scanning efforts
in, where else: CANADA !

Jon Noring