Andrew wrote:
I must say that from the point of view of technical details about the texts themselves, James' preferance for xml as a master file does make me wonder somewhat. As we don't have another project (that I am aware of) to base our initial efforts in this format on, we will be working out the bugs on our own... Also, I believe the collection will grow much slower than if we were to use un-marked-up text files.
Why the rush? Initially it may grow slower, but it is much better to do it right from the start, including proper metadata collection and structuring. Notice the spinning wheels PG-MS (MS == Mother Ship) has in cleaning up (both text and metadata) its 10,000+ texts (especially the older pre-DP-era texts.) The well-known adage applies here: "If you don't have the time to do it right the first time, when are you ever going to have the time?" PG-MS has learned a lot over 30 years, especially in the realm of "what NOT to do" (such as ignoring source metadata and allowing the combining of several sources -- things which are BAD for several reasons -- PG should NOT be a publisher per se, but rather preserve the past and to make the texts useful in the present and future by accurate digitization.) (This "what NOT to do" reminds me of the Monty Python TV sketch: "How Not To Be Seen" <laugh/>) Note that DP itself is seriously considering switching to an XML-based system, likely to be based on a selected subset of TEI or TEI-Lite. I know James will disagree, but PG-Canada should do likewise. So long as the subset chosen is well-defined and structurally/semantically-oriented (with its own defined RelaxNG Schema with its own namespace), so things can be kept under strict control, yet take advantage of all the tools and expertise out there in TEI-Land. I'm not sure how PGTEI fits into this equation, but it certainly merits study. Of course, working with DP is important. And it is important that trained librarians, such as Alev, play an important role in the design and collection of metadata/catalog-info.
However, the alluring vision of a collection of consistently marked-up texts is tempting. Perhaps I lack the prior experience to really know how long it will take to get to that point.
Yes, it is tempting. And there's nothing wrong with going about things slow and methodically to debug things and to build a strong and robust system (such as DP-Canada?) to eventually speed things up. Here's what I see the process, which is different in some ways to how PG-MS now does it: 1) Select and find the texts (books, periodicals, etc.) which are relevant to PG-Canada's interests (it is important PG-Canada define what its focus will be.) 2) Copyright clear them. 3) Scan these texts, collect the metadata/catalog-info, and place the page scans online. (Optionally, OCR can be done on these scans, and the raw, uncorrected OCR text can be used to enable a "temporary" full-text-search capability of the collection of page scans.) 4) Start converting selected texts into XML, prioritizing them based on various criteria (to be determined.) Eventually this will be done via the next generation DP, but for the start do it manually (maybe run the text itself through the current DP to remove scanning errors, and then mark it up afterwards.) It is clear that PG-Canada may build a big library of page scans, while the production of XML texts from them will lag. That's not an issue since copyright clearing and scanning are themselves very important, and the intermediate product, page scans, are themselves useful for the interim. PG-Canada can work with Brewster Kahle and the Internet Archive on its "Million Book Project" (to scan one million books and place the scans online.) It might even be possible to acquire the scans from Brewster, in addition to supplying scans to Brewster. Note that Brewster is now focusing his book scanning efforts in, where else: CANADA ! Jon Noring