Some humble suggestions... Re: [PGCanada] Re: PG-Canada / List of tasks to do

Thu Jan 27 18:10:17 PST 2005

Wallace wrote:
> Jon Noring wrote:

> A lot of books we won't be able to get choppable copies of, and in a 
> lot of cases, won't even need to: I think a key priority should be to 
> beg, bum, borrow, or steal microform scanning capability, and start 
> working our way through the CIHM back-catalogue, supplemented with 
> proofraiding of the main existing Canadian image-libraries (ECO, BNQ, 
> ourroots, etc.)

Certainly, other sources are welcome, so long as the quality of the
page images meets minimal standards (to be established by the project)
and that they are fully unencumbered (when the works are public
domain) so the page images may freely be placed online.

DP, as an example, sometimes gets scans from institutions which, by
agreement with those institutions, prevents DP from placing the scans
publicly online. I've voiced my displeasure at this. Except in the
most extraordinary circumstances, where it is otherwise impossible to
ever find an unencumbered scan of a particular work, and this is rare,
PGCan should avoid such restrictions as much as possible, and actively
work to acquire only unencumbered scans of public domain works. I'd
even refuse offers for encumbered scans of PD works, with the hope
that once PGCan gets big enough, it can go back to the institution and
get the scans with no encumberances.

This brings up the new talking point as to the requirements for scans.
I believe the scans are just as important as the structured digital
text (SDT) to be produced from them, and should be made public,
referenced from the final SDT. This means the master scans need to
have a minimum resolution and color depth suitable for not only OCR
purposes, but also for scholarly-quality online reading (and other)
purposes.

My current view, subject to change, is that all master scans of B&W
pages should be done at 600 dpi (optical) and grey-scale. Current work
I'm doing on a project indicates that indeed 300 dpi is insufficient
for comfortable online viewing, especially for texts which have a lot
of fine print. If 300 dpi is decided anyway, then a *must* is that
they be grey-scale. But 600 dpi grey-scale is better (in a few rare
cases it may be wise to go even higher -- and of course color requires
full-color scans.) The downside, of course, is that the file sizes
for the scans are much larger. One could compress them using DjVu
(which is impressive), but I still believe the original scans should
be preserved in the original lossless form. Do not produce JPGs as
part of the scan acquiring process (as IA is experimenting with using
digital cameras for scanning.)

> We can further distribute that task through my "cells" idea, which 
> takes the "team" concept over at PGDP one step further.
>
> Cells would be groups would would work more closely together to collect 
> works in a geographical area (the Halifax cell), a given library (the 
> Acadia University cell) or a given field of interest (the Canadian 
> Incunabula cell; the LOTE cell; the Genealogy cell, etc.)

Good idea.

I'm definitely all for encouraging/catalyzing special-interest groups
to digitally scan works of interest to them. Such groups usually
bring in enthusiastic volunteers who will not only help to scan the
works, but will help in the proofing process to produce SDT.

Local historical societies, and genealogy groups (both by family
surname and by locality), are the notable groups which come to mind.
A startup project I'm working with, LibraryCity, has planned for a
while to mobilize these local special interest groups to digitize
their holdings and to get them online. LC plans to focus on the
usability and enhanceability of the final digital products. Blogs,
annotation, and collection interlinking are major features of the
LC focus.

> They would bootstrap themselves into existence, both on our site, and 
> through outside contacts, and make conscious efforts to assimilate 
> everything and anything that interests them and is clearable. Think LDS 
> genealogists meet the Borg.

Laugh. I live in Salt Lake City, and am an avid amateur genealogist.
I often do research in the Family History Library downtown. (I am NOT
LDS.)

>> The second, cataloging/copyright clearance, will take the scans which
>> have been done, and put together MARC (or equivalent) records for the
>> works (a lot of data can be taken from other libraries.) In addition,
>> the group can do the research on the copyright of the works, which of
>> course the cataloging information is important in the process. And
>> finally, this group can look over the scans to determine if any pages
>> are missing or badly scanned (a sort of QC function).

> Again, provisional publication of the scans could help accelerate and 
> distribute that process.

Certainly. Placing the scans online (which I assume is what you mean
by "publication") certainly *requires* that cataloging records be
first generated from them, as well as copyright clearance. It is my
belief that whoever accesses the PGCan repository of finished works
can push a button and get the catalog record in the format of interest
to them, such as MARC-XML (Lars Aronsson talked about using FRBR --
don't know much about that.)

Jon Noring