[PGCanada] PG of Canada details

James Linden jlinden at projectgutenberg.ca
Thu Apr 15 18:51:54 PDT 2004

> I am excited about the potential of Project Gutenberg of Canada.
> But so far, I don't know much about it.

  Andrew, thanks for starting the discussion! I'm officially, the "founder"
of PG-Canada, and I know only a little more than you at this point. That's
the idea behind this dicussion list. :-)

> So I've put together a list of questions regarding various
> aspects of it, generally trying to see if there is anything
> we can learn by comparison with PG and PG-Au.

  I think we can learn a lot by comparing to PG and PGAU... mostly, what
_NOT_ to do.

> I believe it would be worth-while to consider these issues
> and have a consistent plan ready rather than just wait until
> a situation comes up and then deal with it on an ad hoc basis.

  I concur.

> Some of these topics have shown themselves to cause strong
> differences of opinion on the gutvol-d mailing list. I
> would request that we avoid flame wars about them here.

  Again, I concur. :o)

>   Filenames and Directory structure
> As the post-10,000 changeover at PG shows, this is worth
> taking the time to consider carefully. I like the way
> the current PG system puts all the files relating to
> one eBook in the same subdirectory. PG-au is using
> one subdirectory for each release year, with each filename
> including the year, ebook number in that year, and version.
> However, that will not scale upwards well if ebook
> production increases. Given a choice between these two,
> I personally prefer the post-10,000 PG method.

  Between the two options you stated, I prefer the PGAFS (Project Gutenberg
Archive File System) over the PGAU idea much better.

  To be honest, I'm hoping to come up with an even better idea, as I don't
really like PGAFS much either.

>   Will we have one basic, authoritative file format?
> >From postings I've seen on the gutvol-d mailing list, I would
> hazard a guess that some form of pgxml will likely be proposed
> by James. If so, that would seem to me to indicate a likelihood
> of using UTF-8 as a default encoding. Another seemingly logical
> choice would be to use mostly ISO-Latin-1, and other parts of
> ISO 8895 when appropriate.

  Yes, we will have one authorative file format - XML of some sort, and
probably UTF-16 encoding.

>   Do we restrict or encourage certain formats or encodings?
> At PG-US the preference is still to go with having a plain 7-bit
> ASCII version available whenever possible. At PG-AU, Col has said
> posting just an ISO-Latin-1 version is fine. My own preference is
> to go for consistency in whatever route is taken.

  ONLY the XML will be considered authorative. All other formats will be
generated automatically from that XML master. This is _one_ concept that is
not really open to discussion. PG's biggest problem is format of the ebooks,
a problem we will _not_ have.

  Other formats will be accepted, but only if they can be converted to the
XML format, preferably via automated means, but at worst, manually.
Naturally, we want to keep such submissions to a minimum, so it will be key
to provide tools to create the XML format (or at least a psuedo XML that we
can easily import) for people to use.

>   How will we handle corrections?
> At PG-US, when there are a small number of corrections, the file is
> changed with just a "last updated" line to show it was changed. For a
> larger number of corrections a new file is posted to superced the old
> one, and the old one is still kept in the archive. At PG-Au, a corrected
> eBook has "last updated" info added and the old file is deleted, with
> the new one taking its place.

   ALL changed to a text will be logged, and the proper versioning noted in
the file. We will make a backup of the master before updating it. Versions
will be in _standard_ notation: major.minor.revision

  In this context, each part of the version number means:

  Major: the first "release" will always be major version 1 (1.0.0);
Changing a file's primary structure (ie: fixing a missing chapter or page,
etc) will result in a major version change.

  Minor: a minor version change will be done when 10 or more (approx)
changes are done at once, so long as they are not changes as noted in the
"major" definition. The minor changes would include adjusting metadata,
fixing multiple typographic errors, etc.

  Revision: a revision change will be done when small corrections are made
such as fixing a broken paragraph or adjusting the indenting on a stanza of
poetry, etc.

  ALL corrections to a text will be done by automated systems (ie: web
interface), so versioning will be handled by the system as well.

  Additionally, a 3 character language code will ALWAYS be appended to the
version. IE: 1.0.0-eng would mean the text is the first release in English.
This means that a text with the version of 1.0.0-spa would be a direct
Spanish translation of 1.0.0-eng.

>   Selection criteria: exclusively Canadiana?
> As I understand PG-Ca is hoping to have funding (from a
> government source?) does that mean we will have a mandate
> to pursue just Canadiana?

  Initially, I think we should concentrate on Canadiana initially, because
it will be easier to get funding with such content, but I have no intention
of limiting it. That would be repository suicide, eh? Also, it would be nice
to get a definitive collection of digital Canadiana online -- a good
flagship project for a group called "PG of Canada". :-)

  Personally, I plan to work on locating and digitizing all of Nikola
Tesla's work (ranging from 1890s to 1930s) at some point, even though he was
Croatian / American. It's of personal interest to me, and said content is
not generally available already, at least not a lot of it. Another
"flagship" project that will, hopefully, put PG Canada "on the map".

>   Selection criteria: Overlap with PG and PG-au
> PG-au is likely to have some similar texts as their copyright laws
> are similar to ours. Would we want to have a policy of trying to
> include everything found there--or keeping a distinct collection
> with no duplication--or just letting things happen as they may?
> Assuming a focus on Canadiana, would we want to make an effort
> to include all Canadiana in PG in this collection?

  We will overlap work if ANY of the following condition are NOT met:

  1) Full book scan images must be available.
  2) _Legal_ copyright clearance assured and record.
  3) Content is properly structured and marked up.

>   What metadata stored in database
> I would love to have, right from the start, a clear expectation
> of what metadata we hope to record for each item, and as
> consistent as possible a way to record it.

  We will start with a fully automated system, including cataloging. The
catalog will exceed standard library catalog standards and will extend to
include any metadata we need to store. At minimum, metadata will be:

  Published Year(s)
  LOC call number (if available)
  LOC subject heading
  Dewey Decimal assignment (if available)
  ISBN (if available/applicable)
  Description (of original print media, IE: hardcover, dimensions, etc)

  If item is from a periodical the following will be added to above list:

  Publication Month(s)
  ISSN (if available/applicable)

>   Will we have an official copyright clearance procedure?
> Just recently, I was thinking that one of the strong points
> of the Project Gutenberg collection is that it may be the
> largest collection of documented public domain material
> in existence. From what I understand, all copyright
> clearances are saved, so they can be referred back to
> if needed. In contrast, at PG-au, from what I've seen,
> you can just submit an eBook without getting formal
> clearance before-hand.

   Yes, we will have an official clearance system as part of the automated
system. Anyone know of a Canadian lawyer or ten who might be interested in
pro bono work for PG Canada?

>   Mirroring? Backup? Long-term stability.
> One reason I feel my time is well-spent as a PG volunteer
> is that the collection appears very permanent, and I feel
> that my contributions are not going to be lost over the years.
> I could not envision the PG archive disappearing as long as
> there is an internet. What plans would be ideal to encourage
> the same for PG-ca?

  I already have tentative agreements with two well-known, very permanent
places. I'll give out names when I have things more firmly worked out, but I
will say they are already familiar with PG.

  I'm also working on a couple of Canadian mirrors. Any other locations that
might be interested would be good - a project like this can never have too
many backups/mirrors.

   James Linden
   Founder, Project Gutenberg of Canada
   jlinden at projectgutenberg.ca

More information about the PGCanada mailing list