[gutvol-d] PG collection in a source control repository?

2 Feb 2012

      Maybe of interest to some ongoing discussions here:

I have been maintaining all work I do for PG/PGDP in my private bazaar
(bzr) repository. I have prepared almost 500 books and have little over
100 in progress. For all books I prepare, I have a TEI master file,
an HTML and text version, and a few supplementary files, as well as the
illustrations, where applicable. This boils down to about 20.000 files
in about 2100 folders, and almost 6 Gigabytes of disk space usage,
including the history of all edits since January 2008. (about 3400
commits). This excludes storage for scans, which take up about 160 GB.

To scale this up to Project Gutenberg with (lets say) 40000 books, we
will have to multiply those figures with 80, i.e. a 480 GB repository,
and about 1.6 million files... I am not sure any of the popular source
control systems can deal with that.

I do not face serious problems with my 6 GB bzr repository -- commits
and pulls are fast; the only exception is the Windows explorer
integration, which drags along, and keeps locks on files far too long.

I've been experimenting with a subset of my work (dictionaries) in a
Google code repository (https://code.google.com/p/phildict/) This works
fine, but some other problems arise:

1. Current diff software is not really made for files with really long
lines (I keep paragraphs on a single line for ease of processing)

2. The on-line diffs do not like files of more than roughly 100kB; above
this you simply see nothing. (And dictionary files typically grow
to several MB a piece; an average book is about 500kB.)

A radically different approach would be to treat every book as a
separate project, and just have a forest of them...

Jeroen.

[gutvol-d] PG collection in a source control repository?

Jeroen Hellingman