
Maybe of interest to some ongoing discussions here: I have been maintaining all work I do for PG/PGDP in my private bazaar (bzr) repository. I have prepared almost 500 books and have little over 100 in progress. For all books I prepare, I have a TEI master file, an HTML and text version, and a few supplementary files, as well as the illustrations, where applicable. This boils down to about 20.000 files in about 2100 folders, and almost 6 Gigabytes of disk space usage, including the history of all edits since January 2008. (about 3400 commits). This excludes storage for scans, which take up about 160 GB. To scale this up to Project Gutenberg with (lets say) 40000 books, we will have to multiply those figures with 80, i.e. a 480 GB repository, and about 1.6 million files... I am not sure any of the popular source control systems can deal with that. I do not face serious problems with my 6 GB bzr repository -- commits and pulls are fast; the only exception is the Windows explorer integration, which drags along, and keeps locks on files far too long. I've been experimenting with a subset of my work (dictionaries) in a Google code repository (https://code.google.com/p/phildict/) This works fine, but some other problems arise: 1. Current diff software is not really made for files with really long lines (I keep paragraphs on a single line for ease of processing) 2. The on-line diffs do not like files of more than roughly 100kB; above this you simply see nothing. (And dictionary files typically grow to several MB a piece; an average book is about 500kB.) A radically different approach would be to treat every book as a separate project, and just have a forest of them... Jeroen.