
On Tue, Sep 18, 2012 at 02:22:07PM -0400, Alex Buie wrote:
This is _very_ cool, and would allow people to report problems with things or check in new revisions...
Yes, it seems to have potential, but is just an experiment - it is currently just a static copy of the collection.
What's the VCS behind it? Subversion?
TRAC lets you choose. It is Subversion currently. -- Greg
On Tue, Sep 18, 2012 at 12:41 PM, Greg Newby <gbnewby@pglaf.org> wrote:
On Fri, Sep 14, 2012 at 11:28:20PM +0200, Jeroen Hellingman wrote:
The core "trouble" of the Project Gutenberg is its massive size. With over 40,000 ebooks, produced over several decades by countless volunteers, keeping up to any standards will be hard. Even a large and well-funded organization will have trouble to set consistent standards for such a large and diverse collection of texts.
Jeroen,
I looked into TRAC, and actually got it to ingest the whole collection (it took a few days).
http://trac.readingroo.ms/gutenberg/
It does revision control, issue tracking, etc. Unfortunately I have not had time (and don't have sufficient expertise) to take it much further than that. If anyone is interested, I'd be happy to provide access.
Ultimately, part of my goal (which I expressed here earlier in the year) is exactly what you wrote about: better ability to crowdsource production and errata handling, and to more easily allow variations & derivative works. I wrote a fair amount about it then, so won't go into detail in this thread.
-- Greg
You can basically propose any master format, any quality standard, or whatsoever, to do with the Project Gutenberg collection, but unless you are willing and able to contribute significantly to your own proposals, it won't get you anywhere.
Over a period of about 15 years, I've contributed over 500 ebooks to Project Gutenberg, and even as a single individual, I have trouble maintaining that bulk -- and even while I have the luxury of working with a decent master format, (TEI, which from a strictly technical point of few is the best choice, but suffers from a extremely steep learning curve, which makes it a bridge-too-far for most volunteers) and tuned-to-my-requirement tool-set (my tei2html scripts), I am reaching a point that just maintaining that sub-collection (fixing errors, improving tagging of early texts, etc.) gets a considerable task. (And then I can regenerate HTML and ePub files with a few keypresses, and have everything in a revision control system)
What I would currently like to see most to move Project Gutenberg forward are (in order of priority)
1. A decent issue tracking system that can cope with the number of books we have, so readers can report possible issues with ebooks 2. A revision control system (preferably distributed) that can handle the massive size of the PG collection, so we can keep track of what is going on. 3. An integrated production environment, a kind of PGDP 2.0, to help stream-line the production of new "text-based" (as opposed to scans only) ebooks.
Only after those things are in place, we can work towards a suitable master format, and a publishing pipeline that can produce desired output formats, such as HTML, ePub, etc. from that.
I've been looking around for suitable tools to make this possible, but probably nothing that is currently available can handle this, so it will require a considerable effort.
Taking for example the size of the collection. My own 500+ books are stored in a bazaar repository of about 4 gigabytes. Scaling that up a hundred times to hold PG's 40.000 books will result in a half a terabyte of controlled data; several orders of magnitudes larger than the largest open source code projects I know of.
Jeroen Hellingman
On 2012-09-14 22:37, Steve White wrote:
But as I've observed in so many other projects and jobs, the problem ain't the hardware, and it ain't the software, and it ain't the firmware and it ain't the middleware. It's some other ware.
I have mostly looked at some of the older books in the collection (for no particular reason). There, the most egregious problem is that the book was transcribed into ASCII, which inevitably results in some degree of data loss--sometime very unfortunate losses. Some human intervention would be necessary to repair that, but there are ways to reduce the amount of intervention required.
I'm cognizant of necessity to keep a master copy, and build other formats from that. It's not clear to me what the preferred format of the master copy is, or should be.
I am quite sure that very legible and attractive books could be automatically generated with a suitable base format. There are all sorts of technical possibilities -- again, this is not the problem.
Besides that, I think some direction and clarity could be helpful regarding modern formats. Example: it seems that many e-books are formatted to retain the ASCII look-and-feel, at least regarding the Gutenberg project text. This is always pointlessly ugly, and is often totally illegible. A little thought, and some examples, clearly documented somewhere, would go a long way.
Cheers!
Dr. Gregory B. Newby Chief Executive and Director Project Gutenberg Literary Archive Foundation www.gutenberg.org A 501(c)(3) not-for-profit organization with EIN 64-6221541 gbnewby@pglaf.org _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d