Re: [gutvol-d] procedures for contributions, forum for questions and ideas

18 Sep 2012


      On Tue, Sep 18, 2012 at 02:22:07PM -0400, Alex Buie wrote:
...
This is _very_ cool, and would allow people to report problems with
things or check in new revisions...
Yes, it seems to have potential, but is just an experiment - it is currently
just a static copy of the collection.
...
What's the VCS behind it? Subversion?
TRAC lets you choose.  It is Subversion currently.
  -- Greg
...
On Tue, Sep 18, 2012 at 12:41 PM, Greg Newby <gbnewby@pglaf.org> wrote:
...
On Fri, Sep 14, 2012 at 11:28:20PM +0200, Jeroen Hellingman wrote:
...
The core "trouble" of the Project Gutenberg is its massive size. With
over 40,000 ebooks, produced over several decades by countless
volunteers, keeping up to any standards will be hard. Even a large
and well-funded organization will have trouble to set consistent
standards for such a large and diverse collection of texts.
Jeroen,
I looked into TRAC, and actually got it to ingest the whole
collection (it took a few days).
http://trac.readingroo.ms/gutenberg/
It does revision control, issue tracking, etc.  Unfortunately I have
not had time (and don't have sufficient expertise) to take it much
further than that.  If anyone is interested, I'd be happy to provide
access.
Ultimately, part of my goal (which I expressed here earlier in the
year) is exactly what you wrote about: better ability to crowdsource
production and errata handling, and to more easily allow variations &
derivative works.  I wrote a fair amount about it then, so won't go
into detail in this thread.
-- Greg
...
You can basically propose any master format, any quality standard, or
whatsoever, to do with the Project Gutenberg collection, but unless you
are willing and able to contribute significantly to your own proposals,
it won't get you anywhere.
Over a period of about 15 years, I've contributed over 500 ebooks to
Project Gutenberg, and even as a single individual, I have trouble
maintaining that bulk -- and even while I have the luxury of working
with a decent master format, (TEI, which from a strictly technical point
of few is the best choice, but suffers from a extremely steep learning
curve, which makes it a bridge-too-far for most volunteers) and
tuned-to-my-requirement tool-set (my tei2html scripts), I am reaching a
point that just maintaining that sub-collection (fixing errors,
improving tagging of early texts, etc.) gets a considerable task. (And
then I can regenerate HTML and ePub files with a few keypresses, and
have everything in a revision control
system)
What I would currently like to see most to move Project Gutenberg
forward are (in order of priority)
1. A decent issue tracking system that can cope with the number of books
we have, so readers can report possible issues with ebooks
2. A revision control system (preferably distributed) that can handle
the massive size of the PG collection, so we can keep track of what is
going on.
3. An integrated production environment, a kind of PGDP 2.0, to help
stream-line the production of new "text-based" (as opposed to scans
only) ebooks.
Only after those things are in place, we can work towards a suitable
master format, and a publishing pipeline that can produce desired output
formats, such as HTML, ePub, etc. from that.
I've been looking around for suitable tools to make this possible, but
probably nothing that is currently available can handle this, so it will
require a considerable effort.
Taking for example the size of the collection. My own 500+ books are
stored in a bazaar repository of about 4 gigabytes. Scaling that up
a hundred times to hold PG's 40.000 books will result in a half a
terabyte of controlled data; several orders of magnitudes larger than
the largest open source code projects I know of.
Jeroen Hellingman
On 2012-09-14 22:37, Steve White wrote:
...
But as I've observed in so many other projects and jobs,
the problem ain't the hardware, and it ain't the software, and it ain't
the firmware and it ain't the middleware.  It's some other ware.
I have mostly looked at some of the older books in the collection (for
no particular reason).
There, the most egregious problem is that the book was transcribed
into ASCII, which inevitably results in some degree of data
loss--sometime very unfortunate losses.  Some human intervention would
be necessary to repair that, but there are ways to reduce the amount
of intervention required.
I'm cognizant of necessity to keep a master copy, and build other
formats from that.
It's not clear to me what the preferred format of the master copy is,
or should be.
I am quite sure that very legible and attractive books could be
automatically generated with a suitable base format.  There are all
sorts of technical possibilities -- again, this is not the problem.
Besides that, I think some direction and clarity could be helpful
regarding modern formats.
Example: it seems that many e-books are formatted to retain the ASCII
look-and-feel, at least regarding the Gutenberg project text.  This is
always pointlessly ugly, and is often totally illegible.  A little
thought, and some examples, clearly documented somewhere, would go a
long way.
Cheers!
Dr. Gregory B. Newby
Chief Executive and Director
Project Gutenberg Literary Archive Foundation www.gutenberg.org
A 501(c)(3) not-for-profit organization with EIN 64-6221541
gbnewby@pglaf.org
_______________________________________________
gutvol-d mailing list
gutvol-d@lists.pglaf.org
http://lists.pglaf.org/mailman/listinfo/gutvol-d