David
Widger and I worked through PG#1 to about PG#5000, 2-3 years ago, reposting
many of them into the new structure. What's left (some hundreds of etexts)
are all over the place, standards (or the lack thereof)-wise. Footnotes
are done many different ways, italics are either absent or represented by
all-caps, publisher information is missing, no illustrations, line lengths too
short/long, odd handling of page numbers, etc, etc, etc. Some (maybe most)
might be easier to re-do from scratch than to try and fix.
Most
etexts from #5000-#9999 were done by DP, weren't looked at by David and I,
and these days are only looked at (and maybe reposted into the new structure) if
there's an errata report. I can't say how many are text-only, and how many
are text+HTML. I reposted all the audiobooks out of the old structure into
the new one, a year or two ago, but there are still upwards of 4000 etexts left
in the old folder structure in the 5000-9999 range.
It'll
be interesting to see who, if anyone, steps up to this project.
Al
So, if someone were
to start "refactoring" old PG texts into TEI or RST and working with a WWer to
repost them ... is this a workable idea?
I'd love to see the PG corpus redone as a "master format" system (and the
current filesystem supports "old" format files in a subdirectory, so if
someone wanted to get the old original hand-made files, they could). I'm
not particularly wedded to any master format. Hell, if someone came up
with a sufficiently constrained HTML vocabulary that could be easily used to
"generate" the additional formats necessary, I'm good with that.
But before anyone will start doing this work, there needs to be a
concensus from PG (I'm looking at you, Greg!) that the work will be
acceptable. A half-assed "master format" system is no master format
system at all.
I'm even ok with working up the system as you go (i.e., start with
"simple" fiction works and make sure the system handles them before throwing
more and more complex works at it, tweaking and fixing in the time honored
method of "incremental development").
Maybe we start this process on a semi-private mirror of the PG corpus and
only when it reaches a critical mass of some sort it gets moved over.
But an official notice that this project has some backing is necessary
or we'll just keep seeing everything running around in ten different
directions and nothing ever getting done.
Josh