
David Widger and I worked through PG#1 to about PG#5000, 2-3 years ago, reposting many of them into the new structure. What's left (some hundreds of etexts) are all over the place, standards (or the lack thereof)-wise. Footnotes are done many different ways, italics are either absent or represented by all-caps, publisher information is missing, no illustrations, line lengths too short/long, odd handling of page numbers, etc, etc, etc. Some (maybe most) might be easier to re-do from scratch than to try and fix. Most etexts from #5000-#9999 were done by DP, weren't looked at by David and I, and these days are only looked at (and maybe reposted into the new structure) if there's an errata report. I can't say how many are text-only, and how many are text+HTML. I reposted all the audiobooks out of the old structure into the new one, a year or two ago, but there are still upwards of 4000 etexts left in the old folder structure in the 5000-9999 range. It'll be interesting to see who, if anyone, steps up to this project. Al -----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Joshua Hutchinson Sent: Tuesday, January 24, 2012 2:08 PM To: gutvol-d@lists.pglaf.org Subject: Re: [gutvol-d] Producing epub ready HTML So, if someone were to start "refactoring" old PG texts into TEI or RST and working with a WWer to repost them ... is this a workable idea? I'd love to see the PG corpus redone as a "master format" system (and the current filesystem supports "old" format files in a subdirectory, so if someone wanted to get the old original hand-made files, they could). I'm not particularly wedded to any master format. Hell, if someone came up with a sufficiently constrained HTML vocabulary that could be easily used to "generate" the additional formats necessary, I'm good with that. But before anyone will start doing this work, there needs to be a concensus from PG (I'm looking at you, Greg!) that the work will be acceptable. A half-assed "master format" system is no master format system at all. I'm even ok with working up the system as you go (i.e., start with "simple" fiction works and make sure the system handles them before throwing more and more complex works at it, tweaking and fixing in the time honored method of "incremental development"). Maybe we start this process on a semi-private mirror of the PG corpus and only when it reaches a critical mass of some sort it gets moved over. But an official notice that this project has some backing is necessary or we'll just keep seeing everything running around in ten different directions and nothing ever getting done. Josh