
On Fri, December 9, 2011 1:26 pm, James Adcock wrote:
"Making our lives easier" would mean removing the requirement to submit separate hand-tooled lightweight markup in the first place!
I don't believe there /is/ any requirement "to submit separate hand-tooled lightweight markup." What there /is/ is a requirement that whatever you submit must have an impoverished text version as well as any other. At some point in the past, Project Gutenberg's policy was that it would not be a repository for multiple variations of the same e-text. Because an impoverished text version was required, and no variations were allowed, the impoverished text version became the de facto master canonical version. I think it is obvious to the most casual observer, that "lightweight markup languages" which have the same expressive power as a "heavy markup language" are in fact as complex as their counterparts, and are probably harder to work with. The sole benefit of a "lightweight markup language" is that the resulting document is presumably easier for a human to understand in its raw form (and consequently harder to process algorithmically). In the context of Project Gutenberg, the ambiguity and opaqueness of a "lightweight markup language" had the benefit that it could fool the whitewashers into believing that a markedup text was nothing more than impoverished text, allowing it to become the master version of a particular book but with sufficient expressive power to be upgraded to something approaching useful. Given the (relatively) new openness at PG to accept HTML files, the need to fool whitewashers is gone. If I were preparing a new book to be submitted to PG I'd do everything in HTML and make sure /that/ version was canonical. I'd then used some kind of automated conversion tool to strip the markup from the HTML, and submit /that/ as the text version along with the canonical HTML version. And because fixing errors in PG texts is so difficult, it's unlikely that the two files will ever get out of sync.
It would be a ton less effort to write a style guide for HTML, and/or to write a style tool to tag "features" of submitted HTML of which the transcriber is very proud but which are just very probably going to look crappy on one or more target devices.
Yes, and several of those style guides have been written in the past: see, e.g. http://www.hwg.org/opcenter/gutenberg/ and the Gutenberg wiki at http://www.gutenberg.org/wiki/Gutenberg:HTML_FAQ#H.4._What_are_the_PG_rules_.... Marcello Perathoner has also written such a guide, but I can't put my finger on a reference to it quickly. The problem with all these style guides is not that they are ineffective or technically incorrect (with some exceptions) but rather that the people who wrote them and those that consult them tend to approach the question of markup with religious fervor. They were the George W. Bushes of HTML markup: "no compromise, God is on my side." (This same religious fervor is not limited to HTML; there are some people who evangelize other, less useful, markup systems -- such as z.m.l. -- as well.) Presumably Project Gutenberg will continue Mr. Hart's commitment to anarchy; other than "good" rules (e.g. Project Gutenberg must always have an impoverished text version) no other rules will be imposed. (If you can't detect my sarcasm, you're not trying hard enough.) This means that if there is going to be any standardization at all in the format of the Project Gutenberg corpus it will require the /unanimous/ agreement of every volunteer, including those from the past who are no longer contributors. Project Gutenberg has no practical leaders; it is less well-organized that the Occupy Wall Street protesters. But you can lead by example. Pick someone's style guide; it doesn't matter whose, or whether or not you like it -- as long as it can be converted to something you /do/ like you're fine. Then follow that guide and encourage others to do the same. The principals at Project Gutenberg will offer no leadership -- we're on our own. If the masses can't agree on being limited to a few styles there is no hope.