Re: [gutvol-d] Producing epub ready HTML

There is a golden opportunity here for someone to create an automated tool to assess the use/misuse of HTML which PG could use to screen submissions--see my earlier post with some simple tests written in Perl. I would be happy to host development of the tool under the Guiguts sourceforge project. Also badly needed is an update of gutcheck the corresponding tool to evalute .txt submissions--it has not been changed since 2005 and cannot even count the number of characters per line if the line contains Unicode. Provide me with your sourceforge ID and I will give your access rights. If it were up to me I would write these tools in Perl not C (since Guiguts is in Perl), or it could be an enhancement to HTML Tidy but my only request is that the errors/warnings generated look like "linenum:columnnum This line is wrong: <h1>Chapter</h2>" (this allows Guiguts users to double click on an error report to jump to the errors). Guiguts has a "Check All" button which runs W3C validation, W! 3C CSS check, Link Check, HTML Tidy, Image Check, and the rudimentary Epub Friendly check I sent earlier--perhaps these could be part of an HTMLGutcheck tool. I could bundle the tools with Guiguts so they would reach a broad audience even if PG does not immediately adopt them. Hunter
Why aren't WWers send back projects that include destructive layout tagging, or don't include important structural tagging? I can think of any number of reasons for rejection that are less disruptive to the reader's satisfaction.
Because we have automated checks for validity and good spelling. We don't have automated checks for (mis-) use of HTML for layout. If we had some sort of automated and relatively unambiguous checks for such things, I'm sure that many submitters would strive to comply. -- Greg

Please feel encouraged! I'm not sure the coding is the hard part (especially given that the HTML will be known to be valid). The policy/rules might be the hard part. -- Greg On Sun, Jan 22, 2012 at 10:53:36PM -0500, hmonroe.pglaf@huntermonroe.com wrote:
There is a golden opportunity here for someone to create an automated tool to assess the use/misuse of HTML which PG could use to screen submissions--see my earlier post with some simple tests written in Perl. I would be happy to host development of the tool under the Guiguts sourceforge project. Also badly needed is an update of gutcheck the corresponding tool to evalute .txt submissions--it has not been changed since 2005 and cannot even count the number of characters per line if the line contains Unicode. Provide me with your sourceforge ID and I will give your access rights. If it were up to me I would write these tools in Perl not C (since Guiguts is in Perl), or it could be an enhancement to HTML Tidy but my only request is that the errors/warnings generated look like "linenum:columnnum This line is wrong: <h1>Chapter</h2>" (this allows Guiguts users to double click on an error report to jump to the errors). Guiguts has a "Check All" button which runs W3C validation, W! 3C CSS check, Link Check, HTML Tidy, Image Check, and the rudimentary Epub Friendly check I sent earlier--perhaps these could be part of an HTMLGutcheck tool. I could bundle the tools with Guiguts so they would reach a broad audience even if PG does not immediately adopt them.
Hunter
Why aren't WWers send back projects that include destructive layout tagging, or don't include important structural tagging? I can think of any number of reasons for rejection that are less disruptive to the reader's satisfaction.
Because we have automated checks for validity and good spelling. We don't have automated checks for (mis-) use of HTML for layout. If we had some sort of automated and relatively unambiguous checks for such things, I'm sure that many submitters would strive to comply.
-- Greg _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

The policy/rules might be the hard part.
A first step might be to reconcile the current HTML FAQ to better align with the expectations that the WWers actually enforce. It would also be helpful to recognize that there is more than one actual workflow that volunteers use to create HTML, and the requisite "txt70." http://www.gutenberg.org/wiki/Gutenberg:HTML_FAQ

On Sun, January 22, 2012 8:53 pm, hmonroe.pglaf@huntermonroe.com wrote: [snip]
Why aren't WWers send back projects that include destructive layout tagging, or don't include important structural tagging? I can think of any number of reasons for rejection that are less disruptive to the reader's satisfaction.
Because we have automated checks for validity and good spelling. We don't have automated checks for (mis-) use of HTML for layout. If we had some sort of automated and relatively unambiguous checks for such things, I'm sure that many submitters would strive to comply.
-- Greg
There is a golden opportunity here for someone to create an automated tool to assess the use/misuse of HTML which PG could use to screen submissions--see my earlier post with some simple tests written in Perl.
The problem here is two-fold: 1. PG has no standards for submission of HTML (other than the obvious one that it must be valid HTML), and no one can code to a non-existent standard. If PG had unambiguous published standards I'm sure that most submitters would strive to comply even /without/ automated checks. 2. The evaluation of a file for the existence of destructive layout tagging and the non-existence of structural tagging cannot be automated. (The first part of this statement is probably an exaggeration. I could probably easily write a tool that would check for "style" attributes, but Perl wouldn't be the best language for the job). Tidy can check of well-formedness, and there are tools like Jing (http://www.thaiopensource.com/relaxng/jing.html) that can test for compliance with a schema, but I can't think of a single tool that can tell you that "<p align="center">Chapter One</p>" is wrong on so many levels. The solution is also two-fold: 1. Develop a consensus HTML coding style for PG. Heck, it doesn't even need to be a consensus, a mandate from TPTB would serve just as well, but a consensus is more likely to be adopted. 2. Build a small set of individuals who are familiar with PG's HTML coding style and could review HTML submissions. For example, I'm familiar enough with the use of HTML for encoding e-books that I'll bet I could judge whether a file is acceptable in less than 10 minutes. You could call this group of examiners "white washers" for lack of a better term. But without solution #1, nothing else is possible.

Lee>1. PG has no standards for submission of HTML (other than the obvious one that it must be valid HTML), and no one can code to a non-existent standard. The problem isn't that PG has no standards for HTML, the problem is that the WWers tell them to you after you have attempted to post to PG, and that those standards are different than given in the FAQ: http://www.gutenberg.org/wiki/Gutenberg:HTML_FAQ Lee>1. Develop a consensus HTML coding style for PG. Heck, it doesn't even need to be a consensus, a mandate from TPTB would serve just as well, but a consensus is more likely to be adopted. Just so long as PG follows the same set of rules for the PG tools which generate HTML, including the HTML that gets put into the generated "EPUB" and "MOBI" files. Lee>2. Build a small set of individuals who are familiar with PG's HTML coding style and could review HTML submissions. For example, I'm familiar enough with the use of HTML for encoding e-books that I'll bet I could judge whether a file is acceptable in less than 10 minutes. You could call this group of examiners "white washers" for lack of a better term. Set them first to take a good hard look at the "HTML" code being generated by the PG tool set, and have them fix that first. Secondly I would hope that the WWers wouldn't be accepting "HTML" based "on form," when that "HTML" produces books which are unreadable in practice. Lead by example.

On Mon, January 23, 2012 11:42 am, Jim Adcock wrote:
The problem isn't that PG has no standards for HTML, the problem is that the WWers tell them to you after you have attempted to post to PG, and that those standards are different than given in the FAQ: http://www.gutenberg.org/wiki/Gutenberg:HTML_FAQ
Well, this degrades into a semantic argument. I would argue that unpublished requirements which depend on the whims of an individual are not standards at all; they are mere whims. The allegation that you have been given directions that are not published is troubling; I would suggest that you need to document these discrepancies, here if nowhere else.
Lee>1. Develop a consensus HTML coding style for PG. Heck, it doesn't even need to be a consensus, a mandate from TPTB would serve just as well, but a consensus is more likely to be adopted.
Just so long as PG follows the same set of rules for the PG tools which generate HTML, including the HTML that gets put into the generated "EPUB" and "MOBI" files.
This goes without saying. A standard is a standard is a standard. Indeed, most of the ... bizarre ... HTML that you have pointed out in the past is not an indictment of HTML files posted to PG, but a demonstration of the flawed tool set which produced it. I think it is important to distinguish between flawed HTML and flawed tools, because the solutions to the two problems are vastly different.
Lee>2. Build a small set of individuals who are familiar with PG's HTML coding style and could review HTML submissions. For example, I'm familiar enough with the use of HTML for encoding e-books that I'll bet I could judge whether a file is acceptable in less than 10 minutes. You could call this group of examiners "white washers" for lack of a better term.
Set them first to take a good hard look at the "HTML" code being generated by the PG tool set, and have them fix that first. Secondly I would hope that the WWers wouldn't be accepting "HTML" based "on form," when that "HTML" produces books which are unreadable in practice.
I disagree with your sequence. The incredibly messy HTML being generated complies completely with the PG standard, which involves pretty much just 1. move inline CSS to an in-file <style> block and 2. Make sure the HTML complies with the DTD. When your standards are that broad just about anything passes the test. I think PG needs to back up and build a comprehensive set of standards that produce useful HTML, then ensure that the automated tools build HTML that satisfies the standard. Looking at the problems with the generated HTML may be a good place to start in developing the standards, but one can't "fix" the current tool set if there is no basis by which to judge the efficacy of the "fix."
Lead by example.
Okay. My example is to produce a good HTML file without regard to what PG wants, post it to the Internet Archive, then tell someone at PG where it is and that they can go get it if they want it.

Lee>My example is to produce a good HTML file without regard to what PG wants, post it to the Internet Archive, then tell someone at PG where it is and that they can go get it if they want it. LOL, well I'm getting there, trust me!
participants (4)
-
Greg Newby
-
hmonroe.pglafï¼ huntermonroe.com
-
Jim Adcock
-
Lee Passey