blah blah blah feasibility blah

newer
blah blah blah blah test case blah...

older
blah blah blah blah test case blah...

Bowerbird＠aol.com

5 Oct 2012 5 Oct '12

9:37 p.m.

don said:

...

So what are you suggesting? ... What is a constructive suggestion?

as per usual, for this _particular_ topic, we have the otherwise uncommon phenomenon where jim adcock is one of the only people making sense in the dialog. so let me translate for you. jim recommends that p.g. should _require_and_enforce_ submitted .html to be convertible to quality .mobi/.epub. as _part_ of this, jim recommends that the converter-tool should be improved, since its current performance sucks. jim suggests that an improved converter coupled with just a _little_ bit of bending on the part of the producer's whims will mean that _everyone_ will end up with a better outcome. he would probably also suggest that the converter-tool be made more widely available, so that producers could use it to check the viability of their files _while_ working on them. (the current p.g. web-app tool is extremely clumsy to use.) it is clearly _possible_ to implement all these suggestions, because the systems i've built have routinely included them. so i would certainly agree with all of jim's suggestions. indeed, i can't see how there could be much opposition. oh, right, now i remember. you don't like the messenger. *** as for the simple/complex dimension... i'll ignore for a moment that the label has its problems... i believe it is quite easy to make a concrete suggestion: use the simple-basic workflow for the 99%+ of books in the library which can be finished with that approach. portion out the remaining less-than-1% of the books to people who are in love with their complex methodology, and see who comes back with the best-working product. for each e-book, give the winning contender a gold-star sticker to put on their forehead, for the bragging rights. ***

...

1. Choose a book to digitize. 2. Collect and record metadata, and confirm Public Domain status. 2. Obtain images of all the pages. 3. Produce text with OCR. 4. Refine the text to match the OCR. 5. Locate and mark significant artifacts in the text. 6. Submit package of text(s) and images to PG. 7. PG review and post to catalog. 9. Select layout and format for desired device or application. 10. Select or create layout and format for text and artifacts. 11. Issue ebook per spec (possibly including production.) 12. Identify errors or other revisions to text. 13. Locate and describe revisions. 14. Submit revision to PG. 15. Back to step 7 to validate and apply revision.

well, first off, you might wanna check your numbering. more importantly, can you please explain how this list sheds any light or understanding on contended issues? it's reminiscent of your earlier call for "a feasibility study". we're about a decade past the time for such suggestions... there are simple, _obvious_ problems right in front of us -- one example is a whitewasher who posted an "update" less than 6 weeks ago with headers tagged as paragraphs, and not with some "header" class, but bracket-p-bracket -- yet you want to fuzz this all up with a kindergarten outline? is p.g. paying you to be a "consultant", or what? i don't get it. -bowerbird

Attachments:

attachment.html (text/html — 3.9 KB)

Show replies by date

don kretz

6 Oct 6 Oct

7:02 a.m.

PG could be keeping itself busy forever digitizing some order of magnitude more ebooks than it does today and still not be catching up. The list is to describe the scope of what all would need to scale up if PG were to take the true magnitude of the job seriously. Take a ridiculously low number - say a rate of 10 times more books than are being added now. I can't even imagine trying to handle that with most of the processes in place now. But I think it's reasonable to imagine better processes that could scale that far. It's probably going to involve supporting multiple ways of doing the same thing to suit different work style preferences. It absolutely needs to be simpler. I think it's less a matter of better tools at this point than deciding what's the core requirements in each of the items on that list and agreeing on what needs to be done in order to support the rest of the process. Almost none of them are specified with enough clarity to enable a first-timer to know what to do, how to do it, and most importantly when they've done it correctly. We spend way to much time on that point alone - person A thinks they've done a magnificent job and person B accuses them of producing junk in the pursuit of vanity. There's no objective standard so the animosity just grows and PG doesn't. It's pointless to demand someone else do it better, or more correctly, or less dumbed-down or more portably, when the nature of the job to be done, much less the proper way to do it, is subjective to each of us. On Fri, Oct 5, 2012 at 2:37 PM, <Bowerbird@aol.com> wrote:

...

don said:

...
So what are you suggesting? ... What is a constructive suggestion?

as per usual, for this _particular_ topic, we have the otherwise uncommon phenomenon where jim adcock is one of the only people making sense in the dialog.

so let me translate for you.

jim recommends that p.g. should _require_and_enforce_ submitted .html to be convertible to quality .mobi/.epub.

as _part_ of this, jim recommends that the converter-tool should be improved, since its current performance sucks.

jim suggests that an improved converter coupled with just a _little_ bit of bending on the part of the producer's whims will mean that _everyone_ will end up with a better outcome.

he would probably also suggest that the converter-tool be made more widely available, so that producers could use it to check the viability of their files _while_ working on them. (the current p.g. web-app tool is extremely clumsy to use.)

it is clearly _possible_ to implement all these suggestions, because the systems i've built have routinely included them.

so i would certainly agree with all of jim's suggestions.

indeed, i can't see how there could be much opposition.

oh, right, now i remember. you don't like the messenger.

***

as for the simple/complex dimension...

i'll ignore for a moment that the label has its problems...

i believe it is quite easy to make a concrete suggestion:

use the simple-basic workflow for the 99%+ of books in the library which can be finished with that approach.

portion out the remaining less-than-1% of the books to people who are in love with their complex methodology, and see who comes back with the best-working product. for each e-book, give the winning contender a gold-star sticker to put on their forehead, for the bragging rights.

***

...
1. Choose a book to digitize. 2. Collect and record metadata, and confirm Public Domain status. 2. Obtain images of all the pages. 3. Produce text with OCR. 4. Refine the text to match the OCR. 5. Locate and mark significant artifacts in the text. 6. Submit package of text(s) and images to PG. 7. PG review and post to catalog. 9. Select layout and format for desired device or application. 10. Select or create layout and format for text and artifacts. 11. Issue ebook per spec (possibly including production.) 12. Identify errors or other revisions to text. 13. Locate and describe revisions. 14. Submit revision to PG. 15. Back to step 7 to validate and apply revision.

well, first off, you might wanna check your numbering.

more importantly, can you please explain how this list sheds any light or understanding on contended issues?

it's reminiscent of your earlier call for "a feasibility study". we're about a decade past the time for such suggestions...

there are simple, _obvious_ problems right in front of us -- one example is a whitewasher who posted an "update" less than 6 weeks ago with headers tagged as paragraphs, and not with some "header" class, but bracket-p-bracket -- yet you want to fuzz this all up with a kindergarten outline?

is p.g. paying you to be a "consultant", or what? i don't get it.

-bowerbird

_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Greg Newby

2:32 p.m.

Very well said, Don. This agrees with my sentiments exactly. -- Greg On Sat, Oct 06, 2012 at 12:02:44AM -0700, don kretz wrote:

...

PG could be keeping itself busy forever digitizing some order of magnitude more ebooks than it does today and still not be catching up. The list is to describe the scope of what all would need to scale up if PG were to take the true magnitude of the job seriously.

Take a ridiculously low number - say a rate of 10 times more books than are being added now.

I can't even imagine trying to handle that with most of the processes in place now. But I think it's reasonable to imagine better processes that could scale that far. It's probably going to involve supporting multiple ways of doing the same thing to suit different work style preferences. It absolutely needs to be simpler. I think it's less a matter of better tools at this point than deciding what's the core requirements in each of the items on that list and agreeing on what needs to be done in order to support the rest of the process. Almost none of them are specified with enough clarity to enable a first-timer to know what to do, how to do it, and most importantly when they've done it correctly. We spend way to much time on that point alone - person A thinks they've done a magnificent job and person B accuses them of producing junk in the pursuit of vanity. There's no objective standard so the animosity just grows and PG doesn't.

It's pointless to demand someone else do it better, or more correctly, or less dumbed-down or more portably, when the nature of the job to be done, much less the proper way to do it, is subjective to each of us.

On Fri, Oct 5, 2012 at 2:37 PM, <Bowerbird@aol.com> wrote:

...
don said:

...
So what are you suggesting? ... What is a constructive suggestion?

as per usual, for this _particular_ topic, we have the otherwise uncommon phenomenon where jim adcock is one of the only people making sense in the dialog.

so let me translate for you.

jim recommends that p.g. should _require_and_enforce_ submitted .html to be convertible to quality .mobi/.epub.

as _part_ of this, jim recommends that the converter-tool should be improved, since its current performance sucks.

jim suggests that an improved converter coupled with just a _little_ bit of bending on the part of the producer's whims will mean that _everyone_ will end up with a better outcome.

he would probably also suggest that the converter-tool be made more widely available, so that producers could use it to check the viability of their files _while_ working on them. (the current p.g. web-app tool is extremely clumsy to use.)

it is clearly _possible_ to implement all these suggestions, because the systems i've built have routinely included them.

so i would certainly agree with all of jim's suggestions.

indeed, i can't see how there could be much opposition.

oh, right, now i remember. you don't like the messenger.

***

as for the simple/complex dimension...

i'll ignore for a moment that the label has its problems...

i believe it is quite easy to make a concrete suggestion:

use the simple-basic workflow for the 99%+ of books in the library which can be finished with that approach.

portion out the remaining less-than-1% of the books to people who are in love with their complex methodology, and see who comes back with the best-working product. for each e-book, give the winning contender a gold-star sticker to put on their forehead, for the bragging rights.

***

...
1. Choose a book to digitize. 2. Collect and record metadata, and confirm Public Domain status. 2. Obtain images of all the pages. 3. Produce text with OCR. 4. Refine the text to match the OCR. 5. Locate and mark significant artifacts in the text. 6. Submit package of text(s) and images to PG. 7. PG review and post to catalog. 9. Select layout and format for desired device or application. 10. Select or create layout and format for text and artifacts. 11. Issue ebook per spec (possibly including production.) 12. Identify errors or other revisions to text. 13. Locate and describe revisions. 14. Submit revision to PG. 15. Back to step 7 to validate and apply revision.

well, first off, you might wanna check your numbering.

more importantly, can you please explain how this list sheds any light or understanding on contended issues?

it's reminiscent of your earlier call for "a feasibility study". we're about a decade past the time for such suggestions...

there are simple, _obvious_ problems right in front of us -- one example is a whitewasher who posted an "update" less than 6 weeks ago with headers tagged as paragraphs, and not with some "header" class, but bracket-p-bracket -- yet you want to fuzz this all up with a kindergarten outline?

is p.g. paying you to be a "consultant", or what? i don't get it.

-bowerbird

_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

...

_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

James Adcock

8 Oct 8 Oct

5:34 a.m.

...

It's pointless to demand someone else do it better, or more correctly, or less dumbed-down or more portably, when the nature of the job to be done, much less the proper way to do it, is subjective to each of us.

By such an argument it is impossible to define any error -- except that WW'ers do so all the time -- and catch submitters off guard when they do so. By this standard formatting errors are actually pretty easy to define, because most HTML looks pretty similar when run on one or another of the desktop browsers. When that same HTML then generates something radically different, "scrambled" and inferior on one of the other target platforms, what one sees then is pretty clearly a formatting bug, which usually can be fixed relatively easily, if one cares to do so. Of course, some people don't care to do so, presumably because they do not use one of these other platforms. If they did, then being asked to try to read "scrambled eggs" presumably *would* bug them. So, again, the possibility exists to allow HTML to be fixed by people who care, without hurting the people who don't care, including the submitter - unless the reality is that the submitter is deliberately working to sabotage the HTML so that it will not run correctly on one or the other platforms, in which case fixing the HTML presumably *would* bug them. But given PG's charter, that would not seem to be an excuse not to fix the code.

don kretz

6:57 a.m.

What happened when you offered to help the submitters fix their texts, so they could learn how to avoid the problems in future? On Sun, Oct 7, 2012 at 10:34 PM, James Adcock <jimad@msn.com> wrote:

...

...
It's pointless to demand someone else do it better, or more correctly, or less dumbed-down or more portably, when the nature of the job to be done, much less the proper way to do it, is subjective to each of us.

By such an argument it is impossible to define any error -- except that WW’ers do so all the time -- and catch submitters off guard when they do so. ****

By this standard formatting errors are actually pretty easy to define, because most HTML looks pretty similar when run on one or another of the desktop browsers. When that same HTML then generates something radically different, “scrambled” and inferior on one of the other target platforms, what one sees then is pretty clearly a formatting bug, which usually can be fixed relatively easily, if one cares to do so. Of course, some people don’t care to do so, presumably because they do not use one of these other platforms. If they did, then being asked to try to read “scrambled eggs” presumably **would** bug them. So, again, the possibility exists to allow HTML to be fixed by people who care, without hurting the people who don’t care, including the submitter – unless the reality is that the submitter is deliberately working to sabotage the HTML so that it will not run correctly on one or the other platforms, in which case fixing the HTML presumably **would** bug them. But given PG’s charter, that would not seem to be an excuse not to fix the code.****

** **

_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

James Adcock

7:21 a.m.

...

What happened when you offered to help the submitters fix their texts, so

they could learn how to avoid the problems in future? I have offered solutions to common problems on this forum, and on DP forums, and I have repeatedly offered suggestions about good books and authors who actually know something about the reality of writing HTML that "works" on portable ebook readers, and have been told to go stuff it. I can say, basically, "KISS and it will work" and everybody comes back and says "No, no, I want to see how much complicated hackery I can put in it." Greg and PG have basically been saying "KISS" for many years - and everybody just ignores them. Elizabeth Castro knows what she is talking about, but she tends to be Apple centric. Rufus Deuchler is good - because he actually tests what he preaches. The Amazon Kindle publishing guidelines are good - if only to warn one how screwed up Mobi7 is - but guess what, we get to play the ball where it lies. Joshua Tallent is good.

traverso＠posso.dm.unipi.it

7:45 a.m.

Jim, why don't you do instead of preaching? Offer to prepare HTML for post-processors that don't like preparing HTML, but can give you files marked with a reasonable extension of DP markup of your choice. I can provide you work for hundreds of books (literally). Of course, I will check your homework before posting, and take full responsability towards DP. Carlo

James Adcock

3:24 p.m.

...

Jim, why don't you do instead of preaching? Offer to prepare HTML for post-processors that don't like preparing HTML, but can give you files marked with a reasonable extension of DP markup of your choice.

It would be much more beneficial to PG if I, and others, were allowed to choose which books we see as most needing of having their formatting fixed, rather than having Carlo dictate to us which books to fix. This is consistent with PG philosophy that volunteers work on the books they think are important, rather than being told what books to work on. I would be happy if Greg, or others at PG made suggestions of which books most need their formatting fixed based on frequency of download. This is certainly easy to accomplish. Certainly Michael Hart used to make suggestions about what books are worth doing, and I was happy to do a number of books at his suggestion -- because I found when I explored the issue that he did have good taste in books -- the books he suggested were historically significant, were good reads, and were books that could be done in a reasonable amount of time and effort, including HTML formatting. I personally am not interested in fixing the formatting on the next yet-another DP "Tom and Sally ride their Pony to the Moon." Nor am I interested in tackling someone's Italian Mathematical Treatise. If trying to use epubmaker to reverse every act of foolishness is a fool's errand, then doubly so to think that I or anyone can fix every act of foolishness by hand -- if PG is not willing to tackle the problem of vetting the continuing new HTML being submitted in the first place. PG could just implement their implied threat as a test, namely have the WW (temporarily) yank the CSS, and then see if a submission still made the remotest sense. I think that answer should be obvious.

traverso＠posso.dm.unipi.it

4:39 p.m.

...

...
...
...
...
"James" == James Adcock <jimad@msn.com> writes:

>> Jim, why don't you do instead of preaching? Offer to prepare >> HTML for >> post-processors that don't like preparing HTML, but can >> give you files marked with a reasonable extension of DP >> markup of your choice. James> It would be much more beneficial to PG if I, and others, James> were allowed to choose which books we see as most needing James> of having their formatting fixed, rather than having Carlo James> dictate to us which books to fix. This is consistent with James> PG philosophy that volunteers work on the books they think James> are important, rather than being told what books to work James> on. I don't ask ask you to fix the formatting, I ask you to show us how you would do "correct" HTML for books that are not yet in PG. And possibly show how to start a toolchain alternative to RST+epubmaker, possibly to install at a new DP site that I am currently testing. James> I personally am not interested in fixing the formatting on James> the next yet-another DP "Tom and Sally ride their Pony to James> the Moon." Nor am I interested in tackling someone's James> Italian Mathematical Treatise. I don't have any Tom and Sally, nor mathematical treatises in any language (I had recently a classical mathematical paper in English by C.L.Dodgson, aka Lewis Carrol, but this has been done) but I can propose to you, for example, a play of James Joyce, or a collection of stories of Bram Stoker, or a few books of Stendhal, George Sand, Giovanni Pascoli, Gabriele D'Annunzio, and even a few Shakespeare (edited by T. Bowdler) and several other important books. All waiting for HTML to be posted. May at least ask you a list of your recent contributions to PG, so that we can learn from your examples? Carlo

James Adcock

5:46 p.m.

...

May at least ask you a list of your recent contributions to PG, so that we can learn from your examples?

Examples include 32325, 28948, 38561, 35105, etc.

jeroen＠bohol.ph

12:12 p.m.

Please inspect one of my most recent submissions (http://www.gutenberg.org/ebooks/40867, in German, so the text won't distract), and tell me what I can improve here to guaranty better cross-browser functionality. The HTML is generated from TEI (except for the PG standard header and footer), and I can adjust the tools that generated it at will... (http://code.google.com/p/tei2html/), so that will immediately improve all my following submissions as well. Or, if you can't tell how to improve, just tell me what causes you problems, so I can figure something out to fix that. Jeroen. Quoting don kretz <dakretz@gmail.com>:

...

What happened when you offered to help the submitters fix their texts, so they could learn how to avoid the problems in future?

James Adcock

5:14 p.m.

...

Please inspect one of my most recent submissions (http://www.gutenberg.org/ebooks/40867, in German, so the text won't distract), and tell me what I can improve here to guaranty better cross-browser functionality. The HTML is generated from TEI (except for the PG standard header and footer), and I can adjust the tools that generated it at will... (http://code.google.com/p/tei2html/), so that will immediately improve all my following submissions as well.

Starting by looking at a desktop browser: I try downsizing the default window, to see if there are any issues where some aspect of the design is "hardwired" to the assumption of display size and shape. I find immediately that the cover image isn't scaling nor centering correctly. As I downsize the margins appear to be excessive. This will prevent reading the HTML on small display devices which directly understand HTML. For example I open the HTML directly in my tablet, and the margins consume excessive screen real estate. Horizontal rules I would think were intended to be centered, but are showing up left-aligned. Chapter headings are left aligned, where I suspect in the original book and in common usage they should be centered. I see gesperrt, which I suspect will be troublesome on other platforms, we will see later. I also see use of margin page numbers, which will typically cause problems, again we will check this later. Next I try opening the EPUB in ADE. EPUB comes close to implementing most of HTML, and ADE is a pretty robust implementation of EPUB (with a few quirks) so this should not be a hard test: The PG boilerplate comes out "scrambled eggs" like always, but I don't see how one can blame a submitter for that which WW'ers provide "automagically." Rather, after how many years? maybe it's time for PG to get its act together. " Druckfehler und Berichtigungen" table doesn't work in EPUB. Gesperrt doesn't come through to epub, but you have italicized the entries anyway, so no great loss. Footnote numbering is showing up line-before the footnote. Some of the alignment in other tables are a bit off, but still for EPUB this looks like a good effort. Now onto a harder test, lets see how you are doing in mobi7 and mobi8. I open the mobi in Kindle Previewer, since these checks can be performed quickly there: Test mobi8 (which is close to being pure epub) by setting the previewer to Device="Kindle Fire" I see: Tables are scrambled. Footnote numbers are coming in again line-before. Epubmaker seems to have successfully killed your page numbers, which arguably is a good thing. Test mobi7 (which is a harder test) by setting the previewer to Device="Kindle" I see: "Karten" section doesn't format correctly. Tables don't format correctly, don't scale, and "run off the right edge of the display." Paragraphs all run together without indent or spacing. Horizontal rules on this machine do center, which I think was what you were trying to do in the first place. Rules seem to rendering unattractively long. Missing vertical white space after chapter headings. Also between chapter headings and subheadings. Also before these things. Next, let's test how much your code relies on unportable aspects which are being masked by having epubmaker "fix" these problems for you. This is interesting because we want PG books to be "write once read everywhere" which means that your HTML should be portable directly, and should not require epubmaker "fix-ups." We can test portability by directly opening the HTML in Kindle Previewer. This compiles the EPUB and the mobi7 and mobi8 directly, without given epubmaker a chance to try to cover for any mistakes: Mobi8 (epub, device="Kindle Fire"): The margins a WAY excessive, consuming most of the screen real estate. The cover image doesn't work, there is some kind of image format problem. TOC has a fixed vs. float formatting problem where "Seite" and page number "1" are trying to occupy the same place and are writing over each other. In general the TOC has fixed vs. float formatting problems causing overwrites. Tables don't work. Images inside tables don't work. The gesperrt came through correctly, which is a nice surprise. Mobi7 (Kindle Klassic, Device="Kindle"): "Inhalt" renders multiple times. "Karten" runs off the edges of the display. Tables don't display correctly. No indenting or spacing between paragraphs. "[Inhalt]" keeps showing up "everywhere" Footnote numbers render unattractively and intrusively in body text. No vertical whitespace before or after chapter headings. OK, now let's go back and take a look at the actual HTML code, and see if we can learn anything: Hmm, the CSS runs to about 700 lines of code! Cover size is being "hardwired." Out of curiosity, what happens if the 700 lines of CSS are removed? I recompile again directly using Kindle Previewer. The appearance doesn't change greatly, for better or for worse, but now at least "sensible" paragraph formatting is working both in mobi8 (epub) and mobi7. Still, I want to be clear that this book shows fewer formatting errors than most PG books I look at.

Jeroen Hellingman

6:52 p.m.

James, Thanks for taking your time. Let me address some of your concerns in-between the lines. On 2012-10-08 19:14, James Adcock wrote:

...

Starting by looking at a desktop browser:

I try downsizing the default window, to see if there are any issues where some aspect of the design is "hardwired" to the assumption of display size and shape.

The HTML is indeed designed to work well on reasonable modern monitors, without consideration for smaller devices.

...

I find immediately that the cover image isn't scaling nor centering correctly.

Scaling of pictures is not well supported in older browsers, nor possible without resort to non-HTML features such as javascript. Still, using modern browsers, I can scale without issues.

...

As I downsize the margins appear to be excessive. This will prevent reading the HTML on small display devices which directly understand HTML. For example I open the HTML directly in my tablet, and the margins consume excessive screen real estate.

The problem I try to address here is the landscape portrait aspect ratio of most computers, leading to excessive long lines by default. Tablets and smartphones better approach the traditional book aspect ratio, and thus won't need it.

...

Horizontal rules I would think were intended to be centered, but are showing up left-aligned.

The separators that split footnotes are not centered. They were not present in the original book, but as footnotes do not translate to a non-paged medium, I placed them as endnotes at the end of the chapter, and needed a separator to make them stand out. Note that asymmetric footnote separators can be seen in old books for footnotes continued on the next page in traditional typography.

...

Chapter headings are left aligned, where I suspect in the original book and in common usage they should be centered.

That is a matter of taste and convention. I normally use this stylesheet, but have others that allow centering. We've of course all read Jan Tschichold's Asymmetric Typography.

...

I see gesperrt, which I suspect will be troublesome on other platforms, we will see later. I also see use of margin page numbers, which will typically cause problems, again we will check this later.

Gesperrt is nasty, but very common in old Dutch and German books (This book was originally in antiqua (i.e. normal roman font), but fraktur fonts didn't have italics at all.) A simple stylesheet change can fix it, but then I loose the contrast with italics. Older typography often goes over the top with use of letter-spacing, italics, and small caps.

...

Next I try opening the EPUB in ADE. EPUB comes close to implementing most of HTML, and ADE is a pretty robust implementation of EPUB (with a few quirks) so this should not be a hard test:

The ePub was made by epubmaker, for which I take no responsibility. I've now also uploaded my own ePub version to this location: http://www.gutenberg.ph/previews/blumentritt/EthnographiePhilippinen.epub, for some further review.

...

" Druckfehler und Berichtigungen" table doesn't work in EPUB.

The table uses cells spanning multiple columns. If ADE can't handle that, they should fix it.

...

Gesperrt doesn't come through to epub, but you have italicized the entries anyway, so no great loss.

This seems like an issue with the stylesheet falling back to default rendering.

...

Epubmaker seems to have successfully killed your page numbers, which arguably is a good thing.

In my own ePub I kill them as well.

...

Tables don't format correctly, don't scale, and "run off the right edge of the display."

For which it is sometimes hard to find a reasonable solution. Sometimes, original tables span two columns in small print. Compressing that to a small window device will require re-designing the entire table.

...

Next, let's test how much your code relies on unportable aspects which are being masked by having epubmaker "fix" these problems for you. This is interesting because we want PG books to be "write once read everywhere" which means that your HTML should be portable directly, and should not require epubmaker "fix-ups."

My master is TEI, not HTML, from the TEI I should be able to generate reasonable ePub. (I won't do Kindle directly, as that format is proprietary, and closely linked to the Amazon infrastructure)

...

We can test portability by directly opening the HTML in Kindle Previewer. This compiles the EPUB and the mobi7 and mobi8 directly, without given epubmaker a chance to try to cover for any mistakes:

Mobi8 (epub, device="Kindle Fire"):

The margins a WAY excessive, consuming most of the screen real estate.

Agreed for ebook readers.

...

The cover image doesn't work, there is some kind of image format problem.

There is no standard way of dealing with cover images in ePub 2.0; 3.0 addresses this.

...

TOC has a fixed vs. float formatting problem where "Seite" and page number "1" are trying to occupy the same place and are writing over each other.

Seen the issue, need to fix that by giving the line "Seite" stands on some positive height. Will investigate.

...

In general the TOC has fixed vs. float formatting problems causing overwrites.

The TOC is formatted as a table using spans; the table implementation seems broken.

...

"[Inhalt]" keeps showing up "everywhere"

In HTML, which has no standard navigation aids (unlike some ePub readers) those help to give quick access. They are superflous in ePub, and this not present in my ePub version.

...

Footnote numbers render unattractively and intrusively in body text.

Here I have a usability issue: to be able to follow links to footnotes, they need to be large enough to tap upon, but that looks ugly (I haven't resolved this issue yet.)

...

No vertical whitespace before or after chapter headings.

OK, now let's go back and take a look at the actual HTML code, and see if we can learn anything:

Hmm, the CSS runs to about 700 lines of code!

Correct, I use a standard stylesheet for all my books, and I still have to code a tool that removes unused CSS code. That problem is surprisingly hard.

...

Cover size is being "hardwired."

All image sizes are specified in pixels to match the actual images. This works great in the original HTML as it allows the browser to set the formatting before all images are downloaded.

...

Out of curiosity, what happens if the 700 lines of CSS are removed? I recompile again directly using Kindle Previewer. The appearance doesn't change greatly, for better or for worse, but now at least "sensible" paragraph formatting is working both in mobi8 (epub) and mobi7.

Still, I want to be clear that this book shows fewer formatting errors than most PG books I look at.

I am glad to hear that. Jeroen.

James Adcock

9 Oct 9 Oct

12:36 p.m.

OK, but this demonstrates, once again, the on-going conflict (with Jeroen's being one common point of view) Jeroen says basically "I want to write HTML for large desktop computer displays, not HTML for EPUB or MOBI on small displays. In fact, *I* want to pick and choose which customers PG gets to support with my efforts. By doing this I can format more conveniently and attractively by optimizing for the large desktop computer displays." PG says (explicitly and implicitly, via their postings) "We want to support not just HTML on large desktop computer displays, but as well as on the small ebook reader devices, including both the EPUB and the MOBI file formats. We want to support ALL customers, not just some of them." These are explicitly mutually exclusive goals, and when PG tries to accommodate both goals at once, then epubmaker gets sent on a fool's errand, and the PG end customer sees formatting bugs in the EPUB and MOBI file formats which prevent the PG books from being read in a reasonable manner on those small machines. And it ends up making PG look foolish, and further, then others simply take the PG code, strip out most or all of that "gratuitous" HTML formatting optimized for large displays, retaining (if you are lucky) <i> and <b>, and redistribute that reduced formatting effort from other locations having removed the PG name. PG Customers want to read in EPUB and MOBI. HTML which is intentionally -- as Jeroen's example describes -- written for large desktop displays will not run correctly on EPUB and MOBI devices. The current PG approach does not resolve this dilemma. PG says formatters are supposed to be submitting HTML that is "write once read everywhere," but in practice PG doesn't enforce that requirement. There are many ways to resolve this problem. One would be to actually enforce the requirement that the HTML be "write once read everywhere." One would be to allow volunteers to directly submit EPUB, and not require that the EPUB and MOBI be derived from HTML. One would be to allow volunteers to submit two versions of HTML, one optimized for the big desktop displays, and another for the small EPUB and MOBI displays. Another would be to allow submitters at submission time to say "This HTML is designed only to work on large desktop displays." Then either you take an approach where PG simply doesn't try to generate the EPUB or the MOBI, or you take the approach where PG ignores the HTML and tries instead to generated EPUB and MOBI directly from the submitted txt files. But then in either case, I think you will find the problem again that all that happens is that other volunteers step up to the bat, rework the HTML to work on the small EPUB and MOBI displays, take out the PG name, and distribute the book via other forums. And then PG customers are left searching around the web to find a copy of the "PG" book which actually works on their EPUB or MOBI devices. In which case, again, why not simply allow these "fixed" versions for EPUB and MOBI be distributed via PG in the first place ???

jeroen＠bohol.ph

12:58 p.m.

Quoting James Adcock <jimad@msn.com>:

...

OK, but this demonstrates, once again, the on-going conflict (with Jeroen's being one common point of view)

Jeroen says basically "I want to write HTML for large desktop computer displays, not HTML for EPUB or MOBI on small displays. In fact, *I* want to pick and choose which customers PG gets to support with my efforts. By doing this I can format more conveniently and attractively by optimizing for the large desktop computer displays."

I think, in this you misrepresent what I said. I prepare my HTML for what was historically the only target display: a (more-or-less) large landscape monitor connected to a desktop computer. Back in 2009, I added an alternative to that: ePub aimed at small portrait screens on mobile devices. I upgraded my tooling, and now generate both. PG still only accepts the HTML, and thinks it can generate the ePub itself from that HTML. (I gave a link to one of my own ePubs for review in a previous post in response to your review. I hope you will still have a look at it to indicate problems with it.) Yes, I want to support all readers, but don't believe the current process is good enough for it. If you produce one HTML to serve both desktop and mobile devices, you will do a disservice to both. They are just too far apart to have something that really works nice. Each platform needs mutually exclusive tweaks. epubmaker is just not good enough at this yet. (May I remind you of the discussions around Apps made for 10", 7" and 4" devices.)

...

And it ends up making PG look foolish, and further, then others simply take the PG code, strip out most or all of that "gratuitous" HTML formatting optimized for large displays, retaining (if you are lucky) <i> and <b>, and redistribute that reduced formatting effort from other locations having removed the PG name.

Most PG copycats don't bother at all with the HTML versions, and start with the text versions directly....

...

PG Customers want to read in EPUB and MOBI. HTML which is intentionally -- as Jeroen's example describes -- written for large desktop displays will not run correctly on EPUB and MOBI devices. The current PG approach does not resolve this dilemma. PG says formatters are supposed to be submitting HTML that is "write once read everywhere," but in practice PG doesn't enforce that requirement.

There are many ways to resolve this problem.

One would be to actually enforce the requirement that the HTML be "write once read everywhere."

One would be to allow volunteers to directly submit EPUB, and not require that the EPUB and MOBI be derived from HTML.

Good as a stop-gap measure, I do have such ePubs for all my submissions waiting for the great day. (But am aware of the maintenance issues this introduces, the main reason why it is not in place.)

...

One would be to allow volunteers to submit two versions of HTML, one optimized for the big desktop displays, and another for the small EPUB and MOBI displays.

Would not resolve the maintenance issue, and would decrease the quality of the generated ePubs agains what is possible in the first option.

...

Another would be to allow submitters at submission time to say "This HTML is designed only to work on large desktop displays." Then either you take an approach where PG simply doesn't try to generate the EPUB or the MOBI, or you take the approach where PG ignores the HTML and tries instead to generated EPUB and MOBI directly from the submitted txt files. But then in either case, I think you will find the problem again that all that happens is that other volunteers step up to the bat, rework the HTML to work on the small EPUB and MOBI displays, take out the PG name, and distribute the book via other forums. And then PG customers are left searching around the web to find a copy of the "PG" book which actually works on their EPUB or MOBI devices.

This issue is basically a licensing issue. Since PG requires the complete removal of its name, we loose a lot of credit here. I would require the opposite (where legally possible given that most of our work isn't covered by copyright), and use the CC-BY license instead.

...

In which case, again, why not simply allow these "fixed" versions for EPUB and MOBI be distributed via PG in the first place ???

As indicated, the maintenance nightmare of having to fix issues in multiple versions of the same text. Jeroen.

James Adcock

2:03 p.m.

...

(I gave a link to one of my own ePubs for review in a previous post in response to your review. I hope you will still have a look at it to indicate problems with it.)

Sorry, I can't find the link, can you please send it again.

...

Yes, I want to support all readers, but don't believe the current process is good enough for it. If you produce one HTML to serve both desktop and mobile devices, you will do a disservice to both. They are just too far apart to have something that really works nice. Each platform needs mutually exclusive tweaks. epubmaker is just not good enough at this yet. (May I remind you of the discussions around Apps made for 10", 7" and 4" devices.)

If it's just "tweaks" then perhaps a conditional compilation approach such as @media is the way to go [but NOT PG's "@media handheld" which is an exactly backwards anti-standard.]

...

...
And it ends up making PG look foolish, and further, then others simply take the PG code, strip out most or all of that "gratuitous" HTML formatting optimized for large displays, retaining (if you are lucky) <i> and <b>, and redistribute that reduced formatting effort from other locations having removed the PG name.

...

Most PG copycats don't bother at all with the HTML versions, and start with the text versions directly....

...

Good as a stop-gap measure, I do have such ePubs for all my submissions waiting for the great day. (But am aware of the maintenance issues this introduces, the main reason why it is not in

So again, because some submitters submit HTML which only works on large machines, what happens is that the submitters *who do* write code which works on all machines have THEIR "beautiful and sincere" efforts thrown away because some submitters selfishly only want to support some machines, and most ebook readers never see these good efforts, because the "copycat distributors" have to now assume the least-common-denominator "txt" format. Where "copycat" includes Amazon, and Apple, and feedbooks, and ... I don't think PG's position is that we oppose "copycats" -- we are trying to get these books read and distributed as widely as possible. What we would like is to have the PG name retained so that PG can at least get credit for the good work, and the very hard work, being done -- and that only happens if the formatting actually works. If the formatting doesn't work, then the PG name and credit WILL be removed. place.) The "maintenance issues" is not a reasonable argument, given that PG is happy to send out books year after year which are seriously broken, without fixing them.

Jeroen Hellingman

5:49 p.m.

On 2012-10-09 16:03, James Adcock wrote:>

...

Sorry, I can't find the link, can you please send it again.

http://www.gutenberg.ph/previews/blumentritt/EthnographiePhilippinen.epub,

...

...
Yes, I want to support all readers, but don't believe the current process is good enough for it. If you produce one HTML to serve both desktop and mobile devices, you will do a disservice to both. They are just too far apart to have something that really works nice. Each platform needs mutually exclusive tweaks. epubmaker is just not good enough at this yet. (May I remind you of the discussions around Apps made for 10", 7" and 4" devices.)

If it's just "tweaks" then perhaps a conditional compilation approach such as @media is the way to go [but NOT PG's "@media handheld" which is an exactly backwards anti-standard.]

That might work, I think you can get quite some results by tweaking CSS alone (which is what I do mainly). Once it boils down to a need to redesign tables, it gets more complicated. Note that of course the ePub generation also requires some metadata, and things as OPF, NCX, and other stuff to be generated, and places some limitations on the XHTML being used.

...

So again, because some submitters submit HTML which only works on large machines, what happens is that the submitters *who do* write code which works on all machines have THEIR "beautiful and sincere" efforts thrown away because some submitters selfishly only want to support some machines, and most ebook readers never see these good efforts, because the "copycat distributors" have to now assume the least-common-denominator "txt" format. Where "copycat" includes Amazon, and Apple, and feedbooks, and ... I don't think PG's position is that we oppose "copycats" -- we are trying to get these books read and distributed as widely as possible. What we would like is to have the PG name retained so that PG can at least get credit for the good work, and the very hard work, being done -- and that only happens if the formatting actually works. If the formatting doesn't work, then the PG name and credit WILL be removed.

I agree fully with the second half of this paragraph. Jeroen.

James Adcock

10 Oct 10 Oct

1:11 p.m.

...

Once it boils down to a need to redesign tables, it gets more complicated.

I certainly agree that how to support tables on small devices -- many of which do not support horizontal scrolling -- is difficult. But this is not the major issue of ebook reader support that I am seeing problems with. The most common problem, frankly, is paragraph formatting in mobi7 -- a problem which ought to be "duck soup" for PG to solve -- if "PG" cared to solve the problem. Of course, one option would be for PG simply to declare "We are no longer supporting mobi7, only mobi8" and then by fiat the problem becomes much easier. But that would leave many customers in the lurch.

James Adcock

2:13 p.m.

...

http://www.gutenberg.ph/previews/blumentritt/EthnographiePhilippinen.epub,

OK, well to start with you are hardwiring about an additional 2em margins on each side of the "body", which may not sound like much, but it "throws away" about 10% of my display space, and in turn leaves so few words per line as to make the justification spacing between words look ugly. My devices all allow me to increase margin sizes if I want more margin. They do not allow me to remove margins which you have hardwired in for me. I, like most ebook users, would rather that you left the control of margins, fonts, font sizes, etc, up to me and my device, rather than taking away these decisions from me. Again, my ebook reader comes with an external design margin of about 1/2 inch, plus they have designed into their display software about 1/4 inch margin each side, plus you are adding about yet another 1/4 "body" margin each side, which now means that my device, which is 4.5 inches wide, only has about 2.5 inches left of usable display width -- i.e. almost half the potential display width has been wasted -- after you have added your additional margin to the additional margin already designed into the software of the machine. And again this breaks justification, making everything look ugly and pretty much unreadable. Again: do not add body margins, it breaks books on real-world ebook readers. "Druckfehler" table doesn't format sensibly. Footnote markings in body text are larger than in the actual footnotes themselves, probably better if they match in size. Again, we disagree about using K&R style chapter and section headings on books. I would say: K&R style on computer manuals, Book style on Books. Excess spacing between chapter and subchapter headings. Chapter and subchapter headings appear to be "justified" which clearly doesn't work. In Mobi7 (in addition to the previous problems): Title page formatting problems Paragraph demarcation becomes even more difficult to discern. I would suggest for texts like this which have extremely long paragraphs go to the "zero indent, 1em space between" style of paragraphs as being much easier to read on ebook readers. And again, paragraph demarcation is also difficult to understand because your margin choices are breaking the justification algorithm [which then switches over temporarily to "ragged right"], which also makes it more difficult to discern paragraph boundaries -- since the eye also tends to catch "ragged right" within "justified" text as being a paragraph boundary. Footnote markers aren't rendering as attractively as they did on epub devices, not sure what you are doing. Mobi7 doesn't have the margins problem -- since it ignores body margins -- but the word justification routine still breaks down because of so many long German words. Suggest perhaps switch to unjustified ? Spacing between Chapter and subchapter headings now collapse completely. IE the vertical whitespace associated with chapter and subchapter headers disappeared. Did you implement them using top and bottom margins? The bracket labels on the vertical curly brackets in tables which you implemented using images in the tables -- those bracket labels aren't positioning correctly with respect to the bracket images. (In general positioning of text with respect to images or vice versa is an ill-defined problem in any flavor of html.) But again, this work is much better than most I see published by PG.

Jeroen Hellingman

8:48 p.m.

Thanks for taking the trouble once again to look at my product. I know my ePubs are far from perfect, as, since I am unable to submit them to PG, I lack some motivation to improve them. Also, I've only been able to test them on PC based readers, which are in general far more forgiving than the small devices they are meant to go on. I normally use Calibre to read, now have also installed nook for PC and ADE. I have regenerated the file with some fixes you suggest, and re-uploaded to the same location. On 2012-10-10 16:13, James Adcock wrote:

...

...
http://www.gutenberg.ph/previews/blumentritt/EthnographiePhilippinen.epub,

OK, well to start with you are hardwiring about an additional 2em margins on each side of the "body", which may not sound like much, but it "throws away" about 10% of my display space, and in turn leaves so few words per line as to make the justification spacing between words look ugly. My devices all allow me to increase margin sizes if I want more margin. They do not allow me to remove margins which you have hardwired in for me. I, like most ebook users, would rather that you left the control of margins, fonts, font sizes, etc, up to me and my device, rather than taking away these decisions from me.

I've now removed the margin completely from the stylesheet. I few (exceptional) books I have use marginal notes, for those I can re-introduce it (after some testing).

...

"Druckfehler" table doesn't format sensibly.

Doesn't reproduce for me on any of the PC based readers. Probably has to do with the width of the table.

...

Footnote markings in body text are larger than in the actual footnotes themselves, probably better if they match in size.

Footnotes are in their entirety made slightly smaller, including the numbers. I can set them in the same size, but want to keep the distinction.

...

Again, we disagree about using K&R style chapter and section headings on books. I would say: K&R style on computer manuals, Book style on Books.

I simple matter of switching stylesheets. I don't what you mean by K&R style (Kernighan and Ritchie maybe, but that is about C programming), anyway, I did so to a stylesheet I call classic (and I think I will make it the default from now on for PG submissions).

...

Excess spacing between chapter and subchapter headings.

Chapter and subchapter headings appear to be "justified" which clearly doesn't work.

That is not my intention, but I do not see it here.

...

In Mobi7 (in addition to the previous problems):

Title page formatting problems

Paragraph demarcation becomes even more difficult to discern. I would suggest for texts like this which have extremely long paragraphs go to the "zero indent, 1em space between" style of paragraphs as being much easier to read on ebook readers. And again, paragraph demarcation is also difficult to understand because your margin choices are breaking the justification algorithm [which then switches over temporarily to "ragged right"], which also makes it more difficult to discern paragraph boundaries -- since the eye also tends to catch "ragged right" within "justified" text as being a paragraph boundary.

I always use ragged right. For languages with long words like German, that becomes extremely ragged, so some hyphenation should be introduced (but that is really a task of the rendering device, I can only add hyphenation hints...)

...

Footnote markers aren't rendering as attractively as they did on epub devices, not sure what you are doing.

Mobi7 doesn't have the margins problem -- since it ignores body margins -- but the word justification routine still breaks down because of so many long German words. Suggest perhaps switch to unjustified ?

Spacing between Chapter and subchapter headings now collapse completely. IE the vertical whitespace associated with chapter and subchapter headers disappeared. Did you implement them using top and bottom margins?

The bracket labels on the vertical curly brackets in tables which you implemented using images in the tables -- those bracket labels aren't positioning correctly with respect to the bracket images. (In general positioning of text with respect to images or vice versa is an ill-defined problem in any flavor of html.)

Again, I do not observe this with the browsers I use to test. Alignment between text and images is not always pixel perfect, but well enough if massaged into a table to be workable on all desktop browsers I've seen. This must be an issue with devices.

...

But again, this work is much better than most I see published by PG.

Thanks once more.

James Adcock

10:30 p.m.

...

I normally use Calibre to read, now have also installed nook for PC and ADE

Suggest also that "Amazon Kindle Previewer" is a worthwhile emulator to take a look at -- because Amazon is such a pain in rear to target, and because so many people have them. The Previewer is a dream to work with however in terms of file format, being happy to accept an html, an epub, or a mobi format book, and emulates 5 or 10 different size machines in either portrait or landscape orientation. If you are only interested in EPUB issues, run it in "Kindle Fire" mode. Not sure last time I checked how compatible Calibre was really interested in being with anything -- it has some unusual internal format it silently converts to.

Greg Newby

6 Oct 6 Oct

2:36 p.m.

On Fri, Oct 05, 2012 at 05:37:39PM -0400, Bowerbird@aol.com wrote:

...

don said:

...
So what are you suggesting? ... What is a constructive suggestion?

as per usual, for this _particular_ topic, we have the otherwise uncommon phenomenon where jim adcock is one of the only people making sense in the dialog.

so let me translate for you.

jim recommends that p.g. should _require_and_enforce_ submitted .html to be convertible to quality .mobi/.epub.

as _part_ of this, jim recommends that the converter-tool should be improved, since its current performance sucks.

jim suggests that an improved converter coupled with just a _little_ bit of bending on the part of the producer's whims will mean that _everyone_ will end up with a better outcome.

he would probably also suggest that the converter-tool be made more widely available, so that producers could use it to check the viability of their files _while_ working on them. (the current p.g. web-app tool is extremely clumsy to use.)

This is, more or less, exactly what I said we needed. There is no resistance to any of this. I even asked for input on figuring out what the requirements & enforcement would look like. -- Greg

Lee Passey

8 Oct 8 Oct

6:51 p.m.

New subject: enforcement feasibility plan

On 10/6/2012 8:36 AM, Greg Newby wrote

...

This is, more or less, exactly what I said we needed. There is no resistance to any of this. I even asked for input on figuring out what the requirements & enforcement would look like

Last question first: "enforcement." This is the easy one. Standards are enforced by "the community." When a person or organization uploads a document to PG, he/she/it should be required to accept a disclaimer that the text is being donated to the public domain, and that it may be modified in any way by any other volunteer; no more claims of "ownership". Any volunteer may make any change to any document. All changes will be tracked (you use Subversion; I think CVS would be a better choice, but that's a conversation for another day), so any change can be backed-out. No anonymous volunteers allowed (registration required, validated by e-mail confirmation -- not perfect, but helps keep out the 'bots). Errata is handled by the Trac bug tracking system. The state of all errata can be viewed by the world at large, and new errata can be entered by any person. Any registered volunteer can fix any problem, and change the state of errata. A bug is entered into the issue tracking system for every document that does not meet formatting standards. (It might be interesting to develop categories of non-compliance, and allow volunteers to add a document to one of more categories.) In the case of format wars, where one volunteer thinks a thing should be done one way, and another thinks the thing should be done a different way, the issue can be presented on this list. If the standard is clear and unambiguous the resolution should be straight-forward. If it is not, then we will need a consensus on how to clarify the standard. If the standard does not cover the issue, we will need to enhance the standard. Rogue volunteers can have their write privileges suspended. In any case, non-compliant files can still be hosted at PG, they will simply be labeled as non-compliant, and bugs will be entered into the issue tracking system indicating what changes will be required to make a file compliant.

James Adcock

9 Oct 9 Oct

12:47 p.m.

New subject: enforcement feasibility plan

...

In any case, non-compliant files can still be hosted at PG, they will simply be labeled as non-compliant, and bugs will be entered into the issue tracking system indicating what changes will be required to make a file compliant.

My one suggestion here would be that PG not try to generate the EPUB and MOBI from HTML which is deliberately being submitted as being "non-compliant." You can make epubmaker jump through whatever hoops you want, but if the submitter is deliberately trying to write code that knowingly doesn't work on EPUB and MOBI, then again, epubmaker is being sent on a fool's errand. You are basically trying to make better design decisions about how to bridge the gap between large machines and small machines than the pros who design these small machines are making -- and that is exactly the business that the EPUB and MOBI [cough cough] standards writing committees are trying to accomplish, which is not something that PG can profitably second-guess. Again, another thing to think seriously about: Should perhaps PG simply "write off" support for MOBI7 ? This would hose many PG customers, but since MOBI8 is much more compatible to EPUB, and since EPUB is more closely compatible with modern HTML, this would greatly reduce the support burden, and would position PG to make progress to move forward in the future. I personally would like to see PG continue to support MOBI7, at least for a couple more years. But I can see an argument for blowing it off and just supporting EPUB and MOBI8.

Lee Passey

8 Oct 8 Oct

7:56 p.m.

New subject: requirements feasibility plan

On 10/6/2012 8:36 AM, Greg Newby wrote:

...

This is, more or less, exactly what I said we needed. There is no resistance to any of this. I even asked for input on figuring out what the requirements & enforcement would look like.

First question last: "requirements." This is the tough one, not because coming up with a list of requirements is hard, but getting consensus /is/. The first, and I think non-negotiable, requirement is that whatever standard is selected it must have a reasonably complete set of markup to capture all the features of a book. Figuring out just what this "reasonably complete" set of features is is nonetheless problematic. Given it's ivory tower origins, I believe that TEI should be the standard against which all other markups should be judged. So, the basic requirement is that whatever standard is selected, it should be possible to losslessly convert from the standard to TEI and back again, and likewise convert losslessly from any TEI file to the agreed-upon standard and back to TEI. Any markup scheme which can meet this requirement is a candidate. ePub is little more than a zip archive containing XHTML files. Even Kindle's .mobi format is a compressed XHTML format (although CSS is not supported). Thus, I have decided that XHTML is probably the best standard format, as it will require the least conversion and can probably be used natively without any conversion (TEI can also be viewed natively in a browser with the addition of an appropriate stylesheet, but it still makes some people nervous). So, here is a list of some of my requirements for XHTML files, more-or-less in order of importance: 0. Files should be created using HTML markup with the XML syntax. 1. Paragraphs should be marked with the <p> element. Anything that is /not/ a paragraph should /not/ be marked with the <p> element. A paragraph is one or more compete sentences with together relate to a common thought or purpose. If you don't know what a complete sentence is, just drop the book and back away. 2. Lists should be marked as lists. Numbered lists should use the <ol> element, and unnumbered lists should use the <ul> element. Tables of Contents are not tables, they are lists, and should be marked as such. 3. <table> should only be used for tabular data. Tabular data is data that obviously exists as rows and columns. Tables should /NEVER/ be used to force a specific presentation. Sometimes, user agents do not display tables well. This is the fault of the user agent, and not the fault of the markup. If accommodation for a specific user agent is possible using CSS, create a specific stylesheet file to be used with the document, but /don't/ try to alter the table to account for user agent deficiencies. 4. Book titles should be marked with the <h1> element, "part" titles should be marked with the <h2> element, chapter titles should be marked with the <h3> element, "section" titles should be marked with the <h4> element, and "sub-section" titles should be marked with the <h5> element. If a title is composed of both a main title and a subtitle, the subtitle should be distinguished from the main title by adding "class='subtitle'" as an attribute of the title element. Author's names in book titles should be indicated by <h1 class="author">. 5. Indented blocks should be marked with the <blockquote> element. A <blockquote> can contain multiple <p>aragraphs, so long as they meet the requirement of item 1. 6. Blocks of text that do not fit into any current HTML category should be marked with the <div> element. Because <div> is a generic block marker, whenever a <div> element is used it should be appropriately classified: e.g. <div class='chapter'>. Classification values of <div> blocks should be drawn from a controlled set of values, TBD. A <div> blocks should /never/ be used when it can appropriately be replaced with some other HTML element. On the other hand, do not use other HTML elements inappropriately just to avoid using the <div> element. 7. Style attributes should not be applied to any element. If an element needs a specific style, create a special class for that style and place the style specification into an external style sheet. 7a. Place all style rules in external style sheets, not internal style blocks. This way it is possible to change style effects without editing the file itself. 7b. Files should be created in such a way that presentation is adequate, even if not optimal, without the application of any styles. To be continued... I collected this article many years ago: http://www.passkeysoft.com/~lee/HTMLeBooks.html. It is dated, and I don't agree with all the recommendations, but it is quite readable and is a good foundation for making e-books using HTML markup.

don kretz

11:03 p.m.

New subject: requirements feasibility plan

On Mon, Oct 8, 2012 at 12:56 PM, Lee Passey <lee@passkeysoft.com> wrote: On 10/6/2012 8:36 AM, Greg Newby wrote: This is, more or less, exactly what I said we needed. There is no resistance to any of this. I even asked for input on figuring out what the requirements & enforcement would look like. The first, and I think non-negotiable, requirement is that whatever standard is selected it must have a reasonably complete set of markup to capture all the features of a book. Figuring out just what this "reasonably complete" set of features is is nonetheless problematic. This is the first and only requirement. And not reasonably complete, absolutely complete. You can't format what you can't identify. But, if you identify everything, you can format it any way you want with software. And the list of things to identify can be done easily if your markup is extensible, because you keep adding markup identifiers until you don't need any more. The rest of your requirements is just details and there are any number of equivalent schemes; they are interchangeable as long as the things requiring identification are unambiguously tagged or otherwise clearly identifiable.

James Adcock

9 Oct 9 Oct

1:16 p.m.

New subject: requirements feasibility plan

...

This is the first and only requirement.

Sorry, but in practice a couple more requirements: You have to find enough volunteers who are willing to write to your pet language requirements that PG / DP can afford to kiss off the volunteers who are still interested in writing in other formats. Volunteers may be interested in writing in other formats supported by the rest of the industry, for example, because they may not want to be "locked in" to solely writing for PG. You have to find tool-makers including editor creators to make the tools that make writing books to this standard reasonably pleasant and relatively foolproof. And these tools need to be supported on all the major OS's. You have to find a systems "build master" who is willing to write the tools and actually support them to accurately compile your markup language to the formats which customer's devices actually support, including html, epub, and mobi, and who is actually willing to fix these tools when they generate broken output. Perhaps most hard, you have to find someone who is actually willing to write support manuals and online help for all this stuff. And all this has to be done indefinitely into the future, for at least the next couple decades, so that PG is not left with an incompatible and unsupported pile of rotting compost, wasting 100,000s of volunteer hours.

Lee Passey

4:05 p.m.

New subject: requirements feasibility plan

On 10/8/2012 5:03 PM, don kretz wrote:

...

On Mon, Oct 8, 2012 at 12:56 PM, Lee Passey <lee@passkeysoft.com <mailto:lee@passkeysoft.com>> wrote:

On 10/6/2012 8:36 AM, Greg Newby wrote:

This is, more or less, exactly what I said we needed. There is no resistance to any of this. I even asked for input on figuring out what the requirements & enforcement would look like.

The first, and I think non-negotiable, requirement is that whatever standard is selected it must have a reasonably complete set of markup to capture all the features of a book. Figuring out just what this "reasonably complete" set of features is is nonetheless problematic.

...
This is the first and only requirement. And not reasonably complete, absolutely complete. You can't format what you can't identify. But, if you identify everything, you can format it any way you want with software.

I do not believe a markup language exists that can capture the complete essence of a book, thus my use of the "weasel word," 'reasonably.' Rest assured, my bar of "reasonableness" is quite high, but it is not unrealistic.

...

...
And the list of things to identify can be done easily if your markup is extensible, because you keep adding markup identifiers until you don't need any more.

A very salient point, with which I completely agree. Because XHTML is the basis of all modern e-book formats, whatever markup is chosen must at some point be reducible to XHTML. XHTML has two generic elements, <div> and <span> to which semantic inflection can be added by use of the "class" attribute, and the "class" attribute can be added to any other element allowing refinement of their semantics. For this reason, XHTML meets your requirement of an extensible markup language, but also satisfies the goal of being a base language which can be used directly without transformation. Note that TEI also has generic block-level and inline elements, and semantic inflection can be added using the "type" attribute. Thus, TEI is also and extensible language even though the core elements are predefined and presumably immutable.

...

...
The rest of your requirements is just details and there are any number of equivalent schemes; they are interchangeable as long as the things requiring identification are unambiguously tagged or otherwise clearly identifiable.

True, but I obviously have not made myself clear. The primary purpose for a standard is to be predictable. To allow documents to be submitted in /any/ markup language makes it virtually impossible to develop tool sets to generate common output, or to maintain those documents. Further, standards provide a yard-stick that can not only measure compliance, but which can become a learning and training tool, so that when someone like Mr. Salzer comes along and asks, "how [does one] properly prepare HTML files for PG?" we can say, "here you go, follow these rules and you will be compliant, and if something doesn't make sense or isn't covered, we will clarify or modify the rules so it /is/ covered." Development of a standard is primarily a political endeavor, not a technical one. While there /are/ a number of equivalent schemes a standard means that you pick one and stick with it. Frankly, if I ruled the world, that standard would be TEI as it is the most complete of markup languages for text encoding, and being XML is easy to work with. But as a general rule, people's irrational fear of TEI is even greater than their irrational fear of XHTML, so as a practical matter HTML is a better /political/ choice. I don't care if paragraphs are marked with <p> or <para> or {\pard} or two [CR/LF] pairs following a non-whitespace character and terminated by a [CR/LF] pair, just so long as I know that when I encounter that markup I am guaranteed that the text is /always/ a paragraph and /nothing but/ a paragraph. Whatever the consensus is, I will happily adopt it and develop tools for it. But I must have a single rule. This kind of a process will require compromise, and I'm afraid that those who are unwilling to compromise will simply have to be left out of the process. If you don't like my rules, fine, suggest alternatives. We'll go with whatever gets the most support. I don't mind losing so long as at the end of the day everyone gets behind the winner. (For a very interesting exploration of the value of crowd-sourcing, listen to the Radio Lab episode "Emergence" at http://www.radiolab.org/2007/aug/14/).

don kretz

7:20 p.m.

New subject: requirements feasibility plan

It seems we are essentially in agreement. I would be ok with using any variant of html, but it has some some shortcomings in my mind. 1. Everything is built on those <div>s and <span> (and <some others, like <p> and <a>, but they are inherently unsemantic so getting people to restrict themselves to marking up a master text with only semantic tags will be difficult. 2. Similarly, I find it helpful to have an easy visual distinction between structural markup and presentation markup. 3. Also, since it's only divs and spans, the markup tends to become verbose and obscure pretty quickly, with a lot of noise to the signal. 4. And consequently marked-up XHTML is less transparent when you want to see only the text. 5. I would prefer to be able to tell at a glance whether I'm looking at a master format text or an output format text. 6. (This is a nit.) I think begin and end tag matches should visually match; XHTML only provides endiing </div> and </span> Yes, it's nice (I would say necessary) to be able to have your master text previewable. I can do that with my markup because the mapping to HTML is built into WordPress, accessible by hitting the Preview button. Probably what I'm using is closer to TEI, but with less overhead. If PG were to standardize on XHTML I would not need to abandon mine because the mapping to XHTML is there from the start. I'm not advocating that PG accept any format; only saying that any qualifying format is OK to me as a master format; just pick one. Your point about not caring how a paragraph is marked: I agree, except that HTML' <p> has some implications that shouldn't apply to every usage of a paragraph; and if people were to see the <p> markup and believe it can only be an HTML-type <p> then that's a problem. On Tue, Oct 9, 2012 at 9:05 AM, Lee Passey <lee@passkeysoft.com> wrote:

...

On 10/8/2012 5:03 PM, don kretz wrote:

On Mon, Oct 8, 2012 at 12:56 PM, Lee Passey <lee@passkeysoft.com

...
<mailto:lee@passkeysoft.com>> wrote:

On 10/6/2012 8:36 AM, Greg Newby wrote:

This is, more or less, exactly what I said we needed. There is no resistance to any of this. I even asked for input on figuring out what the requirements & enforcement would look like.

The first, and I think non-negotiable, requirement is that whatever standard is selected it must have a reasonably complete set of markup to capture all the features of a book. Figuring out just what this "reasonably complete" set of features is is nonetheless problematic.

This is the first and only requirement. And not reasonably

...
complete, absolutely complete. You can't format what you can't identify. But, if you identify everything, you can format it any way you want with software.

I do not believe a markup language exists that can capture the complete essence of a book, thus my use of the "weasel word," 'reasonably.' Rest assured, my bar of "reasonableness" is quite high, but it is not unrealistic.

And the list of things to identify can be done easily if your

...
...
markup is extensible, because you keep adding markup identifiers until you don't need any more.

A very salient point, with which I completely agree. Because XHTML is the basis of all modern e-book formats, whatever markup is chosen must at some point be reducible to XHTML. XHTML has two generic elements, <div> and <span> to which semantic inflection can be added by use of the "class" attribute, and the "class" attribute can be added to any other element allowing refinement of their semantics. For this reason, XHTML meets your requirement of an extensible markup language, but also satisfies the goal of being a base language which can be used directly without transformation.

Note that TEI also has generic block-level and inline elements, and semantic inflection can be added using the "type" attribute. Thus, TEI is also and extensible language even though the core elements are predefined and presumably immutable.

The rest of your requirements is just details and there are any

...
...
number of equivalent schemes; they are interchangeable as long as the things requiring identification are unambiguously tagged or otherwise clearly identifiable.

True, but I obviously have not made myself clear. The primary purpose for a standard is to be predictable. To allow documents to be submitted in /any/ markup language makes it virtually impossible to develop tool sets to generate common output, or to maintain those documents. Further, standards provide a yard-stick that can not only measure compliance, but which can become a learning and training tool, so that when someone like Mr. Salzer comes along and asks, "how [does one] properly prepare HTML files for PG?" we can say, "here you go, follow these rules and you will be compliant, and if something doesn't make sense or isn't covered, we will clarify or modify the rules so it /is/ covered."

Development of a standard is primarily a political endeavor, not a technical one. While there /are/ a number of equivalent schemes a standard means that you pick one and stick with it. Frankly, if I ruled the world, that standard would be TEI as it is the most complete of markup languages for text encoding, and being XML is easy to work with. But as a general rule, people's irrational fear of TEI is even greater than their irrational fear of XHTML, so as a practical matter HTML is a better /political/ choice.

I don't care if paragraphs are marked with <p> or <para> or {\pard} or two [CR/LF] pairs following a non-whitespace character and terminated by a [CR/LF] pair, just so long as I know that when I encounter that markup I am guaranteed that the text is /always/ a paragraph and /nothing but/ a paragraph. Whatever the consensus is, I will happily adopt it and develop tools for it. But I must have a single rule.

This kind of a process will require compromise, and I'm afraid that those who are unwilling to compromise will simply have to be left out of the process.

If you don't like my rules, fine, suggest alternatives. We'll go with whatever gets the most support. I don't mind losing so long as at the end of the day everyone gets behind the winner.

(For a very interesting exploration of the value of crowd-sourcing, listen to the Radio Lab episode "Emergence" at http://www.radiolab.org/2007/** aug/14/ <http://www.radiolab.org/2007/aug/14/>).

______________________________**_________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/**mailman/listinfo/gutvol-d<http://lists.pglaf.org/mailman/listinfo/gutvol-d>

James Adcock

10 Oct 10 Oct

5:10 p.m.

New subject: requirements feasibility plan

...

2. Similarly, I find it helpful to have an easy visual distinction between structural markup and presentation markup.

How about if PG supported external style sheets - to help encourage people to separate their thinking about these tasks? In which case how about if PG also eases their one-file-only html restriction, so that we can submit books organized by chapter, for example?

Lee Passey

5:15 p.m.

New subject: requirements feasibility plan

On 10/10/2012 11:10 AM, James Adcock wrote:

...

How about if PG supported external style sheets – to help encourage people to separate their thinking about these tasks?

External style sheets should not be supported--they should be mandated. And internal style sheets should be prohibited.

...

In which case how about if PG also eases their one-file-only html restriction, so that we can submit books organized by chapter, for example?

The one-file-only rule was ill-advised from the moment it was adopted. If you're downloading you can download a zip file as easily as a single .txt file, and if you're viewing online the browser can assemble multiple files transparently. There is no practical reason for this rule's continued enforcement.

James Adcock

12:55 p.m.

New subject: requirements feasibility plan

...

"how [does one] properly prepare HTML files for PG?" we can say, "here you go, follow these rules and you will be compliant, and if something doesn't make sense or isn't covered, we will clarify or modify the rules so it /is/ covered."

Your ideas for using xhtml "class" to mark semantics (which I do not disagree with the concept) then has the problem that many [x]html tools which volunteers may want to use to tackle the job don't support your "class" semantics. Thus volunteers are stuck using your tools, or generic text editors such as notepad++. What, in practice do the WW'ers do when presented with what looks like an otherwise acceptable xhtml which doesn't have your "class" semantics? What should DP do to implement this? Certainly having a well-defined marking scheme for "obvious" things like TOC, title, author, chapter headings, etc, would be an obvious huge step forward.

...

Development of a standard is primarily a political endeavor, not a technical one.

Thank you -- yes. The politics, frankly, lie between PG and DP. If they agreed to anything, they have enough "critical mass" [for better or for worse] to stick it to the rest of us, because they would have us outvoted 10 to 1. Again, if you want to win the politics, go get DP on your side. Without them, you are not going to get anywhere.

Lee Passey

4:56 p.m.

New subject: requirements feasibility plan

On 10/10/2012 6:55 AM, James Adcock wrote:

...

Your ideas for using xhtml "class" to mark semantics (which I do not disagree with the concept) then has the problem that many [x]html tools which volunteers may want to use to tackle the job don't support your "class" semantics. Thus volunteers are stuck using your tools, or generic text editors such as notepad++.

Every XML/HTML editor I am familiar with has the ability to set attributes on elements. An editor without that capability is pretty much worthless. It takes a little bit of effort, but I can even make Microsoft Word do it. If volunteers choose to use tools that make their job harder, I can do nothing about it. But selecting an inadequate markup scheme just because it's hard to use your preferred tools otherwise, is as big a mistake as deciding to support only markup-free text because some volunteer, somewhere, might not understand markup. There are lots of volunteers out there in the world, and if someone decides not to participate because it's too hard for him/her I'm sure they can find some other worthy cause--perhaps Distributed Proofreaders, where things have been made even simpler than is possible.

...

What, in practice do the WW'ers do when presented with what looks like an otherwise acceptable xhtml which doesn't have your "class" semantics?

Don't know, don't care. In the system I envision the WW priesthood would be defrocked. I don't think the answer is to reform the current PG system, I think the answer is to replace it.

...

What should DP do to implement this?

Whatever it needs/wants to. Project Gutenberg Next Generation creates an archive, creates a catalog, and creates a standard for how documents need to be structured. If DP wants to participate, it can just meet the standards like everyone else. Maybe it can develop some processes that can make this happen better than an individual volunteer, maybe it can't, maybe it's just not interested. Their problem, not mine. In any case, I suspect the focus of PGNG would be PG1000, which is the top 1000 most popular books at Project Gutenberg. Distributed Proofreaders (and Project Gutenberg as well) has consistently demonstrated resistance to upgrading e-texts already in the archive, preferring instead to produce such in-demand works such as _The History and Antiquities of Horsham_ instead of fixing _Huck Finn_. I doubt that DP would have any interest at all in this project.

James Adcock

5:54 p.m.

New subject: requirements feasibility plan

...

Every XML/HTML editor I am familiar with has the ability to set attributes on elements. An editor without that capability is pretty much worthless. It takes a little bit of effort, but I can even make Microsoft Word do it. If volunteers choose to use tools that make their job harder, I can do nothing about it. But selecting an inadequate markup scheme just because it's hard to use your preferred tools otherwise, is as big a mistake as deciding to support only markup-free text because some volunteer, somewhere, might not understand markup.

Its not a question of what I would be willing or not to use, it's a question of what the volunteers, especially at DP are willing to use. Go check out the DP forums. My understanding is that a lot of people use things like notepad, notepad++, msword, sigil, etc. Dedicated XML/HTML editors are either not cheap or not good, last time I checked some months ago.

...

There are lots of volunteers out there in the world, and if someone decides not to participate because it's too hard for him/her I'm sure they can find some other worthy cause--perhaps Distributed Proofreaders, where things have been made even simpler than is possible.

...

In the system I envision the WW priesthood would be defrocked. I don't

...

Whatever it needs/wants to. Project Gutenberg Next Generation creates an archive, creates a catalog, and creates a standard for how documents need to be structured. If DP wants to participate, it can just meet the standards

DP would tell you they are already hurting for people who are capable and willing to format code. They have lots of people happy to play in P1. The problem is getting books finished and pushed out the back end. think the answer is to reform the current PG system, I think the answer is to replace it. Well, I won't ask what caused them to become defrocked. What would you replace them with? PG does have reasonable legal requirements to vet submissions before posting them. like everyone else. Maybe it can develop some processes that can make this happen better than an individual volunteer, maybe it can't, maybe it's just not interested. Their problem, not mine. Your problem, actually, because PG isn't going anywhere without DP support. Could happen the other way around though -- DP could go somewhere without PG.

David Starner

9 Oct 9 Oct

12:01 a.m.

New subject: requirements feasibility plan

On Mon, Oct 8, 2012 at 12:56 PM, Lee Passey <lee@passkeysoft.com> wrote:

...

So, the basic requirement is that whatever standard is selected, it should be possible to losslessly convert from the standard to TEI and back again, and likewise convert losslessly from any TEI file to the agreed-upon standard and back to TEI. Any markup scheme which can meet this requirement is a candidate.

It's virtually never true that between two formats that they can meaningfully be losslessly converted between each other. In the smaller, more defined realm of images: JPEG and PNG can be reasonably losslessly converted to TIFF files, and the results converted back (if you know the conventions used in the original conversion), but general TIFF files can't be losslessly converted to JPEG or PNG. GIF can be converted to APNG (but not standard PNG) and back, but in general APNG can't be converted to GIF. You can't perfectly convert plain text back and forth to TEI; you're forced either to keep your tables as monospaced or accept that they're going to lose their hand-set qualities along the way, for example. I'm pretty sure converting a modern HTML website to TEI and back won't come out anywhere near losslessly*. You can't really convert TEI to XHTML and back; with conventions perhaps, but I think it'd be a point of storing TEI in XHTML comments and names as much as actually converting the TEI to XHTML. Basically, that sentence to me means that TEI is a candidate, and nothing else. There are no interesting formats that are isomorphic to TEI; all formats have chosen to handle certain features in a distinct way, don't include some features that TEI does, and often include some features that TEI doesn't. -- Kie ekzistas vivo, ekzistas espero.

don kretz

12:16 a.m.

New subject: requirements feasibility plan

...

Basically, that sentence to me means that TEI is a candidate, and nothing else. There are no interesting formats that are isomorphic to TEI; all formats have chosen to handle certain features in a distinct way, don't include some features that TEI does, and often include some features that TEI doesn't.

--

You can pass text among formats as long as you provide an unambiguous mapping between equivalent entities and confine your markup to those entities which are mapped. In any case we will need to create a well-defined list of semantic entities and attributes we expect PG to support. As long as those entities are distinct the list can be as extensible as we want. But you are correct, if there is any entity or attribute definition that isn't mapped, information is lost. We can't build a master format on any markup that doesn't support the full set of definitions. There's no existing standard for which we can agree to support everything that particular standard supports. Even within a standard the same semantic construction can be represented in various ways to various degrees of detail.

James Adcock

12:59 p.m.

New subject: requirements feasibility plan

...

Kindle's .mobi format is a compressed XHTML format (although CSS is not supported).

MOBI8 is close to being EPUB, including CSS. MOBI7, generated with modern versions of kindlegen does "support" CSS, just not as well as we like. Perhaps most troublesome MOBI7 still retains backwards compatibility with earlier prehistoric versions of MOBI (which followed prehistoric browser standards) by not collapsing top and bottom margins, and by rounding them to the closest 1em. MOBI8 doesn't have these problems -- it's basically EPUB with a different compression scheme.

4669

Age (days ago)

4674

Last active (days ago)

List overview

Download

36 comments

9 participants

participants (9)

Bowerbird＠aol.com
David Starner
don kretz
Greg Newby
James Adcock
Jeroen Hellingman
jeroen＠bohol.ph
Lee Passey
traverso＠posso.dm.unipi.it