a review of some digitization tools -- 011

ok, here we go, in our series on digitization tools... we're concentrating now on the _tagging_ part of the digitization process. we have cleaned the text and marked the italicized words in previous steps. so we need some sample books to work with now... you might recall that i used "books and culture" in our "sweet grapes" series back in september, when discussing the clean-up phase, so we can now use that e-text to demo the next phase in our process. the file, after the bulk of the clean-up, was here:
we then did a transform of that file, to get this:
and now, after yet another transform, we have this:
all the versions are more-or-less reminiscent of: 1. the output you could expect from abbyy o.c.r. 2. the text you find for each book at archive.org 3. a c.t.f. (concatenated text file) from pgdp.net 4. the .txt format utilized by project gutenberg to that extent, then, we're on very familiar ground. still, i encourage you to examine "grapes008.txt" closely, so you know i have nothing up my sleeve. here are some of the notable things you might see, which adhere to the rules of zen markup language: 01. headers are preceded by at least 4 blank lines. 02. headers are followed by exactly 2 blank lines. 03. structural elements are bounded by blank lines. 04. paragraphs are not indented (in the input file). 05. the linebreaks are consistent with the p-book. 06. linebreaks don't necessarily need to be retained. 07. end-of-line hyphens are considered to be "soft", 08. but a tilde indicates a preceding hyphen is "hard". 09. italics are indicated by surrounding underscores. 10. lines beginning with " > " indicate a blockquote. 11. the first section is considered as "the title page". 12. the table-of-contents must be the second section. and again, none of this "tagging" would be hard to do in a typical text-editor, or your favorite wordprocessor. this, again, is one of the bare-bones books that has practically no structural elements aside from headers, front-matter, and plain old paragraphs in the body... the only other structures here are a few blockquotes. so take a good look at that file... -bowerbird

On Mon, December 5, 2011 10:46 am, Bowerbird@aol.com wrote:
here are some of the notable things you might see, which adhere to the rules of zen markup language:
Together with those same things in HTML.
01. headers are preceded by at least 4 blank lines.
Headers are preceded by one of <h1>, <h2>, <h3>, <h4>, <h5> or <h6>
02. headers are followed by exactly 2 blank lines.
Headers are followed by </h1>, </h2>, </h3>, </h4>, </h5> or </h6> depending on how they were started.
03. structural elements are bounded by blank lines.
Structural elements are bounded by <div>...</div>
04. paragraphs are not indented (in the input file).
Paragraphs are bounded by <p>...</p>
05. the linebreaks are consistent with the p-book. 06. linebreaks don't necessarily need to be retained.
Line breaks may be consistent with the p-book, but will not necessarily be retained on display.
07. end-of-line hyphens are considered to be "soft", 08. but a tilde indicates a preceding hyphen is "hard".
End-of-line hyphens must be merged with the following word. If a hyphen is only used for syllabification it should be replaced with "".
09. italics are indicated by surrounding underscores.
Italics are bounded by <i>...</i> or <em>...<em>
10. lines beginning with " > " indicate a blockquote.
Blockquotes are bounded by <blockquote>...<blockquote>, and may contain paragraphs (<p>...</p>) or any other structural division (<div>...</div>).
11. the first section is considered as "the title page".
The title page is that structual element which has a class defintion of "title_page" (e.g. <div class="title_page">...</div>).
12. the table-of-contents must be the second section.
The table of contents is that structual element which has a class defintion of "toc" (e.g. <div class="doc">...</div>). The actual table of contents entries are bounded by <ol>...</ol> or <ul>...</ul> and each individual entry is bounded by <li>...</li>
and again, none of this "tagging" would be hard to do in a typical text-editor, or your favorite wordprocessor.
And again, none of this "tagging" would be hard to do in a typical text-editor, or your favorite wordprocesser. There is nothing fundamentally "wrong" with inventing a new markup language like z.m.l. (although I have doubts as to its completeness), but it's an attempt to reinvent the wheel -- and a fifth wheel at that. I think that we should stick with what works.

End-of-line hyphens must be merged with the following word. If a hyphen is only used for syllabification it should be replaced with "".
A problem with is that it is frequently miss-implemented by devices, resulting in a rendering of a hard-hyphen on the display when that hyphen should have been suppressed on rendering except when it is EOL. Thus in practice the recommendation is that not be used. Reader devices are starting to implement auto-hyphenation at rendering time without the requirement of , which is probably a step in the right direction, although these efforts seem pretty rudimentary at this time. The hyphenation problem is greatly compounded by PG and DP's requirements on using non-typographical rules on hard hyphenation and m-dashes, resulting in non-breakable strings of chars 30-40 chars long, almost guaranteeing ugly display rendering.
12. the table-of-contents must be the second section.
A requirement to place TOC at a specific location would clearly seem to be an error. If original author/pub chose to place TOC in an "unusual" location still it would seem sensible to retain that location. Conversely, if the original author/pub chose NOT to implement a TOC but the PG volunteer transcriber feels a dying need to implement a TOC in spite of the original author's intent, then the logical place to put the TOC would seem to be at the very beginning of the transcription, prior to the actual transcription, where the transcriber can make it clear that the TOC has been added by the transcriber for the convenience of readers trying to navigate the transcription on electronic devices, and that such a TOC *does not* in fact represent part of the work being transcribed. The transcriber can also highlight this fact by implementing the TOC in a non-matching choice of font.

Hi Jim, There are actually three problems below: 1) the original TOC has no true menaing in an digital version. At least in the sense of the original text, unless we have a 1 to 1 representation. Of course if there is one one should adjust accordingly. The problem is what to do about the pages numbers if existent. 2) a user define TOC must be identifiable as just that. it actually does not matter where it goes! Yet, see 3 3) certain e-book formats require a "TOC" a) do we use the original, yes. do we leave the original in place, too, or or place a transcribers note that it has been moved b) the format has generally a requirement for its placement. c) any user defined should be this "TOC" of course there might be reasons to do it differently, yet I can not think of any reason to do it. regards Keith. P.S. Who are you citing????? Am 08.12.2011 um 21:40 schrieb James Adcock:
[snip, snip]
12. the table-of-contents must be the second section.
A requirement to place TOC at a specific location would clearly seem to be an error. If original author/pub chose to place TOC in an "unusual" location still it would seem sensible to retain that location. Conversely, if the original author/pub chose NOT to implement a TOC but the PG volunteer transcriber feels a dying need to implement a TOC in spite of the original author's intent, then the logical place to put the TOC would seem to be at the very beginning of the transcription, prior to the actual transcription, where the transcriber can make it clear that the TOC has been added by the transcriber for the convenience of readers trying to navigate the transcription on electronic devices, and that such a TOC *does not* in fact represent part of the work being transcribed. The transcriber can also highlight this fact by implementing the TOC in a non-matching choice of font.

Keith: 1) the original TOC has no true menaing in an digital version. The job of a TOC is to tell the reader the major subparts of the whole that the reader is reading, and how to get to one of those subparts if the reader doesn't want to read the whole. This has the same meaning in a digital books as in a paper book. 2) a user define TOC must be identifiable as just that. it actually does not matter where it goes! Yet, see 3 Today a TOC is frequently recognized as being the TOC by including a tag on it which says "TOC" b) the format has generally a requirement for its placement. Not aware of real ebook formats that place a requirement on the placement of at TOC. "real ebook formats" meaning in practice EPUB and MOBI. People including most of PG are confused by TOC and EPUB because the official EPUB documentation is horribly written and confusing, referring to two distinct things as being TOCs, and by the fact that Adobe in their implementation of EPUB chose to ignore one of these two "TOC"s covered by the EPUB documentation, highlighting something as being a TOC which isn't really a TOC but rather an "on the side" digital navigational aide. In MOBI a TOC is placed at location simply tagged as being a "TOC" PS: re "Who am I quoting" - if he who I am quoting wanted to be part of a discussion then this nicety of distinction would seem to matter.

Hi James, Am 09.12.2011 um 21:11 schrieb James Adcock:
Keith: 1) the original TOC has no true menaing in an digital version.
The job of a TOC is to tell the reader the major subparts of the whole that the reader is reading, and how to get to one of those subparts if the reader doesn’t want to read the whole. This has the same meaning in a digital books as in a paper book. Not quite! For a PDF this may be true, (often it is off by a page or two) For HTML and epub ir does not tell where to go, but offers a link. There is quite a bit of semantic and pragmatic difference.
2) a user define TOC must be identifiable as just that. it actually does not matter where it goes! Yet, see 3
Today a TOC is frequently recognized as being the TOC by including a tag on it which says “TOC”
b) the format has generally a requirement for its placement.
Not aware of real ebook formats that place a requirement on the placement of at TOC. “real ebook formats” meaning in practice EPUB and MOBI. People including most of PG are confused by TOC and EPUB because the official EPUB documentation is horribly written and confusing, referring to two distinct things as being TOCs, and by the fact that Adobe in their implementation of EPUB chose to ignore one of these two “TOC”s covered by the EPUB documentation, highlighting something as being a TOC which isn’t really a TOC but rather an “on the side” digital navigational aide. In MOBI a TOC is placed at location simply tagged as being a “TOC”
Evidently, you have not read the documentation well enough or you are not recognizing the fact that the "EPUB TOC" does "tell the reader the major subparts of the whole". regards Keith.

Keith: Not quite! For a PDF this may be true, (often it is off by a page or two) For HTML and epub ir does not tell where to go, but offers a link. There is quite a bit of semantic and pragmatic difference. I don't see it? The difference between "telling" how to get there vs. "pointing" how to get there? Evidently, you have not read the documentation well enough or you are not recognizing the fact that the "EPUB TOC" does "tell the reader the major subparts of the whole". Assuming by "EPUB" we mean version 2 since version 3 isn't out in the market yet, I read, direct from IPDF, explaining their transitioning from version 2 to version 3 quote: Need for enhanced navigation support. There is currently no ability to represent preferred instantiations of navigational elements; as well, presently any rendering of page-to-page navigation as well as Table of Contents and other navigation elements is optional and entirely Reading System-dependent. Version 2.0.1 makes "end user" presentation of the NCX TOC a "should" with statement that the next version will likely transform to a "must" and that other NCX sections may receive similar "upgraded" treatment. End-quote. IE a NCX is NOT a TOC under version 2.0 -- Under version 3.0 they are moving to require the NCX to be a TOC. Early EPUBs included explicit TOCs until Adobe Digital Editions started supporting NCX "on the side" in a way that appear similar to a TOC, at which point in time EPUBs appeared to include two TOCs in slightly different incompatible ways that looked stupid, so then publishers began to remove the explicit TOCs from their books and started referring to the NCX as if it were the "TOC." Conversely Kindle retained NCX as in fact a "Navigational Aid" which did not look like at TOC and so instead retained TOCs explicitly in the book where they belong and where they in fact look like a TOC. But then people who naively translated EPUBS to MOBI using Kindlegen would complain that the generated MOBI doesn't include a TOC - of course not since it never had a TOC, rather it had a NCX navigational aid! The NCX is still in there, it just isn't implemented "on the side" like Adobe chose to do making it look like a TOC and not like an NCX. Version 3.0 EPUB basically shrugs its shoulders and says "Oh well the Adobe approach has won out by now."

Am 09.12.2011 um 22:26 schrieb James Adcock:
Keith: Not quite! For a PDF this may be true, (often it is off by a page or two) For HTML and epub ir does not tell where to go, but offers a link. There is quite a bit of semantic and pragmatic difference.
I don’t see it? The difference between “telling” how to get there vs. “pointing” how to get there? O.K. let's take this slowly so that we are all in the same context. 1) a link is identified by some text that is in one way or the over emphasized to indicate that it is a link.
2) the text identifing the link will not necessarily contain the location of where to go in the text. Depending on the software used for display the text one might be give said location. This may be dependent on the exact syntax used for the link. 3) I have not seen a e-reader that actual shows the location the link is pointing to Of course, I admit that I do not know many ereaders myself, yet I would assume that they do not display the true location of the link as this would be a great distraction while reading. 4) I do not know of an ereader that allows you to enter a link verbatim add jump to that position! In other words you have no way to manually accessing any particular location in a etext/ebook 5) Etexts/ebooks do not per say have a concept of pages so one can actually flip to a particular page. More over it is at most dived into "screens of text" and the exact size will vary from device to device and settings to settings(font, font size, orientation). So as you can plainly see from the above mentioned that the person that sees a entry in a TOC is not told at all where to go. At most they only have the information that there exists a major subpart some where in the book. In other words they have no navigation means of going there as one would with a traditional book of printed matter. What the do have is a navigational means by clicking/activating the link, whereby they are moved to that pre-defined position in the etext/ebook without actually truly knowing where it is located. That is in the same file, chapter, section, etc. So semantically, even if there where a traditional TOC, it does do tell you where to go. Pragmatically, you navigate via link, and can not move easily to this position as you would using a printed book by thumbing to the position. You could though, if supported try scrolling to this position, but you have no true means of orientation. A link does not necessarily tell a user where to go, not does it actually point the user to a location {though a link does actually point/link to a location), it merely indicates that s/he can go to a different location by clicking it(thereby points out that s/he can jump/link with necessarily knowing where the location is). [Please note the use of necessarily, because depending on the software used the user has the ability to see the actual location being pointed/link to] Below, you try to advocate that the "should" of NCX does not make it a true TOC for an EPUB. Yet, it the light of other missing navigational means there is hardly any true easy accessible alternative to it. Furthermore, you try to differential between the functionality of navigational means and the informational value of a TOC. In a printed book you only have "the turning of pages" as navigational mean. Yet, as the TOC gives you orientation of where to find that information it is a navigational means to. I find you distinction(differentiation) between purely informational value and navigational value very constructed, especially in this discussion. A TOC is most always navigational, in this day and age. I do admit in the past was customary to have books in which the TOC did not have page number and was, thereby, purely information in nature, and you can find this style used today, yet it is not widespread. Whether, the author of an ebook should add a TOC on add navigational content to a traditional TOC is irrelevant, as it is a matter or style. According, to the documention the "EPUB TOC"(NCX) is the true TOC of the EPUB that is easily accessible to the human reader. The NCX is a must have file, and must contain at least one entry(the beginning of the text). So, in the light of the above and if one wishes to follow traditional book making practices, the "EPUB TOC" has a fix position in the EPUB format, if one wishes to have a TOC for the ebook. This "EPUB TOC" does not necessarily have to mimic the original TOC of a book being transcirbe, yet for all practical purposes of a TOC of mentioning the major subparts of a this is where it should go!. Of course, in traditional book making there is no said locational for the TOC, though convention has it that it is either somewhere in the front of a book or near the end of it. regards Keith. [
Keith: [ Evidently, you have not read the documentation well enough or you are not recognizing the fact that the "EPUB TOC" does "tell the reader the major subparts of the whole".
Assuming by “EPUB” we mean version 2 since version 3 isn’t out in the market yet, I read, direct from IPDF, explaining their transitioning from version 2 to version 3 quote:
Need for enhanced navigation support. There is currently no ability to represent preferred instantiations of navigational elements; as well, presently any rendering of page-to-page navigation as well as Table of Contents and other navigation elements is optional and entirely Reading System-dependent. Version 2.0.1 makes “end user” presentation of the NCX TOC a “should” with statement that the next version will likely transform to a “must” and that other NCX sections may receive similar “upgraded” treatment.
End-quote.
IE a NCX is NOT a TOC under version 2.0 -- Under version 3.0 they are moving to require the NCX to be a TOC.
Early EPUBs included explicit TOCs until Adobe Digital Editions started supporting NCX “on the side” in a way that appear similar to a TOC, at which point in time EPUBs appeared to include two TOCs in slightly different incompatible ways that looked stupid, so then publishers began to remove the explicit TOCs from their books and started referring to the NCX as if it were the “TOC.” Conversely Kindle retained NCX as in fact a “Navigational Aid” which did not look like at TOC and so instead retained TOCs explicitly in the book where they belong and where they in fact look like a TOC. But then people who naively translated EPUBS to MOBI using Kindlegen would complain that the generated MOBI doesn’t include a TOC – of course not since it never had a TOC, rather it had a NCX navigational aid! The NCX is still in there, it just isn’t implemented “on the side” like Adobe chose to do making it look like a TOC and not like an NCX. Version 3.0 EPUB basically shrugs its shoulders and says “Oh well the Adobe approach has won out by now.”
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

So semantically, even if there where a traditional TOC, it does do tell you where to go. Pragmatically, you navigate via link, and can not move easily to this position as you would using a printed book by thumbing to the position. You could though, if supported try scrolling to this position, but you have no true means of orientation.
Sorry, but I think we are talking past each other. By a "traditional TOC" I mean something within the body of an e-book at the traditional location of a TOC which has the appearance of a traditional paper book TOC, but which has been augmented with hot-links, such that clicking on say a chapter title, or clicking on what was the traditional paper page number, results in vectoring the reader's reading page to that location. About half the TOCs one sees within PG have such hot-links, the other half are simply static renderings, or are not provided at all. As opposed to say an NGX, which kind-of presents TOC information "on the side" IE outside of the normal reading experience of reading a book. Again, in early EPUB efforts one often saw EPUB e-books which included both an active hot-linked "traditional" TOC within the body of the book AND an Adobe Digital Reader NGX "TOC" "on the side" resulting in the appearance of two separate but slightly different TOCs in one book -- which looked stupid. With the result that the EPUB ebook publishing community seems to be heading rapidly in the direction of including the Adobe Digital Reader NGX approach "on the side" and excluding putting a "real" copy of a TOC within the body of the text -- even if the paper version of the book originally came with a real TOC in the body of the text.

Hi Jim, I am afraid then all your arguments are mute. As you can access a traditional TOC as you would call it, without "jumping" to the beginning and paging to it, or you use the so called "Navigational" means. You have just given the best argument for the NCX ease of use and accessibility! As for my personal opinion I would prefer that both are implemented. Please do not forget that any time you move from reading to the TOC you are going "outside of the normal reading experience". So, using the NCX is part of the "normal digital reading experience". In light of the below mentioned, then you where wrong in using the term "telling" when you meant "point" as a link does tell you where to go (see earlier post). Thereby, the confusion. regards Keith. Am 10.12.2011 um 19:59 schrieb Jim Adcock:
So semantically, even if there where a traditional TOC, it does do tell you where to go. Pragmatically, you navigate via link, and can not move easily to this position as you would using a printed book by thumbing to the position. You could though, if supported try scrolling to this position, but you have no true means of orientation.
Sorry, but I think we are talking past each other. By a "traditional TOC" I mean something within the body of an e-book at the traditional location of a TOC which has the appearance of a traditional paper book TOC, but which has been augmented with hot-links, such that clicking on say a chapter title, or clicking on what was the traditional paper page number, results in vectoring the reader's reading page to that location. About half the TOCs one sees within PG have such hot-links, the other half are simply static renderings, or are not provided at all.
As opposed to say an NGX, which kind-of presents TOC information "on the side" IE outside of the normal reading experience of reading a book. Again, in early EPUB efforts one often saw EPUB e-books which included both an active hot-linked "traditional" TOC within the body of the book AND an Adobe Digital Reader NGX "TOC" "on the side" resulting in the appearance of two separate but slightly different TOCs in one book -- which looked stupid. With the result that the EPUB ebook publishing community seems to be heading rapidly in the direction of including the Adobe Digital Reader NGX approach "on the side" and excluding putting a "real" copy of a TOC within the body of the text -- even if the paper version of the book originally came with a real TOC in the body of the text.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Keith>I am afraid then all your arguments are mute. As you can access a traditional TOC as you would call it, without "jumping" to the beginning and paging to it, or you use the so called "Navigational" means. You have just given the best argument for the NCX ease of use and accessibility! Keith>As for my personal opinion I would prefer that both are implemented. Well, my arguments may well be moot, but not for the reasons you suggest, but rather because on EPUB3 "real" TOCs are in practice going away in EPUB now that NGX "on the side" "TOCs" has become officially sanctioned.

On 12/9/2011 2:26 PM, James Adcock wrote: [snip]
IE a NCX is NOT a TOC under version 2.0 -- Under version 3.0 they are moving to require the NCX to be a TOC.
[snip]
Version 3.0 EPUB basically shrugs its shoulders and says “Oh well the Adobe approach has won out by now.”
As a frequently ignored member of the ePub 3 working group, I can definitively say that you have this backwards. In ePub 3.0 the .ncx file has been deprecated (actually, I think the word the specification uses is "superseded"). At the outset it must be acknowledged that Adobe is the 800-pound gorilla in this fight. Virtually every suggestion that Adobe makes 1. is adopted, and 2. is designed to make ePub behave more like PDF. I don't believe that the Adobe representatives are really trying to turn ePub into PDF, just that they have lived so long in the "pages as pictures" world they just have a hard time changing their mindset (I think that this kind of ossification goes a long ways towards explaining Project Gutenberg as well.) As near as I can tell ePub 1, based on OEB 1.0, did not address the Table of Contents issue at all. In OEB 1.0 Microsoft started using the OPF "tours" element to construct a table of contents. (This approach actually still makes a lot of sense to me). Even today you can find Microsoft .lit books that have a "toctour" identified in the <guide> element. Adobe didn't like this approach, and started casting about for an alternative. They decided they liked the Navigation Control Center (.ncx) file that the Daisy Consortium used for Digital Talking Books, and kind of rammed that decision through the IDPF when ePub 1 was refined into ePub 2. With ePub 3, the working group realized that they had been "played" by Adobe, and moved back to an XHTML solution. The ePub working group is, however, enamored with HTML 5, so instead of recommending that the Table of Contents be encapsulated in a <div> element the adopted the new HTML 5 <nav> element instead. The remainder of the Table of Contents, however, is much as you would expect it to be: a header, followed by a list, where each item in the list is a link to an anchor in the contents. Ordinarily, I would create a Table of Contents thusly: <div class="toc"> <h3>Table of Contents</h3> <ul> <li><a href="#ch01">Chapter One</li> <li><a href="#ch02">Chapter Two</li> <li><a href="#ch03">Chapter Three</li> ... The new ePub 3.0 spec would have me create it like this: <nav epub:type="toc"> <h3>Table of Contents</h3> <ol> <li><a href="#ch01">Chapter One</li> <li><a href="#ch02">Chapter Two</li> <li><a href="#ch03">Chapter Three</li> ... Now HTML requires that User Agents (that which many call "browsers") ignore unknown elements and continue processing, so if I built a Table of Contents like /this/: <div class="toc"> <nav epub:type="toc"> <h3>Table of Contents</h3> <ol> <li><a href="#ch01">Chapter One</li> <li><a href="#ch02">Chapter Two</li> <li><a href="#ch03">Chapter Three</li> ... I would have a Table of contents that should be rendered correctly in all HTML User Agents, /and/ is a valid ePub 3 Table of Contents.

On 12/10/2011 4:28 PM, Lee Passey wrote:
Now HTML requires that User Agents (that which many call "browsers") ignore unknown elements and continue processing, so if I built a Table of Contents like /this/:
<div class="toc"> <nav epub:type="toc"> <h3>Table of Contents</h3> <ol> <li><a href="#ch01">Chapter One</li> <li><a href="#ch02">Chapter Two</li> <li><a href="#ch03">Chapter Three</li> ...
I would have a Table of contents that should be rendered correctly in all HTML User Agents, /and/ is a valid ePub 3 Table of Contents.
P.S. An .ncx file is still allowed going forward for backwards compatibility. For that reason, in ePubEditor I have created a function that parses a Table of Contents list like the one illustrated, and creates a valid .ncx file from it.

Hi Lee, Everybody, Please excuse if this seems OT, but we are in general talking about creating ebooks. If I understand you correctly, the NCX is being deprecated/susperceded for the HTML 5 <nav>-tag. Now, I can see the elegance of the approach, but isn't this going to cause a big mess when authoring an ebook and cause older EPUB not to function properly. That is where do I put use tag (that is does it go into ncx or just in a file). [Yes, I could look it up! ;-)) ] How should I deveolp a new ebook. 1) still use a NCX 2) just use the HTML5 <nav>-tag 3) use both methods This may seem as a trivial question, yet is important if expect an ebook to work with different ereaders or develop a tool. Any insight what the working group the think ereader producers should implement would be helpful. Otherwise there will ebooks to redo and redo! regards Keith. Am 11.12.2011 um 00:28 schrieb Lee Passey:
On 12/9/2011 2:26 PM, James Adcock wrote:
[snip]
IE a NCX is NOT a TOC under version 2.0 -- Under version 3.0 they are moving to require the NCX to be a TOC.
[snip]
Version 3.0 EPUB basically shrugs its shoulders and says “Oh well the Adobe approach has won out by now.”
As a frequently ignored member of the ePub 3 working group, I can definitively say that you have this backwards. In ePub 3.0 the .ncx file has been deprecated (actually, I think the word the specification uses is "superseded").
At the outset it must be acknowledged that Adobe is the 800-pound gorilla in this fight. Virtually every suggestion that Adobe makes 1. is adopted, and 2. is designed to make ePub behave more like PDF. I don't believe that the Adobe representatives are really trying to turn ePub into PDF, just that they have lived so long in the "pages as pictures" world they just have a hard time changing their mindset (I think that this kind of ossification goes a long ways towards explaining Project Gutenberg as well.)
As near as I can tell ePub 1, based on OEB 1.0, did not address the Table of Contents issue at all. In OEB 1.0 Microsoft started using the OPF "tours" element to construct a table of contents. (This approach actually still makes a lot of sense to me). Even today you can find Microsoft .lit books that have a "toctour" identified in the <guide> element.
Adobe didn't like this approach, and started casting about for an alternative. They decided they liked the Navigation Control Center (.ncx) file that the Daisy Consortium used for Digital Talking Books, and kind of rammed that decision through the IDPF when ePub 1 was refined into ePub 2.
With ePub 3, the working group realized that they had been "played" by Adobe, and moved back to an XHTML solution. The ePub working group is, however, enamored with HTML 5, so instead of recommending that the Table of Contents be encapsulated in a <div> element the adopted the new HTML 5 <nav> element instead. The remainder of the Table of Contents, however, is much as you would expect it to be: a header, followed by a list, where each item in the list is a link to an anchor in the contents.
Ordinarily, I would create a Table of Contents thusly:
<div class="toc"> <h3>Table of Contents</h3> <ul> <li><a href="#ch01">Chapter One</li> <li><a href="#ch02">Chapter Two</li> <li><a href="#ch03">Chapter Three</li> ...
The new ePub 3.0 spec would have me create it like this:
<nav epub:type="toc"> <h3>Table of Contents</h3> <ol> <li><a href="#ch01">Chapter One</li> <li><a href="#ch02">Chapter Two</li> <li><a href="#ch03">Chapter Three</li> ...
Now HTML requires that User Agents (that which many call "browsers") ignore unknown elements and continue processing, so if I built a Table of Contents like /this/:
<div class="toc"> <nav epub:type="toc"> <h3>Table of Contents</h3> <ol> <li><a href="#ch01">Chapter One</li> <li><a href="#ch02">Chapter Two</li> <li><a href="#ch03">Chapter Three</li> ...
I would have a Table of contents that should be rendered correctly in all HTML User Agents, /and/ is a valid ePub 3 Table of Contents.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On 12/11/2011 5:51 AM, Keith J. Schultz wrote:
Now, I can see the elegance of the approach, but isn't this going to cause a big mess when authoring an ebook and cause older EPUB not to function properly.
That is where do I put use tag (that is does it go into ncx or just in a file). [Yes, I could look it up! ;-)) ]
The ePub group took sort of an odd approach this time. Their goal was to produce a specification that was not only backward compatible, but was forward compatible as well. In other words, not only should an ePub 3 User Agent be able to display an ePub 2 document, but an ePub 2 User Agent should be able to display an ePub 3 document. Overall, I'm not real impressed with just how well they have managed that goal. But in order to try and meet the goal they changed the way the Table of Contents is identified in the publication definition file (.opf). For ePub 2, the NCX file is identified by adding the "toc" attribute to the <spine> element having as a value the id of the item in the manifest which points to the NCX file. In ePub 3, a User Agent finds the document file containing the HTML5 <nav> element by scanning the manifest for an <item> element which has a property of "nav" (e.g. <item property="nav">...). It then becomes an issue of opening that file to find the <nav> element.
How should I deveolp a new ebook.
1) still use a NCX 2) just use theHTML5<nav>-tag 3) use both methods
This may seem as a trivial question, yet is important if expect an ebook to work with different ereaders or develop a tool.
You should be able to use both methods. In my mind, the most reliable method of building a Table of Contents is by using a list element (<ol> or <ul>) at or near the beginning of the document as a whole. I tend to like placing the Table in its own file, but it's really up to you. As I have been working on ePubEditor, I discovered that it was fairly easy to create a Daisy NCX file from a nested list, so I built that functionality into my program. Thus, my approach is to build a Table of Contents using HTML lists and list items. I identify the TOC with a <div> element with a "toc" class. When encapsulating my HTML files into an ePub I identify the TOC file in the guide as being the "toc" type, and then just generate the NCX file just before I build the ePub. You could add the HTML5 <nav> element as the first (and only) child of the <div> tag for ePub 3 compatibility, but as of right now there is not a single User Agent in the world that can actually read a version 3 ePub, so it seems like that adding the element is unnecessary at the moment. So long as the <div> element identifies the TOC, the <nav> tag can be added back in later if and when necessary. One of my major design goals in developing ePubEditor was to build a tool that could be used to fix the truly horrendous markup that commercial publisher use to create ePub files (even BowerBird has more of a clue about how to create an ePub than Penguin books). As part of that process I have been checking out hundreds of books from my public library, even ones I have no personal interest in, and examining their structures. As part of this process I have discovered that despite Mr. Adcock's claim to the contrary, almost every commercial book I have looked at has an HTML Table of Contents in addition to the Daisy NCX file. Usually they are not implemented as lists (in some instances each TOC entry is marked as a paragraph with a class of "list-item"!), but usually they exist, so even commercial publishers have learned that while an NCX file is good, an HTML Table of Contents is even better.

Lee: <div class="toc"> <nav epub:type="toc"> <h3>Table of Contents</h3> <ol> <li><a href="#ch01">Chapter One</li> <li><a href="#ch02">Chapter Two</li> <li><a href="#ch03">Chapter Three</li> ... Well then pray be to God that Amazon gets its act together re kf8 and the associated tools because if people start writing this code into their HTML currently it won't do any good on many Kindles -- which hate links in tables. Or else Marcello has to pull another rabbit out of the hat ;-)

On Mon, December 12, 2011 6:55 am, Jim Adcock wrote:
Lee: <div class="toc"> <nav epub:type="toc"> <h3>Table of Contents</h3> <ol> <li><a href="#ch01">Chapter One</li> <li><a href="#ch02">Chapter Two</li> <li><a href="#ch03">Chapter Three</li> ...
Well then pray be to God that Amazon gets its act together re kf8 and the associated tools because if people start writing this code into their HTML currently it won't do any good on many Kindles -- which hate links in tables.
I understand that Kindle has had problems with tables in the past; in fact, <table>, <tr> and <td> are not listed as supported tags in the Kindle Publishing Guidelines. But of course these objections are irrelevant here, as the example I provided uses lists (<ol>) not tables...

Italics are bounded by <i>...</i> or <em>...<em>
That which transforms something with looks like plain-text with markups to something more complicated which a machine can further process in interesting ways is typically called a parser. Parsers have known problems depending on the design of the markup language which they are fed. Two of the most central problems (plus a third hidden problem to be mentioned later) are error detection and error recovery. Error detection meaning that it is generally recognized that if the writer makes a mistake on input then it is better to detect that error and report on it rather than continuing silently. Error recovery meaning even if the input has an error giving up immediately rather than trying to parse the rest of the input file and report on all the other errors is probably also a bad design. If the user makes 12 errors in their input file you would like to report all 12 errors, or at least most of them, so the user doesn't have to make 12 separate submissions to the parser to detect all input errors. One simple input language design aspect, which is well-known, is to design your open-symbols to be distinct from your close-symbols. Then an open-symbol followed by an open-symbol is detectable as an error -- there should have been a close-symbol in there: <i> .... <i> Error: two italic <i> symbols with no closing </i> symbol. Compare to say using underbar to mean both the open and close italic symbol: _ ... _ Is this an error, or is it correct? Once cannot diagnose the situation, it could be correct, or it could be an error. The desire to be able to detect input errors thus being a reason why SGML-like markup languages have gained in preference over the last 40 years over K&R troff-like markup languages. What's the third problem? The input language OUGHT to be designed such that something which is an error LOOKS like an error. When the parser points out an error to the user the user ought to say "Oh yes that is an error, I recognize that I made an error." Consider instead an input language which relies on a lot of hidden unspecified meanings and implied design rules. Now one of two things happens on "input error" 1) What most likely happens is that the output is silently generated but looks ugly and fails to "match" the input submitted for reasons that remain unexplained and undiagnosed. The user is left confused, frustrated, and bewildered, and has to try making changes to the input "at random" to try to fix the problem -- or to give up and accept the erroneous output as a given. 2) Better would be if the parser issues errors even though superficially to the user the input "looks good." Then the user goes back, looks at their input and says: "What do you mean that's an error -- that input looks absolutely beautiful to me!" The end result is still the same: the user is left angry, frustrated, and confused, with no good prospects for making forward progress.

Hi James, Still do not know who you are citing! Though, you are basically describing what any parser should do (common knowledge). You also have a few conceptual errors in your description. What you fail to mention is that practically all systems that take input and create some kind of input parse! Even a simple find is parsing. True enough, I would not necessarily call that a parsers. You just describe a syntactic parser. Parsers can be designed to also give information on semantic and pragmatic usage. (Take a look at XCode) Of course, the amount of useful information will depend on the design of the parser. Finally, and not least, any error message is only meaningful to those that know what they are doing and understand the system they are using. In other words a JADU will not make heads or tails of an error message, no matter how much information you offer. (see many MS error messages) Please do not mention troff and the like it gives me shivers and besides makes me feel old. I do not know(or knew) anybody that liked troff. Having said all that, what is the actual point you are trying to make and in what context. It is completely oblivious to me. Sorry. [BB, please no sly remarks ;-)) ] regards Keith. Am 08.12.2011 um 22:10 schrieb James Adcock:
Italics are bounded by <i>...</i> or <em>...<em>
That which transforms something with looks like plain-text with markups to something more complicated which a machine can further process in interesting ways is typically called a parser. Parsers have known problems depending on the design of the markup language which they are fed. Two of the most central problems (plus a third hidden problem to be mentioned later) are error detection and error recovery. Error detection meaning that it is generally recognized that if the writer makes a mistake on input then it is better to detect that error and report on it rather than continuing silently. Error recovery meaning even if the input has an error giving up immediately rather than trying to parse the rest of the input file and report on all the other errors is probably also a bad design. If the user makes 12 errors in their input file you would like to report all 12 errors, or at least most of them, so the user doesn't have to make 12 separate submissions to the parser to detect all input errors.
One simple input language design aspect, which is well-known, is to design your open-symbols to be distinct from your close-symbols. Then an open-symbol followed by an open-symbol is detectable as an error -- there should have been a close-symbol in there:
<i> .... <i>
Error: two italic <i> symbols with no closing </i> symbol.
Compare to say using underbar to mean both the open and close italic symbol:
_ ... _
Is this an error, or is it correct?
Once cannot diagnose the situation, it could be correct, or it could be an error.
The desire to be able to detect input errors thus being a reason why SGML-like markup languages have gained in preference over the last 40 years over K&R troff-like markup languages.
What's the third problem? The input language OUGHT to be designed such that something which is an error LOOKS like an error. When the parser points out an error to the user the user ought to say "Oh yes that is an error, I recognize that I made an error."
Consider instead an input language which relies on a lot of hidden unspecified meanings and implied design rules. Now one of two things happens on "input error"
1) What most likely happens is that the output is silently generated but looks ugly and fails to "match" the input submitted for reasons that remain unexplained and undiagnosed. The user is left confused, frustrated, and bewildered, and has to try making changes to the input "at random" to try to fix the problem -- or to give up and accept the erroneous output as a given.
2) Better would be if the parser issues errors even though superficially to the user the input "looks good." Then the user goes back, looks at their input and says: "What do you mean that's an error -- that input looks absolutely beautiful to me!" The end result is still the same: the user is left angry, frustrated, and confused, with no good prospects for making forward progress.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Keith: Having said all that, what is the actual point you are trying to make and in what context. It is completely oblivious to me. Sorry. We all "know" what it takes to make good input languages and good parsers and then "we" all forget that stuff and go off half-cocked pursuing light-weight markup schemes and then we ask ourselves why we have to tediously create all this "light weight" stuff by hand and having to hand-count invisible vertical whitespace etc. using tools which don't understand any of this stuff and which therefore cannot help us yet all the time the proponents of light-weight markup keep telling us how they are helping us make our lives easier while the WWers are standing cop at the gates because unlike a real input language the light-weight stuff doesn't even come with its own style checkers to get the WWers out of traffic cop mode. "Making our lives easier" would mean removing the requirement to submit separate hand-tooled lightweight markup in the first place! It would be a ton less effort to write a style guide for HTML, and/or to write a style tool to tag "features" of submitted HTML of which the transcriber is very proud but which are just very probably going to look crappy on one or more target devices.

On Fri, December 9, 2011 1:26 pm, James Adcock wrote:
"Making our lives easier" would mean removing the requirement to submit separate hand-tooled lightweight markup in the first place!
I don't believe there /is/ any requirement "to submit separate hand-tooled lightweight markup." What there /is/ is a requirement that whatever you submit must have an impoverished text version as well as any other. At some point in the past, Project Gutenberg's policy was that it would not be a repository for multiple variations of the same e-text. Because an impoverished text version was required, and no variations were allowed, the impoverished text version became the de facto master canonical version. I think it is obvious to the most casual observer, that "lightweight markup languages" which have the same expressive power as a "heavy markup language" are in fact as complex as their counterparts, and are probably harder to work with. The sole benefit of a "lightweight markup language" is that the resulting document is presumably easier for a human to understand in its raw form (and consequently harder to process algorithmically). In the context of Project Gutenberg, the ambiguity and opaqueness of a "lightweight markup language" had the benefit that it could fool the whitewashers into believing that a markedup text was nothing more than impoverished text, allowing it to become the master version of a particular book but with sufficient expressive power to be upgraded to something approaching useful. Given the (relatively) new openness at PG to accept HTML files, the need to fool whitewashers is gone. If I were preparing a new book to be submitted to PG I'd do everything in HTML and make sure /that/ version was canonical. I'd then used some kind of automated conversion tool to strip the markup from the HTML, and submit /that/ as the text version along with the canonical HTML version. And because fixing errors in PG texts is so difficult, it's unlikely that the two files will ever get out of sync.
It would be a ton less effort to write a style guide for HTML, and/or to write a style tool to tag "features" of submitted HTML of which the transcriber is very proud but which are just very probably going to look crappy on one or more target devices.
Yes, and several of those style guides have been written in the past: see, e.g. http://www.hwg.org/opcenter/gutenberg/ and the Gutenberg wiki at http://www.gutenberg.org/wiki/Gutenberg:HTML_FAQ#H.4._What_are_the_PG_rules_.... Marcello Perathoner has also written such a guide, but I can't put my finger on a reference to it quickly. The problem with all these style guides is not that they are ineffective or technically incorrect (with some exceptions) but rather that the people who wrote them and those that consult them tend to approach the question of markup with religious fervor. They were the George W. Bushes of HTML markup: "no compromise, God is on my side." (This same religious fervor is not limited to HTML; there are some people who evangelize other, less useful, markup systems -- such as z.m.l. -- as well.) Presumably Project Gutenberg will continue Mr. Hart's commitment to anarchy; other than "good" rules (e.g. Project Gutenberg must always have an impoverished text version) no other rules will be imposed. (If you can't detect my sarcasm, you're not trying hard enough.) This means that if there is going to be any standardization at all in the format of the Project Gutenberg corpus it will require the /unanimous/ agreement of every volunteer, including those from the past who are no longer contributors. Project Gutenberg has no practical leaders; it is less well-organized that the Occupy Wall Street protesters. But you can lead by example. Pick someone's style guide; it doesn't matter whose, or whether or not you like it -- as long as it can be converted to something you /do/ like you're fine. Then follow that guide and encourage others to do the same. The principals at Project Gutenberg will offer no leadership -- we're on our own. If the masses can't agree on being limited to a few styles there is no hope.

Given the (relatively) new openness at PG to accept HTML files, the need to fool whitewashers is gone. If I were preparing a new book to be submitted to PG I'd do everything in HTML and make sure /that/ version was canonical. I'd
Lee>I don't believe there /is/ any requirement "to submit separate hand-tooled lightweight markup." What there /is/ is a requirement that whatever you submit must have an impoverished text version as well as any other. Your beliefs unfortunately are in error, as the WWers will inform you after you violate the "unwritten rules that everyone knows about" -- if you submit a txt70 file which doesn't exactly match their unwritten rules, part of which is using the vertical whitespace markup correctly, and using the "_" and "*" markup correctly. And BB will go nonlinear because he recognizes that the rules represent a firm markup requirement. then used some kind of automated conversion tool to strip the markup from the HTML, and submit /that/ as the text version along with the canonical HTML version. And because fixing errors in PG texts is so difficult, it's unlikely that the two files will ever get out of sync. I haven't found a flawless tool that will strip HTML, even easy HTML, and reduce it flawlessly and without wiping out some of the utf-8, and will wrap anything close to the ridiculous txt70 wrap standards that are expected. So this all ends up being a day or two of needless grunt work tacked on at the end of what had previously been a long, but otherwise fruitful grind. And god forbid you make a mistake while grunting your HTML back down to txt70, you will be crucified!
It would be a ton less effort to write a style guide for HTML, and/or
Lee>Yes, and several of those style guides have been written in the past: see, e.g. http://www.hwg.org/opcenter/gutenberg/ and the Gutenberg wiki at http://www.gutenberg.org/wiki/Gutenberg:HTML_FAQ#H.4._What_are_the_PG_rules_ for_HTML_texts.3F. Marcello Perathoner has also written such a guide, but I can't put my finger on a reference to it quickly. Strange, whenever *I* cannot find one of the vaunted guides I am told I am an idiot. [Granted, but the guide writers point exactly being?] And that these guides totally miss the mark can be ascertained by comparing the rendering of PG html, epub, and mobi files in order to see just how successful these style guides have been in practice.
The problem with all these style guides is not that they are ineffective or technically incorrect (with some exceptions) but rather that the people who wrote them and those that consult them tend to approach the question of markup with religious fervor.
While missing the mark: You can markup according to these guides and what PG ends up publishing will still look like crap. Again, please take off the rose colored glasses and look at say the last 10 published versions books in all of html, epub, and mobi. Do they LOOK the same? Do they LOOK equally good? Are "real world customers" going to have an equally happy experience reading them on any of the three classes of devices? Why Not? It is certainly possible, if not relatively easy, to make an ebook that looks equally good on html, epub, and mobi devices. PG just isn't doing it. Other people take PG books, fix them, and republish them in a form that real world customers can actually read -- on other forums. Why can't PG do this with their own books? Why shouldn't PG be doing this with their own books? What is the point in publishing stuff which looks like crap?

On 12/9/2011 5:29 PM, Jim Adcock wrote
Other people take PG books, fix them, and republish them in a form that real world customers can actually read -- on other forums. Why can't PG do this with their own books?
The question you ask is rhetorical in nature. I would encourage you to go back and think about it as a /non/-rhetorical question. To re-write the question to perhaps make this a bit more obvious: PG can't do this with their own books. Why? I'll leave it to you to develop your own answers to this question. But once you have them you need to ask yourself, "what can I do to change the answers to that question?" Personally, I don't think the answer to the question is technical, so a technical solution is probably not going to resolve the issue.

Lee> Personally, I don't think the answer to the question is technical, so a technical solution is probably not going to resolve the issue. Well, I agree with you about there not being a good technical solution -- but -- Marcello is already rewriting code on the fly to try to work around the greatest miss-matches between what people submit and what PG would like to ship. At which point in time the question becomes to what extent should PG try to fix problems "on the fly" ? I personally believe that PG needs to try to inform people about what the technical problems are. If not, then I don't see how PG can expect people to submit code that tries to avoid those problems?

n Mon, December 12, 2011 7:01 am, Jim Adcock wrote:
Lee> Personally, I don't think the answer to the question is technical, Lee> so a technical solution is probably not going to resolve the issue.
Well, I agree with you about there not being a good technical solution
You may have missed my point. There is no technical solution because it is not a technical problem: it is a political/institutional problem.
I personally believe that PG needs to try to inform people about what the technical problems are. If not, then I don't see how PG can expect people to submit code that tries to avoid those problems?
Those individuals at PG's reins /don't/ expect people to submit code to avoid the (technical) problems. All they expect is for people to submit impoverished text versions of public domain books; other formats are tolerated simply because they're so popular they cannot be ignored, but I'm sure that most of the principals at PG would prefer that /no/ alternative versions be made available. Michael Hart always believed that /no/ guidance should be provided to volunteers, because some of them would be so resentful at being told what to do that they would abandon PG altogether (I guess those that abandon PG due to /lack/ of guidance don't count). How can you overcome that legacy?

There is no technical solution because it is not a technical problem: it is a political/institutional problem.
Not sure there is any disagreement here -- I was just trying to be kind in my choice of phraseology since it is really not clear to me what is going on re the insides of the PG technical/political/institutional "powers that be" and why they do do what they do do -- or more often why they don't do what they don't do. Not that these problems are unique to PG relative to any other NFP.
Those individuals at PG's reins /don't/ expect people to submit code to avoid the (technical) problems. All they expect is for people to submit impoverished text versions of public domain books; other formats are tolerated simply because they're so popular they cannot be ignored, but I'm sure that most of the principals at PG would prefer that /no/ alternative versions be made available.
Not clear to me anymore what the intent is of the PG impoverished text versions or what the PG impoverished text version "mission" is any more since it is clear to me that the txt70 versions don't really "preserve" a text -- for example I would think that a commercial house, if they wanted to create a new paper edition of a historical text, would want to work from the HTML, and/or maybe the txt70, PLUS a archive.org style or google books style of full page photo-digitization. So, what does it really mean then to "preserve" a book electronically nowadays? Too bad archive.org and PG haven't figured out a way to work closer to really reintegrate the PG work back into the archive.org posts, by correcting the Djvu and the PDF, for example, combining the more correct PG text with the archival images. I guess my naïve hope is that someday PG will see their mission as being to create high-quality "archival" transcriptions into reflow forms for real people to actually *read* -- people who actually care about the difference -- as opposed to say more popular efforts such as feedbooks who only care that something superficially "looks nice."
Michael Hart always believed that /no/ guidance should be provided to volunteers, because some of them would be so resentful at being told what to do that they would abandon PG altogether (I guess those that abandon PG due to /lack/ of guidance don't count). How can you overcome that legacy?
The reality is that the WWers have always provided "lots of guidance" whatever Michael thought, because, whether one agrees with the WWers or not [and I plead guilty to being one of the chief whiners] *somebody* has to play the role of the adult keeping *some* degree of "play nice" on the playground.

Hi James, I have be pushing thisI idea for decades! So I agree. Yet, I disagree that it could not be a lightweight mark-up! As you probably know it would be all in the parser. I have the ideas and the know-how, just no time for doing it. regards Keith. Am 09.12.2011 um 21:26 schrieb James Adcock:
It would be a ton less effort to write a style guide for HTML, and/or to write a style tool to tag "features" of submitted HTML of which the transcriber is very proud but which are just very probably going to look crappy on one or more target devices.
participants (5)
-
Bowerbird@aol.com
-
James Adcock
-
Jim Adcock
-
Keith J. Schultz
-
Lee Passey