
lee said:
When you say "o.c.f." am I to assume you mean the ePub file?
oh, my apologies, i screwed up. i meant the .opf file, required in an .epub, much like the .ncx file. sorry...
Where do you want me to upload them?
anywhere is fine. you could even copy it into a post here.
Be warned, the .epub file is almost 6 megabytes in size.
oh, heavens no, i wouldn't ask you to do anything like that. i mean the simple file that lists the manifest of other files... (but now i'm curious as to what this 6-meg file would be?) again, though, my apology for the error. -bowerbird

On Wed, October 19, 2011 11:36 pm, Bowerbird@aol.com wrote:
lee said:
Where do you want me to upload them?
anywhere is fine. you could even copy it into a post here.
I didn't want to clutter up the mailing list with files that probably only BowerBird is interested in, so I sent them to him back channel. But I thought maybe I ought to share some information here that some /might/ have a passing interest in. At any rate, this will become a portion of my programs Frequently Asked Questions. You see, I cheated. ePubEditor is not a tool that is used to take simple text files and convert them to HTML. There are other programs out there which do that at least as well as I could, (not the least of which is FineReader itself). BowerBird's own perl scripts are probably a good alternative. If you want to use ePubEditor to build ePubs, you have to start with HTML. Now the web site that BB pointed me to, http://www.archive.org/details/artofbook00holm, doesn't contain any HTML at all. (Alex -- if nothing else, could you convince the Powers That Be at archive.org to at least configure ABBYY to spit out simple HTML as well as bare text? I know it can be done. It's not great, but at least it's a starting point). The site /did/ however contain an ePub version, which as we all know is just zipped up HTML, so I downloaded and extracted that, and that became my base input. Creating an .ncx file in ePubEditor can be (and in this case was) a two-step process. Long before Adobe shoved the .ncx file down the throats of the IDPF, it was my contention that in the electronic world a "Table of Contents" is not a table at all, but rather a "List of Contents." Thus, in HTML a "Table of Contents" is most appropriately expressed as <ol> or <ul> elements, not as a <table>, and certainly not as a sequence of <p>s or <div>s. Interestingly, in ePub 3.0 the .ncx file has been deprecated in favor of HTML lists. ePubEditor's "CreateNCX" function relies on the existence of an Table of Contents made up of one or more lists, optionally nested. This file may be created by hand, and in the case of small works it is trivially easy to do so. All that is necessary is to create a numbered (<ol>) or unnumbered (<ul>) list with anchor elements inside the list item (<li>) elements. Of course, the new TOC file must be added to the publication and the manifest list. ePubEditor uses the publication's <guide> section to discover the HTML TOC file, so if you want to take advantage of the "CreateNCX" function, you must add a reference to the file in the Guides section with a type of "toc." If the HTML TOC is well-formed, you should be able to create an .ncx file with two mouse clicks. In the case of "The Art of the Book", there was no existing Table of Contents, but there was an image of a page with the image "LIST OF ARTICLES" at the top (I have no idea why IA didn't OCR this page as well, but it's hard to figure out why IA does many of the things it does). Referring to the image I was able to create an HTML TOC in about 15 minutes, e.g.: <h3>LIST OF ARTICLES</h3> <ul> <li><a href="part0001.html#page-3">British Types for Printing Books.<br /> by Bernard H. Newdigate</a></li> <li><a href="part0001.html#page-69">Fine Bookbinding in England.<br /> by Douglas Cockerell</a></li> <li><a href="part0001.html#page-127">The Art of the Book in Germany.<br /> by L. Deubner</a></li> etc. From this file, ePubEditor was able to create a valid .ncx file. But what if your TOC is complex, and you don't have a simple image you can use? ePubEditor also has a feature to scan the HTML contents of an ePub and create an HTML Table of Contents. This feature relies on the assumption that every element you would want to include in a Table of Contents is also a section header. So what the program does is traverse the contents of the publication in reading order, and creates a <li> for each header it encounters. Lists are nested according to the level of the headers, so <h3> headers may be nested inside an <h2> list item, and may include <h4> list items as a sub list. This generated TOC may not be perfect, so it is recommended to review the generated TOC in a text editor before attempting to generate an .ncx file from it. A sample generated TOC file would be similar to: <ul> <li><a href="part1.html#part1">Part I</a> <ul> <li><a href="part1.html#chap1">Chapter One</a></li> <li><a href="part1.html#chap2">Chapter One</a></li> <li><a href="part1.html#chap3">Chapter One</a></li> ... </ul></li> <li><a href="part2.html#part2">Part II</a> etc. In the case of the "Art of the Book", there were not headers marked. In fact, every block of text was marked as though it were a paragraph. (If you're going to call everything a paragraph, why add any markup at all? It makes no sense.) However, everywhere a new page scan started the autobuild process injected "<div class="newpage" id="page-nn"></div>" (where nn was the scan number). (How can you have a division of the text that has no content? I'm beginning to think the people at Internet Archive simply don't understand how to use HTML.) I used ePubEditor's "Replace" function to convert every one of these phony <div>s to <h3>, autogenerated an HTML TOC, then autogenerated a different .ncx from that. I ended up with a Navigation Control file that pointed to each page instead of actual divisions of the book, but when you start with the kind of crap that IA provides, there's not a lot you can do with automation.

Lee, There's a very alpha script we're testing you can try out. (No guarantee, may burn the house down, etc) http://www-kenh.archive.org/download/artofbook00holm/artofbook00holm_abbyy.h... -- Alex Buie Network Coordinator / Server Engineer KWD Services, Inc Media and Hosting Solutions +1(703)445-3391 +1(480)253-9640 +1(703)919-8090 abuie@kwdservices.com On Fri, Oct 21, 2011 at 9:50 AM, Lee Passey <lee@novomail.net> wrote:
On Wed, October 19, 2011 11:36 pm, Bowerbird@aol.com wrote:
lee said:
Where do you want me to upload them?
anywhere is fine. you could even copy it into a post here.
I didn't want to clutter up the mailing list with files that probably only BowerBird is interested in, so I sent them to him back channel. But I thought maybe I ought to share some information here that some /might/ have a passing interest in. At any rate, this will become a portion of my programs Frequently Asked Questions.
You see, I cheated.
ePubEditor is not a tool that is used to take simple text files and convert them to HTML. There are other programs out there which do that at least as well as I could, (not the least of which is FineReader itself). BowerBird's own perl scripts are probably a good alternative. If you want to use ePubEditor to build ePubs, you have to start with HTML.
Now the web site that BB pointed me to, http://www.archive.org/details/artofbook00holm, doesn't contain any HTML at all. (Alex -- if nothing else, could you convince the Powers That Be at archive.org to at least configure ABBYY to spit out simple HTML as well as bare text? I know it can be done. It's not great, but at least it's a starting point). The site /did/ however contain an ePub version, which as we all know is just zipped up HTML, so I downloaded and extracted that, and that became my base input.
Creating an .ncx file in ePubEditor can be (and in this case was) a two-step process. Long before Adobe shoved the .ncx file down the throats of the IDPF, it was my contention that in the electronic world a "Table of Contents" is not a table at all, but rather a "List of Contents." Thus, in HTML a "Table of Contents" is most appropriately expressed as <ol> or <ul> elements, not as a <table>, and certainly not as a sequence of <p>s or <div>s. Interestingly, in ePub 3.0 the .ncx file has been deprecated in favor of HTML lists.
ePubEditor's "CreateNCX" function relies on the existence of an Table of Contents made up of one or more lists, optionally nested. This file may be created by hand, and in the case of small works it is trivially easy to do so. All that is necessary is to create a numbered (<ol>) or unnumbered (<ul>) list with anchor elements inside the list item (<li>) elements. Of course, the new TOC file must be added to the publication and the manifest list. ePubEditor uses the publication's <guide> section to discover the HTML TOC file, so if you want to take advantage of the "CreateNCX" function, you must add a reference to the file in the Guides section with a type of "toc." If the HTML TOC is well-formed, you should be able to create an .ncx file with two mouse clicks.
In the case of "The Art of the Book", there was no existing Table of Contents, but there was an image of a page with the image "LIST OF ARTICLES" at the top (I have no idea why IA didn't OCR this page as well, but it's hard to figure out why IA does many of the things it does). Referring to the image I was able to create an HTML TOC in about 15 minutes, e.g.:
<h3>LIST OF ARTICLES</h3> <ul> <li><a href="part0001.html#page-3">British Types for Printing Books.<br /> by Bernard H. Newdigate</a></li> <li><a href="part0001.html#page-69">Fine Bookbinding in England.<br /> by Douglas Cockerell</a></li> <li><a href="part0001.html#page-127">The Art of the Book in Germany.<br /> by L. Deubner</a></li>
etc. From this file, ePubEditor was able to create a valid .ncx file.
But what if your TOC is complex, and you don't have a simple image you can use?
ePubEditor also has a feature to scan the HTML contents of an ePub and create an HTML Table of Contents. This feature relies on the assumption that every element you would want to include in a Table of Contents is also a section header. So what the program does is traverse the contents of the publication in reading order, and creates a <li> for each header it encounters. Lists are nested according to the level of the headers, so <h3> headers may be nested inside an <h2> list item, and may include <h4> list items as a sub list. This generated TOC may not be perfect, so it is recommended to review the generated TOC in a text editor before attempting to generate an .ncx file from it.
A sample generated TOC file would be similar to:
<ul> <li><a href="part1.html#part1">Part I</a> <ul> <li><a href="part1.html#chap1">Chapter One</a></li> <li><a href="part1.html#chap2">Chapter One</a></li> <li><a href="part1.html#chap3">Chapter One</a></li> ... </ul></li> <li><a href="part2.html#part2">Part II</a>
etc.
In the case of the "Art of the Book", there were not headers marked. In fact, every block of text was marked as though it were a paragraph. (If you're going to call everything a paragraph, why add any markup at all? It makes no sense.) However, everywhere a new page scan started the autobuild process injected "<div class="newpage" id="page-nn"></div>" (where nn was the scan number). (How can you have a division of the text that has no content? I'm beginning to think the people at Internet Archive simply don't understand how to use HTML.) I used ePubEditor's "Replace" function to convert every one of these phony <div>s to <h3>, autogenerated an HTML TOC, then autogenerated a different .ncx from that. I ended up with a Navigation Control file that pointed to each page instead of actual divisions of the book, but when you start with the kind of crap that IA provides, there's not a lot you can do with automation. _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On 10/21/2011 9:58 AM, Alex Buie wrote:
Lee,
There's a very alpha script we're testing you can try out. (No guarantee, may burn the house down, etc)
http://www-kenh.archive.org/download/artofbook00holm/artofbook00holm_abbyy.h...
What we have here is not the script, but the output of the script as applied to "The Art of the Book." As you might expect, it's a mess. Like all FileReader output it is in SGML HTML, which XML parsers cannot use. It appears that in standard usage at the Internet Archive, the primary (perhaps only) reason for doing OCR is to support searching inside the FlipBooks (or the PDF files which appear to be the same thing in a different format). What I see in this file is that /every/ word in the file has a surrounding <span> of class "abbyyword" and a title attribute containing the coordinates indicating where that word appeared in the image. The end of every line that was detected is indicated by <br class="abbyybreak">, and every line begins with an <a>nchor of class "abbyyline" that not only has an identifier it also has data indicating where the line starts in the image. The file is full of this sort of cruft; if we want data that is not inextricably tied to a picture, we don't need any of this stuff. BUT... If this is the best I can get, I'll take it. I think I can write some XSL scripts, maybe together with some real programming, that can reduce the garbage to a manageable level. I would much rather have too much data than too little, because its easier to ignore irrelevant data than it is to guess at the existence of unknown data. Whoever wrote this should be continued to improve it; don't let her/him become discouraged. And if IA doesn't have the resources to pursue it, please make it available in the condition it's in now. If you could provide a pointer to the actual script, that would be really great.
participants (3)
-
Alex Buie
-
Bowerbird@aol.com
-
Lee Passey