
Lee, There's a very alpha script we're testing you can try out. (No guarantee, may burn the house down, etc) http://www-kenh.archive.org/download/artofbook00holm/artofbook00holm_abbyy.h... -- Alex Buie Network Coordinator / Server Engineer KWD Services, Inc Media and Hosting Solutions +1(703)445-3391 +1(480)253-9640 +1(703)919-8090 abuie@kwdservices.com On Fri, Oct 21, 2011 at 9:50 AM, Lee Passey <lee@novomail.net> wrote:
On Wed, October 19, 2011 11:36 pm, Bowerbird@aol.com wrote:
lee said:
Where do you want me to upload them?
anywhere is fine. you could even copy it into a post here.
I didn't want to clutter up the mailing list with files that probably only BowerBird is interested in, so I sent them to him back channel. But I thought maybe I ought to share some information here that some /might/ have a passing interest in. At any rate, this will become a portion of my programs Frequently Asked Questions.
You see, I cheated.
ePubEditor is not a tool that is used to take simple text files and convert them to HTML. There are other programs out there which do that at least as well as I could, (not the least of which is FineReader itself). BowerBird's own perl scripts are probably a good alternative. If you want to use ePubEditor to build ePubs, you have to start with HTML.
Now the web site that BB pointed me to, http://www.archive.org/details/artofbook00holm, doesn't contain any HTML at all. (Alex -- if nothing else, could you convince the Powers That Be at archive.org to at least configure ABBYY to spit out simple HTML as well as bare text? I know it can be done. It's not great, but at least it's a starting point). The site /did/ however contain an ePub version, which as we all know is just zipped up HTML, so I downloaded and extracted that, and that became my base input.
Creating an .ncx file in ePubEditor can be (and in this case was) a two-step process. Long before Adobe shoved the .ncx file down the throats of the IDPF, it was my contention that in the electronic world a "Table of Contents" is not a table at all, but rather a "List of Contents." Thus, in HTML a "Table of Contents" is most appropriately expressed as <ol> or <ul> elements, not as a <table>, and certainly not as a sequence of <p>s or <div>s. Interestingly, in ePub 3.0 the .ncx file has been deprecated in favor of HTML lists.
ePubEditor's "CreateNCX" function relies on the existence of an Table of Contents made up of one or more lists, optionally nested. This file may be created by hand, and in the case of small works it is trivially easy to do so. All that is necessary is to create a numbered (<ol>) or unnumbered (<ul>) list with anchor elements inside the list item (<li>) elements. Of course, the new TOC file must be added to the publication and the manifest list. ePubEditor uses the publication's <guide> section to discover the HTML TOC file, so if you want to take advantage of the "CreateNCX" function, you must add a reference to the file in the Guides section with a type of "toc." If the HTML TOC is well-formed, you should be able to create an .ncx file with two mouse clicks.
In the case of "The Art of the Book", there was no existing Table of Contents, but there was an image of a page with the image "LIST OF ARTICLES" at the top (I have no idea why IA didn't OCR this page as well, but it's hard to figure out why IA does many of the things it does). Referring to the image I was able to create an HTML TOC in about 15 minutes, e.g.:
<h3>LIST OF ARTICLES</h3> <ul> <li><a href="part0001.html#page-3">British Types for Printing Books.<br /> by Bernard H. Newdigate</a></li> <li><a href="part0001.html#page-69">Fine Bookbinding in England.<br /> by Douglas Cockerell</a></li> <li><a href="part0001.html#page-127">The Art of the Book in Germany.<br /> by L. Deubner</a></li>
etc. From this file, ePubEditor was able to create a valid .ncx file.
But what if your TOC is complex, and you don't have a simple image you can use?
ePubEditor also has a feature to scan the HTML contents of an ePub and create an HTML Table of Contents. This feature relies on the assumption that every element you would want to include in a Table of Contents is also a section header. So what the program does is traverse the contents of the publication in reading order, and creates a <li> for each header it encounters. Lists are nested according to the level of the headers, so <h3> headers may be nested inside an <h2> list item, and may include <h4> list items as a sub list. This generated TOC may not be perfect, so it is recommended to review the generated TOC in a text editor before attempting to generate an .ncx file from it.
A sample generated TOC file would be similar to:
<ul> <li><a href="part1.html#part1">Part I</a> <ul> <li><a href="part1.html#chap1">Chapter One</a></li> <li><a href="part1.html#chap2">Chapter One</a></li> <li><a href="part1.html#chap3">Chapter One</a></li> ... </ul></li> <li><a href="part2.html#part2">Part II</a>
etc.
In the case of the "Art of the Book", there were not headers marked. In fact, every block of text was marked as though it were a paragraph. (If you're going to call everything a paragraph, why add any markup at all? It makes no sense.) However, everywhere a new page scan started the autobuild process injected "<div class="newpage" id="page-nn"></div>" (where nn was the scan number). (How can you have a division of the text that has no content? I'm beginning to think the people at Internet Archive simply don't understand how to use HTML.) I used ePubEditor's "Replace" function to convert every one of these phony <div>s to <h3>, autogenerated an HTML TOC, then autogenerated a different .ncx from that. I ended up with a Navigation Control file that pointed to each page instead of actual divisions of the book, but when you start with the kind of crap that IA provides, there's not a lot you can do with automation. _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d