Re: [gutvol-d] epubeditor.sourceforge.net

21 Oct 2011

      Lee,

There's a very alpha script we're testing you can try out. (No
guarantee, may burn the house down, etc)

http://www-kenh.archive.org/download/artofbook00holm/artofbook00holm_abbyy.h...
--
Alex Buie
Network Coordinator / Server Engineer
KWD Services, Inc
Media and Hosting Solutions
+1(703)445-3391
+1(480)253-9640
+1(703)919-8090
abuie@kwdservices.com

On Fri, Oct 21, 2011 at 9:50 AM, Lee Passey <lee@novomail.net> wrote:
...
On Wed, October 19, 2011 11:36 pm, Bowerbird@aol.com wrote:
...
lee said:
...
   Where do you want me to upload them?
anywhere is fine.   you could even copy it into a post here.
I didn't want to clutter up the mailing list with files that probably only
BowerBird is interested in, so I sent them to him back channel. But I
thought maybe I ought to share some information here that some /might/ have
a passing interest in. At any rate, this will become a portion of my
programs Frequently Asked Questions.
You see, I cheated.
ePubEditor is not a tool that is used to take simple text files and convert
them to HTML. There are other programs out there which do that at least as
well as I could, (not the least of which is FineReader itself). BowerBird's
own perl scripts are probably a good alternative. If you want to use
ePubEditor to build ePubs, you have to start with HTML.
Now the web site that BB pointed me to,
http://www.archive.org/details/artofbook00holm, doesn't contain any HTML at
all. (Alex -- if nothing else, could you convince the Powers That Be at
archive.org to at least configure ABBYY to spit out simple HTML as well as
bare text? I know it can be done. It's not great, but at least it's a
starting point). The site /did/ however contain an ePub version, which as we
all know is just zipped up HTML, so I downloaded and extracted that, and
that became my base input.
Creating an .ncx file in ePubEditor can be (and in this case was) a two-step
process. Long before Adobe shoved the .ncx file down the throats of the
IDPF, it was my contention that in the electronic world a "Table of
Contents" is not a table at all, but rather a "List of Contents." Thus, in
HTML a "Table of Contents" is most appropriately expressed as <ol> or <ul>
elements, not as a <table>, and certainly not as a sequence of <p>s or
<div>s. Interestingly, in ePub 3.0 the .ncx file has been deprecated in
favor of HTML lists.
ePubEditor's "CreateNCX" function relies on the existence of an Table of
Contents made up of one or more lists, optionally nested. This file may be
created by hand, and in the case of small works it is trivially easy to do
so. All that is necessary is to create a numbered (<ol>) or unnumbered
(<ul>) list with anchor elements inside the list item (<li>) elements. Of
course, the new TOC file must be added to the publication and the manifest
list. ePubEditor uses the publication's <guide> section to discover the HTML
TOC file, so if you want to take advantage of the "CreateNCX" function, you
must add a reference to the file in the Guides section with a type of "toc."
If the HTML TOC is well-formed, you should be able to create an .ncx file
with two mouse clicks.
In the case of "The Art of the Book", there was no existing Table of
Contents, but there was an image of a page with the image "LIST OF ARTICLES"
at the top (I have no idea why IA didn't OCR this page as well, but it's
hard to figure out why IA does many of the things it does). Referring to the
image I was able to create an HTML TOC in about 15 minutes, e.g.:
<h3>LIST OF ARTICLES</h3>
<ul>
 <li><a href="part0001.html#page-3">British Types for Printing Books.<br />
   by Bernard H. Newdigate</a></li>
 <li><a href="part0001.html#page-69">Fine Bookbinding in England.<br />
   by Douglas Cockerell</a></li>
 <li><a href="part0001.html#page-127">The Art of the Book in Germany.<br />
   by L. Deubner</a></li>
etc. From this file, ePubEditor was able to create a valid .ncx file.
But what if your TOC is complex, and you don't have a simple image you can
use?
ePubEditor also has a feature to scan the HTML contents of an ePub and
create an HTML Table of Contents. This feature relies on the assumption that
every element you would want to include in a Table of Contents is also a
section header. So what the program does is traverse the contents of the
publication in reading order, and creates a <li> for each header it
encounters. Lists are nested according to the level of the headers, so <h3>
headers may be nested inside an <h2> list item, and may include <h4> list
items as a sub list. This generated TOC may not be perfect, so it is
recommended to review the generated TOC in a text editor before attempting
to generate an .ncx file from it.
A sample generated TOC file would be similar to:
<ul>
 <li><a href="part1.html#part1">Part I</a>
   <ul>
     <li><a href="part1.html#chap1">Chapter One</a></li>
     <li><a href="part1.html#chap2">Chapter One</a></li>
     <li><a href="part1.html#chap3">Chapter One</a></li>
     ...
   </ul></li>
 <li><a href="part2.html#part2">Part II</a>
etc.
In the case of the "Art of the Book", there were not headers marked. In
fact, every block of text was marked as though it were a paragraph. (If
you're going to call everything a paragraph, why add any markup at all? It
makes no sense.) However, everywhere a new page scan started the autobuild
process injected "<div class="newpage" id="page-nn"></div>" (where nn was
the scan number). (How can you have a division of the text that has no
content? I'm beginning to think the people at Internet Archive simply don't
understand how to use HTML.) I used ePubEditor's "Replace" function to
convert every one of these phony <div>s to <h3>, autogenerated an HTML TOC,
then autogenerated a different .ncx from that. I ended up with a Navigation
Control file that pointed to each page instead of actual divisions of the
book, but when you start with the kind of crap that IA provides, there's not
a lot you can do with automation.
_______________________________________________
gutvol-d mailing list
gutvol-d@lists.pglaf.org
http://lists.pglaf.org/mailman/listinfo/gutvol-d