Re: [gutvol-d] epubeditor.sourceforge.net
lee said:
I didn't want to clutter up the mailing list with files that probably only BowerBird is interested in, so I sent them to him back channel.
never got 'em. and yes, i looked in my spam folder, because that's where your posts go, automatically... (you know, that kill-filter thing.) but that's ok, i don't really need the files. seriously. i'd much rather have you provide people with a primer on them, as well as some easy-to-understand templates, since -- if you will remember -- you had _criticized_ my unwillingness to provide source-code that would create such files. if i could've shown that source-code without giving too much information to the people here who have made me their "enemy", i certainly would have. so that's why i did not. but if you would be willing to do some tutoring of them, so i knew that they were actually in possession of the same knowledge that i already have, then i _might_ be willing to part with my source-code... or, ya know, you can always give 'em _your_ source-code. which they could turn into perl or python. (yeah, right.)
You see, I cheated.
well, what you did wasn't really "cheating", but...
ePubEditor is not a tool that is used to take simple text files and convert them to HTML.
...that was pretty much what i wanted you to admit. because without .html, one cannot make an .epub. so your "epubeditor" cannot _create_ a new .epub -- at least not using a text-file as its input-file -- it can only _edit_ an existing .epub or .html, right? so it's not as useful or powerful as we might have reasonably expected, given the name of the app... and certainly not very useful to the d.p. people... i'm not saying it's useless. it still has a purpose... but turning plain-text into .html is "the hard part". after that, it's just generating some metadata files.
There are other programs out there which do that at least as well as I could
well, maybe you under-rate your own skills, lee, since many of those "other programs out there" get thumbs-down from most e-book-designers, since the .html they create is considered crappy. _you_ wouldn't turn out crappy .html, would you?
(not the least of which is FineReader itself).
well, first of all, not everyone owns finereader. more importantly, it is one of those programs that creates .html that e-book-designers hate. just so you know...
ePubEditor's "CreateNCX" function relies on the existence of an Table of Contents made up of one or more lists, optionally nested.
so if a person makes an .html table of contents, your program can create an .ncx version from it. that's better than nothing, to be sure, but it would be even better if it could create _both_. (and yes, i see where you later noted that it can create both, provided the person has marked up their .html file appropriately, but all that is just resetting the place where you put responsibility on the person, rather than having the tool do it.)
This file may be created by hand, and in the case of small works it is trivially easy to do so.
um, yeah, but isn't it the "trivially easy" tasks that we want our computers to be performing _for_ us? just sayin'...
I have no idea why IA didn't OCR this page as well
they did... but their quality-assessment routines said the results were too shabby to use, so they substituted the page-scan instead... not ideal, but understandable. they would sub in every page this way, if they could, but the e-book would be too bloated to do anyone any good.
Referring to the image I was able to create an HTML TOC in about 15 minutes
i really didn't mean for you to go to all of that work. i'm sorry now that i asked... it won't happen again... oh, and you kids out there, don't try this at home! if you want to _create_ and/or _edit_ an .epub file, your best bet is with "sigil" -- a free/free program that behaves quite like a regular word-processor, and then saves its files in .epub format. sigil does a bang-up job of creating your .opf and .ncx files _for_ you, based on the text that you have entered (or retrieved from an o.c.r. file), all _automatically_. (although it also lets you go in and edit manually.) there are other .epub creators out there nowadays, including "legend maker", but that one costs $40, so why not just go with the free/free "sigil" instead? and remember, if you want .pdf and .kindle as well, you'll be able to get that from my "jaguar" editor... it's _not_ open-source free, but _is_ cost-free free.
In the case of the "Art of the Book", there were not headers marked. In fact, every block of text was marked as though it were a paragraph.
welcome to the second-biggest book-scanning operation on the entire planet, lee. and yes, we're in deep doo-doo.
(If you're going to call everything a paragraph, why add any markup at all?
because you can't make an .epub without having markup. but nobody said that the markup needed to be any good! that was one of the things you "forgot" to put in the spec.
It makes no sense.
on the contrary, if you need to be able to "offer .epubs", it makes perfect sense. it's the only thing you _can_ do.
a division of the text that has no content? I'm beginning to think the people at Internet Archive simply don't understand how to use HTML.
surely you must be mistaken, lee. and _badly_ mistaken. the internet archive is one of the _leading_authorities_ on "books in browsers". indeed, they run a yearly conference with that very title. and -- as a matter of fact, since i have just mentioned their yearly conference -- i should tell you that this year's version is actually being held _very_ soon! here, let me google that for you... ok, yes, it's next week:
and lee, one of the "themes" of the event is "beautiful books". which introduces me to a post i was going to make next week, but will probably make today, now that i stole my own thunder. anyway, lee, good luck with the app. when you have the mods necessary so it can run in toto on a mac, please let me know... (i got it to execute ok, and could do some things, but not all.) -bowerbird
On Fri, October 21, 2011 11:35 am, Bowerbird@aol.com wrote: [snip]
but if you would be willing to do some tutoring of them, so i knew that they were actually in possession of the same knowledge that i already have, then i _might_ be willing to part with my source-code...
or, ya know, you can always give 'em _your_ source-code.
But that's exactly what I did! All of my code is in the repository at SourceForge; just go to the project summary page select "Code" then "Browse CVS Repository." Everything is there, and I try to update it whenever the code is stable. I tried to use file names which were suggestive of their contents even if someone else might find them confusing. You can even check differences between revisions, and the comments (if I remembered to add any). I'd be happy to make a source zip for you and put it on the file download page, but I probably won't pay a lot of attention to keeping it up to date.
lee said:
[snip]
ePubEditor is not a tool that is used to take simple text files and convert them to HTML.
...that was pretty much what i wanted you to admit.
because without .html, one cannot make an .epub.
I'm sorry, I thought I made that clear when I said:
ePubEditor allows a person to create a new electronic publication, then add HTML, JPEG, and other OEB "core" media type files to the ePub's manifest. Once "manifested," these files can be added to the reading order of the publication (the content, or "spine") where the content can be reordered. Metadata such as author identities and subject matter identifiers can be added, and the compiled document can be saved as an ePub.
so your "epubeditor" cannot _create_ a new .epub -- at least not using a text-file as its input-file -- it can only _edit_ an existing .epub or .html, right?
No, ePubEditor can create new .epubs. It cannot use a text file as its input file, because it does not use "input files;" at least not in the sense that a command-line program uses an input file. When you start the program some of the menu items are disabled, because no publication has been selected. If you select File->New, ePubEditor will start a new publication in the directory/folder you specify. At this point, the current tab will be "Authors and Contributors," with some example data in the table. This needs to be changed and augmented according to the creators of the book you will be generating. If you select the "Manifest" tab it will show a list of all the files that will be included in the publication. You should see three files, ebook.css, toc.html, and cover.jpg. All three of these files should be marked with a red "stop" symbol indicating that while they are listed in the manifest, they do not exist on the file system. You can remove them from the manifest, or better yet you can create them. Once the files exist (and the window is refreshed) the red "stop" will turn to a green "check." On this same tab there is an "Add" button. When you press this button, a file system browser will appear and allow you to add any file to the package. You can, if you wish, add a text file to the manifest. When you save the publication as an ePub, the text file will be included in the ePub. It's possible that there's even a User Agent out there that will attempt to display it, if it is also included in the content (all manifested files are included in the ePub, only content files are displayed by a User Agent as part of the regular linear presentation). You can even lie to the program about the nature of the file by changing the "media-type" attribute to something other than text. The publication obviously won't pass epubcheck, but it would be interesting to see what something like the Nook would do with it. Once your publication is internally consistent you may save it as an ePub.
so it's not as useful or powerful as we might have reasonably expected, given the name of the app...
and certainly not very useful to the d.p. people...
Not my target audience. In fact, anyone trying to turn degraded text into ePubs is not my target audience. I have two target audiences: people who have created an e-book in HTML format, and now want a quick and easy way to turn it into an ePub, and people who have ePubs that do not display well on their devices because of "over-styling" and who want to clean up those ePubs for their own purposes. This is not to say that the program is without editing capabilities. One thing I have noticed about editing tools is that people tend to be very...committed...to their editors of choice. Furthermore, there are plenty of editors out there and to create another would be re-inventing a wheel. So what I did was to build into ePubEditor the ability to launch an external editor, passing the full file path of a manifested file as a command line parameter. One of the buttons at the bottom of the Manifest tab is "Edit." When this button is pressed, ePubEditor attempts to launch the editor associated with the declared media-type. Media-type associations are stored in the properties file associated with the program, ePubEditor.ini. By default, the only editor referenced is "C:\Program Files(x86)\Windows NT\Accessories\wordpad.exe". Obviously, that's not going to work well on your Mac, so you would need to edit the .ini file to reflect your preferences. (An interface to do this messy work is high on the priority list, but hey, this is a work in progress.) I note that when you select a file that doesn't exist, the Edit button is still enabled. That's obviously a bug that's going to have to be fixed. On my installation, text and text/css point to TextPad 5, image/jpeg points to IrfanView, and application/xhtml+xml, text/html, and text/x-oeb1-document all point to Microsoft's Visual Web Developer Express (that's the free one). Everyone gets to pick the editor they're most comfortable using. I /have/ added a couple of extra editing functions specifically for HTML files. There is a replace function that alters elements without affecting their children, so that "bad" HTML like <p align="center"> can be changed to <div style="text-align:center">, and an insert function so that important elements like <br style="page-break-before:always"> can be inserted before every <h3 class="chapter">. (A lot could be done to refine these functions). I also can launch a "clean" process that uses XSL to fix a lot of common errors. For commercial ePubs, it is also possible for a user (or group of users) to create an XSL script to fix common errors on a per-publisher basis. I would appreciate it if someone could point me to a relatively complex HTML file on Project Gutenberg so I can start developing an XSL script for them as well. It might even be possible to construct an XSL script to clean up that mess that IA produces. (But Alex, I'd still rather have that than the degraded text that is otherwise available :-)).
i'm not saying it's useless. it still has a purpose... but turning plain-text into .html is "the hard part". after that, it's just generating some metadata files.
I think that turning impoverished text into HTML is harder even than turning PDF into HTML, because you simply have no reference to turn to when you have a question about markup. Turning "good" HTML into ePub is straightforward, but tedious and error-prone; in addition to being a central point for editing and constructing ePubs, one purpose of my tool is to relieve the tedium and avoid simple errors. [snip]
_you_ wouldn't turn out crappy .html, would you?
Well /I/ wouldn't, because everyone knows /my/ interpretation of HTML is the best and most accurate in the world ;-).
(not the least of which is FineReader itself).
well, first of all, not everyone owns finereader.
more importantly, it is one of those programs that creates .html that e-book-designers hate.
For good reason. In the first place, FineReader doesn't produce XHTML at all, it produces SGML/HTML according to version 3.2, which is not allowed in ePubs. So the first thing you have to do is run it through Tidy or some other "Tag Soup" parser just to get it in the right format. It also has this notion that the point size of the font is somehow important, so <font> tags are scattered everywhere in the output (or <span class="abbyy2"> or whatever, which is just as bad). And there's no indication of which "paragraph"s are really chapter titles and which are really paragraphs (although I have had some luck by counting which fonts are "normal" and converting "paragraphs" with fonts that are /bigger/ than normal to headers.) But it does offer a pretty good starting point, and I /really/ like the fact that it preserves line-ending hyphens with an indication of whether it thinks they should be hard hyphens or soft hyphens. I do have some code I wrote to clean and transform ABBYY output--maybe I'll enter an enhancement request to integrate that code with ePubEditor...
just so you know...
Just so you know, I accept no responsibility for the HTML that goes into ePubs created by my tool. If someone throws some HTML at it, doesn't clean it, and doesn't run epubcheck (integrated) s/he'll get a crappy ePub, just like those produced by Penguin and other commercial publishers.
ePubEditor's "CreateNCX" function relies on the existence of an Table of Contents made up of one or more lists, optionally nested.
so if a person makes an .html table of contents, your program can create an .ncx version from it.
that's better than nothing, to be sure, but it would be even better if it could create _both_.
(and yes, i see where you later noted that it can create both, provided the person has marked up their .html file appropriately, but all that is just resetting the place where you put responsibility on the person, rather than having the tool do it.)
Well, you kind of have a chicken and egg problem here. No, really just an egg problem. To create a TOC you have to have some indication of where TOC items exist in the files. And as I hope it is by now well established, every e-book will require human intervention at some point in its development. Even z.m.l. requires a certain number of blank lines be inserted before a header section. This TOC detection is definitely subject to improved heuristics, and suggestions would be welcomed.
This file may be created by hand, and in the case of small works it is trivially easy to do so.
um, yeah, but isn't it the "trivially easy" tasks that we want our computers to be performing _for_ us?
just sayin'...
No, I don't think so. First you have to understand that there are tasks that are trivially easy /for a human bean/ that are extraordinarily complex for a computer. And there are tasks that are enormously complex for a "human bean" (primarily because they are so detailed) that are trivially easy for a computer. What I think we really want is for our computers to do /every/ task for us that they are capable of doing, leaving us to do the "human" things for which we are so admirably suited. "God grant me the power to do the things a computer can't do, the ability to delegate to a computer the things it can do, and the wisdom to know the difference."
I have no idea why IA didn't OCR this page as well
they did... but their quality-assessment routines said the results were too shabby to use, so they substituted the page-scan instead... not ideal, but understandable. they would sub in every page this way, if they could, but the e-book would be too bloated to do anyone any good.
Actually, I think that's what their FlipBooks actually are: just pictures, with the OCR text in the background point to spots on the picture in case you want to do a search.>
Referring to the image I was able to create an HTML TOC in about 15 minutes
i really didn't mean for you to go to all of that work. i'm sorry now that i asked... it won't happen again...
oh, and you kids out there, don't try this at home!
if you want to _create_ and/or _edit_ an .epub file, your best bet is with "sigil" -- a free/free program that behaves quite like a regular word-processor, and then saves its files in .epub format. sigil does a bang-up job of creating your .opf and .ncx files _for_ you, based on the text that you have entered (or retrieved from an o.c.r. file), all _automatically_. (although it also lets you go in and edit manually.)
Well, I for one am not happy with Sigil, and many I have talked to are not happy either, which is why I have created an alternative. One of the problems I have with Sigil is that it produces files that are absodamnlutely compliant with the ePub spec, and work great on the most ePub-compliant User Agents on the market, but fail on less capable software. Extracting just the HTML to use on browsers and User Agents like µBook is less than usable. I would be kind of interested in knowing what would happen if someone used Sigil to create a highly styled ePub, and then converted that to Kindle. But Sigil is a useful tool, as is Calibre. I say, let a thousand flowers bloom.
there are other .epub creators out there nowadays, including "legend maker", but that one costs $40, so why not just go with the free/free "sigil" instead?
The absolute worst is Adobe's InDesign now that it can create ePubs natively. Indeed, I'm hoping a use for ePubEditor might be to clean up after tools like that. [snip]
(If you're going to call everything a paragraph, why add any markup at all?
because you can't make an .epub without having markup.
but nobody said that the markup needed to be any good!
that was one of the things you "forgot" to put in the spec.
Good point. epubcheck has come under a fair amount of criticism for checking files against schemas, but not checking if the use of those schemas is appropriate. As I have integrated epubcheck, maybe I can throw some extra checks in there to highlight areas that might not be quite right: like TOC anchors that point to paragraphs instead of headers, extremely short paragraphs that use a large font, a series of short paragraphs without punctuation which probably indicates a list... I think some heuristics for checking an HTML file is something I'd better to add to the "Feature Request" list. Good idea, thanks.
It makes no sense.
on the contrary, if you need to be able to "offer .epubs", it makes perfect sense. it's the only thing you _can_ do.
a division of the text that has no content? I'm beginning to think the people at Internet Archive simply don't understand how to use HTML.
surely you must be mistaken, lee. and _badly_ mistaken.
the internet archive is one of the _leading_authorities_ on "books in browsers". indeed, they run a yearly conference with that very title. and -- as a matter of fact, since i have just mentioned their yearly conference -- i should tell you that this year's version is actually being held _very_ soon!
here, let me google that for you... ok, yes, it's next week:
and lee, one of the "themes" of the event is "beautiful books".
Hopefully, you have your tongue planted firmly in your cheek.
which introduces me to a post i was going to make next week, but will probably make today, now that i stole my own thunder.
anyway, lee, good luck with the app. when you have the mods necessary so it can run in toto on a mac, please let me know... (i got it to execute ok, and could do some things, but not all.)
I have no Mac, no access to a Mac, and little interest in the Mac. The promise of Java was "write once run everywhere." Well, I wrote it once, now we'll just have to hope that some Mac developer out there can troubleshoot the problem (and tell me what the solution is once s/he figures it out).
On 10/21/2011 6:47 PM, Lee Passey wrote:
But that's exactly what I did! All of my code is in the repository at SourceForge; just go to the project summary page select "Code" then "Browse CVS Repository." Everything is there, and I try to update it whenever the code is stable. I tried to use file names which were suggestive of their contents even if someone else might find them confusing. You can even check differences between revisions, and the comments (if I remembered to add any). I'd be happy to make a source zip for you and put it on the file download page, but I probably won't pay a lot of attention to keeping it up to date.
Oops, message from SourceForge: 2011-10-21: CVS issues October 21st, 2011 Greetings, We are experiencing a CVS outage affecting projects starting with the letters a, b, c, e, g, h, i, l, m, o, r, s, w, and z. Some repos are read-only, while others are offline completely. The team is working on a resolution; updates to follow. Best Regards, Chris Tsai, SourceForge.net Support
On 10/21/2011 6:50 PM, Lee Passey wrote:
On 10/21/2011 6:47 PM, Lee Passey wrote:
But that's exactly what I did! All of my code is in the repository at SourceForge; just go to the project summary page select "Code" then "Browse CVS Repository." Everything is there, and I try to update it whenever the code is stable. I tried to use file names which were suggestive of their contents even if someone else might find them confusing. You can even check differences between revisions, and the comments (if I remembered to add any). I'd be happy to make a source zip for you and put it on the file download page, but I probably won't pay a lot of attention to keeping it up to date.
Oops, message from SourceForge:
2011-10-21: CVS issues October 21st, 2011
Greetings,
We are experiencing a CVS outage affecting projects starting with the letters a, b, c, e, g, h, i, l, m, o, r, s, w, and z. Some repos are read-only, while others are offline completely. The team is working on a resolution; updates to follow.
It's back as of 8:30 Pacific time.
participants (2)
-
Bowerbird@aol.com -
Lee Passey