
On 10/30/2011 12:35 PM, Bowerbird@aol.com wrote [massive snippage]
you know who's gonna work on your app, lee? you. and only you, lee. you. and nobody else.
Yes, that is what I have always believed. My main purpose for creating an "open source" project is not to attract "adherents," but simply for purposes of transparency (and backups in "the cloud"). If someone wants to help out, and it's someone who shares my vision I'm happy for the help. If someone just wants to "steal" my ideas (for whatever they're worth) that's fine too. [much, much more snippage]
What /I/ want is the output from FineReader as though the "Save as HTML" option was selected, with all the markup that FineReader was able to intuit
if i get "all the markup that finereader was able to intuit", then i can do the job just as well as you can. maybe better.
Then we're both in luck. As I've been looking more carefully at the HTML output produced by the IA script, I'm discovering more and more useful information. When one uses FineReader, the post-recognition process brings up a side-by-side view of the image and the recognized text. The recognized text highlights words that FineReader is uncertain about or which do not appear in its dictionary. In the IA output, I'm discovering that that data has been preserved. I think with some effort, it would be possible to use this data to build a web interface substantially identical to the proofreading interface provided by FineReader. So Alex, all that talk a while back about how I wanted a "leaner, meaner" file? Forget about it. I think I like it just the way it is. I can select out what I need, and it has some potential. Why don't you talk IA into hiring me, so I can work on this full time? :-) [yet more snippage]
what do the professionals advise us amateurs to do?
they advise us to save the file as plain-ascii text, and then to apply the .html to that plain text, including the reapplication of styling (e.g., italics) which gets _lost_ when the file is saved in plain-ascii format...
An interesting assertion, although a bit thin on actual evidence. Apparently Liz Castro advocates using Adobe's InDesign (shudder) to generate the HTML to create ePub's and Josh Tallent talks about using Microsoft Word's "Save as HTML" as the first step, and then cleaning up the resultant HTML (he goes on to point out that "HTML is a very simple language to learn"). I, personally, have started with "Just Bare" ASCII only one time, and gave it up before I was done because it was just too painful. Of course, I'm obviously not a professional, but I would /never/ advocate that an amateur to start with Just Bare ASCII; what with macros and global search and replace even cleaning up clumsy HTML is easier that adding it all back in by hand.
the application of good solid .html, though, is wise, so _that_ part of the advice i can thoroughly second...
[snipped assertion I happen to disagree with]
now, the truth is that those pros have "scripts" that apply the markup automatically. plus they _know_ .html already, well, so this comes naturally to 'em, even if they have to do some of the work manually
You make a good point here. 99.9% of the time when I'm creating an e-book I start with the HTML output from FineReader. But as has been pointed out elsewhere, FineReader produces SGML/HTML not the XML/HTML required by ePub. So the very first thing I have to do is convert the FineReader output to XHTML. (It's possible to use HTMLTidy to accomplish this, but I wrote my own program derived from the Tidy code base which not only does the conversion, but it does some other useful transformations as well). When I first started designing ePubEditor, I made the conscious decision /not/ to try and write or integrate Yet Another HTML Editor (or Yet Another CSS Editor, or Yet Another JPEG Editor, or Yet Another SVG Editor, or Yet Another NCX editor, ...). While not a participant in the great vi vs. emacs religious war, I am aware of the history; I wanted a tool that could incorporate virtually any user's preferred HTML editor and not force her to accept my preferences. Thus, ePubEditor has an editable set of preferences where a specific editor could be specified for each file's media-type. Your comment got me thinking that perhaps in addition to media-type-specific editors I should have user-configurable, media-type-specific /transformers/ as well. So I added this configuration preference, together with a "Transform" button on the Manifest pane; these additions are included in the most recent changes I uploaded to SourceForge tonight. The configuration dialog provides for a media-type, a transformation program, a program command line, a /transformed/ media-type, and a new extension. For example, in the case of FR output, I can set "text/html+vnd.abbyy" as the media-type, fr2html.exe as the program and "text/html" as the new media-type. Then, I add the FR output file to the Manifest and set the media-type to "text/html+vnd.abbyy". When that file is selected, and the "Transform" button is activated, my selected tool runs which transforms the file from FR format to XHTML. The media-type is then reset to "text/html" and I can either perform more transformations, or open it in the associate "text/html" editor (in my case, Microsoft's Visual Web Developer Express). As another example, if you had a corpus of works marked up with z.m.l., and had a script that would convert from z.m.l. markup to HTML markup, you could set that script as the transformer for files with the media-type of "text/plain+x-zml." You could add .zml files to the Manifest (setting the media-type appropriately, of course), and convert them to HTML using the transform function, resetting the media-type and renaming the file with a ".html" extension. You could then perform other transformations on those files, edit them with your choice of editor, or split them into more manageable segments using the "Split HTML" function. A simple "Save As..." and you would have an ePub file (although it would not be guaranteed to have valid content. For that, you would want to run the built-in "epubcheck" report.) Thanks for the idea. [last of the snippage]
so if you're one of the amateurs who are _struggling_ with the proper creation of these files, lee's program would be a _godsend_ to you, saving time and hassle.
Thank you. That's what I am attempting to accomplish.