
"Greg" == Greg Newby <gbnewby@pglaf.org> writes:
Greg> On Sun, Apr 18, 2010 at 05:05:09PM +0200, Carlo Traverso Greg> wrote: >> Is PG ready to accept Epub as submission format? (i.e. one >> submits a valid epub from which the other formats are derived)? >> If so, one can target Epub, otherwise at best one is forced to >> submit HTML or txt that converts not-too-badly with current PG >> tools, and this migh be extremely challenging. >> >> Carlo Greg> I don't think we're ready for this except in rare cases Greg> where ePub is the best format for display for a particular Greg> item (we just released a book where PDF was the best format, Greg> believe it or not). Greg> The challenge is that when books are fixed, someone Greg> (typically the whitewasher, seldom the original submitter) Greg> needs to regenerate all the files from that book. Greg> Since there is not yet any standard processing stream to Greg> generate static ePub files, this makes it hard for fixes (to Greg> HTML & text) to be applied to ePubs. Greg> I would, of course, love to see something become our Greg> "standard" conversion tool, usable by anyone. Right now, Greg> the closest for PG is Marcello's software to build the Greg> cached ePub files. It's wonderful and functional, but is it Greg> ready for all envisioned purposes? I think not, due at Greg> least in part to shortcomings of the input HTML. That's the whole point of my proposal. Starting with hand-crafted HTML we are likely to end with poor ePub, since the inference of metadata might be wrong, and many features of HTML need to be tuned to ePub and might not turn out correct; While obtaining reasonable HTML from ePub is just unzipping and discarding metadata. Maybe it will be harder to have "nicely handcrafted" HTML, but we have to give the best available product in the standard format that most users are likely to use (and of course a reasonable product in every other format). To maintain ePub (to correct typos) one has to unzip the ePub, correct the HTML and re-zip. Another issue is to automate the creation of txt from HTML. Currently, the output of w3m -dump (or links -dump, or lynx -dump etc.) is pretty good for txt, except that font changes (mainly, underscores for italics) are lost. It shouldn't be difficult to pre-process the HTML to show the underscores for italics, in such a way that one obtains a reasonable PG txt file. This might work better from the HTML generated from epub (in which the HTML is more constrained) than for handcrafted HTML. It might be a bit more challenging to downgrade from UTF-8 (as generated by -dump) to iso-8859-1 or to ASCII, for example to handle the unicode characters that are used to draw tables, but this might be very well automated too. This is on my side an offer to work towards the production of a toolchain along these lines, if it is not discarded a priori. Carlo