
The question of producing epub-friendly HTML came to the WWers from a DP submitter the other day. They were referred to this article: http://www.pgdp.net/wiki/The_Proofreader%27s_Guide_to_EPUB#How_to_Auth or_HTML_That_Converts_Gracefully_to_EPUB Al -----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of hmonroe1@verizon.net Sent: Sunday, January 15, 2012 9:37 PM To: gutvol-d@lists.pglaf.org Subject: [gutvol-d] Producing epub ready HTML I would appreciate some guidance on what adjustments would need to be made to the HTML produced by Guiguts to make it more amenable to conversion to epub format by epubmaker or otherwise. That is what works well in terms of how the table of contents is presented, use of CSS versus styles, page numbering? I have seen some discussion on this list but I did not manage to find this in the list archives. Also, is there an automated tool that could warn of features that will not convert to epub well? As context I am the lead developer for guiguts although I will be putting less time than I have been. The latest version of Guiguts (1.0.3) is much less buggy and runs out of the box on Windows and Mac. The Windows version has bundled with it epubmaker, the Gnutenberg Press, and the Open Jade SGML parser (onsgmls.exe) to validate HTML and PGTEI. It would be helpful to get pointers to some examples of files that convert well and those that do not. Incidentally I need some assistance in identifying exactly which DTD and other files are needed for onsgmls.exe to validate XHTML 1.1. Perl developers are welcome to help. Hunter

Hunter, My biggest peeve with GUIGUTS HTML is the use of <span> tags for indents. These may or may not work in EPUB readers but when you run the EPUBs through Kindlegen they don't work at all in the resulting MOBI files. If I was writing the code I would make it so that an indent of 4 spaces translates to a <blockquote> section. For poetry, (any text where the indenting is uneven) ideally the least indented text would be represented as a blockquote section and further indents would be accomplished using multiple . Not the easiest thing to code, I know. This is the only way to format poetry that works with the Kindle. Any line with no ident but with multiple spaces before and two spaces after could be an <h2>. Single underscores on a line would be converted to <i> and </i>, for odd and even occurrences. That would not be correct every time but it would help. Right now I'm using one underscore to represent <i> and two for </i> and while that works gutcheck reports it as an error. For the Kindle you might make a paragraph style that has a top margin of 1em. Or you might choose not to use <p> tags at all, but use <div> tags throughout. The Kindle by default makes paragraphs indent on the first line and has no space between paragraphs. If you want to preserve the look of the text file then <div> with a top margin of 1em would work. The Kindle rounds things up to the next em so you don't want to use fractional em's. You might give us a way to specify dropcaps. For the Kindle and Nook this works: <span style="font-size:3em">F</span>irst letter. Kindlegen requires everything to have no more than one class. Second and subsequent classes are ignored. You might give us a way to indicate we want text surrounded by <pre> tags. This would be helpful for family trees and other ASCII art, code listings, etc. Bowerbird has been suggesting using light markup in text files so we can derive other formats automatically and reliably. He will doubtless have something to say on the subject. I've been using guiguts for awhile including the HTML generating feature, but I agree it needs improving. A file you could use for testing is this one: http://www.gutenberg.org/ebooks/38174 It has lots of interestingly formatted poetry. Thanks, James Simmons On Sun, Jan 15, 2012 at 11:37 PM, <hmonroe1@verizon.net> wrote:
I would appreciate some guidance on what adjustments would need to be made to the HTML produced by Guiguts to make it more amenable to conversion to epub format by epubmaker or otherwise. That is what works well in terms of how the table of contents is presented, use of CSS versus styles, page numbering? I have seen some discussion on this list but I did not manage to find this in the list archives. Also, is there an automated tool that could warn of features that will not convert to epub well?
As context I am the lead developer for guiguts although I will be putting less time than I have been. The latest version of Guiguts (1.0.3) is much less buggy and runs out of the box on Windows and Mac. The Windows version has bundled with it epubmaker, the Gnutenberg Press, and the Open Jade SGML parser (onsgmls.exe) to validate HTML and PGTEI. It would be helpful to get pointers to some examples of files that convert well and those that do not. Incidentally I need some assistance in identifying exactly which DTD and other files are needed for onsgmls.exe to validate XHTML 1.1. Perl developers are welcome to help.
Hunter
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

hmonroe>I would appreciate some guidance on what adjustments would need to be made to the HTML produced by Guiguts to make it more amenable to conversion to epub format by epubmaker or otherwise. With apologies, it's been a couple years since I looked at guiguts so what I will offer instead of guiguts-specific comments are comments in general about what I see "broken" with much of the html files on PG re "epub" -- including recent html submissions. I hope, perversely, that some of these problems are in fact coming from guiguts -- in which case "we" have some hope of solving these problems! Now by "epub" I will take as meaning "anything that is submitted to PG in html form, which PG then converts to some flavor of epub, and which then presents to the PG customer, OR, which PG further converts from epub via kindlegen to target some flavor of mobi reader before presenting to the customer." If you read the "epub" specs carefully I think you will find that "epub" includes devices such as mobi devices where that "epub" in turn is compiled to an intermediate form (mobi, or kf8) [via kindlegen] before being presented to the final customer device. 1) There is one very mobi-specific problem which can in turn be decomposed into two parts: a) many mobi devices round vertical distance spaces to the closest 1.0em whether or not those distance specification where written in ems. b) many mobi devices devices DO NOT merge top and bottom margins but rather follow the archaic approach of ADDING those margins. Where this bites PG customers very very very frequently is when submitted HTML contains some form of "splitting the baby" i.e. specifying both top and bottom margins -- especially when those top and bottom margins are applied the ubiquitous case of <p> paragraphs. For example a "p { margin-top: 0.5em; margin-bottom: 0.5em; }" specification LOOKS innocuous, but in fact results in a *** 2.0em *** spacing between paragraphs on most mobi devices! While ugly even that is still not a killer unless the book being converted contains lots of dialog-exchanges, in which case one gets a book which contains one line of dialog, followed by two lines of nothingness, another line of dialog, followed by two lines of nothingness, etc. -- which in practice one reads-in the prosody of extremely slow sluggish stupid protagonists -- which is presumably NOT the original author's intent! The only "reasonable" way around this problem is to understand the limits of these devices. For example it is easy to "prove" in practice that changing the original specification of p { margin-top: 0.5em; margin-bottom : 0.5em } to p { margin-top: 0.51em; margin-bottom: 0.49em; } is sufficient to "completely" solve the problem on many html books using this particular "split the baby" approach [which was really not needed in the first place!] 1b) Actually there are many "epub" [Android (say)] Epub Reader apps which choose not to implement more or less of the "html" spec resulting in many of the same problems as described as being "mobi specific" in 1) above. 1c) I will not comment on how Apple chooses to implement "epub." Read the Elizabeth Castro web blogs and books about this subject: she says it better than I could hope to do. 2) The great majority of the remaining problems I commonly see are not mobi-specific problems, nor are they epub-specific problems, rather they are simply problems where the original author of the html text mentally conceptualized the rendering of that html ONLY on their desktop monitor of say 20" width horizontal resolution 2048 bits and they didn't bother to think what would happen to that conceptualization when rendered on a device which say has a 3" horizontal width resolution 480 bits. And/or they don't think about how their color choices will render on monochrome devices. Or frankly, they don't care. Or frankly, they are openly hostile to PG customers who own one or another different "flavor" of machine and are going out of their way to sabotage owners of such machines, rather than trying to do "write once read everywhere." They could discover most of these problems for themselves, should they choose to fix them, simply by setting the window size of their html browser to say literally 3" wide to 4" high *without* making any other changes, such as changes in font size. Is what you see there still beautiful? No? Then you have written broken HTML code. Good HTML code still looks beautiful even when displayed in that small a window without making any font size changes or any other changes. And without doing any horizontal scrolling. Not "passable." Not "broken but who cares." But rather: "still beautiful even when viewed through a non-scrolling window port of 3" wide by 4" high and without changing font sizes." 3) A Common Problem: The original html author has a really wide 20" screen where the width is much wider than the height is high. They maximize their html browser and then try to read their work -- which looks really really ugly on their screen which has a 16 wide by 9 high aspect ratio. So they stick in a "body { margin-left: 16%; margin-right 16% }" statement which makes them feel better about how their work looks on their machine. They could have specified these margins in their own HTML browser preferences -- but most people don't know how to do that. Or they could demaximize their browser window and choose a window shape more closely simulating the shape of a typical printed book page. But they don't do that. OK, now how does this html display on our hypothetical small machine? The original 3" width display is now effectively only 2" wide -- guaranteeing an unhappy PG customer! The 480 horizontal pixel resolution is now only 320 pixels wide -- NOT a happy camper! 4) Related problem: Gee now that I the html author have all this extra left and right margin space -- how about if I get really clever and stick page numbers in that space!? Wouldn't that be a contribution to humanity? Answer: Now the author has created a justification for mandating those large margins, making it very difficult to remove them. Some small machines automatically remove these margins, or give the device owner manual control to remove those margins. Do those page number then silently and thankfully get dumped into the big bit bucket of bad ideas in the sky? Nope, instead what happens is that those page numbers now get rendered ON TOP OF the text the customer is trying to read. OR those page numbers get rendered randomly INTO the body of text being read: "Four score and seven years ago our fathers brought forth on this continent, a new natPage 6ion, conceived in Liberty, and dedicated to the proposition that all men are...." 5) Related problem: I the html author have this text from the mid-1800s which contains a quasi-illuminated letter as the first letter of each chapter paragraph, and I think it is really important to render my html exactly as close to the original text as possible -- in spite of the fact that the quasi-illuminated letter is a stock printers' image "clip art of the day" which has no relationship to the text in question -- the book being transcribed is an "Everyman's Library" edition where the publisher thought he could gin up Xmas sales by sticking in "Red Letters" and gilding the binding for people buying Xmas presents to provide pretty padding for their library shelves. So I will stick in a gif of the quasi-illuminated letter doing a "float left" at the start of my paragraph. Q: Does this work? A: Not if you have any artistic sense. In html there is no fixed relationship between the float and the rendered "normal" paragraph text so even if it "looks good" on your particular choice of html browser on your particular desktop machine that's just "dumb luck." And this has no chance of looking attractive on small machines simply because the screen size relationships are totally different. And "float left" is really just another variation on the left and right margins problem listed above -- the three or four lines of text floating right of the float left are now reduced to 2" in length guaranteeing they will look stupid, if not actively broken. And this will probably screw up accessibility. 6) Related problem: OK not a gif-letter, but I still want to make that first letter red and 3ems in height to make it really dramatic and I want it to be a "real" dropped cap so I will again do a float-left and I will play around with negative margins to make it drop. Q: Good Idea, right? A: Again, there is no fixed relation in html between your float-left and your following paragraph so if "works" on your desktop and browser its just "dumb luck" and won't work on anyone else's browser and certainly NOT on their small machine and will look ugly and silly on these other machines -- if not totally broken. And it will probably screw up accessibility. What you *can* probably "make work" is to enlarge the first letter (and/or word) of the first paragraph of each chapter using proper CSS, if you make it *marginally* larger, say 2em, and you don't make it red, and you don't attempt to "drop" it, and if you use the CSS to tweak the line spacing a little bit for the first line of the first paragraph after a chapter break. But is it *really* worth it? And does it *really* make "artistic" sense?? [Aside: A "real" dropped cap, if one were to read typographer's texts, whether of the "illuminated" or plain variety is an enlarged letter whose baseline exactly matches the baseline of a 2nd or 3rd subsequent line in the paragraph. Its spacing has to be carefully adjusted so that the visual space to the right of that dropcap "looks right" in relationship to the following paragraph text. Even in the "good old days" typesetters had trouble making this look right, relying literally on a metal file rasp to adjust the size of the drop cap to fit correctly. Good luck making it "fit right" in HTML -- There are no metal files in CSS.] 7) Related problem: The original text I am transcribing is an 1800s "photocopy" of a 1600s text in 9" x 14" size which contains editorial notes "in the margin" which I want to reproduce literally in the left margin of my html text. And of course I still want my page numbers in the right margin. I am not greedy -- I will only take 1" of left margin for my editorial notes and 1" of the right margin for my page numbers because I want my HTML to look just like the original. Q: Good idea, right? A: on our hypothetical 3" wide small device your decision to take 1" of left margin for float-left editorial notes and 1" of right margin float-right page numbers leaves the owner of a our 3" small device with exactly 1" of effective display to try to actually read their choice of book. PG customer not happy -- what an ingrate! 8) Related problem: producers of small devices or reader software for those devices who run into these problems often enough decide they are simply better off NOT implementing some of this html "stupidity" and to silently ignore these most troublesome html tags. With the tags ignored the body text associated with those tags now gets inserted "at random" -- but at least the customer can read it! And now the more technological customers (and authors) for those devices complain about how those devices "...don't implement all of the HTML tags in version 9.99.999 of the global HTML standard. How dare they!" 9) Related problem: poetry. There being no good way to display poetry in HTML. There being no good way to line-wrap-and-poetry-indent. Many people propose that they have "solved" this problem using negative margins. Behavior of negative margins is extremely problematic on many small machines. For example you assume that you have a positive margin in which to poke-back your negative margin. Except the small machine has been set in a mode to discard margins so now your negative margin pokes off the screen. Except the small machine software isn't written to even conceive of that possibility. Or maybe in theory all the margin settings should add together, negatives and positives to form an overall positive margin "which ought to work right." Except someone writing the device software has decided that there is no way that some intermediate term in the margin calculations should ever go negative, and if it does that's a bug that we need to catch so we will turn that negative intermediate calculation into the number "0" -- and again, your carefully crafted poetry scheme is busted. I have seen some schemes that use block inside of span which look like they might work if developed carefully to avoid negative margins. But one would have to test these ideas when developing them on a large number of actual machines. And the transcriber would have to understand the poem in question well-enough to understand that they shouldn't be trying to transcribe literally all the line-breaks and indents in the printed page poem in the first place -- because some of those indents and line-breaks are simply limitations of the original paper page width imposed by the printer and publisher on the original poet. Except sometimes poetry *IS* literally the form of the visual image on the page. 10) And finally but not finally PG automagically inserts archaic boilerplate which horribly breaks on many if not most customer machines in the first place -- and insists on making this boilerplate the first thing the customer sees. Negative impressions first please! Etc. PS: The really sad thing is that it really simply to do a really good job of rendering most books into html that looks attractive and is quite readable on "all" machines including epub and mobi machines. The problems arise when transcribers try too hard to be too "clever" and too "literal" in their transcription. Leading to the over-general observation "Naïve transcriptions look good, sophisticated transcriptions look bad." PPS: Literally try opening some recent PG submissions in HTML mode on your desktop or laptop setting the window port size to 3" wide by 4" high and see what happens -- horizontal scrolling is cheating! Hmm, I grab, literally at random: http://www.gutenberg.org/files/38593/38593-h/38593-h.htm -- and does it work? Nope. Ugly and broken.

hmonroe>I would appreciate some guidance on what adjustments would need to be made to the HTML produced by Guiguts to make it more amenable to conversion to epub format by epubmaker or otherwise. OK, please see my previous email. I don't really remember how to run guiguts but I installed it new and then I imported my last txt72 that I submitted to PG and autoginned HTML using guiguts and I found most of the classes of problems I described earlier. Putting my previous email even simpler: If you think about it a small say 3" wide machine has the visual space to do exactly ONE thing at a time, but via vertical scolling or page flips it, like all modern display devices it can essentially do an infinite number of things vertically. So if you want to make it "work" you have to stack visual elements vertically, not horizontally. So, on a small display device, and at any moment in time, speaking in terms of horizontal real-estate you only have space to do ONE of the following things on the display: a) Display a picture. b) Display a page number. c) Display left margin whitespace. d) Display right margin whitespace. e) Display a note [a "footnote" or what have you] OR g) Display the actual body text of the book that the customer might actually read. Now I'm not saying that a) though e) might not be interesting things to display on the customer's device. I'm just saying that if you want to display more than one of these things "at any moment of time" you have stack them vertically, not horizontally, or it just ain't going to work. Because the real-estate simply doesn't exist on the customer's device's display. Specific comments looking at the ginned HTML code: * Specifying left and right body margins doesn't work. * Specifying P top and bottom margins of .75em doesn't work. * position absolute doesn't work. * things of 0.5em don't work. * sidenotes don’t work. * smallcaps doesn't work on any machine I know of -- but its typically relatively harmless when they don't work. * float left doesn't work, and float right doesn't work. * margin-rights really don't work. * background colors typically don't work but when they don't work that's typically relatively harmless. * reliance on spans to move things right is probably not a good idea. Etc. I realize what I say will probably make you unhappy and you will probably go away mad without ever changing anything but, the reality remains: What you are trying to do is not going to work on small machines for the simple reason that, well, they are small. Which leaves PG in an interesting position.

Jim, are you happy with the epubs and mobi generated through RST, any of these PG numbers: 34605 34654 35031 35034 35077 35078 35079 35080 35090 35091 35150 35189 35321 35352 35412 35449 35484 35503 35504 35518 35538 35594 35600 35660 35680 35694 35728 35729 35730 35749 35759 35818 35819 35819 35821 35822 35823 35831 35857 35858 35859 35879 35896 35914 36020 36021 36060 36061 36062 36063 36165 36239 36335 36379 36727 36737 36792 37102 37123 37378 37405 37412 37413 37466 37467 37476 37477 37482 37551 37719 37720 37776 37776 37783 37819 37849 37931 37932 37933 37934 37935 37936 37939 37986 38082 38216 38217 38218 38287 38298 38302 38338 38360 38509 38578 and/or with those generated through TEI, 6130 15573 15697 15775 16495 16523 16663 16697 16780 16939 16940 16941 16983 16984 16985 16986 17001 17260 17309 17310 17756 18475 18779 18827 18828 19236 19237 19238 19239 19240 19241 19242 19243 19244 19245 19250 19251 19252 19253 19254 19267 19268 19269 19270 19271 19273 19274 19275 19276 19277 19278 19279 19280 19281 19282 19283 19284 19285 19286 19287 19289 19290 19291 19292 19296 19297 19298 19299 19300 19312 19331 19332 19380 19449 19457 19460 19464 19465 19486 19487 19503 19518 19628 19800 19810 19890 19903 19963 19964 19965 19967 19978 20093 20102 20141 20147 20159 20313 20485 20840 20888 20944 20990 21194 21195 22099 22492 22659 22700 22761 22823 23065 23132 23466 23500 23672 24016 24172 24238 24315 24346 24526 24585 24746 24817 24869 24921 24979 25012 25018 25019 25080 25092 25096 25274 25311 25386 25457 25539 25585 25623 25645 25833 25929 25993 26027 26134 26145 26328 26341 26495 26572 26638 26639 26849 26881 27114 27196 27197 27226 27270 27351 27435 27520 27531 27552 27587 27616 27698 27755 27810 27869 27942 28015 28023 28024 28054 28066 28075 28104 28126 28208 28239 28240 28277 28280 28282 28297 28312 28357 28369 28416 28558 28598 28601 28656 28674 28736 28938 28957 28964 29054 29143 29247 29363 29540 29786 29798 29855 29912 30042 30096 30097 30107 30132 30227 30241 30323 30334 30339 30499 30531 30579 30634 30664 30684 30695 30718 30883 30895 31111 31111 31127 31183 31294 31309 31321 31427 31429 31461 31510 31711 31794 31891 31935 31960 32066 32136 32158 32510 32624 32722 32856 32883 32945 32983 33011 33090 33447 33452 33598 33598 33598 33743 34010 34022 34049 34050 34064 34070 34081 34122 34163 34172 34201 34520 34736 34839 34839 34989 35041 35137 35167 35172 35192 35230 35269 35269 35298 35365 35434 35501 35501 35539 35563 35580 35722 35750 35751 35758 35856 35915 36227 36296 36342 36548 36549 36722 36828 37120 37241 37404 37494 37756 37841 37881 38044 38116 38126 38241 38326 38427 And do you see a difference between the two sets? Thanks. Carlo

Carlo>Jim, are you happy with the epubs and mobi generated through RST, any of these PG numbers: Well, if you ask me "any" of these numbers I will just look at one of each if you don't mind. RST 34605 is a junior reader, which usually doesn't exercise the full set of problems one sees in an adult book, but, let me continue. The HTML version has wide margins, which will tend to make PG customers unhappy when they access this file via a web browser from a small device, like an Android tablet. The HTML version doesn't show page numbers, but the EPUB version does -- so I don't understand how you manage to do that? The EPUB version seems to somehow have the margins trimmed. Did Marcello's software trim the margins for you? Now "magically" there are page numbers in the small right hand margin which didn't exist in the HTML version. I don't understand how this is possible? If PG requires that submissions for EPUB and MOBI version be via HTML, then how is this possible? I'm sure I don't know. The EPUB version is something I could live with, IE I think it is "readable" on most EPUB devices, even though personally I would prefer not ragged right, not having a small right hand margin, and not having page numbers there -- at least not in a junior reader. But still the EPUB is successful enough that I don't consider it embarrassing. I just don't know how it can be "magically" -- if PG requires EPUB and MOBI to be via HTML submission -- how this file ends up with page numbers but the HTML version does not? The MOBI version needs to be looked at in two different ways 1) under a "traditional" Kindle device and 2) in a "Kindle Fire" new generation device. Under Kindle Fire: It seems like you have a way to not use the "Blue PDA" as the cover image. I don't see how you do this, because I don't see how PG allows specification of the cover image to be used with an HTML submission? In any case, on many small devices a cover image like you have which includes title and author is much more useful than the PG default "Blue PDA." Or is it that you don't include a cover at all, and that Amazon is automagically providing me with one? But when I submit an HTML PG gives my EPUB a "Blue PDA" -- whether I want it or not! Title Page images are margin left, in HTML they are centered. Is it possible that image location isn't being specified and is just picking up the web browser defaults? Hitting "Back Link" on a chapter title takes one to a scrambled version of the TOC, which is presumably related to MOBI file lazy evaluation. Is your TOC link target placed *after* the format specification for the TOC? MOBI might work better if the TOC link target is placed *prior* to the TOC format specification, if possible. You are doing "no margins-top, first line indent" on paragraphs, I find it easier one small devices to read "1em margin-top, first line no indent" on paragraphs, but this is mainly a personal preference item. "CHAPTER V: JANET HEARS FROM BETTY" title formatting is not very successful. It would probably format better if you specified this as: CHAPTER V JANET HEARS FROM BETTY IE Chapters with titles in two lines, each line centered. Many of the chapter titles suffer appearance from the choice of implementation. "But when the old woman got home in the dark," etc. This poem formats poorly due to poem line break problems. Not to imply that I personally know a good solution to "HTML" based poetry formatting. But it does seem like you could easily choose to indent poems much less than this, leading to poem linebreak problems less frequently. "Bye little Betty, ..." seems to be indented *way* too much. "THE END" and subsequent PG "advertisements" are formatted poorly. It would be nice if the reader could savor finishing the book for a few moments before being immediately hit by the "pgegalize." Put in a page break before the "pgegalize" ? ----- "Traditional Kindle" : Seems to have the same exact issues as "Kindle Fire" -- which is encouraging, given that presumably PG is still compiling "HTML" to MOBI7 using the previous version of Kindlegen, and not the "KF8" using the current generation of Kindlegen. Picking TEI 34520 "at random" -- just to have a book of a similar vintage to RST 34605 -- I see: The HTML version (reading HTML on a small device): has wide margins, which will tend to make PG customers unhappy when they access this file via a web browser from a small device, like an Android tablet. Page numbers are rendered in the left margin, which probably explains why the margins were set so wide. Page numbers 2 and 3 are rendered literally exactly on top of each other which presumably is a TEI->HTML implementation bug. "PGegalize" is formatted no-wrap mode so it runs ugly off the right hand of the screen. Not clear why the "PGegalize" is placed at the front of some (RST) books and at the end of others (TEI) ? Excess margins in the TOC causes more entries than necessary to wrap unattractively. And in general the TOC is hard to read since it wraps so much (and many times a single entry over multiple lines) which again is due to excessive margins. Images are not set to scale correctly to fit the (small) device page size so that on a small device one only sees a tiny upper left corner nibble of the images. <p> formatting is chosen to be 1em spacing between paragraphs, no paragraph indent, which personally I prefer because I think on small devices it is somewhat easier to read. [pg 014] doesn't fit the margin so it renders: [pg 0 14] And where the "]" somewhat overlaps the first letter of the body text. Same for all margin page numbers, basically. Topic titles are formatted "Margin Left" rather than centered, as I would expect of a traditional book. If this were a computer manual instead of a biblical manual, then I would expect to see topic titles formatted left ala K&R troff style. "Note.-God did this by backing His promises with an oath based upon Himself. Heb. 6:13, 14. By this He pledged and placed at stake His name, or character, for the fulfillment of His word." -- which actually looks even worse on the small web browser than I have represented above since the formatting has been chosen to try to "justify" making the gaps between the above words go all over the place. But the point being when you start using "big" margins on small device the results are necessarily tragic. In this case were the left and right margins needlessly set in ems or something, instead of using % ? O Word of God incarnate, O Wisdom from on high, O Truth unchanged, unchanging, O Light of our dark sky! We praise Thee for the radiance That from the hallowed page, A lamp to guide our footsteps, Shines on from age to age. Shows a "textbook example" of the difficulty of HTML "poem" formatting especially on small devices when large margins and indents are chosen. [The actual poem, should you fail to recognize what it should have been, is:] O Word of God incarnate, O Wisdom from on high, O Truth unchanged, unchanging, O Light of our dark sky! We praise Thee for the radiance That from the hallowed page, A lamp to guide our footsteps, Shines on from age to age. Index seems very long for a small device. Is this really what was in the original book? End "pgegalize" : While generally readable, the http links have been formatted n-wrap which means one cannot read the whole http link making them generally unusable. EPUB: The ngx or whatever its called seems to be way overpopulated. Page numbers now have magically jumped from the left margin to the right margin and what used to read "[PG 014]" now just reads a small grey "14" -- how do they do that? How does one submit HTML which looks so different in an HTML browser than in EPUB, I wish I knew how they do that? Because I was told if one wants to target EPUB or MOBI the ONLY option is to submit an HTML file. Hm, except now I see a line that says: [pg 002][pg 013] ... 7 So now I am confused: Is this page 2, or page 13, or page 7 ??? Margins seem large for the small device Images do not scale to fit device Some images are compressed using a scheme which does not display correctly on the device. Some page numbers [pg 016] are now showing up in the body text, interrupting the reading, which I personally never think represents a reasonable tradeoff. Same kinds of severe poetry formatting issues makes the poems unreadable. Same kinds of severe left and right margin "indenting" on blockquotes makes them unreadable. Etc. EPUB showing basically the same kinds of extreme problems present when trying to read the HTML on a small device. Credits: The credits are formatted so that they show up farther to the right than the center of the display and therefore of course are badly clipped off by the right edge of the physical display. Kindle: No image versions of the files are provided at all? Did a gen tool silently fail? Or no one bothers to fix gen failures? Kindle Fire: Unfortunately these files have been "successfully" provided with the "Blue PDA" default PG cover image, meaning that one has 20 identical appearing "Blue PGA" PG books on one's Kindle, and one does not know which book is which. Author name is apparently not correctly being provided by TEI because it doesn't show up on Kindles, where sorting on author name is an important choice. Content header runs up against previous formatting. In general all the lead-in pages formatting is incorrectly run-together. And left-aligned instead of centered, but left-aligned against an excessive left margin width. TOC items bulleted, which is uncommon in TOCs, I believe, and looks strange to my eyes. The missing images are replace with the left-aligned words: Illustration. The Word of God.... -- except in reality there is no vertical whitespace being used to offset these words. For accessibility reasons I would think if one takes out the images and replaces them with something that something ought to be a description of that which was left out? No vertical space after subject headings. Traditionally one would expect to see some spacing between a subject heading and the following paragraph. Ideally the first paragraph after a subject heading shouldn't have an indentation. Poems are now miraculously left-aligned without an excessive margin. But the even-numbered poem lines *ought* to be indented relative to the odd-numbered poem lines and how somehow they are not. In general something seems to have swallowed ALL the vertical "whitespace" formatting that *ought* to be there. Everything simply "runs together" vertically. End "pgegalize" is now readable, but something has turned what used to be active http links into inactive text. Credits show up "OK" on Kindle, but somewhat excessively indented (but not as bad as on EPUB) Kindle "Traditional:" TOC Entries: the bullets and the TOC entries show up on different lines in the display looking pretty silly. "How Readest Thou?" This section shows up with some kind of "pre" formatting or something which doesn't wrap correctly, and lines break unattractively. In generally beyond this the Kindle "Traditional" showing the same kinds of problems as Kindle Fire. In general, I would say the this particular TEI is no more successful in practice on a Kindle than simply sending a txt72-unwrapped version of the plain text to a Kindle. If then. Sorry -- Jim

Jim, none of the submissions that I listed are mine, they have been submitted as PG-RST or PG-TEI master files, from which all the formats have been created by Marcello's software without further manual intervention. In particular, for RST two different HTML are created in the process, one for browsers and one for epub. So it is unfair to view HTML on a small device. The software to handle TEI is older, and probably its procedures should be updated. And the files have been produced in 1910, so no wonder that they used 1910 kindlegen. PG doesn't redo routinely all the file generation. Carlo

Carlo>Jim, none of the submissions that I listed are mine, they have been submitted as PG-RST or PG-TEI master files, from which all the formats have been created by Marcello's software without further manual intervention. Sorry, I didn't mean to imply that you were the actual submitter of the books in question.
In particular, for RST two different HTML are created in the process, one for browsers and one for epub. So it is unfair to view HTML on a small device.
Actually, what I think is unfair is that PG has told submitters, including me, repeatedly, that the ONLY way that PG allows one to submit EPUB or MOBI is by sending in a generic HTML, but then PG "secretly behind the scenes" allows other people other paths to create customized EPUB and MOBI rather than having to rely on "One Generic HTML For All". Re: the general unfairness of reading HTML on a small device: Have you read the numbers of what people out there in the "real world" are actually doing? They have largely stopped buying desktops. They have largely stopped buying laptops. What they ARE buying are tablets. Those tablets now all come with decent HTML browsers. Many of these customers have no idea what an EPUB or MOBI file is. So they end up reading the books in HTML format. Which looks crappy on their small devices since the authors of that HTML really didn't know what they were doing, or didn't care. Now, is it somehow "unfair" that these customers read that which PG provides on that which the customer actually has in hand? Or is it "unfair" that PG takes books written by good authors, transcribed them incorrectly into bad HTML, and then presents them in a totally scrambled manner on the end user's device? Seems like the people being treated "unfairly" are the original authors, and the PG customers, NOT "PG" !!!
PG doesn't redo routinely all the file generation.
IF PG doesn't routinely redo all file generation, then PG has basically been lying to us submitters about their rationale for restricting our contributions to being "HTML ONLY!" Even IF PG insisted on doing their own gen of EPUB and MOBI, PG could still allow us submitters to also submit EPUB-HTML and MOBI-HTML -- which you point out PG is ALREADY allowing in the "special case" of RST! So why one set of rules for some people and other set of rules for all us others ???

On Thu, 19 Jan 2012, Jim Adcock wrote:
Even IF PG insisted on doing their own gen of EPUB and MOBI, PG could still allow us submitters to also submit EPUB-HTML and MOBI-HTML -- which you point out PG is ALREADY allowing in the "special case" of RST!
So why one set of rules for some people and other set of rules for all us others ???
Just because you were not aware of RST being used on PG does not meean that it has been "hidden" from you. The idea of using it came from some volunteers from pgdp. It took a good while to work out a tool-chain so that a WW could process it, and lots of discussion at PGDP. As near as I can tell, it is not being used too widely yet. If you would like to give it a try, please go ahead. To see more try reading: http://www.pgdp.net/wiki/RST --Andrew

Just because you were not aware of RST being used on PG does not meean that it has been "hidden" from you.
You haven't addressed the issue. WHY is "PG" allowing "RST" and "Tei" to bypass the requirement to submit one and only one version of HTML while telling the rest of us submitters that we are not allowed to "submit" alternate versions for "EPUB" nor "MOBI" ? There are no "tool chain" issues here. A submitter could just say to PG "This HTML is for big browsers, this HTML is for EPUB, and this HTML is for MOBI." Or PG could allow us to submit one HTML with conditional compilation statements, even using the CPP for * sake, and then generate *useful* versions of the "big browser HTML", "EPUB HTML", and "MOBI HTML" from that one submitted HTML that includes the conditional compilation statements.

Perhaps one point is not clear. TEI and RST are both "single source" options. That is, a volunteer submits only *one* file, and all others are generated from it. --Anderew On Thu, 19 Jan 2012, Jim Adcock wrote:
Just because you were not aware of RST being used on PG does not meean that it has been "hidden" from you.
You haven't addressed the issue. WHY is "PG" allowing "RST" and "Tei" to bypass the requirement to submit one and only one version of HTML while telling the rest of us submitters that we are not allowed to "submit" alternate versions for "EPUB" nor "MOBI" ?
There are no "tool chain" issues here. A submitter could just say to PG "This HTML is for big browsers, this HTML is for EPUB, and this HTML is for MOBI." Or PG could allow us to submit one HTML with conditional compilation statements, even using the CPP for * sake, and then generate *useful* versions of the "big browser HTML", "EPUB HTML", and "MOBI HTML" from that one submitted HTML that includes the conditional compilation statements.

Lee>Perhaps one point is not clear. TEI and RST are both "single source" options. That is, a volunteer submits only *one* file, and all others are generated from it. OK, how about we try this one more time: Let me submit *one* EPUB and PG can generate all the other formats from it -- especially the txt70 requirement -- now *that* would be one giant step forward!

On 01/19/2012 06:23 PM, Jim Adcock wrote:
IF PG doesn't routinely redo all file generation, then PG has basically been lying to us submitters about their rationale for restricting our contributions to being "HTML ONLY!" Even IF PG insisted on doing their own gen of EPUB and MOBI, PG could still allow us submitters to also submit EPUB-HTML and MOBI-HTML -- which you point out PG is ALREADY allowing in the "special case" of RST!
Wrong. You can submit either *one* HTML or *one* TEI or *one* RST file. In no case has anybody been posting multiple HTML files. -- Marcello Perathoner webmaster@gutenberg.org

You can submit either *one* HTML or *one* TEI or *one* RST file. In no case has anybody been posting multiple HTML files.
Well we are getting down to Clinton-esc Parsing, but in that case allow me to post *one* HTML which includes CPP conditional processing statements (or the equivalent) such that it is "allowed" to magically split into three Hydra Heads and generate separate targeted "Big HTML", EPUB, and MOBI versions just like RST is allowed to do.

On 1/19/2012 5:06 PM, Jim Adcock wrote:
You can submit either *one* HTML or *one* TEI or *one* RST file. In no case has anybody been posting multiple HTML files.
Heck, you could probably even post *one* z.m.l. file, just so long as it has a .txt extension ;-). If you did, you couldn't also post an impoverished text file, but in that case there wouldn't be any need would there?
Well we are getting down to Clinton-esc Parsing, but in that case allow me to post *one* HTML which includes CPP conditional processing statements (or the equivalent) such that it is "allowed" to magically split into three Hydra Heads and generate separate targeted "Big HTML", EPUB, and MOBI versions just like RST is allowed to do.
This horse doesn't yet seem to be dead, so... The three basic rules for making ePub-ready HTML files: 1. Don't specify styles inside the HTML document, either inline or in the <head>. Do include a link to an external style sheet with the name "pg.css". 2. Use HTML elements only as designed; no tag abuse. 3. Use a standardized set of classification attributes ("class='standard'"). Help contribute to building a standard library of attribute values. Be willing to compromise in the naming of the attributes; standardization is more important than "correctness". If you were to follow these rules you would have exactly what you are asking for: a single HTML file that would satisfy all three purposes. The conditional processing system would be implemented by the application of various external style sheets. ePub and the two Kindle formats could simply be generated using two (or three if .kf8 needs it) different style sheet files. You could even distinguish between "21 inch monitor" HTML files and "3.7 inch monitor" HTML files by serving different style sheets according to the user's preferences. For example, it would be a simple thing to do to serve "little screen" HTML files from "m.gutenberg.org" and "big screen" HTML files from "www.gutenberg.org".

Lee>If you were to follow these rules you would have exactly what you are asking for: a single HTML file that would satisfy all three purposes. If *I* were to follow these rules I would produce the same HTML input to PG as I am already doing 99% of the time. I will admit that on the last day or two before shipping I am willing to compromise and "tweak" locally what goes out to fix any small remaining problems without taking the risk of wholesale damage to what I have already SR'ed two or three times before. The problem is with the great mess of other posters, and with a PG/DP mentality which says every last comma is important, but producing books that real customers can actually read on their real devices is not important. One misplaced comma is important, but literally 1000 formatting errors in one file doesn't count, because "We" don't care. With this caveat: I do not use external CSS files because PG doesn't accept them. And I am happy to put a couple lines into my CSS in order that the same HTML will render "correctly", and more or less consistently, on HTML, EPUB, MOBI, and KF8 devices. Agreed that I personally would probably rather see external CSS files.

On 1/19/2012 4:43 PM, Marcello Perathoner wrote:
You can submit either *one* HTML or *one* TEI or *one* RST file. In no case has anybody been posting multiple HTML files.
Is this an exclusive 'or' or an inclusive 'or'? If I submit an HTML file am I precluded from posting a TEI or an RST file? Or may I submit one of each? Mr. Haines or Mr. Newby, would you care to confirm or deny this restriction?

Marcello is correct. It was decided in the early discussions of RST that additional custom files, e.g. a custom HTML file or a custom text file, would not be allowed. If custom files were desired/required, then the submission should be a normal text+HTML submission. The first discussion/submission of TEI files predates my time as WWer, but the same principle applies. Al
-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Lee Passey Sent: Saturday, January 21, 2012 4:52 PM To: gutvol-d@lists.pglaf.org Subject: Re: [gutvol-d] Producing epub ready HTML
On 1/19/2012 4:43 PM, Marcello Perathoner wrote:
You can submit either *one* HTML or *one* TEI or *one* RST file. In no case has anybody been posting multiple HTML files.
Is this an exclusive 'or' or an inclusive 'or'?
If I submit an HTML file am I precluded from posting a TEI or an RST
file? Or may I submit one of each?
Mr. Haines or Mr. Newby, would you care to confirm or deny this restriction? _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Yes, confirmed. Having numerous formats derived from a single master is a long-time goal. We've had some success with RST and TEI, and I've encouraged new projects to consider RST. There are still some limitations, though... On the many, many words on gutvol-d recently about poor results with auto-conversion from HTML to other formats (epub and mobi, among others): this is often due to choices that producers make about using HTML to impact layout, rather than just structure. Enough said. -- Greg On Sat, Jan 21, 2012 at 07:28:30PM -0800, Al Haines wrote:
Marcello is correct.
It was decided in the early discussions of RST that additional custom files, e.g. a custom HTML file or a custom text file, would not be allowed. If custom files were desired/required, then the submission should be a normal text+HTML submission.
The first discussion/submission of TEI files predates my time as WWer, but the same principle applies.
Al
-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Lee Passey Sent: Saturday, January 21, 2012 4:52 PM To: gutvol-d@lists.pglaf.org Subject: Re: [gutvol-d] Producing epub ready HTML
On 1/19/2012 4:43 PM, Marcello Perathoner wrote:
You can submit either *one* HTML or *one* TEI or *one* RST file. In no case has anybody been posting multiple HTML files.
Is this an exclusive 'or' or an inclusive 'or'?
If I submit an HTML file am I precluded from posting a TEI or an RST
file? Or may I submit one of each?
Mr. Haines or Mr. Newby, would you care to confirm or deny this restriction? _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Sunday, January 22, 2012, Greg Newby <gbnewby@pglaf.org> wrote:
Yes, confirmed.
Having numerous formats derived from a single master is a long-time goal. We've had some success with RST and TEI, and I've encouraged new projects to consider RST. There are still some limitations, though...
On the many, many words on gutvol-d recently about poor results with auto-conversion from HTML to other formats (epub and mobi, among others): this is often due to choices that producers make about using HTML to impact layout, rather than just structure. Enough said. -- Greg
Au contraire. I think there are plenty of producers who don't understand the distinction. Why aren't WWers send back projects that include destructive layout tagging, or don't include important structural tagging? I can think of any number of reasons for rejection that are less disruptive to the reader's satisfaction.

On Sun, Jan 22, 2012 at 02:52:48PM -0800, don kretz wrote:
On Sunday, January 22, 2012, Greg Newby <gbnewby@pglaf.org> wrote:
Yes, confirmed.
Having numerous formats derived from a single master is a long-time goal. We've had some success with RST and TEI, and I've encouraged new projects to consider RST. There are still some limitations, though...
On the many, many words on gutvol-d recently about poor results with auto-conversion from HTML to other formats (epub and mobi, among others): this is often due to choices that producers make about using HTML to impact layout, rather than just structure. Enough said. -- Greg
Au contraire. I think there are plenty of producers who don't understand the distinction.
Why aren't WWers send back projects that include destructive layout tagging, or don't include important structural tagging? I can think of any number of reasons for rejection that are less disruptive to the reader's satisfaction.
Because we have automated checks for validity and good spelling. We don't have automated checks for (mis-) use of HTML for layout. If we had some sort of automated and relatively unambiguous checks for such things, I'm sure that many submitters would strive to comply. -- Greg

On the many, many words on gutvol-d recently about poor results with auto-conversion from HTML to other formats (epub and mobi, among others): this is often due to choices that producers make about using HTML to impact layout, rather than just structure. Enough said.
Well, I think I work extremely hard to avoid using HTML to "impact layout", AND I work very hard to create HTML that will render "identically" in EPUB and MOBI, AND I check my work using a local copy of epubmaker, AND I check my work on multiple copies of physical EPUB and MOBI devices, YET still I find that I end being disappointed in the EPUB and MOBI that ends up being posted on PG from my well-intentioned efforts. Funny how pushing on one end of a rope doesn't lead to satisfying control over the other end of the rope. Again, if you allow posting in RST or TEI, why not allow posting in EPUB? EPUB comes much closer to being "a real complete book" that HTML ever will. And it's much easier converting EPUB back to HTML than trying to convert HTML to EPUB. And EPUB goes straight into Kindlegen to make MOBI and now KF8, with KF8 being in practice very similar to EPUB.

On Sun, Jan 22, 2012 at 08:44:26PM -0800, Jim Adcock wrote:
On the many, many words on gutvol-d recently about poor results with auto-conversion from HTML to other formats (epub and mobi, among others): this is often due to choices that producers make about using HTML to impact layout, rather than just structure. Enough said.
Well, I think I work extremely hard to avoid using HTML to "impact layout", AND I work very hard to create HTML that will render "identically" in EPUB and MOBI, AND I check my work using a local copy of epubmaker, AND I check my work on multiple copies of physical EPUB and MOBI devices, YET still I find that I end being disappointed in the EPUB and MOBI that ends up being posted on PG from my well-intentioned efforts. Funny how pushing on one end of a rope doesn't lead to satisfying control over the other end of the rope.
:( Evidence that HTML as a master format cannot be satisfying, or at least not universally so. Seems we already knew that, but it's still sad to hear that your toils didn't have the intended outcome.
Again, if you allow posting in RST or TEI, why not allow posting in EPUB? EPUB comes much closer to being "a real complete book" that HTML ever will. And it's much easier converting EPUB back to HTML than trying to convert HTML to EPUB. And EPUB goes straight into Kindlegen to make MOBI and now KF8, with KF8 being in practice very similar to EPUB.
I think this could be done. But we'd need some sort of processing chain (i.e., to validate, then create the various derivative formats). -- Greg

Evidence that HTML as a master format cannot be satisfying, or at least not universally so. Seems we already knew that, but it's still sad to hear that your toils didn't have the intended outcome.
I have happy results when I process that HTML on my own machines using my own tool chains, in part because I can, and do, easily check my output from those tool chains. In fact it has gotten to the point where if I want to read some book from PG by some other submitter, I find that I can *easily* create a more satisfying read experience locally: 1) Download the PG HTML file(s) 2) Edit the CSS to fix the half dozen formatting "errors" that almost all submissions from DP fall prey to. 3) Compile locally using the latest version of Kindlegen, DIRECTLY submitting the HTML files -- no epubmaker required! 4) Read and enjoy a book that *isn't* scrambled. If I can do this with 5 or 10 minutes effort, then PG could too. Its just a question of whether PG keeps defending the status quo, even though its clearly not working [1], or when or whether PG is willing to acknowledge "you know, the approach we have been trying to take simply isn't working, we need to rethink our approach." [1] How do "we" know that the current approach is not working? Simple 1) Open a PG book in a web browser. 2) Open the same book on an epub device 3) open the same book on a kindle device. Are the results similar, and similarly satisfying? Nope. Are others able to accomplish this same task using PG books as feedstock. Yes. Conclusion: PG is doing something wrong.

On 1/19/2012 4:43 PM, Marcello Perathoner wrote:
You can submit either *one* HTML or *one* TEI or *one* RST file. In no case has anybody been posting multiple HTML files.
I wrote:
Is this an exclusive 'or' or an inclusive 'or'? If I submit an HTML file am I precluded from posting a TEI or an RST file? Or may I submit one of each?
[snip]
On Sat, Jan 21, 2012 at 07:28:30PM -0800, Al Haines wrote:
Marcello is correct.
[snip] On Sun, January 22, 2012 1:21 pm, Greg Newby wrote:
Yes, confirmed.
I think this is a strong demonstration of why it is important for technical people take more classes in technical writing. Mr. Perathoner started this thread making an ambiguous statement. The ambiguity was compounded when he stated "In no case has anybody been posting multiple HTML files," leading one to conclude that the issue here relates to multiple postings of a single markup format. I then asked for clarification of that statement, to remove the ambiguity. Mr. Haines and Mr. Newby each responded saying, in effect, "yes, Marcello's original ambiguity is the policy of Project Gutenberg." I am left to conclude that either the Project Gutenberg policy in this regard is still in a state of flux and the ambiguity is intentional, that Mr. Haines or Mr. Newby did not feel comfortable enough with their writing skills to actually respond to the question, or that neither of them gave anything more that cursory attention to the question actually posed. Given the usual practice on this list of paying little, or no, attention to what other people have to say, I suspect that the last alternative is the correct one. In any case, the ambiguity remains.

On 01/23/2012 05:51 PM, Lee Passey wrote:
I think this is a strong demonstration of why it is important for technical people take more classes in technical writing.
No it is not. It is a textbook example of why technical people should take classes in English. You are the typical nerd that when asked: "Do you want tea or coffee?", answers with: "Yes". To save time I'll just rephrase what I said for nerds who don't quite understand English: You can post exactly one element of HTMLTXT ⋃ TEI ⋃ RST where: HTMLTXT = { (h, t): h ∈ HTML, t ∈ TXT }, TXT = the set of all plain text files that encode the book, HTML = the set of all valid HTML files that encode the book, TEI = the set of all valid TEI files that encode the book, RST = the set of all valid RST files that encode the book. -- Marcello Perathoner webmaster@gutenberg.org

Marcello:
HTMLTXT = { (h, t): h ∈ HTML, t ∈ TXT }, TXT = the set of all plain text files that encode the book, HTML = the set of all valid HTML files that encode the book, TEI = the set of all valid TEI files that encode the book, RST = the set of all valid RST files that encode the book.
Well I am sorry but it is bizarre that PG would pretend that TEI or RST qualify as being a "lasting format" not requiring the input of a separate txt70 file, but that HTML isn't a "lasting format" that does require submitting a separate txt70 file. Again, please everyone take a good hard look at the "HTML" code which PG itself is generating before criticizing the HTML efforts of ANY volunteer submitter!

I've seen Marcello's response to this, and to put it into plain English, a submission can be one, and only one, of the following: - one (and only one) TEI file, plus illustrations. No other files are allowed. - one (and only one) RST file, plus illustrations. No other files are allowed. - one (X)HTML file, plus illustrations, plus from one to three text files (UTF8, Latin1, ASCII), depending on the submission's requirements. (This kind of HTML+text submission *may* also include such binary formats as doc, rtf, and pdf, but since they have to be handled manually, and PDF files are difficult, if not impossible, to correct, they're not encouraged.) - Latex - these are rare, and are usually only used for source material having considerable math content, e.g. "Calculus Made Simple" (PG#33283) Al
-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Lee Passey Sent: Monday, January 23, 2012 8:52 AM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] Producing epub ready HTML
On 1/19/2012 4:43 PM, Marcello Perathoner wrote:
You can submit either *one* HTML or *one* TEI or *one* RST file. In no case has anybody been posting multiple HTML files.
I wrote:
Is this an exclusive 'or' or an inclusive 'or'? If I submit an HTML file am I precluded from posting a TEI or an RST file? Or may I submit one of each?
[snip]
On Sat, Jan 21, 2012 at 07:28:30PM -0800, Al Haines wrote:
Marcello is correct.
[snip]
On Sun, January 22, 2012 1:21 pm, Greg Newby wrote:
Yes, confirmed.
I think this is a strong demonstration of why it is important for technical people take more classes in technical writing.
Mr. Perathoner started this thread making an ambiguous statement. The ambiguity was compounded when he stated "In no case has anybody been posting multiple HTML files," leading one to conclude that the issue here relates to multiple postings of a single markup format.
I then asked for clarification of that statement, to remove the ambiguity. Mr. Haines and Mr. Newby each responded saying, in effect, "yes, Marcello's original ambiguity is the policy of Project Gutenberg."
I am left to conclude that either the Project Gutenberg policy in this regard is still in a state of flux and the ambiguity is intentional, that Mr. Haines or Mr. Newby did not feel comfortable enough with their writing skills to actually respond to the question, or that neither of them gave anything more that cursory attention to the question actually posed.
Given the usual practice on this list of paying little, or no, attention to what other people have to say, I suspect that the last alternative is the correct one.
In any case, the ambiguity remains.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Mon, January 23, 2012 12:43 pm, Al Haines wrote:
I've seen Marcello's response to this, and to put it into plain English, a submission can be one, and only one, of the following:
See, I never would have gotten /this/ out of Mr. Perathoner statement.
- one (and only one) TEI file, plus illustrations. No other files are allowed.
This seems bizarre to me. What is the rationale for not allowing, let alone requiring, an impoverished text file when the other submission is TEI? [snip]
- one (X)HTML file, plus illustrations, plus from one to three text files (UTF8, Latin1, ASCII), depending on the submission's requirements.
The language in the FAQ that one may submit a HTML version without a plain ASCII version (#H.3._Can_I_submit_a_HTML_version_without_a_plain_ASCII_version.3F) needs to be updated with this new requirement.
(This kind of HTML+text submission *may* also include such binary formats as doc, rtf, and pdf, but since they have to be handled manually, and PDF files are difficult, if not impossible, to correct, they're not encouraged.)
How in the world can these kinds of documents be included in HTML? Or do you mean you can package them up in a zip file along side the HTML?

On 01/23/2012 10:48 PM, Lee Passey wrote:
On Mon, January 23, 2012 12:43 pm, Al Haines wrote:
I've seen Marcello's response to this, and to put it into plain English, a submission can be one, and only one, of the following:
See, I never would have gotten /this/ out of Mr. Perathoner statement.
- one (and only one) TEI file, plus illustrations. No other files are allowed.
This seems bizarre to me. What is the rationale for not allowing, let alone requiring, an impoverished text file when the other submission is TEI?
That you can generate the plain txt file out of the TEI. -- Marcello Perathoner webmaster@gutenberg.org

On Mon, January 23, 2012 3:01 pm, Marcello Perathoner wrote:
On 01/23/2012 10:48 PM, Lee Passey wrote
- one (and only one) TEI file, plus illustrations. No other files are allowed.
This seems bizarre to me. What is the rationale for not allowing, let alone requiring, an impoverished text file when the other submission is TEI?
That you can generate the plain txt file out of the TEI.
You can generate the impoverished text just as easily out of HTML. So why the distinction?

On 01/23/2012 11:21 PM, Lee Passey wrote:
On Mon, January 23, 2012 3:01 pm, Marcello Perathoner wrote:
On 01/23/2012 10:48 PM, Lee Passey wrote
- one (and only one) TEI file, plus illustrations. No other files are allowed.
This seems bizarre to me. What is the rationale for not allowing, let alone requiring, an impoverished text file when the other submission is TEI?
That you can generate the plain txt file out of the TEI.
You can generate the impoverished text just as easily out of HTML. So why the distinction?
Please show me how. The generated text must be ready to post, eg. word wrap, pg header, pg footer, lines between chapters, etc. all must be there and adhere to pg standard. There must be no post-generation edits required. -- Marcello Perathoner webmaster@gutenberg.org

On Mon, January 23, 2012 3:48 pm, Marcello Perathoner wrote:
Please show me how.
You do it the same way you do TEI, except you map the tags differently. We know, for example, that XHTML can easily be converted to TEI, so from there it seems the process could be the same. What I did was to load the HTML into an in-memory DOM (best to use a tag-soup parser, because you can't guarantee well-formedness of the input file [unless you've pre-processed with Tidy]). Then walk the tree, spitting out appropriate text as you go. Some tags get translated output both before and after their children (e.g. <i> and <b>) other tags only need something before or after. If you want to be really careful about line lengths, buffer words at a time, and then decide whether a new-line needs to be added before the word.
The generated text must be ready to post, eg. word wrap, pg header, pg footer, lines between chapters, etc. all must be there and adhere to pg standard. There must be no post-generation edits required.
Do you want me to send you the code (based on the 2005? Tidy code base)? It doesn't spit out the PG garbage text, but that could easily be added. I can't say that it adheres to the PG standard, because I am still unaware that there /is/ any PG standard, but if you were to tell me explicitly what /you/ think the standard is I could tell you whether it satisfies it.

On 01/24/2012 12:25 AM, Lee Passey wrote:
On Mon, January 23, 2012 3:48 pm, Marcello Perathoner wrote:
Please show me how.
You do it the same way you do TEI, except you map the tags differently. We know, for example, that XHTML can easily be converted to TEI, so from there it seems the process could be the same.
What I did was to load the HTML into an in-memory DOM (best to use a tag-soup parser, because you can't guarantee well-formedness of the input file [unless you've pre-processed with Tidy]). Then walk the tree, spitting out appropriate text as you go. Some tags get translated output both before and after their children (e.g.<i> and<b>) other tags only need something before or after. If you want to be really careful about line lengths, buffer words at a time, and then decide whether a new-line needs to be added before the word.
Ok. Now put that into code, runnable on an ubuntu box, and give it to the WWers to evaluate.
The generated text must be ready to post, eg. word wrap, pg header, pg footer, lines between chapters, etc. all must be there and adhere to pg standard. There must be no post-generation edits required.
Do you want me to send you the code (based on the 2005? Tidy code base)? It doesn't spit out the PG garbage text, but that could easily be added. I can't say that it adheres to the PG standard, because I am still unaware that there /is/ any PG standard, but if you were to tell me explicitly what /you/ think the standard is I could tell you whether it satisfies it.
Take a hundred random samples from the archive and pipe the HTML file thru your device and see if something very close to the posted txt file comes out. (You may safely ignore where the lines break, but not the number of empty lines between blocks.) -- Marcello Perathoner webmaster@gutenberg.org

Warning: This message has had one or more attachments removed Warning: (html2txt1.exe, html2txt.exe, html2txt.zip). Warning: Please read the "PGLAF-Attachment-Warning.txt" attachment(s) for more information. On 1/24/2012 1:46 AM, Marcello Perathoner wrote:
On 01/24/2012 12:25 AM, Lee Passey wrote:
On Mon, January 23, 2012 3:48 pm, Marcello Perathoner wrote:
Please show me how.
You do it the same way you do TEI, except you map the tags differently
Ok. Now put that into code, runnable on an ubuntu box,
I see. You don't just want me to show you how, you actually want me to do all the work. I'm not sure I want to go to that much effort simply to demonstrate to you that it's possible. What I will do for you is send you the C++ code that you can compile and install yourself. Attached are two zip files; one contains an early version of a C++ version of Tidy (circa 2002) the other additional files used to create the html2txt executable. html2txt.zip contains a file named "filelist.txt" which lists the files from each archive necessary to build the program. I neglected at that time to add the "readme.txt" to the zip file, so I am attaching it here separately. I don't know if the gutvol-d list software strips off attachments or not (if it doesn't, it should). If anyone else would like this code and it doesn't come through, contact me directly. The theory of operation of converting HTML to text is really quite simple, and there's plenty of ways to skin this particular cat. If I were doing this again I would probably use Java as it has all the DOM parsing and manipulating functions necessary, if not built in then readily available. With Java it could easily be done in a couple of hundred lines of code and would "run everywhere." The method is so simple and straight-forward that probably even BowerBird could do it in Python, and I'm sure it's doable as an XSL script as well.
and give it to the WWers to evaluate.
LOL! I'm not convinced that any of the white-washers could even spell ubuntu, let alone compile, install and use a Linux program. For them I've attached a MSWindows executable built from the attached code.
Take a hundred random samples from the archive and pipe the HTML file thru your device and see if something very close to the posted txt file comes out. (You may safely ignore where the lines break, but not the number of empty lines between blocks.)
This is an exercise left to the reader. Of course, the real test is not equivalency, but whether the output is something the white-washers would accept; no one can judge whether this requirement is met except the white-washers themselves. This is a message from the MailScanner E-Mail Virus Protection Service ---------------------------------------------------------------------- The original e-mail attachment "html2txt.zip" is on the list of unacceptable attachments for this site and has been replaced by this warning message. If you wish to receive a copy of the original attachment, please e-mail helpdesk and include the whole of this message in your request. Alternatively, you can call them, with the contents of this message to hand when you call. At Tue Jan 24 09:30:04 2012 the virus scanner said: MailScanner: Executable DOS/Windows programs are dangerous in email (html2txt.exe) No programs allowed (html2txt.exe) Note to Help Desk: Look on the PGLAF () MailScanner in /var/spool/MailScanner/quarantine/20120124 (message 8E9C61684.61F7F). -- Postmaster Project Gutenberg Literary Archive Fdn www.pglaf.org For all your IT requirements visit: http://www.transtec.co.uk

[Same message as sent earlier, but without attachments just in case] On 1/24/2012 1:46 AM, Marcello Perathoner wrote:
On 01/24/2012 12:25 AM, Lee Passey wrote:
On Mon, January 23, 2012 3:48 pm, Marcello Perathoner wrote:
Please show me how.
You do it the same way you do TEI, except you map the tags differently
Ok. Now put that into code, runnable on an ubuntu box,
I see. You don't just want me to show you how, you actually want me to do all the work. I'm not sure I want to go to that much effort simply to demonstrate to you that it's possible. What I will do for you is send you the C++ code that you can compile and install yourself. Attached are two zip files; one contains an early version of a C++ version of Tidy (circa 2002) the other additional files used to create the html2txt executable. html2txt.zip contains a file named "filelist.txt" which lists the files from each archive necessary to build the program. I neglected at that time to add the "readme.txt" to the zip file, so I am attaching it here separately. I don't know if the gutvol-d list software strips off attachments or not (if it doesn't, it should). If anyone else would like this code and it doesn't come through, contact me directly. The theory of operation of converting HTML to text is really quite simple, and there's plenty of ways to skin this particular cat. If I were doing this again I would probably use Java as it has all the DOM parsing and manipulating functions necessary, if not built in then readily available. With Java it could easily be done in a couple of hundred lines of code and would "run everywhere." The method is so simple and straight-forward that probably even BowerBird could do it in Python, and I'm sure it's doable as an XSL script as well.
and give it to the WWers to evaluate.
LOL! I'm not convinced that any of the white-washers could even spell ubuntu, let alone compile, install and use a Linux program. For them I've attached a MSWindows executable built from the attached code.
Take a hundred random samples from the archive and pipe the HTML file thru your device and see if something very close to the posted txt file comes out. (You may safely ignore where the lines break, but not the number of empty lines between blocks.)
This is an exercise left to the reader. Of course, the real test is not equivalency, but whether the output is something the white-washers would accept; no one can judge whether this requirement is met except the white-washers themselves.

On 01/24/2012 06:25 PM, Lee Passey wrote:
I see. You don't just want me to show you how, you actually want me to do all the work. I'm not sure I want to go to that much effort simply to demonstrate to you that it's possible.
I am sure I will *not* do the work for you. You want HTML as master, you write the code.
LOL! I'm not convinced that any of the white-washers could even spell ubuntu, let alone compile, install and use a Linux program. For them I've attached a MSWindows executable built from the attached code.
pglaf.org is an ubuntu box. -- Marcello Perathoner webmaster@gutenberg.org

On 1/24/2012 11:03 AM, Marcello Perathoner wrote:
On 01/24/2012 06:25 PM, Lee Passey wrote:
I see. You don't just want me to show you how, you actually want me to do all the work. I'm not sure I want to go to that much effort simply to demonstrate to you that it's possible.
I am sure I will *not* do the work for you. You want HTML as master, you write the code.
I did, and gave it to you. All you have to do is compile and install it. Or do you want me to be your system administrator as well?
LOL! I'm not convinced that any of the white-washers could even spell ubuntu, let alone compile, install and use a Linux program. For them I've attached a MSWindows executable built from the attached code.
pglaf.org is an ubuntu box.
A classic example of a non-sequitor.

On 01/24/2012 08:21 PM, Lee Passey wrote:
I did, and gave it to you. All you have to do is compile and install it.
That code you gave me doesn't even word wrap according to its help file. Let me explain again: your tool has to automatically produce a PG plain text file that a WWer cannot distinguish from a hand-crafted one. That's the first step to convince a WWer to use your tool.
Or do you want me to be your system administrator as well?
Ask greg for an account at pglaf.org, install your tool there, and convince at least one WWer to use it. You don't need me at all.
pglaf.org is an ubuntu box.
A classic example of a non-sequitor.
That's non-sequitur with 'u'. It's only a non-sequitur because you don't know (don't care?) how PG internally works. pglaf.org is the server the WWers use for posting. Your program has to run there before anything else. -- Marcello Perathoner webmaster@gutenberg.org

On Tue, Jan 24, 2012 at 12:08 PM, Joshua Hutchinson <joshua@hutchinson.net> wrote:
So, if someone were to start "refactoring" old PG texts into TEI or RST and working with a WWer to repost them ... is this a workable idea?
YES! A clean, rational collection with good metadata. In addition, reworking the "copyright notice" so that it's a Creative Commons licence, which now has a track record and legal support. If we can get consensus on redoing things, we could perhaps do a "summer of code" and raise money for it on Kickstarter. -- Karen Lofstrom

On Tue, January 24, 2012 3:20 pm, Karen Lofstrom wrote:
In addition, reworking the "copyright notice" so that it's a Creative Commons licence, which now has a track record and legal support.
The problem with this idea is that PG's "copyright notice" isn't. It's really a trademark claim. There is (probably) nothing in any PG text that is copyrightable other than the copyright notice itself. The PG notice says, in essence, "the name 'Project Gutenberg' is our trademark. It is intended to assure a certain level of quality (or lack thereof) in documents produced for Project Gutenberg. We can't keep you from modifying the text you get from us, but if you do you can't use our trademark on it because we can't ensure that it satisfies our requirements." This kind of trademark notice is qualitatively different from the copyright notices produced by Creative Commons. Clearly, the PG trademark notice needs to be simplified, probably moved to the end of the document, and in HTML versions included in a classified <div> (e.g. <div class="pgclaims">) so that it can easily be hidden by CSS. But it cannot be replaced by any kind of license from Creative Commons, which all rely on a valid copyright.

On Tue, January 24, 2012 3:20 pm, Karen Lofstrom wrote:
YES! A clean, rational collection with good metadata.
I notice that so far no one has responded to this request. So, Ms. Lofstrom, what kind of metadata did you have in mind, and do you have any proposals for storing and accessing it?

On Mon, Jan 30, 2012 at 10:02 AM, Lee Passey <lee@novomail.net> wrote:
So, Ms. Lofstrom, what kind of metadata did you have in mind, and do you have any proposals for storing and accessing it?
NO, because I'm not a computer-savvy librarian. However, my own experience is that that PG (and other online collections, for that matter) fail WOEFULLY at presenting their collections in a manner that is both accessible to the casual browser, and high-powered and accurate enough for researchers looking for specific information. It's not just me; I've certainly read enough online complaints by librarians and academics trying to navigate the e-maze. MARC records would be a start. However, I understand that the software is fussy about the wording of queries, which is a big drawback. Some changes needed there. Also, something like the Amazon and Netflix rating systems (if you liked X, you might like Y) would also be useful. -- Karen Lofstrom

On Mon, Jan 30, 2012 at 12:33 PM, Karen Lofstrom <klofstrom@gmail.com> wrote:
Also, something like the Amazon and Netflix rating systems (if you liked X, you might like Y) would also be useful.
Where's the data coming from? Amazon has a huge amount of data in terms of user JoeBob bought these books, claimed to own those books, and put that third group of books on his Wishlist. We don't have that. We might be able to derive data from LibraryThing, with permission. It wouldn't quite fit our needs, however, being based on a LT "work" instead of a more specific Project Gutenberg "ebook". -- Kie ekzistas vivo, ekzistas espero.

On Mon, Jan 30, 2012 at 10:07 PM, David Starner <prosfilaes@gmail.com> wrote:
Where's the data coming from? Amazon has a huge amount of data in terms of user JoeBob bought these books, claimed to own those books, and put that third group of books on his Wishlist. We don't have that.
The same way Netflix built up its recommendations. Ask people to rate. (And Amazon goes by user ratings as well as purchases.) Sure, it would be slow, and the first iterations would be riddled with errors. But it's not as if it were a new technology at this point. -- Karen Lofstrom

Hi, Besides the naivety of what is involved, why should PG want to offer ratings or suggestion. PG is not selling anything! regards Keith. Am 31.01.2012 um 10:26 schrieb Karen Lofstrom:
On Mon, Jan 30, 2012 at 10:07 PM, David Starner <prosfilaes@gmail.com> wrote:
Where's the data coming from? Amazon has a huge amount of data in terms of user JoeBob bought these books, claimed to own those books, and put that third group of books on his Wishlist. We don't have that.
The same way Netflix built up its recommendations. Ask people to rate. (And Amazon goes by user ratings as well as purchases.)
Sure, it would be slow, and the first iterations would be riddled with errors. But it's not as if it were a new technology at this point.
-- Karen Lofstrom _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

why should PG want to offer ratings or suggestion.
How does a real-world "customer" of PG who doesn't already know a title and an author and who already knows that that book is at PG, find a book they want to read? Currently PG would be ahead even to offer a "I'm feeling lucky" button that simply throws a book at the "customer" at random. Magiccatalog is an example of that kind of "I'm feeling lucky" kind of browsing where "at random" the customer says "H'm that one might look interesting" and clicks on one button to download to their reader device.
"PG is not selling anything."
PG is "selling" their contribution to the world. The price may be free, but that doesn't mean you get anywhere if you are not willing to "sell it." If PG doesn't "advertise" their efforts then would-be volunteers never realize that there is an opportunity to make a contribution to the effort. Others are quite happy to take PG's "advertisements" off the books and substitute their own advertisements so that the end users don't know where the books are actually coming from. It's not *just* that these "competitors" want to use the space to advertise themselves. It is also more profitable to them if PG's efforts fail.

David Widger and I worked through PG#1 to about PG#5000, 2-3 years ago, reposting many of them into the new structure. What's left (some hundreds of etexts) are all over the place, standards (or the lack thereof)-wise. Footnotes are done many different ways, italics are either absent or represented by all-caps, publisher information is missing, no illustrations, line lengths too short/long, odd handling of page numbers, etc, etc, etc. Some (maybe most) might be easier to re-do from scratch than to try and fix. Most etexts from #5000-#9999 were done by DP, weren't looked at by David and I, and these days are only looked at (and maybe reposted into the new structure) if there's an errata report. I can't say how many are text-only, and how many are text+HTML. I reposted all the audiobooks out of the old structure into the new one, a year or two ago, but there are still upwards of 4000 etexts left in the old folder structure in the 5000-9999 range. It'll be interesting to see who, if anyone, steps up to this project. Al -----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Joshua Hutchinson Sent: Tuesday, January 24, 2012 2:08 PM To: gutvol-d@lists.pglaf.org Subject: Re: [gutvol-d] Producing epub ready HTML So, if someone were to start "refactoring" old PG texts into TEI or RST and working with a WWer to repost them ... is this a workable idea? I'd love to see the PG corpus redone as a "master format" system (and the current filesystem supports "old" format files in a subdirectory, so if someone wanted to get the old original hand-made files, they could). I'm not particularly wedded to any master format. Hell, if someone came up with a sufficiently constrained HTML vocabulary that could be easily used to "generate" the additional formats necessary, I'm good with that. But before anyone will start doing this work, there needs to be a concensus from PG (I'm looking at you, Greg!) that the work will be acceptable. A half-assed "master format" system is no master format system at all. I'm even ok with working up the system as you go (i.e., start with "simple" fiction works and make sure the system handles them before throwing more and more complex works at it, tweaking and fixing in the time honored method of "incremental development"). Maybe we start this process on a semi-private mirror of the PG corpus and only when it reaches a critical mass of some sort it gets moved over. But an official notice that this project has some backing is necessary or we'll just keep seeing everything running around in ten different directions and nothing ever getting done. Josh

Al>It'll be interesting to see who, if anyone, steps up to this project. Again, I tried "stepping up to the bat" some time ago. As a test of the receptiveness of PG to people actually trying to rework old and crusty versions, I redid 76 as 32325 using a slightly different edition to avoid stepping on anyone's toes and to test PG's response. PG posted it, but continues to "advertise" the old version via download count such that almost no real customer ever finds the new version. Go try and find it without looking at the number above, and see what I mean. Word diff comparisons show that the old version, 76, is in error in about 1500+ places, by today's standards, at least. Where by "in error" I mean 1500 places 76 clearly differs from the text found in the original book. Again, its not clear to me how anyone would, in practice, be allowed to rework an old crusty version such as 76, even if they wanted to. I believe, in practice, the WWers would reject the effort after the volunteer has put in the 40+ hours work.

"James" == James Adcock <jimad@msn.com> writes:
Al> It'll be interesting to see who, if anyone, steps up to this Al> project. James> Again, I tried "stepping up to the bat" some time ago. As James> a test of the receptiveness of PG to people actually trying James> to rework old and crusty versions, I redid 76 as 32325 James> using a slightly different edition to avoid stepping on James> anyone's toes and to test PG's response. James> PG posted it, but continues to "advertise" the old version James> via download count such that almost no real customer ever James> finds the new version. Go try and find it without looking James> at the number above, and see what I mean. PG 76 (Huckleberry Finn) has been reposted on September 21, 2011. It has an illustrated HTML version. In this case the newest version is advertized first. Carlo

Carlo>PG 76 (Huckleberry Finn) has been reposted on September 21, 2011. It has an illustrated HTML version. In this case the newest version is advertized first. If you actually take a look inside what is posted there you will find that which is "new" is still old again.

On 01/24/2012 11:08 PM, Joshua Hutchinson wrote:
So, if someone were to start "refactoring" old PG texts into TEI or RST and working with a WWer to repost them ... is this a workable idea?
More than a technical challenge it would be a political one. I can convert a novel the size of Pride and Prejudice into RST in about an hour. More if there is formatting or images to recover. But I'd prefer to avoid the riot that will ensue if we start to reformat DP texts. We could start redoing the top 100 list excluding everything that is too hard and everything made by DP.
Maybe we start this process on a semi-private mirror of the PG corpus and only when it reaches a critical mass of some sort it gets moved over. But an official notice that this project has some backing is necessary or we'll just keep seeing everything running around in ten different directions and nothing ever getting done.
A semi-official branch would be a good occasion to ditch the old WWer workflow in favor of a source repository (git or mercurial) that holds all the masters. Should we reserve a range of ebook nos. or shadow the existing ones? -- Marcello Perathoner webmaster@gutenberg.org

Starting at the most basic level, is there any good reason not to use utf-8 as the basic encoding standard for everything including plain-text? Transforming to and from there into any other encoding is I think a pretty well-trod path by now.

What kind of automated build process is used by PG to generate and stage derivative documents from sources? Are the makefiles in scc somewhere? I'm presuming there must be some kind of repeatable make process, with hopefully automated unit tests to validate the build. Otherwise it seems to me you would be stuck with a stack of unique one-off versions you couldn't even duplicate from source, much less upgrade or migrate.

Speaking of sources, I remember writing a utility for DP to rename and upload the project source materials, including images, for DP projects, properly named and assembled to PG's wishes. Having not heard from anyone lately, either it's working perfectly (which is unlikely), or PG isn't staying current on source materials. One of the consequences for DP is that it's not possible for them to recreate plain-text and html files from single sources because everything between the last of the Rounds and what is given to PG is discarded (or never leaves the hands of the post-processor in the first place.) The workflow is pretty broken, isn't it?

On Tue, Jan 24, 2012 at 05:15:27PM -0800, don kretz wrote:
Speaking of sources, I remember writing a utility for DP to rename and upload the project source materials, including images, for DP projects, properly named and assembled to PG's wishes. Having not heard from anyone lately, either it's working perfectly (which is unlikely), or PG isn't staying current on source materials.
I count around 7000 projects with page images. Just look for page-images subdirectories. I don't know the relative proportion of DP vs. non-DP projects that choose to provide images. Page images are very much welcome, and have been for a number of years (since 2004 or so). There is a simple file naming scheme. In cases where IA or GB provides page images, page images might not be seen as so important to include with PG titles. -- Greg
One of the consequences for DP is that it's not possible for them to recreate plain-text and html files from single sources because everything between the last of the Rounds and what is given to PG is discarded (or never leaves the hands of the post-processor in the first place.)
The workflow is pretty broken, isn't it?
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Gutcheck/Jeebies/Gutspell - their sources are at http://gutcheck.sourceforge.net/etc.html PG's posting software - no, but since the only person who's modified it since I've been WWing (four years) is Jim Tinsley, its author, version control is probably overkill. PG's server-based stuff - Greg/Marcello will have to answer that. Al -----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of don kretz Sent: Tuesday, January 24, 2012 4:48 PM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] Producing epub ready HTML Not to mention the source for all the PG software?

Don>Starting at the most basic level, is there any good reason not to use utf-8 as the basic encoding standard for everything including plain-text? BOM or no BOM? Unix or Windows style line breaks? Line breaks meaning paragraph separations or line breaks meaning, well, whatever it is that PG means by line breaks? "uft-8" meaning that "we" use the interpretation of the code points as defined by Unicode, or meaning "we" invent our own meanings for those code points?

On Tue, Jan 24, 2012 at 5:33 PM, James Adcock <jimad@msn.com> wrote:
Don>Starting at the most basic level, is there any good reason not to use utf-8 as the basic encoding standard for everything including plain-text?* ***
** **
BOM or no BOM?
Pick one - but it's pretty unusual to use BOM any more.
**
Unix or Windows style line breaks?
Pick one and filter the other.
****
** **
Line breaks meaning paragraph separations or line breaks meaning, well, whatever it is that PG means by line breaks?****
**
Would we change what we're doing now?
**
“uft-8” meaning that “we” use the interpretation of the code points as defined by Unicode, or meaning “we” invent our own meanings for those code points?****
** **
**
Would you seriously consider defining your own?
**
** **
** **
****
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Line breaks meaning paragraph separations or line breaks meaning, well, whatever it is that PG means by line breaks?
"uft-8" meaning that "we" use the interpretation of the code points as defined by Unicode, or meaning "we" invent our own meanings for those code
Don>Would we change what we're doing now? Only if PG intended to rejoin the rest of the world. points? Don>Would you seriously consider defining your own? PG already has redefined its own meaning for already-defined utf-8 code points, but PG "lifers" are so ingrained in smelling their own flower garden to even notice this.

On Tue, January 24, 2012 6:33 pm, James Adcock wrote:
Don>Starting at the most basic level, is there any good reason not to use utf-8 as the basic encoding standard for everything including plain-text?
No.
BOM or no BOM?
Depends on the file. XML files (XHTML, TEI, etc.) are guaranteed to be ASCII in their first line, and that first line declares the encoding, so no BOM is necessary (and would probably confuse some tools). Subtle markup languages like reStructuredText which have no prolog need some mechanism to indicate that they contain UTF-8 encodings (to distinguish between that, latin-1 or MacRoman) so may need to have a BOM.
Unix or Windows style line breaks?
Don't know that it matters, but my preference would be Unix.
Line breaks meaning paragraph separations or line breaks meaning, well, whatever it is that PG means by line breaks?
All lines will wrap when displayed, so a mechanism is needed to indicate "this is not just whitespace it really is a new line!" All markups have a mechanism for this purpose. For ease of proofreading, I recommend that text lines be broken with insignificant new line characters at the same point as in the original text to the extent possible (hyphenated lines cannot follow this rule, and should be broken at the next, or previous, available whitespace.
"uft-8" meaning that "we" use the interpretation of the code points as defined by Unicode, or meaning "we" invent our own meanings for those code points?
Unicode without composition. Most have argued that UTF-8 requires Unicode. Technically you can UTF-8 encode any set of code points, but for this project it would serve no purpose.

Most have argued that UTF-8 requires Unicode. Technically you can UTF-8 encode any set of code points, but for this project it would serve no
Lee>Unicode without composition. purpose. I suggest that PG has historically "reserved" some commonly used Unicode code points for their own special purposes, and it would at least be wise to take the opportunity to choose much less commonly used code points for those special purposes, or alternatively use uncommon code sequences for those special purposes. The way the situation sits right now one cannot reliably write automatic tools to reliably process PG "utf-8" files.

Hi Lee, Am 25.01.2012 um 20:22 schrieb Lee Passey:
On Tue, January 24, 2012 6:33 pm, James Adcock wrote:
Don>Starting at the most basic level, is there any good reason not to use utf-8 as the basic encoding standard for everything including plain-text?
No.
BOM or no BOM?
Depends on the file. XML files (XHTML, TEI, etc.) are guaranteed to be ASCII in their first line, and that first line declares the encoding, so no BOM is Just to be picky. But, you err here. The above mentioned files are not guaranteed to be ASCII. only txt. Yet, as you state the first lines can contain encoding information. necessary (and would probably confuse some tools). Subtle markup languages like reStructuredText which have no prolog need some mechanism to indicate that they contain UTF-8 encodings (to distinguish between that, latin-1 or MacRoman) so may need to have a BOM. BOMs should generally effect processing unless one is acessing the the file on the byte level. Of course this depends on the system and how the programs are compile to interact with the file system.
regards Keith.

On Thu, January 26, 2012 1:57 am, Keith J. Schultz wrote:
Hi Lee,
Am 25.01.2012 um 20:22 schrieb Lee Passey:
Depends on the file. XML files (XHTML, TEI, etc.) are guaranteed to be ASCII in their first line, and that first line declares the encoding, so no BOM is
Just to be picky. But, you err here. The above mentioned files are not guaranteed to be ASCII. only txt. Yet, as you state the first lines can contain encoding information.
XML files are not guaranteed to be "txt." In fact, the term "txt" has virtually no meaning, so you can't say anything is guaranteed to be "txt" unless you first define the meaning of that term. It is true that XML files can be encoded using UTF-16, in which case the first line will /not/ be ASCII, and a BOM should be required (what's the default byte ordering of UTF-16 if there's no BOM, I wonder?) But if the file is not UTF-16 then at least the first line is guaranteed to be ASCII as it must be the "<!xml" signature, and that signature is only composed of ASCII characters. This is possible because ASCII, Latin-1, Windows 1252, and UTF-8 are all identical for values in the ASCII range. So if the first two bytes of an XML file are not a UTF-16 BOM, an automated process can safely read the first line of text and determine the encoding of the entire file from there.

Wrong! Little endian is wrong, too! Am 26.01.2012 um 19:53 schrieb Lee Passey:
On Thu, January 26, 2012 11:29 am, Lee Passey wrote:
(what's the default byte ordering of UTF-16 if there's no BOM, I wonder?)
Big endian.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On 01/26/2012 07:29 PM, Lee Passey wrote:
It is true that XML files can be encoded using UTF-16, in which case the first line will /not/ be ASCII, and a BOM should be required (what's the default byte ordering of UTF-16 if there's no BOM, I wonder?)
You can detect a lot by reading the first 4 bytes and knowing they must represent a prefix of '<?xm'.
But if the file is not UTF-16 then at least the first line is guaranteed to be ASCII
Wrong. It can also be UCS-4, UCS-2, or EBCDIC. -- Marcello Perathoner webmaster@gutenberg.org

On Thu, January 26, 2012 12:16 pm, Marcello Perathoner wrote:
On 01/26/2012 07:29 PM, Lee Passey wrote:
But if the file is not UTF-16 then at least the first line is guaranteed to be ASCII
Wrong. It can also be UCS-4, UCS-2, or EBCDIC.
I stand corrected. The first line is guaranteed to start with "<?xml", so what a program needs to do is read the first 5 bytes and then figure out which of the "other encodings" yields that string. Would PG ever consider using one of these "other encodings" or is this discussion just academic?

On 01/25/2012 01:45 AM, don kretz wrote:
Starting at the most basic level, is there any good reason not to use utf-8 as the basic encoding standard for everything including plain-text? Transforming to and from there into any other encoding is I think a pretty well-trod path by now.
No. The other encodings are just a big PITA. The last OS that didn't support Unicode was Windows 3.1. Anybody still using that one? The PG website already offers UTF-8 text only. -- Marcello Perathoner webmaster@gutenberg.org

On Jan 25, 2012, at 3:28 AM, Marcello Perathoner wrote:
The PG website already offers UTF-8 text only.
For a while I was sending up UTF-8 text only along with HTML. I stopped when it seemed that was causing a lot of work for the WWers. As of now, I send ASCII if that's sufficient, Latin-1 if it has characters in the Latin-1 set, and UTF-8 if it has characters not in Latin-1. There are two consequences: (1) everything that goes up in Latin-1 could go up in UTF-8 instead but doesn't and (2) I don't send up plain text with curly quotes at all. I work in UTF-8. It would be easiest for me to send up UTF-8. But the WW tools to check the submission, like gutcheck, struggle with UTF-8. The HTML I send up is UTF-8 and that survives because the WWers don't have to check it. They check the text file, which should be ASCII if it can be, and only UTF-8 instead of Latin-1 if there are characters that absolutely are necessary and not in Latin-1. Curly quotes are not viewed as necessary to the text version. Even an oe ligature isn't strong enough to justify UTF-8. Seems to me it's all about the tools. The WWers always seem overloaded and sending UTF-8 up makes that worse. If they had better tools to do their job, actually getting UTF-8 up on the PG website wouldn't be as problematic. This is an outside view of the problem. I am not a WWer. I'd love to hear from a WWer about Marcello's comment. Though correct as stated, perhaps it would be more accurate to say "The PG website already offers UTF-8 text only, but please don't send it to us unless absolutely necessary." I hope someday that statement will not be true, but I believe it is now. --Roger

-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Roger Frank Sent: Wednesday, January 25, 2012 6:30 AM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] Producing epub ready HTML
On Jan 25, 2012, at 3:28 AM, Marcello Perathoner wrote:
The PG website already offers UTF-8 text only.
For a while I was sending up UTF-8 text only along with HTML. I stopped when it seemed that was causing a lot of work for the WWers. As of now, I send ASCII if that's sufficient, Latin-1 if it has characters in the Latin-1 set, and UTF-8 if it has characters not in Latin-1. There are two consequences: (1) everything that goes up in Latin-1 could go up in UTF-8 instead but doesn't and (2) I don't send up plain text with curly quotes at all.
I work in UTF-8. It would be easiest for me to send up UTF-8. But
Custom UTF8 files aren't difficult to handle. I'll assume English-language submissions for the following. If a UTF8 file arrives *with* an accompanying Latin1 or ASCII file (DP's normal practice), I'll check the latter file(s) with Gutcheck/Jeebies/Gutspell, and if any problems are found, I (and the other WWers) will correct the submission's text/HTML files accordingly. (I use Windows Notepad for Latin1/ASCII text/HTML files, and SCUnipad for UTF8 text/HTML files.) If a UTF8 file arrives *without* an accompanying Latin1/ASCII equivalent, I'll first use Unitame in its non-convert mode to see what Unicode characters are in the file. If there's nothing it can't convert, I'll use Unitame in its convert mode to generate a Latin1 file from the UTF8 file, and do the normal checks on the Latin1 file, making any corrections as above. Both text files will be posted. PG's posting software will generate an ASCII file from the Latin1 file, except for most foreign language files, when it will ask if an ASCII file should be generated. As Roger mentions, curly quotes and oe-ligatures are insufficient reason for creating a UTF8 file. This is mentioned somewhere in DP's forums--I forget where. Use normal keyboard quotes, and "oe" or "[oe]" in text files; curly quotes and oe-lig (or their entities) in HTML files. UTF8 files are pretty much mandatory only when a text is in languages such as Greek, Hebrew, Chinese, Japanese, Cyrillic, etc., and a Latin1/ASCII conversion isn't practical. For English texts containing smatterings of Greek, the submitter has two options: transliterate the Greek characters as per PG's Greek How-To, submitting only a Latin1 file, or submit a UTF8 text with the actual Greek characters along with a transliterated Latin1 file. As for UTF8 files in Oriental/Cyrillic/etc., languages, Roger is correct in saying that Gutcheck and the other normal tools don't work very well on these files, but in the cases of the aforesaid languages, it probably doesn't matter since none of the WWers speak/read them anyway, so have no hope of telling if, for example, something is misspelled. I should mention that one thing that does slow things down for the WWers are text/HTML files with CR-only line endings. CR/LF is far preferable, since all the WWers use Windows. (I'm aware that there are fix utilities, but CR/LF's make things easier to start with.) It's also helpful to the WWers that text files are identified as to their character set. UTF8 files should have "-utf8" embedded in their filename, e.g. "somefile-utf8.txt". Latin1 and ASCII files much the same: "-ltn1", "-lt1", "-latin1", "-iso" or similar for Latin1 files, and "-asc" or "-ascii" for ASCII text files. Al the
WW tools to check the submission, like gutcheck, struggle with UTF-8. The HTML I send up is UTF-8 and that survives because the WWers don't have to check it. They check the text file, which should be ASCII if it can be, and only UTF-8 instead of Latin-1 if there are characters that absolutely are necessary and not in Latin-1. Curly quotes are not viewed as necessary to the text version. Even an oe ligature isn't strong enough to justify UTF-8.
Seems to me it's all about the tools. The WWers always seem overloaded and sending UTF-8 up makes that worse. If they had better tools to do their job, actually getting UTF-8 up on the PG website wouldn't be as problematic.
This is an outside view of the problem. I am not a WWer. I'd love to hear from a WWer about Marcello's comment. Though correct as stated, perhaps it would be more accurate to say "The PG website already offers UTF-8 text only, but please don't send it to us unless absolutely necessary." I hope someday that statement will not be true, but I believe it is now.
--Roger
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On 01/25/2012 08:35 PM, Al Haines wrote:
As Roger mentions, curly quotes and oe-ligatures are insufficient reason for creating a UTF8 file.
As I already said I strongly disagree. You don't need no `reason´ to create an utf8 file. It's the standard encoding all over the world. It is the one file you should create first. You need a reason to create those other encodings that nobody uses any more. But what that reason could be is beyond me. I'd like to hear one, *only one*, good argument in favour of having those other encodings around. (The argument: "Because our tools are too decrepit to handle unicode." is not a good argument.) -- Marcello Perathoner webmaster@gutenberg.org

As Roger mentions, curly quotes and oe-ligatures are insufficient reason for creating a UTF8 file.
Marcello>As I already said I strongly disagree. OK, but your "vote" doesn't count, Marcello, because it is the WWers who "throw it back in our faces" when we do something to make them unhappy.

Hi All, Marcello is right utf8 is pretty much standard in these days. It should be also the standard used by PG. I do disagree that the other formats are not used anymore. Plenty of websites not using utf8. If the tools at PG do not handle utf8 well, the need to be reworked. regards Keith. Am 25.01.2012 um 21:05 schrieb Marcello Perathoner:
On 01/25/2012 08:35 PM, Al Haines wrote:
As Roger mentions, curly quotes and oe-ligatures are insufficient reason for creating a UTF8 file.
As I already said I strongly disagree.
You don't need no `reason´ to create an utf8 file. It's the standard encoding all over the world. It is the one file you should create first.
You need a reason to create those other encodings that nobody uses any more. But what that reason could be is beyond me.
I'd like to hear one, *only one*, good argument in favour of having those other encodings around. (The argument: "Because our tools are too decrepit to handle unicode." is not a good argument.)
-- Marcello Perathoner webmaster@gutenberg.org _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Wed, 25 Jan 2012, Al Haines wrote:
UTF8 files are pretty much mandatory only when a text is in languages such as Greek, Hebrew, Chinese, Japanese, Cyrillic, etc., and a Latin1/ASCII conversion isn't practical.
Don't forget other languages that use the roman alphabet, but characters outside of latin-1. Examples that have been posted to PG include Polish, Czech, Hungarian and Esperanto. --Andrew

Hi Al, Am 25.01.2012 um 20:35 schrieb Al Haines:
I should mention that one thing that does slow things down for the WWers are text/HTML files with CR-only line endings. CR/LF is far preferable, since all the WWers use Windows. (I'm aware that there are fix utilities, but CR/LF's make things easier to start with.) You can be serious! For over 10 years I have not needed to worry about the line endings inside of a file. Any decent editor worth its fiddles will identify the line the line endings and internally convert to the native format and should either save in the native format or convert back to the original before saving. Furthermore these files are processed ahead of time and could be automagically convert to CR/LF. If the tools at PG can not do this, it is about time they did.
To my knowledge HTML is supposed to be submitted as a zipped file. Are zippers expected to convert the line endings into the native format!
It's also helpful to the WWers that text files are identified as to their character set. UTF8 files should have "-utf8" embedded in their filename, e.g. "somefile-utf8.txt". Latin1 and ASCII files much the same: "-ltn1", "-lt1", "-latin1", "-iso" or similar for Latin1 files, and "-asc" or "-ascii" for ASCII text files.
Good idea. regards Keith.

On 01/26/2012 04:24 AM, Keith J. Schultz wrote:
To my knowledge HTML is supposed to be submitted as a zipped file. Are zippers expected to convert the line endings into the native format!
The vast majority of zip utilities are capable of converting the line-endings in the files given as input (or to be output) back-and-forth from CRLF/LF upon request. In practice, this capability is rarely needed, because the vast majority of text editors are capable of detecting line ending style at file open, of optionally converting them, and of saving in any of the typical styles (CRLF/LF/CR). David

Hi David, I believe I mentioned modern editors in the rest of my post. Furthermore, the standard mode of zippers is the convert to native format. I was indirectly raising the question why the line endings pose a problem to the WWs? regards Keith. Am 26.01.2012 um 15:55 schrieb D Garcia:
On 01/26/2012 04:24 AM, Keith J. Schultz wrote:
To my knowledge HTML is supposed to be submitted as a zipped file. Are zippers expected to convert the line endings into the native format!
The vast majority of zip utilities are capable of converting the line-endings in the files given as input (or to be output) back-and-forth from CRLF/LF upon request. In practice, this capability is rarely needed, because the vast majority of text editors are capable of detecting line ending style at file open, of optionally converting them, and of saving in any of the typical styles (CRLF/LF/CR).
David _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On 01/25/2012 03:30 PM, Roger Frank wrote:
I don't send up plain text with curly quotes at all.
That is a mistake. There are lots of automated checks you could do in gutcheck et al. if you kept double quotes, single quotes and apostrophes separate. -- Marcello Perathoner webmaster@gutenberg.org

... and primes, double primes, dittos, minutes, seconds, ... On Wed, Jan 25, 2012 at 11:38 AM, Marcello Perathoner < marcello@perathoner.de> wrote:
On 01/25/2012 03:30 PM, Roger Frank wrote:
I don't send up plain text with
curly quotes at all.
That is a mistake. There are lots of automated checks you could do in gutcheck et al. if you kept double quotes, single quotes and apostrophes separate.
-- Marcello Perathoner webmaster@gutenberg.org
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

I don't send up plain text with curly quotes at all.
Marcello>That is a mistake.... I'm curious if anyone has any idea what percentage of submitted texts use "curlies" vs. just punting and going with ASCII-style "straights" ? My assumption has been that the great majority of submitted texts just use "straights."

"Jim" == Jim Adcock <jimad@msn.com> writes:
>>> I don't send up plain text with curly quotes at all. Marcello> That is a mistake.... Jim> I'm curious if anyone has any idea what percentage of Jim> submitted texts use "curlies" vs. just punting and going with Jim> ASCII-style "straights" ? My assumption has been that the Jim> great majority of submitted texts just use "straights." When I checked 10 consecutive random ones, 9 used straight quotes. IIRC they were 10 different submitters. Some of the DP perfectionists say that they like them better. Carlo

When I checked 10 consecutive random ones, 9 used straight quotes. IIRC they were 10 different submitters.
I did curlies for a while but at least on the early Kindles the left curly double was really really ugly and didn't even mirror the right curly double! And curlies look really stupid when you get them wrong -- like when you rely on the assumption that automated software can convert straights to curlies "reliably."

Ideally it would be the reader's choice, not the producer's choice. On Wed, Jan 25, 2012 at 2:35 PM, Carlo Traverso <traverso@posso.dm.unipi.it>wrote:
"Jim" == Jim Adcock <jimad@msn.com> writes:
I don't send up plain text with curly quotes at all.
Marcello> That is a mistake....
Jim> I'm curious if anyone has any idea what percentage of Jim> submitted texts use "curlies" vs. just punting and going with Jim> ASCII-style "straights" ? My assumption has been that the Jim> great majority of submitted texts just use "straights."
When I checked 10 consecutive random ones, 9 used straight quotes. IIRC they were 10 different submitters.
Some of the DP perfectionists say that they like them better.
Carlo _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Wed, Jan 25, 2012 at 2:51 PM, don kretz <dakretz@gmail.com> wrote:
Ideally it would be the reader's choice, not the producer's choice.
That doesn't make any sense. They're still curly quotes, even if the font used to render them is straight. And they're not done, because they're not cheap to do; there's always some cost in going the extra step, and I think there's a valid question here about whether it's worth it. -- Kie ekzistas vivo, ekzistas espero.

If,t any point in your process, you decide it's valuable to consider what the text is, as compared to what it looks like from point to point, then it might comprise some amount of additional work. OTOH, if you begin to suspect that the structure of the document will help you construct a better product, then I think the idenfication of sentences and quotations leads to a fair number of validation opportunities that can be automated to provide exceptions to inspect with good chances for improvement. Once you have those, curling the quotes is cake. On Wed, Jan 25, 2012 at 5:08 PM, David Starner <prosfilaes@gmail.com> wrote:
On Wed, Jan 25, 2012 at 2:51 PM, don kretz <dakretz@gmail.com> wrote:
Ideally it would be the reader's choice, not the producer's choice.
That doesn't make any sense. They're still curly quotes, even if the font used to render them is straight. And they're not done, because they're not cheap to do; there's always some cost in going the extra step, and I think there's a valid question here about whether it's worth it.
-- Kie ekzistas vivo, ekzistas espero. _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

David>And they're not done, because they're not cheap to do; there's always some cost in going the extra step, and I think there's a valid question here about whether it's worth it. A char count on a recent representative novel I did had 11,000+ straights in it, so truly checking all this stuff is not trivial. This compares to 6,500+ scannos corrected in that novel. There are bad algorithms for changing straights to curlies, and good algorithms for changing straights to curlies, but there are not infallible algorithms, which means you're going to have to re-check every quote mark and apostrophe. [This assumes that one is using the typical OCR which returns straights] And again there are lots of reader devices out there which are hard-wired to fonts which have really ugly implementations of curlies. And I have yet to see a hardware device which allows one to "turn off" curlies once they're in there. Still, there are books out there that really *do* need to be done in curlies. "The apostrophe is different from the closing single quotation mark (usually rendered identically but serving a different purpose), from the similar-looking prime ( ′ ), which is used to indicate measurement in feet or arcminutes, as well as for various mathematical purposes, and from the ʻokina ( ʻ ), which represents a glottal stop in Polynesian languages" "Sometimes quotations are nested in more levels than inner and outer quotation. Nesting levels up to five can be found in the Christian Bible. In these cases, questions arise about the form (and names) of the quotation marks to be used. The most common way is to simply alternate between the two forms...."

On 01/26/2012 04:16 PM, Jim Adcock wrote:
And again there are lots of reader devices out there which are hard-wired to fonts which have really ugly implementations of curlies.
A master format should never be designed with current devices in mind. You are making the same error as Michael when he hard-wired the width of texts to 80 characters. -- Marcello Perathoner webmaster@gutenberg.org

A master format should never be designed with current devices in mind.
Well, I guess a master format can include markups such as the <q> ... </q> and still have the option that submitters choose not to do the work. Which I admit is somewhat of a cop-out, when I acknowledge that I book I just did could have used the curlies, but I didn't do them -- because no one else does them either. I also did a book that really needed half-spaces, and I didn't do those either. But I have a better excuse there, in that way PG isn't ready to accept half-spaces. PG/DP also have a bunch of bunged-up rules about how to compromise other punctuation which grate me but how many fights are you going to fight, especially when you know up front you're going to lose them, and by not fighting the fight you end up having to do a lot less tiresome work? My bottom line excuse is that I'm way happy to help anyone get set up for another pass on any of the works I have submitted if they claim they want to do a much more detailed job.

On Wed, January 25, 2012 6:08 pm, David Starner wrote: [snip]
there's always some cost in going the extra step, and I think there's a valid question here about whether it's worth it.
Yes, that is a valid question. Not that many years ago I felt that curly quotes in a document was a silly affectation. Over time that view has moderated, and I now prefer reading curly quotes, so long as I can pick a font that does them fairly well (I can't even imagine using a device or software that doesn't let me select the font and install new ones). I've finally twisted Abbyy's arm far enough that it will preserve curly quotes, and I'm working on a system using Hunspell that will do a spell check on documents that contain them. So far, I think the cost/benefit ratio /for me/ favors preserving (or even creating) curly quotes, but I can definitely appreciate it if others make a different calculation.

On Wed, January 25, 2012 3:29 pm, Jim Adcock wrote:
I'm curious if anyone has any idea what percentage of submitted texts use "curlies" vs. just punting and going with ASCII-style "straights" ? My assumption has been that the great majority of submitted texts just use "straights."
I suspect that you are correct, and I further suspect that the decision is due in large part to the tools chosen. In my experience, it is like pulling teeth without anesthetic to get Abbyy to recognize curly quotes. I had do go in and define a new language explicitly removing " and ' from the allowed character set, and adding u+2018, u+2019, u+2021 and u+2022 to the set. When you do this, the spell checker doesn't recognize the words so delimited. To get curly quotes the best approach is probably to use a set of scripts like BB brags about and then smooth-read the resulting text to make sure the conversion was appropriate. Most editors also have a hard time dealing with curly quotes, so the conversion is one of the last things you would want to do. Most people probably don't want to make the effort, particularly when straight quotes are "good enough."

Lee>Most people probably don't want to make the effort, particularly when straight quotes are "good enough." Straights were probably *not* good enough in my latest effort, but, oh well. IF hypothetically PG were to want to go curly I'd probably be willing to step up to the bat re fixing the texts I've already submitted.

Don>Starting at the most basic level, is there any good reason not to use utf-8 as the basic encoding standard for everything including plain-text? One reason might be that we still have WWers that complain if you choose use utf-8 and they feel you should have chosen a "simpler" encoding scheme.

On Tue, January 24, 2012 5:01 pm, Marcello Perathoner wrote: [snip]
We could start redoing the top 100 list excluding everything that is too hard and everything made by DP.
I could agree to this. I would point out, however, that as you yourself have pointed out the most "important" works are probably those that a volunteer feels most passionate about, whether it's in the top 100 list or not. I would suggest that the top 100 list be considered a suggestion, but tell volunteers that they could re-work /any/ book from the early days that is important to them. As you may recall, a few years back I built a top 100 list based on monthly download lists culled from the Wayback Machine at archive.org. I'd be happy to repost that list if anyone is interested. [snip]
A semi-official branch would be a good occasion to ditch the old WWer workflow in favor of a source repository (git or mercurial) that holds all the masters.
Agreed. I think that both git and mercurial are overkill for this project (this will not be a large project requiring branching and merging, non-linear development, authenticated history, high performance and low bandwidth), but if one of these SCM systems are the only ones you're willing to support I can live with that. From an end-user's standpoint I shouldn't need to worry about learning anything more about these systems that how to get a file, and check in modifications. Having a system like this would allow documents to evolve, and as modifications are made issues encountered and proposed solutions can be retrieved from the history. White-washers should be unnecessary, as no book will every be in a permanently "complete" state, and fixes can always be applied. This would be an implementation of the "continuous proofreading" that BowerBird frequently advocates.
Should we reserve a range of ebook nos. or shadow the existing ones?
Shadow the existing. That way everyone knows what the source was, and what the eventual replacement will be.

Marcello> We could start redoing the top 100 list excluding everything that is
too hard and everything made by DP.
Why would you avoid DP works? If its not broken, then reworking it turns into a "no op". If it is broken, then why should the fact that it comes from DP "save it" from getting fixed? Why would DP want "special privileges" if in fact they submitted something to PG which is broken?

On Tue, January 24, 2012 3:08 pm, Joshua Hutchinson wrote:
So, if someone were to start "refactoring" old PG texts into TEI or RST and working with a WWer to repost them ... is this a workable idea? <div><br /></div>
<div>I'd love to see the PG corpus redone as a "master format" system (and the current filesystem supports "old" format files in a subdirectory, so if someone wanted to get the old original hand-made files, they could). I'm not particularly wedded to any master format. Hell, if someone came up with a sufficiently constrained HTML vocabulary that could be easily used to "generate" the additional formats necessary, I'm good with that.</div> <div><br /></div>
<div>But before anyone will start doing this work, there needs to be a concensus from PG (I'm looking at you, Greg!) that the work will be acceptable. A half-assed "master format" system is no master format system at all.</div> <div><br /></div>
<div>I'm even ok with working up the system as you go (i.e., start with "simple" fiction works and make sure the system handles them before throwing more and more complex works at it, tweaking and fixing in the time honored method of "incremental development").</div> <div><br /></div>
<div>Maybe we start this process on a semi-private mirror of the PG corpus and only when it reaches a critical mass of some sort it gets moved over. But an official notice that this project has some backing is necessary or we'll just keep seeing everything running around in ten different directions and nothing ever getting done.</div> <div><br /></div> <div>Josh</div>
I'm in. I think. Just to be sure, let me reiterate what I think I'm agreeing to. 1. A semi-official mirror of PG will be created. 2. Texts in the mirror will be refactored into a single file format which can be used to automatically create every delivery format offered by Project Gutenberg. 3. The focus of the project will be to re-work the most popular PG texts. At the outset simple works will be preferred to more complex works. 4. The project will evolve as new knowledge is gained. 5. The controlling principals at PG agree that if this effort is successful the refactored works will eventually replace the original works in the PG repository. Am I wrong in any of these broad points? If so, please clarify.

Marcello>The generated text must be ready to post, eg. word wrap, pg header, pg footer, lines between chapters, etc. all must be there and adhere to pg standard. There must be no post-generation edits required. This is simply defending the status quo by stating the status quo must be defended. There is no reasonable rationale for the continuing insistence on the submission, or creation, of a file in the PG "txt70" format. There is a plausible rationale for requiring *some* kind of txt file -- namely that the txt file represents some kind of doomsday scenario fallback where all the computers in the world suddenly forget how to parse HTML and we have to recreate a post-apocalypse world. Then the txt files reduce the need to redo everything from raw scans. But insisting on continuing to maintain the difficult peculiarities of the *particular* current PG "txt70" requirements has really no basis, any more than insisting that all HTML files must be formatted *exactly* one way.

Lee>> This seems bizarre to me. What is the rationale for not allowing, let
alone requiring, an impoverished text file when the other submission is TEI?
Marcello>That you can generate the plain txt file out of the TEI. Following the PG FAQ on this subject: 1) Open HTML file in web browser. 2) Do a Select All 3) Open a text editor 4) Do a Paste 5) Save under a file name ending in ".txt"

Funny, I would have thought Lee already knew the answers to these questions... Be that as it may, see below... Al
-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Lee Passey Sent: Monday, January 23, 2012 1:49 PM To: 'Project Gutenberg Volunteer Discussion' Subject: Re: [gutvol-d] Producing epub ready HTML
On Mon, January 23, 2012 12:43 pm, Al Haines wrote:
I've seen Marcello's response to this, and to put it into plain English, a submission can be one, and only one, of the following:
See, I never would have gotten /this/ out of Mr. Perathoner statement.
- one (and only one) TEI file, plus illustrations. No other files are allowed.
This seems bizarre to me. What is the rationale for not allowing, let alone requiring, an impoverished text file when the other submission is TEI?
Because the TEI process creates text (UTF8, Latin1, ASCII) and HTML files from the TEI file. Ditto with an RST submission. Isn't that what a master file is supposed to do? If a supposed master file also requires an accompanying text file, what's the point of the master file? More below...
[snip]
- one (X)HTML file, plus illustrations, plus from one to three text files (UTF8, Latin1, ASCII), depending on the submission's requirements.
The language in the FAQ that one may submit a HTML version without a plain ASCII version (#H.3._Can_I_submit_a_HTML_version_without_a_plain_ASCII_versi on.3F) needs to be updated with this new requirement.
The requirement for a text file isn't new--it's been around at least since I started producing etexts, eight years ago. Much of PG's assorted FAQs might be candidates for update, but the bulk of them have stood the test of time. The questions are, who's going to update them, how much argument from the cheap seats is going to ensue, and since two of the three main complainers in this forum don't submit anything anyway, why should any updater put up with those complaints? More below...
(This kind of HTML+text submission *may* also include such binary formats as doc, rtf, and pdf, but since they have to be handled manually, and PDF files are difficult, if not impossible, to correct, they're not encouraged.)
How in the world can these kinds of documents be included in HTML? Or do you mean you can package them up in a zip file along side the HTML?
They aren't "included in the HTML", whatever that means. A submitted project is usually a zip file containing the submitter-prepared files (text, HTML, images), and the WWers process those into a zip file containing an etext-numbered set of files, e.g. 12345.txt (ASCII), 12345-8.txt (Latin1), 12345-0.txt (UTF8), 12345-h.htm (HTML), plus images. Not all the text files are produced for a given project, and if a submission doesn't include an HTML file, there won't be one for that project, until Marcello's software auto-generates one. If you follow the "Other Files..." link on an ebook's download page, you'll see the contents of the zip file uploaded by the WWers. (Note that files still in the old etextnn folders won't have this link.)
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Mon, Jan 23, 2012 at 09:51:50AM -0700, Lee Passey wrote:
On 1/19/2012 4:43 PM, Marcello Perathoner wrote:
You can submit either *one* HTML or *one* TEI or *one* RST file. In no case has anybody been posting multiple HTML files.
I wrote:
Is this an exclusive 'or' or an inclusive 'or'? If I submit an HTML file am I precluded from posting a TEI or an RST file? Or may I submit one of each?
The answer is that you may submit one of each, as Marcello and Al and I have since clarified. And I now say again, "Confirmed."
[snip]
On Sat, Jan 21, 2012 at 07:28:30PM -0800, Al Haines wrote:
Marcello is correct.
[snip]
On Sun, January 22, 2012 1:21 pm, Greg Newby wrote:
Yes, confirmed.
I think this is a strong demonstration of why it is important for technical people take more classes in technical writing.
Mr. Perathoner started this thread making an ambiguous statement. The ambiguity was compounded when he stated "In no case has anybody been posting multiple HTML files," leading one to conclude that the issue here relates to multiple postings of a single markup format.
I then asked for clarification of that statement, to remove the ambiguity. Mr. Haines and Mr. Newby each responded saying, in effect, "yes, Marcello's original ambiguity is the policy of Project Gutenberg."
I am left to conclude that either the Project Gutenberg policy in this regard is still in a state of flux and the ambiguity is intentional, that Mr. Haines or Mr. Newby did not feel comfortable enough with their writing skills to actually respond to the question, or that neither of them gave anything more that cursory attention to the question actually posed.
Thanks for this idea. I will endeavor to give cursory attention to your future messages.
Given the usual practice on this list of paying little, or no, attention to what other people have to say, I suspect that the last alternative is the correct one.
In any case, the ambiguity remains.
You are being pedantic. The full message I responded to follows. The relevant extracts: Marcello:
You can submit either *one* HTML or *one* TEI or *one* RST file.
Al:
Marcello is correct.
It was decided in the early discussions of RST that additional custom files, e.g. a custom HTML file or a custom text file, would not be allowed. If custom files were desired/required, then the submission should be a normal text+HTML submission.
Me:
Yes, confirmed.
-- Greg ** Full extract: Yes, confirmed. Having numerous formats derived from a single master is a long-time goal. We've had some success with RST and TEI, and I've encouraged new projects to consider RST. There are still some limitations, though... On the many, many words on gutvol-d recently about poor results with auto-conversion from HTML to other formats (epub and mobi, among others): this is often due to choices that producers make about using HTML to impact layout, rather than just structure. Enough said. -- Greg On Sat, Jan 21, 2012 at 07:28:30PM -0800, Al Haines wrote:
Marcello is correct.
It was decided in the early discussions of RST that additional custom files, e.g. a custom HTML file or a custom text file, would not be allowed. If custom files were desired/required, then the submission should be a normal text+HTML submission.
The first discussion/submission of TEI files predates my time as WWer, but the same principle applies.
Al
-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Lee Passey Sent: Saturday, January 21, 2012 4:52 PM To: gutvol-d@lists.pglaf.org Subject: Re: [gutvol-d] Producing epub ready HTML
On 1/19/2012 4:43 PM, Marcello Perathoner wrote:
You can submit either *one* HTML or *one* TEI or *one* RST file. In no case has anybody been posting multiple HTML files.
Is this an exclusive 'or' or an inclusive 'or'?
If I submit an HTML file am I precluded from posting a TEI or an RST
file? Or may I submit one of each?
Mr. Haines or Mr. Newby, would you care to confirm or deny this restriction? _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Mon, January 23, 2012 4:27 pm, Greg Newby wrote:
On Mon, Jan 23, 2012 at 09:51:50AM -0700, Lee Passey wrote:
On 1/19/2012 4:43 PM, Marcello Perathoner wrote:
You can submit either *one* HTML or *one* TEI or *one* RST file. In no case has anybody been posting multiple HTML files.
I wrote:
Is this an exclusive 'or' or an inclusive 'or'? If I submit an HTML file am I precluded from posting a TEI or an RST file? Or may I submit one of each?
The answer is that you may submit one of each, as Marcello and Al and I have since clarified. And I now say again, "Confirmed."
Thus illustrating the confusion. You say "one of each" but Mr. Perathoner and Mr. Haines have both said "only one of all". As I understand your position, I could submit a total of 4 files: one HTML, one TEI, one RST and one .txt. As I understand Mr. Perathoner's position if I chose to submit a TEI file I could submit only that one file. According to Mr. Haines there are special rules for HTML files, but let's not cloud the issue. Apparently, the ambiguity continues to be confirmed.

I give up. I can't put it any plainer than I did earlier, when I described the four types of submission. Obviously, the four are mutually exclusive. Duh... Al
-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Lee Passey Sent: Monday, January 23, 2012 3:28 PM To: gbnewby@pglaf.org; Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] Producing epub ready HTML
On Mon, January 23, 2012 4:27 pm, Greg Newby wrote:
On Mon, Jan 23, 2012 at 09:51:50AM -0700, Lee Passey wrote:
On 1/19/2012 4:43 PM, Marcello Perathoner wrote:
> You can submit either *one* HTML or *one* TEI or *one* RST file. In no > case has anybody been posting multiple HTML files.
I wrote:
Is this an exclusive 'or' or an inclusive 'or'? If I submit an HTML file am I precluded from posting a TEI or an RST file? Or may I submit one of each?
The answer is that you may submit one of each, as Marcello and Al and I have since clarified. And I now say again, "Confirmed."
Thus illustrating the confusion. You say "one of each" but Mr. Perathoner and Mr. Haines have both said "only one of all". As I understand your position, I could submit a total of 4 files: one HTML, one TEI, one RST and one .txt. As I understand Mr. Perathoner's position if I chose to submit a TEI file I could submit only that one file.
According to Mr. Haines there are special rules for HTML files, but let's not cloud the issue.
Apparently, the ambiguity continues to be confirmed.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On 1/23/2012 4:50 PM, Al Haines wrote:
I give up. I can't put it any plainer than I did earlier, when I described the four types of submission.
Yes, once you took the time to actually answer the question you were very clear. Indeed, you were doing really well right up to the point where you started talking about RTF, PDF and Microsoft Word as part of an (X)HTML submission. Frankly, I don't see how it could be done, so I'll just pretend it didn't happen; if you don't say anything I won't...
Obviously, the four are mutually exclusive. Duh...
How is it obvious? In the very post you're replying to Mr. Newby seemed to think that the answer was "one of each." On the other hand, I'm guessing that Mr. Newby has very little practical impact on or review of the nature of the files that actually get accepted, so I'm going to view yours as the definitive answer. If and when you change your mind, it would be useful if you could post a message here, as apparently the FAQ is no longer being maintained, and in at least some cases is downright wrong.

OK, I won't say anything...
-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Lee Passey Sent: Monday, January 23, 2012 6:20 PM To: 'Project Gutenberg Volunteer Discussion' Subject: Re: [gutvol-d] Producing epub ready HTML
On 1/23/2012 4:50 PM, Al Haines wrote:
I give up. I can't put it any plainer than I did earlier, when I described the four types of submission.
Yes, once you took the time to actually answer the question you were
very clear. Indeed, you were doing really well right up to the point
where you started talking about RTF, PDF and Microsoft Word as part of an (X)HTML submission. Frankly, I don't see how it could be done, so
I'll just pretend it didn't happen; if you don't say anything I won't...
Obviously, the four are mutually exclusive. Duh...
How is it obvious? In the very post you're replying to Mr. Newby seemed to think that the answer was "one of each." On the other hand, I'm guessing that Mr. Newby has very little practical impact on or review of the nature of the files that actually get accepted, so I'm going to view yours as the definitive answer.
If and when you change your mind, it would be useful if you could post a message here, as apparently the FAQ is no longer being maintained, and in at least some cases is downright wrong.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Hi Guys, I am really getting a gas out of this thread. Well, I decided to do some research and look into the HowTos and FAQs. I do now understand the complaint about ambiguity. It is there the pages of the PG site. Actually, even contradictory all on the same page. Source: http://www.gutenberg.org/wiki/Gutenberg:HTML_FAQ 1) H.3. Can I submit a HTML version without a plain ASCII version? You can submit it, but the Posting Team will then consider whether we should also make an ASCII, or perhaps ISO-8859 or Unicode version of it. We really do want our texts to be viewable by everybody, under every circumstances, and we do not want to start posting texts that are in any way inaccessible to anyone. See also the FAQ [G.17] "Why is PG so set on using Plain Vanilla ASCII?" Yet, 2) [H.4.]5. Requirement: HTML and plain-text Project Gutenberg does publish well-formatted, standards compliant HTML. However, we insist that a plain text version be available for all HTML documents we publish (even if images or formatting are absent), except when ASCII can't reasonably be used at all, for example with Arabic, or mathematical texts. So the answer should be generally NO! I have found other such confusing correlations. Furthermore, from the statements here there seem to be policies in place which are not mentioned. AFAIK: TEI is not mentioned. Yes, It is not in the FAQs or HowTos. So the FAQs and HowTos must be updated to mirror the commonly used policies and practices, and be more consistent. I have one question though. What if one has went through the work of producing a ePub, mobi, HTML and text files that conform to PG requirements, can they be submitted to PG? The reason I ask is that, as can be seen for the posts on this list production is moving more to the production directly of mobi and epub formats. These formats more or less contain HTML already. If they follow the guidelines for good ebooks. The HTML should fit PG requirments for HTML. With the above said all PG now needs is a conversion tool to create a plain text version. regards Keith.

On 1/24/2012 3:14 AM, Keith J. Schultz wrote:
So the FAQs and HowTos must be updated to mirror the commonly used policies and practices, and be more consistent.
That would be desirable and advisable, but apparently it will not happen as no one is willing to do it (I doubt a consensus could even be reached as to the answers published in such a document). The next best thing would be a disclaimer at the beginning saying that the information contained therein may or may not be the policy of Project Gutenberg, and that questions should be addressed directly to Mr. Haines. As a last resort, the FAQ should simply be withdrawn.
I have one question though. What if one has went through the work of producing a ePub, mobi, HTML and text files that conform to PG requirements, can they be submitted to PG?
According to Mr. Haines, they may not. ePub and .mobi may not be submitted under any circumstance, and HTML may only be submitted if it is accompanied by an associated .txt file that meets his requirements.
The reason I ask is that, as can be seen for the posts on this list production is moving more to the production directly of mobi and epub formats. These formats more or less contain HTML already. If they follow the guidelines for good ebooks. The HTML should fit PG requirments for HTML.
PG has no requirements for HTML, other than that it be HTML. Perhaps it is more accurate to say that PG has no /published/ requirements for HTML. I've little doubt that if you were to submit an HTML document it would be rejected for some reason. I think it would be an interesting project to collect these rejection slips, from which a de facto requirements list could be constructed.
With the above said all PG now needs is a conversion tool to create a plain text version.
Mr. Perathoner is unwilling to consider such a tool. If you would like, you are welcome to use my rather dated tool. Be aware that I think the text file requirement is, as Mr. Adcock pointed out, a requirement to preserve the status quo because it /is/ the status quo, and therefore I am unlikely to put any effort into maintaining the tool. Impoverished text is dead, let the rich text formats flower!

Hi, I find your lamenting counter productive. Am 24.01.2012 um 20:30 schrieb Lee Passey:
On 1/24/2012 3:14 AM, Keith J. Schultz wrote:
So the FAQs and HowTos must be updated to mirror the commonly used policies and practices, and be more consistent.
That would be desirable and advisable, but apparently it will not happen as no one is willing to do it (I doubt a consensus could even be reached as to the answers published in such a document). The next best thing would be a disclaimer at the beginning saying that the information contained therein may or may not be the policy of Project Gutenberg, and that questions should be addressed directly to Mr. Haines. As a last resort, the FAQ should simply be withdrawn. Though I just skimmed the FAQs and HowTos they are not written that badly. Yet, they were written by knowledgable persons about the process, but they forgot to cross read and forgot that they should be for those that are not knowledgable and want to be informed what PG requires and expects. I actually should not be hard to clean up.
I have one question though. What if one has went through the work of producing a ePub, mobi, HTML and text files that conform to PG requirements, can they be submitted to PG?
According to Mr. Haines, they may not. ePub and .mobi may not be submitted under any circumstance, and HTML may only be submitted if it is accompanied by an associated .txt file that meets his requirements.
I have followed this thread and am aware of what has been written. I believe Marcello and Al should be the ones to respond to this question.
The reason I ask is that, as can be seen for the posts on this list production is moving more to the production directly of mobi and epub formats. These formats more or less contain HTML already. If they follow the guidelines for good ebooks. The HTML should fit PG requirments for HTML.
PG has no requirements for HTML, other than that it be HTML.
This is not true. read the HowTo on HTML.
Perhaps it is more accurate to say that PG has no /published/ requirements for HTML. I've little doubt that if you were to submit an HTML document it would be rejected for some reason. I think it would be an interesting project to collect these rejection slips, from which a de facto requirements list could be constructed.
With the above said all PG now needs is a conversion tool to create a plain text version.
Mr. Perathoner is unwilling to consider such a tool. If you would like, you are welcome to use my rather dated tool. Be aware that I think the text file requirement is, as Mr. Adcock pointed out, a requirement to preserve the status quo because it /is/ the status quo, and therefore I am unlikely to put any effort into maintaining the tool. Impoverished text is dead, let the rich text formats flower!
regards Keith.

Greg>Having numerous formats derived from a single master is a long-time goal. We've had some success with RST and TEI, and I've encouraged new projects to consider RST. There are still some limitations, though... Greg>On the many, many words on gutvol-d recently about poor results with auto-conversion from HTML to other formats (epub and mobi, among others): this is often due to choices that producers make about using HTML to impact layout, rather than just structure. Enough said. This combination of paragraphs doesn't make sense. I recently posted in this forum an analysis, based on a request from this forum, that indicated that RST in practice *is not* working as a "single master." To whit the EPUB and MOBI being generated from RST is not particularly successful, possibly no more so than the DP "HTML" efforts.

On 1/24/2012 10:01 AM, Jim Adcock wrote:
This combination of paragraphs doesn't make sense.
I think if you parsed it carefully it would.
Greg>Having numerous formats derived from a single master is a long-time goal.
Yes, there are many individuals at PG who have had this goal. In the past there have been institutional impediments to the goal, and much of the past effort to achieve the goal can best be characterized as "routing around damage." This is one of the reasons that the current mechanism is so convoluted.
We've had some success with RST and TEI, and I've encouraged new projects to consider RST.
I think it's indisputable that PG has had /some/ success with RST and TEI. The success is more along the lines of a proof-of-concept than actual production-ready code, but the success is there nonetheless.
There are still some limitations,though...
In my view, the biggest limitation of RST is the difficulty of producing it. While more mature, RST suffers from the same main drawbacks as BowerBird's s.m.l.: it requires that the producer have a strong understanding of subtle markup rules, the distinction between markup and content is not obvious and therefore easily confounded, and there is no automated way to detect errors in markup. Of course, both of these formats are susceptible to a flawed implementation of the tool chain that produces other formats, but this potential exists no matter what format is chosen. It is a mistake to equate flawed tools with a flawed format.
Greg>On the many, many words on gutvol-d recently about poor results with auto-conversion from HTML to other formats (epub and mobi, among others): this is often due to choices that producers make about using HTML to impact layout, rather than just structure.
What Mr. Newby is talking about here is /not/ the HTML that is a result of Mr. Perathoner's design decisions, but the HTML which is posted by volunteers, primarily of late by DP. It is certainly true that bad decisions by volunteers can make it hard to use HTML as a master format, although to be fair bad decisions by volunteers can also make RST and TEI hard to use as well. It is simply more likely that HTML will be flawed than RST or TEI because anyone sophisticated enough to use one of these last two formats is likely to know enough to use them correctly, whereas anyone with Microsoft Word or Adobe Dreamweaver /thinks/ s/he know how to use HTML, often incorrectly. Should RST gain the same popularity as HTML I'm sure it would be just as problematic if not more so. But because HTML is the source for all e-book file formats, that is where the focus should be. The solution? 1. Define what constitutes good HTML. 2. Judge the quality of conversion tools by how well they satisfy 1.
I recently posted in this forum an analysis, based on a request from this forum, that indicated that RST in practice *is not* working as a "single master." To whit the EPUB and MOBI being generated from RST is not particularly successful, possibly no more so than the DP "HTML" efforts.
Yes, /in practice/. So it seems that the reforms that need to be made are reforms to the /practice/. Mr. Perathoner has an advantage over all the rest of us in that he is the only one with access to the PG servers. Thus, he pretty much gets to do what he wants, and the tool chain pretty much reflects his tastes and biases. If you want any improvements to be made to the PG tool chain /in practice/ you effectively need to convince Mr. Perathoner, and no one else, that the improvements need to be made. As you may have noticed, Mr. Perathoner is as prickly as BowerBird, so suggestions need to be made with much more subtlety, tact and finesse than I am capable of. On the other hand, it may be possible to take a page from PG's book and route around the damage. I've looked at the HTML output you've provided from PG and I haven't seen anything that can't be repaired. It should be possible to build a web interface that sits in front of PG, forwards requests, rewrites the HTML to meet industry standards, then either delivers /that/ HTML or compiles it into ePub or .mobi. I don't know that I have the time to build anything like that (I'm not particularly committed to the future of Project Gutenberg) but it would be interesting to see how much interest there is in such a project. On what is perhaps an unrelated note, is anyone capturing the output of Distributed Proofreaders and transferring it to the Internet Archive before it gets degraded for Project Gutenberg? Are there any private archives at Distributed Proofreaders that could be transferred as well?

Lee>On the other hand, it may be possible to take a page from PG's book and route around the damage. I've looked at the HTML output you've provided from PG and I haven't seen anything that can't be repaired. It should be possible to build a web interface that sits in front of PG, forwards requests, rewrites the HTML to meet industry standards, then either delivers /that/ HTML or compiles it into ePub or .mobi. In practice if/when people do this kind of things they cache the results so as not to keep hitting the PG website repeatedly unnecessarily. If one does that then consider simply mob getting the books via http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project so as not to have to hit the servers at all. One needs to have more knowledge/love-of DB than I have to make a strong website, but http://freekindlebooks.org still runs as an example of this approach that predates PG willingness to host EPUB and MOBI, and which still draws a couple hundred thousand downloads a month -- even though I've been trying to steer customers back to PG now that PG -- more or less -- hosts EPUB and MOBI file formats. Not clear to me that the PG legalize would allow one to retain the PG tm on the files once one has internally tweaked the formatting problems, though. Bottom line though, is that the books most often downloaded from PG *ought* to be reworked in any case -- whether one wants to read those books in txt70, html, epub or mobi format. It is crazy that in practice PG doesn't have a way to rework that which gets read most often -- and which needs a rework.

On 1/19/2012 4:43 PM, Marcello Perathoner wrote:
On 01/19/2012 06:23 PM, Jim Adcock wrote:
IF PG doesn't routinely redo all file generation, then PG has basically been lying to us submitters about their rationale for restricting our contributions to being "HTML ONLY!" Even IF PG insisted on doing their own gen of EPUB and MOBI, PG could still allow us submitters to also submit EPUB-HTML and MOBI-HTML -- which you point out PG is ALREADY allowing in the "special case" of RST!
Wrong.
You can submit either *one* HTML or *one* TEI or *one* RST file. In no case has anybody been posting multiple HTML files.
Except for David Widger. See: http://www.gutenberg.org/files/74/74-h/74-h.htm. <blockquote> THIS EBOOK HAS BEEN REFORMATTED FOR BETTER APPEARANCE IN MOBILE VIEWERS SUCH AS KINDLES AND OTHERS. THE ORIGINAL FORMAT, WHICH THE EDITOR BELIEVES HAS A MORE ATTRACTIVE APPEARANCE FOR LAPTOPS AND OTHER COMPUTERS, MAY BE VIEWED BY CLICKING ON THIS BOX. </blockquote>

On Wed, January 18, 2012 9:41 pm, Carlo Traverso wrote:
Jim, none of the submissions that I listed are mine, they have been submitted as PG-RST or PG-TEI master files, from which all the formats have been created by Marcello's software without further manual intervention.
Hey, /you/ asked.
In particular, for RST two different HTML are created in the process, one for browsers and one for epub. So it is unfair to view HTML on a small device.
It is not unfair, indeed it is desirable. What is unfair is to say "this HTML can only be adequately viewed on devices having these characteristics." In my mind the major flaw in all of the PG HTML files is the inclusion of a <style> block. Individuals who include a <style> block in an HTML file are attempting to create a document that looks "pretty" on their preferred HTML User Agent utterly without regard to the needs or desires of others. It should go without saying that beauty is in the eye of the beholder, and that one man's treasure is another man's trash. The best way to create ePub-ready HTML is to strip out /all/ of the style definitions from the <head> of the HTML files, replacing them with a single link to an external style sheet. The HTML markup should focus on the /structure/ of the document with little or no regard to its presentation; that is the job of the external style sheet. Standard HTML tags should be used, and should be used appropriately: no marking chapter titles with <p> tags! A standard set of classification attributes should be used. If I know that a chapter title is always marked with <h3 class="chapter"> then I know I can change the presentation of chapter titles in /my/ version by simply providing a style sheet file with the selector: h3.chapter { font-size: humongous; text-align: justify; color: stop-light-red; page-break-before: always } without changing it for anyone else (remember, treasure/trash). If a user doesn't have a style sheet and can't figure out how to download one of the standard ones from PG, the book will still look pretty good because the file is free from tag abuse. So the three basic rules for making ePub-ready HTML files: 1. Don't specify styles inside the HTML document, either inline or in the <head>. Do include a link to an external style sheet with the name "pg.css". 2. Use HTML elements only as designed; no tag abuse. 3. Use a standardized set of classification attributes ("class='standard'"). Help contribute to building a standard library of attribute values. Be willing to compromise in the naming of the attributes; standardization is more important than "correctness". The advice at http://www.pgdp.net/wiki/The_Proofreader%27s_Guide_to_EPUB is all good.

Re this:
3. Use a standardized set of classification attributes ("class='standard'"). Help contribute to building a standard library of attribute values. Be willing to compromise in the naming of the attributes; standardization is more important than "correctness".
-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Lee Passey Sent: Thursday, January 19, 2012 6:16 PM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] Producing epub ready HTML
On Wed, January 18, 2012 9:41 pm, Carlo Traverso wrote:
Jim, none of the submissions that I listed are mine, they have been submitted as PG-RST or PG-TEI master files, from which all the formats have been created by Marcello's software without further manual intervention.
Hey, /you/ asked.
In particular, for RST two different HTML are created in the process, one for browsers and one for epub. So it is unfair to view HTML on a small device.
It is not unfair, indeed it is desirable. What is unfair is to say "this HTML can only be adequately viewed on devices having these characteristics."
In my mind the major flaw in all of the PG HTML files is the inclusion of a <style> block. Individuals who include a <style> block in an HTML file are attempting to create a document that looks "pretty" on
About 10 years or so ago, I remember reading about an initiative to produce a standard set of CSS classes. It may have been aimed at government/business documentation. I don't remember who the initiators were (W3C or whoever), and don't remember ever hearing that anything came of it. Does anyone remember anything like this? Al their
preferred HTML User Agent utterly without regard to the needs or desires of others. It should go without saying that beauty is in the eye of the beholder, and that one man's treasure is another man's trash.
The best way to create ePub-ready HTML is to strip out /all/ of the style definitions from the <head> of the HTML files, replacing them with a single link to an external style sheet. The HTML markup should focus on the /structure/ of the document with little or no regard to its presentation; that is the job of the external style sheet.
Standard HTML tags should be used, and should be used appropriately: no marking chapter titles with <p> tags!
A standard set of classification attributes should be used. If I know that a chapter title is always marked with <h3 class="chapter"> then I know I can change the presentation of chapter titles in /my/ version by simply providing a style sheet file with the selector:
h3.chapter { font-size: humongous; text-align: justify; color: stop-light-red; page-break-before: always }
without changing it for anyone else (remember, treasure/trash).
If a user doesn't have a style sheet and can't figure out how to download one of the standard ones from PG, the book will still look pretty good because the file is free from tag abuse.
So the three basic rules for making ePub-ready HTML files:
1. Don't specify styles inside the HTML document, either inline or in the <head>. Do include a link to an external style sheet with the name "pg.css".
2. Use HTML elements only as designed; no tag abuse.
3. Use a standardized set of classification attributes ("class='standard'"). Help contribute to building a standard library of attribute values. Be willing to compromise in the naming of the attributes; standardization is more important than "correctness".
The advice at http://www.pgdp.net/wiki/The_Proofreader%27s_Guide_to_EPUB is all good. _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On 1/19/2012 8:49 PM, Al Haines wrote:
About 10 years or so ago, I remember reading about an initiative to produce a standard set of CSS classes. It may have been aimed at government/business documentation. I don't remember who the initiators were (W3C or whoever), and don't remember ever hearing that anything came of it. Does anyone remember anything like this?
Are you perhaps thinking of the initiative of the HTML Writers Guild to produce some standards for Project Gutenberg texts? (http://www.hwg.org/opcenter/gutenberg/tutorials.html. See especially, http://www.hwg.org/opcenter/gutenberg/markupXHTML2.html.) IIRC, that initiative was created back in 2000-2001, which coincides with your recollection.

Lee>Are you perhaps thinking of the initiative of the HTML Writers Guild to produce some standards for Project Gutenberg texts? (http://www.hwg.org/opcenter/gutenberg/tutorials.html. See especially, http://www.hwg.org/opcenter/gutenberg/markupXHTML2.html.) An interesting site. Why not try the recommended DTDs and see where it gets you?

On 1/20/2012 10:48 PM, Jim Adcock wrote:
Are you perhaps thinking of the initiative of the HTML Writers Guild to produce some standards for Project Gutenberg texts? (http://www.hwg.org/opcenter/gutenberg/tutorials.html. See especially, http://www.hwg.org/opcenter/gutenberg/markupXHTML2.html.)
An interesting site. Why not try the recommended DTDs and see where it gets you?
Because I'm not interested in DTDs. A DTD can tell you that you can't put a <p>aragraph inside a <span>, or that a <body> or <blockquote> no longer is considered a block element when it comes to <a> or <br> tags. But it can't tell you that <p> elements shouldn't be used for titles (use <h?>), or that lists shouldn't use a collection of nested <div> tags, but should use <ol> or <ul> or <dl> and <li> instead. When it comes to building "good" HTML, DTDs are of limited value (some, but not much). What is needed is good old-fashioned plain language, like that found at http://www.pgdp.net/wiki/The_Proofreader%27s_Guide_to_EPUB (but more detailed), a list of recommended classes and why you should use them, and respect and cooperation among the people willing to follow the guidelines. The list of tags at the HTML Writers Guild is a good starting point, as is the Wiki page a DP. Respect, cooperation and acceptance is nowhere to be seen...

Hi Jim, I haven't read your whole e-mail, just thought I'd reply to this snippet: On Jan 19, 2012, at 04:15, Jim Adcock wrote:
It seems like you have a way to not use the "Blue PDA" as the cover image. I don't see how you do this, because I don't see how PG allows specification of the cover image to be used with an HTML submission? In any case, on many small devices a cover image like you have which includes title and author is much more useful than the PG default "Blue PDA." Or is it that you don't include a cover at all, and that Amazon is automagically providing me with one? But when I submit an HTML PG gives my EPUB a "Blue PDA" -- whether I want it or not!
There are several ways of specifying a cover in your HTML version that epubmaker will use for the epub (and Kindle?) version(s) of your book. See Marcello's advice here: http://www.pgdp.net/wiki/The_Proofreader%27s_Guide_to_EPUB#Cover_Page Jana

Jana>There are several ways of specifying a cover in your HTML version that epubmaker will use for the epub (and Kindle?) version(s) of your book. See Marcello's advice here: http://www.pgdp.net/wiki/The_Proofreader%27s_Guide_to_EPUB#Cover_Page OK thanks -- it didn't occur to me to look in DP for advice on how to target a PG tool. Hopefully people who are providing an alt image for the "Blue PDA" realize that this image needs to include Title and Author. Which it will if people say just provide a jpg of the title page. But if they use the "first pretty picture" in the book, such as a frontal plate, then they need to use some kind of picture editing software to place title and author over that image. Thanks again.

Strangely enough, I made some nice cover images for my three PG submissions so I could put somewhat better formatted versions in the Kindle Store. I would have been happy to include these with my PG submissions if I had known it was possible! https://plus.google.com/photos/114689250113657220289/albums/5681329822764797... https://plus.google.com/photos/114689250113657220289/albums/5676125941560802... http://www.amazon.com/Ancient-Manners-Illustrated-ebook/dp/B0055OMGN0/ref=sr_1_6?s=digital-text&ie=UTF8&qid=1326996235&sr=1-6 It would be good if this was better known. Maybe the White Washers could ask if someone wanted to include a cover image with his submission. They are not hard to make. A few minutes with The GIMP is all it takes. James Simmons On Thu, Jan 19, 2012 at 11:30 AM, Jim Adcock <jimad@msn.com> wrote:
Jana>There are several ways of specifying a cover in your HTML version that epubmaker will use for the epub (and Kindle?) version(s) of your book. See Marcello's advice here: http://www.pgdp.net/wiki/The_Proofreader%27s_Guide_to_EPUB#Cover_Page
OK thanks -- it didn't occur to me to look in DP for advice on how to target a PG tool.
Hopefully people who are providing an alt image for the "Blue PDA" realize that this image needs to include Title and Author. Which it will if people say just provide a jpg of the title page. But if they use the "first pretty picture" in the book, such as a frontal plate, then they need to use some kind of picture editing software to place title and author over that image.
Thanks again.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Custom epub covers were discussed among the WWers and Greg Newby several weeks ago. (The question was raised by DP-Canada.) With Greg to confirm/deny/clarify as needed, the general guidelines decided on were: - the book's original cover is preferred, even if has only the title and author. (Obviously, not all covers have any graphical content, with some having a completely blank cover. I've even seen one with a full-cover illustration, with no title or author on it, those being only on the spine.) - they must be a reasonable size--50K-150K, 300dpi is plenty, most likely in jpg or png format. A 1M, 600dpi image, for example, if not rejected outright, will probably elicit a request for something reasonable, which if not forthcoming, the original would probably be reduced considerably by the WWers (and perhaps not used at all). - a custom cover should be provided only when a submission is first uploaded. They will not, repeat, NOT, be applied retroactively, i.e. to already-posted submissions. The WWers haven't time for a bunch of vanity reposts. - there must be a mention in the Notes to WWers field, and in a Transcriber's Note in the HTML file, that the image was produced by the submitter, and is being placed into the public domain. Obviously, PG wants no hassle because a submitter helped themselves, from some other website, to something *not* in the public domain. Al -----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of James Simmons Sent: Thursday, January 19, 2012 10:10 AM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] Producing epub ready HTML Strangely enough, I made some nice cover images for my three PG submissions so I could put somewhat better formatted versions in the Kindle Store. I would have been happy to include these with my PG submissions if I had known it was possible! https://plus.google.com/photos/114689250113657220289/albums/5681329822 764797041?authkey=COCMmsfqwNDOqgE https://plus.google.com/photos/114689250113657220289/albums/5676125941 560802081?authkey=CPDP0daf1ee9aw http://www.amazon.com/Ancient-Manners-Illustrated-ebook/dp/B0055OMGN0/ ref=sr_1_6?s=digital-text&ie=UTF8&qid=1326996235&sr=1-6 It would be good if this was better known. Maybe the White Washers could ask if someone wanted to include a cover image with his submission. They are not hard to make. A few minutes with The GIMP is all it takes. James Simmons On Thu, Jan 19, 2012 at 11:30 AM, Jim Adcock <jimad@msn.com> wrote: Jana>There are several ways of specifying a cover in your HTML version that epubmaker will use for the epub (and Kindle?) version(s) of your book. See Marcello's advice here: http://www.pgdp.net/wiki/The_Proofreader%27s_Guide_to_EPUB#Cover_Page OK thanks -- it didn't occur to me to look in DP for advice on how to target a PG tool. Hopefully people who are providing an alt image for the "Blue PDA" realize that this image needs to include Title and Author. Which it will if people say just provide a jpg of the title page. But if they use the "first pretty picture" in the book, such as a frontal plate, then they need to use some kind of picture editing software to place title and author over that image. Thanks again. _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Al, This sounds reasonable. I'm not asking for a retroactive posting; I just regret that I didn't know what was possible when I made my submissions. Two out of three of my cover images would have passed your guidelines because they make use of illustrations from the books. The Vidyapati cover is questionable because while the illustration is an Indian miniature that is in the public domain I got it from a Google image search, not from photographing the actual painting. I would prefer a book's original cover too, but for pre-1923 books they can be hard to come by. For PG Canada that won't always be possible. For example, Robert C. Benchley died over 50 years ago, but his illustrator Gluyas Williams did not. I don't know what the original cover of The Big Sleep looked like but I'm sure it would be risky to use it. Original covers might be the only safe way to go for PG Canada. James Simmons On Thu, Jan 19, 2012 at 1:49 PM, Al Haines <ajhaines@shaw.ca> wrote:
Custom epub covers were discussed among the WWers and Greg Newby several weeks ago. (The question was raised by DP-Canada.)
With Greg to confirm/deny/clarify as needed, the general guidelines decided on were:
- the book's original cover is preferred, even if has only the title and author. (Obviously, not all covers have any graphical content, with some having a completely blank cover. I've even seen one with a full-cover illustration, with no title or author on it, those being only on the spine.)
- they must be a reasonable size--50K-150K, 300dpi is plenty, most likely in jpg or png format. A 1M, 600dpi image, for example, if not rejected outright, will probably elicit a request for something reasonable, which if not forthcoming, the original would probably be reduced considerably by the WWers (and perhaps not used at all).
- a custom cover should be provided only when a submission is first uploaded. They will not, repeat, NOT, be applied retroactively, i.e. to already-posted submissions. The WWers haven't time for a bunch of vanity reposts.
- there must be a mention in the Notes to WWers field, and in a Transcriber's Note in the HTML file, that the image was produced by the submitter, and is being placed into the public domain. Obviously, PG wants no hassle because a submitter helped themselves, from some other website, to something *not* in the public domain.
Al
-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of James Simmons Sent: Thursday, January 19, 2012 10:10 AM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] Producing epub ready HTML
Strangely enough, I made some nice cover images for my three PG submissions so I could put somewhat better formatted versions in the Kindle Store. I would have been happy to include these with my PG submissions if I had known it was possible!
https://plus.google.com/photos/114689250113657220289/albums/5681329822 764797041?authkey=COCMmsfqwNDOqgE
https://plus.google.com/photos/114689250113657220289/albums/5676125941 560802081?authkey=CPDP0daf1ee9aw
http://www.amazon.com/Ancient-Manners-Illustrated-ebook/dp/B0055OMGN0/ ref=sr_1_6?s=digital-text&ie=UTF8&qid=1326996235&sr=1-6
It would be good if this was better known. Maybe the White Washers could ask if someone wanted to include a cover image with his submission. They are not hard to make. A few minutes with The GIMP is all it takes.
James Simmons
On Thu, Jan 19, 2012 at 11:30 AM, Jim Adcock <jimad@msn.com> wrote:
Jana>There are several ways of specifying a cover in your HTML version that
epubmaker will use for the epub (and Kindle?) version(s) of your book. See Marcello's advice here: http://www.pgdp.net/wiki/The_Proofreader%27s_Guide_to_EPUB#Cover_Page
OK thanks -- it didn't occur to me to look in DP for advice on how to target a PG tool.
Hopefully people who are providing an alt image for the "Blue PDA" realize that this image needs to include Title and Author. Which it will if people say just provide a jpg of the title page. But if they use the "first pretty picture" in the book, such as a frontal plate, then they need to use some kind of picture editing software to place title and author over that image.
Thanks again.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Al>- the book's original cover is preferred, even if has only the title and author. I would be concerned that on many recent small devices the cover image is used in the form of a small thumbnail as the sole identifier for the book -- at least in many situations -- and the historical covers [as one can see demonstrated in the "Recent Books" section of the PG website] often offer, when scanned, negligible contrast between the main color of the cover and the (often) gilded title+author.

At 11:49 AM 1/19/2012, Al Haines wrote:
Custom epub covers were discussed among the WWers and Greg Newby several weeks ago. (The question was raised by DP-Canada.)
With Greg to confirm/deny/clarify as needed, the general guidelines decided on were:
- they must be a reasonable size--50K-150K, 300dpi is plenty, most likely in jpg or png format. A 1M, 600dpi image, for example, if not rejected outright, will probably elicit a request for something reasonable, which if not forthcoming, the original would probably be reduced considerably by the WWers (and perhaps not used at all).
Al, thank you for posting this, as I was in the process of writing to Greg about the possibility of PGDP using custom covers. (PGDP members, please wait for our board to discuss it and guidelines for covers from our site to be established before using this option.) Kind regards, /louise/ PGDP Gen. Mgr. ____________________________________________________________ 53 Year Old Mom Looks 33 The Stunning Results of Her Wrinkle Trick Has Botox Doctors Worried http://thirdpartyoffers.netzero.net/TGL3241/4f1885bdb63b246505dst02vuc

On 1/18/2012 8:15 PM, Jim Adcock wrote:
EPUB:
The ngx or whatever its called seems to be way overpopulated.
NCX. Navigation Control file for Xml. Designed by Daisy to support Digital Talking Books. Rammed through the IDPF with little discussion by Adobe, who is the IDPF's 800-lb Gorilla now that Microsoft has stopped caring. Adobe preferred the NCX to the existing "tours" element, probably as a result of the NIH syndrome. NCX is a rather chatty format, due in large part to the fact that it was designed as a navigation aid for audio books for the blind rather than as a simple table of contents. Fortunately, end users never need look at it, and it can easily be built using automated tools rather than by hand.
Page numbers now have magically jumped from the left margin to the right margin and what used to read "[PG 014]" now just reads a small grey "14" -- how do they do that?
Magic. Seriously, I think one needs to be cautious when trying to judge the quality of an ePub file by how it looks in any particular User Agent. Due to the influence of Adobe on the ePub specification, I believe that the Adobe Digital Edition reader is probably the most compliant of the ePub readers available, but even that software is not fully compliant with the specification. If you find problems with the way an ePub looks using any particular software it is as likely that the problem is with the software as it is that the problem is in the file. I'm guessing from your description that ADE is what you were using to look at the ePub. The developers of ADE seem to be committed to making ePubs as much like PDF as possible. Thus, ADE automatically puts a page number in the right hand margin of the display. If page numbers are not actually included in the ePub (more on this later) ADE will make some up. The only relation these made-up numbers have to the underlying text is that they are sequential, i.e. 2 is guaranteed to come after 1. BTW, ADE is not the only User Agent that carries on this charade. I know that Aldiko does it, and it's possible that the Nook does it as well. In Aldiko I've learned how to turn it off; so far, I haven't been able to figure out how to do that in ADE. [snip]
Hm, except now I see a line that says:
[pg 002][pg 013] ... 7
So now I am confused: Is this page 2, or page 13, or page 7 ???
The NCX file that is part of ePub is used to create the only Table of Contents that ADE recognizes (remember, NCX was Adobe's idea in the first place). But that is not it's only purpose. In addition to the <navMap> section, which defines that which is approximately a Table of Contents, an NCX file can also contain a <pageList> section which is intended to contain a list of page anchors that point to the beginning of each page in the document (because, after all, PDF is page oriented). In ADE, if this page list is present in the NCX file it is used instead of the made-up numbers. I think that what you're seeing is page numbers which were unadvisedly placed into the source HTML file and not suppressed by your User Agent (many ePub User Agents don't do a good job with the { display:none } style), plus the page markers auto-generated either from the NCX file or made up just because ADE (or equivalent) seemed to think you would want them (I don't, but YMMV). The HTML ePub maker used by PG does a pretty good job at constructing an NCX file from the structure of an HTML file based upon the ordering and levels of <h?> headers. It could probably be enhanced to build a page list in the NCX file as well, by finding the embedded page anchors in the HTML, adding references to those anchors to the NCX, and removing any visible component of the anchor before packaging the ePub. Anyone who is evaluating the quality of the ePubs produced by the PG tool would be well advised to learn the ePub and XHTML specifications, unzip the subject ePub and examine the markup by hand. There's simply too much variation among User Agents to be able to come to any valid conclusion any other way.

Anyone who is evaluating the quality of the ePubs produced by the PG tool would be well advised to learn the ePub and XHTML specifications, unzip the subject ePub and examine the markup by hand. There's simply too much variation among User Agents to be able to come to any valid conclusion any other way.
Let us remember that you are defending a pile of weirdness that PG is generating from one and/or another PG tool. One doesn't really need to unzip and and reverse-engineer the generated EPUB, because one can start by looking at the generated HTML. The generated EPUB, and the further generated MOBI, can only go downhill from there. One can start by taking a look at the CSS used for the HTML, and maybe a little bit of the generated HTML code, and see what one thinks there about this particular approach: CSS: ===== Any generated element will have a class "tei" and a class "tei-elem" where elem is the element name in TEI. The order of statements is important !!! */ .tei { margin: 0; padding: 0; font-size: 100%; font-weight: normal; font-style: normal } .block { display: block; } .inline { display: inline; } .floatleft { float: left; margin: 1em 2em 1em 0; } .floatright { float: right; margin: 1em 0 1em 2em; } .shaded { margin-top: 1em; margin-bottom: 1em; padding: 1em; background-color: #eee; } .boxed { margin-top: 1em; margin-bottom: 1em; padding: 1em; border: 1px solid black; } body.tei { margin: 4ex 10%; text-align: justify } div.tei { margin: 2em 0em } p.tei { margin: 0em 0em 1em 0em; text-indent: 0em; } blockquote.tei { margin: 2em 4em } div.tei-lg { margin: 1em 0em; } div.tei-l { margin: 0em; text-align: left; } div.tei-tb { text-align: center; } div.tei-epigraph { margin: 0em 0em 1em 10em; } div.tei-dateline { margin: 1ex 0em; text-align: right } div.tei-salute { margin: 1ex 0em; } div.tei-signed { margin: 1ex 0em; text-align: right } div.tei-byline { margin: 1ex 0em; } /* calculate from size of body = 80% */ div.tei-marginnote { margin: 0em 0em 0em -12%; width: 11%; float: left; } div.tei-sp { margin: 1em 0em 1em 2em } div.tei-speaker { margin: 0em 0em 1em -2em; font-weight: bold; text-indent: 0em } div.tei-stage { margin: 1em 0em; font-weight: normal; font-style: italic } span.tei-stage { font-weight: normal; font-style: italic } div.tei-eg { padding: 1em; color: black; background-color: #eee } hr.doublepage { margin: 4em 0em; height: 5px; } hr.page { margin: 4em 0em; height: 2px; } ul.tei-index { list-style-type: none } dl.tei { margin: 1em 0em } dt.tei-notelabel { font-weight: normal; text-align: right; float: left; width: 3em } dd.tei-notetext { margin: 0em 0em 1ex 4em } span.tei-pb { position: absolute; left: 1%; width: 8%; font-style: normal; } span.code { font-family: monospace; font-size: 110%; } ul.tei-castlist { margin: 0em; list-style-type: none } li.tei-castitem { margin: 0em; } table.tei-castgroup { margin: 0em; } ul.tei-castgroup { margin: 0em; list-style-type: none; padding-right: 2em; border-right: solid black 2px; } caption.tei-castgroup-head { caption-side: right; width: 50%; text-align: left; vertical-align: middle; padding-left: 2em; } *.tei-roledesc { font-style: italic } *.tei-set { font-style: italic } table.rules { border-collapse: collapse; } table.rules caption, table.rules th, table.rules td { border: 1px solid black; } table.tei { border-collapse: collapse; } table.tei-list { width: 100% } th.tei-head-table { padding: 0.5ex 1em } th.tei-cell { padding: 0em 1em } td.tei-cell { padding: 0em 1em } td.tei-item { padding: 0; font-weight: normal; vertical-align: top; text-align: left; } th.tei-label, td.tei-label { width: 3em; padding: 0; font-weight: normal; vertical-align: top; text-align: right; } th.tei-label-gloss, td.tei-label-gloss { text-align: left } td.tei-item-gloss, th.tei-headItem-gloss { padding-left: 4em; } img.tei-formula { vertical-align: middle; } </style> ========== Start of the Generated Code (verbatim): <body class="tei"> <div lang="en" class="tei tei-text" style="margin-bottom: 2.00em; margin-top: 2.00em" xml:lang="en"> <div class="tei tei-front" style="margin-bottom: 6.00em; margin-top: 2.00em"> <div class="tei tei-div" style="margin-bottom: 5.00em; margin-top: 5.00em"> <div id="pgheader" class="tei tei-div" style="margin-bottom: 4.00em; margin-top: 4.00em"><div class="tei tei-div" style="margin-bottom: 3.00em; margin-top: 3.00em"><p class="tei tei-p" style="margin-bottom: 2.00em">The Project Gutenberg EBook of Bible Readings for the Home Circle</p></div><div class="tei tei-div" style="margin-bottom: 3.00em; margin-top: 3.00em"><p class="tei tei-p" style="margin-bottom: 1.00em">This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License <a href="#pglicense" class="tei tei-ref">included with this eBook</a> or online at <a href="http://www.gutenberg.org/license" class="tei tei-xref">http://www.gutenberg.org/license</a></p></div><pre class="pre tei tei-div" style="margin-bottom: 3.00em; margin-top: 3.00em">Title: Bible Readings for the Home Circle

On 1/20/2012 10:32 PM, Jim Adcock wrote:
Anyone who is evaluating the quality of the ePubs produced by the PG tool would be well advised to learn the ePub and XHTML specifications, unzip the subject ePub and examine the markup by hand. There's simply too much variation among User Agents to be able to come to any valid conclusion any other way.
Let us remember that you are defending a pile of weirdness that PG is generating from one and/or another PG tool.
I can't see anything in my post that could possibly be construed as "defending" this stinky, steaming pile of ... What I /said/ was that when you're evaluating the quality of an ePub you need to look at its contents, not its behavior in any given User Agent. Which is apparently just what you did this second time.
participants (19)
-
Al Haines
-
Andrew Sly
-
D Garcia
-
David Starner
-
don kretz
-
Greg Newby
-
hmonroe1@verizon.net
-
James Adcock
-
James Simmons
-
Jana Srna
-
Jim Adcock
-
Joshua Hutchinson
-
Karen Lofstrom
-
Keith J. Schultz
-
Lee Passey
-
Louise Davies
-
Marcello Perathoner
-
Roger Frank
-
traverso@posso.dm.unipi.it