!@! #17135 Twas The Night Before Christmas

***** This file should be named 17135-h.htm or 17135-h.zip ***** This and all associated files of various formats will be found in: http://www.gutenberg.org/1/7/1/3/17135/ Produced by Janet Blenkinship, Suzanne Shell and the Online Distributed Proofreading Team at http://www.pgdp.net Where the first characters of stanza were "illuminated" they were eliminated and never replaced in the htm. version.

Actually, it's fairly common practice that if a paragraph/verse starts with some kind of graphical/illuminated character, the actual character it stands for is not included in the HTML version. ----- Original Message ----- From: "Michael S. Hart" <hart@pglaf.org> To: "The gutvol-d Mailing List" <gutvol-d@lists.pglaf.org> Sent: Wednesday, April 14, 2010 6:17 AM Subject: [gutvol-d] !@! #17135 Twas The Night Before Christmas
***** This file should be named 17135-h.htm or 17135-h.zip ***** This and all associated files of various formats will be found in: http://www.gutenberg.org/1/7/1/3/17135/
Produced by Janet Blenkinship, Suzanne Shell and the Online Distributed Proofreading Team at http://www.pgdp.net
Where the first characters of stanza were "illuminated" they were eliminated and never replaced in the htm. version.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Al Haines (shaw) wrote:
Actually, it's fairly common practice that if a paragraph/verse starts with some kind of graphical/illuminated character, the actual character it stands for is not included in the HTML version.
And that makes the HTML pretty useless for further processing like conversion to mobile formats. It should be made a requirement that the stream of non-markup-characters be identical in all versions of an ebook: lynx --dump should produce a text that wdiffs equal with the text version. -- Marcello Perathoner webmaster@gutenberg.org

At 03:39 PM 14/04/2010, you wrote:
Al Haines (shaw) wrote:
Actually, it's fairly common practice that if a paragraph/verse starts with some kind of graphical/illuminated character, the actual character it stands for is not included in the HTML version.
And that makes the HTML pretty useless for further processing like conversion to mobile formats.
It should be made a requirement that the stream of non-markup-characters be identical in all versions of an ebook:
lynx --dump
should produce a text that wdiffs equal with the text version.
and it does. stripping this: <p><span class="dropcapc"> </span><span class="dropcap">T</span>he children were nestled all snug in their beds,<br /> While visions of sugar-plums danced in their heads;<br /> And mamma in her kerchief, and I in my cap,<br /> Had just settled our brains for a long winter's nap,<br /><br /> </p> gives you this: The children were nestled all snug in their beds, While visions of sugar-plums danced in their heads; And mamma in her kerchief, and I in my cap, Had just settled our brains for a long winter's nap, JHowse ================================================================================ "Turning a Picture into a thousand words"Preserving History One Page at a Time!! Celebrating more than 17,350 books posted to Project Gutenberg! Join Project Gutenberg's Distributed Proofreaders http://www.pgdp.net/c/ ================================================================================

If the between stanzas illustrations are so easily included, then why no the illuminated characters? 6 of one, half a dozen of the other. . .eh? Me. . .I would in both the plain ascii letter AND the graphic letter. "Fairly common practice" = "UNfairly common practice". . . . . On Wed, 14 Apr 2010, Jeannie Howse wrote:
At 03:39 PM 14/04/2010, you wrote: Al Haines (shaw) wrote:
Actually, it's fairly common practice that if a paragraph/verse starts with some kind of graphical/illuminated character, the actual character it stands for is not included in the HTML version.
And that makes the HTML pretty useless for further processing like conversion to mobile formats.
It should be made a requirement that the stream of non-markup-characters be identical in all versions of an ebook:
lynx --dump
should produce a text that wdiffs equal with the text version.
and it does. stripping this:
<p><span class="dropcapc"> </span><span class="dropcap">T</span>he children were nestled all snug in their beds,<br /> While visions of sugar-plums danced in their heads;<br /> And mamma in her kerchief, and I in my cap,<br /> Had just settled our brains for a long winter's nap,<br /><br /> </p>
gives you this:
The children were nestled all snug in their beds, While visions of sugar-plums danced in their heads; And mamma in her kerchief, and I in my cap, Had just settled our brains for a long winter's nap,
JHowse
========================================================================== ====== "Turning a Picture into a thousand words"Preserving History One Page at a Time!! Celebrating more than 17,350 books posted to Project Gutenberg! Join Project Gutenberg's Distributed Proofreaders http://www.pgdp.net/c/ ========================================================================== ======

PG texts seem to be distributed to readers by a number of different channels. In a sense, PG has become the dominant wholesaler with a number of retailers. And they also provide direct distribution. Source texts are provided to PG by DP (with trivial exceptions) in two formats: plain text and HTML. But PG and other mediators distribute ebooks in a variety of different formats; and given the variety of devices, readers are requiring a number of other formats. This will if anything be increasingly true. But all these ebook formats must somehow be derived, through one or more transformation processes, from one or the other of the two originals. Here are my naive, uninformed perceptions of the trends of what's happening among four different segments: Untransformed plain-text, transformed plain-text, untransformed HTML, and transformed HTML. 1. The number of readers who read ebooks using the original plain-text versions, distributed directly or indirectly, are a significant but declining proportion of the whole. 2. The number of readers who read ebooks using the original HTML versions, distributed directly or indirectly, are a significant proportion of the whole, not declining as rapidly, but still declining (because they require a real browser and a large-enough screen to read them with any level of fidelity.) 3. Some proportion of readers are reading ebooks derived from plain-text versions but transformed using some kind of software to infer formatting. I suspect this proportion is declining as well, but it's hard to do and the readers are increasingly expecting more from ebooks from their increasingly sophisticated devices. 4. So that leaves the rest, who are reading ebooks derived from the original HTML versions. My suspicion is that the majority of ebooks are already provided this way, and (especially with the increasing acceptance of de jure and de facto sub-html standards,) this will only increase. How accurate is this assessment? Based on the distribution among the quartiles, should PG and DP make any changes in the way ebooks are prepared and supplied?

On Wed, 14 Apr 2010, Marcello Perathoner wrote:
And that makes the HTML pretty useless for further processing like conversion to mobile formats.
I've bitched about this before at DP and it got me nowhere. I didn't really jump up and down either though. -- Greg Weeks http://durendal.org:8080/greg/

Even more spectacular than the "illuminated letter" problem [which is bad enough][and which I would hope most transcribers would avoid nowadays by choosing NOT to include GIFs for illuminated letters and other trivial "printers art"] you also have texts when the transcriber has chosen to leave some text in GIF only mode, and/or other text in GIF mode AND OCR'ed mode, such that the MOBI and EPUB versions may have 0, 1 or 2 copies of a particular entire paragraph of text. And/or the HTML was written in a non-linear form in which case the MOBI and EPUB versions may have 0, 1, 2 or N copies of any particular passage in the text. And captions on images may be retained in the image, included in the HTML, and/or included in the alt-tag meaning that a particular user with a particular reading device may see or hear the image caption 0, 1, 2 or 3 times.

I have no objection to having both the Illuminated GIF file and the ASCII equivalent character. I see these as just fine, with no impediment to either reading or searching or quoting, other, of course, that any artifact of the GIF file usually not really much of a problem when I cut and paste. As for the MOBI, EPUB, etc., formats, as long as it's easy from the average reader's POV, it should be acceptable. Michael On Fri, 16 Apr 2010, James Adcock wrote:
Even more spectacular than the "illuminated letter" problem [which is bad enough][and which I would hope most transcribers would avoid nowadays by choosing NOT to include GIFs for illuminated letters and other trivial "printers art"] you also have texts when the transcriber has chosen to leave some text in GIF only mode, and/or other text in GIF mode AND OCR'ed mode, such that the MOBI and EPUB versions may have 0, 1 or 2 copies of a particular entire paragraph of text. And/or the HTML was written in a non-linear form in which case the MOBI and EPUB versions may have 0, 1, 2 or N copies of any particular passage in the text. And captions on images may be retained in the image, included in the HTML, and/or included in the alt-tag meaning that a particular user with a particular reading device may see or hear the image caption 0, 1, 2 or 3 times.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

I have no objection to having both the Illuminated GIF file and the ASCII equivalent character. I see these as just fine, with no impediment to either reading or searching or quoting, other, of course, that any artifact of the GIF file usually not really much of a problem when I cut and paste.
As for the MOBI, EPUB, etc., formats, as long as it's easy from the average reader's POV, it should be acceptable.
OK, then someone needs to think this through and come up with standards and expectations, because what is happening now is "not working." Again, it is not infrequently the case in one or more of the file formats that PG is distributing that a particular item of text is showing up 0, 1, 2, or 3 times, where the "right answer" is once -- or maybe twice -- if as you suggests one accepts redundancy in the case of illuminated letters. As you suggest probably the simplest answer is that if someone wants to put in illuminated letters they also include the plain-text version of the letter, and then presumably one should NOT include an alt-tag on the "illustration" [when it is actually just an illuminated letter] What one *ought* to do for a no-illustration distribution given a "real" illustration with an alt-tag is yet another matter that needs to be thought out. Also suggest it would be nice if we had a naming convention for illuminated letters or some such equivalent, such that the file format conversion software, and/or other software, can tell whether a particular HTML "really" has illustrations, or if it just contains illuminated letters. For example in the text in question, when I ask PG for the MOBI version with *no images* this is what I currently get (which is not quite what one would hope for!) ... Saying her Prayers T was the night before Christmas, when all through the house Not a creature was stirring, not even a mouse; The stocking were hung by the chimney with care In hopes that St. Nicholas soon would be there; Sleeping Mouse Stocking in the Fireplace The children were nestled all snug in their beds, While visions of sugar-plums danced in their heads; And mamma in her kerchief, and I in my cap, Had just settled our brains for a long winter's nap, The children were nestled When out on the lawn there arose such a clatter, I sprang from the bed to see what was the matter ....
participants (9)
-
Al Haines (shaw)
-
don kretz
-
Greg Weeks
-
James Adcock
-
Jeannie Howse
-
Jim Adcock
-
Marcello Perathoner
-
Michael S. Hart
-
Scott Olson