Basic simple test case.

Here is a PG project that needs some formatting help. The HTML conversion would probably best be termed "fail". Yet it should be a simple case. How would you want proofers to submit it after "matching the scan"? What would you consider to be a properly marked up copy in the format of your choice? How should it get there from the previous step? http://www.gutenberg.org/cache/epub/14668/pg14668.html

I appears these are the images: http://archive.org/details/mcguffeysseconde00mcgu There are a number of McGuffey Readers in the catalog: +---------+---------------------------------------------------+ | etextid | title | +---------+---------------------------------------------------+ | 14668 | McGuffey's Second Eclectic Reader | | 14640 | McGuffey's First Eclectic Reader, Revised Edition | | 14766 | McGuffey's Third Eclectic Reader | | 16751 | McGuffey's Sixth Eclectic Reader | | 1490 | The New McGuffey Fourth Reader | | 15040 | McGuffey's Fifth Eclectic Reader | | 15577 | A History of the McGuffey Readers | | 1489 | The New McGuffey First Reader | | 14642 | McGuffey's Eclectic Primer, Revised Edition | | 14880 | McGuffey's Fourth Eclectic Reader | | 15456 | McGuffey's Eclectic Spelling Book | +---------+---------------------------------------------------+ On Tue, Oct 9, 2012 at 8:53 AM, don kretz <dakretz@gmail.com> wrote:
Here is a PG project that needs some formatting help.
The HTML conversion would probably best be termed "fail". Yet it should be a simple case.
How would you want proofers to submit it after "matching the scan"?
What would you consider to be a properly marked up copy in the format of your choice? How should it get there from the previous step?

"don" == don kretz <dakretz@gmail.com> writes:
don> --===============2067625072== don> Content-Type: multipart/alternative; boundary=f46d04428c966205ed04cba257e2 don> --f46d04428c966205ed04cba257e2 don> Content-Type: text/plain; charset=ISO-8859-1 don> Here is a PG project that needs some formatting help. don> The HTML conversion would probably best be termed "fail". don> Yet it should be a simple case. don> How would you want proofers to submit it after "matching the scan"? don> What would you consider to be a properly marked up copy in don> the format of your choice? How should it get there from the previous don> step? don> http://www.gutenberg.org/cache/epub/14668/pg14668.html In this book one can improve the formatting at will, but its worst failure is in txt. Look at its main page, http://www.gutenberg.org/ebooks/14668 and compare the pdf (that has images) or the original http://archive.org/details/mcguffeysseconde00mcgu with the UTF-8 txt as provided by PG as default txt. In txt you have the following: TABLE OF VOCALS. Long Sounds Sound as in Sound as in a ate e err a care i ice a arm o ode a last u use a all u burn e eve oo fool SHORT SOUNDS. Sound as in Sound as in a am o odd e end u up i in oo look All these vocals carry a diacritic sign, easy to represent in unicode, but here everything is ASCII. These diacritics have been considered irrelevant, but they are the essential feature of this reader. There are also transcription errors. For example, in this TABLE 0F ASPIRATES. Sound as in Sound as in f fifi t tat h him sh she k kite ch chat p pipe th thick s same wh why "fifi" is "fife", and "kite" is "cake" Carlo

On 10/9/2012 10:56 PM, Carlo Traverso wrote: [snip]
In this book one can improve the formatting at will, but its worst failure is in txt.
Well, I wouldn't say that the /worst/ failure is in the text, but you are right that it is pretty bad.
Look at its main page, http://www.gutenberg.org/ebooks/14668 and compare the pdf (that has images) or the original http://archive.org/details/mcguffeysseconde00mcgu with the UTF-8 txt as provided by PG as default txt.
I didn't need to go to archive.org, I just went down into the basement and pulled the McGuffey's Readers from the box where they were stored. My edition was printed (not published or copyrighted) by the Fairfax Christian Bookstore in Tyler, Texas. The McGuffey's Readers are still quite popular among home schoolers, presumably because they make no mention of uncomfortable topics such as evolution or climate change. [snip]
All these vocals carry a diacritic sign, easy to represent in unicode, but here everything is ASCII. These diacritics have been considered irrelevant, but they are the essential feature of this reader.
This was one of the first things I noticed; you don't even need to look at the page scans to realize that this has happened. Looking at the scans (or, in my case, the paper) you realize that these diacritical marks are also missing from the word lists that begin each lesson--a fatal flaw.
There are also transcription errors. For example, in this
TABLE 0F ASPIRATES.
Sound as in Sound as in f fifi t tat h him sh she k kite ch chat p pipe th thick s same wh why
"fifi" is "fife", and "kite" is "cake"
I don't know if you noticed, but the 'O' in "OF" was also replaced by a zero. And in at least one spot (I haven't reviewed the entire file yet) a page header ("ECLECTIC SERIES") was included as part of the text. I also found the transcriber's note at the beginning of the file a bit gratuitous. I have no problem with notes like that, which are subjective and have no bearing on the actual transcription process, being included in the catalog, but they really shouldn't be stuck in the file. This file obviously needs a lot of work, and probably is more deserving of attention than many of the recent obscure and esoteric works for which raw page scans are adequate to serve the academic community. I'll try to fix it as best I can, but this will probably take me several weeks. I'll post iterations on my web server as I go along, and provide comments here as iterations are completed.

On 10/10/2012 9:09 AM, Lee Passey wrote: [snip]
This file obviously needs a lot of work, and probably is more deserving of attention than many of the recent obscure and esoteric works for which raw page scans are adequate to serve the academic community. I'll try to fix it as best I can, but this will probably take me several weeks. I'll post iterations on my web server as I go along, and provide comments here as iterations are completed.
Iteration one is available at http://www.passkeysoft.com/~lee/pg14668.html. In this iteration I did the following: Removed all "style=''" attributes. Added a link to "gutenberg.css" for all styling. Encapsulated the Gutenberg boilerplate in <div class="gutenblurb">...<div>. My style sheet sets that class to "display:none" so I don't ever have to look at it again. Encapsulated the transcribers notes into <div class="fm notes">. Encapsulated the title page presentation into <div class="tp">. Changed book title headers to <h1>, pursuant to rule 4. Removed <p> from non-paragraphs on the title page. Some header normalization throughout the remainder of the file has been performed, but is not yet complete. Encapsulated the copyright information into <div class="copyright">. Made some headway in applying rule no. 1: <p> is reserved for paragraphs. e.g. "Preface" was changed from <p> to <h3>; the dateline at the end of the preface was changed from <p> to <div class="closer">. The blob of text starting at <h2 id="id00032> through the end of <p id="id00035">, which is obviously a table of contents, was converted to a table of contents using <ul>, <ol> and <li> elements, pursuant to rule no. 2. Ersatz paragraphs id00041, id00043, id00045, id00047, id00049, and id00051 were obviously tabular data, so I converted them to tables, pursuant to rule no. 3. As best I could, I restored diacritical markings to the words contained in these 6 tables. Letters containing diacritical marks can be represented in a number of ways in unicode. I chose to represent them as ASCII characters + combining diacritical marks because some of the characters had no other representation in unicode and I wanted to use a consistent code page. If the combining diacritical marks are ignored, the collation order of the words will be preserved (for whatever that's worth). The drawback to this approach is that I don't know how widespread support for combining diacritical marks is on HTML user agents. <p id="id00062">[Illustration:...]</p> was changed to <div id="id00062" class="illustration" title="..."><img src="" alt="..."/>. When an image cannot be located (which is always for this file), the alternate text will be displayed; when an image becomes available, the alternate text will not be displayed. This ought to be a rule, but probably not in the top 20. Some further illustrations have been converted, but not yet all. The vocabulary word list at the beginning of each chapter (e.g. ersatz paragraphs id00065, id00066, id00067, id00068) is problematic. Technically this is a list, but HTML has really bad support for floating columns. For the time being I have implemented them as tables, but I don't feel good about it. Suggestions are solicited. Words in the word list are obviously divided on syllables; the apostrophe in those words is a replacement for an acute accent indicating stress. I replaced those apostrophes with the unicode acute accent symbol (´). In the reading exercises, the paragraphs are numbered. I was tempted to place each paragraph as an item in an ordered list for automatic numbering, but decided not to. Opinions are welcomed. Other changes were made, relating to handwriting and correspondence, but I'm running out of time. More to follow in the next installment.

On 10/10/2012 9:09 AM, Lee Passey wrote:
This file obviously needs a lot of work, and probably is more deserving of attention than many of the recent obscure and esoteric works for which raw page scans are adequate to serve the academic community.
Hey, I read obscure 19th century novels for FUN, and I couldn't read them on my iPod as raw page scans. I don't think I'm the only one. As for the "obscure and esoteric" -- it may be that no one will read those works in their entirety, but good texts make searching possible (as the horrid Google OCR doesn't) and allow researchers to pick out the relevant nugget of info to suggest or confirm a hypothesis. Also, you never know what could be useful. I recall someone posting at DP that an early 20th century periodical on nut tree cultivation had proved very helpful in practice. PG has a history of animosity to scholars and scholarly tastes, which I deplore. -- Karen Lofstrom

PG has a history of animosity to scholars and scholarly tastes, which I deplore.
I think the animosity is more directed to the idea that public domain books should be coded in painstaking detail -- and then locked up in a digital locker somewhere where no one can actually read or use them.

On 10/11/2012 12:33 AM, Karen Lofstrom wrote:
PG has a history of animosity to scholars and scholarly tastes, which I deplore.
The animosity I have is against making books unusable for the 99.9% (who just want books that work. everywhere. anytime. on whatever device they choose to read them.) for the sake of the 0.1% that want some special feature. Real academics don't use PG books anyway. They use google scans or get a grant and digitize the book themselves. And they want TEI and not some silly home-brewed HTML. Regards -- Marcello Perathoner webmaster@gutenberg.org

And they want TEI and not some silly home-brewed HTML.
And then they claim their semantic analysis entitles what they just did to new copyright status. And then they lock that TEI up in digital vaults that only other academics in their same consortium can even look at. PG should become concerned when people say they are doing semantic analysis, and are not simply typesetting a book.

On Fri, Oct 12, 2012 at 12:33 PM, James Adcock <jimad@msn.com> wrote:
And they want TEI and not some silly home-brewed HTML.
And then they claim their semantic analysis entitles what they just did to new copyright status. And then they lock that TEI up in digital vaults that only other academics in their same consortium can even look at.
PG should become concerned when people say they are doing semantic analysis, and are not simply typesetting a book.
No. It's none of PG's business how other groups handle their transcribing. Tearing other people down doesn't get anything done. If people say they are doing semantic analysis, and decide to lock up, there's no point in us caring one way or the other; we may as well treat it as if it had never been done in the first place. If they are doing semantic analysis and are making it available to the world, why on Earth would we want be jerks and cop an attitude because other people doing semantic analysis are doing things we don't approve of? -- Kie ekzistas vivo, ekzistas espero.

No. It's none of PG's business how other groups handle their transcribing. Tearing other people down doesn't get anything done. If people say they are doing semantic analysis, and decide to lock up, there's no point in us caring one way or the other; we may as well treat it as if it had never been done in the first place. If they are doing semantic analysis and are making it available to the world, why on Earth would we want be jerks and cop an attitude because other people doing semantic analysis are doing things we don't approve of?
I disagree. To me this is just another way to engage in copyfraud, just like all the dimestore editions which include an introductory chapter of "scholarly analysis" from Prof. Joe Schmoe and then slap a new copyright notice back on the "derivative" work. PG needs to make sure that what they publish represents "typesetting" and NOT "semantic analysis." PG should speak out strongly against all people and all communities which engage in copyfraud -- including much of the TEI "scholarly" community. PG should speak out strongly for returning copyright back to the sensible origins that the founding fathers intended -- namely to allow that an author could make a reasonable profit off his own efforts during his own lifetime -- and *not* the current Mickey Mouse copyright laws which are effectively allowing printing corporations to hold a monopoly in perpetuity. One of rights of the purchaser of a work -- at the time they made that purchase -- was the reasonable expectation that copy prohibitions on that purchased work would expire in a reasonable amount of time -- and not be subject to constantly increasing durations.

On Mon, Oct 15, 2012 at 8:18 AM, James Adcock <jimad@msn.com> wrote:
PG needs to make sure that what they publish represents "typesetting" and NOT "semantic analysis." PG should speak out strongly against all people and all communities which engage in copyfraud -- including much of the TEI "scholarly" community.
Those two sentences have no connection. You do not further the cause of the public domain by attacking certain derivative works that some people find useful. If people need semantically analyzed works, then they should make semantically analyzed works. -- Kie ekzistas vivo, ekzistas espero.

Those two sentences have no connection. You do not further the cause of the public domain by attacking certain derivative works that some people find useful. If people need semantically analyzed works, then they should make semantically analyzed works.
Nonsense. PG has a long history of running into problems with people who claim to have made a useful contribution to a "derivative work" when in fact all they are trying to do is bung up the works to keep copies from being made. The academic community is not more immune to this behavior than anyone else. The keyword you use above is "useful." Is this "academic work" useful? Nope. It's just another excuse to find another way to lock up books which are out of copyright so that the public cannot in practice use that which is already theirs -- namely everything that has risen to the public domain. And you cannot use the book in the university's library anymore either -- because they dispose of the physical copies once they have digitized them.

On Wed, Oct 17, 2012 at 9:02 PM, James Adcock <jimad@msn.com> wrote:
Is this "academic work" useful? Nope.
Right, and James Adcock is the final arbitrator of what academic work is useful or not. You've yet to explain why someone who wants to give a semantically marked up version to Project Gutenberg should be attacked, what immoral crime they've committed.
And you cannot use the book in the university's library anymore either -- because they dispose of the physical copies once they have digitized them.
Never have I seen them do that a book they've semantically analyzed. They've scanned books and disposed of them; perhaps you should object to anyone scanning books because someone has done that in the past. -- Kie ekzistas vivo, ekzistas espero.

Lee, I don't remember seeing your numbered rules. Where are they? They all look right to me so far. Do you want a blog id on readingroo.ms? This would be easier to follow if it accumulate somewhere. Otherwise I'll start a new category and copy them over. Don On Wed, Oct 10, 2012 at 3:20 PM, Lee Passey <lee@passkeysoft.com> wrote:
On 10/10/2012 9:09 AM, Lee Passey wrote:
[snip]
This file obviously needs a lot of work, and probably is more deserving
of attention than many of the recent obscure and esoteric works for which raw page scans are adequate to serve the academic community. I'll try to fix it as best I can, but this will probably take me several weeks. I'll post iterations on my web server as I go along, and provide comments here as iterations are completed.
Iteration one is available at http://www.passkeysoft.com/~** lee/pg14668.html <http://www.passkeysoft.com/~lee/pg14668.html>.
In this iteration I did the following:
Removed all "style=''" attributes. Added a link to "gutenberg.css" for all styling.
Encapsulated the Gutenberg boilerplate in <div class="gutenblurb">...<div>. My style sheet sets that class to "display:none" so I don't ever have to look at it again.
Encapsulated the transcribers notes into <div class="fm notes">.
Encapsulated the title page presentation into <div class="tp">. Changed book title headers to <h1>, pursuant to rule 4. Removed <p> from non-paragraphs on the title page. Some header normalization throughout the remainder of the file has been performed, but is not yet complete.
Encapsulated the copyright information into <div class="copyright">.
Made some headway in applying rule no. 1: <p> is reserved for paragraphs. e.g. "Preface" was changed from <p> to <h3>; the dateline at the end of the preface was changed from <p> to <div class="closer">.
The blob of text starting at <h2 id="id00032> through the end of <p id="id00035">, which is obviously a table of contents, was converted to a table of contents using <ul>, <ol> and <li> elements, pursuant to rule no. 2.
Ersatz paragraphs id00041, id00043, id00045, id00047, id00049, and id00051 were obviously tabular data, so I converted them to tables, pursuant to rule no. 3.
As best I could, I restored diacritical markings to the words contained in these 6 tables. Letters containing diacritical marks can be represented in a number of ways in unicode. I chose to represent them as ASCII characters + combining diacritical marks because some of the characters had no other representation in unicode and I wanted to use a consistent code page. If the combining diacritical marks are ignored, the collation order of the words will be preserved (for whatever that's worth). The drawback to this approach is that I don't know how widespread support for combining diacritical marks is on HTML user agents.
<p id="id00062">[Illustration:...**]</p> was changed to <div id="id00062" class="illustration" title="..."><img src="" alt="..."/>. When an image cannot be located (which is always for this file), the alternate text will be displayed; when an image becomes available, the alternate text will not be displayed. This ought to be a rule, but probably not in the top 20. Some further illustrations have been converted, but not yet all.
The vocabulary word list at the beginning of each chapter (e.g. ersatz paragraphs id00065, id00066, id00067, id00068) is problematic. Technically this is a list, but HTML has really bad support for floating columns. For the time being I have implemented them as tables, but I don't feel good about it. Suggestions are solicited.
Words in the word list are obviously divided on syllables; the apostrophe in those words is a replacement for an acute accent indicating stress. I replaced those apostrophes with the unicode acute accent symbol (´).
In the reading exercises, the paragraphs are numbered. I was tempted to place each paragraph as an item in an ordered list for automatic numbering, but decided not to. Opinions are welcomed.
Other changes were made, relating to handwriting and correspondence, but I'm running out of time. More to follow in the next installment.
______________________________**_________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/**mailman/listinfo/gutvol-d<http://lists.pglaf.org/mailman/listinfo/gutvol-d>

On 10/10/2012 6:56 PM, don kretz wrote:
Lee,
I don't remember seeing your numbered rules. Where are they? They all look right to me so far.
Thread with subject "requirements feasibility plan" dated October 8
Do you want a blog id on readingroo.ms <http://readingroo.ms>? This would be easier to follow if it accumulates somewhere.
I've never been much of a blogger, but capturing these rules, the objections thereto, and the reformulation thereof is probably a good idea. Go ahead and give me an id (and the URL), but don't start complaining if I start posting articles about the proper role of government :-).

Iteration one is available at http://www.passkeysoft.com/~lee/pg14668.html.
Things you are doing which make people (and products) unhappy: Inclusion of javascript in a book. Use of italic for body text. Combining diacriticals seems to seriously hose a lot of products.

On 10/11/2012 10:52 AM, James Adcock wrote:
Iteration one is available at http://www.passkeysoft.com/~lee/pg14668.html.
Things you are doing which make people (and products) unhappy:
Inclusion of javascript in a book.
There is no javascript in the file, or in its included files.
Use of italic for body text.
Italics are used only once in the file, and then only for a single word.
Combining diacriticals seems to seriously hose a lot of products.
Combining diacriticals are not well supported among user agents, especially the less capable ones like the Kindle. However, no alternatives to their use is proposed here, so I can do nothing to address this concern.

Italics are used only once in the file, and then only for a single word.
Combining diacriticals are not well supported among user agents, especially
Well, call it what you like but the "Dear Santa" stuff -- whatever you want to call it -- drives real readers nuts. the less capable ones like the Kindle. Not sure a Kindle even knows what a combining diacritical is. However, they hose many more "capable" products too. Haven't seen any product anywhere which supports them well. Do you know of one?

Quoting James Adcock <jimad@msn.com>:
Combining diacriticals are not well supported among user agents, especially the less capable ones like the Kindle.
Not sure a Kindle even knows what a combining diacritical is. However, they hose many more "capable" products too. Haven't seen any product anywhere which supports them well. Do you know of one?
The Kindle Previewer handles them reasonably well (But then that may rely on OS services, and not be a feature of the Kindle itself). Note that combining diacritics are essential in some scripts (dotted Hebrew, all Indic scripts), and in those scripts, if supported, they do work. Unfortunately, combining diacritics are only something of an afterthought with Latin script, and depends on some information being available in fonts and tools, but the situation is improving, and I no longer hesitate to put combining diacritics into PG texts. BTW, leaving them out is not an option, as they are in the source, and serve a purpose there. (The older Tagalog books I've worked on used a g with tilde; of course modern Tagalog no longer uses that, but removing it would require me updating the entire spelling, which is a considerable investment in time -- distributed antiquated spelling revision anybody? I've done it for one Dutch in old orthography book, and it took me weeks, including writing tools, and reading through the entire work to catch further remnants of old spelling. English speakers probably don't realize the benefits of not having changed their orthography for several centuries...) Jeroen.

On Wed, October 17, 2012 10:43 pm, James Adcock wrote:
Italics are used only once in the file, and then only for a single word.
Well, call it what you like but the "Dear Santa" stuff -- whatever you want to call it -- drives real readers nuts.
Don't blame me, blame McGuffy. Back when I was in elementary school one of the things we were taught was how to do cursive writing (I understand there is a movement in education to quit teaching this subject). This is also one of the subjects which is attempted to be taught in the McGuffy reader. The "Dear Santa" section you refer to is presented as an exercise in orthography; students are presented with a letter in cursive, ostensibly written to Santa Clause, which they are instructed to imitate. The cursive nature of this section is problematic. Presumably, the /best/ solution would be to include an image of the letter, as was done in the Microsoft word and PDF versions of the document at Project Gutenberg, and this is an alternative I may return to. In the interim, what I did was to transcribe the letter, then mark it with <div class="write"> to indicate that it should be rendered as handwriting. There are other "slate exercises" in the book that I have not yet marked up, and I will probably treat them in the same way. The associated CSS file specifies the font to use (cursive), which you are apparently unhappy with. Just download the file, and change your standard gutenberg.css file to use the font that /you/ are most comfortable with. Or don't use a CSS file at all, and get a less appealing version, which is still superior to that provide by Project Gutenberg. (You can do this online with IE and Firefox by selecting "no styles" in your browser options). The whole point of CSS standard usage is that you can obtain a CSS file that most closely represents your tastes, and those choices will be reflected in all files.
Combining diacriticals are not well supported among user agents, especially the less capable ones like the Kindle.
Not sure a Kindle even knows what a combining diacritical is. However, they hose many more "capable" products too. Haven't seen any product anywhere which supports them well. Do you know of one?
I used the free Microsoft Visual Web Developer (http://www.microsoft.com/visualstudio/eng/downloads#d-express-windows-deskto...) to edit this file, and it handled the combining diacriticals just fine. IE 8 and Firefox 16 recognize and display the marks, but don't back up quite far enough to be crystal clear. I haven't yet created the ePub version, so I can't comment yet on how they are handled in ADE or CoolReader. The main problem is what to use as alternatives. About 3/4 of the combined characters have single value code points scattered throughout the Unicode charts. But the remaining 1/4 of the characters have no single value alternatives; combining diacriticals is the /only/ way to represent them. By using combining diacriticals I can represent /all/ of the characters used by the book, and I limit myself to a single Unicode range (0300-036F). By using combining diacriticals I have preserved the content of the original book, and if there are devices that do not support this portion of the Unicode standard it should be trivial to run a transformation on the file to convert those characters to something that the device in question supports. Life is full of trade-offs.

The associated CSS file specifies the font to use (cursive), which you are apparently unhappy with. Just download the file, and change your standard gutenberg.css file to use the font that /you/ are most comfortable with. Or don't use a CSS file at all, and get a less appealing version, which is still superior to that provide by Project Gutenberg. (You can do this online with IE and Firefox by selecting "no styles" in your browser options).
This is stupid. The job of the typesetter is to make intelligent typographical decisions -- including understanding how the limitations of the media he is targeting is going to affect those choices. The idea that the end customer is going to edit your file simply doesn't work. If you want to get people to buy into your view of how the world ought to be coded -- at least make something that looks half attractive and is readable.

On 10/19/2012 6:17 AM, James Adcock wrote:
This is stupid.
Ahh, I see you are a graduate of the BowerBird/Perathoner school of rhetoric. Do go on...
The job of the typesetter is to make intelligent typographical decisions -- including understanding how the limitations of the media he is targeting is going to affect those choices.
Perhaps, but it begs the question, "who is the typesetter?" Back in ancient times, books were read from flimsy sheets of processed wood pulp, mass-produced by applying dyes from large machines called "printing presses." Back in those days, a quasi-priesthood developed called "typesetters." Like their forerunners, the masons, this society spent years developing the arcane knowledge of how best to represent words on pages. These "typesetters" existed because setting up a printing press was an expensive proposition, and once the presses started rolling you would have thousands of copies before they would stop again. If you were going to successfully sell the book, it was important that the layout and typography would appeal, if not to the majority of readers, at least to the plurality. Printing a book to appeal to the tastes of a single reader was the ultimate vanity publishing, and simply impractical. I have noticed a tendency among human beings--certainly not universal, but common--to become dogmatic about their beliefs. Eventually, the practical, empirical lessons about typesetting best practices, learned from observation and empirical evidence, became articles of faith: "There is but one way to layout a book, and it is /my/ way! If you, lowly reader, don't like the way your book looks you are a sinner, and need to return to values of the masses, which /I/, of course, represent!" This is, of course, a bit of an exaggeration (but only a bit). Most professional typesetters have a practical understanding of the competing needs of media and audience (and business) and on the whole produce generally acceptable products. It's typically only the wannabe typesetters who are dogmatic about the rules. The advent of the transformative power of computers truly upset that applecart. And in a triumph of close-mindedness, it is amazing that so many people have not only failed to grasp the extent of the upset, they deny that the apple cart as tilted at all. Suddenly, the ultimate vanity publishing was no longer impractical, it became common (see, e.g. the much discussed Huck Finn, and just about any other work produced by David Widger). Yet these vanity publishers still clung to the 19th century typesetting dogma: "If it looks good to me, it will look good to everyone! and anyone who does not agree is just an outlier and of no consequence!" These vanity publishers quickly learned that computers have lowered the cost of entry which had previously barred them from publishing, but they still don't understand the algorithmic power of the computer, which would allow them to segregate the layout of a book from the structure of a book (or perhaps they are simply to committed to their own layout dogma to care).
The idea that the end customer is going to edit your file simply doesn't work.
True. Thankfully, I don't believe in that idea. I believe that it is possible to create an e-book file which has embedded within it all the hooks necessary to allow it to be /programmatically/ typeset according to a wide range of needs and tastes. I clearly don't have all the answers yet, indeed I may have very few of them, but I'm relying on certain clear-headed individuals like Don Kretz, Carlo Traverso, Jeroen Hellingman and even Joshua Hutchinson, all of whose opinions I respect even when I disagree, to help me find those answers. I am not an adherent of the typesetters sect. I don't pretend to be, I don't want to be, and I'm probably not even qualified to be. I'm simply trying to build a framework where a real typesetter can come along an say, "add this style here, this style there, and this style over there, and you will have an e-book that satisfies your personal esthetic. Whether an end user is going to edit a file depends, I guess, on just how sophisticated an end user he his. I don't expect, or even want, an end user to edit the framework file. Tweaking a standard CSS file is fine for the sophisticated user. What I really would hope for is for computer literate members of the Society of Typesetters to come along and build a significant number of CSS files all of which apply layout to the same elements, but in different ways. The end user's job is then to simply say, "I like Adcock formatting, or Widger formatting, or some other, better formatting, so I will get one of the corresponding CSS files and put it next to the downloaded file; I will then see things how I like." A default CSS file can be provided for people who aren't picky, and those who like the current VT-52 format can skip the formatting altogether. The ePub format is a bit challenging in this regard as most ePub user agents do not let you replace a CSS file included in the zip file (although many, not including ADE, let you disable internal styles; Cool Reader lets you not only disable the pubisher's CSS it also allows you to specify a user CSS file that overrides the internal CSS). I currently have a prototype web-based ePub creator running at readingroo.ms. That plain-jane user interface allows you to specify a Gutenberg e-text number, and then to upload your own preferred CSS file. The resulting ePub will then conform to your own preferences as expressed in /your/ CSS file. /That/ is what my idea is.
If you want to get people to buy into your view of how the world ought to be coded -- at least make something that looks half attractive and is readable.
You forgot to add the words "to me" at the end of your sentence. Fortunately, I created something that looks attractive /to me/ and which /I/ find highly readable. And because everyone in the world likes what I like, the current state of the file is obviously in its highest and best state. I see no reason to create a file that appeals to /you/, because my values represent the world at large, and if you don't find them appealing, then you are obviously living in a state of sin, and you should repent and accept my dogma. (Is there an emoticon for "tongue-in-cheek"?) You know those references above to "vanity publishers" who are blindly committed to their own dogma? Just so we're clear, I want you to know that I include you, Mr. Adcock, in that group.

All that has happened is that Lee has once-again confirmed that there are two camps of belief among readers of this forum. One group believes that they can simply do semantic markup "of everything" and then someone else can apply the formatting decisions later and everything will just "magically work." The other group doesn't believe this works, that on the contrary someone needs to make good "engineering judgment" decisions on formatting now rather than expecting someone else to do it later, and that in general there is not a limited set of semantic markups that can be uniformly applied such that the problem can be well-divided and the "engineering judgment" decisions about formatting can be made later. This second group says: "Hey, if the first group cannot even show good 'engineering judgment' about displaying their own semantic markup in an attractive and sensible way using ONE set of formatting decisions -- namely their own -- then how likely is it that their semantic markup is going to prove to be useful to someone else implementing a different set of formatting decisions at some latter date?"

Some of your facts are ok, but I think you might put them together a little more constructively. There's a legitimate argument that, if we hope many people are going to be involved proofing a text, the instructions need to be clear, simple, and few; and they should be matching the image, plus as little markup as possible to capture the essential semantics (to be defined.) Few would expect DP proofers to be applying complex formal markup like TEI, ReST, or latex. ZML is the only markup that anyone has suggested might have a chance, but it has other issues. So positive suggestions, and especially working examples, should be welcome. --------------------------- On the other hand, once the handwork is done, a version of the text that is exhaustively constructed with all the essential semantics (to be defined) clearly identified and located would, it is supposed, make it possible to write mapping modules to generate formatted texts that convey the structure and meaning of texts clearly and attractively. We have few examples of such a "master text" that has proven capable of providing the source for clear and attractive ebooks in various formats. --------------------------- I think no one is advocating any form of text that serves both purposes simultaneously, so that apparently leaves another step in the process - converting proofed text into some master text. Not many have offered to discuss how this would be done. Which leaves the suspicion that some might expect the DP "post-proofer" volunteers to fill this role. That seems an unlikely scenario to me. But however it is accomplished, it would probably (it seems to me) to depend heavily on the path provided by the essential semantics (to be defined.) -------------------------- I mentioned the McGuffey text as a possible test case for a discussion with real examples of alternatives. I thank Lee Passey for picking up the cue. I find it tedious to endure advocates who only discuss their own ideas, so this will have at least two of us for a while, anyway. I'm currently using the McGuffey text, and learning from how Lee has applied XHTML as markup, presumably as an example of text format #2 above; and I'm building a list of candidates for "essential semantics". It will be interesting to me to see how Lee views his construction from the same perspective. In the meantime, I'm using the resources of the WordPress installation on readingroo.ms see how much of the semantic content of McGuffey is available (as bowerbird would like it to be) inherently in the plain text, and how much needs to be added, by constructing views that allow different versions of text to be viewed in parallel, and edited. The host has the essential capabilities of a linux server, including utilities for I think all the formats commonly mentioned (except ZML), so if anyone thinks their preference would be interesting and helpful, I think that could be pretty easily accommodated. But two of us is enough for me to start. On Sat, Oct 20, 2012 at 12:52 PM, James Adcock <jimad@msn.com> wrote:
All that has happened is that Lee has once-again confirmed that there are two camps of belief among readers of this forum.
One group believes that they can simply do semantic markup "of everything" and then someone else can apply the formatting decisions later and everything will just "magically work."
The other group doesn't believe this works, that on the contrary someone needs to make good "engineering judgment" decisions on formatting now rather than expecting someone else to do it later, and that in general there is not a limited set of semantic markups that can be uniformly applied such that the problem can be well-divided and the "engineering judgment" decisions about formatting can be made later.
This second group says: "Hey, if the first group cannot even show good 'engineering judgment' about displaying their own semantic markup in an attractive and sensible way using ONE set of formatting decisions -- namely their own -- then how likely is it that their semantic markup is going to prove to be useful to someone else implementing a different set of formatting decisions at some latter date?"
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On 10/20/2012 11:43 PM, don kretz wrote: [snip]
I think no one is advocating any form of text that serves both purposes simultaneously,
No, that is precisely what I am advocating. If you carefully segregate the semantics/structure of a text from its presentational elements, and have a technology that allows these two elements to be merged at the time of use, it is possible to mark up a text so that it can be used directly by an end user as well as being the source for transformation into other markups. As it turns out, we do have a technology that allows semantics and presentation to be merged at the time of use: XML + CSS. And most available user agents, such as HTML browsers and ePub-based readers, are capable of performing this merging (MobiPocket and first generation Kindle's are notable exceptions to this rule). HTML is not the only markup language that can serve this function. In the past I have created a CSS file that could be attached to a TEI file and which resulted in browser rendering equivalent to HTML + CSS. HTML simply has the advantage of a set of predefined styles that are used as basic styles which can then be overridden by internal or external style sheets. In the absence of style sheets HTML can provide a mostly acceptable presentation, whereas without explicit style sheets other XML vocabularies have no presentation whatsoever. (Although, this does not preclude someone from creating a TEI user agent that /does/ have predefined styles for TEI elements. I think I may be able to force Cool Reader to do just that; yet more experimentation is needed....) So yes, I want to define a uniform set of semantic inflection (class attributes) that can be applied to HTML files that will allow them to be used as a master file that can be transformed into any other markup language, used in combination with a CSS file to provide rich presentation in HTML-based user agents, and which can degrade gracefully in HTML-based user agents when no CSS styles are present.

On 10/21/2012 07:44 PM, Lee Passey wrote:
So yes, I want to define a uniform set of semantic inflection (class attributes) that can be applied to HTML files that will allow them to be used as a master file [...]
You want to specify a set of PG-proprietary extensions to HTML that mimic the semantic tags of TEI. How is the shoehorning of PG texts into Yet Another Proprietary Markup Language going to help the spread of PG ebooks? -- Marcello Perathoner webmaster@gutenberg.org

Let's back up a step, and see if there is some areas we can agree on, namely there are a variety of areas that pretty much every book has which are "simple as dirt" yet the books PG posts are still not handling these "simple everyday items" in a robust and reader-enjoyable manner. 1) Paragraphs. PG is still serving books, frequently, like all the time, with broken paragraph formatting. 2) TOC. Hard to implement in HTML and to get it to work "right" on the major platforms. 3) Blockquotes. Similar issues to paragraphs. 4) Illustrations. Hard to get right. Will work on some platforms and not others. Some submitted illustrations don't work on some platforms. 5) Covers. Every book should have them, and they ought to be easy to implement. 6) Title page. Again, pretty much every book should have them, and they should be easy to implement. The title page info should not have to also be submitted redundantly 12 other places. 7) PG Boilerplate. Should be implemented in an attractive and non-obtrusive manner, which does not scare off the readers, nor make PG look like idiots, and should be written in such a manner as to convince most readers that the boilerplate is actually a good thing to their advantage. 8) Statement that this book is "risen to the public domain" and what that means. The implication that PG is giving away this book is false, because the book is not PG's to give away. Rather, the book belongs to the public in the first place. 9) A clean, fun, positive-feedback way to submit books to PG, such that people WANT to submit books to PG, rather than do so grudgingly. 10) A clean, fun, easy, robust way to preview and "smooth read" one's work and formatting and "conformance to standards" testing before submitting it to PG. So what I claim we need is a simple method, using universally available and well-supported tools, to do these kinds of "dirt simple all the time" things, and do them in a way that actually works on BOTH the submitters side, and PG's side of things. Such support need not be implemented in terms of this, that or the other language. Nor does it matter much whether it is implemented client side or server side or a mix thereof.

Let's take the McGuffey book as an example to review these points and provide an example for doing them properly. On Sun, Oct 21, 2012 at 10:02 PM, James Adcock <jimad@msn.com> wrote:
Let’s back up a step, and see if there is some areas we can agree on, namely there are a variety of areas that pretty much every book has which are “simple as dirt” yet the books PG posts are still not handling these “simple everyday items” in a robust and reader-enjoyable manner.****
** **
**1) **Paragraphs. PG is still serving books, frequently, like all the time, with broken paragraph formatting.
I can think of cases I've seen where paragraphs have not been handled properly; but I have no idea if they are the same as yours. As it stands, your assertion about broken paragraphs wouldn't be helpful for me as a text provider because I probably think I have marked all my paragraphs properly; you may disagree and you may be right; but there's no information I can use to fix what's wrong.
****
**2) **TOC. Hard to implement in HTML and to get it to work “right” on the major platforms.
I don't understand why we are manually constructing tables of contents in the first place. All the chapters need to be located; all the chapter headings need to be identified and positioned; and all it takes is some software to create a guaranteed accurate TOC from the text.
****
**3) **Blockquotes. Similar issues to paragraphs.
Same comment as paragraphs. A problem I see is that some things are marked as block quotes that aren't block quotes, simply because they are formatted like block quotes.
****
4) Illustrations. Hard to get right. Will work on some platforms and not others. Some submitted illustrations don’t work on some platforms.
Agreed, and this one will need discussion.
**5) **Covers. Every book should have them, and they ought to be easy to implement.
Shouldn't they be easy to auto-generate from metadata?
****
**6) **Title page. Again, pretty much every book should have them, and they should be easy to implement. The title page info should not have to also be submitted redundantly 12 other places.
I don't understand the "12 other places" comment.
****
**7) **PG Boilerplate. Should be implemented in an attractive and non-obtrusive manner, which does not scare off the readers, nor make PG look like idiots, and should be written in such a manner as to convince most readers that the boilerplate is actually a good thing to their advantage.
Agreed. But Greg needs to agree and commit to do something about it.
****
**8) **Statement that this book is “risen to the public domain” and what that means. The implication that PG is giving away this book is false, because the book is not PG’s to give away. Rather, the book belongs to the public in the first place.
Probably true but I'm not sure who would disagree, or what the significance is.
****
**9) **A clean, fun, positive-feedback way to submit books to PG, such that people WANT to submit books to PG, rather than do so grudgingly.
Yes.
****
**10) **A clean, fun, easy, robust way to preview and “smooth read” one’s work and formatting and “conformance to standards” testing before submitting it to PG.****
**
Yes.
**
So what I claim we need is a simple method, using universally available and well-supported tools, to do these kinds of “dirt simple all the time” things, and do them in a way that actually works on BOTH the submitters side, and PG’s side of things. Such support need not be implemented in terms of this, that or the other language. Nor does it matter much whether it is implemented client side or server side or a mix thereof.****
** **
**
Agreed. That's why I want to challenge any proposers to provide at least a prototype of what is being proposed, in the form of examples collected in one place so they can be remembered and compared. A list server like this is a very poor medium for refining and comparing anything. It would also help to avoid reactions that categorize proposals as categorically bad and make general statements about the proposer's character or intelligence. Or fail to provide a better alternative, with real examples we can compare with each other. Yes, bowerbird, I know you're providing the last item. Not so much on the others.
**
** **
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

I can think of cases I've seen where paragraphs have not been handled properly; but I have no idea if they are the same as yours. As it stands, your assertion about broken paragraphs wouldn't be helpful for me as a text provider because I probably think I have marked all my paragraphs properly; you may disagree and you may be right; but there's no information I can use to fix what's wrong.
I don't understand why we are manually constructing tables of contents in
Same comment as paragraphs. A problem I see is that some things are marked as block quotes that aren't block quotes, simply because they are formatted
Historically, there are three ways paragraphs have been formatting: 1) the highest quality books used the formatting Hart followed: no first line indent, line between. 2) medium quality books used: first line indent, no line between. And 3) the lowest quality books used: no indent and no line between, paragraphs demarcated only by the ragged right of the last line. The PG sausage-making chain serves paragraphs in many ways, but often "belt-and-suspenders-and-suspenders" where what the end reader sees is: line between AND first line indent AND yet another line between "just for good luck." Even this isn't the end of the world for books with very long paragraphs, but for books with lots of short dialogs it is deadly. And *why* do this in the first place? the first place. All the chapters need to be located; all the chapter headings need to be identified and positioned; and all it takes is some software to create a guaranteed accurate TOC from the text. I use such software. It is quite successful and helps me make and proof complete and useful books. Unfortunately PG doesn't support it, requiring instead the "partial solutions" that HTML gives us. 3) Blockquotes. Similar issues to paragraphs. like block quotes. This again is arguing semantics rather than typesetting. Those of us who see "typesetting" would claim that the HTML labels don't seriously represent semantic labels in the first place, but rather as simple mnemonics to refer to a particular formatting option. 5) Covers. Every book should have them, and they ought to be easy to implement.
Shouldn't they be easy to auto-generate from metadata?
Yes if we really had metadata, and not the current rag-tag collection of info dragged hap-hazardly from the four corners of the world. 6) Title page. Again, pretty much every book should have them, and they should be easy to implement. The title page info should not have to also be submitted redundantly 12 other places.
I don't understand the "12 other places" comment.
Copyright clearance, and submission, and "title page", and metadata, and file names, and WW input, etc. 7) PG Boilerplate. Should be implemented in an attractive and non-obtrusive manner, which does not scare off the readers, nor make PG look like idiots, and should be written in such a manner as to convince most readers that the boilerplate is actually a good thing to their advantage.
Agreed. But Greg needs to agree and commit to do something about it.
Well, realistically Greg would need to agree and commit to do anything. 8) Statement that this book is "risen to the public domain" and what that means. The implication that PG is giving away this book is false, because the book is not PG's to give away. Rather, the book belongs to the public in the first place.
Probably true but I'm not sure who would disagree, or what the significance is.
OK but currently PG does not say that a book is "risen to the public domain."

Re: 1) Paragraphs. PG is still serving books, frequently, like all the time, with broken paragraph formatting. Explain "broken paragraph formatting". Give examples. 5) Covers. Every book should have them, and they ought to be easy to implement. They're easy to create, but it's up to the book's submitter to create/provide them. (Not sure if custom covers have been discussed in this forum, but I think they have.) Don't expect PG to create them for you. I've done a number of custom covers over the past few months, e.g. 41020, 41021, 41022. I keep it simple--a solid color background, with title and author, and if necessary, volume information. 6) Title page. Again, pretty much every book should have them, and they should be easy to implement. The title page info should not have to also be submitted redundantly 12 other places. I'm baffled as to what this means. Every source book has one, or it probably couldn't be copyright-cleared. As a WWer, I can safely say that submissions without all of a title page's info are rare, and usually from beginners. As to submitting info in 12 other places--rubbish. Info has to be entered only once, in the copyright clearance submission form. That info is carried over to when the finished book is uploaded for WWing, at which time the uploader can make corrections to the info. The only time I have to enter info again is in the metadata section of one of my RST-based projects (e.g. 41020, 41021, 41022). 9) A clean, fun, positive-feedback way to submit books to PG, such that people WANT to submit books to PG, rather than do so grudgingly. A brass band, maybe? 10) A clean, fun, easy, robust way to preview and "smooth read" one's work and formatting and "conformance to standards" testing before submitting it to PG. Another baffler. How (and why) is PG supposed to provide smooth-read facilities to submitters? Independent producers are expected to proof/SR their submissions, and make sure they meet PG's standards, before uploading. (Those standards are well documented, but whether you like them or not, or are willing to meet them or not, is up to you.) I've SRed every one of my 1200+ submissions (just over 1100 to PG-US, over 100 to PG-Canada), and enjoyed, or was interested by, almost all of them. (I did learn early in my PG activity that I'm only a so-so on-screen SRer, so I do all my SRing from a paper print-out, away from my computer.) Al -----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of James Adcock Sent: Sunday, October 21, 2012 10:02 PM To: 'Project Gutenberg Volunteer Discussion' Subject: Re: [gutvol-d] Basic simple test case. Let's back up a step, and see if there is some areas we can agree on, namely there are a variety of areas that pretty much every book has which are "simple as dirt" yet the books PG posts are still not handling these "simple everyday items" in a robust and reader-enjoyable manner. 1) Paragraphs. PG is still serving books, frequently, like all the time, with broken paragraph formatting. 2) TOC. Hard to implement in HTML and to get it to work "right" on the major platforms. 3) Blockquotes. Similar issues to paragraphs. 4) Illustrations. Hard to get right. Will work on some platforms and not others. Some submitted illustrations don't work on some platforms. 5) Covers. Every book should have them, and they ought to be easy to implement. 6) Title page. Again, pretty much every book should have them, and they should be easy to implement. The title page info should not have to also be submitted redundantly 12 other places. 7) PG Boilerplate. Should be implemented in an attractive and non-obtrusive manner, which does not scare off the readers, nor make PG look like idiots, and should be written in such a manner as to convince most readers that the boilerplate is actually a good thing to their advantage. 8) Statement that this book is "risen to the public domain" and what that means. The implication that PG is giving away this book is false, because the book is not PG's to give away. Rather, the book belongs to the public in the first place. 9) A clean, fun, positive-feedback way to submit books to PG, such that people WANT to submit books to PG, rather than do so grudgingly. 10) A clean, fun, easy, robust way to preview and "smooth read" one's work and formatting and "conformance to standards" testing before submitting it to PG. So what I claim we need is a simple method, using universally available and well-supported tools, to do these kinds of "dirt simple all the time" things, and do them in a way that actually works on BOTH the submitters side, and PG's side of things. Such support need not be implemented in terms of this, that or the other language. Nor does it matter much whether it is implemented client side or server side or a mix thereof.

Don't expect PG to create them for you.
Well, PG does create them for me -- it drops the "Big Blue PDA of Death" in there for me -- it would be better if it left off covers entirely, or auto-generated a simple cover with title and author on it.
I've done a number of custom covers over the past few months, e.g. 41020, 41021, 41022. I keep it simple--a solid color background, with title and author, and if necessary, volume information.
What tool, if any, do you use to generate these?
As to submitting info in 12 other places--rubbish. Info has to be entered only once, in the copyright clearance submission form. That info is carried over to when the finished book is uploaded for WWing, at which time the uploader can make corrections to the info.
9) A clean, fun, positive-feedback way to submit books to PG, such that
Take a look at what epubmaker does, for instance. people WANT to submit books to PG, rather than do so grudgingly. A brass band, maybe? There is something seriously wrong when it is easier to submit a book commercially to Amazon than donate one's work to PG.
10) A clean, fun, easy, robust way to preview and "smooth read" one's work and formatting and "conformance to standards" testing before submitting it to PG. How (and why) is PG supposed to provide smooth-read facilities to submitters?
Again, I have tools that I can use on my end that work pretty well for me in this regard, but then PG won't allow me to use these tools, and instead requires me to down-grade my efforts to their "standards." The "guess what will happen" aspect of epubmaker is a particular pain.
Independent producers are expected to proof/SR their submissions, and make sure they meet PG's standards, before uploading. (Those standards are well documented, but whether you like them or not, or are willing to meet them or not, is up to you.)
The "World of Al" is one where the high priests preach an unchanging world with an unchanging religion where nothing can change, and nothing can get better. It must always remain the same. Certainly "well documented" is total BS as many conversations here have discussed. One *might* be able to find a particular piece of documentation somewhere on some PG or DP server if one knows about it to begin with, but one will find that the documentation(s) contradict each other everywhere. And unless you know in great detail what you are looking for to begin with, you will never find the documentation.

To answer the question about custom cover creation, I use Irfanview. I create a new empty image (under the Image menu), and choose a background color. I then select the area for the text to be inserted, and use Irfanview's Edit, Insert Text into Selection function (Ctrl-T) to add the text, using whichever font/color/size/alignment I choose. Any graphics software should be able to do much the same.
-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of James Adcock Sent: Monday, October 22, 2012 7:26 AM To: 'Project Gutenberg Volunteer Discussion' Subject: Re: [gutvol-d] Basic simple test case.
Don't expect PG to create them for you.
Well, PG does create them for me -- it drops the "Big Blue PDA of Death" in there for me -- it would be better if it left off covers entirely, or auto-generated a simple cover with title and author on it.
I've done a number of custom covers over the past few months, e.g. 41020, 41021, 41022. I keep it simple--a solid color background, with title and author, and if necessary, volume information.
What tool, if any, do you use to generate these?
As to submitting info in 12 other places--rubbish. Info has to be entered only once, in the copyright clearance submission form. That info is carried over to when the finished book is uploaded for WWing, at which time the uploader can make corrections to the info.
Take a look at what epubmaker does, for instance.
9) A clean, fun, positive-feedback way to submit books to PG, such that people WANT to submit books to PG, rather than do so grudgingly. A brass band, maybe?
There is something seriously wrong when it is easier to submit a book commercially to Amazon than donate one's work to PG.
10) A clean, fun, easy, robust way to preview and "smooth read" one's work and formatting and "conformance to standards" testing before submitting it to PG. How (and why) is PG supposed to provide smooth-read facilities to submitters?
Again, I have tools that I can use on my end that work pretty well for me in this regard, but then PG won't allow me to use these tools, and instead requires me to down-grade my efforts to their "standards." The "guess what will happen" aspect of epubmaker is a particular pain.
Independent producers are expected to proof/SR their submissions, and make sure they meet PG's standards, before uploading. (Those standards are well documented, but whether you like them or not, or are willing to meet them or not, is up to you.)
The "World of Al" is one where the high priests preach an unchanging world with an unchanging religion where nothing can change, and nothing can get better. It must always remain the same. Certainly "well documented" is total BS as many conversations here have discussed. One *might* be able to find a particular piece of documentation somewhere on some PG or DP server if one knows about it to begin with, but one will find that the documentation(s) contradict each other everywhere. And unless you know in great detail what you are looking for to begin with, you will never find the documentation.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol> -d

On Mon, Oct 22, 2012 at 7:26 AM, James Adcock <jimad@msn.com> wrote:
There is something seriously wrong when it is easier to submit a book commercially to Amazon than donate one's work to PG.
Why? Amazon doesn't care. Amazon makes everything publisher's problem, where PG wants to take responsibility for the works it does. -- Kie ekzistas vivo, ekzistas espero.

On 10/20/2012 1:52 PM, James Adcock wrote:
All that has happened is that Lee has once-again confirmed that there are two camps of belief among readers of this forum.
Boy, do I wish that were true. In fact, there are almost as many "camps" on this list as there are participants. The largest group believes that presentational markup of PG texts is unnecessary or even wrong. This group apparently feels that if a human being can intuit the structure of a text nothing further is desirable--and consequently requiring that a human being always be "in the loop." These texts are not usable by humans or computers, they are /only/ usable by humans. There is an almost equally large group of users who feel that presentation is very important, but that a single presentation is adequate--indeed absolutely required--for each text. This group is highly fractured by the fact that each member seems to think that his/her own vision of what that presentation should be is perfect and inviolable. A third group believes that every PG text ought to be marked up in such a way that various formats (although not necessarily other presentations) can be derived from the master copy, but that the master copy cannot be usable by itself in any meaningful way. I believe that the argument of how best to present a text is simply irrelevant. No two people will agree on the best presentation, nor should they. The power of computers gives us the ability to easily and conveniently create as many "vanity publications" as we want, and to do it /algorithmically/. I believe that a text can be marked up using HTML in such a way that, 1. it can be viewed in any HTML user agent with an adequate, even if not perfect, presentation; 2. it can be combined with one of a number of external CSS files to create a presentation good enough to satisfy any individual reader; and 3. it can be programmatically transformed into any other non-HTML based format (e.g. PDF, RTF). Unfortunately, I don't believe that there is anyone on this list who agrees with me, so please don't accuse others of holding beliefs that are mine alone. But so far, no one has offered any evidence that I am wrong, so I guess I will go on being a "voice crying in the wilderness."
One group believes that they can simply do semantic markup "of everything" and then someone else can apply the formatting decisions later and everything will just "magically work."
One of the many things that Arthur C. Clarke is know for is the statement that "Any sufficiently advanced technology is indistinguishable from magic." Now I'm a programmer, I've read the specifications, I've participated in working groups, and I've written thousands of lines of code to manipulate and display HTML, CSS and ePub files. I can see how you might see browsers as magic, but to me it's just technology.
The other group doesn't believe this works, that on the contrary someone needs to make good "engineering judgment" decisions on formatting now rather than expecting someone else to do it later, and that in general there is not a limited set of semantic markups that can be uniformly applied such that the problem can be well-divided and the "engineering judgment" decisions about formatting can be made later.
I call this group the "ill-informed nay-sayers."
This second group says: "Hey, if the first group cannot even show good 'engineering judgment' about displaying their own semantic markup in an attractive and sensible way using ONE set of formatting decisions -- namely their own -- then how likely is it that their semantic markup is going to prove to be useful to someone else implementing a different set of formatting decisions at some latter date?"
Thankfully, this criticism does not apply to me, because the example I provided of McGuffy's reader is presented in an immaculate, pleasant, elegant and sensible way. It is inconceivable that anyone could not view that presentation as the epitome of fine markup.

On 10/21/2012 06:38 PM, Lee Passey wrote:
I believe that a text can be marked up using HTML in such a way that, 1. it can be viewed in any HTML user agent with an adequate, even if not perfect, presentation; 2. it can be combined with one of a number of external CSS files to create a presentation good enough to satisfy any individual reader; and 3. it can be programmatically transformed into any other non-HTML based format (e.g. PDF, RTF).
3. can only be achieved if you duct-tape sematic extensions onto HTML. Those semantic extension would be PG's properietary ones. Nobody else in the world would use them. You admit yourself that transformation comes cheap, so what makes you think that a proprietary markup used only at PG is better than an open markup used by the whole academic world like TEI? The learning curve toward your proprietary extension would be just as steep as toward TEI. Minus the extensive documentation and lots of use cases and an active discussion list stuffed with the world's best professionals, which TEI has and you have not.
But so far, no one has offered any evidence that I am wrong, so I guess I will go on being a "voice crying in the wilderness."
The lack of falsification doesn't confirm a hypothesis. Some evidence would confirm it. So how about showing us some?
Thankfully, this criticism does not apply to me, because the example I provided of McGuffy's reader is presented in an immaculate, pleasant, elegant and sensible way. It is inconceivable that anyone could not view that presentation as the epitome of fine markup.
For all the sensational work you did with it, you didn't even manage to spell "McGuffey" correctly. But I'm sure your markup is awesome! -- Marcello Perathoner webmaster@gutenberg.org

In the interim, what I did was to transcribe the letter, then mark it with <div class="write"> to indicate that it should be rendered as handwriting. There are other "slate exercises" in the book that I have not yet marked up, and I will probably treat them in the same way.
The associated CSS file specifies the font to use (cursive), which you are apparently unhappy with.
For what it's worth, my browser renders the "Dear Santa" letter in Italic Comic Sans. :-/

On 10/19/2012 7:47 AM, Scott Olson wrote:
For what it's worth, my browser renders the "Dear Santa" letter in Italic Comic Sans. :-/
The rule in the default CSS file is: .write { font-family: cursive; font-style: italic } This rule will query your operating system for the default cursive font. On MS Windows systems the default cursive font is Comic Sans MS. For information on how to change default fonts on Windows on a per-user basis see: http://www.shallowsky.com/blog/tech/web/firefox-cursive-fantasy.html. If you are a sophisticated user you can download the .html file and gutenberg.css, and change the above line to whatever floats your boat. To make it non-italic, remove the "font-style:italic" clause. Instead of "font-family:cursive" change it to one of the cursive fonts you have installed on your system. (Note: These are edits to the .css file, not the .html file. Don't touch the .html file unless you're certain you know what you're doing.) I'm pretty sure there is a way to change the default font families on MSWindows in a global fashion, and I'm sure there is a way to set the default font families on all other desktop OS's as well. But I'm not interested enough in the problem to do the necessary research. Perhaps someone else here is. I'm not going to dictate your choice of font, as the net result would probably be pissing off someone else (Mr. Adcock seems to get pissed off easily about trivial matters). I want to give you the tools to easily make the decision yourself.

On 10/19/2012 12:28 PM, Lee Passey wrote:
I'm pretty sure there is a way to change the default font families on MSWindows in a global fashion, and I'm sure there is a way to set the default font families on all other desktop OS's as well. But I'm not interested enough in the problem to do the necessary research. Perhaps someone else here is.
One of the interesting things about ePub is that it allows you to embed fonts in the publication package. When I get around to it (which means not anytime soon), I may find a free handwriting font that meets the needs of this publication and embed it in an ePub package. In my online ePub maker software, I probably ought to find a way to allow a user to upload his/her own /legally obtained/ font when creating an ePub. I'll add that as a Trac ticket on readingroo.ms when I get the chance (and the ability).

On 10/10/2012 4:20 PM, Lee Passey wrote:
On 10/10/2012 9:09 AM, Lee Passey wrote:
[snip]
This file obviously needs a lot of work, and probably is more deserving of attention than many of the recent obscure and esoteric works for which raw page scans are adequate to serve the academic community. I'll try to fix it as best I can, but this will probably take me several weeks. I'll post iterations on my web server as I go along, and provide comments here as iterations are completed.
Iteration one is available at http://www.passkeysoft.com/~lee/pg14668.html.
I spent the evening doing some servlet programming on readingroo.ms. In case anyone cares, the most recent version of my reworking of the McGuffey reader is now found at http://readingroo.ms:8080/PG2ePub/cvsget/14668/14668.html. If you want an ePub version, derived from this file, you can get it by going to http://readingroo.ms:8080/PG2ePub/GetEPub.jsp and entering the eText number 14668.

If you want an ePub version, derived from this file, you can get it by going to http://readingroo.ms:8080/PG2ePub/GetEPub.jsp and entering the eText number 14668.
Suggest you check your generated epub against: validator.idpf.org

On 10/25/2012 8:38 AM, James Adcock wrote:
If you want an ePub version, derived from this file, you can get it by going to http://readingroo.ms:8080/PG2ePub/GetEPub.jsp and entering the eText number 14668.
Suggest you check your generated epub against: validator.idpf.org
Why? 1. The software is a work in progress, and obviously still error-prone; 2. ePub depends almost exclusively on the correctness of the underlying HTML, and will likely always contain errors; 3. epubcheck, run locally is a more effective and efficient way to validate ePubs; and 4. most errors detected by the validator are harmless schema errors. Feel free to renew your suggestion this time next year, at which time #1 may be resolved.

Re 14668: Well, the first question would be: Why? Contrary to the idea that PG needs to scale up efforts 10X and "do everything" maybe the right answer is to scale DOWN things by 10X and fix the books that people actually want to read, but which are currently hopelessly gone moldy, rather than offer more kiddie readers? Secondly, one needs to get page scans, which are at least available from Google in a variety of editions, you'd have to pick one. In terms of the current "automagic" HTML conversion from txt, this txt shows the problem that PG isn't even currently "correctly" specifying that similar <p> formatting be used on each device. Seems given the PG txt conventions, PG should be specifying "no indent, 1em of white space between paragraphs" for the <p> styling - so at least the basics match the txt styling. This is important because txt "formatters" implicitly are using the txt formatting rules as an element of the formatting - i.e. syntax vs. semantics *cannot* be uniquely determined automagically by examining a PG txt file, so the best one can hope to do is to emulate the PG txt layout. In terms of hand-recoding the html/epub/mobi there appears to be no great problems other than understanding and dealing with the issue of merged/rounded top/bottom margins or not, which can be dealt with in the standard manner of using top margins only. In terms of design issues, there appears to be minor issues of poetry - not hard since the poetry lines are short. (how to "correctly" autowrap lines of poetry remains problematic in html since html doesn't support poetry) There are issues of quasi-table listings of words, where the traditional solution is simply to linearize the lists. IE these word lists were "packed" on paper to save paper, but on ebook devices vertical landscape is "free" [horizontal landscape however definitely *is not*] so the word lists can simply be "unpacked." And there appears to be a minor issue of plain rules vs. decorative rules. But all this would still beg the first question: Why? Who is the customer? What parent would want this for their kid today? Seriously? Is some researcher interested in this for historical reasons? Well - frankly they would be better off examining the bitmap scans. Fundamentally, one can't code anything reasonable unless you decide who the customer is, and how they are going to actually be using your efforts.
participants (11)
-
Al Haines
-
David Starner
-
don kretz
-
James Adcock
-
jeroen@bohol.ph
-
Karen Lofstrom
-
Lee Passey
-
Lee Passey
-
Marcello Perathoner
-
Scott Olson
-
traverso@posso.dm.unipi.it