
(New subject+thread) On Mon, Jan 30, 2012 at 08:54:18PM -0800, James Adcock wrote:
... Again, the direction I hear you guys heading in is that you-all want to reinvent DP because you-all think you can improve-upon DP. Not saying you can, not saying you can't. I'm just say that is not *what I heard* Greg talking about.
From that standpoint, I've left many other (important!) things out of scope: proofreading (distributed or otherwise), tools for auto-conversion (we have some; others will plug-in but will be developed separately), and even arguments about master format (we have some, and better exposure of conversion tools and options will create a positive feedback loop to guide contributors to do a better job. And/or, open capabilities for OTHERS to fix the masters). Cataloging,
Thanks. This is just what I was thinking of pointing out (catching up on today's emails...). Below are details on what I was envisioning. The need I'm trying to address is reformatting or editing eBooks, not proofreading them. For starts, let's consider books that get into the PG collection in the usual way (i.e., culminating with a WWer posting them). I think we can do what's below while leaving the existing workflow out of scope. (It's not holy or untouchable or anything, just out of scope for what follows.) What I'd like is (as someone else nicely put it) a continual improvement opportunity, provided to essentially anyone, for eBooks in teh PG collection. This boils down to a handful of critical activities. It's mainly the third one (III) that involves crowdsourcing and new tools. I. making changes to the master file(s) [let's imagine that we retain the practice of every PG eBook having a small number of master files, in a small number of master formats]. The short list of master formats includes RST, HTML, TeX/TEI, and plain text (perhaps with light markup). Maybe this list will grow in the future; maybe it will shrink. The main feature here is that typos or fixes or additional master formats can be contributed. Challenges have been noted (revision wars; concurrent editing; bogus fixes; spam/inappropriate additions; inconsistent files...) II. from those master files, various other file formats can be [and are, currently] derived automatically. These include EPUB, Kindle variants, variations on HTML or text (especially if they were not previously provided), RTF, and a few others. Again, maybe this list will grow, maybe it will shrink. I do hope to offer conversion on-demand, which will let people select conversion options, and maybe even different conversion programs, for their purposes. The main features here mostly exist, but not as flexibly as I'd like to see. For example, applying a variant CSS. Or making a PDF with a specific font and paper size. Many challenges are technical, such as increased sophistication in dealing with text and HTML as master formats. Others need to be addressed by policy or social means, such as the ongoing tendency to use HTML for layout that is difficult to automatically convert. These, and others, have also been discussed deeply. III. from those master files, various other file formats that are created/contributed by individuals. I get offered these (via help@) practically every day. Usually EPUB, but also RTF/DOC, PDF. Often with typos applied. These are what I called "lovingly prepared," though of course some are better than others. These can be better than automatically-generated versions in various ways. They might have advantages over master files (for example, improved HTML). The main feature is that these would, in many cases, provide an improved reading experience (at least for some people, on some devices). If we accept that anyone could contribute such a new file (or set of files) for an existing PG eBook, then the main challenges I see are (a) how to help readers select among them, and (b) dealing with the fact that, over time, master formats will be fixed, but not these hand-crafted derivatives. I believe the solutions are related, and fairly easy. For (a), we need a community recommender system. Stars, batting average, +1, new/novell, etc. And, "dislike," "report a problem," "report abuse," etc. For (b), "time is on our side" (yes, it is). For a derivative format, we simply need to note when a master format was updated, but hand-derived ones were not. Plus, perhaps, a metric of how different the master format is from when the hand-derived one was created. Combined with (a), a recommendation would be attenuated based on such a metric. So, for example, a hand-derived file that gets a 90% quality rating from readers would slowly lose quality points as the master format is increasingly different. You get the idea... details TBD. Soooooooooooooooooo.... my main starting point was to ask about existing software for group editing. Version control systems seem a reasonable fit for this. Plus, sophisticated systems like TRAC also take care of managing users and their passwords. We do need to think of all the other stuff, but the basic idea was crowdsourcing for eBook formats/conversions/presentations. metadata and supplemental info (such as author bios). Note especially that I don't envision developing tools to help potential contributors do the conversion. Not our bailiwick: there are entire ecosystems of tools already, and all we need to do is support the community of interested donors of resulting files (including pointing them at recommended tools). The PG toolchain for automated conversion should remain available (i.e., http://epubmaker.pglaf.org). I hope this helps clarify my original suggestion a little better. There has been some great discussion on this and related topics. -- Greg Dr. Gregory B. Newby Chief Executive and Director Project Gutenberg Literary Archive Foundation www.gutenberg.org A 501(c)(3) not-for-profit organization with EIN 64-6221541 gbnewby@pglaf.org

(I have a hunch I'm going to be quoting this message a lot in the future...) On Tue, January 24, 2012 3:08 pm, Joshua Hutchinson wrote:
I'd love to see the PG corpus redone as a "master format" system (and the current filesystem supports "old" format files in a subdirectory, so if someone wanted to get the old original hand-made files, they could). I'm not particularly wedded to any master format. Hell, if someone came up with a sufficiently constrained HTML vocabulary that could be easily used to "generate" the additional formats necessary, I'm good with that.
But before anyone will start doing this work, there needs to be a consensus from PG (I'm looking at you, Greg!) that the work will be acceptable. A half-assed "master format" system is no master format system at all.
On Tue, January 31, 2012 1:22 am, Greg Newby wrote:
The need I'm trying to address is reformatting or editing eBooks, not proofreading them.
Okay, we're on the same page so far...
What I'd like is (as someone else nicely put it) a continual improvement opportunity, provided to essentially anyone, for eBooks in the PG collection.
Still good...
This boils down to a handful of critical activities. It's mainly the third one (III) that involves crowdsourcing and new tools.
This is where we start to diverge...
I. making changes to the master file(s) [let's imagine that we retain the practice of every PG eBook having a small number of master files, in a small number of master formats]. The short list of master formats includes RST, HTML, TeX/TEI, and plain text (perhaps with light markup). Maybe this list will grow in the future; maybe it will shrink.
No, according to Mr. Hutchinson's proposal there can be only one...
The main feature here is that typos or fixes or additional master formats can be contributed.
The main feature here is that a single fix to the master file will automatically propagate to all derived formats; syncing between "masters" will not be required. [little snip]
II. from those master files, various other file formats can be [and are, currently] derived automatically.
Mister Hutchinson's vision, which I am trying to follow, is that /all/ other file formats will be derived automatically from the /one/ master version. Caching is certainly advisable, but on-demand creation would be the first-step.
Many challenges are technical, such as increased sophistication in dealing with text and HTML as master formats.
The primary technical challenge is in developing a tool chain which can produce quality instances of all derived formats, and in adopting/developing a master format with the richness necessary to support that tool chain.
Others need to be addressed by policy or social means, such as the ongoing tendency to use HTML for layout that is difficult to automatically convert.
Policy means include deciding on a master format, developing rules for the use of that format, wide-spread publication of those rules and, to the extent possible, automated means to detect violations of those rules. Social means primarily include getting buy-in from participants to the established rules, and attracting volunteers who are willing to work with them.
III. from those master files, various other file formats that are created/contributed by individuals.
At this point we're not only not on the same page, we're not even in the same book. This suggestion is completely at odds with what Mr. Hutchinson proposed, and which I support. [bigger snip]
If we accept that anyone could contribute such a new file (or set of files) for an existing PG eBook, then the main challenges I see are (a) how to help readers select among them, and (b) dealing with the fact that, over time, master formats will be fixed, but not these hand-crafted derivatives.
I'm not saying you shouldn't pursue this vision; I'm simply saying it's not mine, and I'm completely uninterested in pursuing it with you. My vision is to develop a system where existing PG works can be reworked into a single master format, from which all other formats can be automatically derived. Proof-reading and upgrading the master files is certainly a desirable part of that vision, but it is secondary to the main goal. I'm beginning to think that Mr. Hutchinson's earlier question remains unresolved:
there needs to be a consensus from PG (I'm looking at you, Greg!) that the work will be acceptable. A half-assed "master format" system is no master format system at all.
So Mr. Newby, can we expect some support in building a repository of master format reworkings of existing PG works? Infrastructure support would be nice, but moral support is what is most needed. [big snip]
I hope this helps clarify my original suggestion a little better. There has been some great discussion on this and related topics.
Ditto. Cheers, Lee

On Tue, Jan 31, 2012 at 11:32:51AM -0700, Lee Passey wrote:
On Tue, January 24, 2012 3:08 pm, Joshua Hutchinson wrote: ...
II. from those master files, various other file formats can be [and are, currently] derived automatically.
Mister Hutchinson's vision, which I am trying to follow, is that /all/ other file formats will be derived automatically from the /one/ master version. Caching is certainly advisable, but on-demand creation would be the first-step.
Many challenges are technical, such as increased sophistication in dealing with text and HTML as master formats.
The primary technical challenge is in developing a tool chain which can produce quality instances of all derived formats, and in adopting/developing a master format with the richness necessary to support that tool chain.
I put huge backing into addressing this challenge, and so did other people on this list and elsewhere. The answer was: TEI. Then, a couple of years later, ditto. The answer was: RST. There are scarce few things that TEI or RST are not suitable for, though I would not say we have a ton of experience. Today, I count fewer than 400 files that are TEI or RST (I didn't check how many separate eBook titles those files are associated with). The WWers have procedures to process such files - bring it on. Remaining problems include: - how to convince contributors to make new submissions in these master formats - consideration of alternate master formats, as desired - providing advice on better workflows and tool sets (including at DP) for these master formats, so contributors can be comfortable with them Solved problems include: - automatically generating all derived formats, with a far higher level of integrity than other master formats - applying fixes to the master and then regenerating derived formats - having an easily editable master format Looking at weaknesses in these choices, from all possible angles, is certainly worthwhile. As is seeking improvements or alternatives. But the unfortunate fact is that the "better mousetrap" (or, at least, one that purports to have many of the improvements over HTML+text as master formats), was built and delivered years ago. But not too many folks have taken the bait. I can already hear a few voices calling (or writing), "that's because your bait sucks!" and "but your bait cannot do X" and so forth. It would be contrary to past experience for anyone to think they can, indeed, come up with a solution set that will be above criticism. -- Greg

It's just like we're doing again with source control software. We're starting from a technical problem definition and working back toward the user, who is expected to conform. Entirely wrong direction. On Tue, Jan 31, 2012 at 4:14 PM, Greg Newby <gbnewby@pglaf.org> wrote:
On Tue, Jan 31, 2012 at 11:32:51AM -0700, Lee Passey wrote:
On Tue, January 24, 2012 3:08 pm, Joshua Hutchinson wrote: ...
II. from those master files, various other file formats can be [and are, currently] derived automatically.
Mister Hutchinson's vision, which I am trying to follow, is that /all/ other file formats will be derived automatically from the /one/ master version. Caching is certainly advisable, but on-demand creation would be the first-step.
Many challenges are technical, such as increased sophistication in dealing with text and HTML as master formats.
The primary technical challenge is in developing a tool chain which can produce quality instances of all derived formats, and in adopting/developing a master format with the richness necessary to support that tool chain.
I put huge backing into addressing this challenge, and so did other people on this list and elsewhere. The answer was: TEI.
Then, a couple of years later, ditto. The answer was: RST.
There are scarce few things that TEI or RST are not suitable for, though I would not say we have a ton of experience. Today, I count fewer than 400 files that are TEI or RST (I didn't check how many separate eBook titles those files are associated with).
The WWers have procedures to process such files - bring it on.
Remaining problems include: - how to convince contributors to make new submissions in these master formats - consideration of alternate master formats, as desired - providing advice on better workflows and tool sets (including at DP) for these master formats, so contributors can be comfortable with them
Solved problems include: - automatically generating all derived formats, with a far higher level of integrity than other master formats - applying fixes to the master and then regenerating derived formats - having an easily editable master format
Looking at weaknesses in these choices, from all possible angles, is certainly worthwhile. As is seeking improvements or alternatives. But the unfortunate fact is that the "better mousetrap" (or, at least, one that purports to have many of the improvements over HTML+text as master formats), was built and delivered years ago. But not too many folks have taken the bait.
I can already hear a few voices calling (or writing), "that's because your bait sucks!" and "but your bait cannot do X" and so forth. It would be contrary to past experience for anyone to think they can, indeed, come up with a solution set that will be above criticism.
-- Greg _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

I can already hear a few voices calling (or writing), "that's because your bait sucks
Look, it's not like "we" who submit books are not aware of the limitations of HTML. BUT, we just take a look at the TEI files and say "Good lord, I am not going to do that!" [because TEI is targeted at Grad Student Academics] and then we look at RST and we say "Good lord yet another set of 1970s little troff-like escape codes to memorize!" And then the HTML generated is a mess, and the rendering on the end user's device, while not horrible, is not great either, so what has been accomplished at the end? It is one thing to say "well, the rendering could be improved" but if you wanted to make the format(s) attractive to anyone the *least* one would need to do would be to create competitively attractive results. It is hard for these things to compete against HTML because HTML is, well, like everywhere. Unfortunately HTML has some really nasty bits when it comes to writing books. Among other things people "use the wrong tags" because the "right" tags are simply not there.

People respond to feedback. Positive feedback produces positive reinforcement; you'll get more the next time. Negative feedback produces negative reinforcement; you'll get less. Stop and think what kind of day-to-day feedback the DP workers get.
From PG they get very little or no feedback. From DP they get little or no feedback, and when it comes it's weeks to late.
They're wandering in the wilderness, and they know it. But hey, no one seems to care one way or the other about how they did what they do, or want to help them do better; so the ones that stick around, please themselves. No one else notices. They're going through another phase of RST despair right now. They aren't getting out what they thought they are putting in. Things break. But they are supposed to keep trying to make the system happy. They know the system isn't trying to make them happy. That's not its job.

On 02/01/2012 06:54 AM, don kretz wrote:
They're going through another phase of RST despair right now. They aren't getting out what they thought they are putting in. Things break. But they are supposed to keep trying to make the system happy. They know the system isn't trying to make them happy. That's not its job.
You are all right in saying that we have to please the volunteers, but you forget a much more important thing: We have to please the readers. The readers do not want pretty formattings for desktop PCs, they want books they can carry in their pockets. DP could have learnt that by comparing the sales of hardcovers vs. paperbacks. Also they could have learnt that by comparing the downloads of the `oh-so-ugly´ EPUB and Kindle formats vs. the `oh-so-pretty´ HTML format. The `ugly´ formats are already more popular than HTML just a few years after they have been introduced at PG. But DP's agenda deliberately pushes people to produce the product nobody wants, because an elaborately formatted but non-functional book gives more DP bragging rights than a simply formatted and functional book. People are unsatisfied with RST because they are brainwashed into thinking that a simply formatted book is not good enough. It's not RST's fault, it's DP's fault. -- Marcello Perathoner webmaster@gutenberg.org

On Wed, Feb 1, 2012 at 2:46 AM, Marcello Perathoner <marcello@perathoner.de> wrote:
You are all right in saying that we have to please the volunteers, but you forget a much more important thing:
We have to please the readers.
Actually, we don't. Without volunteers, Project Gutenberg will disappear; as long as it has volunteers, Project Gutenberg will be around. That is the dirty secret of non-profits; they have more motivation to get volunteers and donations then to do something useful with them.
It's not RST's fault, it's DP's fault.
Then stop using DP's files. You can very quickly have 400 files formatted the way you want, and we can see if that will make readers happier then if we have 30,000 files formatted DP's way.
People are unsatisfied with RST because they are brainwashed into thinking that a simply formatted book is not good enough.
If you're going to treat people as brainless zombies, then you have to take a little responsibility. It is RST's fault, if RST can't successfully counterpropagandize and brainwash people into using it. In reality, I'm not an active member of DP, nor have I ever PPed much, but when one of the first things I read about it is "The primary goal of reStructuredText is to define and implement a markup syntax for use in Python docstrings and other documentation domains", it makes me wonder why I'd ever try and translate a 400 year-old book to a tool that's obviously not designed for it. Given the apparent lifespan of TEI-Lite, I'm having to wonder if learning this would just mean learning another new format a couple years down the road. -- Kie ekzistas vivo, ekzistas espero.

PG-RST is bad because it is badly documented, and epubmakes is not user-friendly. It is too difficult to do semantic markup (no way to nest markup) and to understand what is wrong when nothing happens. To do some simple things one has to do strange contorsions. It is too difficult to control the output from the input, and one has to do in innatural ways. Carlo

Carlo>...epubmakes is not user-friendly... It took me several tries of more than a day, plus installing several flavors of Python, plus reading some Python books, plus installing a dev environment, plus Marcello's help, to get epubmaker running on my Windows machine. And I had to pick up code bits from at least one site that didn't seem all that savory to me. I just don't see many people trying to install and run epubmaker as it stands today.

On Feb 1, 2012, at 19:19, Jim Adcock wrote:
I just don't see many people trying to install and run epubmaker as it stands today.
Well, then it's rather neat that they don't have to, isn't it? You can just use the online version, which is, allegedly, also always the version that will be run on your files once they're posted: http://epubmaker.pglaf.org/ Jana

Jana>Well, then it's rather neat that they don't have to, isn't it? You can just use the online version, which is, allegedly, also always the version that will be run on your files once they're posted: http://epubmaker.pglaf.org/ Interesting. Go to Gutenberg.org, type "epubmaker" into the "search site" field, and see what you find. And/or follow that up with a Google search on "epubmaker Marcello" [since there is more than one program called "epubmaker" on the internet]

On Tue, January 31, 2012 10:54 pm, don kretz wrote:
People respond to feedback.
Positive feedback produces positive reinforcement; you'll get more the next time. Negative feedback produces negative reinforcement; you'll get less.
Stop and think what kind of day-to-day feedback the DP workers get.
From PG they get very little or no feedback. From DP they get little or no feedback, and when it comes it's weeks too late.
My immediate reaction to this statement was, "what a bunch of Dale Carnegie bunk." My second reaction was, "maybe some people are motivated by positive feedback, but not me." My third reaction to this statement was, "Hmm, there's a real kernel of truth in there." All this got me thinking, so I thought I'd share my preliminary thoughts. I'm not sure my conclusions are cohesive yet -- I'll need some time to flesh everything out. People want to feel valued. They want to feel like the world is a better place due to their efforts. Sincere praise helps people feel valued. (Insincere praise can also help people feel valued but only if you can convince them that it was actually sincere). Most people will feel they are valued, even in the absence of recognition, if they feel that they created a high-quality work product that is available to others (to a large extent the pirated versions of current books have much higher production values than those of PG books, and those people are very careful not to be "recognized"). Most people feel valued if their work is accepted. Acceptance by a group is evidence that the work is of value, and that it will be available to others. One might say that acceptance is a substitute for the kind of personal standards implied in the foregoing paragraph. Most people feel their efforts are valued if they are empowered. If I can do my job without obtaining permission for every detail or being subjected to constant second-guessing I will believe not only that I am trusted but also that my efforts are worth the trust. My involvement with Distributed Proofreaders was very short (their processes didn't make me feel valued) so I can't speak knowledgeably about the user experience there. But as for PG, it fails my just about every measure. I won't go through and illustrate these points with the many anecdotes that have surfaced over the past few days; that is left as an exercise to the reader. I will say that the existence of the apparatchiks that jealously control the contents of the PG repositories is particularly troublesome. I can certainly sympathize with Mr. Adcock's tirades; it would appear that his experience in trying to be an individual contributor to Project Gutenberg has been an endless stream of "you have no value" messages. Soooooo.... On Tue, January 24, 2012 3:08 pm, Joshua Hutchinson wrote:
I'd love to see the PG corpus redone as a "master format" system (and the current filesystem supports "old" format files in a subdirectory, so if someone wanted to get the old original hand-made files, they could). I'm not particularly wedded to any master format. Hell, if someone came up with a sufficiently constrained HTML vocabulary that could be easily used to "generate" the additional formats necessary, I'm good with that.
But before anyone will start doing this work, there needs to be a consensus from PG (I'm looking at you, Greg!) that the work will be acceptable. A half-assed "master format" system is no master format system at all.
In support of Mr. Hutchinson's vision I would like to see a system where master formats are created through crowd-sourcing. Everyone should be empowered to submit changes without having to go through a gatekeeper. The current state of the repository should be open to the world at all times so a contributor will know that their work will forever be available to the public at large. I would follow the SourceForge model where files can be updated, but changes will never be deleted; everyone will always have the option to go back and get a previous version. A clear set of standard should be developed so people can /know/ they are doing a good job, even without a "pat on the back." I don't know that this group could ever be expected to give sincere praise, but at least we could all agree to be respectful. I don't expect Project Gutenberg as an organization to change its institutional behavior, to value its volunteers more (or at all). But just maybe we could get this side project going with a different set of parameters as an example of just what could be accomplished.

Lee>it would appear that his experience in trying to be an individual contributor to Project Gutenberg has been an endless stream of "you have no value" messages. Let me be clear that *I* know the contributions I make: The last book I submitted to PG had over 6500+ corrections I found over that which is posted at IA. And when PG refused to support EPUB and MOBI some years ago I created a little website to support MOBI users which still draws 200,000 downloads a month, 400,000 at Xmas time, in spite of the fact that I have tried to point those users back to PG. Maybe because the books I posted there actually "work" ? My latest set of "tirades" started when a submission that tested "clean" on my end of the submission process "crapped out" at the PG end of the process, so I sent email to PG saying "okay help me out here, tell me what tools where you are ***actually*** using to test the submission on your end of the process so that I can check it out on my end and see what is going wrong" and the WW'er responded "positively" to me by simply telling me that there was no way in hell he was going to accept my submission. Thanks for the help. PG has since accepted the work and it is drawing about 650 downloads a month. Conversely, PG has one well-known text which has been downloaded millions of times which still contains about 1500 errors. Why?

I suggest you consider putting a stop to this RST experiment and step back and come up with some kind of plan that can possibly succeed.
From what I can tell, you're asking non-technical people to essentially learn a new programming language, with little or no documentation (certainly not current), no debugger, no IDE, no or crappy error messages, and all this while the language is still being designed and implemented.
And telling them "Just trust us". I doubt the developers involved here would agree to work under the same conditions.

On 02/01/2012 11:56 PM, don kretz wrote:
I suggest you consider putting a stop to this RST experiment and step back and come up with some kind of plan that can possibly succeed.
I'm surely not doing that. Instead of trying to prevent other people (me) from implementing their ideas and visions, you should come up with your own (possibly better) ideas and visions and implement them. When you've done that we may compare notes and maybe ditch some of the worse ideas we had. As it stands now, you are just complaining and have nothing to offer instead. Re. the use of a VCS, I've finally decided to use hg and have already set it up on pglaf.org. Over the next days I'll redo some of the older and more popular texts in RST. I will then patch the epubmaker at gutenberg.org to pull RST directly from hg@pglaf.org. Everybody who wants to participate and is not afraid of RST and a VCS can send me their RSA public keys so I can give them SSH access to the repos. Everybody who thinks that RST and/or a VCS are bad ideas, is encouraged to implement their own (possibly better) ideas as an individual or a group. Just stop the fruitless complaining and get your act together. -- Marcello Perathoner webmaster@gutenberg.org

Won't there be an abstraction of some kind for the user interface? Are all users to be required to get and register an SSH ticket? Does someone need to download the entire book to fix a comma? How do you get them to avoid the temptation to "really fix this baby up"? Who is going to verify the changes? Against what images? Are the whitewashers pretty familiar with Hg? I at least find it conceptually non-trivial - I'm not sure I'd want to train a non-programmer to use it. I've seen some pretty good developers take a while to get comfortable. But I suppose you've already discussed this pretty thoroughly with everyone who will be affected.. On Wed, Feb 1, 2012 at 3:24 PM, Marcello Perathoner <marcello@perathoner.de>wrote:
On 02/01/2012 11:56 PM, don kretz wrote:
I suggest you consider putting a stop to this RST experiment
and step back and come up with some kind of plan that can possibly succeed.
I'm surely not doing that.
Instead of trying to prevent other people (me) from implementing their ideas and visions, you should come up with your own (possibly better) ideas and visions and implement them. When you've done that we may compare notes and maybe ditch some of the worse ideas we had.
As it stands now, you are just complaining and have nothing to offer instead.
Re. the use of a VCS, I've finally decided to use hg and have already set it up on pglaf.org. Over the next days I'll redo some of the older and more popular texts in RST. I will then patch the epubmaker at gutenberg.org to pull RST directly from hg@pglaf.org.
Everybody who wants to participate and is not afraid of RST and a VCS can send me their RSA public keys so I can give them SSH access to the repos.
Everybody who thinks that RST and/or a VCS are bad ideas, is encouraged to implement their own (possibly better) ideas as an individual or a group. Just stop the fruitless complaining and get your act together.
-- Marcello Perathoner webmaster@gutenberg.org
______________________________**_________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/**mailman/listinfo/gutvol-d<http://lists.pglaf.org/mailman/listinfo/gutvol-d>

On 02/02/2012 04:20 AM, don kretz wrote:
Won't there be an abstraction of some kind for the user interface?
hg is very easy, not harder than zipping and unzipping. There are graphical front-ends. If anybody wants, he can write a front-end talored to PG.
Are all users to be required to get and register an SSH ticket?
All users who want to write to the repository. There will be anonymous read access. Ideally, there will be a hierarchy. The WWers (or whatevers) sit on top and can write into the PG repository. Every WWer can work on his own or have his trusted lieutenants.
Does someone need to download the entire book to fix a comma?
Yes. Once. But since hg transfers are diffed and compressed all subsequent transfers will be very fast. N.B. I don't see hg as a crowdsourcing tool. You can use a crowdsourcing tool (trac ?) on top of hg. But since I think that crowdsourcing is a bad idea, I'll let that as an exercise for people who want to prove that crowdsourcing is good.
How do you get them to avoid the temptation to "really fix this baby up"?
You don't. Same as now. Same as with a web interface. You have the WWers check before they commit.
Who is going to verify the changes? Against what images?
Same as now. You take work from your trusted lieutenants and check it before you commit.
Are the whitewashers pretty familiar with Hg? I at least find it conceptually non-trivial - I'm not sure I'd want to train a non-programmer to use it. I've seen some pretty good developers take a while to get comfortable. But I suppose you've already discussed this pretty thoroughly with everyone who will be affected..
Nope. The first one affected will be me. I'll have to figure many things out as I go. When it works well enough for me I'll offer it to the WWers to use. I'll run this in parallel to the traditional workflow, so nothing changes for anybody except who voluntarily participates to the experiment. -- Marcello Perathoner webmaster@gutenberg.org

Who is going to verify the changes? Against what images?
Same as now. You take work from your trusted lieutenants and check it before you commit.
Isn't that the process that started out this discussion - the process that Greg said is the problem we're trying to fix? Come to think of it, what is the benefit of what you are proposing? Is it related to the problems the rest of us have been discussing?

On 02/02/2012 11:13 AM, don kretz wrote:
Isn't that the process that started out this discussion - the process that Greg said is the problem we're trying to fix? Come to think of it, what is the benefit of what you are proposing? Is it related to the problems the rest of us have been discussing?
The problem I'm fixing is to redo the library using master files. While at it, I'm also building a posting process that is widely automated an easy on bandwith, especially when editing existing books. I'm not building a bug tracking system or a crowd sourcing system. But both of them can be built on top of hg. You can't even think about crowdsourcing without a VCS. -- Marcello Perathoner webmaster@gutenberg.org

The problem I'm fixing is to redo the library using master files.
While at it, I'm also building a posting process that is widely automated an easy on bandwith, especially when editing existing books.
I'm not building a bug tracking system or a crowd sourcing system. But both of them can be built on top of hg. You can't even think about crowdsourcing without a VCS.
You are redoing the library using master files. Ok.
While you are at it, you are also building a posting process that is widely automated an easy on bandwith, especially when editing existing books. Is the bandwidth used when editing existing books even noticeable compared to all the other bandwidth consumption? What will be significantly more automated about the posting process because of VCS? What users are benefitted, how? This is going to be a lot easier to understand if you can tell us what measurable difference someone willl experience when you're done. So far we just see the painful parts.

On 02/02/2012 04:51 PM, don kretz wrote:
Is the bandwidth used when editing existing books even noticeable compared to all the other bandwidth consumption? What will be significantly more automated about the posting process because of VCS? What users are benefitted, how?
Its very noticeable if, after a one comma fix, you have to upload a zip file containing all formats and all pictures or just a one line diff that hg will apply automatically. And God forbid if you work on a mobile link.
This is going to be a lot easier to understand if you can tell us what measurable difference someone willl experience when you're done. So far we just see the painful parts.
If you already worked with VCS, and still you cannot imagine how a VCS can improve the current workflow, then nothing I can tell you can make you see. -- Marcello Perathoner webmaster@gutenberg.org

I can't imagine how a kindle reader with a one-comma change will now be able to submit that, no. Especially since he's working from a copy three versions old that's been corrupted by intermediaries. You perhaps expect him to acquire an SSH ticket, register with you personally, download a completely different version, add his comma, and install an Hg client to sent it back? You're right, that's certainly not a workable crowd-sourcing scenario. I can't see how to implement crowdsourcing that way either. On Thu, Feb 2, 2012 at 9:37 AM, Marcello Perathoner <marcello@perathoner.de>wrote:
On 02/02/2012 04:51 PM, don kretz wrote:
Is the bandwidth used when editing existing books even noticeable compared
to all the other bandwidth consumption? What will be significantly more automated about the posting process because of VCS? What users are benefitted, how?
Its very noticeable if, after a one comma fix, you have to upload a zip file containing all formats and all pictures or just a one line diff that hg will apply automatically.
And God forbid if you work on a mobile link.
This is going to be a lot easier to understand if you can tell us what
measurable difference someone willl experience when you're done. So far we just see the painful parts.
If you already worked with VCS, and still you cannot imagine how a VCS can improve the current workflow, then nothing I can tell you can make you see.
-- Marcello Perathoner webmaster@gutenberg.org ______________________________**_________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/**mailman/listinfo/gutvol-d<http://lists.pglaf.org/mailman/listinfo/gutvol-d>

On 02/02/2012 06:47 PM, don kretz wrote:
I can't imagine how a kindle reader with a one-comma change will now be able to submit that, no. Especially since he's working from a copy three versions old that's been corrupted by intermediaries.
I'm not interested in crowd sourcing. Period. You want it. You implement it. Don't ask me how. I don't know. -- Marcello Perathoner webmaster@gutenberg.org

On Thu, February 2, 2012 3:13 am, don kretz wrote:
Isn't that the process that started out this discussion - the process that Greg said is the problem we're trying to fix?
Two different things. Mr. Hutchinson started one discussion when he said:
I'd love to see the PG corpus redone as a "master format" system (and the current filesystem supports "old" format files in a subdirectory, so if someone wanted to get the old original hand-made files, they could). I'm not particularly wedded to any master format. Hell, if someone came up with a sufficiently constrained HTML vocabulary that could be easily used to "generate" the additional formats necessary, I'm good with that.
Mr. Newby responded to that post by taking the conversation in a totally different direction than what Mr. Hutchinson had posted. Personally, I'm interested in Mr. Hutchinson's original proposal, not Mr. Newby's unrelated concerns.
Come to think of it, what is the benefit of what you are proposing?
The biggest benefit is that it will allow sophisticated users (uploaders, not downloaders) to do an end run around the PG apparatchiks. It will provide a method for the master format to evolve as our understanding of the automated creation process improves. It will provide a history of changes to documents so evolutionary dead ends can be backed out, and it will provide a record of who is responsible for which changes.
Is it related to the problems the rest of us have been discussing?
For me, it's right on point.

Good - then exactly what does it have to do with using a VCS to store document versions (compared to, say, the way they are stored now where anyone can acquire any format they want, but can't submit any new versions except to I guess a whitewasher?) On Thu, Feb 2, 2012 at 9:22 AM, Lee Passey <lee@novomail.net> wrote:
On Thu, February 2, 2012 3:13 am, don kretz wrote:
Isn't that the process that started out this discussion - the process that Greg said is the problem we're trying to fix?
Two different things. Mr. Hutchinson started one discussion when he said:
I'd love to see the PG corpus redone as a "master format" system (and the current filesystem supports "old" format files in a subdirectory, so if someone wanted to get the old original hand-made files, they could). I'm not particularly wedded to any master format. Hell, if someone came up with a sufficiently constrained HTML vocabulary that could be easily used to "generate" the additional formats necessary, I'm good with that.
Mr. Newby responded to that post by taking the conversation in a totally different direction than what Mr. Hutchinson had posted.
Personally, I'm interested in Mr. Hutchinson's original proposal, not Mr. Newby's unrelated concerns.
Come to think of it, what is the benefit of what you are proposing?
The biggest benefit is that it will allow sophisticated users (uploaders, not downloaders) to do an end run around the PG apparatchiks. It will provide a method for the master format to evolve as our understanding of the automated creation process improves. It will provide a history of changes to documents so evolutionary dead ends can be backed out, and it will provide a record of who is responsible for which changes.
Is it related to the problems the rest of us have been discussing?
For me, it's right on point. _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Thu, February 2, 2012 10:41 am, don kretz wrote:
Good - then exactly what does it have to do with using a VCS to store document versions (compared to, say, the way they are stored now where anyone can acquire any format they want, but can't submit any new versions except to I guess a whitewasher?)
When you say "document versions" do you mean subsequent iterations of a single document, or multiple variations of a document ("snowflakes")? If the former, then a VCS records all the iterations, who was responsible, what the changes were, and allows a user to pick any historical version. It enables a smooth evolution of the file as our understanding of the requirements improve. If the latter, then it may or may not have anything to do with it at all, but I don't care. I'm not interested in a system to build a snowdrift. That's Mr. Newby's vision, not Mr. Hutchinson's. As for getting a new and improved version out of the master control system and into the traditional repository, that's also a problem I'm not interested in for the moment. I want to build a system that the White Washers could turn to if they were so inclined. If they don't want to take advantage of it, that's not my problem. You can lead a horse to water, but you can't make him drink.

Everybody who wants to participate and is not afraid of RST and a VCS can send me their RSA public keys so I can give them SSH access to the repos.
Um, what I think I'm hearing is that you have decided to ignore Greg and just declare victory to yourself and your little choice of programming languages which basically no one else uses?

Hi Marcello, This is just wrong paradigm. I have to make a superior system to improve the system. Yet, I can not improve the instated system. Even if I wanted to I have to work through the files and figure out how things are being done. A little more guidance is necessary. How about pointing those that want to help to some information of how you want the RST implemented so that it works with the tools in place and will to convert to the other formats. RST is extendible as you know. So when I convert to I use my own extensions or are their extensions that I should use instead! regards Keith. Am 02.02.2012 um 00:24 schrieb Marcello Perathoner:
On 02/01/2012 11:56 PM, don kretz wrote:
I suggest you consider putting a stop to this RST experiment and step back and come up with some kind of plan that can possibly succeed.
I'm surely not doing that.
Instead of trying to prevent other people (me) from implementing their ideas and visions, you should come up with your own (possibly better) ideas and visions and implement them. When you've done that we may compare notes and maybe ditch some of the worse ideas we had.
As it stands now, you are just complaining and have nothing to offer instead.
Re. the use of a VCS, I've finally decided to use hg and have already set it up on pglaf.org. Over the next days I'll redo some of the older and more popular texts in RST. I will then patch the epubmaker at gutenberg.org to pull RST directly from hg@pglaf.org.
Everybody who wants to participate and is not afraid of RST and a VCS can send me their RSA public keys so I can give them SSH access to the repos.
Everybody who thinks that RST and/or a VCS are bad ideas, is encouraged to implement their own (possibly better) ideas as an individual or a group. Just stop the fruitless complaining and get your act together.

Everybody who wants to participate and is not afraid of RST and a VCS can send me their RSA public keys so I can give them SSH access to the repos.
Everybody who thinks that RST and/or a VCS are bad ideas, is encouraged to implement their own (possibly better) ideas as an individual or a group. Just stop the fruitless complaining and get your act together.
Seems like a bit of a silly comment, when you are literally holding the only keys to the door, and are allowing only your own ideas in. Open up the VCS to other languages and other approaches, so that the competing ideas can be compared next to your own.

On Thu, February 2, 2012 8:23 am, Jim Adcock wrote:
Seems like a bit of a silly comment, when you are literally holding the only keys to the door, and are allowing only your own ideas in.
This seems a bit harsh. I'm sure that if you asked nicely Mr. Perathoner would give you a shell account onto pglaf.org with enough rights to install any software you would like, including web-facing software.

On Thu, Feb 02, 2012 at 09:19:01AM -0700, Lee Passey wrote:
On Thu, February 2, 2012 8:23 am, Jim Adcock wrote:
Seems like a bit of a silly comment, when you are literally holding the only keys to the door, and are allowing only your own ideas in.
This seems a bit harsh. I'm sure that if you asked nicely Mr. Perathoner would give you a shell account onto pglaf.org with enough rights to install any software you would like, including web-facing software.
Actually, I'm the only one currently with that oversight for pglaf.org, and that's not where we will be doing experiments. I have other systems for that. -- Greg

Actually, I'm the only one currently with that oversight for pglaf.org, and that's not where we will be doing experiments. I have other systems for that.
OK, so should I/We out here ignore for right now the instructions Marcello just posted at: http://www.gutenberg.org/wiki/Mercurial_Repository_How-To because I was starting to try to learn Hg and public keys and the like and I'd rather not do that if Marcello's approach isn't what PG is going to be doing? ??? I'm still just trying to figure out "the ground rules" which right now I hear Marcello saying "RST Only" whereas what I would like to try is "Minimalist Tweak to Existing HTML" to get it running correctly on EPUB and MOBI. And is not clear to me how a build chain gets invoked and how a DP/PG "end customer" who wants to try out the result of these flavors can find them and try them.

On Thu, February 2, 2012 12:35 pm, Greg Newby wrote:
Actually, I'm the only one currently with that oversight for pglaf.org, and that's not where we will be doing experiments. I have other systems for that.
In which case I am troubled that Mr. Perathoner has taken it upon himself to install a VCS in that domain. I think it should be moved to the "experimental" section so as not to disturb the pglaf.org web site.

On Thu, Feb 02, 2012 at 01:26:56PM -0700, Lee Passey wrote:
On Thu, February 2, 2012 12:35 pm, Greg Newby wrote:
Actually, I'm the only one currently with that oversight for pglaf.org, and that's not where we will be doing experiments. I have other systems for that.
In which case I am troubled that Mr. Perathoner has taken it upon himself to install a VCS in that domain. I think it should be moved to the "experimental" section so as not to disturb the pglaf.org web site.
pglaf.org doesn't have a copy of the PG collection, and doesn't have a lot of free space. But it *is* integrated with the eBook uploading/errata/etc. process. It's a fine choice for some purposes, but not where I would direct developers who need to login. I've got some other systems better suited for that purpose. -- Greg

I've put some more recent examples of the "tweak the HTML" approach as opposed to "rewrite everything in another language" up at: http://freekindlebooks.org/KF8 where these files should "work" on all Kindles, and look particular good on Kindle Fires and recent "software" Kindles. Again, I don't think its very hard to "tweak the HTML" to get most of it working pretty well on epub and mobi devices. Mind you, some books *are* basket cases -- typically very old books from the early days of HTML.

On 2/2/2012 7:14 PM, Jim Adcock wrote:
I've put some more recent examples of the "tweak the HTML" approach as opposed to "rewrite everything in another language" up at:
One of the things that is most frustrating about trying to converse with BowerBird is that he absolutely refuses to give a straight answer to a straight question; instead he just gives a whole bunch of examples and says, "See how great I am! No figure out how I did it!" On that note, I'm much less interested in seeing what happens when you "tweak the HTML," than in getting a straight answer explaining exactly what you did to each file to get the output you are so proud of.

For the foreseeable future the only reasonable way to make a dent in the quality of current projects is to repair what's already there with software. I know it's possible to infer a lot of the intended display elements from the patterns applied by the PP software, mainly guiguts. Fixed-dimension images can be converted to %s. Or removed sizes entirely and let them be constrained by container elements that are already there. The infamous page markup is especially possible to refactor into something useful, hidden, and/or removed. I'll contribute the methods I've developed for EB (which is only a subset). Others should be discoverable with a reasonable amount of effort. It will take some work and cooperation. The critical question still remains: will PG allow existing projects to be altered this way? Under what condtions? With what verification requirements? On Thu, Feb 2, 2012 at 6:22 PM, Lee Passey <lee@novomail.net> wrote:
On 2/2/2012 7:14 PM, Jim Adcock wrote:
I've put some more recent examples of the "tweak the HTML" approach as opposed to "rewrite everything in another language" up at:
One of the things that is most frustrating about trying to converse with BowerBird is that he absolutely refuses to give a straight answer to a straight question; instead he just gives a whole bunch of examples and says, "See how great I am! No figure out how I did it!"
On that note, I'm much less interested in seeing what happens when you "tweak the HTML," than in getting a straight answer explaining exactly what you did to each file to get the output you are so proud of.
______________________________**_________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/**mailman/listinfo/gutvol-d<http://lists.pglaf.org/mailman/listinfo/gutvol-d>

On Thu, Feb 02, 2012 at 06:46:56PM -0800, don kretz wrote:
... It will take some work and cooperation. The critical question still remains: will PG allow existing projects to be altered this way? Under what condtions? With what verification requirements?
I already answered that in this thread, and the answer is that we do have a procedure to get fixed files back (i.e., the errata process, with a WWer in the loop). A theme that is not well-handled by the errata process is, what if only the HTML is tweaked, to make the file more epub (etc.) friendly? That is, when the "fix" is not typos/scannos/missing pages, etc., etc., but simply formatting or markup? The short answer is a rephrasing of the starting point from a few days ago: I'd like to go ahead and make a way to get these back into the collection, replacing the originals, *en masse*. (Actually, we keep the originals, in an 'old' subdirectory.) I don't anticipate opposition to this idea, assuming we're tweaking, not redoing the look and feel crafted by the submitter. How to tell which is which? One thing we've done with a few very people who were very active in posting/reposting/augmenting is give them direct access to upload. This is something we do AFTER the procedure is very clear. It's easy to screw things up, trust me.... My emphasis in this discussion has been to look at ways to make this type of process more efficient and scalable. We don't want to have a lot of back and forth discussion for every file, if we want to eventually re-do thousands. This interest is at least partially selfish, since I'd rather not be part of a decision process for every such fixed eBook that comes along, and I'm pretty sure the current WWers have similar feelings. -- Greg

Some portions of the changes are I expect going to be 100% automatable, and will be 100% beneficial in 90% of projects. Stuff like taking images/captions out of fixed-size tables and putting them into %-sized divs. With EB I write regexes that can get those right almost all the time. Probably other candidates are footnotes, chapter headings, page numbers. I only have EB and three or four others to work from. But we could automate that pretty quickly, run it against a sample of the corpus, and check over the results thoroughly. The key is to do no other tweaks but the automated ones so we find out how close we can come. Then we may know enough to plan the next step. I'd like to hear what Lee and the others think first, though. They're better judges than I am. On Thu, Feb 2, 2012 at 11:15 PM, Greg Newby <gbnewby@pglaf.org> wrote:
On Thu, Feb 02, 2012 at 06:46:56PM -0800, don kretz wrote:
... It will take some work and cooperation. The critical question still remains: will PG allow existing projects to be altered this way? Under what condtions? With what verification requirements?
I already answered that in this thread, and the answer is that we do have a procedure to get fixed files back (i.e., the errata process, with a WWer in the loop).
A theme that is not well-handled by the errata process is, what if only the HTML is tweaked, to make the file more epub (etc.) friendly? That is, when the "fix" is not typos/scannos/missing pages, etc., etc., but simply formatting or markup?
The short answer is a rephrasing of the starting point from a few days ago: I'd like to go ahead and make a way to get these back into the collection, replacing the originals, *en masse*. (Actually, we keep the originals, in an 'old' subdirectory.) I don't anticipate opposition to this idea, assuming we're tweaking, not redoing the look and feel crafted by the submitter. How to tell which is which?
One thing we've done with a few very people who were very active in posting/reposting/augmenting is give them direct access to upload. This is something we do AFTER the procedure is very clear. It's easy to screw things up, trust me....
My emphasis in this discussion has been to look at ways to make this type of process more efficient and scalable. We don't want to have a lot of back and forth discussion for every file, if we want to eventually re-do thousands. This interest is at least partially selfish, since I'd rather not be part of a decision process for every such fixed eBook that comes along, and I'm pretty sure the current WWers have similar feelings.
-- Greg
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

I don't think you need to worry about how I fix html files. Unless you can come up with something positive or helpful. On Fri, Feb 3, 2012 at 12:09 AM, Marcello Perathoner <marcello@perathoner.de
wrote:
On 02/03/2012 08:44 AM, don kretz wrote:
With EB I
write regexes that can get those right almost all the time.
Are you seriously proposing to fix HTML files using regexes?
-- Marcello Perathoner webmaster@gutenberg.org
______________________________**_________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/**mailman/listinfo/gutvol-d<http://lists.pglaf.org/mailman/listinfo/gutvol-d>

On 02/03/2012 09:18 AM, don kretz wrote:
I don't think you need to worry about how I fix html files. Unless you can come up with something positive or helpful.
Yes, I can. Look at the sources of epubmaker if you want to see how its done. Regexes are like UFOs. Think I'm going to leave this discussion now ... -- Marcello Perathoner webmaster@gutenberg.org

Hi Don, The problem at PG is you have to produce the software and prove its value. Another problem is that the software used is poorly documented and IMHO not very well written from a software maintenance point of view. There is not much emphasis on modularity and extensibility. Most code must be either refractored or better yet completely rewritten from the bottom up. I believe we both now how good software should look like. If we want do do this we will have to spear head ourselves and hope it will be adopted. regards Keith. Am 03.02.2012 um 03:46 schrieb don kretz:
For the foreseeable future the only reasonable way to make a dent in the quality of current projects is to repair what's already there with software. I know it's possible to infer a lot of the intended display elements from the patterns applied by the PP software, mainly guiguts. Fixed-dimension images can be converted to %s. Or removed sizes entirely and let them be constrained by container elements that are already there. The infamous page markup is especially possible to refactor into something useful, hidden, and/or removed.
I'll contribute the methods I've developed for EB (which is only a subset). Others should be discoverable with a reasonable amount of effort.
It will take some work and cooperation. The critical question still remains: will PG allow existing projects to be altered this way? Under what condtions? With what verification requirements?
On Thu, Feb 2, 2012 at 6:22 PM, Lee Passey <lee@novomail.net> wrote: On 2/2/2012 7:14 PM, Jim Adcock wrote:
I've put some more recent examples of the "tweak the HTML" approach as opposed to "rewrite everything in another language" up at:
http://freekindlebooks.org/KF8
One of the things that is most frustrating about trying to converse with BowerBird is that he absolutely refuses to give a straight answer to a straight question; instead he just gives a whole bunch of examples and says, "See how great I am! No figure out how I did it!"
On that note, I'm much less interested in seeing what happens when you "tweak the HTML," than in getting a straight answer explaining exactly what you did to each file to get the output you are so proud of.

On that note, I'm much less interested in seeing what happens when you "tweak the HTML," than in getting a straight answer explaining exactly what you did to each file to get the output you are so proud of.
I take each html file, compile it to epub using epubmaker, and from there to mobi using Kindlegen v2, take a look at it on multiple Kindles, see where it is "breaking" -- displaying totally unreasonable things, things one does not see happening in the HTML version in HTML browsers and which therefore *do not* represent the HTML author's intent -- and then I "pop the top" on the epub and take a look in there and try to find what is going wrong, which after five years playing around with Kindles, EPUBs, PG, DP Files etc is by now usually pretty obvious for me, and I fix it. Sometimes the encoding on the images is wrong. Usually the problem is in the CSS, which has typically been designed by somebody who has a 20" wide monitor who is sure they "know" HTML and is trying to figure out how to fill up that huge space. Which will almost certainly "kill the book" when the display device is 3" not 20" wide. Sometimes the problem is what is NOT in the CSS -- the CSS just "happened" to work on the author's web browser, but there was not a reasonable expectation that it *should* have worked. And then I have to fill in what should have been in there, but is not. Sometimes the problem cannot be fixed in the CSS -- as presumably Marcello discovered when he decided he had to explicitly kill page numbers in epubmaker. And sometimes the HTML is so screwed up I cannot come up with any reasonable explanation of why I see what I see in there and then I just have to give up and declare defeat: "I'm sorry I just don't see any easy way to save this book -- maybe Marcello *should* rewrite this one." But most books can be easily "saved", hopefully without hurting the original HTML author's feelings too much. For that matter, *I* wouldn't know how to write to a 20" wide monitor -- if I was asked to "fill it up please." Note that if PG wants to keep both "big and small" versions of the HTML in the same file there are @media pragmas that can help one do this -- just not the @medias one normally hears about. And usually the amount of changes that need to be made are small enough to keep this from getting really clunky.

Hi Lee, You forgot the main paradigm here on the list and PG. If you have a better way — prove it! Nobody is interest in the process or willing to think much about it. Besides nerds can not explain how to do anything. It is a proven fact "experts" are very bad candidates for explaining how the get things done or what is involved. regards Keith. Am 03.02.2012 um 03:22 schrieb Lee Passey:
On 2/2/2012 7:14 PM, Jim Adcock wrote:
I've put some more recent examples of the "tweak the HTML" approach as opposed to "rewrite everything in another language" up at:
One of the things that is most frustrating about trying to converse with BowerBird is that he absolutely refuses to give a straight answer to a straight question; instead he just gives a whole bunch of examples and says, "See how great I am! No figure out how I did it!"
On that note, I'm much less interested in seeing what happens when you "tweak the HTML," than in getting a straight answer explaining exactly what you did to each file to get the output you are so proud of.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Hi Jim, It is nice to have examples. the can help us nerds, BUT, would it not better to have some sort of manual sothat can be distributed!! Furthermore, it is nice that they look good on a kindle, but would it not be better if the looked good on other readers as well. Of course, their is much benefit from your examples, since they can be used to tweak conversion tools. regards Keith. Am 03.02.2012 um 03:14 schrieb Jim Adcock:
I've put some more recent examples of the "tweak the HTML" approach as opposed to "rewrite everything in another language" up at:
http://freekindlebooks.org/KF8
where these files should "work" on all Kindles, and look particular good on Kindle Fires and recent "software" Kindles.
Again, I don't think its very hard to "tweak the HTML" to get most of it working pretty well on epub and mobi devices. Mind you, some books *are* basket cases -- typically very old books from the early days of HTML.

would it not better to have some sort of manual so that can be distributed!!
Maybe, but I don't know how to get it actually read, and if it does get read there are people who won't believe it, even though if they tested what that manual says they could see the problems with their own eyes. And there is another class of people (particularly at DP) who are extremely hostile to the idea of small machines and "All I want to do is make beautiful HTML so leave me alone!" -- where "beautiful HTML" actual means "on my 20 inch monitor on my particular operating system running my particular choice of browser." [I could probably be arm-twisted into writing such a manual, but doing so is going to make *some* people at DP very unhappy, whereas others there *are* recognizing these problems and *are* honestly searching for solutions -- which is not trivial given the DP "groupthink" -- as opposed to the PG "groupthink" [where "groupthink" is clearly the wrong word for what is going on on the PG side ;-] If you think about it, HTML really doesn't provide the capabilities to code these issues well. For example on 4" wide machines the customer doesn't want to have body margins set *at all* -- no margin please. Plus the small machines all have margins built into the physical design of the machine, so if you write in HTML margins then you are putting margins inside of margins. Whereas on a 20" monitor you really do want margins, in part because your monitor is much wider than it is high -- and "everybody" agrees that a higher-than-wide format is better for reading. And because the typical web browser isn't designed for reading, it is designed for browsing, and it doesn't have margins designed at all into its design anywhere, so your HTML code without margin sets glyphs smack flush with the edges of the surrounding window frame of the web browser. Okay, so setting a % margin works, right? No it doesn't. 10% margins left and right on a 4" machine still "eats up" 0.8 inches of an already-small real estate which makes the user of the small machine really really unhappy -- in part because the 4" screen is so small justification routines were just barely working to begin with, until you messed with it, and now justification has become really really ugly.
Furthermore, it is nice that they look good on a kindle, but would it not be better if the looked good on other readers as well.
In practice if one can get it to "work" in the Kindles then it will "work" on the other EPUB devices to. [ I was just showing "Kindles" to try to demonstrate to people that "Kindles" are already becoming effectively "epub" machines with the introduction of Kindlegen V2. And "freekindlebooks" is a mobi site not an epub site. [an early decision where I was pushing back against other sites who were pretending to be "all things to all people"]] With the possible exception of really weak "EPUB" readers such as some of the 3rd party epub apps one finds for android, and/or one computer vendor I can think of that has deliberately trashed their implementation of epub to make it really incompatible to try to force authors to target that platform exclusively, and who is now creating tools for that platform with the EULA says "If you use this tool then we own exclusive distribution right to anything you make with this tool [aka 'we own the copyright.']" Again, when we are talking about "Kindles" nowadays there are really two classes of Kindles: There are "legacy" hardware Kindles which are the ones we all know about which cause great grief because they don't support floats and absolute positioning and a bunch of other stuff which really doesn't work on small machines anyway. And there are "modern" Kindles which support KF8 which really is pretty much just epub stuffed into a mobi wrapper. So the "modern" Kindles *are* effectively "epub" devices -- except they take a different file format wrapper. Now the only "modern" hardware Kindle at the moment is Kindle Fire -- but Amazon has been claiming "any day now" they will be releasing updated software for "recent generation" Kindles to make them KF8 ie "epub friendly." [The "software" Kindles such as Kindle for PC, Kindle for Android, etc. update automatically and probably have support for KF8 "epub" already whether their owners know it or not.] [See mobi_unpack.py if you want to check out the claim that KF8 really is "just" "epub"] The major heartache here, well there are two actually, is 1) legacy Kindles don't do floats and DP implements almost all page numbers as floats, and sticks those floats in the middle of paragraphs, so, if the line number isn't allowed to "float" out of position, then the reader gets stuck with reading "trash" page numbers in the middle of their paragraphs exactly where the HTML author put them in the first place. Now, if you think about it, putting page numbers in the middle of paragraphs is a really really bad idea to begin with, doesn't really follow the model of "HTML" to begin with, but, there you have it. And many people at DP and a few at PG are really wedded to the idea of page numbers (I think the "official" position at PG is that "we" don't like page numbers, if anyone looks up the directions to submitters.) [PS: "real" books put page numbers up in the upper right hand or left hand corners of the page where the reader doesn't have to read them. ] [[PPS: epubmaker already throws away common implementations of page numbers when targeting Kindles.]] And 2) the second major heartache is that Kindlegen V2 sticks "both" mobi7 and kf8 in one file for distribution. Now when such gets fed through the Amazon system Amazon says they know the capabilities of the end recipient hardware and strip out the unused component. But serving these from PG this makes a big file even bigger -- unless PG were to turn on compression which is an option for the mobi file format [epub always gets served with compression by definition, since epub is just a zip file format, with the result that PG epub files are much smaller than even the current PG mobi files] And Kindlegen V2 now supports some @media pragmas which would allow PG/DP to fix Kindle-specific problems while maintaining one HTML source code. Given that it usually only takes a half dozen CSS changes [excepting page numbers] to get things "to work" this is maybe not too bad an option. Unfortunately I don't think there are any effective EPUB-specific @media pragmas out there. Some people say "@media handheld" but I *think* the epub hardware community refused that -- being afraid their devices would get stuck in the "mobile version" ghetto with the ancient crusty PDA devices such as Blackberries. PG *could* invent a "@media epub" pragma that epubmaker could chunk on [epubmaker already contains a bunch of these undocumented "pragmatic" hacks] which would allow PG/DP to continue to have one HTML source with a couple little pragmatic sections dedicated to "fixing" things that don't work on the smaller machines. But its not clear to me what things, if any, that PG/DP should be trying to "fix" via epubmaker vs. asking submitters to think about these issues before submitting HTML code to PG. [epubmaker does do other "useful" things to HTML code, such as trying to autogen a TOC, put in a dummy "cover", splitting and repacking the HTML in epub-compatible format, etc. Also it does a lot of rewriting to try to get around the limitations of kindlegen V1 -- not clear to me how much of that rewriting is really necessary to target kindlegen v2 -- But Marcello knows these issues much better than I, having been the one bit.]

If you have floats, you can use inset page numbers with spans and an appropriate stylesheet. But fundamentally what you're running into, and I don't know how you avoid it if you insist on XML/XHTML, is that books simply aren't well-formed in the way that XML defines and requires. You can't embed everything 100% in all its containers. Page numbers are a reflection of this problem, because conceptually they are boundaries between page elements. But page elements simply aren't well-formed because their tops and bottoms can cut right through paragraphs (and everything within which paragraphs are embredded.) I think what happened to some devices is that they had to decide between supporting HTML and XHTML, and since writers can't be constrained to create well-formed XML documents (nor should they be), the devices had to choose HTML.

Example of inset page number. http://jsfiddle.net/Ap27K/2/ But of course only useful where there are floats.

Don> http://jsfiddle.net/Ap27K/2/ Not sure this makes it any better for me. Giving it a boarder just draws my eyes to it even more. The whole point is that I want to be able to read without being distracted by page numbers. Which is why traditionally books stick them way way up there at the top corners where your eyes don't see them unless you really really go looking for them. On my ebook reader the progress meter (think page numbers) automatically hides itself unless I explicitly ask it where I am in the book.

I imagine they are showing their own page number anyway, aren't they? I don't imagine the device wants yours. So leave it out for those devices that choose not to make it displayable other than in ways you find objectionable. You can't turn a pig into a pony by putting a saddle on it. On Fri, Feb 3, 2012 at 6:21 PM, James Adcock <jimad@msn.com> wrote:
Don>****
** **
http://jsfiddle.net/Ap27K/2/ ****
** **
Not sure this makes it any better for me. Giving it a boarder just draws my eyes to it even more. The whole point is that I want to be able to read without being distracted by page numbers. Which is why traditionally books stick them way way up there at the top corners where your eyes don’t see them unless you really really go looking for them. On my ebook reader the progress meter (think page numbers) automatically hides itself unless I explicitly ask it where I am in the book.****
** **
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Josh, I think single-master-format is the only workable option. I'd add to that, though, that the texts should incorporate all reasonably collectable content during the original project, even if it's expected that it will be superfluous. Then strip it out for all devices; but it will still be there in the original. BTW, this is as much for simplicity as for comprehensiveness. It's just easier for everyone if the instructions are the same for all projects. This in turn implies that it may be a mistake to use an output format that we actually intend to use for some devices as the master format.

On Fri, February 3, 2012 2:14 pm, don kretz wrote:
This in turn implies that it may be a mistake to use an output format that we actually intend to use for some devices as the master format.
Yes, but I would stress /may/ be, not /will/ be. I think there is some value in having a master format that is usable on it's own without preprocessing. I don't think that this feature is a "must have," but I do think it is a "should have."

But you sacrifice the elements your chosen format doesn't include. For instance, html doesn't include chapters. Or page numbers. If you fake them, you introduce ambiguity and inconsistency. On Fri, Feb 3, 2012 at 1:59 PM, Lee Passey <lee@novomail.net> wrote:
On Fri, February 3, 2012 2:14 pm, don kretz wrote:
This in turn implies that it may be a mistake to use an output format that we actually intend to use for some devices as the master format.
Yes, but I would stress /may/ be, not /will/ be. I think there is some value in having a master format that is usable on it's own without preprocessing. I don't think that this feature is a "must have," but I do think it is a "should have."
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

I appreciate the elegance and economy of having one master format, but it does seem that there would be a great deal of benefit to having three or four. Before the book is put though the DP process, or approved for an independent preparer, how about sorting it into one of several categories: RST -- uncomplicated fiction and non-fiction. Everyone here seems to agree that RST works for most texts. LaTeX - math books. This is standard for math. TEI -- complicated fiction and non-fiction Perhaps XHTML if someone can argue convincingly that it's necessary too. But I do fear that HTML tempts people into hand-tweaking for the nicest appearance on a particular ereader. Different workflows for each format. A few post-processors and whitewashers could specialize in the more esoteric formats, and everyone else could work in RST. This seems more practical than demanding that everything be stored in a format that few know how to prepare OR that difficult texts be mutilated to remove everything that RST can't handle. The effort to define the one true master format seems like the effort to define the one true DTD. If the one true DTD handles everything, it's too unwieldy to use. Better to have different DTDs (or schemas) for different tasks. You'd have to have a different suite of tools for each format, for converting it into the various end-user formats, but that would be easier, in the long run, than forcing everything into the one true format. -- Karen Lofstrom only a gamma geek, but practical

-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Karen Lofstrom Sent: Friday, February 03, 2012 7:22 PM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] Goals and scope
I appreciate the elegance and economy of having one master format, but it does seem that there would be a great deal of benefit to having three or four.
Before the book is put though the DP process, or approved for an independent preparer, how about sorting it into one of several categories:
RST -- uncomplicated fiction and non-fiction. Everyone here seems to agree that RST works for most texts.
LaTeX - math books. This is standard for math.
TEI -- complicated fiction and non-fiction
Perhaps XHTML if someone can argue convincingly that it's necessary too. But I do fear that HTML tempts people into hand-tweaking for
nicest appearance on a particular ereader.
Different workflows for each format. A few post-processors and whitewashers could specialize in the more esoteric formats, and everyone else could work in RST. This seems more practical than demanding that everything be stored in a format that few know how to prepare OR that difficult texts be mutilated to remove everything
Text+(X)HTML submissions are not going to go away. You can't expect an independent producer, perhaps with limited text/HTML expertise/experience, to start submitting in RST. Al the that
RST can't handle.
The effort to define the one true master format seems like the effort to define the one true DTD. If the one true DTD handles everything, it's too unwieldy to use. Better to have different DTDs (or schemas) for different tasks.
You'd have to have a different suite of tools for each format, for converting it into the various end-user formats, but that would be easier, in the long run, than forcing everything into the one true format.
-- Karen Lofstrom only a gamma geek, but practical _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Fri, Feb 3, 2012 at 6:22 PM, Al Haines <ajhaines@shaw.ca> wrote:
Text+(X)HTML submissions are not going to go away. You can't expect an independent producer, perhaps with limited text/HTML expertise/experience, to start submitting in RST.
Then you add a step wherein someone who CAN manage RST converts the text or the HTML to RST. Surely a volunteer could be found to do that. The independent submissions can't be all that numerous. -- Karen Lofstrom

Karen>Then you add a step wherein someone who CAN manage RST converts the text or the HTML to RST. Surely a volunteer could be found to do that. I think the management over at DP will tell you as soon as you start "throwing away" the HTML formatting that DP submitters provide and "dumbify" their submission down to RST then you are chasing away volunteers in droves. Even being asked to think about condescending to support epub devices totally pisses them off.

-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Karen Lofstrom Sent: Friday, February 03, 2012 8:26 PM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] Goals and scope
On Fri, Feb 3, 2012 at 6:22 PM, Al Haines <ajhaines@shaw.ca> wrote:
Text+(X)HTML submissions are not going to go away. You can't expect an independent producer, perhaps with limited text/HTML expertise/experience, to start submitting in RST.
Then you add a step wherein someone who CAN manage RST converts the text or the HTML to RST. Surely a volunteer could be found to do
And then what happens to their text/HTML files? Do they get dumped? I can think of at least four independents (one of whom is me) that would strongly resent, even flatly reject, that being done to their files. that.
The independent submissions can't be all that numerous.
-- Karen Lofstrom _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Fri, Feb 3, 2012 at 7:51 PM, Al Haines <ajhaines@shaw.ca> wrote:
And then what happens to their text/HTML files? Do they get dumped? I can think of at least four independents (one of whom is me) that would strongly resent, even flatly reject, that being done to their files.
Are you so wedded to having YOUR files published? Your files can't be the last step before the master format? I'm trying to get back into post-processing -- I haven't done it for years. Perhaps I'll feel the same sort of pride in my output and unwillingness to let anyone else meddle with it. But for the last five? six? years I've worked in P3, as one of the high-count proofers. I'm just one stage of a process, and I'm OK with that. It doesn't matter that my name isn't on the product. It's enough that there's a book to read. -- Karen Lofstrom

Karen>Are you so wedded to having YOUR files published? Your files can't be the last step before the master format? It saddens me when I see my efforts replaced with something that is clearly inferior by any rational measure of typographical success, by techno-nerds who are more interested in inventing and playing with their own little languages than in producing books which are fun to read. Its not just the people on this forum who do this. There are ton of repackaging houses out there, including Amazon and Apple and the rest, who take the PG files, strip out the PG legalize, strip out the PG/DP acknowledgements, strip out the effort that has been put into making something look reasonably intelligent and reasonably responsive to the original, and replace it with some kind of gimcrack boilerplate which anyone with any background in typography, even just reading one book on typography, will tell you is just plain ugly and stupid and ignores the last 400+ years of agreement about how one designs an attractive and readable thing called "a book." And when they do so they inevitably go beyond this and step on the linguistic intentions of the original author. Because they simply don't understand, or are not willing to acknowledge, or really don't care, that formatting *also* has linguistic intent, and books typically *are not* simply "a string of undifferentiated words." And the reader doesn't realize how much has been lost in the reading experience by this constant dumbing-down process unless they go back and look at an actual copy of a first edition and see for themselves how much more sense that copy makes compared to today's techno-nerd dumbed-down version.

On 02/04/2012 06:51 AM, Al Haines wrote:
And then what happens to their text/HTML files? Do they get dumped? I can think of at least four independents (one of whom is me) that would strongly resent, even flatly reject, that being done to their files.
Why would you feel so? Lots of *authors* are happy with the house style of their publishing houses. I can't understand why a simple digitizer should fell so strong. Why would you prefer your own formatting style over a PG house style? Think of the advantages of having a huge corpus formatted in the same style. Or a huge corpus that can be converted into any house style at the push of a button. -- Marcello Perathoner webmaster@gutenberg.org

Why would you prefer your own formatting style over a PG house style?
Its not *my* formatting style. It is an attempt to try to honestly record in a reasonable amount of reasonably portable effort the formatting style of the original author/publishing house. Because formatting has meaning. And what you are calling a "PG house style" is inevitably a "house lack of style" because the techno-nerds who foist these things on the world are inevitably the people who are the most tone-dead to the issue of taste, because if they had any taste they wouldn't be doing this stuff in the first place. What you-all are ignoring is the PG rules against blind-formatting of HTML, because blind-formatting of HTML makes *no* contribution to the world, and that is what you are doing: you are foisting blind-formatting of HTML on the world via an intermediate dumb-down language, rather than taking a good honest look at, and trying to be responsive to, what the original author wrote and the original publisher published. And frankly feedbooks and the lot have been doing what you propose for many years already, and have been doing it much better.

"Marcello" == Marcello Perathoner <marcello@perathoner.de> writes:
Marcello> On 02/04/2012 06:51 AM, Al Haines wrote: >> And then what happens to their text/HTML files? Do they get >> dumped? I can think of at least four independents (one of whom >> is me) that would strongly resent, even flatly reject, that >> being done to their files. Marcello> Why would you feel so? Lots of *authors* are happy with Marcello> the house style of their publishing houses. I can't Marcello> understand why a simple digitizer should fell so strong. Marcello> Why would you prefer your own formatting style over a PG Marcello> house style? Think of the advantages of having a huge Marcello> corpus formatted in the same style. Or a huge corpus Marcello> that can be converted into any house style at the push Marcello> of a button. Authors that don't like a publisher's house style change publisher, if they have the choice. If a publisher forces an house style that users and authors don't like, they switch to other publishers. And apparently it is what is happening to PG. Carlo

On 02/04/2012 03:09 PM, Carlo Traverso wrote:
Authors that don't like a publisher's house style change publisher, if they have the choice. If a publisher forces an house style that users and authors don't like, they switch to other publishers.
I never heard of that. Can you provide some references? -- Marcello Perathoner webmaster@gutenberg.org

Carlo> Authors that don't like a publisher's house style change publisher, if
they have the choice. If a publisher forces an house style that users and authors don't like, they switch to other publishers.
Marcello> I never heard of that. Can you provide some references? feedbooks.com manybooks.net amazon.com bn.com apple.com freekindlebooks.org archive.org openlibrary.org etc. The *one* thing that sets PG aside from pretty much all these other sites (from the viewpoint of the end user) is that, up until now, most PG books (html, epub, mobi) made *some* "reasonable" effort to try to maintain the publishing style of the original book, rather than rendering them as-if they were an auto-generic genned python users' manual.

On 2/3/2012 10:51 PM, Al Haines wrote:
And then what happens to their text/HTML files? Do they get dumped? I can think of at least four independents (one of whom is me) that would strongly resent, even flatly reject, that being done to their files.
Just on a fluke, I went out to Project Gutenberg to look at our old friend Huck Finn (etext 76). When I checked the HTML version, it announced that "This file has been formatted [badly] for use with tablet readers." The original version produced by Mr. Widger has been relegated to the /old/orig76-h folder; the new version is dated this Feb 6. This is exactly what Mr. Adcock had asked that you do with /his/ version. It appears you do not have nearly as much control as you thought.

-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Lee Passey Sent: Wednesday, February 22, 2012 6:05 PM To: gutvol-d@lists.pglaf.org Subject: Re: [gutvol-d] Goals and scope
On 2/3/2012 10:51 PM, Al Haines wrote:
And then what happens to their text/HTML files? Do they get dumped? I can think of at least four independents (one of whom is me) that would strongly resent, even flatly reject, that being done to
David Widger produced *all* the HTML versions of #76. (Credit was given in the text versions to the original producers of those files.) their
files.
Just on a fluke, I went out to Project Gutenberg to look at our old friend Huck Finn (etext 76). When I checked the HTML version, it announced that "This file has been formatted [badly] for use with tablet readers." The original version produced by Mr. Widger has been relegated to the /old/orig76-h folder; the new version is dated this Feb 6.
This is exactly what Mr. Adcock had asked that you do with /his/ version.
It appears you do not have nearly as much control as you thought. _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Lee>Just on a fluke, I went out to Project Gutenberg to look at our old friend Huck Finn (etext 76). When I checked the HTML version, it announced that "This file has been formatted [badly] for use with tablet readers." The original version produced by Mr. Widger has been relegated to the /old/orig76-h folder; the new version is dated this Feb 6.
This is exactly what Mr. Adcock had asked that you do with /his/ version.
Sorry, but people can and do write whatever they like. In practice if you open 76 on a Kindle device you will find that it is NOT particularly well formatted for a Kindle. So, whatever has been done to 76 don't blame it on Kindles or Kindle users. More particularly, no matter how 76 is formatted it leaves aside the issue that 76 has literally about 1,500 errors in it See by way of comparison 32325 on a Kindle or EPUB machine (which is from a later edition which has somewhat fewer images than 76 and has some differences in name-spelling)

Karen>But I do fear that HTML tempts people into hand-tweaking for the nicest appearance on a particular ereader. You will then have the problem that people say what PG makes is ugly and I can do a better job and they pop the top on the epub or tweak it in Sigil or they mobi-unpack.py or whatever and then they party hardy and put it up on a competitive website and say how stupid and ugly PG is. Not saying we can do anything about that. Just saying that consistency is the what of little minds? Seems to me you could go a long long way just by having "official" DP and/or PG external css style sheets and tell people "hey start with this we've worked a lot of the bugs out of it already" and "if you want to override any of these styles just make your own css file which changes the things you want changed and then load that personal css file after the dp/pg 'official' style sheet."

Hi Karen, It would be better to have one master format. it actually does not matter which as basically all can be adapted to support the features need. of course then in certain instances you will not be conforming to the standard. But, since we are just creating a master format, for generating the different output formats will have tools for that. The same goes for editing or any other workflow. The formats you have mentioned can all be extended, So in the end it does not matter which is used. There seems to be some confusion about what a master format should be and what is involved. I will doing into this soon in a thred call "a new approach". regards Keith. Am 04.02.2012 um 04:21 schrieb Karen Lofstrom:
I appreciate the elegance and economy of having one master format, but it does seem that there would be a great deal of benefit to having three or four.
Before the book is put though the DP process, or approved for an independent preparer, how about sorting it into one of several categories:
RST -- uncomplicated fiction and non-fiction. Everyone here seems to agree that RST works for most texts.
LaTeX - math books. This is standard for math.
TEI -- complicated fiction and non-fiction
Perhaps XHTML if someone can argue convincingly that it's necessary too. But I do fear that HTML tempts people into hand-tweaking for the nicest appearance on a particular ereader.
Different workflows for each format. A few post-processors and whitewashers could specialize in the more esoteric formats, and everyone else could work in RST. This seems more practical than demanding that everything be stored in a format that few know how to prepare OR that difficult texts be mutilated to remove everything that RST can't handle.
The effort to define the one true master format seems like the effort to define the one true DTD. If the one true DTD handles everything, it's too unwieldy to use. Better to have different DTDs (or schemas) for different tasks.
You'd have to have a different suite of tools for each format, for converting it into the various end-user formats, but that would be easier, in the long run, than forcing everything into the one true format.
-- Karen Lofstrom only a gamma geek, but practical _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

"Keith" == Keith J Schultz <schultzk@uni-trier.de> writes:
Keith> Hi Karen, It would be better to have one master format. it Keith> actually does not matter which as basically all can be Keith> adapted to support the features need. of course then in Keith> certain instances you will not be conforming to the Keith> standard. I rather think that we cannot use one master format, we need a few. At least, you cannot reasonably have a master format for mathematics different from LaTeX, and we don't want to exclude contributors and contributions that cannot or don't want to prepare a master file in a format that you arbitrarily like. Carlo

I'm seeing two tendencies here. Some people want a disciplined, organized collection, with one/a few master formats, better metadata and error correction. Some people want PG to follow Michael Hart's dream of big collection, no rules, everything is welcome, do your own thing. As a researcher and former academic, I subscribe wholeheartedly to the former vision. I can't but see the latter camp as something like many self-publishers: people who want THEIR book released THEIR way, with no gatekeepers barring entry. But perhaps PG could accommodate both camps by distinguishing between PG-standard texts (generated from a master format, corrected and re-generated as necessary) and PG-alternative texts (hand-tweaked for particular e-readers, less popular formats, older versions, etc.) Search results could generate standard texts on top, alternative texts in separate category. Everything to have release dates, so that users could see which were the earlier versions and how recently the text had been corrected or updated. Users would have a way to judge the reliability of the text. If volunteers who submit alternative texts want to do error correction themselves, keeping their version in sync with the standard text, fine. But I don't think that PG should be stuck with that role. It would be enough work just to keep the standard texts updated. There would have to be SOME rules for the alternative texts, but they could be much less stringent than the rules for the standard texts. Right now, many of PG's texts would only qualify as alternative texts. But that would be OK. They would be placeholders, so that users could read something while standard texts were being prepared. -- Karen Lofstrom

Karen>Some people want PG to follow Michael Hart's dream of big collection, no rules, everything is welcome, do your own thing. Michael's dream is and will happen anyway. It's just a question of *who* wants to host the party. If PG wants to be a party of just "academics" well then go for it. I don't think many of us "non-academics" who are in this simply *because we love reading books and want to share that love with other people* will hang around long. It's just too easy to occupy some other website where more can happen faster and with more fun and with fewer "academics" clogging the pipes.

On Sat, Feb 4, 2012 at 5:08 PM, Jim Adcock <jimad@msn.com> wrote:
Michael's dream is and will happen anyway. It's just a question of *who* wants to host the party. If PG wants to be a party of just "academics" well then go for it. I don't think many of us "non-academics" who are in this simply *because we love reading books and want to share that love with other people* will hang around long.
I don't know at all what you're arguing against here. One master format is most important for those what to share books with other people, because academics will happily read HTML or PDF files on their 20" screens. It's all the people with their Kindles and cell phones and other random junk that need other formats.
It's just too easy to occupy some other website where more can happen faster and with more fun and with fewer "academics" clogging the pipes.
Then why don't you go there? I work with PG because I have reason to believe my work will still be around after a while, and because people can reuse PG's material without stressing so much about copyright. Ease of reuse would be a nice feature there. -- Kie ekzistas vivo, ekzistas espero.

Hoi Karen, As you most likely know PG was originally intended to be a repository. Furthermore, it was believed that the best format for longevity was the "Plain Vanilla Text". This was the sole master format. Gradually, technological advances open the door of how people read books digitally. it was noticed that the "Plain Vanilla Texts" was sub optimal from point of view that these new technologies could could produce more appealing output. So DP was sun off. From that day on there has been much debate about master formats. PG policy was to let the contributors a fear hand, but that is a very bad idea, as PG did and does not expect everything. I PG wanted to it could stick by its original purpose and be an archival repository and expect a wide range of of formats. It would not be that hard to administer and offer access. But, then you would have tons of files and texts for all kinds of devices and texts most likely texts strictly for just one device. That is not in line with the philosophy of PG. the texts are to be available to a large(if not all) portion of the public. So the idea of a master format that accommodates most was born. What was not done was properly design such a format to fit the task it was(is) to have. the quick and dirty road was chosen. Any programmer knows that this approach creates quick and at first acceptable results. The problem crop when things need to be changed. THAT IS THE PRESENT DAY SITUATION. What is truly , as you mention, so form of standardization. The problem at PG is that there are now true standards to adhere that would support a master format. We do need even constrain contributors to the master format all that is need is that the constrain their format information to a standard. Then the contributions can easily converted to the master format. The master format can be used to distribute to the rest. YES, I can hear all the hand crafters and artists. Yet, you gals and girls forget PG was never dedicated to creating artistic works. What PG is dedicated to is offering etexts in an expectable quality for most. PG is not a publishing has as some come to think of it. PG is a repository and it is not PGs responsibility to offer the etexts therein in any particular format. PG is trying to facilitate as broad base of technologies which is a lot of hard work. I believe PG is getting some of the best mileage out of their resources. Sure the could get more, but it is about time we try to help them do this by developing a mark up standard for etexts and ebooks. STOP, SCREAMING PEOPLE. I am not talking about a particular implementation. More in an upcomming thread "A New Approach" regards Keith. Am 04.02.2012 um 21:57 schrieb Karen Lofstrom:
I'm seeing two tendencies here. Some people want a disciplined, organized collection, with one/a few master formats, better metadata and error correction. Some people want PG to follow Michael Hart's dream of big collection, no rules, everything is welcome, do your own thing.
As a researcher and former academic, I subscribe wholeheartedly to the former vision. I can't but see the latter camp as something like many self-publishers: people who want THEIR book released THEIR way, with no gatekeepers barring entry.
But perhaps PG could accommodate both camps by distinguishing between PG-standard texts (generated from a master format, corrected and re-generated as necessary) and PG-alternative texts (hand-tweaked for particular e-readers, less popular formats, older versions, etc.) Search results could generate standard texts on top, alternative texts in separate category. Everything to have release dates, so that users could see which were the earlier versions and how recently the text had been corrected or updated. Users would have a way to judge the reliability of the text.
If volunteers who submit alternative texts want to do error correction themselves, keeping their version in sync with the standard text, fine. But I don't think that PG should be stuck with that role. It would be enough work just to keep the standard texts updated.
There would have to be SOME rules for the alternative texts, but they could be much less stringent than the rules for the standard texts.
Right now, many of PG's texts would only qualify as alternative texts. But that would be OK. They would be placeholders, so that users could read something while standard texts were being prepared.
-- Karen Lofstrom _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Keith>PG is a repository and it is not PGs responsibility to offer the etexts therein in any particular format. Sorry, but what I always heard Michael talk about was books for people to share and read. Not books locked up behind closed doors in a repository. If you don't want people to read the books, then the archive.org photocopies of pages is a more than adequate solution. If you do want people to read the books, then you quickly realize that what archive.org is doing (or Google books) really doesn't cut it. Which is why some of volunteer to do the hard work to make something which can actually -- in practice -- be read on real machines by real people. *Does* PG offer books in acceptable format for reading? Yes, if you want to read HTML in an HTML browser on a desktop computer with a 20" monitor. Almost all PG texts are of "high quality" when read under that criteria. However, not many of the general public is interesting in curling up in bed with a nice 20" monitor. Does then PG offer adequate quality for small personal reading devices? EPUB, Android, Mobi devices? Nope, most of the files display as "scrambled eggs" -- because a half dozen lines of their CSS doesn't make any sense, having been hard-wired to the assumption that this file will only be read on a 20" screen.

Am 06.02.2012 um 05:54 schrieb Jim Adcock:
Keith>PG is a repository and it is not PGs responsibility to offer the etexts therein in any particular format.
Sorry, but what I always heard Michael talk about was books for people to share and read. Not books locked up behind closed doors in a repository. TRUE! BUT, he also said plain vanilla text, for the repository! You have not refuted the fact that PG is responsible for any particular format. PG does offer formats for reading, just like any good repository should.
If you don't want people to read the books, then the archive.org photocopies of pages is a more than adequate solution. If you do want people to read the books, then you quickly realize that what archive.org is doing (or Google books) really doesn't cut it. Which is why some of volunteer to do the hard work to make something which can actually -- in practice -- be read on real machines by real people.
*Does* PG offer books in acceptable format for reading? Yes, if you want to read HTML in an HTML browser on a desktop computer with a 20" monitor. Almost all PG texts are of "high quality" when read under that criteria. But, the problem is what you want is that PG bows to your machine or a particular one or all reading machines on earth or anything in between.
That is not the purpose of PG. PG is there to preserve books and offer them to those willing to read.
However, not many of the general public is interesting in curling up in bed with a nice 20" monitor.
Does then PG offer adequate quality for small personal reading devices? EPUB, Android, Mobi devices? Nope, most of the files display as "scrambled eggs" -- because a half dozen lines of their CSS doesn't make any sense, having been hard-wired to the assumption that this file will only be read on a 20" screen.
Is that the fault of PG or someone else. You could just refactor the CSS. and VOILA. At least that what the standards say and those format for those devices should conform to. Talk about the REAL WORLD. regards Keith.

Keith> You have not refuted the fact that PG is responsible for any particular format. PG does offer formats for reading, just like any good repository should. A good repository offers multiple formats well-formatted and representative of the original book for reading on the devices that real-world customers want to read on. A goal which PG and most other "repositories" fail at once they start seeing themselves as "repositories" and not as active sources of real books for real people to actually read. Keith> That is not the purpose of PG. PG is there to preserve books and offer them to those willing to read. "Willingness to read" depends on the flavor of the dog-food. When a non-profit loses a charismatic leader there is often an upheaval where the organization takes a look at itself and asks "what is our mission?" Now seems to be PG's turn. Keith> Is that the fault of PG or someone else. You could just refactor the CSS. and VOILA. Yes we could just refactor the CSS, and HTML5 and CSS3 provides the tools to do most of what we need to do cleanly and simply. Yet it is not happening at PG. Why not? Because the people who are in the position to in practice allow this to happen are instead blocking it from happening so that they can pursue their own agenda.

Hi Jim, Am 06.02.2012 um 19:43 schrieb Jim Adcock:
Keith> You have not refuted the fact that PG is responsible for any particular format. PG does offer formats for reading, just like any good repository should.
A good repository offers multiple formats well-formatted and representative of the original book for reading on the devices that real-world customers want to read on. A goal which PG and most other "repositories" fail at once they start seeing themselves as "repositories" and not as active sources of real books for real people to actually read. From what I read so far from you there are no good repositories that fit your criteria!
Keith> That is not the purpose of PG. PG is there to preserve books and offer them to those willing to read.
"Willingness to read" depends on the flavor of the dog-food. When a non-profit loses a charismatic leader there is often an upheaval where the organization takes a look at itself and asks "what is our mission?" Now seems to be PG's turn.
Mr Hart fulfilled his dream. What he wanted was plain vanilla etexts. according to your definitions here, that is/was dog-food. As for reading the etexts/books from PG I can say my first was "The Twin Cities" and the DEVICE was a Newton. I converted and transferred it myself. Since then I have taken these texts and loaded then into different text processors and created my own PDFs. As of late I have taken up interest in the Ereader formats. They are not that bad compared to the commercial books you can buy. That is not books out of copyright, but freshly published books. I would say in many cases the PG style is better. Yes, PG is in transition. As in all transition things will become worst at first and there will be bumps in the road. Yet, in the end things get better.
Keith> Is that the fault of PG or someone else. You could just refactor the CSS. and VOILA.
Yes we could just refactor the CSS, and HTML5 and CSS3 provides the tools to do most of what we need to do cleanly and simply. Yet it is not happening at PG. Why not? Because the people who are in the position to in practice allow this to happen are instead blocking it from happening so that they can pursue their own agenda.
No! Because it is not the HTML and CSS that is the problem. It is the devices and their formats. The devices do a poor job of rendering because they do not even try to implement half of the HTML or CSS standards. PG simply can get hand tweak files for all of them! Who is to do the work. What PG can do though is create a master format that will allow the greatest flexibility and output EPUBs, MOBI, kf8s that are readable and for an absolutely non beatable price. regards Keith.

Hi Carlo, You you seem to think that it can be done. RST is extendable, so components could be added. Please do not get me wrong. I would hold LauTeX for a far better master format with two reservations. we would use the the "language" for mark-up yet not the full scope as it would not otherwise map well, second the use of custom output engines. regards Keith. Am 04.02.2012 um 19:56 schrieb Carlo Traverso:
"Keith" == Keith J Schultz <schultzk@uni-trier.de> writes:
Keith> Hi Karen, It would be better to have one master format. it Keith> actually does not matter which as basically all can be Keith> adapted to support the features need. of course then in Keith> certain instances you will not be conforming to the Keith> standard.
I rather think that we cannot use one master format, we need a few. At least, you cannot reasonably have a master format for mathematics different from LaTeX, and we don't want to exclude contributors and contributions that cannot or don't want to prepare a master file in a format that you arbitrarily like.
Carlo
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

As you point out, Now, if you're trying to represent /pages/ as XML container objects then you
do have a problem, because pages and paragraphs are probably not contiguous. But the document structure (paragaphs) and the manifestation structure (paged book) are totally different paradigms, and can't be represented as the same structural object.
Nonetheless, pages are in fact legitimate characteristics of a book. And they can't be accommodated in an XML structure. So instead re use the "pagenum" positional element as a substitute. If you want examples, every html page that is not xhtml-compliant is a legitimate document that hasn't fit the model. They can be forced into the model, but only by changing the document.

On Fri, February 3, 2012 3:16 pm, don kretz wrote:
If you want examples, every html page that is not xhtml-compliant is a legitimate document that hasn't fit the model. They can be forced into the model, but only by changing the document.
I don't know what you mean by this. I can take any HTML document, and by use of a "tag-soup" parser I can build a complete in-memory DOM. There is no part of the HTML model that I can't represent as XHTML. I can take the in-memory DOM and serialize it out as tag-soup HTML (in fact, that is what Tidy does when you don't select XHTML output). To my knowledge there is a complete one-to-one correspondence in data models between SGML/HTML and XHTML. The distinctions are syntactic only.

How would you propose marking up a footnote that extends across two pages? On Fri, Feb 3, 2012 at 2:26 PM, Lee Passey <lee@novomail.net> wrote:
On Fri, February 3, 2012 3:16 pm, don kretz wrote:
If you want examples, every html page that is not xhtml-compliant is a legitimate document that hasn't fit the model. They can be forced into the model, but only by changing the document.
I don't know what you mean by this. I can take any HTML document, and by use of a "tag-soup" parser I can build a complete in-memory DOM. There is no part of the HTML model that I can't represent as XHTML. I can take the in-memory DOM and serialize it out as tag-soup HTML (in fact, that is what Tidy does when you don't select XHTML output). To my knowledge there is a complete one-to-one correspondence in data models between SGML/HTML and XHTML. The distinctions are syntactic only.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

But serving these from PG this makes a big file even bigger
A big file? What definition of big are we talking about? What's 500k between friends any more? Both Epub and Mobi support page numbers; why don't we translate page numbers from HTML to those formats? It might take a standardized format of page numbers in HTML, but at least you're supporting the people who want page numbers. RST doesn't support page numbers, it doesn't support sidenotes, it doesn't support math. And given the nature of the spec, whatever new book-thing comes up, it probably won't support. I don't particularly feel well-served by how TEI-Lite has been dealt with at PG, but I'd much rather have an incredibly rich format like TEI-Lite that has to be distilled down to fit HTML, then a format that limits me to the features that are convenient in Epub today. -- Kie ekzistas vivo, ekzistas espero.

David>A big file? What definition of big are we talking about? What's 500k between friends any more? 500K between friends is a royal pain in the behind to a user of a small epub or kindle device which is being served books using 3G phone service but is not getting even that in practice because they live out in the boonies. Trust me, I've been there -- now all the epub and kindle devices I actually use are running wifi. And maybe this is why more downloads from PG are without images than with images? One guy I know who loves his kindle is 80+ years old and lives on a boat, and downloads books via 3G -- whenever he's in port. I tell him all the things I do and he looks at me like I'm stupid (and I guess I am 'cuz BB keeps telling me so) and says "I just like reading books."
Both Epub and Mobi support page numbers;
Sorry, where exactly in the epub and mobi standards does it say they support page numbers? Amazon at least I would think is the definition of mobi nowadays and they say in big bold letters "Don't Do Page Numbers!"

On Fri, February 3, 2012 3:03 pm, Joshua Hutchinson wrote:
My preference is for RST to win simply because of the lower entry bar, but ... *shrug*
In my mind, RST has a /higher/ entry bar than any other markup. Part of this is because it has a fair share of uniqueness in it's markup that users are required to learn to use it effectively. Another bar to adoption is the lack of skill transferability. If I learn RST to contribute to PG, it will also help me if I become a Python programmer, but not much else; therefore, I am somewhat disincentivized to learn RST. Lastly, RST suffers from the ambiguity inherent in all "light" markup languages. While the markup is technically "unambiguous," it is still very difficult for a human being to remember and recognize the rules of RST. Does this line start with a space or a tab? Does ===== indicate a first level header or a second level header? Where was the first declaration so I can figure it out? The subtlety of the language /may/ be easier for the end user (reader) if the document has not been pre-processed (personally I find reading an RST document anything but "restful"), but the subtlety can make it difficult for an original creator and horrible for a maintainer who wasn't the original creator.

Joshua>My preference is for RST to win simply because of the lower entry bar, but ... *shrug* One might think you-all would ask DPers what they think about all this. I think what they would tell you is that: “Most people write in HTML, and a few write in RST, and DPers really don’t like either epub nor kindles because that’s just poop in the party, and what kind of stupid smack are you guys talking about when it comes to inventing even *more* bizarre languages? Don’t you realize how hard we work to try to keep the PP’ers we already have!?” That said a number of DPers “get” EPUB as a “real book” source language and are pushing to make that happen. Just like the big boys. DP’ers who learn to EPUB might actually be able to get a job and do something real with their skill set. Even knowing Sigil is enough to get you somewhere nowadays.

On Fri, February 3, 2012 1:44 pm, don kretz wrote:
If you have floats, you can use inset page numbers with spans and an appropriate stylesheet.
If you have floats ...
But fundamentally what you're running into, and I don't know how you avoid it if you insist on XML/XHTML, is that books simply aren't well-formed in the way that XML defines and requires. You can't embed everything 100% in all its containers.
You've made this assertion before. I don't agree with it, and I've yet to see any examples or evidence that it's true. It's obvious that the eggheads who came up with TEI seem to think you can, and from what I've observed even though they're eggheads they're not techies; they're more like linguists and English professors. I believe in climate change. Not because this has been a particular warm winter (it has been) but because virtually every climate scientist on the planet says it's happening. I believe in TEI as a text encoding standard. Not because I have fully tested or exercised it, but because some really smart and really educated people put it together. With all due respect, I don't think anyone here can come close to designing a system as good as what they developed; we don't have the expertise, and we haven't had the time. Now about 5 years ago (my reports are in the list archives in the 2006-2007 range) I did some testing about TEI and XHTML. The results of that testing demonstrated that I could do "round-trip" conversions between TEI and XHTML; that is, I took a TEI file and programmatically converted it to HTML (HTML that displayed well on a browser without CSS) and back to TEI. Thus, I can conclude that there is nothing in TEI that cannot be encoded in XHTML. While TEI is "best-of-breed" it is not ubiquitous. If a volunteer learns TEI to create texts for PG, that skill is not necessarily transferable; but if I know XHTML I can publish web pages, or blog, or do any other job for which XHTML is the standard. Thus, on the whole I think that appropriately constrained XHTML is be best practical choice, even if not the best technical choice.
Page numbers are a reflection of this problem, because conceptually they are boundaries between page elements. But page elements simply aren't well-formed because their tops and bottoms can cut right through paragraphs (and everything within which paragraphs are embedded.)
I fail to see how this example proves your point. For example, assume a paragraph which is split over a page. You've started your <p> container and start throwing your phrasing content into the container. Then, in the middle of your text you encounter some metadata; this metadata just happens to be an indication that the current physical manifestation is changing, that the nature of the metadata is that it's a page number, and that the actual metadata is "217." Drop a hidden metadata object into the phrasing content at the point where it exists (in this case, an <a> tag) and go on. Now, if you're trying to represent /pages/ as XML container objects then you do have a problem, because pages and paragraphs are probably not contiguous. But the document structure (paragaphs) and the manifestation structure (paged book) are totally different paradigms, and can't be represented as the same structural object.
I think what happened to some devices is that they had to decide between supporting HTML and XHTML, and since writers can't be constrained to create well-formed XML documents (nor should they be), the devices had to choose HTML.
I believe that virtually all devices (and perhaps even absolutely all devices) require XHTML. This is certainly a requirement of ePub. If you try to load SGML/HTML onto Adobe's Digital Editions it will blow chunks, and refuse to display anything. Because the Kindle is based on the old HTML 3.2 spec it may not /require/ XHTML but it certainly /accepts/ XHTML. The KindleGen program /may/ require XTHML; I don't know, Mr. Adcock is in a much better position than I to evaluate that question. In any event, requiring XHTML as a master format will certainly have no adverse effects.

Lee>The KindleGen program /may/ require XTHML; I don't know, Mr. Adcock is in a much better position than I to evaluate that question. In any event, requiring XHTML as a master format will certainly have no adverse effects. Well, I can't talk to kindlegen except to say what I have seen it do with things I feed it. First of all, I think you guys are all still thinking "Kindlegen Version 1" whereas the rest of the "dev" world already has moved onto "Kindlegen Version 2". Kindlegen Version 2 outputs a "mobi" file which actually contain two, two, two formats in one. The first format is the old "mobi7" version which everybody knows and loves. The second format "kf8" is basically just epub but Amazon doesn't want to call it epub because they have stuck it inside their own mobi wrapper. But, as mobi_unpack.py demonstrates, one can pop that epub right back out of there. [This is all assuming we are talking about public domain unencrypted mobi content of course. ] So, old crufty Kindles are still effectively "mobi7" devices with all which that means. New Kindles are effectively epub2 machines, with all that means. And Kindles of recent vintage are still running as mobi7 devices but are supposed to be automatically updated to become KF8 "epub2" machines sometime soon. But nobody seems to know which Kindles qualify for updates and which don't. But, in terms of Kindlegen Version 2 which is not what I think PG is running yet, I find I can send it HTML not XHTML and it screams like a banshee over and over again "I cannot take it Captain, she's gunna blow!" but, guess what, a working kf8 file always seems to squirt out the back end. Don't know what happens if you send it really buggy HTML. Kindlegen is supposed to like XHTML 5. And EPUB3 is supposed to like XHTML 5. So, XHTML5 is certainly what I would recommend to anyone who cares about the future. Which, by definition, no one here does.

Hi Don, I very sorry , but are completely wrong. XML is perfectly well of handling the structures of book. The problem is how you handle and define the structure which is not predefined by the XML standard. As far as the output is concerned that that is a matter of parsing the tree and reacting to the semantics for the entities defined therein. Please do not forget that a paragraph is an entity of a text, where as a page is an entity of a book! On another side. HTML actually has no concept of a page, except maybe the rendering of the file itself. Meaning that i would have to have a file per page. Then you have the same problem. Naturally, that is not how it is done. You can simulate the same semantics for page breaks in XML as in HTML, because the page break is just simulated. regards Keith. Am 03.02.2012 um 21:44 schrieb don kretz:
If you have floats, you can use inset page numbers with spans and an appropriate stylesheet.
But fundamentally what you're running into, and I don't know how you avoid it if you insist on XML/XHTML, is that books simply aren't well-formed in the way that XML defines and requires. You can't embed everything 100% in all its containers.
Page numbers are a reflection of this problem, because conceptually they are boundaries between page elements. But page elements simply aren't well-formed because their tops and bottoms can cut right through paragraphs (and everything within which paragraphs are embredded.)
I think what happened to some devices is that they had to decide between supporting HTML and XHTML, and since writers can't be constrained to create well-formed XML documents (nor should they be), the devices had to choose HTML. _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Hi keith, Of couse xml can handle a book if you constrain your definition of your book to only include things that can be described by xml. What if I choose to include footnotes that extend across multiple pages, including both where they are referenced in the text flow, and on what physical pages they are found? You can't just declare the possibility to be invalid because it can't be described in a strict hierarchy. On Fri, Feb 3, 2012 at 4:22 PM, Keith J. Schultz <schultzk@uni-trier.de>wrote:
Hi Don,
I very sorry , but are completely wrong. XML is perfectly well of handling the structures of book. The problem is how you handle and define the structure which is not predefined by the XML standard.
As far as the output is concerned that that is a matter of parsing the tree and reacting to the semantics for the entities defined therein.
Please do not forget that a paragraph is an entity of a text, where as a page is an entity of a book!
On another side. HTML actually has no concept of a page, except maybe the rendering of the file itself. Meaning that i would have to have a file per page. Then you have the same problem. Naturally, that is not how it is done. You can simulate the same semantics for page breaks in XML as in HTML, because the page break is just simulated.
regards Keith.

By the way, David, my markup already includes page numbers, sidenotes and math via latex. Of course, since it's only conceptual markup it's up to the exporter and the display device to instantiate the realization. But that exactly is the way eb.tbicl.org stores, transforms and displays them now. Look up the fourier article I referenced (should be eb.tbicl.org/fouriers-series ( think) earlier for an extensive latex example. All the articles that had page numbers marked somehow in the PG texts have been transformed upon loading to this msrkup and are functional. Same for sidenotes. It's just a stylesheet change, as you would expect, to have offset or inset page numbers. There's a lot of unsimplified html still left from the PG html files. Tables are pretty much untouched. Even most of the illustrations are not easily converted yet. But what's used for the article sources is substantially simpler yet more malleable than what PG provides, given software that understands the markup - none of which exists other than what I've got in the app. On Fri, Feb 3, 2012 at 6:08 PM, don kretz <dakretz@gmail.com> wrote:
Hi keith,
Of couse xml can handle a book if you constrain your definition of your book to only include things that can be described by xml. What if I choose to include footnotes that extend across multiple pages, including both where they are referenced in the text flow, and on what physical pages they are found? You can't just declare the possibility to be invalid because it can't be described in a strict hierarchy.
On Fri, Feb 3, 2012 at 4:22 PM, Keith J. Schultz <schultzk@uni-trier.de>wrote:
Hi Don,
I very sorry , but are completely wrong. XML is perfectly well of handling the structures of book. The problem is how you handle and define the structure which is not predefined by the XML standard.
As far as the output is concerned that that is a matter of parsing the tree and reacting to the semantics for the entities defined therein.
Please do not forget that a paragraph is an entity of a text, where as a page is an entity of a book!
On another side. HTML actually has no concept of a page, except maybe the rendering of the file itself. Meaning that i would have to have a file per page. Then you have the same problem. Naturally, that is not how it is done. You can simulate the same semantics for page breaks in XML as in HTML, because the page break is just simulated.
regards Keith.

Hi Don, You err very badly, and to not understand XML. XML is not a language of containers. It can be done. I do not have the time to teach what XML really is and how it can be used. On the other hand HTML is not less capable of what you are describing! So has you define structures neither HTML nor XHTML or XML are viable candidates for a master format. RST I believe is out to. So basically, you want TEI or languages along the line of TeX, where the pages are not marked up, but the text itself and an engine turns out the pages. But you forget you can to that with XML. regards Keith. Am 04.02.2012 um 03:08 schrieb don kretz:
Hi keith,
Of couse xml can handle a book if you constrain your definition of your book to only include things that can be described by xml. What if I choose to include footnotes that extend across multiple pages, including both where they are referenced in the text flow, and on what physical pages they are found? You can't just declare the possibility to be invalid because it can't be described in a strict hierarchy.
On Fri, Feb 3, 2012 at 4:22 PM, Keith J. Schultz <schultzk@uni-trier.de> wrote: Hi Don,
I very sorry , but are completely wrong. XML is perfectly well of handling the structures of book. The problem is how you handle and define the structure which is not predefined by the XML standard.
As far as the output is concerned that that is a matter of parsing the tree and reacting to the semantics for the entities defined therein.
Please do not forget that a paragraph is an entity of a text, where as a page is an entity of a book!
On another side. HTML actually has no concept of a page, except maybe the rendering of the file itself. Meaning that i would have to have a file per page. Then you have the same problem. Naturally, that is not how it is done. You can simulate the same semantics for page breaks in XML as in HTML, because the page break is just simulated.
regards Keith.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Don>Page numbers are a reflection of this problem, because conceptually they are boundaries between page elements. But page elements simply aren't well-formed because their tops and bottoms can cut right through paragraphs (and everything within which paragraphs are embredded.) The resulting location of the float is ill-defined by HTML in any case in which case one might as well move the page number OUT of the middle of the paragraph in any case.

To where? The point isn't to visualize exactly where the page breaks in the text, it is to give the user some idea which page he's on. If I wanted to display exactly the page break (which isn't very useful to the user in most cases) I'd use a verical colored bar or something. In the case of EB I use it to give the user something to click on to view the TIA page image. On Fri, Feb 3, 2012 at 5:44 PM, James Adcock <jimad@msn.com> wrote:
Don>Page numbers are a reflection of this problem, because conceptually they are****
boundaries between page elements. But page elements simply aren't well-formed****
because their tops and bottoms can cut right through paragraphs (and everything****
within which paragraphs are embredded.)****
** **
The resulting location of the float is ill-defined by HTML in any case in which case one might as well move the page number OUT of the middle of the paragraph in any case.****
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

To where?
Well, I would move the page tags to before the start of the next paragraph, so that a paragraph is understood to belong to the page that it started on. If you move the page tags out of the middle of the paragraph then you have more attractive less intrusive options for placement, and which "works" on more devices.
The point isn't to visualize exactly where the page breaks in the text, it is to
give the user some idea which page he's on. Why would the user care? The argument I've heard is that college students need to be able to ref the page when they write a "book review" for their profs. Not that the PG/DP page numbers are currently accurate enough to allow that in any case.

No, it's so if I tell you that the passage I'm reading is on page 54, you can go to page 54 and find it. Or if you switch from your phone on the train to the paper copy at home, you can tell where to pick up. Or if you go to the library to look at the real illustrations, or maps, or drawings, you'll be able to find them. On Fri, Feb 3, 2012 at 8:38 PM, James Adcock <jimad@msn.com> wrote:
To where? ****
** **
Well, I would move the page tags to before the start of the next paragraph, so that a paragraph is understood to belong to the page that it started on. If you move the page tags out of the middle of the paragraph then you have more attractive less intrusive options for placement, and which “works” on more devices.****
** **
The point isn't to visualize exactly where the page breaks in the text, it is to****
give the user some idea which page he's on. ****
** **
Why would the user care? The argument I’ve heard is that college students need to be able to ref the page when they write a “book review” for their profs. Not that the PG/DP page numbers are currently accurate enough to allow that in any case.****
** **
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

we're also not so good at replacing indexes to equivalent lists of links to anchors at the identical places in the text (and search is no substitute, even if you did hsppen to download the entire book instead of the normal current chapter and the adjacent ones.) Or cross-reeferences - we don't do those either.

No, it's so if I tell you that the passage I'm reading is on page 54, you can go to page 54 and find it.
I've already got that technology and better built into my reader device.
Or if you switch from your phone on the train to the paper copy at home, you can tell where to pick up.
I've already got that technology and better built into my reader device except if you really mean "paper" in which case the answer is "I don't kill trees any more than I wear fur coats."
Or if you go to the library to look at the real illustrations, or maps, or drawings, you'll be able to find them.
My library burnt those books, and was happy to do so, just as soon as Google offered to digitize them.

My mistake. I thought you were asking why other people needed page numbers. On Fri, Feb 3, 2012 at 9:09 PM, James Adcock <jimad@msn.com> wrote:
No, it's so if I tell you that the passage I'm reading is on page 54, you can go to page 54 and find it.****
** **
I’ve already got that technology and better built into my reader device.** **
** **
Or if you switch from your phone on the train to the paper copy at home, you can tell where to pick up.****
** **
I’ve already got that technology and better built into my reader device except if you really mean “paper” in which case the answer is “I don’t kill trees any more than I wear fur coats.”****
** **
Or if you go to the library to look at the real illustrations, or maps, or drawings, you'll be able to find them.****
My library burnt those books, and was happy to do so, just as soon as Google offered to digitize them.****
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Don>My mistake. I thought you were asking why other people needed page numbers. Well, its just a question of whether PG/DP are addressing the right needs for today, or responding to perceived needs of the past. I'm not against having the page numbers documented in the source file somewhere, as long as it doesn't get in the way of the primary need: people have got to have something they can actually read "like a book" without being constantly interrupted by the techno-desires of the techno-nerds to stuff their ugly complexifications in everywhere -- whether the readers want it or not. The hardware designers seem to be beginning to understand this better, and are doing the "auto-hide" stuff, with unblemished "full screen reading" until the reader actively indicates they actually want to be exposed to the techno-nerd complexifications again. For the same reason if I want to watch a movie on a computer (and I do not) I like having the option of "Full Screen Mode" -- so that I can actually *watch* the movie, rather than being constantly exposed to all the widgets that some techno-nerd has decided someone might have fun playing with -- instead of watching the movie.

Hi Don, You are not really serious here. You have a paper copy of all books you read on your phone??????????????? C'mon. This is ridiculous. Just like the remarks that academics need cite page numbers, so pages numbers have to be in the etext. Please do not get me wrong. I believe original page numbers have their value in the master format and are useful for some formats for display devices/agents, yet not all and for all reasons. We simply can not serve everybodies pet whims. Furthermore, as others have stated, the display of page numbers on ereaders Agents is problematic and vary from device to device. PG tries tries to serve as many devices as possible, yet it has to make compromises, inorder not to have a format for every device in the wild. Agrred, the texts will not look perfectly on every device yet look good enough on most devices. I refer your to an upcomming thred by me "A New Approach" regards Keith. Am 04.02.2012 um 05:43 schrieb don kretz:
No, it's so if I tell you that the passage I'm reading is on page 54, you can go to page 54 and find it. Or if you switch from your phone on the train to the paper copy at home, you can tell where to pick up. Or if you go to the library to look at the real illustrations, or maps, or drawings, you'll be able to find them.

Somehow this has assumed the form of a discussion about whether or not to use a VCS. It isn't (or never was to me, anyway.) At least I'm not "afraid" of either RST or VCS. I as second only to bowerbird in using the term ReStructured Text in this forum. I'm pretty intimate with several VCS systems. My suggestion has been from the start that we take the specific technology out of the discussion until we have clearly identified the problem we intend to solve and then we can discuss alternative ways to solve it. One of which may be a vcs. The title of the thread is "Goals and Scope", which seems to have turned into "we're installing a vcs because we need as vcs". On Thu, Feb 2, 2012 at 7:23 AM, Jim Adcock <jimad@msn.com> wrote:
Everybody who wants to participate and is not afraid of RST and a VCS can send me their RSA public keys so I can give them SSH access to the repos.
Everybody who thinks that RST and/or a VCS are bad ideas, is encouraged to implement their own (possibly better) ideas as an individual or a group. Just stop the fruitless complaining and get your act together.
Seems like a bit of a silly comment, when you are literally holding the only keys to the door, and are allowing only your own ideas in.
Open up the VCS to other languages and other approaches, so that the competing ideas can be compared next to your own.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

"Jim" == Jim Adcock <jimad@msn.com> writes:
>> Everybody who wants to participate and is not afraid of RST and >> a VCS can Jim> send me their RSA public keys so I can give them SSH access Jim> to the repos. >> Everybody who thinks that RST and/or a VCS are bad ideas, is >> encouraged to Jim> implement their own (possibly better) ideas as an individual Jim> or a group. Just stop the fruitless complaining and get your Jim> act together. Jim> Seems like a bit of a silly comment, when you are literally Jim> holding the only keys to the door, and are allowing only your Jim> own ideas in. Jim> Open up the VCS to other languages and other approaches, so Jim> that the competing ideas can be compared next to your own.
"Lee" == Lee Passey <lee@novomail.net> writes:
Lee> On Thu, February 2, 2012 8:23 am, Jim Adcock wrote: >> Seems like a bit of a silly comment, when you are literally >> holding the only keys to the door, and are allowing only your >> own ideas in. Lee> This seems a bit harsh. I'm sure that if you asked nicely Lee> Mr. Perathoner would give you a shell account onto pglaf.org Lee> with enough rights to install any software you would like, Lee> including web-facing software. Remark however that the gatekeeper of pglaf is Greg, not Marcello, and his initial post says
"Greg" == Greg Newby <gbnewby@pglaf.org> writes:
................ Greg> I can think of several major details, and many minor ones. Greg> Concerns about copyright, spamming, whether anonymous edits Greg> are permitted, a review/revision/recision cycle, character Greg> sets, forking, searching, etc., etc. I would love to see Greg> multiple "master" files, created lovingly by hand in any or Greg> all of RST, LaTeX, or yfm (Your Favorite Markup) -- then Greg> allow users to select which master to use to generate their, Greg> say, EPUB. And, which tools to use for the conversion. Of Greg> course, people who wanted to lovingly craft an EPUB would be Greg> able to upload that, too. ................. So to me it seems that the gatekeeper promises to open the gates. Including the possibility of installing other toolchains besides gnutenberg Press (PGTEI) and epubmaker (PGRST). In particular, I plan to work on a LaTeX master towards the production of good HTML and epub. Carlo

Hi Carlo, what do you think about LuaTeX? I had personally suggested using LaTeX for PG already in the early days. regards Keith. Am 02.02.2012 um 18:41 schrieb Carlo Traverso:
So to me it seems that the gatekeeper promises to open the gates. Including the possibility of installing other toolchains besides gnutenberg Press (PGTEI) and epubmaker (PGRST).
In particular, I plan to work on a LaTeX master towards the production of good HTML and epub.
Carlo

"Keith" == Keith J Schultz <schultzk@uni-trier.de> writes:
Keith> Hi Carlo, Keith> what do you think about LuaTeX? I have never tried it. I am looking at hevea (for LaTeX->HTML conversion). Keith> I had personally suggested using LaTeX for PG already in Keith> the early days. Me too. I have some experience on a slim macro package that with a very light markup (e.g. _italic markup_) allows direct pdf creation via LaTeX. Carlo

Are you familiar with jqmath? <http://mathscribe.com/author/jqmath.html> hevea appears to be similar. It's also a hassle because everything else in the PG file I can accept as-is On Thu, Feb 2, 2012 at 2:53 PM, Carlo Traverso <traverso@posso.dm.unipi.it>wrote:
"Keith" == Keith J Schultz <schultzk@uni-trier.de> writes:
Keith> Hi Carlo,
Keith> what do you think about LuaTeX?
I have never tried it. I am looking at hevea (for LaTeX->HTML conversion).
Keith> I had personally suggested using LaTeX for PG already in Keith> the early days.
Me too. I have some experience on a slim macro package that with a very light markup (e.g. _italic markup_) allows direct pdf creation via LaTeX.
Carlo
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

"don" == don kretz <dakretz@gmail.com> writes:
don> --f46d04462e7e52f96804b803489f Content-Type: text/plain; don> charset=ISO-8859-1 don> Are you familiar with jqmath? don> <http://mathscribe.com/author/jqmath.html> hevea appears to don> be similar. No, but my focus is to handle texts without math. LaTeX is used as master suitable to produce high quality pdf, and possibly good HTML, while pdf derived from RST or HTML often is of inferior typographical quality unless manually tweaked. Carlo

On 02/02/2012 04:23 PM, Jim Adcock wrote:
Seems like a bit of a silly comment, when you are literally holding the only keys to the door, and are allowing only your own ideas in.
Greg will be happy to give you an account on pglaf.org. When you have installed your tools there and convinced the WWers to use them, I can copy them over to gutenberg.org. -- Marcello Perathoner webmaster@gutenberg.org

Marcello>Greg will be happy to give you an account on pglaf.org. When you have installed your tools there and convinced the WWers to use them, I can copy them over to gutenberg.org. Again, that's not what I think I heard Greg saying. "The tools" are the current PG sausagemaker chain plus the VCS to store the "tweak" changes to the HTML. The whole point to Greg's crowdsourcing idea was to keep the WWers out of the day-to-day -- until at such point in time the WWers believe they see something that is worth their time and effort to back-propagate. You're still saying "my way -- my choice of language -- or the highway." ???

Don>...learn a new programming language, with little or no documentation.... As a point of reference I just looked up the official W3C documentation on one particular HTML feature -- a page which happened to have a page counter on it: 1300 people had looked up the same piece of documentation -- for HTML, something that probably has literally billions of people using it? Not saying HTML is great stuff, just saying having a critical mass of users is critically important, especially when it comes to documentation.

Hi Don, I have not looked at RST that closely, but the master format is not the problem. The problem is the lack of guidelines what should be inside the rst or what it should handle. RST is just the back-end. what is needed is an editor that is simple to use. The user do not need to know RST. OR do you know what is inside of .doc and docx files and understand it? regards Keith. Am 01.02.2012 um 23:56 schrieb don kretz:
I suggest you consider putting a stop to this RST experiment and step back and come up with some kind of plan that can possibly succeed.
From what I can tell, you're asking non-technical people to essentially learn a new programming language, with little or no documentation (certainly not current), no debugger, no IDE, no or crappy error messages, and all this while the language is still being designed and implemented.
And telling them "Just trust us".
I doubt the developers involved here would agree to work under the same conditions.

All those who are comma fucking and mosquito sifting right now - what does all that mean to the general public? please enlighten? On 31 January 2012 19:32, Lee Passey <lee@novomail.net> wrote:
(I have a hunch I'm going to be quoting this message a lot in the future...)
On Tue, January 24, 2012 3:08 pm, Joshua Hutchinson wrote:
I'd love to see the PG corpus redone as a "master format" system (and the current filesystem supports "old" format files in a subdirectory, so if someone wanted to get the old original hand-made files, they could). I'm not particularly wedded to any master format. Hell, if someone came up with a sufficiently constrained HTML vocabulary that could be easily used to "generate" the additional formats necessary, I'm good with that.
But before anyone will start doing this work, there needs to be a consensus from PG (I'm looking at you, Greg!) that the work will be acceptable. A half-assed "master format" system is no master format system at all.
On Tue, January 31, 2012 1:22 am, Greg Newby wrote:
The need I'm trying to address is reformatting or editing eBooks, not proofreading them.
Okay, we're on the same page so far...
What I'd like is (as someone else nicely put it) a continual improvement opportunity, provided to essentially anyone, for eBooks in the PG collection.
Still good...
This boils down to a handful of critical activities. It's mainly the third one (III) that involves crowdsourcing and new tools.
This is where we start to diverge...
I. making changes to the master file(s) [let's imagine that we retain the practice of every PG eBook having a small number of master files, in a small number of master formats]. The short list of master formats includes RST, HTML, TeX/TEI, and plain text (perhaps with light markup). Maybe this list will grow in the future; maybe it will shrink.
No, according to Mr. Hutchinson's proposal there can be only one...
The main feature here is that typos or fixes or additional master formats can be contributed.
The main feature here is that a single fix to the master file will automatically propagate to all derived formats; syncing between "masters" will not be required.
[little snip]
II. from those master files, various other file formats can be [and are, currently] derived automatically.
Mister Hutchinson's vision, which I am trying to follow, is that /all/ other file formats will be derived automatically from the /one/ master version. Caching is certainly advisable, but on-demand creation would be the first-step.
Many challenges are technical, such as increased sophistication in dealing with text and HTML as master formats.
The primary technical challenge is in developing a tool chain which can produce quality instances of all derived formats, and in adopting/developing a master format with the richness necessary to support that tool chain.
Others need to be addressed by policy or social means, such as the ongoing tendency to use HTML for layout that is difficult to automatically convert.
Policy means include deciding on a master format, developing rules for the use of that format, wide-spread publication of those rules and, to the extent possible, automated means to detect violations of those rules. Social means primarily include getting buy-in from participants to the established rules, and attracting volunteers who are willing to work with them.
III. from those master files, various other file formats that are created/contributed by individuals.
At this point we're not only not on the same page, we're not even in the same book. This suggestion is completely at odds with what Mr. Hutchinson proposed, and which I support.
[bigger snip]
If we accept that anyone could contribute such a new file (or set of files) for an existing PG eBook, then the main challenges I see are (a) how to help readers select among them, and (b) dealing with the fact that, over time, master formats will be fixed, but not these hand-crafted derivatives.
I'm not saying you shouldn't pursue this vision; I'm simply saying it's not mine, and I'm completely uninterested in pursuing it with you. My vision is to develop a system where existing PG works can be reworked into a single master format, from which all other formats can be automatically derived.
Proof-reading and upgrading the master files is certainly a desirable part of that vision, but it is secondary to the main goal. I'm beginning to think that Mr. Hutchinson's earlier question remains unresolved:
there needs to be a consensus from PG (I'm looking at you, Greg!) that the work will be acceptable. A half-assed "master format" system is no master format system at all.
So Mr. Newby, can we expect some support in building a repository of master format reworkings of existing PG works? Infrastructure support would be nice, but moral support is what is most needed.
[big snip]
I hope this helps clarify my original suggestion a little better. There has been some great discussion on this and related topics.
Ditto.
Cheers, Lee
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
-- Marc FreeLiterature.org <http://www.freeliterature.org>

On 01/31/2012 09:22 AM, Greg Newby wrote:
I. making changes to the master file(s) [let's imagine that we retain the practice of every PG eBook having a small number of master files, in a small number of master formats]. The short list of master formats includes RST, HTML, TeX/TEI, and plain text (perhaps with light markup). Maybe this list will grow in the future; maybe it will shrink.
Having more than one master format per book does not make sense. Decide which format is best for that book and stick to it. Every typo should have exactly one location that needs fixing.
II. from those master files, various other file formats can be [and are, currently] derived automatically. These include EPUB, Kindle variants, variations on HTML or text (especially if they were not previously provided), RTF, and a few others. Again, maybe this list will grow, maybe it will shrink. I do hope to offer conversion on-demand, which will let people select conversion options, and maybe even different conversion programs, for their purposes.
Conversion on demand will not be possible with the cycles available at ibiblio. If you want that, you'll have to organize some very beefy servers that do nothing but crunch books.
III. from those master files, various other file formats that are created/contributed by individuals. I get offered these (via help@) practically every day. Usually EPUB, but also RTF/DOC, PDF. Often with typos applied. These are what I called "lovingly prepared," though of course some are better than others.
These can be better than automatically-generated versions in various ways. They might have advantages over master files (for example, improved HTML). The main feature is that these would, in many cases, provide an improved reading experience (at least for some people, on some devices).
If we accept that anyone could contribute such a new file (or set of files) for an existing PG eBook, then the main challenges I see are (a) how to help readers select among them, and (b) dealing with the fact that, over time, master formats will be fixed, but not these hand-crafted derivatives.
The whole idea seems to me very ill-conceived. We just don't have the resources to handle that kind of workload. It will just divert our resources away from posting more master files to posting lots of nearly identical vanity editions. Every user contribution will have to be checked for external site links or other SEO optimizations, malicious text edits, etc. or PG will turn very quickly into a link farm for spammers or exchange point of `corrected´ editions of the Origin of Species. Checking some proprietary formats could be expensive. Some format could even be impossible to check except via eyeball grep. Every `just one small typo fixed´ version will have to be checked completely anew. Typos will not be first reported to errata any more, but a new edition will be sent in. Every fixed edition will have a slightly different set of typos fixed. People reporting typos will not state which edition they have. Vanity editions will invariably fall out of sync with the master format. Miscontent will ensue about the ranking of multiple vanity editions. Same about the extent of allowed customizations ("It's a book about cats and I have added just a dozen pics of my cat ...") Users will be confused about which edition to download. I would redirect vanity editions to mobilereads or any other web site that already posts them. We could even link to them if they gave us landing pages. -- Marcello Perathoner webmaster@gutenberg.org

It's not clear to me how a reader would even be able to distinguish which version they were reading (and wanted to improve.) I can't conceive of orchestrating simultaneous corrections of one error to multiple custom targets where the linkage (shared content) among them isn't explicitly identified. If a desired format isn't available for a given device or style (probably what is called in CMS-speak as a theme) then the work needs to be directed to improving the generalization of the master format and the exporting software that converts that master format to the target format. One warning about HTML. Remember that its heritage is coming from SGML, whose intent was to encode all syntax and display information, to remove the syntax part and provide just the display part. So it's not accidental that we end up needing to implant syntax. Also remember that HTML requires "well-formedness", and that books are often not syntactically "well-formed." in that sense. (One reason XHTML is losing favor. It's designed for software developers, not creative writers whose organizational instincts are more fluid.

On Tue, Jan 31, 2012 at 08:32:15PM +0100, Marcello Perathoner wrote:
On 01/31/2012 09:22 AM, Greg Newby wrote:
I. making changes to the master file(s) [let's imagine that we retain the practice of every PG eBook having a small number of master files, in a small number of master formats]. The short list of master formats includes RST, HTML, TeX/TEI, and plain text (perhaps with light markup). Maybe this list will grow in the future; maybe it will shrink.
Having more than one master format per book does not make sense. Decide which format is best for that book and stick to it. Every typo should have exactly one location that needs fixing.
In principal I agree. In practice, we often have 2 (HTML + text). I don't think it's is very burdensome to edit 2 files rather than 1.
II. from those master files, various other file formats can be [and are, currently] derived automatically. These include EPUB, Kindle variants, variations on HTML or text (especially if they were not previously provided), RTF, and a few others. Again, maybe this list will grow, maybe it will shrink. I do hope to offer conversion on-demand, which will let people select conversion options, and maybe even different conversion programs, for their purposes.
Conversion on demand will not be possible with the cycles available at ibiblio. If you want that, you'll have to organize some very beefy servers that do nothing but crunch books.
I guess we need to find someone who runs a supercomputing center or something... seriously, I do not see this as an impediment, though you are right to point out that even a modest bump in consumption at ibiblio might make us unsustainable with their current server farm. -- Greg

On 01/31/2012 10:36 PM, Greg Newby wrote:
Having more than one master format per book does not make sense. Decide which format is best for that book and stick to it. Every typo should have exactly one location that needs fixing.
In principal I agree. In practice, we often have 2 (HTML + text). I don't think it's is very burdensome to edit 2 files rather than 1.
The way forward should be to make it simpler on the WWers than before. If we'd have multiple masters per book, a mantainer would need to learn all master formats, and in the end we'd probably lose sync between them. -- Marcello Perathoner webmaster@gutenberg.org

Anyone who has dealt with this stuff knows it's a geometric relationship. Two formats is 4X as much work as one. Three is 9X. Etc. I don't understand the overhead of on-demand. As long as you have proper dependencies and caching set up, you only run the generation exercise once, and then everyone gets a cached copy until one of the dependencies changes. It's less overhead than building every format whenever a master changes, where you build formats that may never get requested. If you have a common dependency, like a common css stylesheet, then do you currently trigger the rebuild of every project that uses it, all at once? On Tue, Jan 31, 2012 at 2:26 PM, Marcello Perathoner <marcello@perathoner.de
wrote:
On 01/31/2012 10:36 PM, Greg Newby wrote:
Having more than one master format per book does not make sense.
Decide which format is best for that book and stick to it. Every typo should have exactly one location that needs fixing.
In principal I agree. In practice, we often have 2 (HTML + text). I don't think it's is very burdensome to edit 2 files rather than 1.
The way forward should be to make it simpler on the WWers than before.
If we'd have multiple masters per book, a mantainer would need to learn all master formats, and in the end we'd probably lose sync between them.
-- Marcello Perathoner webmaster@gutenberg.org
______________________________**_________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/**mailman/listinfo/gutvol-d<http://lists.pglaf.org/mailman/listinfo/gutvol-d>

Below...
-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Marcello Perathoner Sent: Tuesday, January 31, 2012 2:27 PM To: gbnewby@pglaf.org; Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] Goals and scope (Re: Version control systems)
On 01/31/2012 10:36 PM, Greg Newby wrote:
Having more than one master format per book does not make sense. Decide which format is best for that book and stick to it. Every typo should have exactly one location that needs fixing.
In principal I agree. In practice, we often have 2 (HTML + text). I don't think it's is very burdensome to edit 2 files rather than 1.
Based on my experience with the Errata system, this depends on the number of errors being reported. Correcting a handful of errors in 2 or 3 files (2 text, 1 HTML) is one thing, but a report of hundreds of errors is something else. As an extreme example, I've got an errata report on my hands that's 3400 lines long, that I haven't had the courage to plow through yet. The reporter lists something wrong on almost every one of the book's nearly 400 pages. On top of that, he'd like an HTML version created with the footnotes cross-linked and from what I can tell, the page numbers inserted because there are internal references to them. The reported-on text is one volume of a series, so fixing/reposting it will take it out of sync with the rest. Probably simpler to figure out the source edition and run the whole series through DP, to replace the current files. (Question: where's an Errata Team when you want one? Answer: as happened a year or so ago, they find out what they're up against, and vanish. <g>) Al
The way forward should be to make it simpler on the WWers than
before.
If we'd have multiple masters per book, a mantainer would need to learn all master formats, and in the end we'd probably lose sync between them.
-- Marcello Perathoner webmaster@gutenberg.org _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Al, I may be able to help you with this. If the text includes the page numbers and footnotes using one of the standard DP markup schemes, I can convert that automatically into better page numbers and footnotes, and we may be able to handle cross-references the same way I do with Encyclopedia Britannica. The text errors still need someone with an editor, of course.
As an extreme example, I've got an errata report on my hands that's 3400 lines long, that I haven't had the courage to plow through yet. The reporter lists something wrong on almost every one of the book's nearly 400 pages. On top of that, he'd like an HTML version created with the footnotes cross-linked and from what I can tell, the page numbers inserted because there are internal references to them. The reported-on text is one volume of a series, so fixing/reposting it will take it out of sync with the rest. Probably simpler to figure out the source edition and run the whole series through DP, to replace the current files.

The etext in question is #3441. The underlying files are located in etext02, and named 71001107.txt (7-bit) and 71001108.txt (8-bit), plus their zip files. There's no HTML version, hence no illustrations. Footnotes are indicated by [FN#1], [FN#2], etc. The footnotes themselves (all 460 of them) are collected at file-end. If page numbers are indicated in the file, I can't figure out where or how, so they're probably not there. I mentioned that this was one of a multi-volume set. There are 16 volumes, of which this is volume 7. The entire set's etext numbers are 3435-3450, the file names are from 11001107.* to g1001107.*, all files in etext02. The last number of the file name is 7 for a 7-bit file or 8 for an 8-bit file, plus a zip file for each, i.e. 4 files for each volume. The 7-bit files can be ignored, since the corrected 8-bit text file (and a new HTML file, if one is prepared) would be processed by PG's posting software, which will generate a new 7-bit text file. As I mentioned in another thread, the existing credits will be transferred to any new version, with correctors' names added. Gutvol-d doesn't handle attachments, so I'm prepared to send the list of proposed corrections off-list as an attachment, or I can post it as part of this thread (which is probably better, since anyone interested can get it, and I won't be bothered with sending it to bunches of people). Al -----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of don kretz Sent: Tuesday, January 31, 2012 3:51 PM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] Goals and scope (Re: Version control systems) Al, I may be able to help you with this. If the text includes the page numbers and footnotes using one of the standard DP markup schemes, I can convert that automatically into better page numbers and footnotes, and we may be able to handle cross-references the same way I do with Encyclopedia Britannica. The text errors still need someone with an editor, of course. As an extreme example, I've got an errata report on my hands that's 3400 lines long, that I haven't had the courage to plow through yet. The reporter lists something wrong on almost every one of the book's nearly 400 pages. On top of that, he'd like an HTML version created with the footnotes cross-linked and from what I can tell, the page numbers inserted because there are internal references to them. The reported-on text is one volume of a series, so fixing/reposting it will take it out of sync with the rest. Probably simpler to figure out the source edition and run the whole series through DP, to replace the current files.
participants (14)
-
Al Haines
-
David Starner
-
don kretz
-
Greg Newby
-
James Adcock
-
Jana Srna
-
Jim Adcock
-
Joshua Hutchinson
-
Karen Lofstrom
-
Keith J. Schultz
-
Lee Passey
-
Marc D'Hooghe
-
Marcello Perathoner
-
traverso@posso.dm.unipi.it