Crowdsourcing (Re: Producing epub ready HTML)

Whew. I counted 100 messages in 2 days. Thanks for the lively discussion. I changed the subject for the theme that Joshua and others mentioned (below). The idea I am very interested in fostering is capable online tools to let essentially anyone make edits, add formats, or prepare derivative files, from any PG eBook. Then, to easily add those changes back into (an area of) the PG collection. My view is that we could very easily have an additional major category of file, for a given work. Currently, we have two major categories: first are files that go through WWers to get online, and second are those that are automatically generated from the first type. (While this is a gross simplification, in fact at www.gutenberg.org it's really easy to tell which is which -- they are in a different set of subdirectories, with a different file naming scheme.) A third (new) type would be those files that are, in some way, modified, derived, or produced by other people and their tools. Not necessarily WWers or the original producers/submitters. In a word, crowdsourcing. Or community editing. Or version control. Or whatever you want to call it: the point would be that ANYONE with desire and some basic capability could make changes to existing files, or provide derivative files. I can think of several major details, and many minor ones. Concerns about copyright, spamming, whether anonymous edits are permitted, a review/revision/recision cycle, character sets, forking, searching, etc., etc. I would love to see multiple "master" files, created lovingly by hand in any or all of RST, LaTeX, or yfm (Your Favorite Markup) -- then allow users to select which master to use to generate their, say, EPUB. And, which tools to use for the conversion. Of course, people who wanted to lovingly craft an EPUB would be able to upload that, too. A capable crowdsourcing tool - preferably one that already exists, is well-maintained, is free, and will require relatively few modifications - is the starting point I'd most like to see. Whether we start with one book or 100 or 38000 doesn't matter to me, though it matters a whole lot that the solution is scalable to the full collection. As for the questions about whether this would be allowed, or would pollute the essence of whatever, or piss off whomever: no, PG does't work that way, and never did. The answer is, and has been, "yes, go for it. It's all good, and on-mission." I am sensitive to not removing or undoing others' work, but my view is that current files of the first type, above, would remain, and be easy to find, and that for the main collection at www.gutenberg.org, the WWer process (perhaps as modified, thanks to the new tools that have been under discussion) would still apply. Last year, I tried to deploy TRAC for group editing and version control of PG eBooks. It couldn't handle the directory count, and never finished, though I'm ready to try again. Or, a different tool. As many subscribers have heard before, I have some hefty servers that can be used for experimentation and proof of concept. That's not the hard part. If people are aware of good tools we could base this on, please speak up. I can elaborate on why I prefer to start with an existing tool, but in a nutshell it is because (as many have pointed out), the fundamentals of crowdsourcing and file revisions are already covered by a bunch of excellent tools. Let's not reinvent the parts that others are doing well since, after all, there are plenty of challenges that are unique to Project Gutenberg or to eBooks in general. -- Greg On Wed, Jan 25, 2012 at 01:01:08AM +0100, Marcello Perathoner wrote:
On 01/24/2012 11:08 PM, Joshua Hutchinson wrote:
So, if someone were to start "refactoring" old PG texts into TEI or RST and working with a WWer to repost them ... is this a workable idea?
More than a technical challenge it would be a political one. I can convert a novel the size of Pride and Prejudice into RST in about an hour. More if there is formatting or images to recover. But I'd prefer to avoid the riot that will ensue if we start to reformat DP texts.
We could start redoing the top 100 list excluding everything that is too hard and everything made by DP.
Maybe we start this process on a semi-private mirror of the PG corpus and only when it reaches a critical mass of some sort it gets moved over. But an official notice that this project has some backing is necessary or we'll just keep seeing everything running around in ten different directions and nothing ever getting done.
A semi-official branch would be a good occasion to ditch the old WWer workflow in favor of a source repository (git or mercurial) that holds all the masters.
Should we reserve a range of ebook nos. or shadow the existing ones?
-- Marcello Perathoner webmaster@gutenberg.org _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Greg, There is a tool created for FLOSS Manuals called Booki. It enables collaboration over the web to create books, both printed books like you can do with Lulu.com and EPUBs. One feature it has is the ability to import EPUBs, and there is an interface that lets you import EPUBs created by archive.org. In fact, archive.org is a sponsor of Booki. They see it as a way to get the EPUBs they now generate using OCR proofed and corrected. This tool is not perfect, but it has already been used to create manuals for a lot of Free Software projects, including two manuals for the One Laptop Per Child project that I wrote, plus a translation into Spanish of the first of my manuals that was done by volunteers in South America. You can check it out here: http://en.flossmanuals.net/ James Simmons On Fri, Jan 27, 2012 at 1:58 AM, Greg Newby <gbnewby@pglaf.org> wrote:
Whew. I counted 100 messages in 2 days. Thanks for the lively discussion. I changed the subject for the theme that Joshua and others mentioned (below). The idea I am very interested in fostering is capable online tools to let essentially anyone make edits, add formats, or prepare derivative files, from any PG eBook. Then, to easily add those changes back into (an area of) the PG collection.
My view is that we could very easily have an additional major category of file, for a given work. Currently, we have two major categories: first are files that go through WWers to get online, and second are those that are automatically generated from the first type. (While this is a gross simplification, in fact at www.gutenberg.org it's really easy to tell which is which -- they are in a different set of subdirectories, with a different file naming scheme.)
A third (new) type would be those files that are, in some way, modified, derived, or produced by other people and their tools. Not necessarily WWers or the original producers/submitters. In a word, crowdsourcing. Or community editing. Or version control. Or whatever you want to call it: the point would be that ANYONE with desire and some basic capability could make changes to existing files, or provide derivative files.
I can think of several major details, and many minor ones. Concerns about copyright, spamming, whether anonymous edits are permitted, a review/revision/recision cycle, character sets, forking, searching, etc., etc. I would love to see multiple "master" files, created lovingly by hand in any or all of RST, LaTeX, or yfm (Your Favorite Markup) -- then allow users to select which master to use to generate their, say, EPUB. And, which tools to use for the conversion. Of course, people who wanted to lovingly craft an EPUB would be able to upload that, too.
A capable crowdsourcing tool - preferably one that already exists, is well-maintained, is free, and will require relatively few modifications - is the starting point I'd most like to see. Whether we start with one book or 100 or 38000 doesn't matter to me, though it matters a whole lot that the solution is scalable to the full collection.
As for the questions about whether this would be allowed, or would pollute the essence of whatever, or piss off whomever: no, PG does't work that way, and never did. The answer is, and has been, "yes, go for it. It's all good, and on-mission." I am sensitive to not removing or undoing others' work, but my view is that current files of the first type, above, would remain, and be easy to find, and that for the main collection at www.gutenberg.org, the WWer process (perhaps as modified, thanks to the new tools that have been under discussion) would still apply.
Last year, I tried to deploy TRAC for group editing and version control of PG eBooks. It couldn't handle the directory count, and never finished, though I'm ready to try again. Or, a different tool.
As many subscribers have heard before, I have some hefty servers that can be used for experimentation and proof of concept. That's not the hard part.
If people are aware of good tools we could base this on, please speak up. I can elaborate on why I prefer to start with an existing tool, but in a nutshell it is because (as many have pointed out), the fundamentals of crowdsourcing and file revisions are already covered by a bunch of excellent tools. Let's not reinvent the parts that others are doing well since, after all, there are plenty of challenges that are unique to Project Gutenberg or to eBooks in general.
-- Greg
On Wed, Jan 25, 2012 at 01:01:08AM +0100, Marcello Perathoner wrote:
On 01/24/2012 11:08 PM, Joshua Hutchinson wrote:
So, if someone were to start "refactoring" old PG texts into TEI or RST and working with a WWer to repost them ... is this a workable idea?
More than a technical challenge it would be a political one. I can convert a novel the size of Pride and Prejudice into RST in about an hour. More if there is formatting or images to recover. But I'd prefer to avoid the riot that will ensue if we start to reformat DP texts.
We could start redoing the top 100 list excluding everything that is too hard and everything made by DP.
Maybe we start this process on a semi-private mirror of the PG corpus and only when it reaches a critical mass of some sort it gets moved over. But an official notice that this project has some backing is necessary or we'll just keep seeing everything running around in ten different directions and nothing ever getting done.
A semi-official branch would be a good occasion to ditch the old WWer workflow in favor of a source repository (git or mercurial) that holds all the masters.
Should we reserve a range of ebook nos. or shadow the existing ones?
-- Marcello Perathoner webmaster@gutenberg.org _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On 27 January 2012 07:58, Greg Newby <gbnewby@pglaf.org> wrote:
A third (new) type would be those files that are, in some way, modified, derived, or produced by other people and their tools. Not necessarily WWers or the original producers/submitters. In a word, crowdsourcing. Or community editing. Or version control. Or whatever you want to call it: the point would be that ANYONE with desire and some basic capability could make changes to existing files, or provide derivative files.
Like on Wikisource? http://en.wikisource.org/wiki/Index:A_Desk-Book_of_Errors_in_English.djvu -- <Sefam> Are any of the mentors around? <jimregan> yes, they're the ones trolling you

This one is interesting for a couple reasons. The are providing at least some PG work in this format - there is an Encyclopedia Britannica project that starts with PG (from DP) work.. They are building in some form of semantic structure: http://en.wikisource.org/wiki/1911_Encyclop%C3%A6dia_Britannica They recently displayed a prototype of a new markup-less editing interface. (But that's not it.) On Fri, Jan 27, 2012 at 7:43 AM, Jimmy O'Regan <joregan@gmail.com> wrote:
On 27 January 2012 07:58, Greg Newby <gbnewby@pglaf.org> wrote:
A third (new) type would be those files that are, in some way, modified, derived, or produced by other people and their tools. Not necessarily WWers or the original producers/submitters. In a word, crowdsourcing. Or community editing. Or version control. Or whatever you want to call it: the point would be that ANYONE with desire and some basic capability could make changes to existing files, or provide derivative files.
Like on Wikisource? http://en.wikisource.org/wiki/Index:A_Desk-Book_of_Errors_in_English.djvu
-- <Sefam> Are any of the mentors around? <jimregan> yes, they're the ones trolling you _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

More on wikisource editor: http://www.mediawiki.org/wiki/Special:VisualEditorSandbox http://www.mediawiki.org/wiki/Visual_editor/Features This is only slighlly peripherally interesting: http://strategy.wikimedia.org/wiki/Product_Whitepaper#Framework_for_Strategi... On Fri, Jan 27, 2012 at 8:40 AM, don kretz <dakretz@gmail.com> wrote:
This one is interesting for a couple reasons.
The are providing at least some PG work in this format - there is an Encyclopedia Britannica project that starts with PG (from DP) work..
They are building in some form of semantic structure:
http://en.wikisource.org/wiki/1911_Encyclop%C3%A6dia_Britannica
They recently displayed a prototype of a new markup-less editing interface. (But that's not it.)
On Fri, Jan 27, 2012 at 7:43 AM, Jimmy O'Regan <joregan@gmail.com> wrote:
On 27 January 2012 07:58, Greg Newby <gbnewby@pglaf.org> wrote:
A third (new) type would be those files that are, in some way, modified, derived, or produced by other people and their tools. Not necessarily WWers or the original producers/submitters. In a word, crowdsourcing. Or community editing. Or version control. Or whatever you want to call it: the point would be that ANYONE with desire and some basic capability could make changes to existing files, or provide derivative files.
Like on Wikisource? http://en.wikisource.org/wiki/Index:A_Desk-Book_of_Errors_in_English.djvu
-- <Sefam> Are any of the mentors around? <jimregan> yes, they're the ones trolling you _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Fri, Jan 27, 2012 at 03:43:17PM +0000, Jimmy O'Regan wrote:
On 27 January 2012 07:58, Greg Newby <gbnewby@pglaf.org> wrote:
A third (new) type would be those files that are, in some way, modified, derived, or produced by other people and their tools. Not necessarily WWers or the original producers/submitters. In a word, crowdsourcing. Or community editing. Or version control. Or whatever you want to call it: the point would be that ANYONE with desire and some basic capability could make changes to existing files, or provide derivative files.
Like on Wikisource? http://en.wikisource.org/wiki/Index:A_Desk-Book_of_Errors_in_English.djvu
Wouldn't that be more of a replacement for the proofreading/editing chain? One main difference from what I described is that I'm focused on entire books, which have already been proofread and formatted. Wikisource seems to be focused on a page at a time. Using Wikisource as an alternate pathway to DP and the type of tool set provided there would be a good option for some contributors. I don't know that I've seen items get into the PG collection from that direction. -- Greg

On 27 January 2012 17:59, Greg Newby <gbnewby@pglaf.org> wrote:
On Fri, Jan 27, 2012 at 03:43:17PM +0000, Jimmy O'Regan wrote:
On 27 January 2012 07:58, Greg Newby <gbnewby@pglaf.org> wrote:
A third (new) type would be those files that are, in some way, modified, derived, or produced by other people and their tools. Not necessarily WWers or the original producers/submitters. In a word, crowdsourcing. Or community editing. Or version control. Or whatever you want to call it: the point would be that ANYONE with desire and some basic capability could make changes to existing files, or provide derivative files.
Like on Wikisource? http://en.wikisource.org/wiki/Index:A_Desk-Book_of_Errors_in_English.djvu
Wouldn't that be more of a replacement for the proofreading/editing chain? One main difference from what I described is that I'm focused on entire books, which have already been proofread and formatted. Wikisource seems to be focused on a page at a time.
No, that's just the proofreading view. If you click the link beside 'Title', it'll give you the regular view. Here's a completed work: http://en.wikisource.org/wiki/Index:Picturesque_New_Guinea.djvu The corresponding title: http://en.wikisource.org/wiki/Picturesque_New_Guinea
Using Wikisource as an alternate pathway to DP and the type of tool set provided there would be a good option for some contributors. I don't know that I've seen items get into the PG collection from that direction.
-- <Sefam> Are any of the mentors around? <jimregan> yes, they're the ones trolling you

Here's a completed work: http://en.wikisource.org/wiki/Index:Picturesque_New_Guinea.djvu The corresponding title: http://en.wikisource.org/wiki/Picturesque_New_Guinea
I went there and it seems like it is a "read here online" type of interface -- except it gives you an option of "exporting" -- in PDF format. I have posted the results of asking for and having the site perform that PDF format export here: http://freekindlebooks.org/Dev/NewGuinea.pdf

Like on Wikisource? http://en.wikisource.org/wiki/Index:A_Desk-Book_of_Errors_in_English.djvu
In my opinion these suggestions all have the disadvantages that they try to provide the editing tools, file format and workflow FOR the PG volunteer, rather than being a versioning and source control system over that which volunteers choose to submit based on their *own* choices of tools, file formats, and work flow.

On Fri, January 27, 2012 11:48 am, James Adcock wrote:
Like on Wikisource?
In my opinion these suggestions all have the disadvantages that they try to provide the editing tools, file format and workflow FOR the PG volunteer, rather than being a versioning and source control system over that which volunteers choose to submit based on their *own* choices of tools, file formats, and work flow.
While I sympathize with your view point, you have to remember the old adage "different strokes for different folks." /You/ want a system where /you/ can choose to submit a refactored file which you have created using /your/ tools and /your/ work flow. If /I/ were refactoring a file I probably want a tool where /I/ can look at a single page scan and edit the associated file simultaneously, similar to that hosted at Wikisource. The challenge is to design a system that can accommodate /both/ of our needs. You shouldn't be allowed to dictate my work flow any more than I should be allowed to dictate yours. For example, let's start by agreeing to a few ground rules: 1. The master format will be HTML. 2. Every page break will be indicated by an anchor tag of the form: <a class="pageNum" id="pg0007" title="7"></a> (technically, this tag should be able to be self-closing, but experience suggests that not all UAs deal with that correctly). 3. The main text will be a single file, which will be tracked by a version control system (to be agreed upon later). Now, you may install Tortoise[*] to access the VCS, and may commit changes you have made to the file as a whole. I could (and have) write a software tool that provides a web-based interface to the VCS. Using the anchor tags I could extract a single page for editing in a Wikisource-like tool along side a page image, merge the changes back into the main file and use my web service to commit that change into VCS. You do your thing, I do mine, and we all get along fine. Remember the Perl motto, "Tim Today" (TMTOWTDI, or "There's more than one way to do it.") Just because someone suggests one way, don't think you have to be constrained by it; figure out a way to achieve the same result with /your/ work flow. And don't dismiss another persons methods just because they're not the ones you would have chosen.

Lee>For example, let's start by agreeing to a few ground rules:... LOL, are you serious, or are you only putting forth this set of "ground rules" as a hypothetical example ... or as a joke???

On 1/27/2012 12:48 PM, James Adcock wrote:
Lee>For example, let's start by agreeing to a few ground rules:...
LOL, are you serious, or are you only putting forth this set of "ground rules" as a hypothetical example ... or as a joke???
I'm as serious as a heart attack. The problem was to build a system where we could both contribute to a single text while each maintaining our own unique work flows. I suggested three rules that would allow us to do just that if we both agreed to the rules. Can you put forth any reason why adherence to these three rules would not permit simultaneous editing, or is ridicule your only form of argument?

The problem was to build a system where we could both contribute to a single text while each maintaining our own unique work flows. I suggested three rules that would allow us to do just that if we both agreed to the rules.
I don't believe that was the project statement. What I believe Greg was putting forth was a system where there *was not* a single text format, or rather there is still the HMTL and txt70 formats, but contributors are allowed to also submit epub or mobi or what have you which have been "hand tweaked." We've just had a year+ of "conversations" on this forum where there is widespread disagreement of what an input format should look like, the two most common variations being some flavor of XML-like, and the other flavor some kind of troff/BB/txt70-like, and then you just want to throw out a set of rules and declare victory? There is no victory unless a ton of people actually agree to your suggestions and further they actually submit books to your suggestions, and realistically the only place where that ton of people could come from is DP. Realistically the greatest current point of convergence in the DP/PG community is that markup language DP uses prior to sending the texts to PP, but that markup language is so loosely defined as to be unusable directly as a source language, so if you want to try to come up with an agreement of what "the" source language should look like then you efforts should be directed at DP, not at the people on this forum. If you can get DP to agree to markup more rigorously in something much closer to XML, such that an automated tool can be run at those files without heroic effort, and such that multiple flavors of tools can be run at those files -- when you have won DP support for such an idea -- then I will stop laughing and start supporting. And you still would be stuck with this little nagging issue of 30,000+ files which don't support your ideas of how things *ought* to be done.

We agree that source code control systems are good at managing files. My experience suggests that, to do this effectively, the file paradigm is not very useful. A document is a collection of a number of interconnected chunks of text that altogether comprise a (possibly) single, long string of characters. But for various purposes we need to slice and dice in a lot of ways that don't relate well to pages (or chapters or paragraphs or whatever other segmentation device you want to use.) And "checking out" something might be a little imprecise for some - to me it suggests sole ownership for some period of time. What is your intended use of the word in this context? In order to track the status of a project over time, one needs to determine cumulatively who has done what to which pieces. It needs to know who performed which tasks on which pieces of a project - whether they change anything or not. It needs to be able to distribute uncompleted units of work that vary based on what the task is from a table to a set of illustrations to a paragraph. And synchronize the text assigned with the images associated with it. It needs to be able to know about users who try to perform the same task on the same text repeatedly - which may or may not be appropriate depending on the task. I'm not sure that I'm ready to agree that it's appropriate to make technology decisions based on an assumption that crowd-sourcing in any form is unworkable and won't be considered. I think I'm as strong an advocate of Agile Development as anyone, but at least the strains of it that I'm familiar with still advocate working from user requirements back into software requirements that determine technology choices. Itchy fingers want to make code, but I'd like to try to keep the focus on the user end so we don't end up letting the system determine what users are required to do, rather than vice versa. OTOH, if we can feed Lee the requirements for a user-side-based API for issuing text-with-images based on a realistic variety of tasks, and accept back the results and incorporate them into a flexible workflow process.using Hg or git or whatever, I'd love to be second in line behind him.

On 1/30/2012 3:23 PM, don kretz wrote: [snip]
And "checking out" something might be a little imprecise for some - to me it suggests sole ownership for some period of time. What is your intended use of the word in this context?
"Checking out" is one of those phrases which unfortunately has been used inconsistently in this context. For RCS/SCC systems one "gets" a set of files from the repository in a read-only state, one "checks out" a file setting a lock on the file and making it read/write, and one "checks in" a file which replaces the old version with the new version and releases the lock. In concurrent versioning systems one "checks out" a file or set of files as a working copy; the files are read/write, but there is no lock on them. when you want to refresh your local working copy with the most current version you "update" your files, and when you want to store your changes in the repository you "commit" your changes, which merges them back into file in the repository. I cut my teeth on RCS, so my inclination is to understand "check out" in the RCS sense which means, as you suggest, obtaining an exclusive lock for a limited period of time. (Kind of like copyright grants a monopoly for a "limited time.") Because I don't think RCS is a good model for this project, I think we should all use "check out" in the CVS sense, which means "getting a non-exclusive working copy." When comparing the two systems, however, it's hard to find a consistent use.
In order to track the status of a project over time, one needs to determine cumulatively who has done what to which pieces.
Agreed. This is what a version control system gives us.
It needs to know who performed which tasks on which pieces of a project - whether they change anything or not.
I don't agree; I see no need for any statements to the effect of "looks good to me." They're harmless, but they serve no real purpose.
It needs to be able to distribute uncompleted units of work that vary based on what the task is from a table to a set of illustrations to a paragraph.
I don't think there is any such thing as an uncompleted unit of work -- or perhaps more accurately I don't think there is any such thing as a /completed/ unit of work. A work should always be in a state of continuous improvement, and that which is served to the public will always be just the state of the work at the time it was served.
And synchronize the text assigned with the images associated with it.
This is the job of the project files, not the version control system. As an example, in ePub the .opf file brings all the files together, and defines their relationship to each other; in our case it will probably be the structure of the files in the repository, but it is /never/ the job of the version control system.
It needs to be able to know about users who try to perform the same task on the same text repeatedly - which may or may not be appropriate depending on the task.
Perhaps metrics can be gathered -- I'm just not sure what they would be good for, or how we would use them.
I'm not sure that I'm ready to agree that it's appropriate to make technology decisions based on an assumption that crowd-sourcing in any form is unworkable and won't be considered.
I'm convinced that crowd-sourcing /is/ workable given leadership. I'm just hoping someone out there is prepared to give me enough rope to hang myself. ;-)
I think I'm as strong an advocate of Agile Development as anyone, but at least the strains of it that I'm familiar with still advocate working from user requirements back into software requirements that determine technology choices. Itchy fingers want to make code, but I'd like to try to keep the focus on the user end so we don't end up letting the system determine what users are required to do, rather than vice versa.
I can agree with this, although I think there are devils in the details. I'm just hoping to get a bit of a playground so I can start shaking those little devils out of the dirt.
OTOH, if we can feed Lee the requirements for a user-side-based API for issuing text-with-images based on a realistic variety of tasks, and accept back the results and incorporate them into a flexible workflow process.using Hg or git or whatever, I'd love to be second in line behind him
Hang on, because it's going to be a bumpy ride :-).

I've used Google Code's subversion for a number of projects; we're currently using it for DP-IT. I remember porting RCS to a System V box a long time ago ... OK Agile, start writing User Stories! (You can't argue against that as a pre-coding requirement.) And I suppose next would be the acceptance tests...but that might tip you over the edge ... On Mon, Jan 30, 2012 at 9:34 PM, Lee Passey <lee@novomail.net> wrote:
On 1/30/2012 3:23 PM, don kretz wrote:
[snip]
And "checking out" something might be a little imprecise for some - to
me it suggests sole ownership for some period of time. What is your intended use of the word in this context?

It needs to
know who performed which tasks on which pieces of a project - whether they change anything or not.
I don't agree; I see no need for any statements to the effect of "looks good to me." They're harmless, but they serve no real purpose.
Even in this first exercise it might be useful to anticipate people sharing work on a project; both horizontally (one task for all pages) and vertically (all tasks for some pages.) I think it would be essential to plan for it in the future unless you believe in total anarchy oiverwhelmed by massive redundancy. One person checking out some page or pages for some unknown reason doesn't help that much. Especially when many tasks produce no diffs.
It needs to be able to distribute
uncompleted units of work that vary based on what the task is from a table to a set of illustrations to a paragraph.
I don't think there is any such thing as an uncompleted unit of work -- or perhaps more accurately I don't think there is any such thing as a /completed/ unit of work. A work should always be in a state of continuous improvement, and that which is served to the public will always be just the state of the work at the time it was served.
If I say I'll check the chapter headings, or the illustrations through chapter 5, then that's a unit of work. It comprises a specific scope and probably has a beginning and an end. But you're right, if someone says "we need to update projects 400 to 600", (or even one project),l that's not a unit of work.
And synchronize the text
assigned with the images associated with it.
This is the job of the project files, not the version control system. As an example, in ePub the .opf file brings all the files together, and defines their relationship to each other; in our case it will probably be the structure of the files in the repository, but it is /never/ the job of the version control system.
It will be interesting to see how you orchestrate a project-oriented structured filesysytem simultaneously with a repository on the same projects.
It needs to be able to know about users who try to perform the same task
on the same text repeatedly - which may or may not be appropriate depending on the task.
Perhaps metrics can be gathered -- I'm just not sure what they would be good for, or how we would use them.
At a minimum, what tasks have been accomplished for what pieces of which projects, by whom, when. More importantly, what hasn't been done so people can choose non-redundant work (which I think will be preferred by many.) Assuming again non-anarchy...d .
I'm not sure that I'm ready to agree that it's appropriate to make
technology decisions based on an assumption that crowd-sourcing in any form is unworkable and won't be considered.
I'm convinced that crowd-sourcing /is/ workable given leadership. I'm just hoping someone out there is prepared to give me enough rope to hang myself. ;-)
I see, for instance, someone working on a table from a textbook during the ride to work on the train on their ipad. (which by the way is a unit of work you need to be able to identify, issue, and merge on a non-random sufficiently-repeating basis.) Or checking all the Greek phrases in a Bible commentary (of which we have a few, sometimes done at a substandard level.)

My concern would that when "the crowd" finds "real" problems in a PG text, such as remaining scannos, or coding of txt70 or html files which clearly falls nowhere near close to current PG standards, then there needs to be a well-sanctioned way to fold these "real fixes of real bugs" back into the "official" PG source code. Otherwise the crowdsourced stuff, while it may be better than what PG "officially" provides, simply continues to spin out of control. You've got to have a way to "merge back" eventually that which gets learned, fixed and improved by the crowd.

On Fri, Jan 27, 2012 at 10:40:50AM -0800, James Adcock wrote:
My concern would that when "the crowd" finds "real" problems in a PG text, such as remaining scannos, or coding of txt70 or html files which clearly falls nowhere near close to current PG standards, then there needs to be a well-sanctioned way to fold these "real fixes of real bugs" back into the "official" PG source code. Otherwise the crowdsourced stuff, while it may be better than what PG "officially" provides, simply continues to spin out of control. You've got to have a way to "merge back" eventually that which gets learned, fixed and improved by the crowd.
Agreed. Note that the errata process we use currently is not very efficient, but it *does* result in updates to the hand-posted files and all derivatives. So, there is already a mechanism in place (perhaps it will be improved). The challenge, which you correctly identify, is to build in the feedback loop so that there is continual improvement to the "main" source(s), not just derivatives. This is where something like a traditional source code development cycle might apply. If you have branches and trunks or the equivalent, then there will be a smaller number of people who can commit to the main branch, but anyone will be able to fork. Getting more people than the few existing WWers able to commit to the main branch will be a major benefit. I favor some sort of meritocracy-based system for getting access to such elevated levels of responsibilty. -- Greg

On Fri, January 27, 2012 12:19 pm, Greg Newby wrote:
I favor some sort of meritocracy-based system for getting access to such elevated levels of responsibility.
I favor a democratic-based system where everyone gets all access until they are "voted off the island." If you have a version control system in place you will loose nothing if someone attempts vandalism. There might be some inconvenience involved in backing out changes, but the resulting innovation and vigorous participation will more than make up for it. One should not have to prove one's competence, one should have to prove one's incompetence. Contributors should have to register with a system that at least validates that they control a specific e-mail address. IP addresses for changes should be tracked. If a user consistently violates the rules (remember the rules?), revoke his/her commit privileges. If Wiki wars break out, call in the benevolent dictator. Less bureaucracy is a good thing.

On Fri, Jan 27, 2012 at 11:43 AM, Lee Passey <lee@novomail.net> wrote:
On Fri, January 27, 2012 12:19 pm, Greg Newby wrote:
I favor some sort of meritocracy-based system for getting access to such elevated levels of responsibility.
I favor a democratic-based system where everyone gets all access until they are "voted off the island."
The meritocracy cruft is I think one of the lessons to learn from DP. It by definition reduces the resource pool. It's a demotivator. It's built on negative feedback (a constraint) rather than positive (an enablement). Better to focus development on providing immediate accurate objective feedback so people can make their mistakes, learn from them, and advance up the learning curve. It would also be well to figure out ways to avoid creating ownership. Anyone should be able to work on any part of any text, anytime. Which probably means multiple people work from the same text in parallel and at some point, or periodically, their work is compared and the majority wins (perhaps a weighted majority?) This is in contrast to serial editing where each builds on the previous (which requires extensive orchestration mechanisms, and takes a long time.) Pages are the natural unit of work for only some tasks. Success should be an improving text, not a perfect text.

On Fri, January 27, 2012 2:35 pm, don kretz wrote:
It would also be well to figure out ways to avoid creating ownership.
I don't know if this would be possible. People will naturally gravitate towards works that they have some special interest in, which will create a sort of de facto ownership no matter what we do. I'm not saying it's not a good idea, I'm just not sure if it's feasible.
Anyone should be able to work on any part of any text, anytime.
This is the "ownership" problem you raised above. No one should feel like "this book is /my precious/!" and feel entitled to back out subsequent improvements. I just don't know how it could be done. The first step, of course, is to have a fairly complete and explicit set of rules about which markup is acceptable and which is not. That way, if there is a dispute there is also a set of standards by which the dispute can be resolved. I suggested a system whereby every user would register with a valid e-mail address. Perhaps the version control system could be configured to send out e-mails to the last three (or so) committers when a file is modified? That way, other people who have a vested interest in the quality of that file could at least review changes made, and object if the changes are non-conformant. If the changes /are/ conformant, but raise other issues, then an argument in a public forum would be appropriate.
Which probably means multiple people work from the same text in parallel and at some point, or periodically, their work is compared and the majority wins (perhaps a weighted majority?) This is in contrast to serial editing where each builds on the previous (which requires extensive orchestration mechanisms, and takes a long time.)
I don't think this follows from your postulate. It should be possible to work in parallel, but I think it unlikely that it /would/ happen; no individual work is so popular that it would attract simultaneous attention from our small group of volunteers. I think it much more likely that serial editing would be the norm. Any kind of concurrent versioning system (as opposed to "check-out" systems like RCS or VSS) should be able to accommodate the small amount of parallel processing that would occur. A highly distributed version control system like git or mercurial is probably not required, although either could certainly do the job. Nevertheless, a mechanism for resolving disputes should be developed.
Pages are the natural unit of work for only some tasks.
Agreed. But the /are/ the natural unit of work for some tasks. I believe a split/merge mechanism could be developed which would not only permit the two disparate processes, but would permit them simultaneously. My initial take is to preserve the books in the repository as a single file with a mechanism to extract and inject page segments dynamically.
Success should be an improving text, not a perfect text.
Beautifully said.

Lee>...and feel entitled to back out subsequent improvements. I just don't know how it could be done. Without prejudice, please note that this is already what Marcello's rewriting tools already do. DP puts in "precious" improvements over the txt70, such as page numbers, which the people who put in those page numbers are very vested in, because trust me it adds a ton of work, at least if you want to get the page numbers "right", and then Marcello's code throws away the page numbers -- at least on mobi devices where the typical DP page numbering scheme really really doesn't work. [And the DP scheme for turning off page numbers really really doesn't work either.] Also, people submit well-intentioned HTML which include helpful comments for future developers, and which retain the line-breaks of the original text, so that someone in the future can go back and re-proof the book against the original, and then the WW'ers "improve" the text by running it through Tidy, throwing away the original line-breaks and the hard-won comments which the original HTML developer had put in to make life easier on future developers. So, what one person consider a gift to PG and society, others consider garbage to be removed. Welcome to modern society. If you have multiple people partying on literally the same file you have to expect and live with the kinds of edit wars one sees on wikis, and someone is going to have to eventually traffic cop, and distinguish "honest disagreement" from truly malicious behavior [and please note the continuing disagreements about even *that* major distinction even on this minor forum!]

Jim Adcock said:
the WW'ers "improve" the text by running it through Tidy, throwing away the original line-breaks and the hard-won comments which the original HTML developer had put in to make life easier on future developers.
-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Jim Adcock Sent: Saturday, January 28, 2012 10:56 AM To: lee@novomail.net; 'Project Gutenberg Volunteer Discussion' Subject: Re: [gutvol-d] Crowdsourcing (Re: Producing epub ready HTML)
Lee>...and feel entitled to back out subsequent improvements. I just don't know how it could be done.
Without prejudice, please note that this is already what Marcello's rewriting tools already do. DP puts in "precious" improvements over the txt70, such as page numbers, which the people who put in those page numbers are very vested in, because trust me it adds a ton of work, at least if you want to get the page numbers "right", and then Marcello's code throws away the page numbers -- at least on mobi devices where the typical DP
This is a complete fabrication. (I'd say "lie", but I'm a polite WWer, albeit somewhat outraged right now.) I've *never* done this, for the simple reason that I've never seen or used Tidy. I've canvassed the other WWers, and none of them have run Tidy on user submissions, either. Al (one of the WWers) page
numbering scheme really really doesn't work. [And the DP scheme for turning off page numbers really really doesn't work either.]
Also, people submit well-intentioned HTML which include helpful comments for future developers, and which retain the line-breaks of the original text, so that someone in the future can go back and re-proof the book against the original, and then the WW'ers "improve" the text by running it through Tidy, throwing away the original line-breaks and the hard-won comments which the original HTML developer had put in to make life easier on future developers.
So, what one person consider a gift to PG and society, others consider garbage to be removed. Welcome to modern society. If you have multiple people partying on literally the same file you have to expect and live with the kinds of edit wars one sees on wikis, and someone is going to have to eventually traffic cop, and distinguish "honest disagreement" from truly malicious behavior [and please note the continuing disagreements about even *that* major distinction even on this minor forum!]
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Your recollections are wrong. The WWers are not "claiming"; they're stating categorically--Tidy has never been used on a submitted HTML file. You'll have to show me an example of an HTML file as it existed at submission and how it looked after posting.
-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Jim Adcock Sent: Sunday, January 29, 2012 10:09 AM To: 'Project Gutenberg Volunteer Discussion' Subject: Re: [gutvol-d] Crowdsourcing (Re: Producing epub ready HTML)
This is a complete fabrication.
I stand by my recollections. If the WW'ers claim they are not now doing this, then that is a good thing.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

You'll have to show me an example of an HTML file as it existed at submission and how it looked after posting.
I don't actually have to show you anything Al. I'd rather not spend my time and effort engaging in a pissing match. A subset of your claim is that the WW'ers are not doing this at this point in time. That claim *should* be enough for both of us to move forward on in a positive manner.

Hi Jim, You might recollect looking at the epub code as generated by epubmaker? Marcello uses tidy to convert/reformat HTML when generating mobile formats. Jana On Jan 29, 2012, at 19:09, Jim Adcock wrote:
This is a complete fabrication.
I stand by my recollections. If the WW'ers claim they are not now doing this, then that is a good thing.

You might recollect looking at the epub code as generated by epubmaker? Marcello uses tidy to convert/reformat HTML when generating mobile formats.
I don't think so. I hadn't started doing much active "popping the top" on epubmaker until recently.

On 1/28/2012 11:55 AM, Jim Adcock wrote:
Lee>...and feel entitled to back out subsequent improvements. I just don't know how it could be done.
Without prejudice, please note that this is already what Marcello's rewriting tools already do. DP puts in "precious" improvements over the txt70, such as page numbers, which the people who put in those page numbers are very vested in, because trust me it adds a ton of work, at least if you want to get the page numbers "right", and then Marcello's code throws away the page numbers -- at least on mobi devices where the typical DP page numbering scheme really really doesn't work.
Please understand that what I am talking about is Mr. Hutchinson's proposal to re-work the current PG corpus into a single master format, mirroring, not replacing, the current PG work flow. The fact is I do not give a fig about what you call txt70 and I call Impoverished Text Format. I don't suppose there are more than about 6 people in the entire world who care about ITF, and three of them are PG white washers. So the fact that the WWs do unspeakable things to files to meet their own prejudices, and that Mr. Perathoner's attempts to restore these files to their former glory is flawed, is completely irrelevant to me. I understand that you have your own axe to grind against the current situation at Project Gutenberg. But I am not addressing this current state, so don't think for a moment that you and I are speaking about the same thing.

I understand that you have your own axe to grind against the current situation at Project Gutenberg. But I am not addressing this current state, so don't think for a moment that you and I are speaking about the same
thing. Please pardon my naïve mistake. I thought we were all talking about PG's vague strawman proposal about how to add the ability for individuals singularly or collectively to add "hand tweaked" versions to the code base and have PG host those versions, and I thought there would probably need to be *some* level of agreement about what this means and how it works in order for PG on any level to be able to successfully implement such a plan. I didn't realize you were talking about going off and doing your own thing independent of what PG is planning on doing. My bad.

Revised: It would be well to figure out ways to avoid creating exclusive ownership of a document entirely, or in any part. Any work is done with the understanding that it will be incorporated by harmonizing it with other work that may be being done by others simultaneously. Workers will be registered and uniquely identified by a valid e-mail address, in order to associate them with their contributions and make available selective optional advisory when certain events of interest occur. (Possible addition: No public display (meaning general public or other workers) will be made of individual information, whether provided by the worker or derived from their work.) Personal interest and commitment of individuals should be reinforced with some kind of incentive, maybe weighted influence, maybe acknowledgment, ...) To that end, and thinking in source control terms, the basis for the document at any point in time might be a single version, or "release", that is produced periodically (period to be determined, and maybe automatic, maybe on a schedule, maybe by someone explicitly, maybe based on some set of conditions.) A precondition is to have a fairly complete and explicit set of rules about which markup is acceptable, necessary, and sufficient (which may if necessary differ by project, but differences should be avoided.) The accomplishment of the application of the necessary and sufficient markup should produce a single version which would comprise a canonical source for all final published versions by the application of stylesheets, translators, XSLT transformations, unix-style filters, manual adjustments and enhancements, or other means. (Additional resources may require inclusion, such as images for illustrations, etc.) When there are disputes there will also be a set of standards by which they can be resolved. "Units of work" need to be defined. These determine how much text can be submitted in one chunk. The assumption is that, for the entire chunk, the submitter believes they have completely accomplished some task. (Tasks is another topic for discussion, but I'm thinking of such things as "I proofed it" or "I proofed and formatted it" or "I checked and marked all the chapters' or "all the footnotes" or "this table" or "all the greek content" ...) Any document has some implicit or explicit structure. This mieght be as simple as title page, toc, snd chapters. It could also include sections/subsections, tables, footnotes, etc. This structure needs to be captured and defined as to content and and relative location in the text. The fundamental goal is to preserve the books in the repository as a single file with a mechanism to extract, inject, and harmonize units of work dynamically. Success should be an improving text, not a perfect text.

A precondition is to have a fairly complete and explicit set of rules about which markup is acceptable, necessary, and sufficient (which may if necessary differ by project, but differences should be avoided.) The accomplishment of the application of the necessary and sufficient markup should produce a single version which would comprise a canonical source for all final published versions by the application of stylesheets, translators, XSLT transformations, unix-style filters, manual adjustments and enhancements, or other means. (Additional resources may require inclusion, such as images for illustrations, etc.)
I don't see where this is any different than the "rules" which are in place for HTML right now. PG says what it wants, DP submits something else, and PG accepts what DP offers because PG wants what DP offers. Then Marcello tries to fix up that which was offered so it can run on something smaller than a 20" computer monitor. If you can get DP to agree to submit a flavor of "HTML" or "XML" or the like which is closer to their internal informal formatting markup language, and less page-image-formatting-oriented than their finally submitted HTML, then you would be ahead in the game. In that case, you are basically co-opting that critical mass of people already using a somewhat-reasonable markup language internal to DP to try to generate agreement about what should be in or out of that markup language, and what the syntax of that markup language should be. And DP is in a much better situation to enforce those rules than the people active on this forum. http://www.pgdp.net/c/faq/document.php Unfortunately, what DP is doing right now in their internal-format formatting files is a combination of troff-style formatting and XML markup. It would be much easier to create tools if it was all just 100% XML-style markup. People thinking about this issue would do well to look at their set of agreed-upon formatting markup rules, which literally contains about 100 markup rules. Agreeing on a formatting standard is NOT a simple task.

On 1/29/2012 11:08 AM, Jim Adcock wrote:
A precondition is to have a fairly complete and explicit set of rules about which markup is acceptable, necessary, and sufficient (which may if necessary differ by project, but differences should be avoided.) The accomplishment of the application of the necessary and sufficient markup should produce a single version which would comprise a canonical source for all final published versions by the application of stylesheets, translators, XSLT transformations, unix-style filters, manual adjustments and enhancements, or other means. (Additional resources may require inclusion, such as images for illustrations, etc.)
I don't see where this is any different than the "rules" which are in place for HTML right now.
The difference is that right now there practically are /no/ rules in place for HTML (or perhaps there are /unwritten/ rules, but I have no idea what they are). According to the HTML FAQ, http://www.gutenberg.org/wiki/Gutenberg:HTML_FAQ#1._The_only_absolute_rule_i...: "The only absolute rule is that the HTML should be valid according to one of the W3C HTML standards, and, if used, CSS must also be valid." The FAQ goes on to provide some guidance on how to use HTML, but it barely scratches the surface of the rules needed for HTML to be a master format. The pgdp Wiki has some more pertinent advice at http://www.pgdp.net/wiki/The_Proofreader%27s_Guide_to_EPUB#How_to_Author_HTM... but even this good advice is incomplete, and hardly qualifies as PG "rules." So, if Project WOPR is going to do a better job at creating a master format than currently exists, a much more complete set of rules will need to be created, probably rules that would address most, if not all, of the constructs in the list you posted a few days ago. Such a set of rules does not now exist, but I am convinced that it could be created if people are committed to compromise.

On 1/28/2012 1:03 PM, dakretz@gmail.com wrote:
Revised:
[snip]
The fundamental goal is to preserve the books in the repository as a single file with a mechanism to extract, inject, and harmonize units of work dynamically.
Success should be an improving text, not a perfect text.
I have a few quibbles about this manifesto, but I'm not going to mention them here and distract from the larger message. This is an extremely well-thought-out and well-expressed vision. I could be happy working under these conditions. Kudos, Mr. Kretz.

Note that the errata process we use currently is not very efficient, but it *does* result in updates to the hand-posted files and all derivatives.
Correct me if I'm wrong but I think the errata process corrects very localized errors such as found scannos. It does not correct bigger problems such as "hey, this txt70 and/or html file is really not up the snuff by today's PG standards."

On 01/27/2012 08:19 PM, Greg Newby wrote:
This is where something like a traditional source code development cycle might apply. If you have branches and trunks or the equivalent, then there will be a smaller number of people who can commit to the main branch, but anyone will be able to fork. Getting more people than the few existing WWers able to commit to the main branch will be a major benefit. I favor some sort of meritocracy-based system for getting access to such elevated levels of responsibilty.
That's why I suggested git or mercurial, both of which allow a hierarchical organisation of committers. PG would pull from the WWers and every WWer will pull from his group of PPers etc. -- Marcello Perathoner webmaster@gutenberg.org

On Fri, Jan 27, 2012 at 11:13:50PM +0100, Marcello Perathoner wrote:
On 01/27/2012 08:19 PM, Greg Newby wrote:
This is where something like a traditional source code development cycle might apply. If you have branches and trunks or the equivalent, then there will be a smaller number of people who can commit to the main branch, but anyone will be able to fork. Getting more people than the few existing WWers able to commit to the main branch will be a major benefit. I favor some sort of meritocracy-based system for getting access to such elevated levels of responsibilty.
That's why I suggested git or mercurial, both of which allow a hierarchical organisation of committers. PG would pull from the WWers and every WWer will pull from his group of PPers etc.
That's why I tried TRAC. It uses subversion rather than git or hg, but the core capability of branches, hierarchies, etc. seemed a good fit for our purposes. -- Greg

On 1/27/2012 5:22 PM, Greg Newby wrote:
That's why I tried TRAC. It uses subversion rather than git or hg, but the core capability of branches, hierarchies, etc. seemed a good fit for our purposes.
At your instigation I went out to the TRAC web site and looked at this product. TRAC is an integrated project management tool, not merely a Version Control System. In fact, TRAC is not a VCS at all, it is merely provides a front end to Subversion, and apparently through the use of plugins to git and Mercurial as well. TRAC has a number of attractive features that would be useful to project WOPR, including a bug tracking system, a wiki, and an RSS feed. However, I don't see a current need for the full suite that TRAC offers. Because TRAC is not a VSC, but only a portal into a VCS, I think that the right thing to do at this point is to install a stand-alone VCS with an eye towards integrating it into TRAC at some future point. For our purposes, I think just about any concurrent version system would be adequate, so you should pick the one that you think would be easiest to administer.

When you write of standalone version control systems, do you mean using svn/hg/git out of the box, and developing all the other software for our needs? If not, what examples of standalone VCS are you writing about? -- Greg On Sun, Jan 29, 2012 at 04:21:03PM -0700, Lee Passey wrote:
On 1/27/2012 5:22 PM, Greg Newby wrote:
That's why I tried TRAC. It uses subversion rather than git or hg, but the core capability of branches, hierarchies, etc. seemed a good fit for our purposes.
At your instigation I went out to the TRAC web site and looked at this product.
TRAC is an integrated project management tool, not merely a Version Control System. In fact, TRAC is not a VCS at all, it is merely provides a front end to Subversion, and apparently through the use of plugins to git and Mercurial as well.
TRAC has a number of attractive features that would be useful to project WOPR, including a bug tracking system, a wiki, and an RSS feed.
However, I don't see a current need for the full suite that TRAC offers. Because TRAC is not a VSC, but only a portal into a VCS, I think that the right thing to do at this point is to install a stand-alone VCS with an eye towards integrating it into TRAC at some future point. For our purposes, I think just about any concurrent version system would be adequate, so you should pick the one that you think would be easiest to administer. _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Let me point out that trying to use an sccs for this has been tried, and you may want to think twice. When vasa and I originally conceived DP50 it was our (or at least my) intention to use subversion or git as the text repository. The further we got into it, the less desirable it seemed to become - for a number of reasons. I'm not saying it can't be made to work, and work successfully. But it's not a slam dunk. For one, maintaining text is not the same as maintaining source code. And in particular,the work flow for software development is not the same as ours. And if there's any guiding principle behind sccs, it's to support the depth and breadth of software development. We found we could get along fine for a long time storing simple versions of text in a structured directory design that was easier to use and monitor. And we could always get to our text in another obvious way. We ended up spending a lot of time just figuring out the sccs APIs (which again are designed to support software development) and they aren't simple or really very flexible. We had to mostly adapt our conceptual models to theirs - there isn't much room to fiddle with theirs. Sccs transactions can be very slow. I realize we had different needs - here we're talking about at least mostly-completed projects, not page-or-less components. But I don't see how we easily avoid at least some extension in the proofing direction if we really want to do continuous semi-open-access improvement of texts. I think it requires administrative resources we'll never have to do it any other way. Could we just call it "the repository" or something for a while? I think we should maybe spend more time coming at it from the user's direction and refining some requirements before we make technology choices. And in that direction I think automating the build process to be more dependency-sensitive might pay off more in the short run. Maybe it's there and I don't know it, but I haven't heard much of that flavor to the discussion so far.

On 01/30/2012 03:40 AM, don kretz wrote:
For one, maintaining text is not the same as maintaining source code. And in particular,the work flow for software development is not the same as ours.
But it is close enough. All revision control systems store text files and work in a line oriented fashion. I don't see any difference between program source files and text files. (Here I'm thinking about assembled books, not single pages. Keep the line endings put, and we are already there.) I've considered alternatives, but the best suited VCS seem to be either git or mercurial (hg), with a slight advantage for mercurial. git is blindingly fast and because it only transmits compressed diffs, a multi-megabyte book can be edited in seconds if you already have the book checked out. Very interesting if you are on a GPRS link. But the main thing git lacks is a way to check out parts of a project, which is of paramount importance for us. You don't want to check out the whole archive to edit one typo in one book. hg does have this. (From reading the docs, not from actual testing.) So with hg you can check out one book. Another advantage is that hg is written in python (the PG conversion software and web application server are written in python) and has a very good python interface. On the down side hg is a bit slower than git, but not very much, and not as widely deployed.
I realize we had different needs - here we're talking about at least mostly-completed projects, not page-or-less components. But I don't see how we easily avoid at least some extension in the proofing direction if we really want to do continuous semi-open-access improvement of texts. I think it requires administrative resources we'll never have to do it any other way.
I'm very much against this `crowdsourcing´ of text improvement. We'll end up with the few good volunteers we have patrolling and reverting the edits of hundreds of clueless or malign individuals. Our task is much more similar to software development than writing articles for wikipedia. We have a strict original to follow and strict rules to apply. What we could implement is a system to flag potential text errors for revision. This system should ideally be integrated into the text itself (javascript). If any text location accumulates enough error reports, it will be presented to the errata team. But the first thing we'll need is page images for every book linked to the text and publicly available. -- Marcello Perathoner webmaster@gutenberg.org

Hi Marcello, basically I can agree. Yet, below you have written: "We have a strict original to follow and strict rules to apply." Could you direct me to the strict rules that apply. I the do not yet exist that is O.K. If the are not that strict or concise that is fine, too. I will be starting a series that I hope to lead to a concise and consistent system which will give guidance to those wishing to submit etexts/ebooks to PG as well as guidance for developing tools and or tool chain. regards Keith. Am 30.01.2012 um 14:03 schrieb Marcello Perathoner:
On 01/30/2012 03:40 AM, don kretz wrote:
For one, maintaining text is not the same as maintaining source code. And in particular,the work flow for software development is not the same as ours.
But it is close enough.
All revision control systems store text files and work in a line oriented fashion. I don't see any difference between program source files and text files. (Here I'm thinking about assembled books, not single pages. Keep the line endings put, and we are already there.)
I've considered alternatives, but the best suited VCS seem to be either git or mercurial (hg), with a slight advantage for mercurial.
git is blindingly fast and because it only transmits compressed diffs, a multi-megabyte book can be edited in seconds if you already have the book checked out. Very interesting if you are on a GPRS link.
But the main thing git lacks is a way to check out parts of a project, which is of paramount importance for us. You don't want to check out the whole archive to edit one typo in one book. hg does have this. (From reading the docs, not from actual testing.) So with hg you can check out one book.
Another advantage is that hg is written in python (the PG conversion software and web application server are written in python) and has a very good python interface.
On the down side hg is a bit slower than git, but not very much, and not as widely deployed.
I realize we had different needs - here we're talking about at least mostly-completed projects, not page-or-less components. But I don't see how we easily avoid at least some extension in the proofing direction if we really want to do continuous semi-open-access improvement of texts. I think it requires administrative resources we'll never have to do it any other way.
I'm very much against this `crowdsourcing´ of text improvement. We'll end up with the few good volunteers we have patrolling and reverting the edits of hundreds of clueless or malign individuals.
Our task is much more similar to software development than writing articles for wikipedia. We have a strict original to follow and strict rules to apply.
What we could implement is a system to flag potential text errors for revision. This system should ideally be integrated into the text itself (javascript). If any text location accumulates enough error reports, it will be presented to the errata team. But the first thing we'll need is page images for every book linked to the text and publicly available.
-- Marcello Perathoner webmaster@gutenberg.org _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Note new subject line On Sun, January 29, 2012 7:40 pm, don kretz wrote:
For one, maintaining text is not the same as maintaining source code. And in particular,the work flow for software development is not the same as ours. And if there's any guiding principle behind sccs, it's to support the depth and breadth of software development.
We found we could get along fine for a long time storing simple versions of text in a structured directory design that was easier to use and monitor. And we could always get to our text in another obvious way.
Like Mr. Perathoner I'm a little confused about this statement. A source code file is a text file just like an XML file is, and a version control system ought to be able to handle both of them equally well. I've spent the last 1.5 years working with an XHTML policy manual system using Subversion as the repository and for version control (corporate standard, not my choice). The VCS component of this project was the most straight-forward part and was the one part that "just worked." I'd would be very interested in hearing more about your experiences, off-list if you would like, so I can be ahead of the curve if problems arise in my system.
We ended up spending a lot of time just figuring out the sccs APIs (which again are designed to support software development) and they aren't simple or really very flexible. We had to mostly adapt our conceptual models to theirs - there isn't much room to fiddle with theirs.
You mention the SCC API. I know that Microsoft purchased Visual Source Safe and then created the Source Code Control interface that Visual Studio used to integrate the IDE to VSS. I know that Adobe has adopted this interface exclusively for its Dreamweaver and RoboHelp products, and presumably for its Creative Suite. Presumably other companies have also implemented or consumed the SCC interface, but I have no experience with them. When you speak of SCC are you referring to the Microsoft API? Visual Source Safe and the SCC API are RCS-like systems. They do not support concurrent versioning, but rather use the sequential paradigm where a file is locked on check out and can not be locked by any other user until the file is unlocked by being checked in. A user cannot submit changes to the repository until a lock is obtained. These kind of systems seem to require a lot of administrative attention to break stale locks. We were using the RoboHelp product, and integration between Adobe's SCC interface and the corporate standard Subversion repository was quite challenging. There are a few SCC/SVN products out there, but most are quite long in the tooth. We ended up using the commercial PushOK product to convert SCC calls to SVN, and vice-versa. If when you say SCC you are referring to the Microsoft Source Code Control interface then I can understand your frustrations. But for this particular project I think we shouldn't face these problems if we simply stick with a concurrent version control systems, and eschew any RCS-like systems.
Sccs transactions can be very slow.
Again, it depends on what system your talking about. My experience with Adobe suggests that even SCC transactions can be very quick if you have well-written software running on a 100-base T Local Area Network ;-). I don't think it will be a problem to find a version control system fast enough for our needs. I /do/ think that Mr. Perathoner's concerns about users on GPRS systems are valid, and we need to think about how to address those concerns.
I realize we had different needs - here we're talking about at least mostly-completed projects, not page-or-less components. But I don't see how we easily avoid at least some extension in the proofing direction if we really want to do continuous semi-open-access improvement of texts. I think it requires administrative resources we'll never have to do it any other way.
Yes, I think some extension into the proofing direction is inevitable, and we should be prepared for it. This is why I suggest a rule that some sort of unambiguous page marker be inserted into the master file so that a single page can be programmatically extracted. This leads me around to the "unit of work" question. Mr. Perathoner suggests that Git may not be the best solution for a VCS as you have to check out the entire "project" before the efficiencies of diff merging kick in. So what is a "project?" I had always conceived a project as being a single "work," whatever that means. I get the impression that others conceive of the project as encompassing all 5000 works that we choose as our starting point. I propose that for version control purposes, each "work" will have its own "project." Each project must contain the master file(s) and page scans of the work. (Would a simple reference to the page scans at IA be sufficient? Do we need to bust open IA's archive files so each page image can be viewed individually?)
Could we just call it "the repository" or something for a while? I think we should maybe spend more time coming at it from the user's direction and refining some requirements before we make technology choices.
The reason I would prefer choosing /something/ is that I tend to use agile development methodologies in my own work. If a repository were available now, I would start by generating an HTML file using Mr. Perathoner's text-to-HTML scripts, doing easy tweaks to the file, and checking it in as a first version. I would continue to work with the file adding complexity and generating new ideas as I go, doing interim check-ins. With each revision I will have learned something, which will cause me to propose a new rule, or the modification of an old rule. If it turns out that the work I would have done is incorrect or unnecessary we'd just throw it away and start over. If it turns out the VCS system we have chosen is inadequate to the task we just import the files into a new VCS system (I think they all have some sort of import/export function). The most fundamental proposition of agile programming is that it's okay to throw away work if it's wrong.
And in that direction I think automating the build process to be more dependency-sensitive might pay off more in the short run. Maybe it's there and I don't know it, but I haven't heard much of that flavor to the discussion so far.
I'm opposed to a "build process." With the exception of CVS every VCS I've been talking about has an HTTP interface, and in the case of CVS the project document directory (not CVSROOT) can be mounted as a web server document directory. The most recent version of any document should be available through a browser call. As for derived formats, a web server interface would be provided to serve derived formats on demand. Caching would be appropriate, so any particular format would be cached on generation, but time-stamping should be observed, so a cached file would be discarded and regenerated when a change to the project files occurs.

I propose that for version control purposes, each "work" will have its own "project." Each project must contain the master file(s) and page scans of the work. (Would a simple reference to the page scans at IA be sufficient? Do we need to bust open IA's archive files so each page image can be viewed individually?)
Are you not now heading in the direction of reinventing DP? If you already have an existing say "DP"ed work then isn't referencing (or hosting) a single Djvu or PDF page image file [such as is already hosted at IA] with indexing sufficient for people to check their [incremental] changes to that work? If you are going to start resourcing page images ala DP, even those harvested from IA, then you are adding a ton more work.

How can you possibly not host the images? You csn't count on all your other sources making them available forever; or tomorrow; or at a predictable url. And something has to be canonical or people can (will) post whatever they please for content. Seems like it's the least effort to put them where you want the and you know where they'll be. Then it's time to consider whether to include a permalink in the distributed texts. Some readers would appreciate it (think illustrations, diagrams, tables, ...) and it's not like we're trying to hide anything. On Mon, Jan 30, 2012 at 7:24 PM, James Adcock <jimad@msn.com> wrote:
I propose that for version control purposes, each "work" will have its own "project." Each project must contain the master file(s) and page scans of the work. (Would a simple reference to the page scans at IA be sufficient? Do we need to bust open IA's archive files so each page image can be viewed individually?)
Are you not now heading in the direction of reinventing DP? If you already have an existing say "DP"ed work then isn't referencing (or hosting) a single Djvu or PDF page image file [such as is already hosted at IA] with indexing sufficient for people to check their [incremental] changes to that work? If you are going to start resourcing page images ala DP, even those harvested from IA, then you are adding a ton more work.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

How can you possibly not host the images? You csn't count on all your other sources making them available forever; or tomorrow; or at a predictable url.
And something has to be canonical or people can (will) post whatever they please for content. Seems like it's the least effort to put them where you want the and you know where they'll be. Again, the direction I hear you guys heading in is that you-all want to reinvent DP because you-all think you can improve-upon DP. Not saying you can, not saying you can't. I'm just say that is not *what I heard* Greg talking about. A requirement to post images increases the submitters' work load over what the PG/WW'ers are asking for all ready. If I wanted to make post-ready page images then I could just go through the ugly process of making page images ready for DP and then wash my hands of it. At the very least you ought to invent a pull-system from IA to automate this for when submitters ref IA in the first place. Not sure you aren't making a mess.

Another view is that we have a mess that defies improving texts; and it's reasonable that a less messy situation will necessarily involve providing images for people to legitimize doing anything with the texts. On Mon, Jan 30, 2012 at 8:54 PM, James Adcock <jimad@msn.com> wrote:
*>*How can you possibly not host the images? You csn't count on all your other sources making them available forever; or tomorrow; or at a predictable url.****
And something has to be canonical or people can (will) post whatever they please for content.****
Seems like it's the least effort to put them where you want the and you know where they'll be.****
** **
Again, the direction I hear you guys heading in is that you-all want to reinvent DP because you-all think you can improve-upon DP. Not saying you can, not saying you can’t. I’m just say that is not **what I heard** Greg talking about.****
** **
A requirement to post images increases the submitters’ work load over what the PG/WW’ers are asking for all ready. If I wanted to make post-ready page images then I could just go through the ugly process of making page images ready for DP and then wash my hands of it. At the very least you ought to invent a pull-system from IA to automate this for when submitters ref IA in the first place. Not sure you aren’t making a mess.****
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

A pull system from IA should be helpful. I bet they have one from PG.

Another view is that we have a mess that defies improving texts;
An interesting view but one which cannot be defended: For the great majority of book files PG provides today, I, or anyone who cares, can greatly improve the texts on EPUB and MOBI devices for a few minutes work, taking those files from "virtually unreadable" to "about as good an EPUB or MOBI file one can find anywhere." And one can do this without looking at page images, simply because the great majority of PG files currently contains the same half-dozen "formattos" over and over again. Why obsess over a half-dozen scannos when the typical PG file contains literally about a 1000 easily-fixed "formattos" which keep the books from being read ?

OK, I'm willing to listen to how I can improve any "book files" (whatever they are) PG provides today. We get that question on DP all the time. We tell them to email PG. What great secret do you share with them that the rest of us aren't privy to? On Tue, Jan 31, 2012 at 7:14 AM, Jim Adcock <jimad@msn.com> wrote:
Another view is that we have a mess that defies improving texts;
An interesting view but one which cannot be defended: For the great majority of book files PG provides today, I, or anyone who cares, can greatly improve the texts on EPUB and MOBI devices for a few minutes work, taking those files from "virtually unreadable" to "about as good an EPUB or MOBI file one can find anywhere." And one can do this without looking at page images, simply because the great majority of PG files currently contains the same half-dozen "formattos" over and over again. Why obsess over a half-dozen scannos when the typical PG file contains literally about a 1000 easily-fixed "formattos" which keep the books from being read ?
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

This makes for an interesting Use Case analysis. Say I'm an average PG reader who finds several mistranscriptions of Greek texts an article in my DP Encyclopedia Britannica projects. How do they go about improving the texts? a. How do they find out, from their downloaded third-party ebook, where the ebook came from? b. Say they find out it originally came from PG (which is apparent to maybe 25% of readers, to be generous.) and they go to the PG website. What next? c. The most likely action would be to look up their project. How would they find it? (That itself is not as easy as you think.) d. When they find their project, how do they find out about errata? e. Then what happens? I suggest this is at least one way to approach your requirements that is at least as effective as our discussion so far.

What great secret do you share with them that the rest of us aren't privy to?
No great secret: I've shared them here ad nausea for the last couple years. People just don't listen. 1) One can talk all one wants about HTML and how it is *supposed* to render but one needs to understand instead how it *does* render on different devices, especially the small devices such as EPUB devices and MOBIs. Also, if one actually reads the HTML specs carefully one finds out that basically no rendering device is ever under any obligation to render HTML the way you think you've specified. CSS is basically *hints* not page layout "demands". This is especially true of the EPUB spec. And Mobi/Kindle? -- the only "spec" is basically the actual behavior of this or that Kindle device. 2) If in the HTML one chooses to horizontally stack elements, then that is not going to work on small devices, which physically only have the space to attractively display ONE element horizontally at a time. Vertically stacking elements basically works on all devices. Examples of horizontally stacked elements: i) text, ii) images, iii) page numbers, iv) left margin, v) right margin. On a small device: Choose One. Because that is all that fits. 3) It is much easier for people who want to have margins, or who want to "throw away" excess horizontal real estate they have on their display to add those, than for people who don't have the real estate to come up with that real estate once you've thrown it away. 4) To whit, <body> specifications should never be made. 5) The process of creating books to read is one that has developed slowly over the course of the last 400+ years. If you want your readers to believe they are "reading a book" you need to respect that tradition and follow it closely. If you do not follow that 400+ years of tradition then the reader *will not* "suspend disbelief" and accept that they are in fact "reading a book." Inventing your own system of typography is very probably a very bad idea -- even if K&R started it. 6) Read some books on typography. The DeVine series from the turn of the century (free on Google Books) are really good. The formatting of books is not "arbitrary." It follows in a 400+ year tradition. 7) Characters, "glyph points" or code points in Unicode, have meaning. Try to follow that established meaning. Don't invent your own meanings for existing code points. Doing so will cause breakage. 8) Sentences have meaning. If it was a sentence in the original book, it should be a sentence in this book. 9) Paragraphs have meaning. Traditionally there are three ways to format a paragraph, from "cheapest" -- using up the least amount of paper, to "classiest" -- using more paper but much easier to read, and which traditionally was used in the highest quality printings, such as first editions. Since there is no paper cost to creating e-books, it behooves us to use the "classiest" method, which surprise surprise, is the same method that Michael Hart recommended. (Strange how that works) Cheapest Method to "Classiest" Method a) No indent at start of paragraph, no spacing between paragraphs, "notch" the right end of the last line of the paragraph. b) Notch, i.e. indent the start of the paragraph, no spacing between paragraphs, allow the last line of the paragraph to run ragged. c) Do not notch but put a 1em vertical spacing between paragraphs, last sentence runs ragged. This is basically the same paragraph formatting which is used on the PG txt70, so if you do c) then you have the advantage of following what has been PG "style" for quite some while. This style also has the practical advantage of being the easiest to work with, and to read! Things which look stupid (at least if you have some background in typography) e) Notching AND putting a space between paragraphs (a common error) f) And/or putting 2em of spacing between paragraphs. Which you may be doing without realizing it. === 10) Don't try to specify font families. It really doesn't work. If you think it works you need to read #10 again. 11) Try to follow the meaning of HTML tags. If you use <p> tags on something which is not a paragraph, that is probably a mistake. 12) Illuminated letters don't work. If you try to do them you will break many reader's experiences. If you insist on retaining them, do so as images, not by trying to do "letter placement" on them. See "Float doesn't work." 13) Drop caps don't work. Don't do it. 14) Colored lettering, grey lettering, etc, doesn't work. Don't do it. 15) "Literalism" -- trying to literally recreate the layout of the original text using HTML doesn't work. Don't be literal. 16) Poems: There is lots of great advice on how to format poetry in various places on the internet. That advice doesn't work. Suggest at this point in time try something simple such as indent plus <br> to terminate individual lines of the poem. The poem won't wrap exactly correctly, but that is a softer failure mode than the other "really cool" suggestions one finds on the internet. 17) Background colorings don't work. 18) Try taking out your CSS and see if your coding is still attractive and 100% readable. If not, you are doing something wrong. 19) Common PG/DP methods of encoding page numbers in HTML don't work. Don't ask me how to do it "correctly" because I haven't seen anything yet which isn't "busted." 20) If you think it works, test it. When you test it, you will find out that contrary to what you have read "everywhere" it doesn't in fact work. 21) Behavior exhibited in *your* copy of IE, Moz, or Chrome, is NOT the behavior the end customer will experience in *their* copy of IE, Moz, Chrome, EPUB reader, MOBI reader, Android tab, etc. Test it on various devices, and if it doesn't work it is because *you* are doing something wrong, not because the *device* is doing something wrong. Do not blame the device, and do not blame the end reader. They are not under any obligation to read your effort. On the contrary you are under an obligation to make something worthy of their reading. If you don't believe this, no problem: don't do the work -- and they won't read it. 22) Measurements in terms of ems usually work except on very old copies of IE. Measurements in terms of % may work. Other units of measurements tend not to work, and a desire to use those units indicates you are doing something wrong. In particular measurements in units of pixels don't measure pixels, and measurements in units of points an picas do not measure those things. Specifying in absolute measurements inches or cm is a disaster. 23) Absolute placement doesn't work. 24) Floats don't work. 25) "Comment this out to make page numbers disappear" doesn't work. In fact, contrary to your expectations the page numbers will not only not disappear, but they will show up in unpleasant places. 26) Almost anything one can read about HTML in books or on the internet is destructive to one's ability to make good books. People who write on the subject, who on some level tend to know what they are talking about are: Elizabeth Castro ["Epub straight to the point", "Pigs Gourds, and Wikis"] Joshua Tallent, and Rufus Deuchler -- and even then one needs to look at what they are saying with a jaundiced eye. 27) Top and bottom margins may or may not merge depending on the device. If you want your work to look attractive and not ugly broken choose to use one or the other, not both. 28) Top and bottom margins may round to the closest 1em. Combining with 27) this means if you "split the baby" and set <p> top and bottom margins of 0.5em (of the equivalent in other units) you have just specified a really ugly 2.0em spacing between paragraphs on some devices. 29) Read the Amazon formatting suggestions. It should give you some feeling of what you are up against. http://kindlegen.s3.amazonaws.com/AmazonKindlePublishingGuidelines.pdf If you can't write "correctly" against the limitations of a Kindle, then you are breaking on many other machines also. Not every EPUB device runs ADE. Not every PC runs Moz, IE, or Chrome. 30) If you don't care, then neither will the reader -- they will simply choose to read something else. 31) Page numbers really don't work on small devices, but, oh well. 32) The basic rule about rules: Rules don't work. Especially if you use decorative rules. [[Please understand that the rules we are talking about here are "horizontal rules"]] However, if the original book used rules you may feel some obligation to retain them. Please understand however, in traditional typography a rule is a publishing house's cheap substitute for spending more money on more paper by providing a real page break or a large <tb>. IE understand when and why you are propagating a "cheat" on the current reader *for no good reason.* Rules really don't work on many EPUB devices which will do something really ugly where you thought you were specifying a rule. 33) Old books of lower quality often use "printers art" -- the "clip art" of the day -- to gin up sales of shelf stuffers at Xmas time. Retaining that clip art may not actually represent a contribution. If the artwork *does* relate closely to the content of the book and/or if there is a credited illustrator, then its probably worth keeping the artwork. Most famously Hans Christian Anderson took a "clip art" and built one his most famous stories around it -- the "clip art" came first, the story came second. 33) Things which you put in HTML "as documentation" such as programmers comments and unused page anchors tend to get automagically thrown away by one or another automated tool before you even realize it. 34) Things which you slavishly retain, such as original line breaks, tend to quickly get discarded in the real world. 35) Right margins don't work. 36) Check your images on real devices particularly EPUB devices. I'm not sure what is going on, but commonly used image storage formats at DP/PG breaks horribly on many EPUB devices, leading to un-viewable images. 37) PG efforts are reprocessed, repackaged, and redistributed from dozens if not 100s of different sites, often without given PG any credit (but if are the creator of that PG book, you will certainly still recognize it.) These redistribution efforts are often *inferior* to the versions offered on PG's site. Why? Because these other sites cannot afford to take the time and effort to "fix" "broken" HTML CSS efforts by *One or Another* PG submitter. The end result is that they choose to throw away some or ALL of PG's formatting efforts -- because they cannot afford to track down and fix the individually broken submissions! Conclusion: "We" all live and die together. One person making bad CSS formatting choices drags down *everyone's* efforts leading to inferior copies of PG books being read "everywhere." 38) Literalism: Not all books fit well into the reflow mechanism implied by HTML and small devices. If your book really needs fixed layout, then don't bother trying to use HTML for your effort. Consider using PDF instead, for example. But what we really see is that people who don't know what they are doing submitting in PDF when HTML reflow would have worked perfectly well for their book, and other people trying to force their book into HTML when fixed-format PDF would have been a better choice for that book. 39) Justification: Choice thereof is best left to the end user. Do not override their freedom of choice. 40) Font sizes. Many devices DO NOT have an infinite variety of font sizes. Assuming that they do so will fail, often in an ugly manner. Relative sizing ie <small> <medium> <large> does tend to work -- most of the time. 41) Font choices: Most devices support four fonts: regular, italic, and bold, and a "teletype" font such as used in pre. The only reliable way to specify these fonts are no-tag, <i>, <b> and the "pre" family of tags. If you try to go beyond this set of assumptions your assumptions will fail in an ugly manner on many many devices. 42) Formats, including choice of HTML, EPUB, and MOBI "flavor" changes rapidly and if you think that you can just specify which flavor you want and that is all it takes then you are wrong because someone is going to want to reuse your effort in a different "flavor" and/or process through a tool which assumes a different flavor. If your coding depends on the "flavor of the month club" then your coding will soon be broken, if not already. 43) The actually implemented "Unicode" code points varies widely from device to device and from code release to code release. If you need a code point, use it. If you don't need it, then why use it? That "optional" code point is probably something the device manufacturer decided was "optional" also. 44) Tables often don't work. Links inside of tables often don't work. Tables of more than a few columns will probably fail. 45) Horizontal scrolling doesn't work. 46) Pre without wrap doesn't work. See #45. 47) Frames don't work. 48) Colors don't work. Take a look on an old monochrome device to see what you are doing to the reader. 49) CSS: Don't specify that which you don't need to specify. Many devices come with a perfectly well designed set of CSS choices which actually work on that machine, until you go out of your way to break them. 50) CSS: Don't do "CSS Resets." It doesn't work. See #49 above. 51) Don't use massive CSS "cookbooks." When something is found broken in your HTML then trying to track down what is broken within your massive CSS "cookbook" becomes prohibitive. Include in the CSS that which you actually use and need in this book. What else?

Great discussion of what to (mainly) not do to the text in the privacy of your home. I'm interested in hearing how such an exercise helps improve the PG corpus. My assertion, to which you took exception, was
Another view is that we have a mess that defies improving texts;
I'm not sure which part of my assertion you are addressing.

On 1/29/2012 6:03 PM, Greg Newby wrote:
When you write of standalone version control systems, do you mean using svn/hg/git out of the box, and developing all the other software for our needs?
Yes, sort of. Even TRAC requires a stand-alone Subversion installation; TRAC just interfaces into it. CVS is probably sufficient to our needs, and one of your servers probably has CVS already installed on it (it tends to be part of a standard Linux distribution). If that's what you want to do, it's just a matter of setting up users, collecting private SSH keys, building a CVS repository, etc. CVS is what I used for www.ebookcoop.net. If you think we would be moving to TRAC eventually, maybe a stand-alone installation of Subversion would work. If you, or others, think that a more distributed, high-maintenance, low-bandwidth solution is better, choose Git or Mercurial. I'm primarily a Windows workstation user and Tortoise makes clients for all four. I'm happy with whatever you pick, and I think most of the others would be as well. Don't pick RCS or VSS as they are not designed for concurrent access. When I said "sort of," I was referring specifically to the "developing all the other software," part. For page-at-a-time editing I think the Wikimedia engine could be adapted. For development wiki's, defect tracking and RSS feeds, we may want to adopt TRAC in the future. I'm a firm believer in not reinventing wheels. A Frankenstein's monster approach could work.

What I was imagining is something far different than what you guys are imagining. Here's some thoughts about what I am imagining: 1) I submit a book in txt and html form. PG generates epub and mobi versions. I am not happy with those versions, because they do not ending up fairly representing what I submitted in html form. So I make my own "hand crafted" epub and mobi version, perhaps by fixing the mistakes in the generated epub and mobi versions, and PG also hosts these alternative "hand crafted" versions. 2) After a couple years I have "improved" my conceptualization of how I think one should write HTML code, and what I submitted back then now looks old and crufty even to me. I submit an updated HTML coding which I think will be more useful to people in the future, and PG hosts that alternative version. 3) Looking back at an ancient PG txt and an ancient PG HTML coding effort, I see things which in no way represent current PG best coding practices, which *have* improved over the years. I go back and make updated versions of these files in order to best represent current PG coding practices, and PG hosts these alternative versions. In summary, what with 30,000 plus texts, "crowdsourcing" to me means mainly having one person fixing up one text, rather than having 100 people busy fixing up *one* text. Now granted, if its a popular text, then maybe a year or two after I tackle a text then someone else might want come along and polish it up further. Or maybe they want to rework my EPUB2 effort into an EPUB3 effort.... Now, god forbid, if Greg or the WW'ers find merit in any of these alternative versions, then maybe in the fullness of time PG decides to use them for the basis of making a new "official" version. Or they find bits that they like, and back incorporate them.

Jim, I agree with this. Granted, an EPUB starts out as an HTML document but to make a good EPUB you really need to do tweaks that cannot be done automatically. I want the option to submit a hand crafted EPUB along with my HTML and TXT. I want it to be my own work, not the efforts of a bunch of people. Rather than have an HTML file that avoids features of HTML so it can make a decent EPUB, I want to have both HTML and EPUB be the best they can be. James Simmons On Fri, Jan 27, 2012 at 10:02 PM, Jim Adcock <jimad@msn.com> wrote:
What I was imagining is something far different than what you guys are imagining. Here's some thoughts about what I am imagining:
1) I submit a book in txt and html form. PG generates epub and mobi versions. I am not happy with those versions, because they do not ending up fairly representing what I submitted in html form. So I make my own "hand crafted" epub and mobi version, perhaps by fixing the mistakes in the generated epub and mobi versions, and PG also hosts these alternative "hand crafted" versions.
2) After a couple years I have "improved" my conceptualization of how I think one should write HTML code, and what I submitted back then now looks old and crufty even to me. I submit an updated HTML coding which I think will be more useful to people in the future, and PG hosts that alternative version.
3) Looking back at an ancient PG txt and an ancient PG HTML coding effort, I see things which in no way represent current PG best coding practices, which *have* improved over the years. I go back and make updated versions of these files in order to best represent current PG coding practices, and PG hosts these alternative versions.
In summary, what with 30,000 plus texts, "crowdsourcing" to me means mainly having one person fixing up one text, rather than having 100 people busy fixing up *one* text. Now granted, if its a popular text, then maybe a year or two after I tackle a text then someone else might want come along and polish it up further. Or maybe they want to rework my EPUB2 effort into an EPUB3 effort....
Now, god forbid, if Greg or the WW'ers find merit in any of these alternative versions, then maybe in the fullness of time PG decides to use them for the basis of making a new "official" version. Or they find bits that they like, and back incorporate them.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On 01/28/2012 02:53 PM, James Simmons wrote:
I agree with this. Granted, an EPUB starts out as an HTML document but to make a good EPUB you really need to do tweaks that cannot be done automatically.
What HTML? HTML 4, HTML 5, HTML 6? What EPUB? EPUB 2, EPUB 3, EPUB 4? This is so WYSIWYG, so pedestrian, so typewriter, so oblivious of the capacities of computers and the rapid change of ereading landscape. This is the complete opposite of having a master format.
I want the option to submit a hand crafted EPUB along with my HTML and TXT. I want it to be my own work, not the efforts of a bunch of people.
This is so 90's, so `me´-generation, so Girl Scout Merit Badge. I want *my* name all over the place, I want *my* bragging rights, I want nobody to mess with my work because I'm the paragon of evolution. This is the complete opposite of crowdsourcing.
Rather than have an HTML file that avoids features of HTML so it can make a decent EPUB, I want to have both HTML and EPUB be the best they can be.
This is so `best viewed with IE4´ and `best viewed with Netscape´. This is the complete opposite of future proof. -- Marcello Perathoner webmaster@gutenberg.org

This is the complete opposite of having a master format.
When "we" as a group cannot agree on even basic things such as whether to use xml style tagging <p> verses troff-style "implied markup via teletype-style visual formatting plus various escape clauses" and if "we" cannot even agree on the worthiness of keeping page numbers or not, and when "we" have 30,000+ old files which I don't see anyone volunteering to rewrite in "master format" not to mention put back in page numbers, then I don't see any chance of a "master format" working. I do see merit in making this suggestion -- if you are the person who seems themselves as being "the master." But even then the suggestion lacks merit because PG volunteers are not forced slaves, nor are they stamp-lickers. If they don't want to do it, they go do it elsewhere, or they go do something else.
This is so 90's, so `me´-generation, so Girl Scout Merit Badge. I want *my* name all over the place, I want *my* bragging rights, I want nobody to mess with my work because I'm the paragon of evolution.
This is simply excuse-making for failing to take a good hard look at the epub and mobi which PG is currently providing customers, the great majority of which files are barely readable on those small machines, to put it kindly, whereas other organizations ARE providing high quality highly readable epub and mobi files to their customers, and are doing it from PG source files, and are often doing so without spending a huge amount of time fixing up those PG files. If you and/or the WW'ers can do it, then why pray tell have you not been doing it? The truth is PG is having its own version of the problems DP has been having: namely a small number of old-timer insiders are more interested in having the power of being "the dogs guarding the straw" rather than providing real books for real customers to really enjoy reading.

Marcello, I own both a Nook and a Kindle. You can make an attractive and usable book for either or both, but it won't take advantage of everything you can do with HTML and style sheets because those devices support a limited subset of HTML. When I make an HTML document I want to make it look like the original book. If I have that then I can reissue a book using a print on demand service like Lulu. As far as HTML is concerned, I just want the basic structural tags with decent style sheet support. Nothing fancy. A web page should look good on a full size screen or a printed page. An EPUB needs to look good on a small device. It is natural to start out with a good looking web page and then tweak it to look good on a Kindle. It only takes a few hours to do, but it isn't something you can just do automatically. You probably could do some kind of simple HTML that would look decent on both, but to make full use of each platform you need to tweak by hand. "Crowdsourcing" is one of those things that sounds good in theory but doesn't work in practice. I have worked on books for FLOSS Manuals which are a good example of crowdsourcing. One of the manuals I wrote was translated into Spanish by a team of South American volunteers. In all cases we made sure that there was a way to identify who did what work so that person got credit. Wanting credit is not wrong. Wanting to take control over what an entire book looks like is not wrong. It's the way the world works. Having the OPTION of submitting a hand crafted EPUB in addition to my web page gives me a way of delivering a quality product on every platform. If you do a good EPUB you can generate a good MOBI so there is no need to submit those separately. As things currently stand, I do my best on the two documents I submit, but what most readers will download will NOT be my best, and in some cases will look downright sloppy. James Simmons On Sat, Jan 28, 2012 at 9:10 AM, Marcello Perathoner <marcello@perathoner.de> wrote:
On 01/28/2012 02:53 PM, James Simmons wrote:
I agree with this. Granted, an EPUB starts out as an HTML document but to make a good EPUB you really need to do tweaks that cannot be done automatically.
What HTML? HTML 4, HTML 5, HTML 6?
What EPUB? EPUB 2, EPUB 3, EPUB 4?
This is so WYSIWYG, so pedestrian, so typewriter, so oblivious of the capacities of computers and the rapid change of ereading landscape.
This is the complete opposite of having a master format.
I want the option to submit a hand crafted EPUB along with my HTML and TXT. I want it to be my own work, not the efforts of a bunch of people.
This is so 90's, so `me´-generation, so Girl Scout Merit Badge. I want *my* name all over the place, I want *my* bragging rights, I want nobody to mess with my work because I'm the paragon of evolution.
This is the complete opposite of crowdsourcing.
Rather than have an HTML file that avoids features of HTML so it can make a decent EPUB, I want to have both HTML and EPUB be the best they can be.
This is so `best viewed with IE4´ and `best viewed with Netscape´.
This is the complete opposite of future proof.
-- Marcello Perathoner webmaster@gutenberg.org
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

If you do a good EPUB you can generate a good MOBI so there is no need to submit those separately.
Not sure I understand the claim being made here. If you are claiming that given a good EPUB you can just run that through Kindlegen and generate a good MOBI, then that is certainly not true, and the existing PG EPUB and MOBI book postings give many examples of this. There are many PG EPUB generations that end up looking pretty good, but the MOBI, which is more-or-less generated from the EPUB [1] does not look good. Conversely, there are some cases where PG generates an EPUB and then generates the MOBI [1] from that, and it ends up the MOBI looks good, but the EPUB does not. [1] No exactly true, since Marcello makes a somewhat-custom-version of the EPUB used to generate the MOBI from that version of EPUB via Kindlegen, but the simplified principal expressed is more-or-less the same. Now the reality of my "work-chain" -- which essentially consists of trying to push a rope from the wrong end -- is that I try to create a good MOBI file, but conceptually I do that by trying to understand how Kindlegen creates a MOBI from an EPUB, and I try to understand how Marcello tweaks the EPUB that is generated specifically to be converted into MOBI so that I can create a good MOBI, but to generate a good EPUB then I have to understand how Marcello creates an EPUB from an HTML so that I can create a good HTML to make a good EPUB to make a good MOBI. And I test all this stuff before I send it in to PG with my own local copy of the sausagemaker software. And then I send my stuff into PG, get yelled at for a while until someone at PG actually decides they might actually want the book, PG posts it, and then surprise, what PG posts is not entirely what I expected from my own local tests. Why try to make a good MOBI in the first place? Because in my experience if I can get the far end of sausagemaker to "work" then the intermediate stage of EPUB and the primary stage of HTML is pretty easy to get to work. And the txt70? God only knows. That part really doesn't fit into my "work-chain." I make it last, and extremely reluctantly, as a reluctant precondition to submission. And because neither I nor anyone else apparently has any decent tools to make the txt70 format. And then the WW'ers howl because they still pray at the altar of txt70. But I am so not there.

On 01/29/2012 04:18 AM, James Simmons wrote:
When I make an HTML document I want to make it look like the original book.
Then you can save yourself a ton of work and just use the page scans at IA. The idea behind a reflowable format is to make it indeed: reflow. So even if your HTML looks `like the original book,´ all the user has to do is hit Ctrl + or make the browser window smaller and it will look nothing like the original book. -- Marcello Perathoner webmaster@gutenberg.org

On Fri, Jan 27, 2012 at 08:02:16PM -0800, Jim Adcock wrote:
What I was imagining is something far different than what you guys are imagining. Here's some thoughts about what I am imagining:
This is all quite consistent with my view of the process, too. Not a page at a time, but a book at a time. Or, possibly, a file at a time. Note that we already do this type of thing, but with the WWers as ultimate gatekeepers. The idea is to make it much easier for updated items (books, files, formats) to become available, but with a different approach to the gatekeeping. On the last point :
merit in any of these alternative versions, then maybe in the fullness of time PG decides to use them for the basis of making a new "official" version. Or they find bits that they like, and back incorporate them.
We do this all the time with text and HTML, see our "errata" procedure. We don't do it for derivative formats (i.e., epub & mobi). But it's not scalable, and unfriendly to small incremental improvements. -- Greg
1) I submit a book in txt and html form. PG generates epub and mobi versions. I am not happy with those versions, because they do not ending up fairly representing what I submitted in html form. So I make my own "hand crafted" epub and mobi version, perhaps by fixing the mistakes in the generated epub and mobi versions, and PG also hosts these alternative "hand crafted" versions.
2) After a couple years I have "improved" my conceptualization of how I think one should write HTML code, and what I submitted back then now looks old and crufty even to me. I submit an updated HTML coding which I think will be more useful to people in the future, and PG hosts that alternative version.
3) Looking back at an ancient PG txt and an ancient PG HTML coding effort, I see things which in no way represent current PG best coding practices, which *have* improved over the years. I go back and make updated versions of these files in order to best represent current PG coding practices, and PG hosts these alternative versions.
In summary, what with 30,000 plus texts, "crowdsourcing" to me means mainly having one person fixing up one text, rather than having 100 people busy fixing up *one* text. Now granted, if its a popular text, then maybe a year or two after I tackle a text then someone else might want come along and polish it up further. Or maybe they want to rework my EPUB2 effort into an EPUB3 effort....
Now, god forbid, if Greg or the WW'ers find merit in any of these alternative versions, then maybe in the fullness of time PG decides to use them for the basis of making a new "official" version. Or they find bits that they like, and back incorporate them.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
participants (12)
-
Al Haines
-
dakretz@gmail.com
-
don kretz
-
Greg Newby
-
James Adcock
-
James Simmons
-
Jana Srna
-
Jim Adcock
-
Jimmy O'Regan
-
Keith J. Schultz
-
Lee Passey
-
Marcello Perathoner