Re: [gutvol-d] blah blah blah blah blah
jon said:
I'm up for this, but I would suggest that we use a scan that Greg nominates and uploads to PG as a master scan.
if greg finds some scan-set other than those 3, then i will be happy to upload that one as well...
That way the result will be objectively better than the extant text as it will be a version of the master scan rather than a version of an unknown edition,
p.g. doesn't consider its current e-texts to be inferior just because they come from "an unknown edition". on the contrary, p.g. often tries to cast this as a "benefit". your mileage might or might not vary...
and there will be no excuse for not replacing the old version with the new.
unfortunately, you're thinking inside _your_ head instead of thinking inside the whitewasher heads. the whitewashers will not replace the _current_ version of an e-text with one that you submit... (so your "old" and "new" terminology doesn't fit.) they'll assign your new text a brand-new number, and it will start with zero downloads, and thus it will automatically be at a _distinct_disadvantage_. their thinking is that why should your version "benefit" from the downloads accrued by the earlier version, done by a pioneering volunteer. and you can't argue that that is unreasonable. (and even if you think you can, you will not be the judge as to what constitutes "reasonable"; the whitewashers will be making that decision. and you can bet that they will find _their_ logic to be much more compelling than _your_ logic.) the whitewashers make an exception to this rule when _they_ update their _own_ submissions, but you would expect them to do that, wouldn't you? now, maybe since you've made a stink here, they will also grant an exception to your project, but if i were you, i'd wanna get the promise in public. otherwise, just get used to the fact that your work will largely be ignored. if you're ok with that, great. you will still know, underneath, you did a good thing.
I would also like to do a LaTeX version, so will need a stage before you add your formatting -- it doesn't have to be my RTT, but the lines need to correspond.
i'm just mounting the projects. somebody else will have to clean the text. and add the formatting. i mean, i _might_ do some of that stuff, but i'm not making any guarantees that i _will_... but i doubt i will. i have bigger fish to fry... also, i think it's silly not to add the formatting, so you'll have to _remove_ any that i've added. (but that would strike me as immensely stupid, because then you'll just have to add it back in.) and, as for the lines needing to correspond, that is an acceptable format for a z.m.l. file, and it has been a thing for which i've agitated for a good many years now. just so you know. i think rewrapping lines is a stupid bad idea... *** the scans are up now, for that first mounted text, meaning we can now make the outline of a system.
adjust your browser's window-size and zoom level until you get your best presentation of the display... -bowerbird p.s. another interface should be created to _diff_ this o.c.r. against a clean text to correct the errors. i might do it, if i think it'll be fun, and i have time; but i invite anyone else who wants to play to build it. if you do so, here's a clean text that you can use:
On 2012-09-25, Bowerbird@aol.com wrote:
I would also like to do a LaTeX version, so will need a stage before you add your formatting -- it doesn't have to be my RTT, but the lines need to correspond.
i'm just mounting the projects.
somebody else will have to clean the text.
and add the formatting.
i mean, i _might_ do some of that stuff, but i'm not making any guarantees that i _will_...
but i doubt i will. i have bigger fish to fry...
also, i think it's silly not to add the formatting, so you'll have to _remove_ any that i've added. (but that would strike me as immensely stupid, because then you'll just have to add it back in.)
So here we have a demonstration of the exact set of problems that I was trying to solve. You don't want to clean the text, due to your big fish frying activities. I don't particularly want to spend my (admittedly fishless) limited time cleaning the text either. Pretty much everyone on this list doesn't want to clean text that is not for their own solo project. There are 1000 odd people who will be glad to clean text for you and know what they are doing, but they live at PG and if you want to play there, you have to play by their rules. Greg has a hose full of people that will clean text for you, but if he turns it your way, you will have to train them first, and in your case you will not just have to train them to check that the word in the text is the same as the word in the image, you will have to train them in all things ZML as well. I don't want to start from your ZML. I'm sure ZML is lovely, but I want to work from the corrected text and images where the number of introduced errors is lowest. That is my preferred workflow, and I'm not prepared to compromise on it. Consider it a religious thing, that I am going to be utterly unreasonable about, and that nothing you say will ever change my mind on it. A workflow can go from RTT to ZML very easily, but the reverse is not necessarily the case. That was the point of the RTT. If you mandate that what comes out of your system is ZML and only ZML, then I'm not interested in working within your system. Finally, unless I am reasonably convinced that something useful will come of it, I am not going to proofread word 1. Useful means default PG download and that, at the very least, means preferred PG edition, and the existence of some sort of default download protocol. Posting a small selection of books on your own site while laudable means pretty much nothing at all in the grand scheme of things. So... we agree on the necessity of master scans, we agree that a default download protocol is a requirement before it is worth commencing any work. Let's narrow our focus to these two seemingly simple items. With these in place, people can roll up their sleeves. Maybe some will do redos as solo projects, maybe some will do them through your system, maybe one of the DP's can help make some RTTs. Without them, it is simply not worth starting anything... which you may recall was what I was trying to determine right at the start of all this dialogue, and more or less the first piece of advice you gave me. Cheers Jon
Jon, I feel overwhelmed by the RTT discussion at this point. I wonder if this paragraph summarizes the basics of what has been proposed: The main task is that PG would host a master scan particular to one preferred edition. Along with that master scan would be a text-based version, the RTT, marked up in some currently undefined way, that represents a starting point for anyone who wants to produce a book in any media using any further tools or markup they choose. A related task is that DP would be involved in making the RTT. Also, tools would be developed to take the RTT to various output formats. If that's the heart of it, then at least I understand the goal. If I have it wrong, please correct me. I'm still unclear on the RTT and how it can hold all the information of the original such as italics, superscripts, etc. but this seems less important than the logistics questions: 1. Some have warned that there are at least two miracles in the critical path for this project: one is buy-in from DP/Louise that DP would generate the RTT. There is a sense that DP would not join this work, which is why there is a discussion, I think, of other proofing tools that presumably would be hosted elsewhere and used by other volunteers. 2. The other unlikely critical event is buy-in from PG/Marcello to host/feature the RTT and well-formatted final versions based on that RTT. There is a sense that PG would not join this work, which is why there is a discussion, I think, of having the final versions, derived from the RTT, hosted on other sites. 3. Assuming PG would at least host the RTT, I think the plan included the WWers only having to approve the RTT as it came to them. Are they prepared to do that and relinquish control of the final formatted versions? Is this what they want? Have they been asked? It seems to me that the likely result, if 1-3 were to be taken to their probable conclusions, would be not using DP to do the work and a not using PG to host the RTT-derived final versions. Is that the direction this proposal is going? --Roger
On 2012-09-26, Roger wrote:
The main task is that PG would host a master scan particular to one preferred edition. Along with that master scan would be a text-based version, the RTT, marked up in some currently undefined way, ...
From my perspective, the proposal has evolved to this:
1. Develop a methodology for nominating master scans where no record of provenance if extant text exists. This is PG policy creation and above the pay grade of any individual volunteer. If this is not in place, nothing further can be achieved. 2. Source master scans and upload to PG. 3. Run master scans through DP until P2 using LOTE non-clothing exception. Capture P2 output and diff it against extant text to produce the RTT. Note that if P2 output is perfect the RTT _is_ P2 output. 4. Diff RTT against extant text to produce comprehensive errata list, and deliver to WWs so they can update the e-texts. And that, for the moment, is the limit of my ambition. As Bowerbird points out, I have more than likely just described an impossible scenario, even though from a technical point of view it is completely trivial. Now it has been explained to me, I understand that final formatted versions uploaded to PG _will_ be buried, and are therefore pointless. A change of policy here is not possible. In the unlikely event we reach step 4, I suggest we harvest the MS and RTT and work up final versions at FadedPage. The 1000 most popular ebooks in demonstrably higher quality has a small chance of becoming consequential, and that would be enough for me. Step 1 requires "someone else with sufficient influence" to drive it: no single volunteer can. It is therefore unlikely to happen. Nothing else can happen until it does happen. Therefore I am reverting to lurk mode. Step 3 requires DP buy in. What I am talking about are bog standard DP projects that happen to terminate after P2. Louise will very likely ignore any request and if pushed will block it without explanation. She will do this even though it would help balance the rounds and would likely help with retention. I personally do not believe it is worth commencing step 3 without DP support. For the moment then, I lurk. Cheers Jon
On Wed, Sep 26, 2012 at 06:06:25PM +0100, Jon Hurst wrote:
On 2012-09-26, Roger wrote:
The main task is that PG would host a master scan particular to one preferred edition. Along with that master scan would be a text-based version, the RTT, marked up in some currently undefined way, ...
As mentioned, it is already FullySupportedAndEncouraged that scans be provided with eBooks. DP generally doesn't do this, but they have the scans archived (not always immediately available). I counted over 7000 "page-images" subdirectories in the PG collection, though I don't know how many are actually complete scan sets. In other words: There ARE master scans available for a number of eBooks. I recommend getting the "ls-lR" file from ftp://ftp.ibiblio.org/pub/docs/gutenberg/ls-lR to choose a title that seems to have a complete scan set.
From my perspective, the proposal has evolved to this:
1. Develop a methodology for nominating master scans where no record of provenance if extant text exists. This is PG policy creation and above the pay grade of any individual volunteer. If this is not in place, nothing further can be achieved.
This is two separate topics: 1) For many items, especially older items (pre #5000 or so), it will be challenging or impossible to identify the scan set. We didn't have a procedure to receive the scans, and didn't have the space to save them. 2) PG does not enforce adherance to a particular master (scan set, dead trees, etc.). So, for a title where there IS a scan set, there may be inconsistencies which are not errors. Already discussed. My suggestion has been to start from scratch with a new scan set, rather than trying to fix existing eBooks. It's OK to choose a title we already have. I'm going to visit B&N today to see whether they might have a suitable dead trees edition for this purpose. -- Greg
2. Source master scans and upload to PG.
3. Run master scans through DP until P2 using LOTE non-clothing exception. Capture P2 output and diff it against extant text to produce the RTT. Note that if P2 output is perfect the RTT _is_ P2 output.
4. Diff RTT against extant text to produce comprehensive errata list, and deliver to WWs so they can update the e-texts.
And that, for the moment, is the limit of my ambition. As Bowerbird points out, I have more than likely just described an impossible scenario, even though from a technical point of view it is completely trivial.
Now it has been explained to me, I understand that final formatted versions uploaded to PG _will_ be buried, and are therefore pointless. A change of policy here is not possible. In the unlikely event we reach step 4, I suggest we harvest the MS and RTT and work up final versions at FadedPage. The 1000 most popular ebooks in demonstrably higher quality has a small chance of becoming consequential, and that would be enough for me.
Step 1 requires "someone else with sufficient influence" to drive it: no single volunteer can. It is therefore unlikely to happen. Nothing else can happen until it does happen. Therefore I am reverting to lurk mode.
Step 3 requires DP buy in. What I am talking about are bog standard DP projects that happen to terminate after P2. Louise will very likely ignore any request and if pushed will block it without explanation. She will do this even though it would help balance the rounds and would likely help with retention. I personally do not believe it is worth commencing step 3 without DP support.
For the moment then, I lurk.
Cheers
Jon _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
On 2012-09-26, Greg Newby wrote:
As mentioned, it is already FullySupportedAndEncouraged that scans be provided with eBooks. DP generally doesn't do this, but they have the scans archived (not always immediately available).
My suggestion has been to start from scratch with a new scan set, rather than trying to fix existing eBooks. It's OK to choose a title we already have. I'm going to visit B&N today to see whether they might have a suitable dead trees edition for this purpose.
Thanks Greg, I think I see where you are coming from now. Sorry for being a bit dense and despondent. So... the two "facts of PG life" that affect are: 1) Replacing an extant project is not realistically possible, but doing the same title twice is fine. You will be assigned a new number, and you will have first mover prerogative. You therefore get to pick an edition and it is actively encouraged that you include the scan set. 2) Your new version _will_ be buried by the current version. Therefore, to make a difference in the short term you need to improve the current version. The only approved method for improving current versions is to send errata to the WWs. Fact 2 means that the only possible way to make a difference in the short term is to generate comprehensive errata and give it to the WWs. Anything else requires policy change. Nothing is forcing the WWs to do anything with the errata you send them, of course, but it would be reasonable to suppose that generally they would welcome it in the same way that they would welcome errata from any user, and it would also be reasonable to suppose that an improvement in the text would be a likely outcome. To generate comprehensive errata we generate a clean text and diff it against the extant text. What is an edition difference and what is an error is the WWs prerogative -- if the version that they are working on has no provenance then it's a judgement call that only they have the right to make. We can generate the clean text in a new project by selecting and publishing a scan set and using DP to create an RTT. Assuming an RTT is an acceptable submission format, all this is, as far as I understand it, bog standard PG procedure and doesn't require anyone's permission. So we would have done some work and it may have resulted in _some_ of the improvements we would like to see. Chalk that up as a win. Our version, even with its scan set, accuracy and provenance will, of course, be buried. Most normal users will never see it, and therefore there is no point doing final formatted versions for PG. There may, however, be a point to using them to do final versions for FadedPage. FadedPage is buried by Google rather than by PG and there is a slim chance that a large selection of superior quality texts would make a difference -- it is not against Google policy for FadedPage to bubble up, after all. Cheers Jon
Fact 2 means that the only possible way to make a difference in the short term is to generate comprehensive errata and give it to the WWs.
The errata one submits in practice can only be textual errors. Formatting errors cannot be fixed. Bad "engineering design" decisions cannot be fixed. Many of the earlier texts have bad formatting errors, and bad "engineering design" errors -- since how and what to code into html had not been well thought out back then (if so now.)
It's OK to choose a title we already have.
Tell that to your webmaster and WW'ers who are sending a different message. Again, it is not always trivial for volunteers to provide images, as sometimes the organizations making the images available, rightly or wrongly, claim to put restrictions on the use of those scans. Clearly the text of a "risen to the public domain" work is something PG can accept and post, and the text input of a "risen to the public domain" work is certainly something a volunteer can make and send to PG, so I think that part is well understood. What seems to be less well understood is that providing the actual images could in some cases prove to be more troublesome, in practice, to volunteers. When images can be found in some public location, like archive.org, or are available in multiple places, I usually provide that information as part of the copyright clearance request. If PG thinks that getting and storing a copy of those resources is appropriate, PG should do so.
A change of policy here is not possible.
What has been shown in the past to work is to make enough substantial progress in a different forum that PG starts to feel its turf threatened. Then the high priesthood makes just enough change, not more, to address that perceived threat.
participants (5)
-
Bowerbird@aol.com -
Greg Newby -
James Adcock -
Jon Hurst -
Roger Frank