
Bowerbird said:
i don't want them to think of this as "the formal proofing process".
i believe we need to put the books in front of them _to_read_...
Which is exactly why a book that is ready for release to PG should not force upon an individual the original scan. When the reader spots something they think is an error, they should be able to call up the scan to see if it is an actual error before reporting it as such, etc. Otherwise their reading is distracted by the scan. A proper interface would allow users to select what they want to see. An interface should be about options. In any case, I have no desire to host an archive that duplicates the content of PG and others. The purpose in coming to my project site would be to prepare books for release to PG and to polish released books so that they can be read for leisure and pleasure off of _my_ bandwidth.
I think calling it a wiki makes it easier for quick comprehension rather than explaining the whole process of storage, editing, etc.
we should never _have_to_ "explain the whole process" to the public. for them, it should be as easy as clicking a button to say "check this!" they shouldn't even need to explain the error. (we can find it, right?)
Of course, but it is hard to develop it and request assistance with the development of it without explaining the basic concepts, etc. A project must be explained and defined or it cannot be realized.
Rather than working an editing interface into an out-of-the-box wiki system
we don't need to make "an editing interface"...
It's rather hard for people to make/submit corrections without having an interface for doing so.
Has anyone ever just asked google if they could use the scans and OCR text for PG?
michael had a meeting with google when they were just starting out...
he reported that they treated him rather rudely, hustling them out of their famous cafeteria before he had even finished his lunch. it's sad, but i don't think they really appreciate project gutenberg very much...
That is too bad.
XML makes a nice standard for storage and conversion
it's obtuse, and unnecessary obstructive to the text itself...
It's also one of many methods for the storage of 'intelligent' data and is easy to do. Once put in place, it would never be seen again by anyone except someone correcting the markup (which one hopes would be rare). InDesign uses markup, MS Word uses markup, and on and on. I don't think the users of this software find the markup obstructive to the text itself.
My only protest is that I believe a human proofing stage is still a good choice. [snip] i can't think of one single reason for a word-by-word proofing, let alone two or three rounds of it. it's simply not efficient, and anybody who explores the options can learn that for themselves.
I don't believe I said anything about multiple rounds of proofing, etc. I believe I discussed the main processing of a document being done through human interaction with scripts and then people actually looking at the results of the process before the output is released (proofing). I feel that a human looking at a smaller subset of a large document is a good thing in the error finding process. You apparently do not think it is. Neither of us is right or wrong: It is a matter of perspective and opinion.
between aggressive clean-up of o.c.r., and the comparison method, and "smoothreading" by people from the general public, we're good.
Yes and no. It depends on your definition of 'good.'
Otherwise, I would say to have two people run the same correction process, diff the results and then present the mis-matches to a third set of eyes.
close... but it's unnecessary for "people" to do the digitizations. two sets of cleaned-up o.c.r., diffed against each other, is enough...
That depends on a lot of factors including the assumption that two OCR programs would not make the same mistake and that, if they did, it would be a mistake that could be caught by another process. Etc.
(but yes, it's necessary to have a sentient human doing the clean-up.) Exactly. And, if that human makes a major error, the entire text in now a mess. If two people do the same process it reduces, but does not eliminate, the risk for 'the human error factor.'
o.c.r. clean-up is a fast-paced, active, and energetic task. which does _not_ mean that it's "harder". just more fun...
People actually do enjoy proofing. You do not. Again, neither is wrong in their viewpoint. Volunteers should be involved in a manner that gives them pleasure and satisfaction, etc. I say to give people tools and options and let them be the judge of what is best for them.
i get the impression you think these clean-up tools are just a bunch of scripts that people turn on, and the results fall out. but that's not anything like what actually goes on.
Since I am planning to write scripts for such an environment, one would hope that I do not "think the results will just fall out." I think much of this will be very difficult to program and that the software must be capable of 'learning' and that it will require a human to interact with it so that the job can be done well and the software upgraded to do the job even better in future. I actually wrote several scripts like this many years ago. They are archived on my CDs, but my manifest was damaged in a fire, so I am not sure where to even begin to retrieve them. It would be faster and easier to just start from scratch than to dig up outdated software anyway, so I do not regard it as a great loss.
I'm not sure I'm ready to program an AI engine capable of "listening to the book."
the software can't listen to the book. _you_ have to do that.
I pretty much said that in what you snipped. :)
Which is why I do not believe in leaving people out of the process
again, you have a profound misunderstanding of the process.
I'll just laugh and pretend you didn't say that.... Perhaps I have a profound misunderstanding of _your_ proposed process, but I have a very firm understanding of my own. There is no _the_ process for me to understand. In any case, if you would like to create your own project, feel free. "The road is free to all." And Michael and PG would be happy to recieve the texts submitted through any process. And I would be happy to receive constructive feedback on my efforts as well from any source that cares to give it. I am just interested in creating a method for people to complete initial editions of books rapidly, to avoid putting features in place that 'waste' the time of volunteers in doing tasks that could be done more rapidly and easily through the use of software, and to have an environment that offers the opportunity to correct any errors that were missed in the initial release. Also, it should offer a sense of community for volunteers to interact. The _process_ is not a complex one by any means. As a human, I expect that my project process will not be perfect nor will it agree with the opinions of all others on how it _should_ be done, but it will be a process that produces books. So, it will be part of the solution no matter how many problems it has. :) Nor do I feel that the process at DP is flawed. They are doing the process the way they have chosen to do it and it is a project that acutally exists and is productive. They simply created a bottleneck in their workflow that they need to deal with. I actually hope that some of my tools are of assistance to those who use DP to process books. I am about getting books into PG and care very little how one goes about doing it so long as one _is_ going about doing it. :) Carel