
Bowerbird:
carel said:
You don't need to talk to me about having your time wasted. I am the somewhat infamous Carel that left DP nearly a decade ago (after spending around 500 hours creating a front-end and revamping the horrific code that resided on the back-end).
oh, hey, glad to meet you. :+)
i do believe i am the one who has grabbed the "infamous" crown at this point in time, however. i see they still let you post there. myself, i've been banished, so as not to corrupt the youth there.
LOL. It's nice to meet you as well. I've read your posts for many years, but I don't believe we have ever directly 'spoken.'
DP has managed to produce a huge number of etexts
and they could have produced 3 times as many if their workflow were 3 times more efficient, something it could _easily_ attain... according to brewster, internet archive scans 1,000 books a day. (google probably scans twice that many before their lunch break.) d.p. digitizes 2400 books a _year_. i predict they cannot keep up.
I'm not sure the most efficient system on earth could keep up with the scan rate. No matter how much you automate pre- and post-processing and formatting, it comes down to the speed of humans when it comes to proofing. And, if you speed things up by skipping the manual proofing and use automation completely for the first release then you will have text little to no better than what google already has. I realize that is not what you are proposing, but what google has done is the most efficient with little regard to the potential for the quality added by human interaction. Wikimedia then takes this further by adding the ability for humans to modify the text in a never-ending cycle.
Their etexts also have a decent level of quality.
i'm with michael hart on this. (even if he refuses to see that.)
i think the first mission is to get text up with acceptable quality. (which, by the way, most of the in-process stuff at d.p. is _not_)
and _then_ we can use the _public_ to drive it toward perfection.
I'm am very much with you on this one. However, it rather duplicates the efforts of the wikimedia foundation (at least the last time I looked into their efforts), so I am not sure of the 'need' for it. In what manner do you suggest that this would be differentiated from the current wiki book editing process at wikimedia that justifies the creation of the project for the purpose of PG? I ask because I am considering giving it a try if I can find a single reason to do so that would not make me feel as though I am reinventing the wheel. I have not been able to find an acceptable reason on my own. I love PG and promote it to all that will listen, but love alone is not enough: There must be logic as well. :) For that matter, it wouldn't be that hard to cull the data from google books, cut their 'plain text' into pages (and/or re-ocr and dif) and just slap the result into a wiki (which is about what it looks like wikimedia has done). I believe the google use license allows for this for non-commercial entities such as PG? Then, someone can verify the google 'plain text' images as inline images or as content that could not be processed by the OCR (type in missing text and annotate images). Tidy it up (scannos, etc.), add some assumed markup for lines, paragraphs, pages, etc. using scripts with no human interaction. Have a human assisted 'spell check' fix the majority of remaining errors and end of line hyphenation issues, etc. and add some basic markup for italics and whatnot (although the google OCR text appears to already contain this). Run the text through an 'X number of people say this page matches the scan' proofing process (roudless process). Have a human assisted markup process to wrap the content in properly formed (but easy) XML. Convert the document to plain text (and any other formats that are desired). On a 300 page work of uncomplicated fiction, depending on the scan quality and speed of the humans doing the interactive processes, I would estimate this would take about 2-6 hours _plus_ the proofing time to create a document ready for first release and wiki storage. Then, you let it sit in a wiki for eternity being edited into perfection (or until a very large X number of people again say a page is perfect, create a final edition when all pages in a book pass this test, post the output, archive the XML version and scans and consider the book done for all time - or at least until a PG reader reports an error that must be checked). For user supplied scans, same process with OCR done on the server (if desired, can diff against the scan submitter's OCR, if supplied). I haven't thought about the process in a while and the new press for speed cuts out some of the way I recall my original ideas. I'm probably completely forgetting a topic or two, but this would pretty much work (after a few logic tweeks) to accelerate the release process and should retain some quality factors in the first release. As you know, ultimately, in regards to quality, you will always have to rely on the quality judgements of individuals working on the page or project to determine the quality of the final output. Quality is a human concept and so there is no way to automate it. I wish there were. :) I am very interested in your text processing scripts though and always have been. I just haven't been processing any OCR text for a while.
If the volunteers at DP are blissfully unaware that their time is being wasted then they do not feel like victims of the process
but they _are_ victims, whether they _feel_like_ victims or not...
They are not victims in the sense that an abused child is an abused child whether the child realizes it or not. We are talking about proofreading and wasting manual hours doing tasks that could have been done by (or assisted by) automation. I may lose sleep over the former, but I'll never lose sleep over the latter. Not that I do not regret the wasting of volunteer time, I just don't have it at the top of my 'causes' list.
and, like i said, despite a constant influx of volunteers sent there by p.g., the steady d.p. base seems to be a fairly constant number, which means they are experiencing severe "churn" and "burnout", neither of which bodes well for the future. the well _will_ go dry. (especially since more and more people will come to see google as "the source" of the books that they try to find online, since google offers people several million more titles than project gutenberg.)
Loss of volunteers is not good, obviously. Just as with any venture, it is twice as hard to win someone back as it was to gain them in the first place. The last I checked, the wiki book project also appears to have very few active members, so it is also possible that book proofing has a cap on the number of people willing to do it with any regularity that has nothing to do with the format of presentation or the process. As I said before, I don't believe it is possible to catch up to google. If DP produces 2400 books per year, then even if the process I outlined above (or your own or anyone else's) increased the production by 5x or 10x or even 25x (for first releases), we would still be far behind the scanners. (Although I do believe your scan figures for daily output by IA and google include a lot of material that is not public domain.) I would very much love for someone to prove me wrong on this point and have a million texts in PG in no time flat.
If the inefficiency bothers you and me (and others), it's a moot point because it is not our time that is being wasted.
but the potential of our society _is_ being spent inefficiently, and the time of _good_people_ is being wasted, for no good reason.
so i think the point is _not_ moot. i believe i should _speak_up_.
Yes, you should do as you please. We all should. You do fight the good fight as they say. As much as I may agree with you most of the time, it just isn't in me to voice out about it. After all, you took my crown. ;)
And, we have no power to change what is at DP. At least that is the way I look at it.
i'm not looking for "power". i'm looking to enlighten people.
'Power' is often the easier goal to attain. ;) I merely meant that we are not in a position to have any say in what DP does with itself. Nor in what PG does with itself. Carel