[gutvol-d] Re: d.p.'s undeserved superior attitude about "quality"

25 Feb 2010

      Bowerbird:
...
carel said:
...
You don't need to talk to me about having your time wasted. 
I am the somewhat infamous Carel that left DP nearly a decade ago 
(after spending around 500 hours creating a front-end and 
revamping the horrific code that resided on the back-end).
oh, hey, glad to meet you. :+)
i do believe i am the one who has grabbed the "infamous" crown
at this point in time, however. i see they still let you post there.
myself, i've been banished, so as not to corrupt the youth there.
LOL. It's nice to meet you as well. I've read your posts for many years,
but I don't believe we have ever directly 'spoken.'
...
...
DP has managed to produce a huge number of etexts
...
and they could have produced 3 times as many if their workflow
were 3 times more efficient, something it could _easily_ attain...
according to brewster, internet archive scans 1,000 books a day.
(google probably scans twice that many before their lunch break.)
d.p. digitizes 2400 books a _year_. i predict they cannot keep up.
I'm not sure the most efficient system on earth could keep up with the
scan rate. No matter how much you automate pre- and post-processing and
formatting, it comes down to the speed of humans when it comes to
proofing. And, if you speed things up by skipping the manual proofing
and use automation completely for the first release then you will have
text little to no better than what google already has.

I realize that is not what you are proposing, but what google has done
is the most efficient with little regard to the potential for the
quality added by human interaction. Wikimedia then takes this further by
adding the ability for humans to modify the text in a never-ending
cycle.
...
...
Their etexts also have a decent level of quality.
i'm with michael hart on this. (even if he refuses to see that.)
i think the first mission is to get text up with acceptable quality.
(which, by the way, most of the in-process stuff at d.p. is _not_)
...
and _then_ we can use the _public_ to drive it toward perfection.
I'm am very much with you on this one. 

However, it rather duplicates the efforts of the wikimedia foundation
(at least the last time I looked into their efforts), so I am not sure
of the 'need' for it. In what manner do you suggest that this would be
differentiated from the current wiki book editing process at wikimedia
that justifies the creation of the project for the purpose of PG? 

I ask because I am considering giving it a try if I can find a single
reason to do so that would not make me feel as though I am reinventing
the wheel. I have not been able to find an acceptable reason on my own.
I love PG and promote it to all that will listen, but love alone is not
enough: There must be logic as well. :)

For that matter, it wouldn't be that hard to cull the data from google
books, cut their 'plain text' into pages (and/or re-ocr and dif) and
just slap the result into a wiki (which is about what it looks like
wikimedia has done). I believe the google use license allows for this
for non-commercial entities such as PG? 

Then, someone can verify the google 'plain text' images as inline images
or as content that could not be processed by the OCR (type in missing
text and annotate images). Tidy it up (scannos, etc.), add some assumed
markup for lines, paragraphs, pages, etc. using scripts with no human
interaction. Have a human assisted 'spell check' fix the majority of
remaining errors and end of line hyphenation issues, etc. and add some
basic markup for italics and whatnot (although the google OCR text
appears to already contain this). Run the text through an 'X number of
people say this page matches the scan' proofing process (roudless
process). Have a human assisted markup process to wrap the content in
properly formed (but easy) XML. Convert the document to plain text (and
any other formats that are desired). On a 300 page work of uncomplicated
fiction, depending on the scan quality and speed of the humans doing the
interactive processes, I would estimate this would take about 2-6 hours
_plus_ the proofing time to create a document ready for first release
and wiki storage. Then, you let it sit in a wiki for eternity being
edited into perfection (or until a very large X number of people again
say a page is perfect, create a final edition when all pages in a book
pass this test, post the output, archive the XML version and scans and
consider the book done for all time - or at least until a PG reader
reports an error that must be checked).

For user supplied scans, same process with OCR done on the server (if
desired, can diff against the scan submitter's OCR, if supplied). I
haven't thought about the process in a while and the new press for speed
cuts out some of the way I recall my original ideas. I'm probably
completely forgetting a topic or two, but this would pretty much work
(after a few logic tweeks) to accelerate the release process and should
retain some quality factors in the first release.

As you know, ultimately, in regards to quality, you will always have to
rely on the quality judgements of individuals working on the page or
project to determine the quality of the final output. Quality is a human
concept and so there is no way to automate it. I wish there were. :) 

I am very interested in your text processing scripts though and always
have been. I just haven't been processing any OCR text for a while.
...
...
If the volunteers at DP are blissfully unaware that their time 
is being wasted then they do not feel like victims of the process
but they _are_ victims, whether they _feel_like_ victims or not...
They are not victims in the sense that an abused child is an abused
child whether the child realizes it or not. We are talking about
proofreading and wasting manual hours doing tasks that could have been
done by (or assisted by) automation. I may lose sleep over the former,
but I'll never lose sleep over the latter. Not that I do not regret the
wasting of volunteer time, I just don't have it at the top of my
'causes' list.
...
and, like i said, despite a constant influx of volunteers sent there
by p.g., the steady d.p. base seems to be a fairly constant number,
which means they are experiencing severe "churn" and "burnout",
neither of which bodes well for the future. the well _will_ go dry.
(especially since more and more people will come to see google as
"the source" of the books that they try to find online, since google
offers people several million more titles than project gutenberg.)
Loss of volunteers is not good, obviously. Just as with any venture, it
is twice as hard to win someone back as it was to gain them in the first
place. The last I checked, the wiki book project also appears to have
very few active members, so it is also possible that book proofing has a
cap on the number of people willing to do it with any regularity that
has nothing to do with the format of presentation or the process.

As I said before, I don't believe it is possible to catch up to google.
If DP produces 2400 books per year, then even if the process I outlined
above (or your own or anyone else's) increased the production by 5x or
10x or even 25x (for first releases), we would still be far behind the
scanners. (Although I do believe your scan figures for daily output by
IA and google include a lot of material that is not public domain.) I
would very much love for someone to prove me wrong on this point and
have a million texts in PG in no time flat.
...
...
If the inefficiency bothers you and me (and others), 
it's a moot point because it is not our time that is being wasted.
but the potential of our society _is_ being spent inefficiently, and 
the time of _good_people_ is being wasted, for no good reason.
so i think the point is _not_ moot. i believe i should _speak_up_.
Yes, you should do as you please. We all should. You do fight the good
fight as they say. As much as I may agree with you most of the time, it
just isn't in me to voice out about it. After all, you took my crown. ;)
...
...
And, we have no power to change what is at DP. 
At least that is the way I look at it.
i'm not looking for "power". i'm looking to enlighten people.
'Power' is often the easier goal to attain. ;) I merely meant that we
are not in a position to have any say in what DP does with itself. Nor
in what PG does with itself.

Carel

[gutvol-d] Re: d.p.'s undeserved superior attitude about "quality"

cmiske＠ashzfall.com