re: [gutvol-d] lest the message be missed

carlo said:
How do you manage words that are written in the same way, except the accent? This is quite common in french and italian.
i wouldn't strip away high-bit characters on a french or ltalian book; they are an essential part of those languages. i've said that repeatedly. jon noring seems to turn every listserve into an endless merry-go-round. *** robert said:
With some of the OCR I've seen latesly, this is probably about 90% right, for 90% of books
if you want me to estimate some hard numbers, i'd say 75% of the e-texts in the library now could be done to our standard in one evening. the "standard" we've been talking about thus far is 1 error every 10 pages. if an individual can take an e-text to that level of accuracy, then it can be turned over to a quasi-public process i call "continuous proofreading", discussed below. another 20% of the e-texts in the library now would take two or three evenings, and might require the help of a "specialist" of some kind, someone knowledgeable about a certain arena, like greek, or tables, or indexes, or graphics, etc. here's where the value of distributed proofreaders will most come into play in the future, in my opinion, being able to "fix up" the work of independent proofers. the remaining 5% of the e-texts in the library now (remember, i'm just guessing at the numbers here) might be too difficult for any individual to take on, for reasons of size, complexity, or what have you. again, distributed proofreaders will shine here...
provided you can get good scans,
jon noring promises us good scans, at a high resolution. i guess he's got a lotta money, to be able to afford the maintenance costs of storage and bandwidth for them. if he delivers on that promise, we don't need to worry... if he can't -- and he has no track-record in this arena, so maybe we shouldn't count on him -- then we have the scans that internet archive is making for toronto. they've promised us "thousands" of scanned books by this spring, but the last thing i heard was that they were "pausing to do an evaluation of their quality", which probably means they've come to realize that they must do a much better job than they have been. i've heard that some books are done very well and that others leave a lot to be desired. consistency in this regard is often a difficult goal to achieve. but i believe brewster will eventually get it right. and of course, there's always google. we still don't know what kind of job they'll do, or if we can use their scans. but if they do a good job, and release their scans freely, that will provide us with a ton of scanned-image books. some people reading this undoubtedly work in an office that has one of those "multi-functionality" machines. some of those babies can scan over 60 pages a minute, straighten the scans, and upload them to the internet, all while you sit there and whistle and pick your nose. at that rate, the 450 pages of "my antonia" would take 7.5 minutes. i don't know how crafty _you_ might be, but my fellow poets are _highly_ skilled at hijacking office machinery for our own nefarious purposes... ;+) (heck, that's the only reason some of us get a job at all.) finally, there are millions of home computers out there that were sold with an all-in-one printer/scanner/fax. and, as before, the quickest and easiest way to get to a million scan-books is for a million people to scan _one_. so, it's fairly easy to predict that, from one or more of the above factors, there will soon be an _avalanche_ of books that have been scanned and need to be converted into text. and every scanned-book will have at least _one_ person who will want its text badly enough to do a little work. what we need to do is _give_that_person_a_good_tool_ that enables their little bit of work to get good results. that's what i intend to give them. all i ask of you is that you stop telling people that this job is difficult. it's not.
and provided you are willing to let a few hard-to-detect classes of error
you'll have to explain what you mean by "hard-to-detect". in my experience, if an error is serious (in any meaningful way), then it'll be detected by a person who's actually reading the book. some errors are unforgivable, such as an incorrect word that won't even pass spellcheck. those should _always_ be caught. trivial punctuation errors, like a missing comma, are... well, trivial. (although i haven't mentioned anything about it until now, a great way to catch some errors is to have the computer speak the text aloud to you as you follow along reading it; stealth scannos, for instance, are handily exposed by this.)
and provided you are willing to let a few hard-to-detect classes of error go until post-production.
"post-production" has no meaning in my scenario. i repeat myself, again and again, by saying that once a person gets the error-level on an e-text down to 1 error in 10 pages, we can make it available via "continuous proofreading" and let readers-from-the-general-public zoom it to perfection. 1. scan. 2. "fix" the scans. 3. do the o.c.r. on them. 4. run the post-o.c.r. tool. 5. do quasi-public "continuous proofreading". 6. consolidate the corrections into a public release. 7. release the e-text out to the public as a single file. 8. continue doing a full-public "continuous proofreading". 9. take error-reports from the people reading the file offline. i will also state, for the record, that i think step #4 can do _far_ better than 1 error in 10 pages if we sharpen our tools. look at the "my antonia" example. jon had a team of _seven_ proofreading it. i don't know how many looked at each page, but, according to my analysis, they took the error-rate down to about 1 every 70 pages. when i used my bag of tricks on it, i removed 3 errors from the 210 pages i subjected to scrutiny. (which leads us to predict 3 more errors in the second half.) to the best of my knowledge, there are no errors in my file, i.e., the first half of the book. (full book by friday, hopefully.) i'm not saying it _is_ free of errors even now, since someone with a different set of tricks in _their_ bag might be able to locate 2 more remaining errors i couldn't find, but i will say that it is more than clean enough to turn loose on the public... (that is, i think we could skip steps #5 and #6 on this e-text.) -bowerbird

Bowerbird wrote:
carlo said:
How do you manage words that are written in the same way, except the accent? This is quite common in french and italian.
i wouldn't strip away high-bit characters on a french or ltalian book; they are an essential part of those languages. i've said that repeatedly.
Yes you have said that -- repeatedly. But I believe it is also essential to preserve all accented Latin and non-accented characters found in *all* books. This is where the differences of view arise. Throwing them out because they are "inconvenient" (which seems to be your motive, but I'm not sure) is not a valid excuse. Since your tool set (and viewing software) can handle any character set you want, then not supporting the non-ASCII characters is even more confusing.
jon noring seems to turn every listserve into an endless merry-go-round.
<laugh/> Jon

On Tue, 8 Mar 2005 13:04:18 EST, Bowerbird@aol.com <Bowerbird@aol.com> wrote:
carlo said:
How do you manage words that are written in the same way, except the accent? This is quite common in french and italian.
i wouldn't strip away high-bit characters on a french or ltalian book; they are an essential part of those languages. i've said that repeatedly.
Then why take the time to remove the high-bit characters in an English book? There's lots of books that have important French quotations or use accents to denote unpredictable stress; why strip them from the books that you can, just because? It's not hard at all to deal with the handful of accents the average English book has.
if you want me to estimate some hard numbers, i'd say 75% of the e-texts in the library now could be done to our standard in one evening.
Not by someone new to the job. To do a book in an evening requires that you be experianced with the job and the tools. And I get real tired of you using the average book in PG as a metric. The average book in PG was chosen because it was relatively easy to do. Out of the three floors of books in the library I'm sitting (excluding the governmental depository), the basement is full of science, math or technology books, and will require complex graphical work or mathematical work. Of the remaining 66% (probably more like 55% or 60%, since the third floor is small), many of them are art or music books or dictionaries and grammars, or archiac languages that OCR doesn't handle well, or archiac fonts that OCR doesn't handle well.
here's where the value of distributed proofreaders will most come into play in the future, in my opinion, being able to "fix up" the work of independent proofers.
It's funny that if the value of DP is so limited, that the percentage of texts that have come in from DP is so high. Why don't we have more people doing books by hand alone?
that's what i intend to give them. all i ask of you is that you stop telling people that this job is difficult. it's not.
When you upload books to PG, what name do you put on them? For all your words, I can't recall ever seeing a book credited to you. I have no samples of what you've worked on alone and what your quality standards are to judge by.
participants (3)
-
Bowerbird@aol.com
-
David Starner
-
Jon Noring