Processing eTexts (was RE: Re: d.p.'s undeserved superior attitude about "quality")

Bowerbird said:
carel said:
I've read your posts for many years
so _you're_ the one who's been reading them! ;+)
I'm sure many people read them. :)
I'm not sure the most efficient system on earth could keep up with the scan rate.
sure we can. just have 10 million people do 1 book each.
piece'a'cake.
It sounds easy when you put it that way.... ;)
And, if you speed things up by skipping the manual proofing and use automation completely for the first release then you will have text little to no better than what google already has.
oh, don't kid yourself. google's text is _much_ better than what they're showing to us members of the general public.
Their OCR software requires training, so I agree with you that they always have an at least one step better version on the backburner. They probably have a stepped up version of everything we see....
i haven't been impressed by the wiki approach, honestly...
The standard wiki layout is not a proper environment for editing. Better would be something that allows the vast majority of readers to just read the page and click a button to be presented with an interface for making/submitting corrections if such is required. Direct comparison to the scans is part of the formal proofing process and, although available during the 'quest for perfection,' should not be a forced issue. I think calling it a wiki makes it easier for quick comprehension rather than explaining the whole process of storage, editing, etc. Rather than working an editing interface into an out-of-the-box wiki system, it would be better to write a proprietary system with some of the social networking features of a wiki. That part would take a very firm second place to just getting some tools out there for people to produce decent quality texts in a minimal amount of time. Because of that fact, I haven't really given much thought to the 'wiki' interface as it will evolve by itself from wants/needs. [snip] in re: google:
people on this listserve have been cut off for scraping... you can still do it, of course, but you have to be careful.
Has anyone ever just asked google if they could use the scans and OCR text for PG? In regards to the scans, I believe they just require you to leave the branding.
i would bother to nitpick certain parts of it, like the x.m.l., but really, my objection is at the more fundamental level.
XML makes a nice standard for storage and conversion and can be human readable. I'm not one of those people who believes that XML is the answer to everything, but I do think it works well for this use-case scenerio. Opinions on this will vary, etc.
yes, i believe every page of every book should be online, on its own webpage with unchanging u.r.l., text and scan, with an error-report form that the general public can use to detail any problems with the page, or ask questions, or even make annotations and have dialog about that page...
Agreed.
but i also believe that the text for the vast majority of these pages should be _perfect_ at the time that text is mounted.
We are in agreement on this point as well. The majority of the work towards corrections should be done via automation and human interaction with a software process so as to reduce the error level to a minimm before it is put in the hands of a proofreader (or the general public).
and i've demonstrated, repeatedly, that this is fully possible. It is very possible.
[snip]
so here's how i see things going...
when you find a scanset online of a book you want to digitize, you'll download it, along with the o.c.r. (if it comes from a site that won't give you the o.c.r., you can upload it to archive.org, and they'll o.c.r. it for you. providing you have no o.c.r. app.)
you'll use a handy-dandy tool to digitize it in an hour or two, and then you'll make the results available to everyone online.
My only protest is that I believe a human proofing stage is still a good choice. For one thing, it would be a way of catching errors that could be induced by the human assisted automation process. Otherwise, I would say to have two people run the same correction process, diff the results and then present the mis-matches to a third set of eyes. If that doesn't result in a book that is ready for release, I'm not sure what would. Ready for release != perfect. :) On a point in favor of the proofing round that really has nothing to do with the potential for automation, the existance of a proofing round creates a sense of belonging to the project and could assist in recruiting people to try the 'harder' process of running the 'automation' scripts and submitting books, etc. People need a sense that they are not just assisting tools to create book editions, but are a fundamental part of the process. If you try to proof after a release has been done, there will be little interests in it because you lose the motivation to 'get the book out there."
and carel, since you're interested in programming such tools, we'll have plenty to talk about...
Yes. :) It's very much a 'will do' at this point. I just have to survive my mid-terms. I look forward to your feedback and your input on the project.
well, like i told gardner just the other day, i really don't _have_ any collection of scripts. i use a text-editor and start looking at the book. i see an error, and devise a global search-and-replace to deal with that type of error. when they're all fixed, i look again to find the next error.
That brings up the point that pretty much all of what my scripts would be doing could be done in word processing software....
i think of this process as "listening to the book", in that the book itself will tell you what kinds of errors it has...
I like that. I'm not sure I'm ready to program an AI engine capable of "listening to the book." Which is why I do not believe in leaving people out of the process, but rather giving them tools to assist them with the process. And, the tools can 'learn' from the people who use them. I'll be busy finishing up a group project for my classes, so it'll be late March before I get any real time to work on scripts for this project. I'll work on some of the research, logic, and flow in the meantime so that I am ready to roll when I do have the time. Back to my programming.... Carel
participants (1)
-
cmiske@ashzfall.com