
Bowerbird wrote:
jon said networker said:
maybe not. but having done so, it is _refreshing_ to know that -- when that's factored in -- only "a handful" of errors surface.
But the bigger issue is not constrained to errors (differences) with respect to the source text used, as you continue to focus on. The issues deal with the larger areas of trust, verifiability, proper digital preservation of the Public Domain, and using acceptable sources (with proper documentation), not just blindly grabbing anything off the shelf as what appears happened to the PG version of Frankenstein, which now exposes PG to legal liability. The lack of proper processes, procedures and guidelines to build the non-DP portion of the PG library (which comprises about half the collection and is heavily skewed towards the more classic works), is leading to serious questions about the integrity and trustworthiness of the whole PG library (I've discussed this at length the last couple weeks on The eBook Community.) It can certainly be fixed, but the fix will require: 1) redoing most of the non-DP works using DP, 2) Proper selection of sources so they are acceptable, both legally and from those knowledgeable as to the better sources to use, and 3) Proper documentation as to source, including making available all the original page scans (and not just the title page, which proves nothing.) (Btw, NetWorker presented evidence in his message to indicate PG's version of Frankenstein was taken from a copyrighted edition, that itself had a significant number of emendments from the original, which in essence act like a "fingerprint" as to the pedigree. This is NOT good. It casts PG's archive in a negative light, and may even lead to a legal demand by Bantam for PG to remove the current version of Frankenstein. It also calls into question the provenance of a large number of other pre-DP texts where there's no source metadata given and no page scans to prove proper provenance. NetWorker himself is a former attorney, and he has thoroughly researched copyright law the last couple years as it relates to ebooks, so Michael and Greg should seriously sit up and take notice of the problem with PG's version of Frankenstein, and many of its other texts where an acceptable source cannot be demonstrated. Even if PG is "right" in a legal sense, that it could use the 1981 Bantam Classics edition as it *might* have done, does it want to even fight this in court, or to try to explain it away to the trusting public?)
so once again, in spite of some very big noises, it ends up that this fails to stand as a good example of an error-ridden e-text.
Well, at least you seem to indicate from your interest in very low error rate OCR that every etext PG includes in its archive should be a textually faithful reproduction of some known source. That is, if any post-emendments are done, that they should be properly documented. Otherwise, leave the text as it is in the print source. Is this your thinking, or do you believe that textual faithfulness and proper source identification and verification are not necessary at all? That is, just let people take any text in the PG library and then "edit it" as they see fit?
if you do the scanning properly, manipulate those scans correctly, use abbyy in the best way, and subject its results to the right tools, you will reduce the errors in your text to a relatively small number.
I don't believe anyone disagrees with you here in general. But NetWorker was not only interested in OCR errors, but the bigger issues as mentioned above -- they are all interlinked.
(the number we've been kickin' around is 1 error for every 10 pages, iand at that point, proofreading by the public becomes very viable.)
I doubt this error rate (let's say for even half of the public domain printings out there) is accomplishable without sentient-level AI. But if proofreading is to be done anyway by the public, as is *now done* by DP, what difference is there between an OCR error of one every 10 pages, and one every page? The key is that for the aspect of building *trust* in the final product, it is a very good idea to involve the volunteer proofreaders to go over the texts, even if *you don't have to*. Having (and proving to anyone who asks) at least two independent people who proofed every page, adds to its trustworthiness. Include source metadata, and access to the original page scans used as the source, and the highest level of trust is built (as well as greater immunity to legal challenge.) That's what makes DP's system so powerful. But look at PG's edition of Frankenstein: 1) Which original edition it represents is not documented (Mary Shelley issued two substantially different editions). I think the reader should know which one it is in the PG cataloging information. This lack of care about different editions is troubling. 2) The source document is not given at all. I'm not sure if the person who did the first etext version is even recorded anywhere (or even known.) (Btw, this person, should Bantam press the issue, which I hope they don't, would probably become a co-defendent. This shows that the lack of proper guidelines, processes and verification methods in the building of the non-DP portion of PG's collection exposes the volunteer donors of texts to potential legal liability! This is another demonstration that if a project is to do something, it needs to *do it right* from the start, and not just do the "ready-fire-aim" approach to everything.) 3) It is unknown what subsequent "edits" were done along the way -- they are not documented, as far as I know. (How do we know that whole paragraphs were removed or inserted?) 4) It now appears, but is not proven, that the source document was the 1981 Bantam Classics edition. This certainly does not give one warm fuzzies as to the trustworthiness of the non-DP portion of the PG collection. As a user of PG texts, it is important, for both moral, legal and aesthetic reasons, that the texts are: 1) textually faithful reproductions of *known* sources, 2) provable as such (include access to the full page scans, and not just the title page), and 3) the sources of which are themselves acceptable to use, both legally and from those knowledgeable (both professional and amateur) with the Work in question. (For Works which were only published once and never republished by anyone, this last point does not apply provided the source is itself Public Domain.)
if you then have the rare luxury of evaluating your output against an existing version of the book -- like a project gutenberg e-text -- with the right tool (which networker obviously does not yet have), the comparison between the two, alongside the page-images, should make the process of coming to an error-free version simply a breeze.
There will always be hand work necessary to compare two different etexts of the same Work (note that oftentimes there are multiple editions of multiple versions: The Work/Expression/Manifestation (WEM) principle.) Even the issue of hyphenation of compound words requires a human being to ascertain what the author intended. Of course, if this is not important to you, then what can I say?
since this is _exactly_ what will need to be done _increasingly_, as the page-images from the internet archive and (we hope) google -- plus the work done by individual people scanning everywhere -- emerge into cyberspace, that's where my tool-development efforts are now being focused. i suggest networker start reading my blog; it should start being updated on a daily basis starting next week...
Tools such as yours will likely work for some types of texts, and not work for others, where there'll be a need for human beings to not only proof for errors, but to properly structure the document. I'm now assessing the digitizing of records of historical and genealogical significance, and these documents usually have quite complex table layouts, very poor quality printing (and oftentimes handwriting). Scans of these records are insufficient for use, so having human beings read them and transcribe the information into properly structured etext form is necessary. I'll post an announcement to TeBC of your blog if you'd like me to (although I don't know the address of your blog -- had it and then lost it.) Jon