book vs. page proofing

I've looked at this some more and I've come to the conclusion that working at the page-level is sufficient but inefficient, and that working at the book-level is efficient but insufficient. Here's what I base that on. In my latest experiement on a new book, I concatenated all the text files first and did the global corrections. A lot of improvements were made quickly. When I was convinced I had done all I could, I burst that back into individual pages and loaded it into PPE, the page-at-a-time editor which presents the image, and edit window, and an analysis window simultaneously. I found two types of errors that I could not have found at the book level and which would have also likely been missed by the smoothies. The first was where the text of two paragraphs was combined into one, without an intervening blank line. This happened in the middle of two separate pages. Abbyy doesn't preserve the start-of-paragraph indent, so especially if the first paragraph ends its last line near the ight margin, this error is invisible unless you are looking at the image. The other are words incorrectly marked as italic. Typically they are short words ("I" and "we" most commonly), but not always. If you can't see the image, you can't expect to get these right. I had to undo about 8 of these. For the next book, likely I'll do this same procedure: full text and then page at at time in PPE. The first makes it easier; the second makes it right. --Roger

On Tue, December 27, 2011 11:24 am, Roger Frank wrote:
I've looked at this some more and I've come to the conclusion that working at the page-level is sufficient but inefficient, and that working at the book-level is efficient but insufficient.
This tends to mirror my own experience, although I would characterize it slightly differently. Picture a graph where the y-axis represents the efficacy of work, and the x-axis represents the life cycle of a digitization project. Now picture two logarithmic curves with a minimum value greater than zero. One curve starts at a maximum on the left and approaches the minimum to the right, and the other begins at the minimum value on the left and approaches the maximum on the right. The first (decending) curve represents the efficacy (not necessarily the efficiency) of working with a book as a whole, and the second (ascending) curve represents the efficacy of working with the book a page at a time. When a book first emerges from transcription, whether via OCR or human action, it is most susceptible to whole-document transformations. These whole-document transformations could include things like regular expressions coupled with search and replace, computer-assisted spell checking, editor macros or even tools like the fr2html tool I wrote several years ago. As an e-book moves through its life-cycle, the usefulness of these whole-document processes decreases (although it never reaches 0; there will always be some efficacy associated with whole document actions). At the same time, the usefulness of page-at-a-time proofreading increases. Page-at-a-time proofreading processes can include a contextless comparison of a segment of text with an image (something I am simply incapable of doing), smooth-reading, and smooth-reading with page images. Of course, these page-at-a-time processes include checking both formatting and content. At this point I think we need to add a third curve to our graph: the ability of software tools to make corrections/transformations without human intervention. I suspect that this curve will mirror the whold-book curve to some extent, although I can't say just how closely. To illustrate, when a book comes out of FineReader as SGML/HTML, fr2html can be run without human intervention to correct many of the errors (or to improve the document). Running a spell check against the entire book requires much more human involvement (to verify proposed changes) and should be performed at some point into the life-cycle, but before the page-at-a-time process takes over. If you have been paying attention to BowerBird's sample processes you may have noted that while he has written a number of Python, Perl and sed (regex) scripts, very few of them are fully automated, and most of them require some tweaking over multiple iterations. Even the most automated of these scripts require some human intervention if only to decide which scripts are appropriate for a particular book, and which are not. I suspect that the ability of software tools to run more-or-less autonomously decreases much faster than the usefulness of whole-document editing.
Here's what I base that on.
In my latest experiment on a new book, I concatenated all the text files first and did the global corrections. A lot of improvements were made quickly. When I was convinced I had done all I could, I burst that back into individual pages and loaded it into PPE, the page-at-a-time editor which presents the image, and edit window, and an analysis window simultaneously.
This makes sense to me. If I were designing an e-book proofing system (which I guess in some sense I am), I would start with OCR output that preserves as much document structure as possible. I would then run as many automated processes as I can find (or develop) against that document, preferring first those that require little human intervention, and then moving towards those that require more. At some point the law of diminishing returns on whole-document editing will come into play, at which point it is time to move to a page-at-a-time review and edit. At this point in time, I can't provide any heuristics to suggest when this transition should occur, but I suspect that it is much later in the production life-cycle than most people think. It appears to me that DP's process is to take a text, strip it of most meaningful structural elements and then throw it right in to the page-at-a-time process. I believe that this is the basis of most of BowerBird's polemics against DP: page proofers shouldn't have to deal with errors that could have been easily identified, and repaired, at the whole-document editing level. As I indicated earlier, however, I don't think the efficacy of whole-document editing ever really approaches zero. So even when the document life-cycle moves into the page-at-a-time phase we still need a mechanism to do whole document editing. When I first created my own DP-like prototype (http://www.ebookcoop.net -- not usable in IE 8) I broke documents up according to pages; a page of HTML represented a single page image. If a person wanted to read or download an entire document a background process would merge all the individual HTML files into a finished document. Processes that would benefit from a document context (e.g. global search and replace) would also be provided by the controlling server process. I'm reconsidering that strategy. I'm certain that given an HTML file with page break milestones embedded in it, I can tease apart each page and display it separately. Putting page-at-a-time edits back into the master document is a little more difficult, but I think I can do that as well. A master document would be placed in a CVS repository where it could be checked out and in by anyone needing to work at the whole-document level. It would also be accessible by a web application which could tease apart the pages, presenting them one at a time with the associated page image. When editing is complete, the web application would merge the changes back into the master document, then check it in to CVS on behalf of the user.
I found two types of errors that I could not have found at the book level and which would have also likely been missed by the smoothies.
The first was where the text of two paragraphs was combined into one, without an intervening blank line. This happened in the middle of two separate pages. Abbyy doesn't preserve the start-of-paragraph indent, so especially if the first paragraph ends its last line near the right margin, this error is invisible unless you are looking at the image.
Actually, Abbyy does preserve the start-of-paragraph indents -- if you know where to look. Given the thunderous silence from our bretheren at Internet Archive about the 'fromabbyy' PHP script, I've started looking more closely at the *_abbyy.xml files from archive.org to see what I could do to perform the same function. One of the many things I've discovered is that most of the the <par> elements have a "left-indent" attribute, which seems to be the number of pixels that a paragraph is indented. I have every reason to believe that Abbyy strips this value when saving as RTF or HTML, but if you start from the .xml output you might be able to identify those paragraphs which are /not/ significantly indented, and thus likely to need to be joined to the proceeding paragraph. I'm hoping that by the first of February I might have something that will allow us to actually make the files at Internet Archive useful. Perhaps I might have a servlet that could display HTML segments by that time as well. [remainder snipped]
participants (2)
-
Lee Passey
-
Roger Frank