[gutvol-d] ok then, let's go to work (part 2)

7 Sep 2005

      today's lesson will bear on distributed proofreaders.

as a reminder, this thread is about the scanset
created by jon noring for his "my antonia" demo:
...
http://www.openreader.org/myantonia
since the scans themselves are a fairly hefty download,
at 31 megs, you can grab the 5-meg djvu version instead.
that allows you to follow along if i refer to certain pages.

i've uploaded the text-file that finereader v7
generates as o.c.r. output for "my antonia".
this text-file can be downloaded from this u.r.l.:
...
http://snowy.arsc.alaska.edu/bowerbird/myantonia1.txt
or, for those of you who would prefer a .zip version instead:
...
http://snowy.arsc.alaska.edu/bowerbird/myantonia1-txt.zip
i will probably make some references to this text-file
in coming days, so if any of you wanna follow along,
you should download it and become familiar with it...
(and it's only a 500k download, under 200k for the .zip.)

one of the most immediate ways that online scans
can facilitate the goals of project gutenberg is to
make it possible for people to check the text against
the scans themselves, to make sure that it's accurate.

generally, this is what distributed proofreaders does --
present the text alongside the scan so it can be proofed.

but the relevance of the parallel is much more specific;
for that we need to delve deeper into the o.c.r. output...

as i understand it, d.p. has recently switched to a new
methodology, which separates proofing and formatting
into separate rounds (with 2 rounds for each of them).

part of the formatting involves "meta-formatting".
(i think this is part of a formatting round, anyway,
not a distinct and separate "round" of its own, but
if i am wrong about any of this, i assume that one of
the people from d.p. will step in and correct my error.)

one aspect of the "meta-formatting" is a checklist that,
among other things, indicates if a chapter-heading exists
on a specific page.   this checklist is used to ensure that
the proper markup gets applied to that chapter-heading.

that is, each page-scan is pushed out to human being,
who then marks an item on a list if it has a header on it.
(and other items on the list if it has those other features.)

as i have indicated in the past, it is not difficult to write
computerized routines to sniff out these section-headers;
so it is simply a ridiculous waste of valuable resources to
have human beings be making this determination instead.

it is much smarter to use the computer to do the bulk of 
the work at the outset, and then have a human check it.

appended is the output from such a routine i wrote in
just a short time -- the program is under 50 lines long.

if you check against the "my antonia" text, or the scans,
you will see that this routine has successfully identified
the pages in the book on which there is a section-break.
it gives you the page-number, and does the best it can
to tell you the actual _text_ of the header on that page.

programs like these only take a few minutes to run, so
it's easy to see this is a more efficient way to proceed,
compared to having a person drudge through every page.

i won't tell you exactly _how_ this program operates,
because it would do you good to look at pages where
there is a section-break, and come up with an answer.

a hint for you is that there are _numerous_ indicators,
any one of which is sufficient in this current example...

you might remember that i have a 30-item checklist
consisting of dimensions that are indicative of headers.
how many of the 30 items can _you_ come up with?

feel free to share your answers with the whole listserve.
if enough people come up with enough of the indicators,
i will share the source-code of the routine i wrote here...

in fact, just for fun, i wrote another quick little routine,
using another one of my 30 indicators, and that routine
gave me the output that i appended in the second p.s.
as you see, this routine produced excellent results too.

i have said it before, but i'll repeat it again here now:
headers are specifically _designed_ to draw attention,
so it is easy to locate them, even in raw o.c.r. output.

but heck, before you know it, you'll be smart enough to
figure out how to determine _other_ structures as well...

another item on the "meta-formatting" checklist is 
_footnotes_.   there is only one footnote in this text,
but based on that, how would you write a routine to
identify any footnotes?   how about block quotations?
expressions in a foreign language?   tables?   lists?
poems?   all of the various aspects contained in plays?

it's not hard.   give it a try...

-bowerbird

p.s.   here's that output...

3               Book I
9               II
21              Ill
31              IV
36              V
42              VI
48              VII
57              VIII
70              IX
80                              For several weeks after my sleigh-ride, we
91              XI
96              XII
101             XIII
108             XIV
119             XV
131             XVI
137             XVII
145             XVIII
156             XIX
160             Book II
162             Book II
168             II
176             Ill
181             IV
193             V
197             VI
206             VII
220             VIII
225             IX
233                             It was at the Vannis* tent that Antonia was
238             XI
244             XII
258             XIII
264             XIV
280             XV
288             Book III
290             Book III
298             II
307             Ill
315             IV
332             Book IV
334             Book IV
342             II
346             Ill
361             IV
366             Book V
368             Book V
399             II
415             Ill
441                             COPYRIGHT, 1918, BY WILLA SIBKRT CATHKR
443             CONTENTS
445             INTRODUCTION

p.p.s. and here's the output from the second routine.   i've left a big
hint in here about how this routine operates; can you figure it out?

 1=2              2=2
 3=24             4=30
 8=29             9=26
 20=20            21=26
 30=28            31=25
 35=29            36=26
 41=23            42=26
 47=17            48=25
 56=7             57=26
 69=17            70=26
 79=6             80=25
 90=16            91=26
 95=18            96=25
 100=13           101=26
 107=10           108=27
 118=23           119=26
 130=8            131=25
 135=28           136=22
 137=26           138=30
 144=25           145=26
 155=26           156=26
 159=27           160=2
 161=3            162=2
 163=24           164=30
 167=27           168=26
 175=16           176=25
 180=24           181=26
 192=13           193=26
 196=12           197=26
 205=25           206=26
 219=10           220=27
 224=29           225=25
 232=18           233=25
 237=22           238=26
 243=19           244=26
 257=6            258=26
 263=17           264=26
 279=28           280=26
 288=21           289=3
 290=2            291=24
 297=18           298=26
 306=28           307=26
 314=29           315=26
 331=29           332=17
 333=3            334=2
 335=25           336=30
 340=29           341=16
 342=26           343=30
 345=10           346=26
 359=29           360=8
 361=26           362=30
 365=20           366=2
 367=3            368=2
 369=24           370=30
 398=20           399=26
 414=8            415=25
 419=25           420=4

[gutvol-d] ok then, let's go to work (part 2)

Bowerbird＠aol.com