today's lesson will bear on distributed proofreaders.

as a reminder, this thread is about the scanset
created by jon noring for his "my antonia" demo:
>   http://www.openreader.org/myantonia

since the scans themselves are a fairly hefty download,
at 31 megs, you can grab the 5-meg djvu version instead.
that allows you to follow along if i refer to certain pages.

i've uploaded the text-file that finereader v7
generates as o.c.r. output for "my antonia".
this text-file can be downloaded from this u.r.l.:
>   http://snowy.arsc.alaska.edu/bowerbird/myantonia1.txt

or, for those of you who would prefer a .zip version instead:
>   http://snowy.arsc.alaska.edu/bowerbird/myantonia1-txt.zip

i will probably make some references to this text-file
in coming days, so if any of you wanna follow along,
you should download it and become familiar with it...
(and it's only a 500k download, under 200k for the .zip.)

one of the most immediate ways that online scans
can facilitate the goals of project gutenberg is to
make it possible for people to check the text against
the scans themselves, to make sure that it's accurate.

generally, this is what distributed proofreaders does --
present the text alongside the scan so it can be proofed.

but the relevance of the parallel is much more specific;
for that we need to delve deeper into the o.c.r. output...

as i understand it, d.p. has recently switched to a new
methodology, which separates proofing and formatting
into separate rounds (with 2 rounds for each of them).

part of the formatting involves "meta-formatting".
(i think this is part of a formatting round, anyway,
not a distinct and separate "round" of its own, but
if i am wrong about any of this, i assume that one of
the people from d.p. will step in and correct my error.)

one aspect of the "meta-formatting" is a checklist that,
among other things, indicates if a chapter-heading exists
on a specific page.  this checklist is used to ensure that
the proper markup gets applied to that chapter-heading.

that is, each page-scan is pushed out to human being,
who then marks an item on a list if it has a header on it.
(and other items on the list if it has those other features.)

as i have indicated in the past, it is not difficult to write
computerized routines to sniff out these section-headers;
so it is simply a ridiculous waste of valuable resources to
have human beings be making this determination instead.

it is much smarter to use the computer to do the bulk of
the work at the outset, and then have a human check it.

appended is the output from such a routine i wrote in
just a short time -- the program is under 50 lines long.

if you check against the "my antonia" text, or the scans,
you will see that this routine has successfully identified
the pages in the book on which there is a section-break.
it gives you the page-number, and does the best it can
to tell you the actual _text_ of the header on that page.

programs like these only take a few minutes to run, so
it's easy to see this is a more efficient way to proceed,
compared to having a person drudge through every page.

i won't tell you exactly _how_ this program operates,
because it would do you good to look at pages where
there is a section-break, and come up with an answer.

a hint for you is that there are _numerous_ indicators,
any one of which is sufficient in this current example...

you might remember that i have a 30-item checklist
consisting of dimensions that are indicative of headers.
how many of the 30 items can _you_ come up with?

feel free to share your answers with the whole listserve.
if enough people come up with enough of the indicators,
i will share the source-code of the routine i wrote here...

in fact, just for fun, i wrote another quick little routine,
using another one of my 30 indicators, and that routine
gave me the output that i appended in the second p.s.
as you see, this routine produced excellent results too.

i have said it before, but i'll repeat it again here now:
headers are specifically _designed_ to draw attention,
so it is easy to locate them, even in raw o.c.r. output.

but heck, before you know it, you'll be smart enough to
figure out how to determine _other_ structures as well...

another item on the "meta-formatting" checklist is
_footnotes_.  there is only one footnote in this text,
but based on that, how would you write a routine to
identify any footnotes?  how about block quotations?
expressions in a foreign language?  tables?  lists?
poems?  all of the various aspects contained in plays?

it's not hard.  give it a try...

-bowerbird

p.s.  here's that output...

3              Book I
9              II
21             Ill
31             IV
36             V
42             VI
48             VII
57             VIII
70             IX
80                             For several weeks after my sleigh-ride, we
91             XI
96             XII
101            XIII
108            XIV
119            XV
131            XVI
137            XVII
145            XVIII
156            XIX
160            Book II
162            Book II
168            II
176            Ill
181            IV
193            V
197            VI
206            VII
220            VIII
225            IX
233                            It was at the Vannis* tent that Antonia was
238            XI
244            XII
258            XIII
264            XIV
280            XV
288            Book III
290            Book III
298            II
307            Ill
315            IV
332            Book IV
334            Book IV
342            II
346            Ill
361            IV
366            Book V
368            Book V
399            II
415            Ill
441                            COPYRIGHT, 1918, BY WILLA SIBKRT CATHKR
443            CONTENTS
445            INTRODUCTION


p.p.s. and here's the output from the second routine.  i've left a big
hint in here about how this routine operates; can you figure it out?

1=2             2=2
3=24            4=30
8=29            9=26
20=20           21=26
30=28           31=25
35=29           36=26
41=23           42=26
47=17           48=25
56=7            57=26
69=17           70=26
79=6            80=25
90=16           91=26
95=18           96=25
100=13          101=26
107=10          108=27
118=23          119=26
130=8           131=25
135=28          136=22
137=26          138=30
144=25          145=26
155=26          156=26
159=27          160=2
161=3           162=2
163=24          164=30
167=27          168=26
175=16          176=25
180=24          181=26
192=13          193=26
196=12          197=26
205=25          206=26
219=10          220=27
224=29          225=25
232=18          233=25
237=22          238=26
243=19          244=26
257=6           258=26
263=17          264=26
279=28          280=26
288=21          289=3
290=2           291=24
297=18          298=26
306=28          307=26
314=29          315=26
331=29          332=17
333=3           334=2
335=25          336=30
340=29          341=16
342=26          343=30
345=10          346=26
359=29          360=8
361=26          362=30
365=20          366=2
367=3           368=2
369=24          370=30
398=20          399=26
414=8           415=25
419=25          420=4