
today's lesson will bear on distributed proofreaders. as a reminder, this thread is about the scanset created by jon noring for his "my antonia" demo:
since the scans themselves are a fairly hefty download, at 31 megs, you can grab the 5-meg djvu version instead. that allows you to follow along if i refer to certain pages. i've uploaded the text-file that finereader v7 generates as o.c.r. output for "my antonia". this text-file can be downloaded from this u.r.l.:
or, for those of you who would prefer a .zip version instead:
i will probably make some references to this text-file in coming days, so if any of you wanna follow along, you should download it and become familiar with it... (and it's only a 500k download, under 200k for the .zip.) one of the most immediate ways that online scans can facilitate the goals of project gutenberg is to make it possible for people to check the text against the scans themselves, to make sure that it's accurate. generally, this is what distributed proofreaders does -- present the text alongside the scan so it can be proofed. but the relevance of the parallel is much more specific; for that we need to delve deeper into the o.c.r. output... as i understand it, d.p. has recently switched to a new methodology, which separates proofing and formatting into separate rounds (with 2 rounds for each of them). part of the formatting involves "meta-formatting". (i think this is part of a formatting round, anyway, not a distinct and separate "round" of its own, but if i am wrong about any of this, i assume that one of the people from d.p. will step in and correct my error.) one aspect of the "meta-formatting" is a checklist that, among other things, indicates if a chapter-heading exists on a specific page. this checklist is used to ensure that the proper markup gets applied to that chapter-heading. that is, each page-scan is pushed out to human being, who then marks an item on a list if it has a header on it. (and other items on the list if it has those other features.) as i have indicated in the past, it is not difficult to write computerized routines to sniff out these section-headers; so it is simply a ridiculous waste of valuable resources to have human beings be making this determination instead. it is much smarter to use the computer to do the bulk of the work at the outset, and then have a human check it. appended is the output from such a routine i wrote in just a short time -- the program is under 50 lines long. if you check against the "my antonia" text, or the scans, you will see that this routine has successfully identified the pages in the book on which there is a section-break. it gives you the page-number, and does the best it can to tell you the actual _text_ of the header on that page. programs like these only take a few minutes to run, so it's easy to see this is a more efficient way to proceed, compared to having a person drudge through every page. i won't tell you exactly _how_ this program operates, because it would do you good to look at pages where there is a section-break, and come up with an answer. a hint for you is that there are _numerous_ indicators, any one of which is sufficient in this current example... you might remember that i have a 30-item checklist consisting of dimensions that are indicative of headers. how many of the 30 items can _you_ come up with? feel free to share your answers with the whole listserve. if enough people come up with enough of the indicators, i will share the source-code of the routine i wrote here... in fact, just for fun, i wrote another quick little routine, using another one of my 30 indicators, and that routine gave me the output that i appended in the second p.s. as you see, this routine produced excellent results too. i have said it before, but i'll repeat it again here now: headers are specifically _designed_ to draw attention, so it is easy to locate them, even in raw o.c.r. output. but heck, before you know it, you'll be smart enough to figure out how to determine _other_ structures as well... another item on the "meta-formatting" checklist is _footnotes_. there is only one footnote in this text, but based on that, how would you write a routine to identify any footnotes? how about block quotations? expressions in a foreign language? tables? lists? poems? all of the various aspects contained in plays? it's not hard. give it a try... -bowerbird p.s. here's that output... 3 Book I 9 II 21 Ill 31 IV 36 V 42 VI 48 VII 57 VIII 70 IX 80 For several weeks after my sleigh-ride, we 91 XI 96 XII 101 XIII 108 XIV 119 XV 131 XVI 137 XVII 145 XVIII 156 XIX 160 Book II 162 Book II 168 II 176 Ill 181 IV 193 V 197 VI 206 VII 220 VIII 225 IX 233 It was at the Vannis* tent that Antonia was 238 XI 244 XII 258 XIII 264 XIV 280 XV 288 Book III 290 Book III 298 II 307 Ill 315 IV 332 Book IV 334 Book IV 342 II 346 Ill 361 IV 366 Book V 368 Book V 399 II 415 Ill 441 COPYRIGHT, 1918, BY WILLA SIBKRT CATHKR 443 CONTENTS 445 INTRODUCTION p.p.s. and here's the output from the second routine. i've left a big hint in here about how this routine operates; can you figure it out? 1=2 2=2 3=24 4=30 8=29 9=26 20=20 21=26 30=28 31=25 35=29 36=26 41=23 42=26 47=17 48=25 56=7 57=26 69=17 70=26 79=6 80=25 90=16 91=26 95=18 96=25 100=13 101=26 107=10 108=27 118=23 119=26 130=8 131=25 135=28 136=22 137=26 138=30 144=25 145=26 155=26 156=26 159=27 160=2 161=3 162=2 163=24 164=30 167=27 168=26 175=16 176=25 180=24 181=26 192=13 193=26 196=12 197=26 205=25 206=26 219=10 220=27 224=29 225=25 232=18 233=25 237=22 238=26 243=19 244=26 257=6 258=26 263=17 264=26 279=28 280=26 288=21 289=3 290=2 291=24 297=18 298=26 306=28 307=26 314=29 315=26 331=29 332=17 333=3 334=2 335=25 336=30 340=29 341=16 342=26 343=30 345=10 346=26 359=29 360=8 361=26 362=30 365=20 366=2 367=3 368=2 369=24 370=30 398=20 399=26 414=8 415=25 419=25 420=4