these grapes are sweet -- lesson #07

this thread explains how to digitize a book quickly and easily. search archive.org for "booksculture00mabiuoft" for the book. *** benjamin said:
I use groovy.
cool... ;+) feel free to port any of my python code over to that platform. also, benjamin, since you might be the only "new" guy here, i will be open to any feedback you'd have on a web interface for proofing and editing o.c.r. text alongside its pagescan... you saw one possibility, in the links i gave recently:
http://z-m-l.com/go/mabie/mabiep123.html http://z-m-l.com/go/myant/myantp123.html http://z-m-l.com/go/sgfhb/sgfhbp123.html
http://www.archive.org/details/booksculture00mabiuoft click the "all files: http" link you find in the leftmost column.
http://www.archive.org/stream/booksculture00mabiuoft#page/123
i have also floated others, but i'm interested in any designs that you might've generated and would be willing to share... *** ok, so here we go... since we're using o.c.r. text from archive.org, it'll be good to review some of the basic information about their typical files. 1. you can get to the files for a book from its "details" page: this will take you to an "index" page that lists all of the files. the file with "djvu.txt" at the end of its name holds the o.c.r. this is the file we'll be working on, so take a good look at it. does the text look like o.c.r. did a good job of recognition? in some of its early books, archive.org "lost" the em-dashes, so that output is totally worthless; don't even bother with it! some files are also missing quotemarks or end-line hyphens. archive.org also scanned some files where library patrons had scrawled notes in the margins, or underlined some passages, which -- as you might imagine -- wreaks havoc with the o.c.r. before using any archive.org text, ensure it's not fatally flawed; your time is much better spent seeking a _worthwhile_ copy... (archive.org itself often has multiple copies of a specific book.) if you can't find any better o.c.r., do some other book instead! or, if you have to do _that_ book, just re-do the o.c.r. yourself. life is too short to waste your time correcting a bad o.c.r. file. thank goodness, most of the recent archive.org files are good. 2. we also need to be able to review the actual page-scans, to resolve questions. archive.org hosts the scans, both in a high-resolution form (which is far too bulky to bother with) and a lower-resolution (which is more than good enough)... they're in a .zip file there on the "index" page for the book. because i have a tool that shows me each scan alongside the o.c.r. text for that page, i usually download the actual scans. but in addition to the scans, archive.org offers .djvu and .pdf _output_ formats that collect the scan-set so you can view it, and a one-file format is often better than hundreds of scans. .djvu is pretty good, because it's smaller than the .pdf, plus _sometimes_ the o.c.r. text is included too, so it's searchable. (realistically, though, all we usually need is the pagenumber.) .pdf often comes in either black-and-white or "color", which means a sepia background was inserted (for some reason)... 3. so now let's concentrate on the text in the actual o.c.r. file. after you've downloaded it, there is some basic cleaning to do. there's a space at the end of every line, including "blank" ones, so we must to do a global search-and-replace to delete them. i usually do this first, just to get it out of the way right away... 4. pagebreaks are indicated by 3 blank lines. unfortunately, sometimes a 3-blank-line sequence occurs elsewhere as well. which means that we can't _depend_ on it being a pagebreak. which is quite sad, yes. still, for the most part, it usually will. we'll often need to look at a scan to resolve an o.c.r. glitch, so pagenumbers will be important to us, so pagebreaks are too. i replace these 3-blank-line sequences with a distinctive line. (a line full of tildes, for instance, if there are none in the file.) this way i know that's where the pagebreaks (probably) were. 5. the runheads (at the top of a page) _are_ retained, as are pagenumbers when they got printed at the bottom of a page. this is good news. (d.p. always decapitates the runheads and pagenumbers; thus their scans are devoid of identifying info; just one of several idiotic things at distributed proofreaders.) the runheads and pagenumbers keep us grounded in the file; otherwise it can be too easy to get lost in the vast sea of text. but the runheads and pagenumbers often have garbled o.c.r. (this might be because their fonts often differ from the body.) luckily, there is enough redundancy that we can control this; runheads almost always exhibit heavily-repetitive elements, and pagenumbers have an extremely-predictable sequence. one of the first things that i generally do is fix pagenumbers; it speeds up later changes that require any review of a scan... a few easy reg-ex searches can do wonders on pagenumbers. the best situation is when pagenumbers were on the bottom of the p-book pages, because then they are followed by the pagebreak, which is in turn followed by the runhead, so we have an extremely predictable sequence of lines going on... (and luckily for us, that's the situation we have in this book.) 6. front-matter is almost always recognized badly by o.c.r. i've spent a lot of time analyzing o.c.r. output to determine how it can be corrected automatically. but sometimes you just have to throw in the towel and admit you are defeated. such is the case with front-matter. you must fix it by hand. this is not necessarily a bad thing, since the contents page gives you an overview of the book that's very good to have. for example, for this thread, it informed me that our book had _24_ chapters, so when a search for "chapter" ended at chapter 23, i knew i had a problem that i needed to check. and sure enough, i found it, right there on "chapler xxiv". which reminds me that i always check the chapter-headers, to make sure they were recognized correctly, but also that they're on the same page as listed in the table of contents. (this is one of the redundancy checks that it's good to do.) when i check the chapter-headers, i give 'em proper z.m.l., which means 4 blank lines above them, and 2 below them. i also give them a pseudo-runhead and pagenumber, since those were typically eliminated from chapter-header pages, whereas our process will count on 'em being on every page. also when i do this job, i often fix the drop-caps/small-caps (which are a _staple_ of the chapter-header pages), because o.c.r. always fails, in a spectacular fashion, to get 'em right. 7. "scene-breaks" from the print-book are _not_ recorded as empty lines in the o.c.r. output... so they are essentially lost, we'll need to do a page-by-page look-through to restore 'em before we're finished. italics are also lost, so ditto for them. and the third leg of this final look-through on the full book is to make sure that all the paragraphing was done correctly, with particular focus on those that start at the top of a page. happily, most paragraphs _do_ get separated by blank lines... (but it's also the case that tight leading on the original page -- a typographical tactic often employed on block-quotes -- sometimes results in mid-paragraph intervening blank lines.) again, this 3-pronged check is done at the end of the process, so this information is merely a heads-up at this point in time. 8. i forget what 8 was for. 9. archive.org generally scans _all_ of the pages in a book, including the blank pages at the end, which often have stuff on them, such as the library form showing the date on which the book had to be returned... this extraneous back-matter is scrubbed, of course, but _do_ look at the scans, and enjoy! let your imagination roam, about the people who borrowed that book, and read it, and what they might've thought of it. it can make you quite nostalgic... 10. i'm gonna return for a minute to discuss scan-viewing... as i said above, both the .djvu and .pdf options are very good. and you can always download the .zip file of all of the scans; loading them into a file-viewer makes them quite accessible, by choosing the one you want to view based on its filename. one glaring problem with all of these approaches, however, is that archive.org never adopted a _smart_ policy regarding their filenames, meaning the number you see in the filename usually is _not_ the pagenumber for that scan. instead, it is the sequential number of that scan within the full scan-set... this stupid filenaming-policy means that you have know the "offset" for each pagenumber in order to know its filename. if you're lucky -- and we were, in this book -- the "offset" is an unchanging number through the entire book. in this book, the offset is "6", meaning that if you want to see the scan for page 123, for example, you need to view the scan for "129". likewise with the .djvu and .pdf files. however, on the mac, at least, the "preview.app" system graphics-viewer-program can _insert_and_delete_ pages from a .pdf, meaning that you can delete some of those unnecessary front-matter pages so the pagenumbers of the .pdf will match the real pagenumbers. this makes your life easier. finally, there is one more option. for every book, archive.org offers a streaming viewer-program. for instance, for the mabie book we're using here, the u.r.l. is: that "page 123" part of the u.r.l. there actually shows page 123, and _not_ page 117, as you might expect, given the "6" offset... also, the finger-pointer slider at the bottom of the display is based upon the "real" pagenumbers too, so you can use that. so this is a way of getting around the pesky "offset" nonsense. but this app is only available if you have an internet connection, and it downloads each pagescan as needed, so it can be _slow_, especially if/when your internet connection isn't all too speedy. plus every time you view a page again, it has to be downloaded yet again (unless it happens to still be in your browser cache)... and realize that eventually we're going to "step through" each and every page in the whole book, and probably more than once. so i certainly can't unequivocally recommend this option to you. i believe it's better to download the scans once, and work offline. but, for the sake of being complete, i've informed you about it... 11. so much for the overview on the files you'll encounter... so let's review the process we'll go through. we will first do the first-pass cleaning of the o.c.r. then we'll check that the pagenumbers and runheads are correct throughout the text. after that, we'll do the rest of the "obvious" cleaning needed, referring back to the scans whenever it is required to do so... next, we'll do a spellcheck and fix the warts. then, after that, we'll smooth any rough edges, and do our last-minute check. the final step will be to convert into various output formats... *** our second program, a slight mod from the first, is appended. you can run -- and view -- this program here:
http://zenmarkuplanguage.com/grapes102.py http://zenmarkuplanguage.com/grapes102.txt
this second script auto-fixes the floating punctuation and skips over the unnecessary front-matter and back-matter. -bowerbird p.s. here's the python source-code for our second program... #!/usr/bin/python import urllib import re f=urllib.urlopen("http://ia700300.us.archive.org/1/items/booksculture00mabiu oft/booksculture00mabiuoft_djvu.txt") s = f.read() f.close() s = re.sub("\r\n","\n",s) s = re.sub("\r\n","\n",s) s = re.sub("\r","\n",s) s = re.sub("\r","\n",s) s = re.sub(" \n","\n",s) s = re.sub(" \n","\n",s) s="\n"+s s=s+"\n" s = re.sub(" ,",",",s) s = re.sub(" ;",";",s) s = re.sub(" \:",":",s) s = re.sub(" !","!",s) s = re.sub(" \?","?",s) s = re.sub(" \!","!",s) print "Content-type: text/html\n\n" print "<html><head><title>" print "grapes102.py" print "</title></head><body><pre>" pg = s.split("\n") maxpg=len(pg) startat=236 endat=7764 for i in range(startat,endat): if i < 10: print "000"+str(i)+" "+pg[i] elif i < 100: print "00"+str(i)+" "+pg[i] elif i < 1000: print "0"+str(i)+" "+pg[i] elif i > 999: print ""+str(i)+" "+pg[i] print "</pre><hr></body></html>"
participants (1)
-
Bowerbird@aol.com