a review of some digitization tools -- 018

first, let's just cut directly to the chase today... we're going to compare my converted .html to the .html version of "books and culture" which project gutenberg has posted as their version...
here are a bunch of head-to-head screenshots:
http://zenmagiclove.com/misc/bac-html001.jpg http://zenmagiclove.com/misc/bac-html002.jpg http://zenmagiclove.com/misc/bac-html003.jpg http://zenmagiclove.com/misc/bac-html004.jpg http://zenmagiclove.com/misc/bac-html005.jpg http://zenmagiclove.com/misc/bac-html006.jpg http://zenmagiclove.com/misc/bac-html007.jpg http://zenmagiclove.com/misc/bac-html008.jpg http://zenmagiclove.com/misc/bac-html009.jpg http://zenmagiclove.com/misc/bac-html010.jpg http://zenmagiclove.com/misc/bac-html011.jpg http://zenmagiclove.com/misc/bac-html012.jpg
across the board, we see that the two versions are very much the same, with only occasional differences. and most of the differences are "fixed" now, which you can verify, if you like, by running the program.
it was a simple matter of tweaking my script a bit. for instance, i had to revert my curly-quotes back to straight-quotes, repair some typos, delete the spaces i'd put around my em-dashes, and so on. so now the two look almost completely identical... this makes it abundantly clear that -- at least for this simple book -- my script works just as well as the "hand-formatting" performed by this volunteer. so that's the main takeaway of today's lesson... that's unsurprising. if/when a book consists mostly of headers and plain paragraphs, then all you have to do to make different versions look the same is to ensure their basic setup of those two structures is equivalent. *** so now let's circle back around for the exposition... *** we'll start with a short refresher on the workflow. we whipped the text into shape, and now we're ready to convert it, into .html -- "and beyond", as buzz lightyear has been known to proclaim. you can see the text file here:
it looks very much like an o.c.r. output file, or a "concatenated text file" (c.t.f.) from d.p. it's also quite reminiscent of a p.g. e-text... you know, the kind that jim labels "txt70", and lee calls "an impoverished text-file"... it's also a .zml file. it follows the z.m.l. rules. my thanksgiving-day python script has been converting that .zml file into our .html output. lots of lots of people -- including jim and lee -- have suggested that p.g. should mandate .html. or that _any_ cyberlibrary should use .html as its "master-format" -- the one it actively maintains. what these guys are never clear about is exactly _how_ volunteers are supposed to transform the _text_ files (of the type we mentioned up above, the type you end up with after your digitization) _into_ .html. they're strangely mum about that. but _i_ have an answer -- i.e., a script like mine! are there other converters which could be used? well, of course there are. but the usual situation is that the other guy's converter doesn't convert the same way that _you_ would prefer to convert.. (we see that keith said that very thing just today... indeed, he just assumes that that'll be the case.) can't you do text-markup "by hand"? using just a text-editor? that's what these guys usually say. and yes, you certainly _can_ do markup by hand. and not everyone can, or wants to. but "you" can. it's just not a lot of fun. a lot of it is grunt-work. so you find yourself using "regular expressions" -- if you know what they do and how they work. and then you find yourself with _lots_ of reg-ex, and the next thing you know, you write a script... so you might as well _start_off_ writing a script... especially since i am showing that it's so easy... *** now, you couldn't be blamed if you thought that the appeal of a script is that _it_makes_life_easy_ regarding the work involved in coding the .html. but that's only the half of it. the other benefit from a light-markup approach is that the output will be _dependably_consistent_. from the perspectives of _re-use_ and _re-mix_, an ability to _know_ that the output is _consistent_ is extremely valuable, because you can depend on the output structure, and treat it programmatically. and that gives us _immense_power_ down the line. this contrasts _significantly_ with the reality of p.g., as it is currently construed, which is _inconsistency_. very little p.g. .html has "dependable consistency"... even files from the same producer can -- and do! -- take different approaches, sometimes varying widely. and, as you'd imagine, files from different producers often have almost nothing in common between them. so these .html files can't be treated programmatically. you can try, and you might experience some success, more or less, depending on how wild it is what you do, but you can't go in _expecting_ to have 100% success. or even 70%. on some things, you might get only 20%. essentially, every .html version is a "snowflake", in that it is different from every other .html file in the library. and we see this demonstrated on this very book that we've been using in this series -- "books and culture", so let's discuss this specific example next... *** as we have seen, "books and culture" is a simple book, from the standpoint of digitization, since it consists of plain old paragraphs and chapter-headers, containing only a couple blockquotes to break up the monotony... so you'd think "the markup" on it would be predictable. i certainly thought so; but as it turned out, i was wrong. in order to make my version "look like" the p.g. version, i was originally going to simply copy in their .css file... but as soon as i did that, i realized there'd be trouble... i've appended the .css, so you can see for yourself... for a straightforward book such as this one, we'd expect that the .css file would be sparse, if there was one at all. about the only thing that _routinely_ needs to be altered for the look-and-feel of old books is to center headings. (and it's often better to achieve that by using a "style" command inside the header tag, most especially since the kindle is infamous for not even using the .css file.) but the .css for this book from p.g.? it's _not_ "sparse". it defines some styles, a few of 'em quite questionable... the most troubling, in that vein, is called ".chaptertitle". but others also of concern are ".chapter", and ".quote". the reason the "chapter" classes are a red-flag is that you generally want the chapters to get [h] header tags. structurally, that is what they "are", and a number of tools are geared toward this "semantic" interpretation. sure enough, when i looked at the .html tags within, there was indeed trouble. as feared, chapter headers were not marked with [h] header tags, but with a [p]. that's right. the headers were declared as paragraphs. oops! i will let someone else give you the "tag abuse" lecture, and dole out the appropriate flogging here... but you don't have to be compulsive to see that's bad. we also find the blockquotes were marked with a [p], with the class defined as ".quote", more "tag abuse". ditto with the footnote, although that'll be less clear. so, you ask, was anything actually tagged as a header? well, as a matter of fact, yes. here's the full list of them:
[h1] BOOKS AND CULTURE [h4] By [h2] HAMILTON WRIGHT MABIE [h4] NEW YORK: PUBLISHED BY [h4] MDCCCCVII [h4] [i] Copyright, 1896 [h4] [span class="sc"] By Dodd, Mead and Company [h4] University Press: [h4] To [h3] EDMUND CLARENCE STEDMAN [h3] CONTENTS
http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2005&
http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2005&
http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2005&
http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2005&
http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2005&
oops! they did it again. aside from the [h1] and maybe the [h2] -- or should i say "mabie" the [h2], hahaha! -- that's pretty much presentational tagging throughout... meaning more lecturing and some additional flogging. all in all, even on this very "simple" book, sorry to say, the tagging would have to be assigned a failing grade. i guess if you wanted to be kind, you could give it a "d". *** now some people might offer up as an explanation that this book was done a long time ago -- october of 2005. but if you think about it, that "excuse" really backfires... because the problem is _not_ that the postprocessor was incapable of handling a relatively difficult task... on the contrary, the postprocessor tried to "get fancy", and basically tripped over themselves on a simple task. it's like the person who misses the very first question on "who wants to be a millionaire?". what were they thinking? and believe me, in the ensuing years, the postprocessors over at d.p. have gotten even _more_ fancy with their code. they've accumulated lots of crud and a ton of tricks as well, and they wanna throw _all_ of it into every project they do. so the snowflake problem is getting _worse_, not better... *** it is also worth noting, before we leave this book behind, that this was a book which had _extremely_high_visibility_ -- an excruciating amount -- at the time it was posted... it has the distinction of being the first public-domain book which was released by google as part of its scanning project. so it was a major milestone when d.p. picked it up to do it. that -- all by itself -- would've made it a memorable book. but this book also marked another historical turning-point. people who were around back then might remember that the distributed proofreading site had come into its own... the volunteers over there were feeling very smug about it, claiming loudly that their accuracy trumped everyone else. they were superior. and their arrogance knew few bounds. pride goeth before destruction, a haughty spirit before the fall. the truth is that they _were_ "superior", at least if compared to the rather flawed e-texts that comprised the p.g. library... but their accuracy was far short of the perfection they claimed. nonetheless, they proclaimed far and wide that their method of _collaboration_ meant that more eyes checked the text and -- ipso facto -- that meant that the output was more accurate. it's a neat argument, in the sense that it has intuitive appeal... the thing was that it pissed off some people, who thought that they did a darn good job all by themselves, working all alone... one of those people was a guy by the name of jose menendez. so, when google released "books and culture", jose proofed it. and then he waited for p.g. to proof it. and waited and waited. and when d.p. _finally_ released its version, and p.g. posted it, some 6 months later, jose pounced on it. and he found juice... it ends up that d.p. didn't do a "perfect" job on its digitization. it didn't even come close, with 48+ errors -- yes, over 48! -- on a 279-page book (i.e., 1 every 6 pages) that is under 210k. and jose detailed all of those errors. painstakingly. ouch. but that number wasn't the most embarrassing thing, not at all. the most embarrassing was that d.p. _dropped_an_entire_page_! mighta been google's scanning error originally, can't remember, but it was still true that d.p. failed to notice that missing page... but jose hadn't. nor did he miss the 49 errors that d.p. missed. d.p. was aghast. all of their claims of superiority were dashed. their vaunted collaborative system had been out-performed by a single individual, working all alone, with just his own two eyes. d.p. rushed to fix its errors -- jose helped them very graciously. and p.g. posted the corrected version just as soon as it could. if you look at the "old" folder, you'll see that the update came within a week, which -- as most of you know -- is astounding. and even with that, there's some history that's been rewritten, since not even the "old" file is missing that big chunk of text. (it ended up that p.g. got the missing page of text right away, but then it neglected to fix the other 48+ errors jose found. when that was pointed out, they did an update in one week.) the history of all this was documented online at the time... the main "battlefield" for it was the "bookpeople" listserve: post=2005-09-30,3 post=2005-10-05,4 post=2005-10-06,7 post=2005-10-11,3 post=2005-10-21,4 right around the time of this disillusionment, d.p. decided that it needed to have more than two rounds in its system, splitting the task into "proofing" and "formatting" rounds. and -- at least for a short while, anyway -- d.p. stopped making big claims about how "superior" their system was, and acknowledged that "some" individuals did good work. gradually, however, that d.p. overconfidence has returned. once again, it got smug, claiming a degree of "perfection" which -- at least the last i checked -- was not warranted... the standard line at d.p. is that "our product has improved", and they conveniently split their history in half, admitting "the first half" of books they've done contain flaws, whereas "the last half" reflect the better results that they now attain. this slippery scale is very convenient. back when there were 30,000 books in the library, #16736 was in "the good half". now, however, it has been safely relegated to "the bad half", and any flaws in it -- flaws that've been there since 2005 -- have now been "explained away" as mere historical artifact. but i am sure that lee will be all too happy to tell you that he's been giving his "tag abuse" lecture since before 2005 -- heck, i bet you can find it in the archives of this list -- and that this .html was just as faulty back then as it is now. my point is a bit different. i stress that this was a high-profile e-text, even back then... so the questionable .html is _not_ due to "a lack of attention". you can bet that the whitewashers looked closely at this one. especially when they had to re-do it twice within two weeks... nevertheless, it _still_ ended up being badly flawed... for a drop-dead simple book like this one, that's shocking. and like i said, things have only gotten "fancier" since then... so if you wanted an illustration of the "snowflake" problem, you'd be hard-pressed to find a better example than this... *** as i noted, the body of the book looked almost identical between the two versions, which is as we would expect... but there were visual differences in the _frontmatter_, so i will show them to you for the sake of completion... for instance, here is the very top of the file:
the p.g. legalese is really a turnoff. when will p.g. get smart, and condense the stuff at the very top to one line, embedded in the middle of the cover-page, which just points to the license to the end of the file? after the legalese, the p.g. titlepage is quite nice:
it's nicer than mine. (i didn't put effort into mine.) especially because it has that wee flower doodad... i don't like the fact that the p.g. titlepage bleeds into the following pages, which also bleed into each other, with no separation at all, but maybe that's just me... i also like the look of their table-of-contents:
again, i might need to put some work in... or not... as noted above, the only wrinkle in the book's body was the presence of a couple blockquote paragraphs. the flow varies, since i retained the p-book linebreaks:
http://zenmagiclove.com/misc/bac-html016.jpg http://zenmagiclove.com/misc/bac-html017.jpg
pedants will also note that my blockquote does _not_ have the indent present in the p-book. (yes, i looked.) in return, i will smile at them, mocking their pettiness. which brings us to the close of today's post... *** in sum, let's repeat the takeaway of this experiment: for this simple book, my conversion worked as well as the "hand-formatting" performed by a p.g. volunteer. indeed, from the standpoint of _re-mix_consistency_, the product of my conversion is seen as far superior... -bowerbird p.s. here is the .css file from "books and culture" at p.g.
[style type="text/css"] [!-- body {margin-left: 3em; margin-right: 3em;} p {text-indent: 1em; text-align: justify;} .ctr {text-align: center; text-indent: 0em;} .noindent {text-indent: 0em;} .sc {font-variant: small-caps;} .chapter {text-indent: 0em; margin-top: 2.5em; margin-bottom: 1em; text-align: center; font-size: 115%; font-weight: bold;} .chaptertitle {text-indent: 0em; margin-top: .5em; margin-bottom: 1.5em; text-align: center; font-size: 115%; font-weight: bold;} .quote {text-align: justify; margin-left: 2.5em; margin-right: 2em; text-indent: .75em;} h1,h2,h3,h4,h5,h6 {text-align: center; margin-top: 1.5em; margin-bottom: 1em;} hr.long {text-align: center; width: 95%; margin-top: 2em; margin-bottom: 2em;} hr.med {text-align: center; width: 60%; margin-top: 1.5em; margin-bottom: 2em;} .footnote {font-size: 96%; text-indent: 3em;} ul.nameofchapter {list-style-type: none; position: relative; width: 75%; margin-left: 4em; line-height: 150%; font-size: 76%;} ul.TOC {list-style-type: upper-roman; position: relative; width: 75%; margin-left: 4em; line-height: 150%; font-size: 96%;} .ralign {position: absolute; right: 0;} --] [/style]

BB>you know, the kind that jim labels "txt70", and lee calls "an impoverished text-file"... And what is perhaps most surprising is that ZML is not in a form that the WWers would be willing to even accept. BB>what these guys are never clear about is exactly _how_ volunteers are supposed to transform the _text_ files (of the type we mentioned up above, the type you end up with after your digitization) _into_ .html. What I am not clear about is why BB insists that what one starts from must be an "an impoverished text-file" because I never work with text files per se until I am forced to derive one at the end of my html development as a needless extra step in order to get the PG WWers to accept my html work. I do not start with an "an impoverished text-file" for the simple reason that my OCR gives me better file format choices which help preserve more of the information available in the original page images, such that I do not have to rediscover and re-enter that information again later manually -- after needlessly throwing that information away in the first place just to reduce the OCR result to txt70. PS: I call it "txt70" for the simple reason that I wish to distinguish that what PG insists one submit is not a text file in any normal sense, anymore than ZML is a normal text file in any normal sense. At least ZML has the arguable advantage that it retains the original line breaks -- but I have shown how these can be easily rederived. And the txt70 has a PG-specific requirement to put in manual line breaks at about every 70 chars, not to mention reimagining some of the standard ASCII code points as prosodic markers. PG'ers tend to spend so much time smelling their own roses that they forget that that which they call a text file really isn't a text file, anymore than the contents of an html file, or of a ZML file, is a text file.

Hi Jim, I can not tell you why BB does it, but I might be able to explain some of cavets of approach. The first is it is hard to disassociate the markup from the text to process, process it and put it back together correctly. Why do think that MS et. al. produce such crappy code. The other is for any automatic processing you to have a known state or structure. The more you know or can make correct assumptions on the better a algorithm will work. It actually does not matter in the end what format it is. But, PG does want a simple text file and it is easy to use these "impovirshed" file. It is easy to go from these to somethinhg PG will expect than doing from a more complex layout. You have more work to do. It is always easier to from simple to more complex than from complex to simple. regards Keith. Am 14.12.2011 um 04:35 schrieb Jim Adcock:
What I am not clear about is why BB insists that what one starts from must be an "an impoverished text-file" because I never work with text files per se until I am forced to derive one at the end of my html development as a needless extra step in order to get the PG WWers to accept my html work. I do not start with an "an impoverished text-file" for the simple reason that my OCR gives me better file format choices which help preserve more of the information available in the original page images, such that I do not have to rediscover and re-enter that information again later manually -- after needlessly throwing that information away in the first place just to reduce the OCR result to txt70.
PS: I call it "txt70" for the simple reason that I wish to distinguish that what PG insists one submit is not a text file in any normal sense, anymore than ZML is a normal text file in any normal sense. At least ZML has the arguable advantage that it retains the original line breaks -- but I have shown how these can be easily rederived. And the txt70 has a PG-specific requirement to put in manual line breaks at about every 70 chars, not to mention reimagining some of the standard ASCII code points as prosodic markers. PG'ers tend to spend so much time smelling their own roses that they forget that that which they call a text file really isn't a text file, anymore than the contents of an html file, or of a ZML file, is a text file.

On Tue, December 13, 2011 8:35 pm, Jim Adcock wrote:
What I am not clear about is why BB insists that what one starts from must be an "an impoverished text-file" because I never work with text files per se until I am forced to derive one at the end of my html development as a needless extra step in order to get the PG WWers to accept my html work. I do not start with an "an impoverished text-file" for the simple reason that my OCR gives me better file format choices which help preserve more of the information available in the original page images, such that I do not have to rediscover and re-enter that information again later manually -- after needlessly throwing that information away in the first place just to reduce the OCR result to txt70.
I don't get this either. I /never/ start with impoverished text. (Well, okay, I did one once. But it was sooooo painful that I vowed I would never do it again.) If FineReader is offering to save my OCR as HTML (class 2 tag soup), why would I not accept the offer? Use a tool like Tidy to convert to XHTML and I have a file that can easily be manipulated with scripts as well as plain text editors. This is one of the reasons I want so badly to see kenh's script from archive.org, or at least to have it running. BB is right that the OCR text at Internet Archive is unusable as a starting point for e-books, but I think I could work with the HTML output from that script. If I knew how it worked, I could probably even replicate it in a different programming language for off-line use. Heck, even the stuff at Distributed Proofreaders nowadays has a modicum of HTML embedded in it. About the only reason you would need to start with plain text is if you're trying to fix the early e-texts in the PG corpus -- not a bad idea, but there are better places to start if that's really what you're trying to do.

BB>but _i_ have an answer -- i.e., a script like mine! Actually, I would respect BB's efforts if he would turn his scripts into a stand-alone tool that he would post on the web that volunteers could choose to download and use if they so choose, just like some volunteers might choose to use Sigil, or to use (god forbid) MS Word, or even (hypothetically) Adobe InDesign, and if BB's tools output standard conforming "txt70" and "html" just like the WW'ers expect the rest of us to make. Then we would have a true marketplace for ideas, and volunteers could simply pick the tool, for better or worse, which they personally believe (rightly or wrongly) works best *for them personally.* If a particular volunteer believes that ZML is the best format for their effort to write to -- then go for it -- because BB will provide you the tools you need to do the job in ZML! Mm, how about it BB? Where's the answer?
participants (4)
-
Bowerbird@aol.com
-
Jim Adcock
-
Keith J. Schultz
-
Lee Passey