a review of some digitization tools -- 018

13 Dec 2011

      first, let's just cut directly to the chase today...

we're going to compare my converted .html to
the .html version of "books and culture" which
project gutenberg has posted as their version...
...
http://www.gutenberg.org/files/16736/16736-h/16736-h.htm
here are a bunch of head-to-head screenshots:
...
http://zenmagiclove.com/misc/bac-html001.jpg
   http://zenmagiclove.com/misc/bac-html002.jpg
   http://zenmagiclove.com/misc/bac-html003.jpg
   http://zenmagiclove.com/misc/bac-html004.jpg
   http://zenmagiclove.com/misc/bac-html005.jpg
   http://zenmagiclove.com/misc/bac-html006.jpg
   http://zenmagiclove.com/misc/bac-html007.jpg
   http://zenmagiclove.com/misc/bac-html008.jpg
   http://zenmagiclove.com/misc/bac-html009.jpg
   http://zenmagiclove.com/misc/bac-html010.jpg
   http://zenmagiclove.com/misc/bac-html011.jpg
   http://zenmagiclove.com/misc/bac-html012.jpg
across the board, we see that the two versions are
very much the same, with only occasional differences.

and most of the differences are "fixed" now, which
you can verify, if you like, by running the program.
...
http://zenmagiclove.com/cgi-bin/tday2011.py
it was a simple matter of tweaking my script a bit.
for instance, i had to revert my curly-quotes back
to straight-quotes, repair some typos, delete the
spaces i'd put around my em-dashes, and so on.

so now the two look almost completely identical...

this makes it abundantly clear that -- at least for
this simple book -- my script works just as well as
the "hand-formatting" performed by this volunteer.

so that's the main takeaway of today's lesson...

that's unsurprising.   if/when a book consists mostly of
headers and plain paragraphs, then all you have to do
to make different versions look the same is to ensure
their basic setup of those two structures is equivalent.

***

so now let's circle back around for the exposition...

***

we'll start with a short refresher on the workflow.

we whipped the text into shape, and now we're
ready to convert it, into .html -- "and beyond",
as buzz lightyear has been known to proclaim.

you can see the text file here:
...
http://zenmagiclove.com/grapes009.txt
it looks very much like an o.c.r. output file,
or a "concatenated text file" (c.t.f.) from d.p.

it's also quite reminiscent of a p.g. e-text...

you know, the kind that jim labels "txt70",
and lee calls "an impoverished text-file"...

it's also a .zml file.   it follows the z.m.l. rules.

my thanksgiving-day python script has been
converting that .zml file into our .html output.

lots of lots of people -- including jim and lee --
have suggested that p.g. should mandate .html.

or that _any_ cyberlibrary should use .html as its
"master-format" -- the one it actively maintains.

what these guys are never clear about is exactly
_how_ volunteers are supposed to transform the
_text_ files (of the type we mentioned up above,
the type you end up with after your digitization)
_into_ .html.   they're strangely mum about that.

but _i_ have an answer -- i.e., a script like mine!

are there other converters which could be used?
well, of course there are.   but the usual situation
is that the other guy's converter doesn't convert
the same way that _you_ would prefer to convert..
(we see that keith said that very thing just today...
indeed, he just assumes that that'll be the case.)

can't you do text-markup "by hand"?   using just
a text-editor?   that's what these guys usually say.
and yes, you certainly _can_ do markup by hand.
and not everyone can, or wants to.   but "you" can.
it's just not a lot of fun.   a lot of it is grunt-work.

so you find yourself using "regular expressions"
-- if you know what they do and how they work.

and then you find yourself with _lots_ of reg-ex,
and the next thing you know, you write a script...

so you might as well _start_off_ writing a script...

especially since i am showing that it's so easy...

***

now, you couldn't be blamed if you thought that
the appeal of a script is that _it_makes_life_easy_
regarding the work involved in coding the .html.

but that's only the half of it.

the other benefit from a light-markup approach
is that the output will be _dependably_consistent_.

from the perspectives of _re-use_ and _re-mix_,
an ability to _know_ that the output is _consistent_
is extremely valuable, because you can depend on
the output structure, and treat it programmatically.
and that gives us _immense_power_ down the line.

this contrasts _significantly_ with the reality of p.g.,
as it is currently construed, which is _inconsistency_.

very little p.g. .html has "dependable consistency"...

even files from the same producer can -- and do! --
take different approaches, sometimes varying widely.

and, as you'd imagine, files from different producers
often have almost nothing in common between them.

so these .html files can't be treated programmatically.
you can try, and you might experience some success,
more or less, depending on how wild it is what you do,
but you can't go in _expecting_ to have 100% success.
or even 70%.   on some things, you might get only 20%.

essentially, every .html version is a "snowflake", in that
it is different from every other .html file in the library.

and we see this demonstrated on this very book that
we've been using in this series -- "books and culture",
so let's discuss this specific example next...

***

as we have seen, "books and culture" is a simple book,
from the standpoint of digitization, since it consists of
plain old paragraphs and chapter-headers, containing
only a couple blockquotes to break up the monotony...

so you'd think "the markup" on it would be predictable.

i certainly thought so; but as it turned out, i was wrong.

in order to make my version "look like" the p.g. version,
i was originally going to simply copy in their .css file...

but as soon as i did that, i realized there'd be trouble...

i've appended the .css, so you can see for yourself...

for a straightforward book such as this one, we'd expect
that the .css file would be sparse, if there was one at all.

about the only thing that _routinely_ needs to be altered
for the look-and-feel of old books is to center headings.

(and it's often better to achieve that by using a "style"
command inside the header tag, most especially since
the kindle is infamous for not even using the .css file.)

but the .css for this book from p.g.?   it's _not_ "sparse".

it defines some styles, a few of 'em quite questionable...

the most troubling, in that vein, is called ".chaptertitle".
but others also of concern are ".chapter", and ".quote".

the reason the "chapter" classes are a red-flag is that
you generally want the chapters to get [h] header tags.
structurally, that is what they "are", and a number of
tools are geared toward this "semantic" interpretation.

sure enough, when i looked at the .html tags within,
there was indeed trouble.   as feared, chapter headers
were not marked with [h] header tags, but with a [p].

that's right.   the headers were declared as paragraphs.
oops!   i will let someone else give you the "tag abuse"
lecture, and dole out the appropriate flogging here...
but you don't have to be compulsive to see that's bad.

we also find the blockquotes were marked with a [p],
with the class defined as ".quote", more "tag abuse".
ditto with the footnote, although that'll be less clear.

so, you ask, was anything actually tagged as a header?

well, as a matter of fact, yes.   here's the full list of them:
...
[h1] BOOKS AND CULTURE
   [h4] By
   [h2] HAMILTON WRIGHT MABIE
   [h4] NEW YORK: PUBLISHED BY
   [h4] MDCCCCVII
   [h4] [i] Copyright, 1896
   [h4] [span class="sc"] By Dodd, Mead and Company
   [h4] University Press:
   [h4] To
   [h3] EDMUND CLARENCE STEDMAN
   [h3] CONTENTS
...
http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2005&
...
http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2005&
...
http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2005&
...
http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2005&
...
http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2005&
oops!   they did it again.   aside from the [h1] and maybe
the [h2] -- or should i say "mabie" the [h2], hahaha! --
that's pretty much presentational tagging throughout...
meaning more lecturing and some additional flogging.

all in all, even on this very "simple" book, sorry to say,
the tagging would have to be assigned a failing grade.

i guess if you wanted to be kind, you could give it a "d".

***

now some people might offer up as an explanation that
this book was done a long time ago -- october of 2005.

but if you think about it, that "excuse" really backfires...

because the problem is _not_ that the postprocessor
was incapable of handling a relatively difficult task...

on the contrary, the postprocessor tried to "get fancy",
and basically tripped over themselves on a simple task.

it's like the person who misses the very first question on
"who wants to be a millionaire?".   what were they thinking?

and believe me, in the ensuing years, the postprocessors
over at d.p. have gotten even _more_ fancy with their code.
they've accumulated lots of crud and a ton of tricks as well,
and they wanna throw _all_ of it into every project they do.

so the snowflake problem is getting _worse_, not better...

***

it is also worth noting, before we leave this book behind,
that this was a book which had _extremely_high_visibility_
-- an excruciating amount -- at the time it was posted...

it has the distinction of being the first public-domain book
which was released by google as part of its scanning project.

so it was a major milestone when d.p. picked it up to do it.
that -- all by itself -- would've made it a memorable book.

but this book also marked another historical turning-point.

people who were around back then might remember that
the distributed proofreading site had come into its own...

the volunteers over there were feeling very smug about it,
claiming loudly that their accuracy trumped everyone else.
they were superior.   and their arrogance knew few bounds.

pride goeth before destruction, a haughty spirit before the fall.

the truth is that they _were_ "superior", at least if compared
to the rather flawed e-texts that comprised the p.g. library...

but their accuracy was far short of the perfection they claimed.

nonetheless, they proclaimed far and wide that their method
of _collaboration_ meant that more eyes checked the text and
-- ipso facto -- that meant that the output was more accurate.

it's a neat argument, in the sense that it has intuitive appeal...

the thing was that it pissed off some people, who thought that
they did a darn good job all by themselves, working all alone...

one of those people was a guy by the name of jose menendez.

so, when google released "books and culture", jose proofed it.

and then he waited for p.g. to proof it.   and waited and waited.

and when d.p. _finally_ released its version, and p.g. posted it,
some 6 months later, jose pounced on it.   and he found juice...

it ends up that d.p. didn't do a "perfect" job on its digitization.
it didn't even come close, with 48+ errors -- yes, over 48! --
on a 279-page book (i.e., 1 every 6 pages) that is under 210k.

and jose detailed all of those errors.   painstakingly.   ouch.

but that number wasn't the most embarrassing thing, not at all.
the most embarrassing was that d.p. _dropped_an_entire_page_!
mighta been google's scanning error originally, can't remember,
but it was still true that d.p. failed to notice that missing page...

but jose hadn't.   nor did he miss the 49 errors that d.p. missed.

d.p. was aghast.   all of their claims of superiority were dashed.
their vaunted collaborative system had been out-performed by
a single individual, working all alone, with just his own two eyes.

d.p. rushed to fix its errors -- jose helped them very graciously.

and p.g. posted the corrected version just as soon as it could.

if you look at the "old" folder, you'll see that the update came
within a week, which -- as most of you know -- is astounding.
and even with that, there's some history that's been rewritten,
since not even the "old" file is missing that big chunk of text.

(it ended up that p.g. got the missing page of text right away,
but then it neglected to fix the other 48+ errors jose found.
when that was pointed out, they did an update in one week.)

the history of all this was documented online at the time...
the main "battlefield" for it was the "bookpeople" listserve:

post=2005-09-30,3
post=2005-10-05,4
post=2005-10-06,7
post=2005-10-11,3
post=2005-10-21,4

right around the time of this disillusionment, d.p. decided
that it needed to have more than two rounds in its system,
splitting the task into "proofing" and "formatting" rounds.

and -- at least for a short while, anyway -- d.p. stopped
making big claims about how "superior" their system was,
and acknowledged that "some" individuals did good work.

gradually, however, that d.p. overconfidence has returned.
once again, it got smug, claiming a degree of "perfection"
which -- at least the last i checked -- was not warranted...

the standard line at d.p. is that "our product has improved",
and they conveniently split their history in half, admitting
"the first half" of books they've done contain flaws, whereas
"the last half" reflect the better results that they now attain.

this slippery scale is very convenient.   back when there were
30,000 books in the library, #16736 was in "the good half".

now, however, it has been safely relegated to "the bad half",
and any flaws in it -- flaws that've been there since 2005 --
have now been "explained away" as mere historical artifact.

but i am sure that lee will be all too happy to tell you that
he's been giving his "tag abuse" lecture since before 2005
-- heck, i bet you can find it in the archives of this list --
and that this .html was just as faulty back then as it is now.

my point is a bit different.

i stress that this was a high-profile e-text, even back then...
so the questionable .html is _not_ due to "a lack of attention".

you can bet that the whitewashers looked closely at this one.
especially when they had to re-do it twice within two weeks...

nevertheless, it _still_ ended up being badly flawed...

for a drop-dead simple book like this one, that's shocking.

and like i said, things have only gotten "fancier" since then...

so if you wanted an illustration of the "snowflake" problem,
you'd be hard-pressed to find a better example than this...

***

as i noted, the body of the book looked almost identical
between the two versions, which is as we would expect...

but there were visual differences in the _frontmatter_,
so i will show them to you for the sake of completion...

for instance, here is the very top of the file:
...
http://zenmagiclove.com/misc/bac-html013.jpg
the p.g. legalese is really a turnoff.   when will p.g.
get smart, and condense the stuff at the very top to
one line, embedded in the middle of the cover-page,
which just points to the license to the end of the file?

after the legalese, the p.g. titlepage is quite nice:
...
http://zenmagiclove.com/misc/bac-html014.jpg
it's nicer than mine.   (i didn't put effort into mine.)
especially because it has that wee flower doodad...

i don't like the fact that the p.g. titlepage bleeds into
the following pages, which also bleed into each other,
with no separation at all, but maybe that's just me...

i also like the look of their table-of-contents:
...
http://zenmagiclove.com/misc/bac-html15.jpg
again, i might need to put some work in...   or not...

as noted above, the only wrinkle in the book's body
was the presence of a couple blockquote paragraphs.
the flow varies, since i retained the p-book linebreaks:
...
http://zenmagiclove.com/misc/bac-html016.jpg
   http://zenmagiclove.com/misc/bac-html017.jpg
pedants will also note that my blockquote does _not_
have the indent present in the p-book.   (yes, i looked.)
in return, i will smile at them, mocking their pettiness.

which brings us to the close of today's post...

***

in sum, let's repeat the takeaway of this experiment:

for this simple book, my conversion worked as well as
the "hand-formatting" performed by a p.g. volunteer.
indeed, from the standpoint of _re-mix_consistency_,
the product of my conversion is seen as far superior...

-bowerbird

p.s.   here is the .css file from "books and culture" at p.g.
...
http://www.gutenberg.org/files/16736/16736-h/16736-h.htm
[style type="text/css"]
     [!--

     body {margin-left: 3em;
          margin-right: 3em;}

     p {text-indent: 1em;
          text-align: justify;}

     .ctr {text-align: center;
          text-indent: 0em;}

     .noindent {text-indent: 0em;}

     .sc {font-variant: small-caps;}

     .chapter {text-indent: 0em;
          margin-top: 2.5em;
          margin-bottom: 1em;
          text-align: center;
          font-size: 115%;
          font-weight: bold;}

     .chaptertitle {text-indent: 0em;
          margin-top: .5em;
          margin-bottom: 1.5em;
          text-align: center;
          font-size: 115%;
          font-weight: bold;}

     .quote {text-align: justify;
          margin-left: 2.5em;
          margin-right: 2em;
          text-indent: .75em;}

     h1,h2,h3,h4,h5,h6 {text-align: center;
          margin-top: 1.5em;
          margin-bottom: 1em;}

     hr.long {text-align: center;
          width: 95%;
          margin-top: 2em;
          margin-bottom: 2em;}

     hr.med {text-align: center;
          width: 60%;
          margin-top: 1.5em;
          margin-bottom: 2em;}

     .footnote {font-size: 96%;
          text-indent: 3em;}

     ul.nameofchapter {list-style-type: none;
          position: relative;
          width: 75%;
          margin-left: 4em;
          line-height: 150%;
          font-size: 76%;}

     ul.TOC {list-style-type: upper-roman;
          position: relative;
          width: 75%;
          margin-left: 4em;
          line-height: 150%;
          font-size: 96%;}

     .ralign {position: absolute;
          right: 0;}

 --]
[/style]

Bowerbird＠aol.com

Jim Adcock

Keith J. Schultz

Lee Passey

Jim Adcock

tags

participants (4)