let's step through this process.011

4 Oct 2012

      now for the debriefing on our update walk-through...

***

we re-digitized "pride and prejudice", using this source:
...
http://archive.org/details/harvardclassicss03elio
all of our work-files are here:
...
http://zenmagiclove.com/prhpr
the final "updated" master z.m.l. file is here:
...
http://zenmagiclove.com/prhpr/prhpr.zml
the derivative versions are found here:
...
http://zenmagiclove.com/prhpr/prhpr.html
   http://zenmagiclove.com/prhpr/prhprp222.html
***

ok, let's start this debriefing with the main headline:
digitization isn't as hard as some of you seem to think.

you might remember that i did this project because
jon hurst wanted to re-do the top-10 p.g. downloads.

he offered up some grand plan that would've involved
d.p. in some complicated workflow, and it was obvious
to me that he was overestimating the work required to
re-do one of these classics...   thanks to plentiful scans
and a variety of digitizations out in the wild, it can be
relatively quick and easy to do books from start to end.

so i decided to prove it with a bit of my own pudding.

now, one week later, we have a newly-done "update",
including an extensive documentation of the process,
from little old me working a small number of hours on
a handful of days.   (it took more time to write it all up.)

by doing it myself, i have shown that you can do it too.

but if you _do_ choose to take action, take my advice:
it is _not_ the smartest way to just jump in like i did, 
and start it up, not if you don't have the _experience_
that i have, working with those archive.org materials...

so here's some advice on choosing your material...
this is about archive.org, since i'm stuck with them,
being that i gave up on google a while back because
google seemed to be making it harder and harder to
scrape their material, whereas archive.org is _open_.

but archive.org stuff can suck, so you must take care.
particularly avoid the stuff that archive.org scraped
from google, because then you get the worst quality
of both institutions, and that can be a lethal combo.

(it might also be the case that now that google has
brought out their commercial arm, the quality of
the work they offer up to the public has improved.
i haven't checked, to know, but it _is_ a possibility.)

but back to internet archive...

my first piece of advice about archive.org material is
to choose it very carefully.   some of their scans are
not-good.   if a book has bad scans, _skip_the_book_.
life is too short to spend time staring at lousy scans.
if you can't find good scans, move on to another title.

you should particularly avoid scan-sets archive.org
has "liberated" from google, because they are awful.

sometimes the _scans_ can look good enough to use,
but the _o.c.r._ turned out lousy.   skip those as well...

the o.c.r. which i used in this project was _terrible._
the letter "e" was frequently misrecognized as "c",
and as you might know, the letter "e" is _common._

and there were lots of other misrecognitions too...
i was able to deal with it because i know the ropes,
but until you clean a dozen archive.org o.c.r. files,
you won't, so you should avoid the hassle entirely...
(and once you _have_ cleaned a dozen such messes,
you will know quite well you don't want to do more.)

you should especially watch out for some o.c.r. files
from archive.org which have _lost_ their em-dashes!
restoring those em-dashes is a huge pain in the ass.

speaking of the o.c.r., the easy thing to do is to grab
the djvu.txt file that's listed right on the "files" page.
that's what i did, but it is _not_ the right thing to do,
not if you want to minimize the time it will take you.

it's much better to pull the text out of the x.m.l. file
that's saved by abbyy finereader in the o.c.r. process.

that's because _styling_ information gets saved there,
and in many cases, it might well take a ton of time to
reconstitute it.   so i recommend you extract it initially.

the problem with my recommendation is that you will
have to write yourself an app to grab that information.
(yes, i wrote such an app, but you may not use mine.)

if you can't write an app, or use somebody else's, then
i guess you'll just have to use the djvu.txt file instead.
it won't be the end of the world.   heck, i did it myself,
for this demo project, because i knew i already _have_
all of the styling information for "pride and prejudice".

but if you can write an app, then go ahead and do that.

another benefit of using the abbyy.xml file is that you
will see that it contains those _em-dashes_ which the
incompetent technocrats over there lost in the deja-vu.

also, if you will be working in a web-based application,
you can squirt the scans from archive.org to your site,
without doing all the dirty-work of downloading them
to your machine and then uploading them to your site.

those are the big points.   if anyone has any questions
about smaller-grained issues, i'll be happy to answer,
to the best of my ability.   most glitches have bitten me.

as far as the step-by-step workflow of doing the edits,
i have belabored that many, many times in the past, so
your best bet in that regard will be to hit the archives.

***

just a couple more notes.   i programmed all my tools
for this "pride and prejudice" project _from_scratch_...
so even if you have to start from scratch, it ain't hard.

i don't think i hit hard enough that this last round of
page-by-page .html files have an error-report form,
right on each page, so it makes it _very_convenient_
for someone to report an error, unlike the p.g. way...

and yes, the text i created does needs smoothreading.
i'm not vouching for its full accuracy, by _any_ means.
i miss the good old days, when jose menendez would
appear every time i posted something, and do a reveal
of the 197 errors he found in it.   that kept me humble.
so go ahead and find the errors that i could not find!
it'll make me happy to have my human-ness exposed.

because, and wraps up my summary, i think that i did
a pretty good job.   i think this text is amazingly clean.
if there is a cleaner "pride and prejudice" _anywhere_,
i would very much like to see it, so do please tell me...

and not only that, but i did the entire job in one week.

***

so, once again, the executive summary:

digitization isn't as hard as some of you seem to think.

-bowerbird

Bowerbird＠aol.com

tags

participants (1)