[gutvol-d] a review of some digitization tools -- 007

28 Nov 2011

      i hope you all had a nice thanksgiving weekend...        :+)

now let's ease back into our series on digitization tools.

***

our overall objective here is designing a workflow that
can be used in a _collaborative_system_ of digitization,
which does not preclude digitization by individuals too.

offline programs are fine -- indeed _preferred_ --
with the proviso that they have online equivalents,
and these online versions can work in batch mode,
since the emphasis is on the workflow of a _system_.

it is worth restating, however, that neither p.g. nor
internet archive is open for improvements, meaning
we're unfettered by any pesky legacy considerations.

we are free to start anew...

(which is a good thing, since both of those systems
now have workflows which are significantly flawed.)

***

you will recall our three overarching requirements
for the process of digitizing an existing paper-book:

1.   obtain the text (via scanners, and o.c.r. software)
2.   clean the text (the first set of tasks for our tools)
3.   turn the text into e-books (the second set of tasks)

the first requirement is largely taken care of, thanks to
the big scanning efforts of google and internet archive.

under the second of those requirements, we covered:

2a.   do a spellcheck
2b.   fix spacey punctuation
2c.   restore styling, e.g., italics

all of these tasks can be managed with a text-editor,
whether offline or online.   nothing here is "difficult"...

so we're working on the tasks for the third requirement:

3a.   tagging the structural aspects of the text
3b.   converting the text into .html (intermediate and final)
3c.   converting intermediate .html into e-book output-files

our biggest task here is 3a, tagging the structural aspects...

and now, to just tell you where i am going with all of this.
i'll inform you that i already programmed such a system...

indeed, i've previously showed you the various parts of it.

one part was the "editing/proofing" system, a demo that
i did using the "sitka" book that rfrank was also digitizing,
using his system.   this part takes care of requirement #2.
yes, i can put that back up online if anyone wants to see it.

the other part was the online demo for creating e-books
from a z.m.l. text-file, the one jana "helped" with...      ;+)
this web-app performed the tasks under requirement #3.
and yes, i can put that back up, if anyone wants to see it...

***

again, the "human" need here occurs largely on task 3a,
the tagging task.   once that's done, scripts can take over.

in order to know what "tags" we will be needing to apply,
we are examining some books that have been digitized...

just for fun, we're using books that jim adcock digitized.

so far, we've found that quite a few of the books we want
to digitize require only a barest minimal number of tags.
most books consist largely of paragraphs within chapters.
(this really shouldn't be a major surprise to anyone here.)

of the books from jim that we've analyzed so far, we find
he's used tags to label the following structural features:

a.   headers -- [h1]-[h6]
b.   paragraphs and breaks -- [p] and [br]
c.   styling -- [i] and [b]
d.   blocks -- [pre] and [blockquote]
e.   horizontal rules -- [hr]
f.    images -- [img]
g.   links -- [a name] and [a href] mostly for the t.o.c.

we're gonna look at a few more books from jim, where we
will find need for tags for a few more structural elements,
but what we will learn is that this job is pretty darn simple.

after that, we're going to see how to generate our output,
in the form of e-book files (such as .pdf, .epub, and .mobi),
from a simple plain-ascii input-text z.m.l. "master" source.

as i said, i already did a demo of an _online_ app to do this,
so the new twist for this series is that i am now releasing an
_offline_ program to do the same thing, with arbitrary text.
(the earlier online demo was for a given text -- "the jungle",
by upton sinclair, the influential classic from the early 1900s.)

yep, i am turning up the heat on all you lobsters in my pot...

-bowerbird

[gutvol-d] a review of some digitization tools -- 007

Bowerbird＠aol.com