ok, here we go, in our series on digitization tools...
we're concentrating now on the _tagging_ part of
the digitization process. we have cleaned the text
and marked the italicized words in previous steps.
so we need some sample books to work with now...
you might recall that i used "books and culture" in
our "sweet grapes" series back in september, when
discussing the clean-up phase, so we can now use
that e-text to demo the next phase in our process.
the file, after the bulk of the clean-up, was here:
> http://zenmarkuplanguage.com/grapes006.txt
we then did a transform of that file, to get this:
> http://zenmarkuplanguage.com/grapes007.txt
and now, after yet another transform, we have this:
> http://zenmarkuplanguage.com/grapes008.txt
all the versions are more-or-less reminiscent of:
1. the output you could expect from abbyy o.c.r.
2. the text you find for each book at archive.org
3. a c.t.f. (concatenated text file) from pgdp.net
4. the .txt format utilized by project gutenberg
to that extent, then, we're on very familiar ground.
still, i encourage you to examine "grapes008.txt"
closely, so you know i have nothing up my sleeve.
here are some of the notable things you might see,
which adhere to the rules of zen markup language:
01. headers are preceded by at least 4 blank lines.
02. headers are followed by exactly 2 blank lines.
03. structural elements are bounded by blank lines.
04. paragraphs are not indented (in the input file).
05. the linebreaks are consistent with the p-book.
06. linebreaks don't necessarily need to be retained.
07. end-of-line hyphens are considered to be "soft",
08. but a tilde indicates a preceding hyphen is "hard".
09. italics are indicated by surrounding underscores.
10. lines beginning with " > " indicate a blockquote.
11. the first section is considered as "the title page".
12. the table-of-contents must be the second section.
and again, none of this "tagging" would be hard to do
in a typical text-editor, or your favorite wordprocessor.
this, again, is one of the bare-bones books that has
practically no structural elements aside from headers,
front-matter, and plain old paragraphs in the body...
the only other structures here are a few blockquotes.
so take a good look at that file...
-bowerbird