ok, you'll remember our three overarching requirements:
1. obtain the text (via scanners, and o.c.r. software)
2. clean the text (the first set of tasks for our tools)
3. turn the text into e-books (the second set of tasks)
under the second of those requirements, we covered:
2a. do a spellcheck
2b. fix spacey punctuation
2c. restore styling, e.g., italics
now we'll work on the list of tasks for the third one:
3a. tagging the structural aspects of the text
3b. converting the text into .html (intermediate and final)
3c. converting intermediate .html into e-book output-files
our biggest task here is 3a, tagging the structural aspects...
so let's talk about what it takes to "tag" the elements of a book,
where the tags describe the _structure_ of the specific element.
the first thing you need to know is that -- for many books --
the number of types of tags needed is embarrassingly small...
indeed, here are the big 5...
1. paragraphs
you need a paragraph tag -- [p] -- at the beginning of paragraphs.
(i'll use square brackets in this post, rather than angle-brackets,
just so nobody gets confused, as i'm speaking in general terms
about marking the elements, and not about .html coding per se.)
depending on the flavor of the markup, you might need to have
_closing_ tags, which might be a [/p] at the end of a paragraph.
similar in function is the [br] break command, to force a linebreak.
2. blocks which should not be rewrapped
some blocks -- such as poetry -- should not be rewrapped,
so you need to mark them, so they'll retain their formatting.
a quick-and-dirty way is a [pre] tag before, and [/pre] after.
3. headers
you need to mark headers, usually indicating the _level_ as well.
so [h1] might start a level-one header, with [/h1] ending it, and
[h2] and [/h2] surrounding a level-two header, and continuing...
4. table-of-contents links
the table of contents in an e-book should link to the proper spots.
in .html, these links consist of an [a] anchor with a [href] argument.
5. italics
words/phrases which are to be italicized must be marked as such.
***
and really, that's all you need for a _surprising_ number of books!
i know, just 5 easy things, but believe me, that's often _sufficient_!
***
for instance, since nobody has suggested an example book to use,
let us examine "the rainbow", by d.h. lawrence, which was digitized
for project gutenberg by jim adcock; this book is known as #28948.
> http://www.gutenberg.org/files/28948
first of all, jim, thanks for digitizing this important book for p.g.
i've taken a cursory look at it, and found a few errors for you to fix.
in the .html version, chapter 9 is unfortunately called "chapter ixix".
and chapter 11 is called "chapter xii"; that's true in the .txt files too.
no need for thanks, jim, i'm happy to help, and won't see it anyway.
oh yeah, jim, by the way, your .txt files are badly in error, because
you forgot to indicate all of the italicized words with underscores!
that's over 160 errors throughout the book, just on those italics...
luckily, the italics _are_ indicated in your .html file, so you can just
use that file to find where to make corrections to your .txt versions.
because i'm sure you don't want to do a disservice to d.h. lawrence.
anyway, back to the topic of the thread...
i've appended a list of the .html tags that jim used. as you can see,
the entire list consists of only the 5 types of tags i mentioned above.
of course, we'll want to be able to handle more than just the simple
basic structures of a straightforward book with any system we build,
but it's nice to know you can _start_ with something simple and still
be able to digitize an important book like this one by d.h. lawrence.
-bowerbird
> [p][/p]
> [pre][/pre]
> [i][/i]
> [br]
>
> [h1 align="center"][/h1]
> [h2 align="center"][/h2]
> [h3 align="center"][/h3]
> [h4 align="center"][/h4]
> [h5 align="center"][/h5]
>
> [a name="ChI" href="#CI"][/a]
> [a name="ChII" href="#CII"][/a]
> [a name="ChIII" href="#CIII"][/a]
> [a name="ChIV" href="#CIV"][/a]
> [a name="ChV" href="#CV"][/a]
> [a name="ChVI" href="#CVI"][/a]
> [a name="ChVII" href="#CVII"][/a]
> [a name="ChVIII" href="#CVIII"][/a]
> [a name="ChIX" href="#CIX"][/a]
> [a name="ChX" href="#CX"][/a]
> [a name="ChXI" href="#CXI"][/a]
> [a name="ChXII" href="#CXII"][/a]
> [a name="ChXIII" href="#CXIII"][/a]
> [a name="ChXIV" href="#CXIV"][/a]
> [a name="ChXV" href="#CXV"][/a]
> [a name="ChXVI" href="#CXVI"][/a]
>
> [h3 align="center"][a name="CI"][/a][/h3]
> [h3 align="center"][a name="CII"][/a][/h3]
> [h3 align="center"][a name="CIII"][/a][/h3]
> [h3 align="center"][a name="CIV"][/a][/h3]
> [h3 align="center"][a name="CV"][/a][/h3]
> [h3 align="center"][a name="CVI"][/a][/h3]
> [h3 align="center"][a name="CVII"][/a][/h3]
> [h3 align="center"][a name="CVIII"][/a][/h3]
> [h3 align="center"][a name="CIX"][/a][/h3]
> [h3 align="center"][a name="CX"][/a][/h3]
> [h3 align="center"][a name="CXI"][/a][/h3]
> [h3 align="center"][a name="CXII"][/a][/h3]
> [h3 align="center"][a name="CXIII"][/a][/h3]
> [h3 align="center"][a name="CXIV"][/a][/h3]
> [h3 align="center"][a name="CXV"][/a][/h3]
> [h3 align="center"][a name="CXVI"][/a][/h3]