Re: [gutvol-d] a review of some digitization tools -- 003

ok, you'll remember our three overarching requirements: 1. obtain the text (via scanners, and o.c.r. software) 2. clean the text (the first set of tasks for our tools) 3. turn the text into e-books (the second set of tasks) under the second of those requirements, we covered: 2a. do a spellcheck 2b. fix spacey punctuation 2c. restore styling, e.g., italics now we'll work on the list of tasks for the third one: 3a. tagging the structural aspects of the text 3b. converting the text into .html (intermediate and final) 3c. converting intermediate .html into e-book output-files our biggest task here is 3a, tagging the structural aspects... so let's talk about what it takes to "tag" the elements of a book, where the tags describe the _structure_ of the specific element. the first thing you need to know is that -- for many books -- the number of types of tags needed is embarrassingly small... indeed, here are the big 5... 1. paragraphs you need a paragraph tag -- [p] -- at the beginning of paragraphs. (i'll use square brackets in this post, rather than angle-brackets, just so nobody gets confused, as i'm speaking in general terms about marking the elements, and not about .html coding per se.) depending on the flavor of the markup, you might need to have _closing_ tags, which might be a [/p] at the end of a paragraph. similar in function is the [br] break command, to force a linebreak. 2. blocks which should not be rewrapped some blocks -- such as poetry -- should not be rewrapped, so you need to mark them, so they'll retain their formatting. a quick-and-dirty way is a [pre] tag before, and [/pre] after. 3. headers you need to mark headers, usually indicating the _level_ as well. so [h1] might start a level-one header, with [/h1] ending it, and [h2] and [/h2] surrounding a level-two header, and continuing... 4. table-of-contents links the table of contents in an e-book should link to the proper spots. in .html, these links consist of an [a] anchor with a [href] argument. 5. italics words/phrases which are to be italicized must be marked as such. *** and really, that's all you need for a _surprising_ number of books! i know, just 5 easy things, but believe me, that's often _sufficient_! *** for instance, since nobody has suggested an example book to use, let us examine "the rainbow", by d.h. lawrence, which was digitized for project gutenberg by jim adcock; this book is known as #28948.
first of all, jim, thanks for digitizing this important book for p.g. i've taken a cursory look at it, and found a few errors for you to fix. in the .html version, chapter 9 is unfortunately called "chapter ixix". and chapter 11 is called "chapter xii"; that's true in the .txt files too. no need for thanks, jim, i'm happy to help, and won't see it anyway. oh yeah, jim, by the way, your .txt files are badly in error, because you forgot to indicate all of the italicized words with underscores! that's over 160 errors throughout the book, just on those italics... luckily, the italics _are_ indicated in your .html file, so you can just use that file to find where to make corrections to your .txt versions. because i'm sure you don't want to do a disservice to d.h. lawrence. anyway, back to the topic of the thread... i've appended a list of the .html tags that jim used. as you can see, the entire list consists of only the 5 types of tags i mentioned above. of course, we'll want to be able to handle more than just the simple basic structures of a straightforward book with any system we build, but it's nice to know you can _start_ with something simple and still be able to digitize an important book like this one by d.h. lawrence. -bowerbird
[p][/p] [pre][/pre] [i][/i] [br]
[h1 align="center"][/h1] [h2 align="center"][/h2] [h3 align="center"][/h3] [h4 align="center"][/h4] [h5 align="center"][/h5]
[a name="ChI" href="#CI"][/a] [a name="ChII" href="#CII"][/a] [a name="ChIII" href="#CIII"][/a] [a name="ChIV" href="#CIV"][/a] [a name="ChV" href="#CV"][/a] [a name="ChVI" href="#CVI"][/a] [a name="ChVII" href="#CVII"][/a] [a name="ChVIII" href="#CVIII"][/a] [a name="ChIX" href="#CIX"][/a] [a name="ChX" href="#CX"][/a] [a name="ChXI" href="#CXI"][/a] [a name="ChXII" href="#CXII"][/a] [a name="ChXIII" href="#CXIII"][/a] [a name="ChXIV" href="#CXIV"][/a] [a name="ChXV" href="#CXV"][/a] [a name="ChXVI" href="#CXVI"][/a]
[h3 align="center"][a name="CI"][/a][/h3] [h3 align="center"][a name="CII"][/a][/h3] [h3 align="center"][a name="CIII"][/a][/h3] [h3 align="center"][a name="CIV"][/a][/h3] [h3 align="center"][a name="CV"][/a][/h3] [h3 align="center"][a name="CVI"][/a][/h3] [h3 align="center"][a name="CVII"][/a][/h3] [h3 align="center"][a name="CVIII"][/a][/h3] [h3 align="center"][a name="CIX"][/a][/h3] [h3 align="center"][a name="CX"][/a][/h3] [h3 align="center"][a name="CXI"][/a][/h3] [h3 align="center"][a name="CXII"][/a][/h3] [h3 align="center"][a name="CXIII"][/a][/h3] [h3 align="center"][a name="CXIV"][/a][/h3] [h3 align="center"][a name="CXV"][/a][/h3] [h3 align="center"][a name="CXVI"][/a][/h3]

Hi All, On this list there has been many rantings on how to things and not. What is even more we have almost everything that is needed for setting up a complete toolchain for creating ebooks. The only problem is geting the best of foes/friends to cooperate and make a few compromises. 1) BB has the tools for doing the initial proofing 2) Lee has a ebook editor 3) Don has a system for setting up a knowledge base and distribution of resources 4) Marcello has resources at his disposal to provide the internet infrastructure. Sure every single part is not completely perfect and there is some glue needed. But, it does have potential. How about it guys. The whole system would have no fluff and unnecessary formats. You have divided to conquer, now combine forces and we can make a truely easy to use e-book production system for the benefits of all. regards Keith.
participants (2)
-
Bowerbird@aol.com
-
Keith J. Schultz