
ok, let's regroup... you will recall our three overarching requirements for the process of digitizing an existing paper-book: 1. obtain the text (via scanners, and o.c.r. software) 2. clean the text (the first set of tasks for our tools) 3. turn the text into e-books (the second set of tasks) the first requirement is largely taken care of, thanks to the big scanning efforts of google and internet archive. it keeps getting harder to find a book they haven't done. under the second of those requirements, we covered: 2a. do a spellcheck 2b. fix spacey punctuation 2c. restore styling, e.g., italics all of these tasks can be managed with a text-editor, whether offline or online. nothing here is "difficult"... so we're working on the tasks for the third requirement: 3a. tagging the structural aspects of the text 3b. converting the text into .html (intermediate and final) 3c. converting intermediate .html into e-book output-files tasks 3b and 3c will be the product of automatic processes, so our biggest task is 3a, the tagging of structural aspects... we have reviewed 5 simple books, to see what they involved. a. headers -- [h1]-[h6] b. paragraphs and breaks -- [p] and [br] c. styling -- [i] and [b], plus cover-page prettyness d. blocks -- [pre] and [blockquote] e. horizontal rules -- [hr] f. images -- [img] g. links -- [a name] and [a href], mostly t.o.c., but also index this is a bare-bones list, to be sure, but it's also the case that a surprisingly large number of books only require bare-bones. i believe internet archive currently claims ~3 million volumes... i'd say that 1 million of those books only require bare-bones, while another million of 'em would need only a little bit more. and even the final million are not _that_ difficult, to be honest, not if you're willing to bend a little on what you want from 'em. and if you won't bend, nobody will be able to make you happy... so, starting next week, i'll show you how you can tag structure. we will review how various tools can help us, offline and online, and i'll show you how easily i can build a tool that does the job. we'll start with the simple structures we already found, above, and then work in any additional structures we learn we need... to that end, if there are some books you want to point me to -- with complicated structures -- do please feel free to do so. it might take a little time to work up to the complicated books, but there's no reason we can't put 'em on the table right away. have a nice weekend! :+) -bowerbird