
i hope you all had a nice thanksgiving weekend... :+) now let's ease back into our series on digitization tools. *** our overall objective here is designing a workflow that can be used in a _collaborative_system_ of digitization, which does not preclude digitization by individuals too. offline programs are fine -- indeed _preferred_ -- with the proviso that they have online equivalents, and these online versions can work in batch mode, since the emphasis is on the workflow of a _system_. it is worth restating, however, that neither p.g. nor internet archive is open for improvements, meaning we're unfettered by any pesky legacy considerations. we are free to start anew... (which is a good thing, since both of those systems now have workflows which are significantly flawed.) *** you will recall our three overarching requirements for the process of digitizing an existing paper-book: 1. obtain the text (via scanners, and o.c.r. software) 2. clean the text (the first set of tasks for our tools) 3. turn the text into e-books (the second set of tasks) the first requirement is largely taken care of, thanks to the big scanning efforts of google and internet archive. under the second of those requirements, we covered: 2a. do a spellcheck 2b. fix spacey punctuation 2c. restore styling, e.g., italics all of these tasks can be managed with a text-editor, whether offline or online. nothing here is "difficult"... so we're working on the tasks for the third requirement: 3a. tagging the structural aspects of the text 3b. converting the text into .html (intermediate and final) 3c. converting intermediate .html into e-book output-files our biggest task here is 3a, tagging the structural aspects... and now, to just tell you where i am going with all of this. i'll inform you that i already programmed such a system... indeed, i've previously showed you the various parts of it. one part was the "editing/proofing" system, a demo that i did using the "sitka" book that rfrank was also digitizing, using his system. this part takes care of requirement #2. yes, i can put that back up online if anyone wants to see it. the other part was the online demo for creating e-books from a z.m.l. text-file, the one jana "helped" with... ;+) this web-app performed the tasks under requirement #3. and yes, i can put that back up, if anyone wants to see it... *** again, the "human" need here occurs largely on task 3a, the tagging task. once that's done, scripts can take over. in order to know what "tags" we will be needing to apply, we are examining some books that have been digitized... just for fun, we're using books that jim adcock digitized. so far, we've found that quite a few of the books we want to digitize require only a barest minimal number of tags. most books consist largely of paragraphs within chapters. (this really shouldn't be a major surprise to anyone here.) of the books from jim that we've analyzed so far, we find he's used tags to label the following structural features: a. headers -- [h1]-[h6] b. paragraphs and breaks -- [p] and [br] c. styling -- [i] and [b] d. blocks -- [pre] and [blockquote] e. horizontal rules -- [hr] f. images -- [img] g. links -- [a name] and [a href] mostly for the t.o.c. we're gonna look at a few more books from jim, where we will find need for tags for a few more structural elements, but what we will learn is that this job is pretty darn simple. after that, we're going to see how to generate our output, in the form of e-book files (such as .pdf, .epub, and .mobi), from a simple plain-ascii input-text z.m.l. "master" source. as i said, i already did a demo of an _online_ app to do this, so the new twist for this series is that i am now releasing an _offline_ program to do the same thing, with arbitrary text. (the earlier online demo was for a given text -- "the jungle", by upton sinclair, the influential classic from the early 1900s.) yep, i am turning up the heat on all you lobsters in my pot... -bowerbird