
ok, so let's regroup once again for this new week... this thread's purpose is to review digitization tools. so we need to know what's involved in "digitization". our analysis gave us this:
1. obtain the text (via scanners, and o.c.r. software)
2. clean the text (the first set of tasks for our tools) 2a. do a spellcheck 2b. fix spacey punctuation 2c. restore styling, e.g., italics
3. turn the text into e-books (the second set of tasks) 3a. tagging the structural aspects of the text 3b. converting the text into .html (intermediate and final) 3c. converting intermediate .html into e-book output-files
task #1 is largely taken care of for us, thanks to the public output of the major scanning projects. (although sometime this week i'll discuss the issue of how to get the best results from abbyy's output.) task #2 is something we discussed earlier, finding that the jobs can be done fine in a wordprocessor. that leaves us with task #3. on some day this week, i will discuss the specifics of applying zen markup, and that post will cover the main thrust of task #3a. and then, of course, the two "conversion" tasks -- #3b and #3c -- are being performed by the script which we're in the process of discussing right now. *** so we need an application that does #3b and #3c. i do programming. so my first instinct is to code. so if i have a job which needs to be accomplished, once i've coded something -- or tried to do so -- i can compare it with the tools that exist out there, and do an assessment of how everything stacks up. what we'd like, for this app, is one that: 1. creates good html. (gotta do the job, and well.) 2. is cost-free. (because we don't have a budget.) 3. is user-friendly. (because we rely on volunteers.) 4. runs online and off. (so we have multiple options.) 5. is open-source. (because why not ask for it all?) those are listed in decreasing order of importance. but everything through #4 is pretty much required. we need an online component, because we will want our system to be capable of being used in a setting like project gutenberg or distributed proofreaders, a cyberlibrary meant to digitize and display e-books. an online presence is required by such a cyberlibrary. at the same time, it would be best if the tools used by this cyberlibrary could also be run by people offline... if we can meet all of our desires with our own code, it's gonna be hard for any other tool to get traction. so let's see how we do... *** up to this point, the sample content for this thread was "books and culture", by hamilton wright mabie, a book that is distinguished by the fact that it was the very first public-domain book released by the big library scanning project being done by google. it was a simple book, so it was good starter content. but now we want to work on more complicated stuff. what we need, as a "proving ground" for our script, is a good test-suite, with the structures we require. fortunately -- if you ascribe luck to elbow grease -- we have such a test-suite, one that i created in 2005. i've updated it a bit for this thread, and put it here:
so now let's give our script a new name as well:
that's just a first pass, so everything might not look exactly right just quite yet while we get started, but we'll work our way up to handle everything we need. so look at that input test-suite, satisfying yourself that it doesn't have any sneaky tricks up its sleeves, and then take a look at the .html the script creates -- feel free to view the source too, and comment -- and then you'll be ready as we depart on this ride... *** z.m.l. input file:
python script to create .html output:
-bowerbird