
so let's review some digitization tools, shall we? the task of digitizing a book is now fairly simple, thanks to cheap scanners and good o.c.r. today... scanners have been incorporated into printers for a while now, and even the cheaper ones are able to do a satisfactory job on most p-books... also, the best-of-breed o.c.r. app -- abbyy -- is now cross-platform _and_ fairly affordable... but the best news is that you don't even need to do the scanning or o.c.r. yourself nowadays -- if you want to do a public-domain book -- because the chances are very good that google or internet archive has already done it for you. so search around to see if scans already exist. (if you want a book in a non-english language, you might find it at google or internet archive, but if not, there are other sites you can search, and you likely know more about 'em than i do.) you might even discover that a book has been scanned multiple times. that's a good thing... what you should do is look at all of the copies, and find the one with the best-looking scans and (more important) the most-accurate o.c.r. download the scans and the o.c.r., as well as a djvu copy (if available) and/or the .pdf copy. the first thing to do is to spellcheck the o.c.r. when doing this, you must look at the scans! keep in mind that you aren't "correcting" the spelling, per se, but rather matching the book. now, do not take this to a ridiculous extreme. if there was clearly an _error_ in the p-book, you will want to fix it. you might even wanna update some anachronisms, such as changing "to-day" to the version that we all use "today". (i typically delete periods from chapter-heads.) and there might be some changes you'll make in order to make the text work as an _e-book_. (i make sure that the table-of-contents entries _exactly_match_ the chapter-heads themselves, so my automatic link-creator works correctly.) but be conservative in making these changes, because you might change the book's flavor... for instance, in the early days, p.g. would strip out the pagenumbers from an index, "because an e-book doesn't have pages"... well, that's stupid, because those numbers didn't _hurt_ anything, and they might have proven to be quite useful down the line, so they _should've_ been left in, "just in case". don't make that kind of mistake. while you are doing your spellcheck, it is very important that you _add_new_words_ to your spellcheck dictionary -- such as character names, place names, and so on. not only does this mean you only have to deal with them _once_, it also means that when a similar-looking variant comes up, it's probable that it's gonna be a "scanno". adding words also means you can re-do the spellcheck after any/all future edits, with a result that's both quick and clean. that helps catch any errors that you have inadvertently introduced during editing, something which happens to everybody... _after_ spellcheck, depending on the text you have downloaded, you might have to _restore_any_text_styling_, such as italics. if o.c.r. output gets saved as "plain-text", it loses its formatting, including _italics_. so you need to restore it. in most books, there aren't many italicized words, so this is a manageable task, albeit very mundane. if your book _does_ have lots of italics -- and you should check this at the outset -- you might want to re-do the o.c.r. for it... specify that the output be saved as .rtf... and then use your wordprocessor to place _markers_ around all the italicized words. you can do this by using a global change. you can use underscores as markers, or the bracket-i/bracket-slash-i html tags, or another convention you might invent. you'll also want to mark any _bold_ too, but most books contain very little bold, except in their headers, which we will be marking separately in a later process, so you don't need to worry about them now. in this vein, note that what you're marking right now are _individual_ italicized words or phrases. if a whole paragraph is italics, or bold, make a note about it for the future. the other thing you'll have to do, again in your wordprocessor, is to "close up" any "spacey punctuation". for the most part, these are double-quotemarks (and periods and commas) with a space on _both_ sides, so it's rather easy to do a search for them. so the first tool you'll use for digitization is your old friend, your trusty wordprocessor... -bowerbird p.s. if you have a book you'd like me to use as an example during this series, suggest it.