let's step through this process.011

now for the debriefing on our update walk-through... *** we re-digitized "pride and prejudice", using this source:
all of our work-files are here:
the final "updated" master z.m.l. file is here:
the derivative versions are found here:
http://zenmagiclove.com/prhpr/prhpr.html http://zenmagiclove.com/prhpr/prhprp222.html
*** ok, let's start this debriefing with the main headline: digitization isn't as hard as some of you seem to think. you might remember that i did this project because jon hurst wanted to re-do the top-10 p.g. downloads. he offered up some grand plan that would've involved d.p. in some complicated workflow, and it was obvious to me that he was overestimating the work required to re-do one of these classics... thanks to plentiful scans and a variety of digitizations out in the wild, it can be relatively quick and easy to do books from start to end. so i decided to prove it with a bit of my own pudding. now, one week later, we have a newly-done "update", including an extensive documentation of the process, from little old me working a small number of hours on a handful of days. (it took more time to write it all up.) by doing it myself, i have shown that you can do it too. but if you _do_ choose to take action, take my advice: it is _not_ the smartest way to just jump in like i did, and start it up, not if you don't have the _experience_ that i have, working with those archive.org materials... so here's some advice on choosing your material... this is about archive.org, since i'm stuck with them, being that i gave up on google a while back because google seemed to be making it harder and harder to scrape their material, whereas archive.org is _open_. but archive.org stuff can suck, so you must take care. particularly avoid the stuff that archive.org scraped from google, because then you get the worst quality of both institutions, and that can be a lethal combo. (it might also be the case that now that google has brought out their commercial arm, the quality of the work they offer up to the public has improved. i haven't checked, to know, but it _is_ a possibility.) but back to internet archive... my first piece of advice about archive.org material is to choose it very carefully. some of their scans are not-good. if a book has bad scans, _skip_the_book_. life is too short to spend time staring at lousy scans. if you can't find good scans, move on to another title. you should particularly avoid scan-sets archive.org has "liberated" from google, because they are awful. sometimes the _scans_ can look good enough to use, but the _o.c.r._ turned out lousy. skip those as well... the o.c.r. which i used in this project was _terrible._ the letter "e" was frequently misrecognized as "c", and as you might know, the letter "e" is _common._ and there were lots of other misrecognitions too... i was able to deal with it because i know the ropes, but until you clean a dozen archive.org o.c.r. files, you won't, so you should avoid the hassle entirely... (and once you _have_ cleaned a dozen such messes, you will know quite well you don't want to do more.) you should especially watch out for some o.c.r. files from archive.org which have _lost_ their em-dashes! restoring those em-dashes is a huge pain in the ass. speaking of the o.c.r., the easy thing to do is to grab the djvu.txt file that's listed right on the "files" page. that's what i did, but it is _not_ the right thing to do, not if you want to minimize the time it will take you. it's much better to pull the text out of the x.m.l. file that's saved by abbyy finereader in the o.c.r. process. that's because _styling_ information gets saved there, and in many cases, it might well take a ton of time to reconstitute it. so i recommend you extract it initially. the problem with my recommendation is that you will have to write yourself an app to grab that information. (yes, i wrote such an app, but you may not use mine.) if you can't write an app, or use somebody else's, then i guess you'll just have to use the djvu.txt file instead. it won't be the end of the world. heck, i did it myself, for this demo project, because i knew i already _have_ all of the styling information for "pride and prejudice". but if you can write an app, then go ahead and do that. another benefit of using the abbyy.xml file is that you will see that it contains those _em-dashes_ which the incompetent technocrats over there lost in the deja-vu. also, if you will be working in a web-based application, you can squirt the scans from archive.org to your site, without doing all the dirty-work of downloading them to your machine and then uploading them to your site. those are the big points. if anyone has any questions about smaller-grained issues, i'll be happy to answer, to the best of my ability. most glitches have bitten me. as far as the step-by-step workflow of doing the edits, i have belabored that many, many times in the past, so your best bet in that regard will be to hit the archives. *** just a couple more notes. i programmed all my tools for this "pride and prejudice" project _from_scratch_... so even if you have to start from scratch, it ain't hard. i don't think i hit hard enough that this last round of page-by-page .html files have an error-report form, right on each page, so it makes it _very_convenient_ for someone to report an error, unlike the p.g. way... and yes, the text i created does needs smoothreading. i'm not vouching for its full accuracy, by _any_ means. i miss the good old days, when jose menendez would appear every time i posted something, and do a reveal of the 197 errors he found in it. that kept me humble. so go ahead and find the errors that i could not find! it'll make me happy to have my human-ness exposed. because, and wraps up my summary, i think that i did a pretty good job. i think this text is amazingly clean. if there is a cleaner "pride and prejudice" _anywhere_, i would very much like to see it, so do please tell me... and not only that, but i did the entire job in one week. *** so, once again, the executive summary: digitization isn't as hard as some of you seem to think. -bowerbird
participants (1)
-
Bowerbird@aol.com