[for the graphics wizards] Cleaning up original Burton "Kama Sutra" page scans -- need advice/help

[There are no doubt a few graphics wizards here who might have some good advice to share on the following message I've been posting on various forums and groups. Of course, let me know if PG would be interested in archiving the cleaned up page scans. As noted below, I've already started the process at DP to produce structured digital text of this book, which then will be fed back to PG. Thanks!] I'm now working to "clean up" the 182 page images from a recent scan of a very rare and noteworthy public domain book. The cleaned-up scans will be released to the public (such as given to the Internet Archive) for free access. [For those interested, the book is the 1885 second printing of the second edition of Sir Richard F. Burton's "Kama Sutra of Vatsyayana".] The scans were done at 600 dpi (optical) 256-color greyscale (there's no color in the book), to capture sufficient fine-detail to aid in the cleanup process. Of course, the book was chopped (the binding was falling apart anyway) and each page scanned on a flat-bed, so there's no page distortion caused by trying to scan a bound book. There are no illustrations -- it's all black and white text. I've already deskewed, cropped, centered and size-normalized all 182 pages. (For those interested, links to two sample partially-cleaned pages are given below.) In the cleanup process, I'd like to convert what I now have into 600-dpi *bitonal* (black and white) with uniform and nicely readable character density, removal of "pepper", cleanup of larger blotches, etc. I recognize there will be some handwork required, particularly to remove larger "pepper" and blotches, and repair a few characters, etc., but of course want to minimize handwork. [Note that the purpose of the cleanup is for direct human-use of the scans, and not solely for OCR purposes which doesn't require the planned level of cleanup. For example, I plan to produce a DjVu version for direct reading. For those who will probably ask, the raw page scans have already been uploaded to Distributed Proofreaders for conversion to structured digital text.] Unfortunately, what complicates the clean-up process is that the original book is in poor and variable condition. The paper is quite yellowed and darkened, and many pages are quite faded. Were the original in mint condition with good, uniform ink-to-paper contrast, I wouldn't be posting this request for advice. But the overall poor quality and page-to-page variation is taxing my graphics abilities to produce a clean finished product with reasonably readable and uniform character density (at 600-dpi bitonal.) Here are two sample pages, each about 4.5 megs in size (2550x3900 greyscale): http://www.openreader.org/kamasutra/page031.png (good condition) http://www.openreader.org/kamasutra/page106.png (poor condition) I would assume that others have had similar needs and have come up with various processing tricks and even built special tools to aid in the clean-up process (e.g., how to auto-remove small "pepper", the one to few pixel wide black spots on the white background?). I look forward to your advice and even help if you are interested (I will upload all the partially-cleaned images somewhere if you want to help with the actual clean-up process -- the whole set of images totals 680 megs.) [As a final note, I use Paint Shop Pro 9, but do not have Photoshop. But since PSP9 is fairly powerful, I assume that many, if not all, recommended Photoshop processes will map over to PSP9.] Thanks! Jon Noring

I have found scanning in 600 dpi color and then saving only the red image in Paintshop pro gets rid of most of the foxing and gives a very clean greyscale image. Much better than scanniing in greyscale to start with. Alternatively Capio from Kofax (available for 30 day trial) has lots of clean up options including removing background colors (foxing). Xp required though. The usual procedure is to use unsharp masking to sharpen the edges of the letters. Another approach for bitonal is to scan at 14 or 16 bit, and then chop off the background in Photoshop's histogram, after which youo can still convert to a 256 greyscale without loss of data. Then adjust the break point for the best bitonal image. Where have you posted your images or some sample ones? ----- Original Message ----- From: "Jon Noring" <jon@noring.name> To: <gutvol-d@lists.pglaf.org> Sent: Wednesday, May 25, 2005 11:29 AM Subject: [gutvol-d] [for the graphics wizards] Cleaning up original Burton"Kama Sutra" page scans -- need advice/help
[There are no doubt a few graphics wizards here who might have some good advice to share on the following message I've been posting on various forums and groups. Of course, let me know if PG would be interested in archiving the cleaned up page scans. As noted below, I've already started the process at DP to produce structured digital text of this book, which then will be fed back to PG. Thanks!]
I'm now working to "clean up" the 182 page images from a recent scan of a very rare and noteworthy public domain book. The cleaned-up scans will be released to the public (such as given to the Internet Archive) for free access. [For those interested, the book is the 1885 second printing of the second edition of Sir Richard F. Burton's "Kama Sutra of Vatsyayana".]
The scans were done at 600 dpi (optical) 256-color greyscale (there's no color in the book), to capture sufficient fine-detail to aid in the cleanup process. Of course, the book was chopped (the binding was falling apart anyway) and each page scanned on a flat-bed, so there's no page distortion caused by trying to scan a bound book. There are no illustrations -- it's all black and white text.
I've already deskewed, cropped, centered and size-normalized all 182 pages. (For those interested, links to two sample partially-cleaned pages are given below.)
In the cleanup process, I'd like to convert what I now have into 600-dpi *bitonal* (black and white) with uniform and nicely readable character density, removal of "pepper", cleanup of larger blotches, etc. I recognize there will be some handwork required, particularly to remove larger "pepper" and blotches, and repair a few characters, etc., but of course want to minimize handwork.
[Note that the purpose of the cleanup is for direct human-use of the scans, and not solely for OCR purposes which doesn't require the planned level of cleanup. For example, I plan to produce a DjVu version for direct reading. For those who will probably ask, the raw page scans have already been uploaded to Distributed Proofreaders for conversion to structured digital text.]
Unfortunately, what complicates the clean-up process is that the original book is in poor and variable condition. The paper is quite yellowed and darkened, and many pages are quite faded. Were the original in mint condition with good, uniform ink-to-paper contrast, I wouldn't be posting this request for advice. But the overall poor quality and page-to-page variation is taxing my graphics abilities to produce a clean finished product with reasonably readable and uniform character density (at 600-dpi bitonal.)
Here are two sample pages, each about 4.5 megs in size (2550x3900 greyscale):
http://www.openreader.org/kamasutra/page031.png (good condition) http://www.openreader.org/kamasutra/page106.png (poor condition)
I would assume that others have had similar needs and have come up with various processing tricks and even built special tools to aid in the clean-up process (e.g., how to auto-remove small "pepper", the one to few pixel wide black spots on the white background?). I look forward to your advice and even help if you are interested (I will upload all the partially-cleaned images somewhere if you want to help with the actual clean-up process -- the whole set of images totals 680 megs.)
[As a final note, I use Paint Shop Pro 9, but do not have Photoshop. But since PSP9 is fairly powerful, I assume that many, if not all, recommended Photoshop processes will map over to PSP9.]
Thanks!
Jon Noring
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d
participants (2)
-
Jon Noring
-
N Wolcott