
On 12/21/2011 8:53 AM, James Simmons wrote:
1). I don't have corrected pages. I started doing the page-at-a-time thing and gave up. Your pages are already better than mine because I used Tesseract and archive.org <http://archive.org> uses ABBY Fine Reader.
2). This book really requires a way to enter UTF-8 characters. If I could just stick a circumflex above a's, u's, and i's (both lower and upper case) that would be 99% of what I need. That's why I use JEdit: there is a plugin that makes a docked window for entering these characters. As you can see the OCR actually worked pretty well. Most of what I'm doing now (after de-hyphenating, joining split paragraphs, and re-wrapping everything) is putting in those circumflexes.
Take a look at the files from http://www.passkeysoft.com/~lee/studyofbhagavata00benaiala_abbyy.zip. Each page has its own file, and is encoded in UTF-8. M-dashes are preserved; all soft hyphens (the ones you would want to get rid of) have been replaced by . These are files transformed from archive.org, so they benefit from abbyy's OCR. Unfortunately, I don't think IA turned on foreign characters during recognition (yes, it can be done) so diacritical marks will probably still be missing.