Re: [gutvol-d] book of james -- 001

22 Dec 2011

      On 12/21/2011 8:53 AM, James Simmons wrote:
...
1).  I don't have corrected pages.  I started doing the page-at-a-time
thing and gave up.  Your pages are already better than mine because I
used Tesseract and archive.org <http://archive.org> uses ABBY Fine Reader.
2).  This book really requires a way to enter UTF-8 characters.  If I
could just stick a circumflex above a's, u's, and i's (both lower and
upper case) that would be 99% of what I need.  That's why I use JEdit:
there is a plugin that makes a docked window for entering these
characters.  As you can see the OCR actually worked pretty well.  Most
of what I'm doing now (after de-hyphenating, joining split paragraphs,
and re-wrapping everything) is putting in those circumflexes.
Take a look at the files from 
http://www.passkeysoft.com/~lee/studyofbhagavata00benaiala_abbyy.zip. 
Each page has its own file, and is encoded in UTF-8. M-dashes are 
preserved; all soft hyphens (the ones you would want to get rid of) have 
been replaced by .

These are files transformed from archive.org, so they benefit from 
abbyy's OCR. Unfortunately, I don't think IA turned on foreign 
characters during recognition (yes, it can be done) so diacritical marks 
will probably still be missing.

Re: [gutvol-d] book of james -- 001

Lee Passey