Re: [gutvol-d] book of james -- 001

for want of better text, i've now loaded in the original o.c.r. for the "book of james".
this isn't the edit interface, it's just a skeleton to display the o.c.r./scan combo for each page. resizing your browser-window resizes the scan. so james, if you want me to pursue this at all, send me your corrected files. otherwise, i will just be done with it. i'd love to see an xhtml tutorial on this book... *** roger said:
yes! roger's back! :+)
different users with different browsers.
iphone, no. ipad, also no, unfortunately. doesn't appear to size itself very well...
http://z-m-l.com/misc/cinema-screen-shot.png lotta unused space on my 23-inch screen.
anyway, all you people who bellyache about "open source" need to step up to the plate... go over and help roger code something cool. show that you really _deserve_ open source... *** don said:
Is this closer?
i don't think so. i'd say you're getting colder. but i _will_ read that story about the turnip... *** keith said:
elves
aha! that's great. i didn't think the gerbils would have enough manual dexterity... but _elves_ are very handy with their... hands...
Now, if I could only get then to program.
oh, they don't have to be able to code at all. just tell them to run these commands:
pandoc -i criterion.html -o markdown.md pandoc -i criterion.html -o rst.rst pandoc -i markdown.md -o markdown.html pandoc -i rst.rst -o rst.html
if criterion.html and markdown.html and rst.html are not "sufficiently similar", let me know why not. *** carlo said:
djvutxt --page=123 djvufile.djvu 123.txt
ok, i get it now. it's on _my_ machine. i thought you issued it on archive.org. i can't thank you enough for that, carlo; this ends a years-long headache for me. -bowerbird

Bowerbird, This looks pretty neat. There are a couple of points to mention, though. 1). I don't have corrected pages. I started doing the page-at-a-time thing and gave up. Your pages are already better than mine because I used Tesseract and archive.org uses ABBY Fine Reader. 2). This book really requires a way to enter UTF-8 characters. If I could just stick a circumflex above a's, u's, and i's (both lower and upper case) that would be 99% of what I need. That's why I use JEdit: there is a plugin that makes a docked window for entering these characters. As you can see the OCR actually worked pretty well. Most of what I'm doing now (after de-hyphenating, joining split paragraphs, and re-wrapping everything) is putting in those circumflexes. James Simmons On Wed, Dec 21, 2011 at 4:14 AM, <Bowerbird@aol.com> wrote:
for want of better text, i've now loaded in the original o.c.r. for the "book of james".
this isn't the edit interface, it's just a skeleton to display the o.c.r./scan combo for each page.
resizing your browser-window resizes the scan.
so james, if you want me to pursue this at all, send me your corrected files. otherwise, i will just be done with it.
i'd love to see an xhtml tutorial on this book...
***
roger said:
yes! roger's back! :+)
different users with different browsers.
iphone, no. ipad, also no, unfortunately.
doesn't appear to size itself very well...
http://z-m-l.com/misc/cinema-screen-shot.png lotta unused space on my 23-inch screen.
anyway, all you people who bellyache about "open source" need to step up to the plate... go over and help roger code something cool. show that you really _deserve_ open source...
***
don said:
Is this closer?
i don't think so. i'd say you're getting colder. but i _will_ read that story about the turnip...
***
keith said:
elves
aha! that's great. i didn't think the gerbils would have enough manual dexterity... but _elves_ are very handy with their... hands...
Now, if I could only get then to program.
oh, they don't have to be able to code at all.
just tell them to run these commands:
pandoc -i criterion.html -o markdown.md pandoc -i criterion.html -o rst.rst pandoc -i markdown.md -o markdown.html pandoc -i rst.rst -o rst.html
if criterion.html and markdown.html and rst.html are not "sufficiently similar", let me know why not.
***
carlo said:
djvutxt --page=123 djvufile.djvu 123.txt
ok, i get it now. it's on _my_ machine. i thought you issued it on archive.org.
i can't thank you enough for that, carlo; this ends a years-long headache for me.
-bowerbird
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

"James" == James Simmons <nicestep@gmail.com> writes:
James> 2). This book really requires a way to enter UTF-8 James> characters. If I could just stick a circumflex above a's, James> u's, and i's (both lower and upper case) that would be 99% James> of what I need. Then use US international keyboard. http://en.wikipedia.org/wiki/Keyboard_layout
There is an alternative layout that uses the physical US keyboard to type diacritics in some operating systems (including Windows). ... uses keys ', `, ", ^ and ~ as dead keys used to generate characters with diacritics by pressing the appropriate key, then the letter on the keyboard. The international keyboard is a software setting installed from the Windows control panel
The disadvantage is that to type ^ you need ^-space. Mac and linux also allow to define a compose key. You type compose-^-a to get a with circumflex. Carlo

Carlo, I'll try this tonight. If it works you've saved me a heap of time! James Simmons On Wed, Dec 21, 2011 at 1:50 PM, Carlo Traverso <traverso@posso.dm.unipi.it>wrote:
"James" == James Simmons <nicestep@gmail.com> writes:
James> 2). This book really requires a way to enter UTF-8 James> characters. If I could just stick a circumflex above a's, James> u's, and i's (both lower and upper case) that would be 99% James> of what I need.
Then use US international keyboard.
http://en.wikipedia.org/wiki/Keyboard_layout
There is an alternative layout that uses the physical US keyboard to type diacritics in some operating systems (including Windows). ... uses keys ', `, ", ^ and ~ as dead keys used to generate characters with diacritics by pressing the appropriate key, then the letter on the keyboard. The international keyboard is a software setting installed from the Windows control panel
The disadvantage is that to type ^ you need ^-space.
Mac and linux also allow to define a compose key. You type compose-^-a to get a with circumflex.
Carlo _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Carlo, I defined a compose key on my Linux box last night and it works great. I can finish the book twice as fast now. Thanks again! James Simmons On Wed, Dec 21, 2011 at 1:50 PM, Carlo Traverso <traverso@posso.dm.unipi.it>wrote:
"James" == James Simmons <nicestep@gmail.com> writes:
James> 2). This book really requires a way to enter UTF-8 James> characters. If I could just stick a circumflex above a's, James> u's, and i's (both lower and upper case) that would be 99% James> of what I need.
Then use US international keyboard.
http://en.wikipedia.org/wiki/Keyboard_layout
There is an alternative layout that uses the physical US keyboard to type diacritics in some operating systems (including Windows). ... uses keys ', `, ", ^ and ~ as dead keys used to generate characters with diacritics by pressing the appropriate key, then the letter on the keyboard. The international keyboard is a software setting installed from the Windows control panel
The disadvantage is that to type ^ you need ^-space.
Mac and linux also allow to define a compose key. You type compose-^-a to get a with circumflex.
Carlo _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On 12/21/2011 8:53 AM, James Simmons wrote:
1). I don't have corrected pages. I started doing the page-at-a-time thing and gave up. Your pages are already better than mine because I used Tesseract and archive.org <http://archive.org> uses ABBY Fine Reader.
2). This book really requires a way to enter UTF-8 characters. If I could just stick a circumflex above a's, u's, and i's (both lower and upper case) that would be 99% of what I need. That's why I use JEdit: there is a plugin that makes a docked window for entering these characters. As you can see the OCR actually worked pretty well. Most of what I'm doing now (after de-hyphenating, joining split paragraphs, and re-wrapping everything) is putting in those circumflexes.
Take a look at the files from http://www.passkeysoft.com/~lee/studyofbhagavata00benaiala_abbyy.zip. Each page has its own file, and is encoded in UTF-8. M-dashes are preserved; all soft hyphens (the ones you would want to get rid of) have been replaced by . These are files transformed from archive.org, so they benefit from abbyy's OCR. Unfortunately, I don't think IA turned on foreign characters during recognition (yes, it can be done) so diacritical marks will probably still be missing.
participants (4)
-
Bowerbird@aol.com
-
James Simmons
-
Lee Passey
-
traverso@posso.dm.unipi.it