
Bowerbird, I wrote a Python utility to do the page at a time correction, and I have done that with previous contributions where the OCR was decent. (I did The Big Sleep, Benchley Beside Himself, and Ancient Manners that way). That approach seemed less than optimal for my last two contributions, so I did the process as described. What I may be able to do is get the original file from archive.org and put in the lines of equals using regex search and replace. There will of course be no corrections. I'd like to stay the course with what I am working on, but if I can give you something useful to experiment with I certainly will. If I can use ZML formatting on my text and run it through your converter I'll do that too. James Simmons On Tue, Dec 20, 2011 at 5:10 PM, <Bowerbird@aol.com> wrote:
james said:
I do have the book as 1 text page per file.
ok. (i think.)
I got it this way by downloading the page images, making TIFFs out of them, then running tesseract
oh dear. that was a waste of time.
I can give you the separate text files in a Zip archive if you wish.
are these the files after you made your corrections? if so, then yes, those are exactly the files that i need. zip 'em up, and put it in your dropbox.
My work method is to use guiguts to remove page numbers and reformat paragraphs first.
oh dear. more wasted time. oh well.
(also, removing pagenumbers is the _last_ thing to do. they help let you be aware where you are in the book.)
The link you gave gives me a 404 error.
yes, here's the correct one:
sorry about that...
I'm not sure what you mean by online.
i mean you do your corrections on the web... which means that other people can help you. (at least if you give them the web-address.)
but if you prefer to work offline, you can do that.
I thought you would provide a command line utility
i'm a mac person, james. we believe in a friendly interface. only a sadist seeks to saddle you with command-line crap...
that would convert ZML to the various formats.
but first you have to get your text _into_ .zml format.
Were you thinking of something like DP uses?
"something like" that is a fairly accurate description. my system isn't nearly as convoluted or bureaucratic.
-bowerbird
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d