
On Tue, December 13, 2011 8:35 pm, Jim Adcock wrote:
What I am not clear about is why BB insists that what one starts from must be an "an impoverished text-file" because I never work with text files per se until I am forced to derive one at the end of my html development as a needless extra step in order to get the PG WWers to accept my html work. I do not start with an "an impoverished text-file" for the simple reason that my OCR gives me better file format choices which help preserve more of the information available in the original page images, such that I do not have to rediscover and re-enter that information again later manually -- after needlessly throwing that information away in the first place just to reduce the OCR result to txt70.
I don't get this either. I /never/ start with impoverished text. (Well, okay, I did one once. But it was sooooo painful that I vowed I would never do it again.) If FineReader is offering to save my OCR as HTML (class 2 tag soup), why would I not accept the offer? Use a tool like Tidy to convert to XHTML and I have a file that can easily be manipulated with scripts as well as plain text editors. This is one of the reasons I want so badly to see kenh's script from archive.org, or at least to have it running. BB is right that the OCR text at Internet Archive is unusable as a starting point for e-books, but I think I could work with the HTML output from that script. If I knew how it worked, I could probably even replicate it in a different programming language for off-line use. Heck, even the stuff at Distributed Proofreaders nowadays has a modicum of HTML embedded in it. About the only reason you would need to start with plain text is if you're trying to fix the early e-texts in the PG corpus -- not a bad idea, but there are better places to start if that's really what you're trying to do.