
"James" == James Simmons <nicestep@gmail.com> writes:
James> There are page numbers in the original text file from James> archive.org. They usually start with a left square James> bracket, but not always as the OCR messes some of them up. James> I do have the book as 1 text page per file. I got it this James> way by downloading the page images, making TIFFs out of James> them, then running tesseract on them. The results of this James> process are generally good, but in this case the text file James> provided by archive.org was a lot better so that is what I James> chose to use as my starting point. I can give you the James> separate text files in a Zip archive if you wish. You seem to be unaware that you can get the text for a single page from the Internet Archive djvu files, through the djvutxt command and the -page option. Or get any range of pages, separated by FormFeed characters. The *_djvu.txt files provided at TIA are just these same files with some non-printing characters transformed into something else (in pure z.m.l. style, e.g. FormFeed is replaced by three blank lines....) thus making recovering the page breaks more difficult. Carlo Traverso