
james said:
Carlo, I defined a compose key on my Linux box last night and it works great. I can finish the book twice as fast now.
carlo is handing out the holiday gifts, isn't he? :+) thanks again, carlo, for the djvu-text-extraction info... *** james, i just wanna make sure we're on the same page. (see how i did that?) ;+) i'm cleaning the text now, and i will hand it over to you sometime next week. that's my understanding, anyway. is it your understanding as well? i want to be positively _certain_ that we do _not_ have the situation where we're working on separate tracks, because merging 2 sets of corrections is a nightmare. good version control is an absolute requirement. -bowerbird

Bowerbird, I am working on and will continue to work on the file here: http://dl.dropbox.com/u/8919415/AStudyOfTheBhagavataPurana.txt I've put in much too much work on it to just abandon it, and since it has been de-hyphened, rewrapped and de-paged there is no good way to get my work into your system. However, what I *could* do is do my book from a certain point onwards in your system. Previous to that point would remain uncorrected in your system but would be corrected in my file. I would use your system to create a complete ZML marked up file, then I would overlay the tail end of that file on my original file, add ZML markup where needed on my original, then run the result through your converters. I would guess this is not the work flow you had in mind. I have done previous books with the page-at-a-time method and if I had known how to extract individual pages from DJVUs I would have used it for this one too. There is another possibility, which would be to do a different book. PG currently has the Bhagavad Gita only in German: http://www.gutenberg.org/ebooks/33186 There are a couple of good possibilities for a public domain English translation: http://www.archive.org/details/srimadbhagavadg00swamgoog http://www.archive.org/details/bhagavadgitason00johngoog This is a MUCH shorter book that the Bhagavata Purana, but it should demonstrate the value of your approach. I favor the first link over the second one. It seems to be more authentic. I'm sure the finished product would get a lot of downloads. If you like I can set aside my current project when your system is ready, and work on the Gita instead. There should be no problem getting copyright clearances on either one of these. James Simmons On Thu, Dec 22, 2011 at 2:33 PM, <Bowerbird@aol.com> wrote:
james said:
Carlo, I defined a compose key on my Linux box last night and it works great. I can finish the book twice as fast now.
carlo is handing out the holiday gifts, isn't he? :+)
thanks again, carlo, for the djvu-text-extraction info...
***
james, i just wanna make sure we're on the same page. (see how i did that?) ;+)
i'm cleaning the text now, and i will hand it over to you sometime next week. that's my understanding, anyway. is it your understanding as well?
i want to be positively _certain_ that we do _not_ have the situation where we're working on separate tracks, because merging 2 sets of corrections is a nightmare.
good version control is an absolute requirement.
-bowerbird
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

I am working on and will continue to work on the file here:
http://dl.dropbox.com/u/8919415/AStudyOfTheBhagavataPurana.txt
I've put in much too much work on it to just abandon it, and since it has been de-hyphened, rewrapped and de-paged there is no good way to get my work into your system.
If one has two reasonably clean versions of the txt files of the same images, file1 with page separators and file2 without, it is reasonably simple to merge the two taking the separators from the first one and the text from the second one. The trick is to use wdiff (GNU wdiff, possibly dwdiff, not wdiff.com wdiff); assume for example that the separators are a line that cannot appear in the text, e.g. <--------Page 123------> then do wdiff file1 file2 > file3 file3 consists of file2 in which differing parts (sequence of different words, thus disregarding whitespace) are represented like This is the [-first-] {+second+} version One has just to preserve the common parts, the text of the second version and the separators. The line breaks and other whitespace are preserved from the second version. Of course this is not so simple as I said, since one has to take care of extra spaces and newlines, and be sure that the difference separator strings [- , -] , {+ , +} , do not appear in either text; but these strings can be changed with wdiff options -w, -x, -y, -z. Handling whitespace might be a bit trickier, (maybe use wdiff -n), but you can at the end run diff file2 file3 to see what went wrong (if everything is OK you see just the separators and the lines that are split between the pages). The position of the separators in the result might be a bit off if there are differences in the words immediately before or after the separators. For example, a word split between pages goes to the second page. I don't have much experience, or ready-to-use scripts, since I always preserve dearly my separators, (I have simple scripts to join files preserving the filenames in the separators and split recovering the filename), but I have sometimes helped others to recover their lost page separators. Carlo

Carlo, I appreciate the suggestion, but the file I'm working on has pages combined, hyphenated words re-joined, and paragraphs re-wrapped. The only thing left to do is to correct spellings and add accents (mostly circumflexes), add [Illustration] tags, format ASCII art family trees, etc. If Bowerbird can give me a conversion utility I'll put in ZML markup, otherwise I'll do what I've done in the past. It's a 400+ page book and I'm nearing page 80 at the moment. The OCR at archive.org came out pretty well so the corrections are not too numerous. I would have done the corrections one page at a time with page images alongside if I had known how to get separate text files for each page. I had intended to do the Bhagavad Gita next, a much shorter book, and that might be a better one to use with Bowerbird. It has footnotes, poetry line numbers, chapter headings, etc. and might be a good proof of concept for what Bowerbird is proposing. I can always put my current book on the back burner for awhile. James Simmons On Fri, Dec 23, 2011 at 12:13 AM, Carlo Traverso <traverso@posso.dm.unipi.it
wrote:
I am working on and will continue to work on the file here:
http://dl.dropbox.com/u/8919415/AStudyOfTheBhagavataPurana.txt
I've put in much too much work on it to just abandon it, and since it has been de-hyphened, rewrapped and de-paged there is no good way to get my work into your system.
If one has two reasonably clean versions of the txt files of the same images, file1 with page separators and file2 without, it is reasonably simple to merge the two taking the separators from the first one and the text from the second one.
The trick is to use wdiff (GNU wdiff, possibly dwdiff, not wdiff.com wdiff); assume for example that the separators are a line that cannot appear in the text, e.g.
<--------Page 123------>
then do
wdiff file1 file2 > file3
file3 consists of file2 in which differing parts (sequence of different words, thus disregarding whitespace) are represented like
This is the [-first-] {+second+} version
One has just to preserve the common parts, the text of the second version and the separators. The line breaks and other whitespace are preserved from the second version.
Of course this is not so simple as I said, since one has to take care of extra spaces and newlines, and be sure that the difference separator strings [- , -] , {+ , +} , do not appear in either text; but these strings can be changed with wdiff options -w, -x, -y, -z.
Handling whitespace might be a bit trickier, (maybe use wdiff -n), but you can at the end run
diff file2 file3
to see what went wrong (if everything is OK you see just the separators and the lines that are split between the pages).
The position of the separators in the result might be a bit off if there are differences in the words immediately before or after the separators. For example, a word split between pages goes to the second page.
I don't have much experience, or ready-to-use scripts, since I always preserve dearly my separators, (I have simple scripts to join files preserving the filenames in the separators and split recovering the filename), but I have sometimes helped others to recover their lost page separators.
Carlo _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
participants (3)
-
Bowerbird@aol.com
-
James Simmons
-
traverso@posso.dm.unipi.it