book of james -- 016

ok, james, it's nose-to-the-grindstone time... :+) today we attack your diacritics, most of 'em on names... *** james said:
You list of words and names have no accents or diacritical marks, and so does the HTML. Even with global search and replace that's not going to be fun to put back.
there must be something you don't understand... but that's ok. that's why i'm here... :+) in a nutshell, changing between versions of the file which _do_ or _do_not_ have the diacritics is merely a matter of converting all of 'em with a single script. or, if we arrange it another way, clicking a button... click it once to replace non-diacritics with diacritics, and click again to change diacritics to non-diacritics. back. and forth. now you see 'em; now you don't. i won't try to make the claim that clicking a button is "fun", but i'd certainly dispute any claim it's "difficult", which is what i believe you were trying to imply above. maybe you thought you'd be doing 'em one-at-a-time? think again. we're doing things the easy way now... *** first, take a good long look at the diacritics you did:
i pulled those out of your edited version of the file... *** next, here's the demo code for changing to diacritics:
*** and likewise, the demo code for removing diacritics:
*** notice that, to develop the script, i had this demo code actually perform its conversion _on_ the file containing the _data_ which it uses, the file containing the words... you can find that dummy file here:
(if that file doesn't display properly, you will need to force your web-browser to recognize the file as utf8.) *** of course, eventually, we'll have it use the _real_ file:
but first, james, you're gonna need to clean that file... there are a couple of different ways you could do that, and i don't know which of them will be easiest for you. can you recognize the correct version of each word by just looking at it? or do you need to look at the scans? if you can do it by just looking at the list, that's great. if you need to look at scans, i'll code a routine for you. *** but... either way, before you start, let's look at the file... (an important note's at the end of this, so don't skip it.) there are some 777 words on it, give or take a few, but i'd guess a couple hundred of them are _errors_, for instance, up top, we have two similar ones:
Akuti Akriti
those might be two different names. or one could be a scanno for the other, in which case it must be fixed. there are lots of these cases. further down, we have:
Vyasa Vyana
again, could be two names. or one could be a scanno. pay attention, as some of them look very close:
Surasa Surama ... Sinivati Sinivali ... Sinhaya Sinhaka ... Sayujya Saynjya ... Satarupa Satarapa
*** there are 25 cases that particularly need attention. you can locate these cases by searching for "yyy"... for instance, here's the top one:
Avirmukhiyyy Avirmukhizzz
the "yyy" and "zzz" were added by me, as markers... in such cases, you'll see that the non-diacritic versions are exactly the same (or would be, without my markers). however, the diacritic versions _differ_, in some way... i take it that these are _not_ different names, but rather that you did their diacritics differently. perhaps that is because you made a mistake, or maybe the p-book had it different, maybe because they made a mistake, or... but whatever the case, if these names are supposed to be _the_same_person_, then they should be made the same... (unless i misunderstand something about diacritics, which is totally a possibility, as i claim to know zilch about them.) *** one last thing, james, and it's extremely important to note. when you make these corrections, you must do the edits to the _diacritic_ side of the pair, not the non-diacritic side... (the non-diacritic side of the pair is what's in the file now.) however... although the _edit_ must be made to the diacritic _side_, the actual edit _can_ be the correct non-diacritic version. this will work because we'll be running the converter twice. so the first time through, the incorrect non-diacritic will be changed to the correct non-diacritic. then in the second run, the correct(ed) non-diacritic is changed to the correct diacritic. if you have questions about this, feel free to backchannel me... -bowerbird

Bowerbird, This looks good, mostly. I should be able to devote some time to this on Saturday. Not before then, unfortunately. One thing I spotted was the use of "S with an accent acute above it" (Compose+'+s) shows up in your file as S', like this: S´ridhara -- Sridhara That won't do. The accent needs to be above the "s". The sound is identical to "sh" in English and some books do it that way but the author chose to do it the way he did. This might be one of those "combined diacriticals" in UTF-8 but most of the time I used my compose key to make them. I could do a search and replace in JEdit to fix them all if that's what is needed. However, I thought you should be aware that your word extraction program is doing this and it is wrong. I have a thought on looking up the words. PDFs and DjVus from archive.org have text contained in them. I should be able to put in a questionable word from the right column and see what it should be on the left, then fix it. Thanks, James Simmons On Thu, Jan 19, 2012 at 3:43 PM, <Bowerbird@aol.com> wrote:
ok, james, it's nose-to-the-grindstone time... :+)
today we attack your diacritics, most of 'em on names...
***
james said:
You list of words and names have no accents or diacritical marks, and so does the HTML. Even with global search and replace that's not going to be fun to put back.
there must be something you don't understand...
but that's ok. that's why i'm here... :+)
in a nutshell, changing between versions of the file which _do_ or _do_not_ have the diacritics is merely a matter of converting all of 'em with a single script.
or, if we arrange it another way, clicking a button...
click it once to replace non-diacritics with diacritics, and click again to change diacritics to non-diacritics.
back. and forth. now you see 'em; now you don't.
i won't try to make the claim that clicking a button is "fun", but i'd certainly dispute any claim it's "difficult", which is what i believe you were trying to imply above.
maybe you thought you'd be doing 'em one-at-a-time?
think again. we're doing things the easy way now...
***
first, take a good long look at the diacritics you did:
i pulled those out of your edited version of the file...
***
next, here's the demo code for changing to diacritics:
***
and likewise, the demo code for removing diacritics:
***
notice that, to develop the script, i had this demo code actually perform its conversion _on_ the file containing the _data_ which it uses, the file containing the words...
you can find that dummy file here:
(if that file doesn't display properly, you will need to force your web-browser to recognize the file as utf8.)
***
of course, eventually, we'll have it use the _real_ file:
but first, james, you're gonna need to clean that file...
there are a couple of different ways you could do that, and i don't know which of them will be easiest for you.
can you recognize the correct version of each word by just looking at it? or do you need to look at the scans?
if you can do it by just looking at the list, that's great. if you need to look at scans, i'll code a routine for you.
***
but... either way, before you start, let's look at the file...
(an important note's at the end of this, so don't skip it.)
there are some 777 words on it, give or take a few, but i'd guess a couple hundred of them are _errors_,
for instance, up top, we have two similar ones:
Akuti Akriti
those might be two different names. or one could be a scanno for the other, in which case it must be fixed.
there are lots of these cases. further down, we have:
Vyasa Vyana
again, could be two names. or one could be a scanno.
pay attention, as some of them look very close:
Surasa Surama ... Sinivati Sinivali ... Sinhaya Sinhaka ... Sayujya Saynjya ... Satarupa Satarapa
***
there are 25 cases that particularly need attention. you can locate these cases by searching for "yyy"...
for instance, here's the top one:
Avirmukhiyyy Avirmukhizzz
the "yyy" and "zzz" were added by me, as markers...
in such cases, you'll see that the non-diacritic versions are exactly the same (or would be, without my markers).
however, the diacritic versions _differ_, in some way...
i take it that these are _not_ different names, but rather that you did their diacritics differently. perhaps that is because you made a mistake, or maybe the p-book had it different, maybe because they made a mistake, or...
but whatever the case, if these names are supposed to be _the_same_person_, then they should be made the same...
(unless i misunderstand something about diacritics, which is totally a possibility, as i claim to know zilch about them.)
***
one last thing, james, and it's extremely important to note. when you make these corrections, you must do the edits to the _diacritic_ side of the pair, not the non-diacritic side... (the non-diacritic side of the pair is what's in the file now.)
however...
although the _edit_ must be made to the diacritic _side_, the actual edit _can_ be the correct non-diacritic version.
this will work because we'll be running the converter twice.
so the first time through, the incorrect non-diacritic will be changed to the correct non-diacritic. then in the second run, the correct(ed) non-diacritic is changed to the correct diacritic.
if you have questions about this, feel free to backchannel me...
-bowerbird
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
participants (2)
-
Bowerbird@aol.com
-
James Simmons