book of james -- 016

19 Jan 2012

      ok, james, it's nose-to-the-grindstone time...              :+)

today we attack your diacritics, most of 'em on names...

***

james said:
...
You list of words and names have no accents 
   or diacritical marks, and so does the HTML.
   Even with global search and replace
   that's not going to be fun to put back.
there must be something you don't understand...

but that's ok.   that's why i'm here...            :+)

in a nutshell, changing between versions of the file
which _do_ or _do_not_ have the diacritics is merely
a matter of converting all of 'em with a single script.

or, if we arrange it another way, clicking a button...

click it once to replace non-diacritics with diacritics,
and click again to change diacritics to non-diacritics.

back.   and forth.   now you see 'em; now you don't.

i won't try to make the claim that clicking a button is
"fun", but i'd certainly dispute any claim it's "difficult",
which is what i believe you were trying to imply above.

maybe you thought you'd be doing 'em one-at-a-time?

think again.   we're doing things the easy way now...

***

first, take a good long look at the diacritics you did:
...
http://zenmagiclove.com/bhaga/diacritics777.html
i pulled those out of your edited version of the file...

***

next, here's the demo code for changing to diacritics:
...
http://zenmagiclove.com/bhaga/waxon.py
***

and likewise, the demo code for removing diacritics:
...
http://zenmagiclove.com/bhaga/waxoff.py
***

notice that, to develop the script, i had this demo code
actually perform its conversion _on_ the file containing
the _data_ which it uses, the file containing the words...

you can find that dummy file here:
...
http://zenmagiclove.com/bhaga/waxer.txt
(if that file doesn't display properly, you will need to
force your web-browser to recognize the file as utf8.)

***

of course, eventually, we'll have it use the _real_ file:
...
http://zenmagiclove.com/bhaga/diacritics777.html
but first, james, you're gonna need to clean that file...

there are a couple of different ways you could do that,
and i don't know which of them will be easiest for you.

can you recognize the correct version of each word by
just looking at it?   or do you need to look at the scans?

if you can do it by just looking at the list, that's great.
if you need to look at scans, i'll code a routine for you.

***

but...   either way, before you start, let's look at the file...

(an important note's at the end of this, so don't skip it.)

there are some 777 words on it, give or take a few,
but i'd guess a couple hundred of them are _errors_,

for instance, up top, we have two similar ones:
...
Akuti
   Akriti
those might be two different names.   or one could be
a scanno for the other, in which case it must be fixed.

there are lots of these cases.   further down, we have:
...
Vyasa
   Vyana
again, could be two names.   or one could be a scanno.

pay attention, as some of them look very close:
...
Surasa
   Surama
...
   Sinivati
   Sinivali
...
   Sinhaya
   Sinhaka
...
   Sayujya
   Saynjya
...
   Satarupa
   Satarapa
***

there are 25 cases that particularly need attention.
you can locate these cases by searching for "yyy"...

for instance, here's the top one:
...
Avirmukhiyyy
   Avirmukhizzz
the "yyy" and "zzz" were added by me, as markers...

in such cases, you'll see that the non-diacritic versions
are exactly the same (or would be, without my markers).

however, the diacritic versions _differ_, in some way...

i take it that these are _not_ different names, but rather
that you did their diacritics differently.   perhaps that is
because you made a mistake, or maybe the p-book had
it different, maybe because they made a mistake, or...

but whatever the case, if these names are supposed to be
_the_same_person_, then they should be made the same...

(unless i misunderstand something about diacritics, which
is totally a possibility, as i claim to know zilch about them.)

***

one last thing, james, and it's extremely important to note.
when you make these corrections, you must do the edits to
the _diacritic_ side of the pair, not the non-diacritic side...
(the non-diacritic side of the pair is what's in the file now.)

however...

although the _edit_ must be made to the diacritic _side_,
the actual edit _can_ be the correct non-diacritic version.

this will work because we'll be running the converter twice.

so the first time through, the incorrect non-diacritic will be
changed to the correct non-diacritic.   then in the second run,
the correct(ed) non-diacritic is changed to the correct diacritic.

if you have questions about this, feel free to backchannel me...

-bowerbird

Bowerbird＠aol.com

James Simmons

tags

participants (2)