Re: Name lists and Big-endianism

and we see yet another excellent example of how the "metadata" b.s. is such an unproductive path. the o.c.d. people love to focus on these minute details, which make very little difference at all -- who cares how "van holst" is sorted?, or if the "van" is capitalized or not?, or indeed whether it is "capitalised" or not?, because a search for "holst" is gonna find it no matter what you do -- and, as if this insignificance wasn't bad enough, such compulsiveness usually causes full paralysis. you can tie yourself up worrying about that crap... or you can cut the gordian knot and be productive. -bowerbird

Hi There, Am 20.09.2009 um 23:52 schrieb Bowerbird@aol.com:
and we see yet another excellent example of how the "metadata" b.s. is such an unproductive path. Not true. It is how the metedata is use or structured. See Below.
the o.c.d. people love to focus on these minute details, which make very little difference at all -- who cares how "van holst" is sorted?, or if the "van" is capitalized or not?, or indeed whether it is "capitalised" or not?, because a search for "holst" is gonna find it no matter what you do -- and, as if this insignificance wasn't bad enough, such compulsiveness usually causes full paralysis.
Here BB is right on the point. Basically, the metadata is a dataabase. so we have the field for the name and then one or several fields of indexing that field. Furthermore in a typical library cataloge you wil find "Walter van Holst" under "Walter van Holst", "van Hols, Walter" and "Holst, van, Walter". So where doe sit leave us? With the development of a structured databese. Which means that we will have to comprise, that is cover the basic cases and in certain cases hand edit the fields involved. These special cases will be harder to find, but there will be a set of rules which will help us look for them. To make things easier we could use cross- references as in library catalogues. There is no magic bullet. As aexample take look at iTunes. It has field for sorting Artist. they use a db and for my own CDs the information is gotten from a diferent DB. I have my own notion how things should be sorted. So I edit the "sort for Artist" field. The only problem here is that for classical music sorting/ indexing by Artist is not viable. I prefer to use the Komposer field. So I have to use a different index. So what should be done is say our index follow these rules for names. If you cannot find a name where you expect it to be search do a full text search of the field X and you should find what you are looking for if not use the full name field !!! regards Keith.

On Mon, 21 Sep 2009 09:14:20 +0200, "Keith J. Schultz" <schultzk@uni-trier.de> wrote:
the o.c.d. people love to focus on these minute details, which make very little difference at all -- who cares how "van holst" is sorted?, or if the "van" is capitalized or not?, or indeed whether it is "capitalised" or not?, because a search for "holst" is gonna find it no matter what you do -- and, as if this insignificance wasn't bad enough, such compulsiveness usually causes full paralysis. Here BB is right on the point.
Not quite. If I am looking for a book written by a particular author, I want to be able to search for his or her name and not for all books about that particular author. Therefore metadata has a, albeit in this era of sophisticated search algorithms, somewhat reduced, purpose. And to that particular bird that is usually relegated to my spambox: I really do care whether the 'van' part in my family names is capitalised or not. I'm rather proud of it and do not need beastly pseudonyms to cower behind. Regards, Walter

Hi Walter Generally I agree, though I don't think that most of the extant search algorithms are so sophisticated. Most packages use brute force, relying on fast hardware. "Throwing silicon at the problem." It works to a point, but in data sets that grow far enough to run into exponential problems (even large quadratic problems ftm) a decent design relying on an appropriate algorithm can do nice things for nice people. CU Jon
Not quite. If I am looking for a book written by a particular author, I want to be able to search for his or her name and not for all books about that particular author. Therefore metadata has a, albeit in this era of sophisticated search algorithms, somewhat reduced, purpose.
And to that particular bird that is usually relegated to my spambox: I really do care whether the 'van' part in my family names is capitalised or not. I'm rather proud of it and do not need beastly pseudonyms to cower behind.
Regards,
Walter _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Hi Keith
With the development of a structured databese. Which means that we will have to comprise, that is cover the basic cases and in certain cases hand edit the fields involved. These special cases will be harder to find, but there will be a set of rules which will help us look for them. To make things easier we could use cross- references as in library catalogues.
There is no magic bullet. As aexample take look at iTunes. It has field for sorting Artist. they use a db and for my own CDs the information is gotten from a diferent DB. I have my own notion how things should be sorted. So I edit the "sort for Artist" field. The only problem here is that for classical music sorting/ indexing by Artist is not viable. I prefer to use the Komposer field. So I have to use a different index.
I take your point, but I reckon that with a bit of definition of canonical fields and formats one should be able to clean the lot up with the exception of cases where previous manual record entry had violated sensible rules. Most of the problems could be cleaned up automatically, and only the horrible examples (basically errors) need get special manual treatment. Trying to construct special rules for your data base to negotiate, would fall foul of the ingenuity of fools. Whether you really need a "formal data base" or not is an open question. Some direct access to properly sorted and indexed files can be startlingly effective. Jon

Hi Jon, Am 21.09.2009 um 17:39 schrieb Jon Richfield:
Hi Keith
I take your point, but I reckon that with a bit of definition of canonical fields and formats one should be able to clean the lot up with the exception of cases where previous manual record entry had violated sensible rules. Most of the problems could be cleaned up automatically, and only the horrible examples (basically errors) need get special manual treatment. Trying to construct special rules for your data base to negotiate, would fall foul of the ingenuity of fools.
Whether you really need a "formal data base" or not is an open question. Some direct access to properly sorted and indexed files can be startlingly effective. Basically, I was not saying we need a "formal database" or system. The fact is information in the files basically constitute a database, albeit the information is structured. As I mentioned due to restrisction defined for the metadata the desired features are not possible in the present form and could be easily overcome.
regards Keith.

Really BB! You of all people! You can do better than that! I had assumed that you were IT-savvy. What you say suggests that you may be a DB user, but you sure as Sherridan don't talk like a DB designer, much less a systems designer. Capitalised or not? Weeeellll... maybe if the distinction is built into your software and hard to leave out. Have you considered what difference it makes to the mechanics of sorting, classification, or access? Whether you see it happen or not? For little toy kilo-record files it might be trivial, but we don't all work on those all the time. How anything is sorted? Oh boy... BB, in a certain large corporation which here shall be nameless, I got lumbered with a job of indexing the world-wide email and phone list after some other people repeatedly failed to do it. (Their software tools kept dying when fed the full files.) I wrote the application from scratch with no pain in an unfamiliar language in a few days, partly because I saw to it that a temp got hired to re-format all the names canonically. A year or two later Global HQ decreed a new, commercial-DB-based (Again no names of which large corporation's DB package it was based on!) package, and so we used that instead. Except that the savvy seniors clandestinely loaded and retained my version for years afterward because it was easier to use, more often successful in searching, and faster than the off-the-shelf even when there was a first-time hit. Canonically formatted files are VERY efficiently handleable. But you knew that BB, didn't you? How about this? A certain file-checking job involved cross-checking two files against each other. (Again, never mind which international corporation's files those were!) The job had been manual, but rapidly became infeasible as the files grew. Someone wrote a quick-and-dirty to help, but it took a week to run (5-day week, but still!) and only partly did the job. Someone (maybe the same guy; I don't remember) did the job better, and it ran in a day, still partly successfully. Someone else did a totally different job and it ran in a couple of hours, almost successfully, but it didn't work. Then to get me out of someone's hair I got the job. I began by reformatting the input file every run. Stupid, but whoever expected anything else. Run time, including the sort (Which I also had to write myself) and selection match pass: 49 seconds. Several orders of magnitude improvement in performance plus perfect results. And best of all, it didn't take a lot of sexy programming, just competent design. I probably cold have halved the times for both jobs if I had written in low level code, but it wasn't really necessary. Now BB, I reckon that when proper attention changes a job from not worth running, to so trivial that at first the user thinks that the job hadn't run, it is not a "minute detail, which makes very little difference at all", but a very important detail, which makes enough difference to get management respect -- till the next toughie comes along! You see BB, 'who cares how "van holst" is sorted? --a search for "holst" is gonna find it no matter what you do' is exactly the sort of detail that made the difference in the real life cases. Would you believe, BB, that I could go on for some time in this vain vein? My Gordian (Note the Caps BB!) gnot was nicely productive once I kut it with proper knit-picking design (as in untangling rather than depediculotic activity). Not a louse-egg of "full paralysis" in sight, or in anyone's hair! It is not a matter of bottom-up vs top-down; it is knowing when and why which is appropriate. Cheers, Jon
and we see yet another excellent example of how the "metadata" b.s. is such an unproductive path.
the o.c.d. people love to focus on these minute details, which make very little difference at all -- who cares how "van holst" is sorted?, or if the "van" is capitalized or not?, or indeed whether it is "capitalised" or not?, because a search for "holst" is gonna find it no matter what you do -- and, as if this insignificance wasn't bad enough, such compulsiveness usually causes full paralysis.
you can tie yourself up worrying about that crap... or you can cut the gordian knot and be productive.
-bowerbird
------------------------------------------------------------------------
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

and we see yet another excellent example of how the "metadata" b.s. is such an unproductive path.
the o.c.d. people love to focus on these minute details, which make very little difference at all -- who cares how "van holst" is sorted?, .
You make great big assumptions about the nature of the machines that people are reading on, and then make incorrect conclusions based on those assumptions. Yes, if all readers are reading on desktop computers running some flavor of *nix then your conclusions may be correct. But, not all readers of PG books are running *nix, or even desktops. Many of these machines have a very different notion of "sorting" than you have in mind. Which is why we just had this conversation a couple days ago, but, I guess many people didn't get it. On my favorite class of machine, which something like a million+ other readers are reading on, and more every day, "sorts" are typically done on authorlastname, where authorlastname is something provided within the book file. That part which does not correspond to authorlastname is stored by convention in authorfirstname. This sort information is displayed to the reader in one of two ways, both of which ought to appear sensible: Authorlastname, authorfirstname And Authorfirstname authorlastname In either case the actual sort should be on authorlastname This class of machine has no notion of the idea that you can type in part of an authors' name and search on that. Rather all the books on the machine are sorted and displayed in order by authorlastname, and you find a book by scrolling for the authorlastname in sort order within that list. Why does this matter? Consider the famous author name Sun Tzu What is the last name? Sun What is the first name? Well, no one actually knows, but historically "Tzu" which is actually an honorarium is stuck in the authorfirstname slot. But now look what happens: In the authorlastname, authorfirstname case you get: Sun, Tzu Which is not a bad result In the Authorfirstname authorlastname case you get: Tzu Sun Which is an error. Thus, perhaps, one concludes with names where family name needs to display first the encoding has to be: Authorlastname: Sun Tzu Authorfirstname: null In which case both displays work out right. How does one write an automatic algorithm to figure these things out from an existing gut authorlist? Answer, again, is that one can not write an automatic algorithm to figure these things out because currently there isn't enough information stored about author names, and further, how author names are sorted and displayed are based in part on library tradition, perhaps best found by researching Library of Congress for a particular author. Another way of saying this is, let's say you make the mistake of wandering into a Barnes and Noble when you were actually trying to enter the Starbucks next door. But while in there you decide to look at the fiction stacks just for fun to see if they have your favorite author. Where in the stacks do you look? Well, that depends on how B&N sorts on your favorite author, which in turn is based on library tradition for that particular author. Yes you can try to write an algorithm to do this but then you will find that surprisingly often it breaks, because it seems that having an unusual family name is a prereq for writing a book. You can then say "oh well this is PG we really don't care why be o.c.d.?" But then you are producing books that work inferior, in practice, for customers, on customer's machines, compared to the other publishing houses, making PG look like amateur hour. You might say "well then they shouldn't have bought that machine rather they should buy my favorite choice of machine." But customers tend to consider that attitude towards their choice of machine a sign of hostility towards the customer by PG - which I guess is why PG already provides literally about 80 different file formats for customers. I believe PG needs to remain agnostic towards the customers' choice of machine if PG wants to retain the customer, which means that PG needs to understand how the differing classes of machines actually work, and what their constraints are. Getting authors, titles, and sort orders "correct" IS pretty basic. Not easy, but basic.

On Mon, Sep 21, 2009 at 1:13 PM, James Adcock <jimad@msn.com> wrote:
Why does this matter? Consider the famous author name Sun Tzu
Let's consider it; why do you think the general audience will search for Sun Tzu and not Tzu, Sun? A system that just gives an unsearchable list of names and doesn't have Tzu, Sun, even if only as an alias, is unusable, correct or not. Not to mention that his name is Sūn Zǐ, or 孫子, or 孙子 or Sunzi, and that doesn't even start to approach the problem of spelling questions. -- Kie ekzistas vivo, ekzistas espero.

Let's consider it; why do you think the general audience will search for Sun Tzu and not Tzu, Sun? A system that just gives an unsearchable list of names and doesn't have Tzu, Sun, even if only as an alias, is unusable, correct or not.
I assure you that I and about a million other people have for our primary reading machines a machine which only provides a library of books sorted and listed by authorlastname and which does not in fact have a "search" capability on authornamepart and while I agree with you that I would prefer a machine with a stronger search capability the reason that we put up with this machine is that it is so many light years ahead of other machines that we might want to read on as to make that decision a "no brainer" -- even given the shortcomings of the user shell design. In fact after "putting up with" computers and having to print out documents for the last 35 years of my life I now find that I almost never print out anything, and I almost never buy a book or magazine in print anymore. And the machine goes with me everywhere and I read it every night in bed until I fall asleep. So this is by far my most useful most favorite machine I have ever had in my life. But then again, I do a LOT of reading! A better counter question is why would PG WANT to implement a system that prevents easy and correct implementation of common e-book formats? -- EPUB and MOBI ?
participants (7)
-
Bowerbird@aol.com
-
David Starner
-
James Adcock
-
Jim Adcock
-
Jon Richfield
-
Keith J. Schultz
-
Walter van Holst