re: [gutvol-d] Categorizing PG content

karen said:
Suggestion: have a competition to design an open-source cataloging system for e-books, where there are no physical constraints on "shelving." Publicize it in library schools. Major ego-boo for the teacher/graduate student whose scheme is accepted, free design for PG.
um, i don't know that i'm seeing much quality thinking coming out of the library schools, i am chagrined to say. besides, it's not so much the "design" that is so difficult, but rather the _implementation_, and the grunt work of assigning e-texts within the system. so it'd be far better to have the competition at the _programming_ level... and again, much of the design work has already been done, when this thread had an earlier incantation on this listserve. if no one is willing to check the archives, what's the point?... finally, i'm not sure that y'all understand the major need here. and i'm quite certain that library-school students will miss it. answer this question: why should we categorize the e-texts? i'm serious. formulate an answer. i'll wait... got one? ok, great... if your response runs along the lines of "so end-users can find the book they want, and download it", you're on the wrong path. that's the function catalogs used to serve, in the dead-tree world. after all, since a person had made a trip to a library to get a book, and would have to be making another trip to bring it back, it made a lot of sense for that person to find a book that they would enjoy. in that scenario, the catalog helped avoid the cost of a wrong choice. the physical nature of bound pages creates a situation of obligation. but in our new era of high-bandwidth and terrabyte hard-drives, it's silly for a person to spend even mere seconds trying to decide _whether_or_not_ to download a book. it's _far_ more convenient to download vast portions of the library, since they can have their computer do it automatically while they are partying, or sleeping... even the dial-up people can request the d.v.d., for free, and have the entire p.g. library sitting on their hard-disk in a week or so... not only is it not wise to make people spend any time "choosing", it's at odds with the important concept of _unlimited_distribution_. and that's why the library-school people don't understand this. because unlike them, we _want_ people to take a whole bunch! it's not just that there's "no shortage of shelf-space" with e-books, it's that we have an endless source of production. so take 'em all! we are all still trapped, to a large degree, by our history of scarcity, so it's difficult for us to realize how deeply it pervades our thinking. (especially since we all live in the real world too, where scarcity still is a hard fact of life.) but this is one place where we can shed that... these implications of unlimited-production-and-distribution turn our thinking on its head. instead of helping users choose what to _pick_ in the library, we have to help 'em choose what to _discard_. in many ways, this is a much easier task. human genome project files? ya, you probably won't want 'em. e-texts in a language that you cannot read? you can skip those. text-to-speech files? videos? magazines? maybe yes, maybe no. they start with 20,000 "possibles", weeding 'em out to their taste, thanks to our handy-dandy program, which then auto-downloads the ones that are left, in the background, with zero input needed. at that point, the cost of selecting a book is double-clicking it and starting to read it. and if it doesn't appeal to you, just stop reading and go on to the next one. you don't have much need for a catalog. oh sure, it might still be kind of handy to have be guided to e-texts, so some means of categorizing an e-text as being "similar to" others would be nice. but that's how we need to _approach_ this project, from the get-go, and not from our implicit notions about "a catalog", because those are outdated and irrelevant to the task now at hand. you're barking up the wrong tree if you don't rearrange that thinking. but anyway, as i said, a system of categorization would be handy, and i'll have some work to show in that regard in a separate post. i believe it's important to start out with the philosophical point... -bowerbird

Bowerbird wrote:
karen said:
Suggestion: have a competition to design an open-source cataloging system for e-books, where there are no physical constraints on "shelving." Publicize it in library schools. Major ego-boo for the teacher/graduate student whose scheme is accepted, free design for PG.
answer this question: why should we categorize the e-texts?
Actually, I think what we'd like to do is to "categorize" the texts using one or more categorical systems, and then embed that information right into the book (which is a digital object). This is essentially adding metadata, or what the Yahoo folk call "microformats" (which is a terrible name), right into the object. This is done now in many kinds of digital objects, such as audio, video and some ebook formats. This way no external categorization need to be applied -- it is all recorded internally, meaning each book can become autonomous of the others since it carries its own metadata. Particular "libraries" can build a lookup table of their choosing by simply sniffing through all the texts it holds. It doesn't really matter where the text files are placed or organized in a file structure. Multiple categorization systems can be supported in parallel provided the texts carry the requisite information. In XML, there's a number of ways this info could be embedded. In plain text documents, some sort of machine recognizable "plain text" syntax has to be developed -- it'd be quite simple, actually. I think those who advocate plain text should develop a "plain text" metadata system (such as one based on Dublin Core) to insert somewhere in the file. Jon Noring

Jon: One point to take into account is that the upcoming wiki catgorizing will be flexible, never "finished", changing as needed. Embedding this in the files would take a much larger amount of effort, and remove much of the possibility for collaberative effort. Andrew On Thu, 13 Jul 2006, Jon Noring wrote:
Actually, I think what we'd like to do is to "categorize" the texts using one or more categorical systems, and then embed that information right into the book (which is a digital object).
This is essentially adding metadata, or what the Yahoo folk call "microformats" (which is a terrible name), right into the object. This is done now in many kinds of digital objects, such as audio, video and some ebook formats.
This way no external categorization need to be applied -- it is all recorded internally, meaning each book can become autonomous of the others since it carries its own metadata. Particular "libraries" can build a lookup table of their choosing by simply sniffing through all the texts it holds. It doesn't really matter where the text files are placed or organized in a file structure. Multiple categorization systems can be supported in parallel provided the texts carry the requisite information.

Jon Noring wrote:
Actually, I think what we'd like to do is to "categorize" the texts
using one or more categorical systems, and then embed that information right into the book (which is a digital object).
Instead of embedding it into the e-book, I think it would work better as a seperate file. If you embed it into the ebooks, you will need to put it in all the versions (html, text, pdf, tei, etc..), and keep ALL of them up-to-date. Also, you would have to search the entire text of the book to find all the meta-data. As a seperate file, it would also by easier to download just that when you want to be able to do "local" searches, without needing to download the full text of every e-book. Also, if you want to make it "user" editable, however you want to define "user", it would be better as seperate file, so that the original files don't constantly get flagged as modified. Also, make it easy to join the meta-files into a single file (cat *.meta > all.meta would be ideal) so that large number of books could be munged at once, or catalogues of specific groupings could easily be created (i.e. science-fiction/german). This would just require having a header in each file specifying which book it applies to. The format could be text, or XML, or even tei. If you use an XML based version, a text version could be easily created.
This is essentially adding metadata, or what the Yahoo folk call "microformats" (which is a terrible name), right into the object. This is done now in many kinds of digital objects, such as audio, video and some ebook formats.
Instead of just category, you could store all sorts of information in the "meta" file. Authors name, copyright date(s), categories (science finction, horticulture, cook-book), available formats (text, html, tei, pdf, etc.), language(s), links to web sites, link to author meta file, and any other information like you would like to find in a card catalog,
This way no external categorization need to be applied -- it is all recorded internally, meaning each book can become autonomous of the others since it carries its own metadata. Particular "libraries" can build a lookup table of their choosing by simply sniffing through all the texts it holds. It doesn't really matter where the text files are placed or organized in a file structure. Multiple categorization systems can be supported in parallel provided the texts carry the requisite information.
I think that it could become a problem if the meta-data in the different formats were found to be different. Which one has the most correct information, the text version or the html one?
In XML, there's a number of ways this info could be embedded. In plain text documents, some sort of machine recognizable "plain text" syntax has to be developed -- it'd be quite simple, actually. I think those who advocate plain text should develop a "plain text" metadata system (such as one based on Dublin Core) to insert somewhere in the file.
If you wanted to search for all polish math books, how would you write the query program so that you would get all of them, without duplicates because of the different formats, and without wasting a lot of CPU cycles. Not all texts have a .txt version,

Jon Noring <jon@noring.name> writes:
In plain text documents, some sort of machine recognizable "plain text" syntax has to be developed -- it'd be quite simple, actually. I think those who advocate plain text should develop a "plain text" metadata system (such as one based on Dublin Core) to insert somewhere in the file.
I would suggest using YAML -- there are a number of applications for processing it, and it can be mapped to dublin core elements easily. The following is a complete YAML Dublin Core document: --- - title: <scalar> - creator: <scalar> - subject: <scalar> - description: <scalar> - publisher: <scalar> - contributor: <scalar> - date: <iso 8601> - type: <uri> - format: <mime-type> - identifier: <scalar> - source: <scalar> - language: <rfc-3066_iso639> - relation: <scalar> - coverage: <scalar> - rights: <scalar> This can easily be parsed, it's human readable and maps well to html/xml structures. b/ -- Brad Collins <brad@chenla.org>, Banqwao, Thailand

On 7/14/06, Brad Collins <brad@chenla.org> wrote:
I would suggest using YAML -- there are a number of applications for processing it, and it can be mapped to dublin core elements easily.
To my untutored eye, it looks good. I wonder if we could make a start at adding that info to all PG texts and ALSO develop an extra-textual cataloguing system that might contain more detail. My dream library would also contain info on how often a text had been downloaded and a rating/recommendation system like the various book and movie rating systems out there. You know, "Readers who liked "Campfire Girls Go Bananas" also liked "Campfire Girls Make Whoopee". -- Karen Lofstrom

On Fri, Jul 14, 2006 at 04:24:39PM -1000, Karen Lofstrom wrote:
On 7/14/06, Brad Collins <brad@chenla.org> wrote:
To my untutored eye, it looks good. I wonder if we could make a start at adding that info to all PG texts and ALSO develop an extra-textual cataloguing system that might contain more detail.
I don't think this is a bad idea at all.
My dream library would also contain info on how often a text had been downloaded and a rating/recommendation system like the various book and movie rating systems out there. You know, "Readers who liked "Campfire Girls Go Bananas" also liked "Campfire Girls Make Whoopee".
This is one of the reasons I wanted to go with a tagging/folksonomy model. I've already begun reading some of the available research on how to creative collaborative filtering engines using folksonomies as seeds.

Before everybody goes all warm and fuzzy about his/her pet categorization scheme, let me remind you that the discussion started about how to use the wiki for categorizing. A wiki has no built-in authority control. If we want to end up with useful categories we need to develop a restricted vocabulary. The good news is: if we use one page per category, the vocabulary will build itself. Pages can easily be split or merged whenever the vocabulary changes. Also, it is very easy to harvest sites that already have categorized pg books and convert their data into a wiki list. The easiest way to start is to: 1. Create an account 2. Create a page containing a list of books 3. Add the page to the "Bookshelf" category like here: http://www.gutenberg.org/wiki/Detective_Fiction Remember that this is a wiki. Don't expect things you edit to stay edited. If you want to express a personal opinion use a subpage of your user page like this: http://www.gutenberg.org/wiki/User:Marcello/Marcello's_Tops_and_Flops -- Marcello Perathoner webmaster@gutenberg.org

On 7/16/06, Marcello Perathoner <marcello@perathoner.de> wrote:
The easiest way to start is to:
1. Create an account 2. Create a page containing a list of books 3. Add the page to the "Bookshelf" category
like here:
Is this ready to be announced? It doesn't seem that anyone has noticed the wiki yet. One thing missing is Recent Changes (http://www.gutenberg.org/wiki/Special:Recentchanges) isn't linked from the sidebar. It's a very useful tool for seeing what new cool stuff is going on and what vandalism is being created.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hopefully it will get more contributions when we go live. Sincerely Aaron Cannon - -- Skype: cannona MSN/Windows Messenger: cannona@hotmail.com (don't send email to the hotmail address.) - ----- Original Message ----- From: "David Starner" <prosfilaes@gmail.com> To: "Project Gutenberg Volunteer Discussion" <gutvol-d@lists.pglaf.org> Sent: Sunday, August 06, 2006 2:22 PM Subject: Re: [gutvol-d] Categorizing in Wiki
On 7/16/06, Marcello Perathoner <marcello@perathoner.de> wrote:
The easiest way to start is to:
1. Create an account 2. Create a page containing a list of books 3. Add the page to the "Bookshelf" category
like here:
Is this ready to be announced? It doesn't seem that anyone has noticed the wiki yet.
One thing missing is Recent Changes (http://www.gutenberg.org/wiki/Special:Recentchanges) isn't linked from the sidebar. It's a very useful tool for seeing what new cool stuff is going on and what vandalism is being created. _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (MingW32) - GPGrelay v0.959 Comment: Key available from all major key servers. iD8DBQFE1ltiI7J99hVZuJcRAi+SAJwK55aJ2PawhYBGddTNf8+OskKObACdGAzW l4A3OSJfKohtTxIHD7kqk10= =XYFD -----END PGP SIGNATURE-----

Hi bb. I'm replying to some statements from a few different messages of yours here.
extensive discussions on this topic were already held here on gutvol-d. why go through it all again? and again and again and again?
You've hinted a few times that I'm just starting from scratch here, ignoring what has already been done. This is not true. I've read with interest the previous discussion you mentioned. However, ideas seem to, all too often, be of the variety that would only be practicable if we had a couple of highly-trained, professional librarians who decided to donate their full-time services to PG. (I can dream can't I?) So, I've actually tried doing something productive. I've spent countless hours editing parts of the PG online catalog (focusing mostly on author headings, having given up on trying to make title statements that would be acceptable to the Library Sciences community)
besides, it's not so much the "design" that is so difficult, but rather the _implementation_, and the grunt work of assigning e-texts within the system. so it'd be far better to have the competition at the _programming_ level...
In this you are very correct. "The grunt work" is a very real factor here. What I am hoping is to eventually get these wiki pages working in a way that will _invite_ people to contribute, making it more a collaborative effort. If I just go ahead and do as much as I can myself, there will really be no advantage over what I could have done just in editing the PG online catalog.
answer this question: why should we categorize the e-texts?
if your response runs along the lines of "so end-users can find the book they want, and download it", you're on the wrong path.
You then argue that:
in our new era of high-bandwidth and terrabyte hard-drives, it's silly for a person to spend even mere seconds trying to decide _whether_or_not_ to download a book. it's _far_ more convenient to download vast portions of the library, since they can have their computer do it automatically while they are partying, or sleeping...
So, let's assume that someone is interested in the Science Fiction books that we've posted a decent number of lately. Should this person have to download a few hundred books and then do his own time-consuming search of these books now on his own system, trying to identify which ones might be science fiction? I think the need for something like categorizing is apparent, because I've seen a decent number of independent web sites which present a subset of PG books relating to a certain subject. Ones that spring to mind are collections of Australaiana, Canadiana, Esperanto-related topics (not necessarily _in_ Esperanto), and books related to the Philippines. Also, not long ago, I had someone ask if there was some way he could look through just 18th century books. I would argue that having general categories was one reason that Blackmask was so popular.
these implications of unlimited-production-and-distribution turn our thinking on its head. instead of helping users choose what to _pick_ in the library, we have to help 'em choose what to _discard_.
I must disagree here. People by nature, prefer to have a smaller number of choices. (How many people will look at an extensive menu in a restaurant, be intimidated, and just pick something off the small "feature" list?--having worked in such a place, I can tell you: lots of people.) Would you rather have a selection of items in one particular category that may be of interest, or have a massive list where you have to go "Nope, don't want that one. Nope, don't want that one" two-hundred times?
from the get-go, and not from our implicit notions about "a catalog", because those are outdated and irrelevant to the task now at hand.
Careful now. The traditional library catalog is still an extremely useful resource (for those who know how to use it). I might be susceptible to the argument, however, that its limits get stretched uncomfortably trying to describe digital material. Andrew

Sorry for the length, everyone, but I wanted to try and cover in words what I was unable to cover in production software. On Thu, Jul 13, 2006 at 05:42:29PM -0400, Bowerbird@aol.com wrote: ...
finally, i'm not sure that y'all understand the major need here. and i'm quite certain that library-school students will miss it.
answer this question: why should we categorize the e-texts?
if your response runs along the lines of "so end-users can find the book they want, and download it", you're on the wrong path.
that's the function catalogs used to serve, in the dead-tree world. ... but in our new era of high-bandwidth and terrabyte hard-drives, it's silly for a person to spend even mere seconds trying to decide _whether_or_not_ to download a book. it's _far_ more convenient to download vast portions of the library, since they can have their computer do it automatically while they are partying, or sleeping...
I disagree. I have a 100Mb/s municipal fiber connection and almost 2 terabytes of disk space available, and "download[ing] vast portions of the library" is not an option for me. I don't find it difficult to imagine that if I have a hard time accepting this answer, there are going to be others who do so as well, with far fewer resources at their command.
even the dial-up people can request the d.v.d., for free, and have the entire p.g. library sitting on their hard-disk in a week or so...
I also don't agree with the implied assertion here that having the full (or even "vast portions of the") library means that users don't want help identifying and locating content within that collection. Of course, this means that we'll want to help people who download the library get the catalog data that matches their portion of the library!
not only is it not wise to make people spend any time "choosing", it's at odds with the important concept of _unlimited_distribution_.
Having a catalog does not equate to making people use it. It's a tool for those who want to make use of it. That said, let's make sure that whatever tool(s) we come up with fit as many of the percieved needs as we possibly can! You clearly have different ideas of the use of a catalog than do I. As you've already enumerated some of the points of *my* use, perhaps you could elaborate on your ideas? (On the other hand, if you already did this, ignore this request. I generally avoid topics once you start weighing in on them, so I may have missed the applicable portions from the last time this topic came up.) --- So, on to my proposal. I had hoped to actually be able to provide a tool demonstrating it, but my day job interfered too much this week to allow me to realize that hope. So instead, let me see if I can lay out the concept. It's based on the tagging system known as the "Debian Package Browser" [1]. Some important parts of the idea that might be missed initially: * Every book gets tagged initially with a placeholder value * Wherever we can identify existing valuable tags, they are added to the initial load. Some examples of tags I'd want include: year published in PG; Author/Creator; Language; LoC Class; Copyright Status (sounding familiar to anyone?) * Tags need to be nestable. This is something the Debian system is not able to support, but I think it's very important. One example Browerbird already pointed out is the Amazon.com categorization scheme. * The default behaviour of the tagging system should be marking which of the existing tags are best applied to this book, but it also needs to be flexible enough to add new tags (and hierarchies thereof). Setting the default behaviour this way is one way of preventing the "del.icio.us syndrome" found in many folksonomies, where there are as many different ways of tagging a piece of content as there are users of the system. * It should be easy, when viewing a particular ebook, to do any of the following actions: view tags already on this book; see a list of "suggested tags", based on a weighted list of tags attached to content that has other tags in common with the current content; view other content tagged in common; add / remove tags. * It needs to be easy to see all content with a particular tag or tagset. I'm envisioning something akin to the Flamenco [2] system here. I envision a lot of things coming out of this effort, including an easier way for people to suggest content for the "Best Of" DVDs so that Greg doesn't have to do so much of the leg-work himself. As people come across suggestions, they tag them, then Greg can just pull a list of ebooks with that tag. I've done some work on a prototype, but as I said, the real world invaded and sapped my time. Then again, I know there are many others on this list that are talented software developers, so perhaps one of you will beat me to it...or propose an even better system. [1]: http://debian.vitavonni.de/packagebrowser/ [2]: http://flamenco.berkeley.edu/ If you'd like to see Flamenco at work, but don't have the resources to set it up yourself, drop me a line off-list and I'll provide you with a URL to one I've setup.
participants (10)
-
Aaron Cannon
-
Andrew Sly
-
Bowerbird@aol.com
-
Brad Collins
-
David Starner
-
joey
-
Jon Noring
-
Karen Lofstrom
-
Kevin Handy
-
Marcello Perathoner