From i30817 at gmail.com Sat Oct 30 05:02:47 2010 From: i30817 at gmail.com (Paulo Levi) Date: Sat, 30 Oct 2010 13:02:47 +0100 Subject: [gutvol-p] Quick question about file formats Message-ID: All books in the catalog have a zipped conterpart right? I mean specifically, if there is a "text/plain" file in the site, there is also a equivalent one with text/plain and application/zip right? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 217 bytes Desc: not available URL: From marcello at perathoner.de Sat Oct 30 06:01:12 2010 From: marcello at perathoner.de (Marcello Perathoner) Date: Sat, 30 Oct 2010 15:01:12 +0200 Subject: [gutvol-p] Re: Quick question about file formats In-Reply-To: References: Message-ID: <4CCC1718.80306@perathoner.de> Paulo Levi wrote: > All books in the catalog have a zipped conterpart right? > I mean specifically, if there is a "text/plain" file in the site, there is also > a equivalent one with > text/plain and application/zip right? > > > ------------------------------------------------------------------------ > > _______________________________________________ > gutvol-p mailing list > gutvol-p at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-p Not necessarily. Some formats, like epub, are already zipped and have no zip file. In the RDF/XML catalog you'll find entries for all files we have. -- Marcello Perathoner webmaster at gutenberg.org From i30817 at gmail.com Sat Oct 30 10:00:23 2010 From: i30817 at gmail.com (Paulo Levi) Date: Sat, 30 Oct 2010 18:00:23 +0100 Subject: [gutvol-p] Re: Quick question about file formats In-Reply-To: <4CCC1718.80306@perathoner.de> References: <4CCC1718.80306@perathoner.de> Message-ID: Another quick question :) Are the rules for creating a download url from the "file" tag in the rdf catalog consistent? let's say i have a html and zipped file, of #etext15560 , in this case the download link is &f;dirs/1/5/5/6/15560/15560-h.zip The way to recreate from the arguments is obvious (if lengthy), but i seem to recall it changed before - is there a algorithm that always gives a "valid" link, or should i just give up compressing this and include the link in my db? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 534 bytes Desc: not available URL: From i30817 at gmail.com Sat Oct 30 10:03:32 2010 From: i30817 at gmail.com (Paulo Levi) Date: Sat, 30 Oct 2010 18:03:32 +0100 Subject: [gutvol-p] Re: Quick question about file formats In-Reply-To: References: <4CCC1718.80306@perathoner.de> Message-ID: I don't care about epub files and such, at least, not yet. On Sat, Oct 30, 2010 at 6:00 PM, Paulo Levi wrote: > Another quick question :) > Are the rules for creating a download url from the "file" tag in the rdf > catalog consistent? > > let's say i have a html and zipped file, of #etext15560 , in this case the > download link is > &f;dirs/1/5/5/6/15560/15560-h.zip > > The way to recreate from the arguments is obvious (if lengthy), but i seem > to recall it changed before - is there a algorithm that always gives a > "valid" link, or should i just give up compressing this and include the link > in my db? > -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 926 bytes Desc: not available URL: From marcello at perathoner.de Sat Oct 30 11:56:50 2010 From: marcello at perathoner.de (Marcello Perathoner) Date: Sat, 30 Oct 2010 20:56:50 +0200 Subject: [gutvol-p] Re: Quick question about file formats In-Reply-To: References: <4CCC1718.80306@perathoner.de> Message-ID: <4CCC6A72.4060500@perathoner.de> Paulo Levi wrote: > Another quick question :) > Are the rules for creating a download url from the "file" tag in the rdf catalog > consistent? > > let's say i have a html and zipped file, of #etext15560 , in this case the > download link is > &f;dirs/1/5/5/6/15560/15560-h.zip > > The way to recreate from the arguments is obvious (if lengthy), but i seem to > recall it changed before - is there a algorithm that always gives a "valid" > link, or should i just give up compressing this and include the link in my db? The "algorithm" is the expansion of XML entities, which any common run-of-the-mill xml parser will do for you. I think we had this discussion already. This is an XML file and should be processed thru an XML parser. If you don't, every little cosmetic change to the file structure will break your program. You have been warned. -- Marcello Perathoner webmaster at gutenberg.org From Bowerbird at aol.com Sat Oct 30 12:26:43 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Sat, 30 Oct 2010 15:26:43 -0400 (EDT) Subject: [gutvol-p] Re: Quick question about file formats Message-ID: <9368e.68f11d08.39fdcb73@aol.com> paulo's questions arise out of a more complex context, which has a big bunch of people scratching their heads. here's the "discussion" of it at distributed proofreaders: > http://www.pgdp.net/phpBB2/viewtopic.php?p=706358#706358 i put "discussion" in quotes because -- as usual, eh? -- marcello simply ends up calling everyone else an idiot. even lucy24, one of the smartest workers over at d.p. in a nutshell, the "more complex context" here is that marcello has been making "executive decisions" lately. ostensibly aimed at changing the p.g. site for "mobil", these decisions are impacting p.g. policy in a big way. marcello isn't even trying to hide that fact these days. for instance, over in that forum thread at d.p., he says that "nobody understands encodings" which is why he "scrapped them for good"... evidently, everything gets served up from p.g. these days as utf8, because that is what marcello wants. so the ascii and latin1 files are described as utf8, which might be technically so, but is also misleading. perhaps the powers that be at p.g. approved the shift, but marcello sure doesn't make it sound like anything other than that _he_ "decided" it... marcello goes on: > As for why PG still requires ASCII and other > extinct encodings, ask the WWers, not me. > I've been fighting that requirement for years. so i'd guess that once he consolidates all his power plays, marcello will get around to his next agenda and gut this "requirement" which he has been "fighting" all these years. after all, since you no longer call the ascii file an ascii file, nobody would miss it if it were to quietly just not be there. i just wonder if that'll happen _after_ michael hart is dead, or whether it will be the thing that puts him in his grave... either way, p.g. will never be the same... -bowerbird -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 2280 bytes Desc: not available URL: From i30817 at gmail.com Sat Oct 30 15:06:32 2010 From: i30817 at gmail.com (Paulo Levi) Date: Sat, 30 Oct 2010 23:06:32 +0100 Subject: [gutvol-p] Re: Quick question about file formats In-Reply-To: <20101030202230.GZ76507@styx.org> References: <4CCC1718.80306@perathoner.de> <4CCC6A72.4060500@perathoner.de> <20101030202230.GZ76507@styx.org> Message-ID: Thanks for the answers, i guess i will save it. The reason i'm using a xml parser is that the libraries for rdf are huge in java, 30mb like, in contrast with stax or sax that is in jdk. Also i'm doing a client program, even if the only client is me ;) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 268 bytes Desc: not available URL: From marcello at perathoner.de Sat Oct 30 15:39:28 2010 From: marcello at perathoner.de (Marcello Perathoner) Date: Sun, 31 Oct 2010 00:39:28 +0200 Subject: [gutvol-p] Re: Quick question about file formats In-Reply-To: <20101030202230.GZ76507@styx.org> References: <4CCC1718.80306@perathoner.de> <4CCC6A72.4060500@perathoner.de> <20101030202230.GZ76507@styx.org> Message-ID: <4CCC9EA0.4040403@perathoner.de> William Waites wrote: > On Sat, Oct 30, 2010 at 08:56:50PM +0200, Marcello Perathoner wrote: >> Paulo Levi wrote: >>> Another quick question :) >>> Are the rules for creating a download url from the "file" tag in the rdf >>> catalog consistent? >> The "algorithm" is the expansion of XML entities, which any common >> run-of-the-mill xml parser will do for you. > > RDF != XML > >> I think we had this discussion already. This is an XML file and should >> be processed thru an XML parser. If you don't, every little cosmetic >> change to the file structure will break your program. You have been warned. > > If you're trying to interpret RDF data, it's better > to use a library, they exist for just about all > programming languages. If you try to interpret it > as XML you are asking for trouble. > > It is too bad that the RDF you get from here, > http://www.gutenberg.org/ebooks/12345.rdf > is different from the catalogue. This is intended and documented. http://www.gutenberg.org/wiki/Gutenberg:Feeds The old catalog.rdf is a legacy format we keep for compatibiity. > > This is because you have > > xml:base="http://www.gutenberg.org/feeds/catalog.rdf > > and then, e.g. > > rdf:ID="etext12345" > > This amounts to giving the URI > > http://www.gutenberg.org/feeds/catalog.rdfetext12345 > > to that book which is not what you intend. Wrong. This gives http://www.gutenberg.org/feeds/catalog.rdf#etext12345 "The rdf:ID attribute on a node element (not property element, that has another meaning) can be used instead of rdf:about and gives a relative RDF URI reference equivalent to # concatenated with the rdf:ID attribute value." > > If on the other hand you had used > > rdf:about="http://www.gutenberg.org/ebooks/12345" > > the data would be the same (which I guess is > what you intend). > > where lower down you talk about formats, > you use > > rdf:resource="#etext12345" > > which refers to > > http://www.gutenberg.org/feeds/catalog.rdf#etext12345 > > which if it weren't for the error with rdf:ID would > at least be consistent within the catalogue. > > But supposing this is fixed, I still have two > URIs for one text: > > http://www.gutenberg.org/feeds/catalog.rdf#etext12345 > http://www.gutenberg.org/ebooks/12345 > > and you've given no way of knowing that they are > in fact the same. Because you are not supposed to mix the old catalog.rdf with the new catalog.rdf which will be put online when I get to finish it. -- Marcello Perathoner webmaster at gutenberg.org From william.waites at okfn.org Sat Oct 30 16:35:40 2010 From: william.waites at okfn.org (William Waites) Date: Sun, 31 Oct 2010 01:35:40 +0200 Subject: [gutvol-p] Re: Quick question about file formats In-Reply-To: <4CCC9EA0.4040403@perathoner.de> References: <4CCC1718.80306@perathoner.de> <4CCC6A72.4060500@perathoner.de> <20101030202230.GZ76507@styx.org> <4CCC9EA0.4040403@perathoner.de> Message-ID: <20101030233540.GA76507@styx.org> On Sun, Oct 31, 2010 at 12:39:28AM +0200, Marcello Perathoner wrote: > > Wrong. This gives > > http://www.gutenberg.org/feeds/catalog.rdf#etext12345 > > "The rdf:ID attribute on a node element (not property element, that has > another meaning) can be used instead of rdf:about and gives a relative > RDF URI reference equivalent to # concatenated with the rdf:ID attribute > value." Quite right. I was confusing nodeID with ID (now will a user trying to use an XML parser also get this right? Kind of proves my point: RDF != XML, don't look at the XML unless you have a good reason to be writing a parser). In any event I stand corrected on this. -w From william.waites at okfn.org Sat Oct 30 16:39:16 2010 From: william.waites at okfn.org (William Waites) Date: Sun, 31 Oct 2010 01:39:16 +0200 Subject: [gutvol-p] Re: Quick question about file formats In-Reply-To: References: <4CCC1718.80306@perathoner.de> <4CCC6A72.4060500@perathoner.de> <20101030202230.GZ76507@styx.org> Message-ID: <20101030233916.GB76507@styx.org> On Sat, Oct 30, 2010 at 11:06:32PM +0100, Paulo Levi wrote: > Thanks for the answers, i guess i will save it. > The reason i'm using a xml parser is that the libraries for rdf are huge in > java, 30mb like, in contrast with stax or sax that is in jdk. Also i'm doing > a client program, even if the only client is me ;) Paulo, maybe this will help. I've taken the catalogue and put it in the laboratory triple store (http://river.styx.org/sparql) for you. You can try a query like, PREFIX rdf: PREFIX dc: PREFIX dcterms: PREFIX gut: SELECT DISTINCT ?download, ?mimetype FROM WHERE { ?download dcterms:isFormatOf gut:etext12345 . ?download dc:format ?format . ?format rdf:value ?mimetype } and get output in any number of formats like JSON and such. Canned version of this query: http://bit.ly/cUIPqm Let me know if this is helpful to you. Cheers, -w