From i30817 at gmail.com Wed Jul 15 17:25:53 2009 From: i30817 at gmail.com (Paulo Levi) Date: Thu, 16 Jul 2009 01:25:53 +0100 Subject: [gutvol-p] Programmatic fetching books from Gutenberg Message-ID: <212322090907151725x194f5361j2741f331dbf775f1@mail.gmail.com> I made a ebook reader (here) http://code.google.com/p/bookjar/downloads/list and i'd like to search and download Gutenberg books. I already have a searcher prototype using LuceneSail a library that uses Lucene to index rdf documents and only indexing what i want from the catalog.rdf.zip. Now i'd like to know how from the url inside the catalog i can fetch the book itself, and what are the variants for the formats. A example query result: author: Shakespeare, William, 1564-1616 url: http://www.gutenberg.org/feeds/catalog.rdf#etext1802 title: King Henry VIII So, i like to know how from the etext1802 number can i get a working url to download the book, and how to construct variants for each format. Thank you in advance. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 921 bytes Desc: not available URL: From marcello at perathoner.de Thu Jul 16 00:05:51 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Thu, 16 Jul 2009 09:05:51 +0200 Subject: [gutvol-p] Re: Programmatic fetching books from Gutenberg In-Reply-To: <212322090907151725x194f5361j2741f331dbf775f1@mail.gmail.com> References: <212322090907151725x194f5361j2741f331dbf775f1@mail.gmail.com> Message-ID: <4A5ED14F.3050704@perathoner.de> Paulo Levi wrote: > I made a ebook reader > (here) http://code.google.com/p/bookjar/downloads/list > > and i'd like to search and download Gutenberg books. I already have a searcher > prototype using LuceneSail a library that uses Lucene to index rdf documents and > only indexing what i want from the catalog.rdf.zip. > > Now i'd like to know how from the url inside the catalog i can fetch the book > itself, and what are the variants for the formats. > A example query result: > author: Shakespeare, William, 1564-1616 > url: http://www.gutenberg.org/feeds/catalog.rdf#etext1802 > title: King Henry VIII > > So, i like to know how from the etext1802 number can i get a working url to > download the book, and how to construct variants for each format. In the second half of the rdf file you will find records for all the files in different formats we offer for an ebook. Use the #etext1802 as link between book record and file records. From i30817 at gmail.com Thu Jul 16 10:55:51 2009 From: i30817 at gmail.com (Paulo Levi) Date: Thu, 16 Jul 2009 18:55:51 +0100 Subject: [gutvol-p] Re: Programmatic fetching books from Gutenberg In-Reply-To: <4A5ED14F.3050704@perathoner.de> References: <212322090907151725x194f5361j2741f331dbf775f1@mail.gmail.com> <4A5ED14F.3050704@perathoner.de> Message-ID: <212322090907161055x5c582668yb8f2ee249e10cf4f@mail.gmail.com> Thing is, indexing takes a long time and occupies quite a lot of space. I managed to reduce it by filtering the "important" parts of the rdf to index. If the location of the text can be inferred from #etext1802 i prefer to use that. On Thu, Jul 16, 2009 at 8:05 AM, Marcello Perathoner wrote: > Paulo Levi wrote: > >> I made a ebook reader >> (here) http://code.google.com/p/bookjar/downloads/list >> >> and i'd like to search and download Gutenberg books. I already have a >> searcher prototype using LuceneSail a library that uses Lucene to index rdf >> documents and only indexing what i want from the catalog.rdf.zip. >> >> Now i'd like to know how from the url inside the catalog i can fetch the >> book itself, and what are the variants for the formats. >> A example query result: >> author: Shakespeare, William, 1564-1616 >> url: http://www.gutenberg.org/feeds/catalog.rdf#etext1802 >> title: King Henry VIII >> So, i like to know how from the etext1802 number can i get a working url >> to download the book, and how to construct variants for each format. >> > > In the second half of the rdf file you will find records for all the files > in different formats we offer for an ebook. Use the #etext1802 as link > between book record and file records. > > -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 1925 bytes Desc: not available URL: From i30817 at gmail.com Thu Jul 16 11:19:07 2009 From: i30817 at gmail.com (Paulo Levi) Date: Thu, 16 Jul 2009 19:19:07 +0100 Subject: [gutvol-p] Re: Programmatic fetching books from Gutenberg In-Reply-To: <212322090907161055x5c582668yb8f2ee249e10cf4f@mail.gmail.com> References: <212322090907151725x194f5361j2741f331dbf775f1@mail.gmail.com> <4A5ED14F.3050704@perathoner.de> <212322090907161055x5c582668yb8f2ee249e10cf4f@mail.gmail.com> Message-ID: <212322090907161119w6e2c101dra5f41c4546e7cbf5@mail.gmail.com> It appears that the format seems to follow a rule sort of #etext1802 "&f;1/8/0/1802/1802.txt Thought it doesn't appear to be consistent. I saw something about a old indexing scheme for files older than 10000. What is the scheme (can it be guessed from the #number?) and is it going to disappear from the Gutenberg server ? Or are you going to make redirects? On Thu, Jul 16, 2009 at 6:55 PM, Paulo Levi wrote: > Thing is, indexing takes a long time and occupies quite a lot of space. I > managed to reduce it by filtering the "important" parts of the rdf to index. > If the location of the text can be inferred from #etext1802 i prefer to use > that. > > > On Thu, Jul 16, 2009 at 8:05 AM, Marcello Perathoner < > marcello at perathoner.de> wrote: > >> Paulo Levi wrote: >> >>> I made a ebook reader >>> (here) http://code.google.com/p/bookjar/downloads/list >>> >>> and i'd like to search and download Gutenberg books. I already have a >>> searcher prototype using LuceneSail a library that uses Lucene to index rdf >>> documents and only indexing what i want from the catalog.rdf.zip. >>> >>> Now i'd like to know how from the url inside the catalog i can fetch the >>> book itself, and what are the variants for the formats. >>> A example query result: >>> author: Shakespeare, William, 1564-1616 >>> url: http://www.gutenberg.org/feeds/catalog.rdf#etext1802 >>> title: King Henry VIII >>> So, i like to know how from the etext1802 number can i get a working url >>> to download the book, and how to construct variants for each format. >>> >> >> In the second half of the rdf file you will find records for all the files >> in different formats we offer for an ebook. Use the #etext1802 as link >> between book record and file records. >> >> > -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 2688 bytes Desc: not available URL: From ockerblo at pobox.upenn.edu Thu Jul 16 11:53:37 2009 From: ockerblo at pobox.upenn.edu (John Mark Ockerbloom) Date: Thu, 16 Jul 2009 14:53:37 -0400 Subject: [gutvol-p] Re: Programmatic fetching books from Gutenberg Message-ID: <4A5F7731.5090904@pobox.upenn.edu> (sending this time to the proper list address) Prior to etext #10000, Gutenberg filenames were strings that didn't have anything to do with the etext numbers (except that the number indirectly determined which etext* directory files would be in, since both the numbers and the etext* directories were tied to release date). As these early etexts get revised, they typically move into the new numerically-based-filename system. However, there are still slightly more than 6000 Gutenberg etexts that use the old-style filenames. Information about the filenames, new and old, is included in the RDF file. For instance, if you are interested in etext #3167, you'll find a metadata record for it early in the file in a pgterms:etext element with ID "etext3167". Later in the RDF file, you'll see two pgterms:file elements that have an isFormatOf relationship with the ID "etext3167". Those elements in turn specify information about the two files associated with that etext, including name, MIME type, length, and last-modified date. The name of the file, in particular, is in the rdf:about attribute of the pgterms:file element. In this case, you'll find that the two files associate with etext #3167 are in etext02/wsxpm10.txt and etext02/wsxpm10.zip (relative to the top-level Gutenberg text directory). If for some reason the RDF itself is too big for you to handle easily, it looks like it's auto-generated, so you could probably write a script in Perl or some other suitable text-crunching language to extract only the information you're interested in, in some more compact form. I have my own independent copy of some of this information, in my own format, which I assembled before the RDF directory was made available. But if I were to start over again, I'd probably just pull straight from the RDF. (And try to be more proactive about getting Gutenberg to fix its metadata at various spots instead of just fixing it on my end.) I hoep this helps. John From marcello at perathoner.de Thu Jul 16 13:07:37 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Thu, 16 Jul 2009 22:07:37 +0200 Subject: [gutvol-p] Re: Programmatic fetching books from Gutenberg In-Reply-To: <212322090907161119w6e2c101dra5f41c4546e7cbf5@mail.gmail.com> References: <212322090907151725x194f5361j2741f331dbf775f1@mail.gmail.com> <4A5ED14F.3050704@perathoner.de> <212322090907161055x5c582668yb8f2ee249e10cf4f@mail.gmail.com> <212322090907161119w6e2c101dra5f41c4546e7cbf5@mail.gmail.com> Message-ID: <4A5F8889.1070903@perathoner.de> Paulo Levi wrote: > It appears that the format seems to follow a rule sort of > #etext1802 > "&f;1/8/0/1802/1802.txt &f; is an xml entity which is defined at the top of the rdf file. Retrieve the url &f;1/8/0/1802/1802.txt and you get the file. This is an xml file and you *really* should parse it with an xml parser. That will take care of all these problems for you. If you are concened about memory, use a sax parser. > Thought it doesn't appear to be consistent. I saw something about a old indexing > scheme for files older than 10000. What is the scheme (can it be guessed from > the #number?) and is it going to disappear from the Gutenberg server ? Or are > you going to make redirects? If you parse the rdf file the xml way you get the current url of the ebook files regardless of old/new directory scheme and any future file moves. From i30817 at gmail.com Thu Jul 16 13:33:11 2009 From: i30817 at gmail.com (Paulo Levi) Date: Thu, 16 Jul 2009 21:33:11 +0100 Subject: [gutvol-p] Re: Programmatic fetching books from Gutenberg In-Reply-To: <4A5F8889.1070903@perathoner.de> References: <212322090907151725x194f5361j2741f331dbf775f1@mail.gmail.com> <4A5ED14F.3050704@perathoner.de> <212322090907161055x5c582668yb8f2ee249e10cf4f@mail.gmail.com> <212322090907161119w6e2c101dra5f41c4546e7cbf5@mail.gmail.com> <4A5F8889.1070903@perathoner.de> Message-ID: <212322090907161333p5b7ee941xf75ac3ad2e9fee2b@mail.gmail.com> Not using xml, and am not going to - I need fast search so I'm indexing the files using Lucene. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 104 bytes Desc: not available URL: From i30817 at gmail.com Thu Jul 16 19:55:57 2009 From: i30817 at gmail.com (Paulo Levi) Date: Fri, 17 Jul 2009 03:55:57 +0100 Subject: [gutvol-p] Bizarre. Message-ID: <212322090907161955q54c141bci6c5c9bc6d926f3e2@mail.gmail.com> What are the obligatory rdf triples? Apparently editors of collections have no predefined tag and ommit creator. What about books with various authors? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 164 bytes Desc: not available URL: From gbnewby at pglaf.org Thu Jul 16 22:10:00 2009 From: gbnewby at pglaf.org (Greg Newby) Date: Thu, 16 Jul 2009 22:10:00 -0700 Subject: [gutvol-p] Re: Bizarre. In-Reply-To: <212322090907161955q54c141bci6c5c9bc6d926f3e2@mail.gmail.com> References: <212322090907161955q54c141bci6c5c9bc6d926f3e2@mail.gmail.com> Message-ID: <20090717050959.GA31087@mail.pglaf.org> On Fri, Jul 17, 2009 at 03:55:57AM +0100, Paulo Levi wrote: > What are the obligatory rdf triples? Apparently editors of collections have > no predefined tag and ommit creator. What about books with various authors? I used to edit in the header to RDF files, but gave up... I think essentially all RDF files are copyrighted, so we leave it to the contributor to insert whatever metadata they desire. My practice these days is to add a README.TXT with the triples & header & license. I don't think anyone but me has posted an RDF file in awhile, but could be wrong... you are correct that they are not standard. -- Greg Dr. Gregory B. Newby Chief Executive and Director Project Gutenberg Literary Archive Foundation http://gutenberg.org A 501(c)(3) not-for-profit organization with EIN 64-6221541 gbnewby at pglaf.org From i30817 at gmail.com Sat Jul 18 19:55:04 2009 From: i30817 at gmail.com (Paulo Levi) Date: Sun, 19 Jul 2009 03:55:04 +0100 Subject: [gutvol-p] Isn't it strange that the creators have birth and death dates included in the tag? Message-ID: <212322090907181955n3e0145b8xce049fcb602c2cee@mail.gmail.com> I suppose this is going to cause me problems later, since i'm using the open library webservice to fetch book covers, and its very sensitive to false information. I can't think of a way to remove it because of the seperator used, just about the worst possible choice since "," appears as delimiter of the first and last names. Not to speak of the names in french and latin or organizations that either don't have ",". The other possible match "-" also appears in names... possibly a regex like this "[0-9]+* (B\.C\.)? [0-9]+*" ? Then there are cases like "Sunzi, 6th cent. B.C." Bad metadata! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 691 bytes Desc: not available URL: From i30817 at gmail.com Sat Jul 18 20:08:24 2009 From: i30817 at gmail.com (Paulo Levi) Date: Sun, 19 Jul 2009 04:08:24 +0100 Subject: [gutvol-p] Re: Isn't it strange that the creators have birth and death dates included in the tag? In-Reply-To: <212322090907181955n3e0145b8xce049fcb602c2cee@mail.gmail.com> References: <212322090907181955n3e0145b8xce049fcb602c2cee@mail.gmail.com> Message-ID: <212322090907182008w3d13dc20p89421e60147e7c26@mail.gmail.com> I'm dumb. I didn't notice the friendlytitle triple. Disregard this. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 80 bytes Desc: not available URL: From i30817 at gmail.com Sat Jul 18 20:10:36 2009 From: i30817 at gmail.com (Paulo Levi) Date: Sun, 19 Jul 2009 04:10:36 +0100 Subject: [gutvol-p] Re: Isn't it strange that the creators have birth and death dates included in the tag? In-Reply-To: <212322090907182008w3d13dc20p89421e60147e7c26@mail.gmail.com> References: <212322090907181955n3e0145b8xce049fcb602c2cee@mail.gmail.com> <212322090907182008w3d13dc20p89421e60147e7c26@mail.gmail.com> Message-ID: <212322090907182010w5d4b3876y3de43ef9fb0ba857@mail.gmail.com> Then again, don't. The friendly title doesn't have the last name. Bah. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 83 bytes Desc: not available URL: From i30817 at gmail.com Sat Jul 18 20:12:11 2009 From: i30817 at gmail.com (Paulo Levi) Date: Sun, 19 Jul 2009 04:12:11 +0100 Subject: [gutvol-p] Re: Isn't it strange that the creators have birth and death dates included in the tag? In-Reply-To: <212322090907182010w5d4b3876y3de43ef9fb0ba857@mail.gmail.com> References: <212322090907181955n3e0145b8xce049fcb602c2cee@mail.gmail.com> <212322090907182008w3d13dc20p89421e60147e7c26@mail.gmail.com> <212322090907182010w5d4b3876y3de43ef9fb0ba857@mail.gmail.com> Message-ID: <212322090907182012y4c8794beua9d11119bd572726@mail.gmail.com> In fact it has what appears to be automated parsing errors. For instance: The Strange Adventures of Captain Dangerous, Vol. 2 of 3 Who was a sailor, a soldier, a merchant, a spy, a slave among the moors... Sala, George Augustus, 1828-1895 The Strange Adventures of Captain Dangerous, Vol. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 558 bytes Desc: not available URL: From i30817 at gmail.com Sat Jul 18 20:17:21 2009 From: i30817 at gmail.com (Paulo Levi) Date: Sun, 19 Jul 2009 04:17:21 +0100 Subject: [gutvol-p] Re: Isn't it strange that the creators have birth and death dates included in the tag? In-Reply-To: <212322090907182012y4c8794beua9d11119bd572726@mail.gmail.com> References: <212322090907181955n3e0145b8xce049fcb602c2cee@mail.gmail.com> <212322090907182008w3d13dc20p89421e60147e7c26@mail.gmail.com> <212322090907182010w5d4b3876y3de43ef9fb0ba857@mail.gmail.com> <212322090907182012y4c8794beua9d11119bd572726@mail.gmail.com> Message-ID: <212322090907182017u5c0fe57hee051975c2c6805f@mail.gmail.com> It appears to be a hard-coded limit length, but appears to work correctly in the other cases... Whats the algorithm you use? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 129 bytes Desc: not available URL: From i30817 at gmail.com Sun Jul 19 00:27:39 2009 From: i30817 at gmail.com (Paulo Levi) Date: Sun, 19 Jul 2009 08:27:39 +0100 Subject: [gutvol-p] Re: Isn't it strange that the creators have birth and death dates included in the tag? In-Reply-To: <212322090907182017u5c0fe57hee051975c2c6805f@mail.gmail.com> References: <212322090907181955n3e0145b8xce049fcb602c2cee@mail.gmail.com> <212322090907182008w3d13dc20p89421e60147e7c26@mail.gmail.com> <212322090907182010w5d4b3876y3de43ef9fb0ba857@mail.gmail.com> <212322090907182012y4c8794beua9d11119bd572726@mail.gmail.com> <212322090907182017u5c0fe57hee051975c2c6805f@mail.gmail.com> Message-ID: <212322090907190027n2520655dhe41456cd481bfdf@mail.gmail.com> I managed but the special cases are driving me crazy. Some of the names is like this, a clear error i believe: Headley, P. C. (Phineas Camp), 1819-1903, 1819-1903 Combs, Josiah Henry, 1886-1960, 1886-1960 Algie, R. M. (Ronald Macmillan), 1888-1978, 1888-1978 then there are the various eastern people that only have a date no leading - or anything. And the other abreviations like *d. ca. fl. cent * -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 450 bytes Desc: not available URL: From i30817 at gmail.com Sun Jul 19 00:50:01 2009 From: i30817 at gmail.com (Paulo Levi) Date: Sun, 19 Jul 2009 08:50:01 +0100 Subject: [gutvol-p] Re: Isn't it strange that the creators have birth and death dates included in the tag? In-Reply-To: <212322090907190027n2520655dhe41456cd481bfdf@mail.gmail.com> References: <212322090907181955n3e0145b8xce049fcb602c2cee@mail.gmail.com> <212322090907182008w3d13dc20p89421e60147e7c26@mail.gmail.com> <212322090907182010w5d4b3876y3de43ef9fb0ba857@mail.gmail.com> <212322090907182012y4c8794beua9d11119bd572726@mail.gmail.com> <212322090907182017u5c0fe57hee051975c2c6805f@mail.gmail.com> <212322090907190027n2520655dhe41456cd481bfdf@mail.gmail.com> Message-ID: <212322090907190050q1de86dc0g93895fc623a525d0@mail.gmail.com> Actually just searching for a digit in the last "," seperated string seems to do it, except in the three cases above. Can you fix the rdf? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 153 bytes Desc: not available URL: From marcello at perathoner.de Sun Jul 19 04:34:06 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Sun, 19 Jul 2009 13:34:06 +0200 Subject: [gutvol-p] Re: Isn't it strange that the creators have birth and death dates included in the tag? In-Reply-To: <212322090907190027n2520655dhe41456cd481bfdf@mail.gmail.com> References: <212322090907181955n3e0145b8xce049fcb602c2cee@mail.gmail.com> <212322090907182008w3d13dc20p89421e60147e7c26@mail.gmail.com> <212322090907182010w5d4b3876y3de43ef9fb0ba857@mail.gmail.com> <212322090907182012y4c8794beua9d11119bd572726@mail.gmail.com> <212322090907182017u5c0fe57hee051975c2c6805f@mail.gmail.com> <212322090907190027n2520655dhe41456cd481bfdf@mail.gmail.com> Message-ID: <4A6304AE.20309@perathoner.de> Paulo Levi wrote: > I managed but the special cases are driving me crazy. > Some of the names is like this, a clear error i believe: > > Headley, P. C. (Phineas Camp), 1819-1903, 1819-1903 > Combs, Josiah Henry, 1886-1960, 1886-1960 > Algie, R. M. (Ronald Macmillan), 1888-1978, 1888-1978 Fixed these. If you find errors in the catalog, report to catalog at pglaf.org *AFTER* doing a diligent LoC search on http://catalog.loc.gov/ to make sure that you are reporting a real error. > then there are the various eastern people that only have a date no leading - or > anything. > And the other abreviations like > *d. ca. > fl. > cent The catalog has been edited by many different people over a period of nearly 40 years. We know it is not as consistent as we'd like it, and will probably never be so. Welcome in the world of real programming ... From marcello at perathoner.de Sun Jul 19 04:40:26 2009 From: marcello at perathoner.de (Marcello Perathoner) Date: Sun, 19 Jul 2009 13:40:26 +0200 Subject: [gutvol-p] Re: Isn't it strange that the creators have birth and death dates included in the tag? In-Reply-To: <212322090907182017u5c0fe57hee051975c2c6805f@mail.gmail.com> References: <212322090907181955n3e0145b8xce049fcb602c2cee@mail.gmail.com> <212322090907182008w3d13dc20p89421e60147e7c26@mail.gmail.com> <212322090907182010w5d4b3876y3de43ef9fb0ba857@mail.gmail.com> <212322090907182012y4c8794beua9d11119bd572726@mail.gmail.com> <212322090907182017u5c0fe57hee051975c2c6805f@mail.gmail.com> Message-ID: <4A63062A.9070604@perathoner.de> Paulo Levi wrote: > It appears to be a hard-coded limit length, but appears to work correctly in the > other cases... Whats the algorithm you use? Friendlytitle is the title that appears on the bibrec page, so that users who bookmark the page will see something meaningful in their bookmark list. Its the first line of the book title followed by as many authors as will fit into 80 chars. From i30817 at gmail.com Fri Jul 31 11:12:49 2009 From: i30817 at gmail.com (Paulo Levi) Date: Fri, 31 Jul 2009 19:12:49 +0100 Subject: [gutvol-p] If you don't mind testing a program Message-ID: <212322090907311112y207a9bdh6126cc57baf2d94a@mail.gmail.com> This is my reader with the Gutenberg indexing now. Its so very large because I'm including the tts and built index in the download. I'd like to know what do you think of the Gutenberg download (right click toggle Gutenberg). http://rapidshare.com/files/262224026/BookJar.7z.html Also after you tire of searching, i'd like to know how efficient is the reindexing process in other machines. So i'd ask you to delete what's inside the directory named cache, so that the next time the Gutenberg download list is accessed it reindexes.