I made a ebook reader (here) http://code.google.com/p/bookjar/downloads/list and i'd like to search and download Gutenberg books. I already have a searcher prototype using LuceneSail a library that uses Lucene to index rdf documents and only indexing what i want from the catalog.rdf.zip. Now i'd like to know how from the url inside the catalog i can fetch the book itself, and what are the variants for the formats. A example query result: author: Shakespeare, William, 1564-1616 url: http://www.gutenberg.org/feeds/catalog.rdf#etext1802 title: King Henry VIII So, i like to know how from the etext1802 number can i get a working url to download the book, and how to construct variants for each format. Thank you in advance.
Paulo Levi wrote:
I made a ebook reader (here) http://code.google.com/p/bookjar/downloads/list
and i'd like to search and download Gutenberg books. I already have a searcher prototype using LuceneSail a library that uses Lucene to index rdf documents and only indexing what i want from the catalog.rdf.zip.
Now i'd like to know how from the url inside the catalog i can fetch the book itself, and what are the variants for the formats. A example query result: author: Shakespeare, William, 1564-1616 url: http://www.gutenberg.org/feeds/catalog.rdf#etext1802 title: King Henry VIII
So, i like to know how from the etext1802 number can i get a working url to download the book, and how to construct variants for each format.
In the second half of the rdf file you will find records for all the files in different formats we offer for an ebook. Use the #etext1802 as link between book record and file records.
Thing is, indexing takes a long time and occupies quite a lot of space. I managed to reduce it by filtering the "important" parts of the rdf to index. If the location of the text can be inferred from #etext1802 i prefer to use that. On Thu, Jul 16, 2009 at 8:05 AM, Marcello Perathoner <marcello@perathoner.de
wrote:
Paulo Levi wrote:
I made a ebook reader (here) http://code.google.com/p/bookjar/downloads/list
and i'd like to search and download Gutenberg books. I already have a searcher prototype using LuceneSail a library that uses Lucene to index rdf documents and only indexing what i want from the catalog.rdf.zip.
Now i'd like to know how from the url inside the catalog i can fetch the book itself, and what are the variants for the formats. A example query result: author: Shakespeare, William, 1564-1616 url: http://www.gutenberg.org/feeds/catalog.rdf#etext1802 title: King Henry VIII So, i like to know how from the etext1802 number can i get a working url to download the book, and how to construct variants for each format.
In the second half of the rdf file you will find records for all the files in different formats we offer for an ebook. Use the #etext1802 as link between book record and file records.
It appears that the format seems to follow a rule sort of #etext1802 "&f;1/8/0/1802/1802.txt Thought it doesn't appear to be consistent. I saw something about a old indexing scheme for files older than 10000. What is the scheme (can it be guessed from the #number?) and is it going to disappear from the Gutenberg server ? Or are you going to make redirects? On Thu, Jul 16, 2009 at 6:55 PM, Paulo Levi <i30817@gmail.com> wrote:
Thing is, indexing takes a long time and occupies quite a lot of space. I managed to reduce it by filtering the "important" parts of the rdf to index. If the location of the text can be inferred from #etext1802 i prefer to use that.
On Thu, Jul 16, 2009 at 8:05 AM, Marcello Perathoner < marcello@perathoner.de> wrote:
Paulo Levi wrote:
I made a ebook reader (here) http://code.google.com/p/bookjar/downloads/list
and i'd like to search and download Gutenberg books. I already have a searcher prototype using LuceneSail a library that uses Lucene to index rdf documents and only indexing what i want from the catalog.rdf.zip.
Now i'd like to know how from the url inside the catalog i can fetch the book itself, and what are the variants for the formats. A example query result: author: Shakespeare, William, 1564-1616 url: http://www.gutenberg.org/feeds/catalog.rdf#etext1802 title: King Henry VIII So, i like to know how from the etext1802 number can i get a working url to download the book, and how to construct variants for each format.
In the second half of the rdf file you will find records for all the files in different formats we offer for an ebook. Use the #etext1802 as link between book record and file records.
Paulo Levi wrote:
It appears that the format seems to follow a rule sort of #etext1802 "&f;1/8/0/1802/1802.txt
&f; is an xml entity which is defined at the top of the rdf file. Retrieve the url &f;1/8/0/1802/1802.txt and you get the file. This is an xml file and you *really* should parse it with an xml parser. That will take care of all these problems for you. If you are concened about memory, use a sax parser.
Thought it doesn't appear to be consistent. I saw something about a old indexing scheme for files older than 10000. What is the scheme (can it be guessed from the #number?) and is it going to disappear from the Gutenberg server ? Or are you going to make redirects?
If you parse the rdf file the xml way you get the current url of the ebook files regardless of old/new directory scheme and any future file moves.
Not using xml, and am not going to - I need fast search so I'm indexing the files using Lucene.
participants (2)
-
Marcello Perathoner
-
Paulo Levi