Thing is, indexing takes a long time and occupies quite a lot of space. I managed to reduce it by filtering the "important" parts of the rdf to index. If the location of the text can be inferred from #etext1802 i prefer to use that.
In the second half of the rdf file you will find records for all the files in different formats we offer for an ebook. Use the #etext1802 as link between book record and file records.Paulo Levi wrote:
I made a ebook reader
(here) http://code.google.com/p/bookjar/downloads/list
and i'd like to search and download Gutenberg books. I already have a searcher prototype using LuceneSail a library that uses Lucene to index rdf documents and only indexing what i want from the catalog.rdf.zip.
Now i'd like to know how from the url inside the catalog i can fetch the book itself, and what are the variants for the formats.
A example query result:
author: Shakespeare, William, 1564-1616
url: http://www.gutenberg.org/feeds/catalog.rdf#etext1802
title: King Henry VIII
So, i like to know how from the etext1802 number can i get a working url to download the book, and how to construct variants for each format.