[gutvol-d] Re: Fwd: Programmatic fetching books from Gutenberg

9 Aug 2009

      Paulo Levi wrote:
...
Tell me, please, if the gutenberg rtf index file, besides being
autogenerated, is also sorted.
What i mean is, i'm indexing parts of the file, and i gain a major
speed up of treating the file as:
a massive list of pgterms:etext definitions
followed by a (more massive) list of pgterms:file definitions
This allows the string comparisons i have to do to be lower (about
n/2), but at the cost that any etext record in between the second list
is not picked up.
That won't be changed, but is only for my peace of mind.
Also, in the pgterms:file records, are the records referring to the
same file consecutive ? I ask because if so, i could do the same sort
of filtering Aaron Cannon is doing in its dvd project, to speed up the
index some more and remove duplicates.
(If they aren't consecutive i would have to  issue queries  between
building the index to see if they were already inserted).
I have nothing against xpath, indeed i think the scanning of the file
in lucene already uses something similar. But i need free text
searches, and they have to be fast (i'm already experimenting with a
memory cache after the query too, and it works okish for my
application)
All future changes to the file will be backward-compatible in an XML way.

Meaning: the same XPath queries will yield (supersets of) the same 
result sets.

I do NOT guarantee that the sequence of entities will be sorted the same 
way. Your sorting needs will probably be much different from the next 
guy's, so the sorting is up to you.

[gutvol-d] Re: Fwd: Programmatic fetching books from Gutenberg

Marcello Perathoner