[gutvol-d] Re: Fwd: Programmatic fetching books from Gutenberg

29 Jul 2009

      Tell me, please, if the gutenberg rtf index file, besides being
autogenerated, is also sorted.

What i mean is, i'm indexing parts of the file, and i gain a major
speed up of treating the file as:

a massive list of pgterms:etext definitions

followed by a (more massive) list of pgterms:file definitions

This allows the string comparisons i have to do to be lower (about
n/2), but at the cost that any etext record in between the second list
is not picked up.

That won't be changed, but is only for my peace of mind.
Also, in the pgterms:file records, are the records referring to the
same file consecutive ? I ask because if so, i could do the same sort
of filtering Aaron Cannon is doing in its dvd project, to speed up the
index some more and remove duplicates.
(If they aren't consecutive i would have to  issue queries  between
building the index to see if they were already inserted).

I have nothing against xpath, indeed i think the scanning of the file
in lucene already uses something similar. But i need free text
searches, and they have to be fast (i'm already experimenting with a
memory cache after the query too, and it works okish for my
application)

[gutvol-d] Re: Fwd: Programmatic fetching books from Gutenberg

Paulo Levi