
Tell me, please, if the gutenberg rtf index file, besides being autogenerated, is also sorted. What i mean is, i'm indexing parts of the file, and i gain a major speed up of treating the file as: a massive list of pgterms:etext definitions followed by a (more massive) list of pgterms:file definitions This allows the string comparisons i have to do to be lower (about n/2), but at the cost that any etext record in between the second list is not picked up. That won't be changed, but is only for my peace of mind. Also, in the pgterms:file records, are the records referring to the same file consecutive ? I ask because if so, i could do the same sort of filtering Aaron Cannon is doing in its dvd project, to speed up the index some more and remove duplicates. (If they aren't consecutive i would have to issue queries between building the index to see if they were already inserted). I have nothing against xpath, indeed i think the scanning of the file in lucene already uses something similar. But i need free text searches, and they have to be fast (i'm already experimenting with a memory cache after the query too, and it works okish for my application)