
Paulo Levi wrote:
Tell me, please, if the gutenberg rtf index file, besides being autogenerated, is also sorted.
What i mean is, i'm indexing parts of the file, and i gain a major speed up of treating the file as:
a massive list of pgterms:etext definitions
followed by a (more massive) list of pgterms:file definitions
This allows the string comparisons i have to do to be lower (about n/2), but at the cost that any etext record in between the second list is not picked up.
That won't be changed, but is only for my peace of mind. Also, in the pgterms:file records, are the records referring to the same file consecutive ? I ask because if so, i could do the same sort of filtering Aaron Cannon is doing in its dvd project, to speed up the index some more and remove duplicates. (If they aren't consecutive i would have to issue queries between building the index to see if they were already inserted).
I have nothing against xpath, indeed i think the scanning of the file in lucene already uses something similar. But i need free text searches, and they have to be fast (i'm already experimenting with a memory cache after the query too, and it works okish for my application)
All future changes to the file will be backward-compatible in an XML way. Meaning: the same XPath queries will yield (supersets of) the same result sets. I do NOT guarantee that the sequence of entities will be sorted the same way. Your sorting needs will probably be much different from the next guy's, so the sorting is up to you.