
Just to clarify, you wrote: "Also, in the pgterms:file records, are the records referring to the same file consecutive ? I ask because if so, i could do the same sort of filtering Aaron Cannon is doing in its dvd project, to speed up the index some more and remove duplicates. (If they aren't consecutive i would have to issue queries between building the index to see if they were already inserted)." There are actually not really any duplicates that I am aware of in the RDF catalog. It's just that most books are in the archive in more than one encoding or format. What I have been filtering out are the more lossey encodeings (like ASCII) when there is a less lossey one available (like UTF-8). As for the sorting, I don't know for sure, but it seems like the current ordering is likely an artifact of the way the RDF was generated. Whether or not you want to rely on that never changing is up to you. I haven't followed the thread closely enough to know what you're trying to do, but it sounds as though you might be using the RDF in a way which it was never intended. What I mean by that is you seem to be trying to read directly from it like a database when someone does a search, rather than loading the RDF into an actual database, and reading from that. Having just recently worked on a Python app which parses the RDF into memory, I can tell you that parsing the XML is the slowest part of the process, at least in my application. Your mileage may vary, but when you have to do tens-of-thousands of string comparisons from a file which is roughly 100MB in size before you can return a result in a web app (I'm assuming it's a web app), you're likely going to have problems. Good luck. Aaron On 7/29/09, Paulo Levi <i30817@gmail.com> wrote:
Of course if they were sorted by priory say, most feature full free format : html -> rtf -> text UTF-8 -> ASCII would be very nice too. _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d