[gutvol-d] Re: Fwd: Programmatic fetching books from Gutenberg

30 Jul 2009

      Just to clarify, you wrote:
"Also, in the pgterms:file records, are the records referring to the
same file consecutive ? I ask because if so, i could do the same sort
of filtering Aaron Cannon is doing in its dvd project, to speed up the
index some more and remove duplicates.
(If they aren't consecutive i would have to  issue queries  between
building the index to see if they were already inserted)."

There are actually not really any duplicates that I am aware of in the
RDF catalog.  It's just that most books are in the archive in more
than one encoding or format.  What I have been filtering out are the
more lossey encodeings (like ASCII) when there is a less lossey one
available (like UTF-8).

As for the sorting, I don't know for sure, but it seems like the
current ordering is likely an artifact of the way the RDF was
generated.  Whether or not you want to rely on that never changing is
up to you.

I haven't followed the thread closely enough to know what you're
trying to do, but it sounds as though you might be using the RDF in a
way which it was never intended.  What I mean by that is you seem to
be trying to read directly from it like a database when someone does a
search, rather than loading the RDF into an actual database, and
reading from that.  Having just recently worked on a Python app which
parses the RDF into memory, I can tell you that parsing the XML is the
slowest part of the process, at least in my application.  Your mileage
may vary, but when you have to do tens-of-thousands of string
comparisons from a file which is roughly 100MB in size before you can
return a result in a web app (I'm assuming it's a web app), you're
likely going to have problems.

Good luck.

Aaron

On 7/29/09, Paulo Levi <i30817@gmail.com> wrote:
...
Of course if they were sorted by priory say, most feature full free
format : html -> rtf -> text UTF-8 -> ASCII would be very nice too.
_______________________________________________
gutvol-d mailing list
gutvol-d@lists.pglaf.org
http://lists.pglaf.org/mailman/listinfo/gutvol-d