Re: Fwd: Programmatic fetching books from Gutenberg

On Tue, Jul 28, 2009 at 12:17:23PM -0400, David A. Desrosiers wrote:
On Tue, Jul 28, 2009 at 10:33 AM, Greg Newby<gbnewby@pglaf.org> wrote:
A more general approach would be to let visitors to www.gutenberg.org put their selected files (including those generated on-the-fly) on a bookshelf (i.e., shopping cart), then download in one big file, or several small ones.
If you're looking at it at that level, why not just offer some streaming audio of the books as well? You could do this very simply with any number of dozens of dynamic content streaming applications in whatever language you choose (Perl, PHP, Python, Java, etc.)
This is a good point. I don't know why we don't have streaming, especially since iBiblio does have streaming (I think). If you could suggest some software that seems likely to work on the iBiblio server (Apache, PHP, Perl all on Linux; free), especially that could just be dropped into bibrec.php that I sent earlier, that would be a tremendous help. The funny part is that I get inquiries all the time via help@ on "how do I save an audio file locally?" It seems the most common audio listening experience is to download & play back (perhaps with a delay for the download to complete), so people are doing the same thing as streaming (i.e., immediate listening), but needing to wait for the download to complete. It would be nice to offer streaming, instead. -- Greg
I actually used one to demo for a DJ/Amtrak train conductor several months back. He wanted a way to pull the tags/artists out of his enormous mp3 collection, and in 15 minutes on the train (with 'net), I found one that would let him "radio-enable" his entire mp3 collection, including a web interface to stream, play, download, view, sort, browse all of the artists by collection, tag, album art, date, etc. all in Perl.
It should be a simple matter to have something similar latched onto the Gutenberg audio collection, so anyone can click on the audiobook to either download, stream, convert, etc. the book in whatever format they prefer.
Just an idea...

Tell me, please, if the gutenberg rtf index file, besides being autogenerated, is also sorted. What i mean is, i'm indexing parts of the file, and i gain a major speed up of treating the file as: a massive list of pgterms:etext definitions followed by a (more massive) list of pgterms:file definitions This allows the string comparisons i have to do to be lower (about n/2), but at the cost that any etext record in between the second list is not picked up. That won't be changed, but is only for my peace of mind. Also, in the pgterms:file records, are the records referring to the same file consecutive ? I ask because if so, i could do the same sort of filtering Aaron Cannon is doing in its dvd project, to speed up the index some more and remove duplicates. (If they aren't consecutive i would have to issue queries between building the index to see if they were already inserted). I have nothing against xpath, indeed i think the scanning of the file in lucene already uses something similar. But i need free text searches, and they have to be fast (i'm already experimenting with a memory cache after the query too, and it works okish for my application)

Of course if they were sorted by priory say, most feature full free format : html -> rtf -> text UTF-8 -> ASCII would be very nice too.

Just to clarify, you wrote: "Also, in the pgterms:file records, are the records referring to the same file consecutive ? I ask because if so, i could do the same sort of filtering Aaron Cannon is doing in its dvd project, to speed up the index some more and remove duplicates. (If they aren't consecutive i would have to issue queries between building the index to see if they were already inserted)." There are actually not really any duplicates that I am aware of in the RDF catalog. It's just that most books are in the archive in more than one encoding or format. What I have been filtering out are the more lossey encodeings (like ASCII) when there is a less lossey one available (like UTF-8). As for the sorting, I don't know for sure, but it seems like the current ordering is likely an artifact of the way the RDF was generated. Whether or not you want to rely on that never changing is up to you. I haven't followed the thread closely enough to know what you're trying to do, but it sounds as though you might be using the RDF in a way which it was never intended. What I mean by that is you seem to be trying to read directly from it like a database when someone does a search, rather than loading the RDF into an actual database, and reading from that. Having just recently worked on a Python app which parses the RDF into memory, I can tell you that parsing the XML is the slowest part of the process, at least in my application. Your mileage may vary, but when you have to do tens-of-thousands of string comparisons from a file which is roughly 100MB in size before you can return a result in a web app (I'm assuming it's a web app), you're likely going to have problems. Good luck. Aaron On 7/29/09, Paulo Levi <i30817@gmail.com> wrote:
Of course if they were sorted by priory say, most feature full free format : html -> rtf -> text UTF-8 -> ASCII would be very nice too. _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

But i am reading the rdf into a (file) database. That is more or less what Lucene is. What i am filtering is just what insert into the database, so that its creation is faster / searches only on the fields that interest me. Sure its a lot of code, that will break if the format changes, but it reduced the creation step from 5 minutes or so to 40 seconds (this on a fast dual-core computer - i shudder to think what would happen if a user tried to re-index in a 1000 hertz machine). The index is at about 33.5 mb, and should compress into < 10mb. Probably enough to be included into the application.

Hi Paulo, Am 30.07.2009 um 06:40 schrieb Paulo Levi:
But i am reading the rdf into a (file) database. That is more or less what Lucene is. What i am filtering is just what insert into the database, so that its creation is faster / searches only on the fields that interest me.
Sure its a lot of code, that will break if the format changes, but it If you program modularely that would be no problem
reduced the creation step from 5 minutes or so to 40 seconds (this on If you are filtering and just get a factor of 13 I said it your system that is slow. If I remember correctly you are just requestíng the certain information so somebody else is doing the work!
a fast dual-core computer - i shudder to think what would happen if a user tried to re-index in a 1000 hertz machine). Let's see. My Mac SE was a 1 Mega Hertz machine. That was twenty years ago. It would handle something like this in about ten minutes. I do not know what dbase system I was using.
The index is at about 33.5 mb, and should compress into < 10mb. Probably enough to be included into the application.
Hardcoding data of that size into the program is not feasible. Though most newer computers can load it into memory quite quickly. gives you a factor of 100 if everything is in memory that is why perl is so fast. regard Keith

reduced the creation step from 5 minutes or so to 40 seconds (this on
If you are filtering and just get a factor of 13 I said it your system that is slow. If I remember correctly you are just requestíng the certain information so somebody else is doing the work!
It's not a server application, so the client is (potentially) doing the indexing if he want to update the catalog. Its the indexing that takes 40 s.
The index is at about 33.5 mb, and should compress into < 10mb. Probably enough to be included into the application.
Hardcoding data of that size into the program is not feasible. Though most newer computers can load it into memory quite quickly. gives you a factor of 100 if everything is in memory that is why perl is so fast.
Including everything in memory would more than double my program heap, and don't forget that this is a java application, so it would never be released before it ending (or at least a sub process ending). Besides, as lucene uses files, i think i can't use the in memory index to search the rdf (using LuceneSail, that uses Sesame and Lucene)

Paulo Levi wrote:
Tell me, please, if the gutenberg rtf index file, besides being autogenerated, is also sorted.
What i mean is, i'm indexing parts of the file, and i gain a major speed up of treating the file as:
a massive list of pgterms:etext definitions
followed by a (more massive) list of pgterms:file definitions
This allows the string comparisons i have to do to be lower (about n/2), but at the cost that any etext record in between the second list is not picked up.
That won't be changed, but is only for my peace of mind. Also, in the pgterms:file records, are the records referring to the same file consecutive ? I ask because if so, i could do the same sort of filtering Aaron Cannon is doing in its dvd project, to speed up the index some more and remove duplicates. (If they aren't consecutive i would have to issue queries between building the index to see if they were already inserted).
I have nothing against xpath, indeed i think the scanning of the file in lucene already uses something similar. But i need free text searches, and they have to be fast (i'm already experimenting with a memory cache after the query too, and it works okish for my application)
All future changes to the file will be backward-compatible in an XML way. Meaning: the same XPath queries will yield (supersets of) the same result sets. I do NOT guarantee that the sequence of entities will be sorted the same way. Your sorting needs will probably be much different from the next guy's, so the sorting is up to you.
participants (5)
-
Aaron Cannon
-
Greg Newby
-
Keith J. Schultz
-
Marcello Perathoner
-
Paulo Levi