I fixed about a dozen bugs and made about the same number of improvements in the request processor and have added it to the cd project page. If anyone wants to take another look, you can visit http://www.gutenberg.org/cdproject/ . Then click on either the international or United States form. It's still ok to test, just make sure you say so in the comments field. Most of the improvements were in data validation. However, I also changed the program so that now, in addition to E-mailing the request to cd@pglaf.org, it writes the data to a flat file. This is a good thing, because it will allow us to eventually make things more distributed. What I envision is a system where volunteers can log on to a web page at Gutenberg.org and checkout requests, somewhat like the Distributed Proofing site. It shouldn't be too hard to implement. I've also been working on getting access to the PG database, so that we can start compiling a new CD and DVD image. If anyone has experience with parsing RDF/XML in Perl, I could use your help. To give you all an idea of the current size of the archive, here's a breakdown of file types and their total size in bytes. fk_filetypes | fk_compressions | files | bytes --------------+-----------------+-------+------------- ? | none | 1 | 6148 ? | zip | 56 | 1158610659 avi | zip | 1 | 9671667 css | none | 73 | 129309 doc | none | 5 | 16335360 doc | zip | 5 | 3530238 dvi | gz | 1 | 145672 eps | none | 5 | 667758 eps | zip | 1 | 50481 gif | none | 3038 | 61544329 gif | zip | 3 | 820419 html | none | 3826 | 1289404590 html | zip | 3578 | 3726062915 index | none | 339 | 882064 iso | none | 277 | 4852684800 iso | zip | 1 | 388439680 jpg | none | 26280 | 1947495197 jpg | zip | 3 | 1928065 license | none | 23 | 253207 lit | none | 57 | 5728949 lit | zip | 55 | 4477759 ly | none | 9 | 59063 ly | zip | 1 | 2423 md5 | none | 4 | 15020 mid | none | 46 | 3195924 mid | zip | 7 | 574353 mp3 | none | 12751 | 93537630338 mp3 | zip | 54 | 1717503076 mpg | none | 4 | 16441408 mpg | zip | 7 | 30644113 mus | none | 9 | 1853407 mus | zip | 8 | 4154954 nfo | none | 1 | 4222976 nfo | zip | 1 | 3063405 pageimages | zip | 1 | 12875805 pdf | none | 93 | 117202451 pdf | zip | 37 | 47006708 png | none | 10897 | 779262971 prc | none | 54 | 7266524 prc | zip | 55 | 8775500 ps | none | 5 | 17086684 ps | zip | 1 | 4210628 qt | none | 1 | 1399639 qt | zip | 1 | 7758161 readme | none | 573 | 9154766 readme | zip | 3 | 6718762 rtf | none | 41 | 47205486 rtf | zip | 53 | 28563481 sib | none | 38 | 1799503 sib | zip | 5 | 904124 svg | none | 2 | 24120 tex | none | 24 | 10556350 tex | zip | 29 | 6468205 tiff | none | 34 | 9129666 tr | zip | 1 | 2591514 txt | none | 17253 | 14194087555 txt | zip | 17200 | 4711928510 wav | none | 2 | 29144452 xml | none | 47 | 111411597 xml | zip | 42 | 25677549 xsl | none | 14 | 345476 | none | 115 | 25679187 | rar | 23 | 338860727 | zip | 49 | 583288192 (64 rows) As you can see, we're going to have to leave a lot off of any DVD we create. The big question is what gets kept and what gets excluded. Even if we only include zipped HTML and zipped text, we're still over budget. Of course, there's always the possibility that we could create two images, a volume 1 and a volume 2, or something similar. Thoughts? Sincerely Aaron Cannon -- E-mail: cannona@fireantproductions.com Skype: cannona MSN Messenger: cannona@hotmail.com (Do not send E-mail to the hotmail address.)
My first thought, Aaron, would be to group the zipped text and html into fiction and everything else and see if that provides a reasonable division. The PG database has some subject information but there have been discussions of stripping it all out unless someone comes forward to keep it up-to-date. Because subject data would make our job that much easier (especially as PG continues to add texts), maybe we need to consider becoming involved in that effort as well.
Hi John, I actually have had similar thoughts. The biggest hurtle to creating the new images is the catalog. Anything we can do to improve it is going to make compiling new isos that much easier. Although, I must say that it has come a long way since last year. Whoever has been maintaining it has been doing an incredible job! I'm not sure how familiar with the situation you are, but do you happen to know how this information was gathered in the first place? Did we just look it up on the LOC web site or is it more involved than that? I'm CCing Marcello for his possible input as well. Thanks. Sincerely Aaron Cannon At 06:09 AM 10/16/2004, John Hagerson wrote:
My first thought, Aaron, would be to group the zipped text and html into fiction and everything else and see if that provides a reasonable division.
The PG database has some subject information but there have been discussions of stripping it all out unless someone comes forward to keep it up-to-date. Because subject data would make our job that much easier (especially as PG continues to add texts), maybe we need to consider becoming involved in that effort as well.
_______________________________________________ dvdvol mailing list dvdvol@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo.cgi/dvdvol
-- E-mail: cannona@fireantproductions.com Skype: cannona MSN Messenger: cannona@hotmail.com (Do not send E-mail to the hotmail address.)
Aaron Cannon wrote:
I actually have had similar thoughts. The biggest hurtle to creating the new images is the catalog. Anything we can do to improve it is going to make compiling new isos that much easier. Although, I must say that it has come a long way since last year. Whoever has been maintaining it has been doing an incredible job!
I'm not sure how familiar with the situation you are, but do you happen to know how this information was gathered in the first place? Did we just look it up on the LOC web site or is it more involved than that?
I think Alev used the LoC site to get subject and LoC class information. This was all done manually and at a certain point she wasn't able to keep up with the increasing flood of new books. Subject and LoC class information is available until around ebook 8000 and only very sparsely afterwards. -- Marcello Perathoner webmaster@gutenberg.org
I once wrote a program that used PHP and a programming interface known as PHP/YAZ to query the library of congress database for book information automatically. This was for a used textbook sale operated by my university, but a similar approach would go a long way to helping to automate this process. - Scott Marcello Perathoner wrote:
Aaron Cannon wrote:
I actually have had similar thoughts. The biggest hurtle to creating the new images is the catalog. Anything we can do to improve it is going to make compiling new isos that much easier. Although, I must say that it has come a long way since last year. Whoever has been maintaining it has been doing an incredible job!
I'm not sure how familiar with the situation you are, but do you happen to know how this information was gathered in the first place? Did we just look it up on the LOC web site or is it more involved than that?
I think Alev used the LoC site to get subject and LoC class information. This was all done manually and at a certain point she wasn't able to keep up with the increasing flood of new books. Subject and LoC class information is available until around ebook 8000 and only very sparsely afterwards.
Scott Schmucker wrote:
I once wrote a program that used PHP and a programming interface known as PHP/YAZ to query the library of congress database for book information automatically. This was for a used textbook sale operated by my university, but a similar approach would go a long way to helping to automate this process.
Granted, but how do you identify the book? We have no ISBN or similar key into the LoC database. The LoC database contains a record for every copy of book they hold, which could be many hundreds for a popular work. And copies acquired in say 1950 are often cataloged very differently from copies acquired in 2000. If we just had some sort of unique key into the LoC database, `borrowing' the data would be a child's play. -- Marcello Perathoner webmaster@gutenberg.org
participants (4)
-
Aaron Cannon
-
John Hagerson
-
Marcello Perathoner
-
Scott Schmucker