May 2006 - gutvol-d - lists.pglaf.org

re: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital libraries
by Bowerbird＠aol.com 01 Jun '06

01 Jun '06

yahoo has enough money to generate press any time they want. but you have to take that money out of your pocket to do it... so the answer is that they haven't wanted it enough. indeed, the only contribution to o.c.a. that i've seen acknowledged is a $5-million one from microsoft. which is pretty much peanuts, coming from a company as rich as microsoft, and we all know it... i get the impression everyone at o.c.a. -- except maybe brewster -- is trying to go the cheap route. might work that way, or might not, but o.c.a. certainly ain't gonna get much publicity without buyin' it. -bowerbird

6 8

re: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital libraries
by Bowerbird＠aol.com 01 Jun '06

01 Jun '06

sebastien said: > Most of the time the original typesetting does not matter much. different people can disagree on that. > I believe you are missing the point. > Michael doesn't care as much about collections of pictures > as he does about digitalized text. different people disagree with michael. > As long as scans and/or OCR technologies are so disappointing, > we'll have to rely on higher-level humain brains with initiatives > such as PGDP or ebooksgratuits.com or methodologies which are better. > Of course having easy access to pictures is useful and > much better than nothing and serves you well, but > that's not what PG and ebooks are about. different people can disagree on that too. > ebooks are much more than photographs of regular analog books. yes, but photographs of regular analog books _might_ qualify as e-books, for _some_ people. different people can disagree on that too. > 3. is the top we are heading for. 2. is just a step on the way. but #2 might serve the needs of person x just fine. > I did that and got > 20845628 bytes for 604 pages. scans are resource hogs. nobody disagrees about that. one argument is that since these resources are now plentiful, it doesn't matter that scans are resource hogs. different people can disagree on that too. as long as we can easily move scan-sets to digitized text, i don't see much purpose in continuing to debate these two as if they were competitors. they're not. they're complimentary. -bowerbird

3 3

Books containing works from several sources
by Dave Fawthrop 01 Jun '06

01 Jun '06

One of ?my? Authors, John Hartley, wrote some 1000 short pieces of poetry and prose only about 1/2 were published in books, some of which book are already on PG. The rest were published in the Clock Almanack, Penney Broadsheets, dated up to 1915,and the like. All out of Copyright to both US and EU, life+70 systems. The Almanacks were sort of Desk Calendars, which along with the dialect works had *many* pages of adverts, "A Chronological Table of the Principal Events in Yorkshire History, Changes of the moon, Festivals ... Original puns and conundrums, etc", Rambling Remarks, etc. which IMO will no longer be of interest to the modern reader. All the above were always omitted from books which were compiled from works in the Clock Almanacks I would hope at some time to collect those short works into groups of about 50 for PG. This would involve me, as editor, in choosing/collecting/assembling the works from various sources. Working title would be "Lost Works of John Hartley", or "John Hartley's unrepublished Works from the Clock Almanacks ???? to ????", It would clearly be a good idea to state in which dated copy of Clock Almanack each work could be found. It should take two to five Clock Almanacks to make a PG book. As PG is set up for whole books this would leave a problem with copyright clearance. I have copies of the Contents Page for each yearly Almanack which would clear OK, but any subsequent book edited by me would cause problems, because it would contain work from more than one source. Any ideas? -- Dave Fawthrop <dave hyphenologist co uk> "Intelligent Design?" my knees say *not*. "Intelligent Design?" my back says *not*. More like "Incompetent design". Sig (C) Copyright Public Domain

3 2

re: [gutvol-d] Kevin Kelly in NYT on future of digital libraries
by Bowerbird＠aol.com 31 May '06

31 May '06

i said: > as long as we can easily move scan-sets to digitized text, > i don't see much purpose in continuing to debate these two > as if they were competitors. they're not. they're complimentary. obviously, i meant "complementary". and a good checker-program would've flagged that. -bowerbird

1 0

who knew?
by Bowerbird＠aol.com 31 May '06

31 May '06

so, it seems the secret of e-books is putting ads in them. who knew? -bowerbird

1 0

a policy for scan-sets
by Bowerbird＠aol.com 31 May '06

31 May '06

i'm glad the topic of formulating a policy for scan-sets is being discussed again. even though donovan made it sound like it would be a while before scan-sets would be made available, i think it's wise to decide on a policy now. in that vein, here's a response i made to marcello's proposal. (i probably made much the same response back when he first presented his proposal, but i didn't go back and check, i just wrote it up anew.) anyway, like i said, i'm glad this is being discussed again, and i offer this -- as usual -- with a constructive spirit... **** marcello said: > PG allows the posting of page images along an existing ebook. > The only currently accepted format is tiff files collected in a zip archive. i didn't read that "should" as a _requirement_ for .tiff files, but it should be stated clearly what image-formats are "allowed". my opinion is that any widely-used one should be permitted. > This new format will not replace the old format. Page images > for any book can be posted in whichever of the 2 formats > the poster chooses. i believe strongly in "lockss" -- "lots of copies keeps stuff safe". i believe strongly p.g. should store its content in multiple formats. one form of storage for the scans should be _as_individual_files, and that form should be thought of as the _primary_ method... but yes, i'm all in favor of bundling them up in other formats too, such as .zip and .djvu, as well as .pdf and even as .quicktime movies, so that they could be more easily handled on machines like the psp... > The format specified here will permit online viewing of the page images > (with djvu plugin). It is not required to download the file and unpack it > as with the old format, although it is still possible to do so. > The new format also allows for linking from an html document > to an arbitrary page image, so that a click on a link will > open the right page image in the djvu browser plugin. the ability to deep-link to a specific page-scan in a browser is good. in fact, it's absolutely necessary. but it doesn't go far enough. browsers simply are _not_ the only way people will access this content. every obstacle that you put between the user and the content is bad, and wrapping the content up into a bundle _is_ an obstacle, because it means that that bundle has to be unwrapped. look around, and you'll see that websites with straightforward a.p.i.'s are the ones that programmers are gravitating toward. understandably. p.g. needs to make it _easy_ for programmers to access this content, because that's what's going to drive independent development forward. even if it's possible, down the line or even now, for programmers to include a library that will negotiate a djvu file, it's not a good thing to make them do. it just raises the level of difficulty, and bloats their applications. and _really_, there's no good reason for it! don't make programmers jump through hoops to get to your content! remember those two lines of code that i posted that pull a file from the web? that's how _easy_ it can be, but only if your site is free of access-obstacles... and when i'm talking about automatic processes to convert the scan-sets into digital text, it's absolutely crucial that access to each individual scan be unobstructed. to have it any other way is to demand unnecessary labor. let me give an example. when you direct a browser to a google scan -- in some books, anyway, and maybe all, i just haven't verified that -- the initial download from google to your website is a _redirect_ page, which the browser automatically resolves by calling that other webpage. what this "redirection" workflow means, though, is that any programs that are used to grab the scans have to know how to resolve redirection. this is an example of the unnecessary obstacle that should be removed. we want to make it as easy as copying a 2-line routine to do a download; we don't want every programmer who wants to access the library have to bloat their code by requiring 'em to include routines to resolve redirection. > Numbering / Naming of page files ok, here we go. > A book usually contains 2 page number sequences, a roman one followed by > an arabic one. We considered the cover pages as yet another sequence. this is good. > A filename for a single-page djvu file MUST follow this pattern: > <prefix><page number>.djvu > The prefix for the cover pages is: "c". > The prefix for the roman pages is: "f". > The prefix for the arabic pages is: "p". so far, so good. it's not really necessary to _require_ these letters, as a "strong recommendation" should do the job, but whatever... > If there are more page number sequences in the book, they MUST > be handled in a similar fashion, using an arbitrary free letter. ok. > The <page number> is the true page number as seen on the physical page > (or inferred from the previous / next pages) expressed in arabic numerals > and left-padded with zeroes to a length of 4 digits. i don't see any need to require any particular number of digits. i myself generally pad to 3, but there's no need to be dogmatic. > For blank pages there should be no file > and the page number should be skipped. > Optionally an image saying: > "This page is blank in the original." > may be inserted. > Missing pages MUST be replaced by > an image saying: "This page is missing." this is just wrong. you need to have a file for each page, even blank ones; otherwise, there's no way of knowing if you have inadvertently lost a file... also, there's really no reason to require _what_ a blank-page image says, as long as it communicates the message adequately. as an aside, sometimes it's a good thing to "put a blank page to good use", for instance, by copying a graphic from elsewhere in the book to that spot. it is sufficient to tell the user that the page was blank, then put it to work... > A filename for a single-page djvu file containing an illustration > scanned in a different resolution or color depth MUST follow this pattern: > <prefix><page number>-<image position on the page>.djvu there's no need to scan illustrations on a page in addition to the page itself. some people might try to tell you that you need to do that for an .html vers ion. tell 'em that you'll have software crop out the illustration when that time comes. (this practice would also screws up recto/verso order, which is discussed below.) if a page has illustrations that require a different resolution or color depth, just scan the page that way. > If present, front cover, back cover and spine MUST be named as follows: > front cover outside: c0001.djvu > front cover inside: c0002.djvu > back cover inside: c0003.djvu > back cover outside: c0004.djvu > spine: c0005.djvu um, no. this breaks the rule of "alphabetic filename sort = print/bind order". > Example of file naming: > front cover c0001.djvu > back cover c0004.djvu > spine c0005.djvu if you have a front-cover scan saved as "c0001", you _must_ have its verso -- even if it was blank -- saved as "c0002". this must be an _ironclad_ rule, because it is absolutely essential to the reprinting of the scans as a p-book (which is one of the main things people will want to do with the scan-sets). every recto has to have a verso. it is a basic, fundamental fact about paper. > i title page f0001.djvu > ii title verso f0002.djvu > iii dedication f0003.djvu > iv is blank > v contents f0005.djvu you need an "f0004" file, and an "f0006" one too. > page 1 p0001.djvu > page 2 p0002.djvu > image on page 2 p0002-1.djvu > image on page 2 p0002-2.djvu i've already said we don't need to scan illustrations separately. but an issue that does arise involves unnumbered plate pages. in such a case, i use naming that goes like this: > myantp157.png > myantp158.png > myantp158x6a.png > myantp158x6b.png > myantp159.png > myantp160.png the "x6" indicates that this is the 6th illustration in the book. i haven't convinced myself that that's a _necessary_ piece of info, but for the sake of a confidence-doublecheck, it seems useful... oh yeah, notice the "myant" prefix, which serves to make the filenames for this book _unique_ across the library... once again, naming files "f0001.tif" is _asking_ for trouble. you want to know exactly what's in a file just by knowing its _name_ -- you _need_ to know that -- because the alternative (i.e., opening up the file to look at it to tell) is -- for want of a more accurate description -- idiotic. if people can't mix files from several books into one folder, because that results in filename crashes, you've hobbled them. oh yeah, as i've said for the 493rd time now, i need to retain the recto/verso order, which is why i have a "myantp158x6b.png" file, even though in this particular case the verso was blank... > page 9999 p9999.djvu if you have a page 9999, you must have a page 10000. oops! count=494. > All cover pages and the book spine should appear at the front. i disagree. only front-cover pages should go at the front. the back-cover images should go at the end, naturally. if the spine was digitized, it should be the very last image. (and -- because every recto has to have a verso -- it must too.) > The naming scheme was chosen so that saying: > djvm -c 12345.djvu *djvu > in a directory containing all single-page djvu files will > assemble the multi-page djvu file in the correct sequence. well, that's the right _idea_. (and kudos for picking it up from me from our previous thread.) but it also shows you why you need to have a verso file for every recto file, or else your multi-page .djvu file will have some even-numbered pages on the right, and some odd-numbered pages on the left, which is a very big no-no in book typography... *** greg said: > At this point, let's try to get a few dozen done, > in a few different ways, and take a look at > accessibility/utility. you don't need to "do experiments" to see that some of the suggestions that have been made have flaws in them. you just need to analyze them carefully. *** jon said: > it is best if both PG/DP align its practices as best as > practicable with OCA-developed standards and conventions. i do believe that it would be good if there was some consistency. but the o.c.a. is doing some things wrong too. to "synchronize" with them would be a mistake. to correct their policy is better. i would also like to take the opportunity to learn from them, because i'm sure they are aware of some gotchas i don't know. i've been meaning to volunteer for the appropriate committee, but i would _much_ rather they just held an open discussion, because in order to formulate a wise policy, you have to be knowledgeable about the _uses_ to which people will put the content, and the only way to find out the wide range of ideas out there in the world is to open up your ears and _listen_... and when i look at the names of the o.c.a. heavy hitters -- microsoft, adobe, rlg, and so on -- it makes me _wonder_ if they are even capable of listening to the world at large... i don't want to be a pessimist. but i do want to be a realist. if anyone gets a sense they _are_ open to input, let me know, and i'll be happy to tell 'em exactly what they should do... ;+) -bowerbird

1 0

re: [gutvol-d] Books containing works from several sources
by Bowerbird＠aol.com 31 May '06

31 May '06

dave said: > Any ideas? publish the book yourself, under your own imprint, and then slap a c.c. license on it and donate it to p.g. since you've taken responsibility for clearing the work, p.g. can simply point to your clearance as its own... -bowerbird

1 0

re: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital libraries
by Bowerbird＠aol.com 31 May '06

31 May '06

michael- > I've seen examples of this happening for a few pages, not for whole books. you haven't seen anything even close to the array that i will throw at the problem. > Depending on your proofreaders, this might have been largely true > quite a number of years ago. my "proofreaders" are the general public, who are just "regular" readers, so they're not very good. but they will have at their disposal a very simple system for reporting errors, and will be properly informed that we _adore_ error-reports, and that's all you really need to move e-texts to perfection... that and a whole bunch of eyeballs. and if a book can't attract eyeballs, then maybe it doesn't matter if it happens to have errors in it, eh? but most of the text will be super-clean before it ever gets to these readers. > I think the foundation of the rest of the history of eBooks will be laid down > within 5 years, so I wouldn't wait if I were you, it might get to be too late. nah. no matter what norms get laid down in the next 5 years, if they aren't good ones, then mine will eventually replace them. > I would at least lay down an example set of a dozen or two this year have already started. will continue to show developments. > I've hear reports that many of Brewster's scans might have to be redone. might be. but he hasn't scanned enough yet for this to be a problem. even google hasn't done more than 1% of what they will do eventually, so even if they had to re-do everything they've done so far, no big deal. realistically, it's still too early to even _ask_ if the workflow is correct. you should do that about the 5% mark. and you can even profitably "start over" at the 10% mark -- which for google is 1 million books -- if you have identified and solved a significant flaw in your processes. but until you've done 5% of the job (and hopefully a _random_ 5%, so your selection biases don't blind-spot any substantial problems), you haven't adequately confronted the difficulty-factor of the task, so you can't even make an informed decision about your workflow. > Not to mention that it appears Google and Gallica both > intentionally leave us only with reduced resolution scans > that might not do OCR in a feasible manner. that just means our post-o.c.r. program has to work a little harder. again, you don't need to worry about text digitization. it is solved. > You don't think bots like "The Wayback Machine" can do this? um, no. because google excludes bots from its cyberlibrary. you must respect robots.txt, or you're not being responsible. so there simply must be a human at the machine. it's more-or-less a technicality, yes, because we _are_ scraping, make no mistake about it, we say it out loud, but it is a technicality that we absolutely have to meet. besides, we need someone to do the quality-control, and for _that_, you do have to have a sentient human. ditto with the re-upload to our own webspace. > I'd like to think half an hour will be enough, when the time comes, um, not really. not if you're going the download/qc/upload route. you _could_ save a _lot_ of time if google just handed over their scans on a couple big hard-drives, or pointed you to an open-access server. somebody needs to open up such negotiations with google. google's biggest flaw in this whole arena is its tight-lipped nature. i understand that that's typical of the company at large, but still, at some point -- which i place at december 14th of this year -- google will _have_to_ start talking with us electronic-book people, or risk losing our continuing support in regard to their lawsuits... if they want us to continue to believe they're doing this on our behalf, they must speak with us... they simply cannot continue to stonewall... i truly believe they could win themselves all the friends they'll need if they simply release all of their public-domain scan-sets for free, and i also truly believe they're smart enough to figure that out, but at a minimum they will need to start sharing their progress with us. again, somebody needs to open up the negotiations that would let us take a swift and convenient possession of their p-d scansets. if we can save one-half hour each on 100,000 books, it adds up. do that on a million books, or on 10 million, and it really adds up. it'd also jack your count up considerably, and we could use that too, especially since this wouldn't be some meaningless inflation either, but a solid increase in their access to books by the general public... > and eventually, after it has all been done once, > it will be trivial to do it all over again, better. actually, once it's been done once, it will be unnecessary to do it again. -bowerbird

1 0

!@!The Big Push, Well Not So Big. . . .
by Michael Hart 31 May '06

31 May '06

As most of you are aware, it is 5 weeks until we complete our 35th year of Project Gutenberg history, and we have about 462 eBooks left to make it to 20,000. This would be about 92 per week. . .we did 71 this week. So it's not such a Big Push as we did to get to 10,000, but a rather smaller push, which is why you haven't heard me say an awfully lot about it. . .things are working out much a closer match to reaching 20,000 on our 35th anniversary than anyone, myself included, would likely have predicted. However, especially since I am planning on taking a week off, right at July 4th, I am best man at my best friend's wedding, I am trying to get as much as possible done before I leave as soon as I can after sending out the Newsletter a week before. I am working on the July 5th Newsletter, and will have it out in a fairly complete manner half a day after the previous one goes out, and am hoping that some of our volunteers will have the wherewithal to update it and send it out July 5th with an entirely up to date revision, that hopefully will hit 20,000. More later, I'm just trying to make it one day at a time right now. . . . Thanks!!! Michael

1 0

re: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital libraries
by Bowerbird＠aol.com 31 May '06

31 May '06

michael said: > Actually, from what Brewster told us, and I think you were there, > he really does want to take the cheap route, and would prefer to > do mostly only raw scans if he could get away with it the world will eventually explain to him that he can't get away with it. if you can't search their text, you lose the vast majority of the benefits associated with having electronic-books. however, as i have said many times here, if you do the scans correctly, and do the o.c.r. correctly, and run a good post-o.c.r. correction app, you will end up with text that is _highly_accurate_. so accurate that it's much more than good enough for search purposes, even good enough to move to the public for "continuous proofreading". and again, i will "make this happen" _by_myself_ -- if it's necessary -- within 5 years, so this is no sticking point, none at all, really, i promise. and because brewster's stuff will be open-access, and guaranteed so, there's no need to even worry about that content. we'll get it straight. it is the stuff from _google_ that we have to be concerned about, because there's no assurance so far, let alone a guarantee, that we will be able to access that content easily, not unless we scrape it... i've said this before, too -- we need a coordinated scraping campaign. every public-domain book we scrape is liberated -- forever and ever. o.c.r., clean-up, and format conversions can be done _automatically_. michael, you asked me backchannel if scansets would need volunteers. for scraping, yes. but nothing else. just scraping. i'd estimate it takes 15 minutes of human work to scrape a book. (the actual downloading takes longer, but it can be unattended.) i'd then say it takes 30 minutes, on average, to do quality-control. (most books will take 10 minutes, but any books with problems take a disproportionate amount of time, so 30 minutes average.) quality-control involves renaming files according to solid policy. and then i'd estimate it takes 15 minutes to upload those scans to their permanent home, where they'll be viewable immediately and ready for treatment by automated systems when those come online. so this one hour of work is all it takes to liberate a book from google. -bowerbird

2 1