re: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital libraries

yahoo has enough money to generate press any time they want. but you have to take that money out of your pocket to do it... so the answer is that they haven't wanted it enough. indeed, the only contribution to o.c.a. that i've seen acknowledged is a $5-million one from microsoft. which is pretty much peanuts, coming from a company as rich as microsoft, and we all know it... i get the impression everyone at o.c.a. -- except maybe brewster -- is trying to go the cheap route. might work that way, or might not, but o.c.a. certainly ain't gonna get much publicity without buyin' it. -bowerbird

On Tue, 30 May 2006 Bowerbird@aol.com wrote:
yahoo has enough money to generate press any time they want.
but you have to take that money out of your pocket to do it...
so the answer is that they haven't wanted it enough.
indeed, the only contribution to o.c.a. that i've seen acknowledged is a $5-million one from microsoft. which is pretty much peanuts, coming from a company as rich as microsoft, and we all know it...
i get the impression everyone at o.c.a. -- except maybe brewster -- is trying to go the cheap route. might work that way, or might not, but o.c.a. certainly ain't gonna get much publicity without buyin' it.
-bowerbird
Actually, from what Brewster told us, and I think you were there, he really does want to take the cheap route, and would prefer to do mostly only raw scans if he could get away with it, much as it appears the Gallica collection does. I had some of my Parisienes take a look at it, and they told the that Gallica got complaints, many of them, for being so hard to use, and not providing much of an actual full text eLibrary, even one such as Google's separated out for searching, then different files for viewing. No one seems to thinks Gallica is really an eBook collection, raw scans seems to be most of what is available, and even those are a set of low-res versions that is not really suitable for OCRing. I must admit that I am relying on my friends here, as my Francias is not really good enough to know if I didn't miss something that would have provided better results on their site. Michael

Michael Hart <hart@pglaf.org> writes:
No one seems to thinks Gallica is really an eBook collection, raw scans seems to be most of what is available, and even those are a set of low-res versions that is not really suitable for OCRing.
OCRing is important, but OCR without the scans nearby is often not enough. I think gallica is one of the best e-book collections. Their PDF are very useful (you can download complete books as PDFs pretty easily and they are readable)! This way I can access the Bulletin Monumental.
I must admit that I am relying on my friends here, as my Francias is not really good enough to know if I didn't miss something that would have provided better results on their site.
Sure, you must know the way to create and download PDFs: www.gallica.fr -> Recherche -> "Mots du titre" - enter the title, for example "Bulletin Monumental" In the "Résultat de la recherche: click on "Bulletin Monumental" Select the volume, you are interested in, for example "1861 (Sér. 2)" Now "Télécharger" and "ok" if you are interested in the complete book Then wait, PDF preparation takes time. Click Vous pouvez le télécharger "en cliquant ici." or use the supplied FTP address. I hope this helps. -- http://www.gnu.franken.de/ke/ | ,__o | _-\_<, | (*)/'(*) Key fingerprint = F138 B28F B7ED E0AC 1AB4 AA7F C90A 35C3 E9D0 5D1C

On Wed, May 31, 2006 at 10:03:26PM +0200, Karl Eichwalder wrote:
OCRing is important, but OCR without the scans nearby is often not enough.
Most of the time the original typesetting does not matter much. The page breaks can matter (table of contents, index, references to other books) but apart from that...
I think gallica is one of the best e-book collections. Their PDF are very useful (you can download complete books as PDFs pretty easily and they are readable)! This way I can access the Bulletin Monumental.
I believe you are missing the point. Michael doesn't care as much about collections of pictures as he does about digitalized text. As long as scans and/or OCR technologies are so disappointing, we'll have to rely on higher-level humain brains with initiatives such as PGDP or ebooksgratuits.com Of course having easy access to pictures is useful and much better than nothing and serves you well, but that's not what PG and ebooks are about. ebooks are much more than photographs of regular analog books. 0. Before printing press: books were few and expensive 1. Before computers: books were heavy, costy, took space, one had to move from home to get them (go to shop/library/...) 2. With things like Gallica: books are light, free, ubiquitous (Internet, computers, CD) 3. With ebooks and PG: books are searchable and can be tailor-cut to everybody's needs (more books on a given media, PDA OK, ...) OCR is somewhere between 2. and 3., depending on quality of scan/software. 3. is the top we are heading for. 2. is just a step on the way. 2. was made possible by scanners. 2.5 was made possible by OCR software. 3. was made possible at a larger scale by PGDP.
www.gallica.fr -> Recherche -> "Mots du titre" - enter the title, for example "Bulletin Monumental" In the "Résultat de la recherche: click on "Bulletin Monumental" Select the volume, you are interested in, for example "1861 (Sér. 2)" Now "Télécharger" and "ok" if you are interested in the complete book
I did that and got 20845628 bytes for 604 pages. 34k/page on average. The text is just about 33 lines of about 55 characters, that is to say less that 2K/page, PDF is much smaller in text format (for example as produced by pdflatex). That PDF uses on avg. 20 bytes for every byte of text. I am not impressed! Every little dot and character variation is accounted for. We don't care! The file is long to load in xpdf. It has huge margins. pdftotext yields nothing. It's just a collection of pictures. Now take http://www.eleves.ens.fr/home/blondeel/PGDP/ebooksgratuits/barbey_d_aurevill... (a test I did on ebooksgratuits files, to import them to PG) The TXT is 575086 bytes long. The PDF is 794285 bytes long, and this includes 34860 bytes of images. (794285 - 34860) / 575086 = 1.3 byte/byte. This includes a table of contents with internal links and find-text functionality. Plus if you don't like the typesetting you can redo it to suit you better changing a couple lines in the LaTeX preamble. 2 columns? Landscape? Different margins? Different font? HTML version (CSS-it the way you want)...? Gallica is mainly a collection of pictures. It is useful to use as OCR-fodder on PGDP or the like, but by no means is it a collection of "e-books". You cannot search or probe text in it, have it read out by a machine, typeset it differently, extract quotes, index it in a search engine... Can you tell us in what issues of the Bulletin Monumental such or such word appears, with context? No you can't. Neither could the original readers (unless somebody built an index), but precisely we want more than what these people had at the time. Gallica has just over 1000 books in full-text (Search nothing and just tick "Ouvrages en mode texte"), and the quality and format is far from PG standards: compare for example «L'île des pingouins» on both sides. http://gallica.bnf.fr/document?O=N088395 http://www.gutenberg.org/dirs/etext05/8ilep10.txt Regards,

Sebastien Blondeel <blondeel@clipper.ens.fr> writes:
3. With ebooks and PG: books are searchable and can be tailor-cut to everybody's needs (more books on a given media, PDA OK, ...)
Often the editor takes an arbitrary decision (turn long-s into s or together with the following s into ß), limit the character set to iso-8859-1, fix spelling issues, apply English formatting rule to LOTE texts, strange formatting that hurts the reader. All this is often useful, but aoften not enough. With technology of today, it is possible to provide the images nearby and in this area PG fails miserably, and worse, quite some books are simply missing or offered twice (complete book, separated into chapters). All this is okay, but it is not nice to deny the usefulness of other collection and to blame them for being "that bad", esp. if these collections are free. BTW, loading HTML monster files also takes time (it often takes more time than loading a well done PDF file) and reading the ASCII file often is no fun. Sure, these ASCII files are also useful for special purposes, but telling us again and again that's the best solution for all books and all times, is highly arguable. -- http://www.gnu.franken.de/ke/ | ,__o | _-\_<, | (*)/'(*) Key fingerprint = F138 B28F B7ED E0AC 1AB4 AA7F C90A 35C3 E9D0 5D1C

On Wed, 31 May 2006, Karl Eichwalder wrote:
Michael Hart <hart@pglaf.org> writes:
No one seems to thinks Gallica is really an eBook collection, raw scans seems to be most of what is available, and even those are a set of low-res versions that is not really suitable for OCRing.
OCRing is important, but OCR without the scans nearby is often not enough. I think gallica is one of the best e-book collections. Their PDF are very useful (you can download complete books as PDFs pretty easily and they are readable)! This way I can access the Bulletin Monumental.
I must admit that I am relying on my friends here, as my Francias is not really good enough to know if I didn't miss something that would have provided better results on their site.
Sure, you must know the way to create and download PDFs:
Each .pdf file seemed to just hold a .gif file. . .or is there something else going on there that was missed?
www.gallica.fr -> Recherche -> "Mots du titre" - enter the title, for example "Bulletin Monumental" In the "Résultat de la recherche: click on "Bulletin Monumental" Select the volume, you are interested in, for example "1861 (Sér. 2)" Now "Télécharger" and "ok" if you are interested in the complete book
Then wait, PDF preparation takes time. Click Vous pouvez le télécharger "en cliquant ici." or use the supplied FTP address.
And this is supposed to prepare the book as a single .pdf file? Searchable? Thanks!!! Give the world eBooks in 2006!!! Michael S. Hart Founder Project Gutenberg Blog at http://hart.pglaf.org

The gallica pdf's are very low resoloution mostly. Where there are diagrams they hardly come out at all, especially mathematical ones with small lettters on them. I t may he helpful to have a copy of the book nearby. OCR'ing pdf's is not for the faint hearted, as they are ot designed for this purpose. However they are good for layout of the original publications and for copyright use as the date of publication is usually given. Also shows the title page often omitted from other pdf files. I believe some gallica are available in text format if you push the "text" button. nwolcott2@post.harvard.edu ----- Original Message ----- From: "Karl Eichwalder" <ke@gnu.franken.de> To: <gutvol-d@lists.pglaf.org> Sent: Wednesday, May 31, 2006 4:03 PM Subject: Re: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digitallibraries Michael Hart <hart@pglaf.org> writes:
No one seems to thinks Gallica is really an eBook collection, raw scans seems to be most of what is available, and even those are a set of low-res versions that is not really suitable for OCRing.
OCRing is important, but OCR without the scans nearby is often not enough. I think gallica is one of the best e-book collections. Their PDF are very useful (you can download complete books as PDFs pretty easily and they are readable)! This way I can access the Bulletin Monumental.
I must admit that I am relying on my friends here, as my Francias is not really good enough to know if I didn't miss something that would have provided better results on their site.
Sure, you must know the way to create and download PDFs: www.gallica.fr -> Recherche -> "Mots du titre" - enter the title, for example "Bulletin Monumental" In the "Résultat de la recherche: click on "Bulletin Monumental" Select the volume, you are interested in, for example "1861 (Sér. 2)" Now "Télécharger" and "ok" if you are interested in the complete book Then wait, PDF preparation takes time. Click Vous pouvez le télécharger "en cliquant ici." or use the supplied FTP address. I hope this helps. -- http://www.gnu.franken.de/ke/ | ,__o | _-\_<, | (*)/'(*) Key fingerprint = F138 B28F B7ED E0AC 1AB4 AA7F C90A 35C3 E9D0 5D1C _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d

On Thu, 1 Jun 2006, Norm Wolcott wrote:
I believe some gallica are available in text format if you push the "text" button.
From what my French friends tell me, this is only around 1% of them, and that sometimes the full text versions disappear after a while.
mh

"Norm" == Norm Wolcott <nwolcott2ster@gmail.com> writes:
Norm> The gallica pdf's are very low resoloution mostly. Where Norm> there are diagrams they hardly come out at all, especially Norm> mathematical ones with small lettters on them. I t may he Norm> helpful to have a copy of the book nearby. OCR'ing pdf's is Norm> not for the faint hearted, as they are ot designed for this Norm> purpose. However they are good for layout of the original Norm> publications and for copyright use as the date of Norm> publication is usually given. Also shows the title page Norm> often omitted from other pdf files. But why you download pdf from gallica? For OCR you should download tiff, that is perfectly suited, and does not pose conversion problems. The gallica pdf is just a wrapper for the tiff files (compare a gallica pdf with a gallica tiff: the tiff is integrally contained in the pdf, with some extra wrapper) for every page). For example FineReader, if you feed a pdf, passes through ghostscript, substantially "printing" the pfd and converting the resulting bitmap; if you choose the wrong dpi while converting, you lose resolution; it instead directly uses a tiff file (tiff is the internal image format in FineReader). gallica pdf is OK if you want to read (but a multipage tiff viewer is even better). But not for OCR. You cannot blame gallica if you cannot tick the correct box when you download. Carlo Traverso
participants (6)
-
Bowerbird@aol.com
-
Carlo Traverso
-
Karl Eichwalder
-
Michael Hart
-
Norm Wolcott
-
Sebastien Blondeel