
On Wed, May 31, 2006 at 10:03:26PM +0200, Karl Eichwalder wrote:
OCRing is important, but OCR without the scans nearby is often not enough.
Most of the time the original typesetting does not matter much. The page breaks can matter (table of contents, index, references to other books) but apart from that...
I think gallica is one of the best e-book collections. Their PDF are very useful (you can download complete books as PDFs pretty easily and they are readable)! This way I can access the Bulletin Monumental.
I believe you are missing the point. Michael doesn't care as much about collections of pictures as he does about digitalized text. As long as scans and/or OCR technologies are so disappointing, we'll have to rely on higher-level humain brains with initiatives such as PGDP or ebooksgratuits.com Of course having easy access to pictures is useful and much better than nothing and serves you well, but that's not what PG and ebooks are about. ebooks are much more than photographs of regular analog books. 0. Before printing press: books were few and expensive 1. Before computers: books were heavy, costy, took space, one had to move from home to get them (go to shop/library/...) 2. With things like Gallica: books are light, free, ubiquitous (Internet, computers, CD) 3. With ebooks and PG: books are searchable and can be tailor-cut to everybody's needs (more books on a given media, PDA OK, ...) OCR is somewhere between 2. and 3., depending on quality of scan/software. 3. is the top we are heading for. 2. is just a step on the way. 2. was made possible by scanners. 2.5 was made possible by OCR software. 3. was made possible at a larger scale by PGDP.
www.gallica.fr -> Recherche -> "Mots du titre" - enter the title, for example "Bulletin Monumental" In the "Résultat de la recherche: click on "Bulletin Monumental" Select the volume, you are interested in, for example "1861 (Sér. 2)" Now "Télécharger" and "ok" if you are interested in the complete book
I did that and got 20845628 bytes for 604 pages. 34k/page on average. The text is just about 33 lines of about 55 characters, that is to say less that 2K/page, PDF is much smaller in text format (for example as produced by pdflatex). That PDF uses on avg. 20 bytes for every byte of text. I am not impressed! Every little dot and character variation is accounted for. We don't care! The file is long to load in xpdf. It has huge margins. pdftotext yields nothing. It's just a collection of pictures. Now take http://www.eleves.ens.fr/home/blondeel/PGDP/ebooksgratuits/barbey_d_aurevill... (a test I did on ebooksgratuits files, to import them to PG) The TXT is 575086 bytes long. The PDF is 794285 bytes long, and this includes 34860 bytes of images. (794285 - 34860) / 575086 = 1.3 byte/byte. This includes a table of contents with internal links and find-text functionality. Plus if you don't like the typesetting you can redo it to suit you better changing a couple lines in the LaTeX preamble. 2 columns? Landscape? Different margins? Different font? HTML version (CSS-it the way you want)...? Gallica is mainly a collection of pictures. It is useful to use as OCR-fodder on PGDP or the like, but by no means is it a collection of "e-books". You cannot search or probe text in it, have it read out by a machine, typeset it differently, extract quotes, index it in a search engine... Can you tell us in what issues of the Bulletin Monumental such or such word appears, with context? No you can't. Neither could the original readers (unless somebody built an index), but precisely we want more than what these people had at the time. Gallica has just over 1000 books in full-text (Search nothing and just tick "Ouvrages en mode texte"), and the quality and format is far from PG standards: compare for example «L'île des pingouins» on both sides. http://gallica.bnf.fr/document?O=N088395 http://www.gutenberg.org/dirs/etext05/8ilep10.txt Regards,