Re: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital libraries

31 May 2006

      On Wed, May 31, 2006 at 10:03:26PM +0200, Karl Eichwalder wrote:
...
OCRing is important, but OCR without the scans nearby is often not
enough.
Most of the time the original typesetting does not matter much.
The page breaks can matter (table of contents, index, references to
other books) but apart from that...
...
I think gallica is one of the best e-book collections.  Their
PDF are very useful (you can download complete books as PDFs pretty
easily and they are readable)!  This way I can access the Bulletin
Monumental.
I believe you are missing the point.

Michael doesn't care as much about collections of pictures as he does
about digitalized text.

As long as scans and/or OCR technologies are so disappointing, we'll
have to rely on higher-level humain brains with initiatives such as PGDP
or ebooksgratuits.com

Of course having easy access to pictures is useful and much better than
nothing and serves you well, but that's not what PG and ebooks are
about. ebooks are much more than photographs of regular analog books.

0. Before printing press:
books were few and expensive

1. Before computers:
books were heavy, costy, took space, one had to move from home to get
them (go to shop/library/...)

2. With things like Gallica:
books are light, free, ubiquitous (Internet, computers, CD)

3. With ebooks and PG:
books are searchable and can be tailor-cut to everybody's needs
(more books on a given media, PDA OK, ...)

OCR is somewhere between 2. and 3., depending on quality of
scan/software.

3. is the top we are heading for. 2. is just a step on the way.
2. was made possible by scanners.
2.5 was made possible by OCR software.
3. was made possible at a larger scale by PGDP.
...
www.gallica.fr ->
Recherche      ->
"Mots du titre" - enter the title, for example "Bulletin Monumental"
In the "Résultat de la recherche: click on "Bulletin Monumental"
Select the volume, you are interested in, for example "1861 (Sér. 2)"
Now "Télécharger" and "ok" if you are interested in the complete book
I did that and got
20845628 bytes for 604 pages. 34k/page on average.
The text is just about 33 lines of about 55 characters, that is to say
less that 2K/page, 

PDF is much smaller in text format (for example as produced by
pdflatex).

That PDF uses on avg. 20 bytes for every byte of text.
I am not impressed! Every little dot and character variation is
accounted for. We don't care!

The file is long to load in xpdf. It has huge margins.
pdftotext yields nothing. It's just a collection of pictures.

Now take
http://www.eleves.ens.fr/home/blondeel/PGDP/ebooksgratuits/barbey_d_aurevill...
(a test I did on ebooksgratuits files, to import them to PG)

The TXT is 575086 bytes long. The PDF is 794285 bytes long, and this
includes 34860 bytes of images.

(794285 - 34860) / 575086 = 1.3 byte/byte.
This includes a table of contents with internal links and find-text
functionality. Plus if you don't like the typesetting you can redo it to
suit you better changing a couple lines in the LaTeX preamble. 2
columns? Landscape? Different margins? Different font? HTML version
(CSS-it the way you want)...? 

Gallica is mainly a collection of pictures.
It is useful to use as OCR-fodder on PGDP or the like, but by no means
is it a collection of "e-books". You cannot search or probe text in it,
have it read out by a machine, typeset it differently, extract quotes, index
it in a search engine...

Can you tell us in what issues of the Bulletin Monumental such or such
word appears, with context? No you can't. Neither could the original
readers (unless somebody built an index), but precisely we want more
than what these people had at the time.

Gallica has just over 1000 books in full-text (Search nothing and just
tick "Ouvrages en mode texte"), and the quality and format is far from
PG standards: compare for example «L'île des pingouins» on both sides.

http://gallica.bnf.fr/document?O=N088395
http://www.gutenberg.org/dirs/etext05/8ilep10.txt

Regards,

Re: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital libraries

Sebastien Blondeel