Re: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital libraries (fwd)

From one of my French friends much more familiar with Gallica.
On Thu, Jun 01, 2006 at 09:16:03AM -0700, Michael Hart wrote:
Each .pdf file seemed to just hold a .gif file. . .or is there something else going on there that was missed?
Gallica just has _pictures_. It is hardly more than a scanning bank. They can either serve them as PDF or TIFF files. When they did the word to have a TXT file (which they did for about 1% of their books), they give that too. (ex: _L'=EEle des pingouins_, both on Gallica as text and on PG).
And this is supposed to prepare the book as a single .pdf file?
Yes.
Searchable?
No, just a bunch of pictures. The document is just like a document with a picture of a different painting, or a photograph of a different landscape, on each page. It's like HTML: HTML can have text (-> searchable) or display a sequence of pictures (-> non searchable, even if they are pictures of pages with text). PDF is more confusing because the layout depends less on the viewer than with HTML (one can define custom colors/sizes/margins in HTML with CSS and the like; not so with PDF). Make the experiment with the ZIP file I pointed to. It contains a PDF file, small & light & searchable. The PDF file produced by Gallica (take the example given by the other person) is heavy and not searchable. I did the tests with xpdf but it should be the same with Acrobat Reader. To know whether a PDF file is text or a picture, I'm not sure what to do. Here are hints: . pdftotext will only work with text-PDF . searching too . if the letters look dirty, with noise, or the lines are not quite horizontal, it is a picture.

Michael Hart <hart@pglaf.org> writes:
Gallica just has _pictures_.
Gallica _has_ pictures and that's very nice.
Searchable?
No, just a bunch of pictures.
Searching isn't the only thing that matter's. Think about children's book where pictures are very important. The same is valid for book about architecture, etc. As I said earlier, we need both sides of the coin--: the pictures and the text or the text and the pictures (= scans). Not necessarily within the same file (PDF, Djvu, or .tar.bz2), but catalogued or archived in a way that it is possible to download the wanted files easily. -- http://www.gnu.franken.de/ke/ | ,__o | _-\_<, | (*)/'(*) Key fingerprint = F138 B28F B7ED E0AC 1AB4 AA7F C90A 35C3 E9D0 5D1C
participants (2)
-
Karl Eichwalder
-
Michael Hart