Re: [gutvol-d] roundtripping formatted text through a .pdf

16 Jun 2005

      On 6/16/05, David A. Desrosiers <hacker@gnu-designs.com> wrote:
...
...
It is possible to make the text rather obscure, but that doesn't
mean that if formatted correctly you could not scan through the file
in a text editor and read it.  Granted, it's rarely done, but
doesn't mean it's impossible.
I just ran strings(1) across about 40 of the PDFs I have here
from various clients, online resources and PDFs I've created in
Windows and with OpenOffice.org, and not a single one contained any
readible strings that are actually in the _content_ of the documents
themselves, other than the strings which comprise URLs embedded in the
document itself.
So where is the text of the document stored? If its somewhere
in here, why is it obfuscated by default, in every single PDF I have?
In text blocks within the documents which can be encoded and are
referenced from the part of the document that sets up the layout.
...
The document content itself is most-definitely NOT stored as
"plain text" in the pdf documents I have here, which is a pretty broad
sample set.
People are not arguing the average case.  Like I said, it's rare for
it not to be obfuscated.  But guess what, improbable != impossible. 
You said it was impossible, that the information was stored purely as
pixels and vectors.  It's not.  There is a whole subculture that is
quite used to the idea of there being embedded text from when direct
tinkering with postscript/tex processing was more common.  You might
need a tool more complex than strings to grab the textual information
out if obsfuscated (since it can really be an encoding within an
encoding).

I'm at a loss to what your example run proved.  Just that's rare.  And
Marcello was kind enough to provide an example where it was not
obfuscated.  See those ()?  Simple definition of them (It's been a
while since I read the Reference, so this isn't 100%)  means that the
characters are not in another encoding so there is no need to convert
them when generating the page.

It's pretty well known that the great number of automatic pdf
generators can create some very unreadable code.  I knew someone who
was bitterly disappointed at the amount of cruft and difficulty it
brings to working with them.  But ideally they still follow the rules
in the Reference (it's annoying to find, but it is available through
the adobe site).  If it's not in that syntax, it's no more an pdf than
if an "almost-XML" document had elements with no closing tags.

If I had time, I'd write one by hand for ya that had none of the
encoding mess.

I'd agree most pdf documents would be a pain to handle by hand, but
you wouldn't have to apply OCR like techniques to most.  Just write a
parser based off specs.  I'm confused at the point of all of this. 
You seemed to be implying that bowerbird couldn't be doing what he
claimed because: " Acrobat doesn't store text in PDFs, they store
pixels and
vectors and OCR'd coordinates. "   Multiple people have pointed out
that this is wrong, that there is text within pdfs.  They've shown
examples.  Remember, probably most of the obstfucation code is there
for more nefarious reasons, but some of the ideas come from valid
problems with multiple character encoding sets.  (We're talking about
techniques established well before Unicode)

I'm not arguing that the format is good or bad, that we should abandon
ACII files here at gutenberg or anything along those lines.  Just that
your statement was misleading.  A pdf is not like a jpeg.  In fact, as
far as vector-based systems go I'm not familiar with any vector system
that doesn't store text in a file instead of just pure vector
representation of characters due to efficiency reasons.

Jon Gorman