
On 6/16/05, David A. Desrosiers <hacker@gnu-designs.com> wrote:
It is possible to make the text rather obscure, but that doesn't mean that if formatted correctly you could not scan through the file in a text editor and read it. Granted, it's rarely done, but doesn't mean it's impossible.
I just ran strings(1) across about 40 of the PDFs I have here from various clients, online resources and PDFs I've created in Windows and with OpenOffice.org, and not a single one contained any readible strings that are actually in the _content_ of the documents themselves, other than the strings which comprise URLs embedded in the document itself.
So where is the text of the document stored? If its somewhere in here, why is it obfuscated by default, in every single PDF I have?
In text blocks within the documents which can be encoded and are referenced from the part of the document that sets up the layout.
The document content itself is most-definitely NOT stored as "plain text" in the pdf documents I have here, which is a pretty broad sample set.
People are not arguing the average case. Like I said, it's rare for it not to be obfuscated. But guess what, improbable != impossible. You said it was impossible, that the information was stored purely as pixels and vectors. It's not. There is a whole subculture that is quite used to the idea of there being embedded text from when direct tinkering with postscript/tex processing was more common. You might need a tool more complex than strings to grab the textual information out if obsfuscated (since it can really be an encoding within an encoding). I'm at a loss to what your example run proved. Just that's rare. And Marcello was kind enough to provide an example where it was not obfuscated. See those ()? Simple definition of them (It's been a while since I read the Reference, so this isn't 100%) means that the characters are not in another encoding so there is no need to convert them when generating the page. It's pretty well known that the great number of automatic pdf generators can create some very unreadable code. I knew someone who was bitterly disappointed at the amount of cruft and difficulty it brings to working with them. But ideally they still follow the rules in the Reference (it's annoying to find, but it is available through the adobe site). If it's not in that syntax, it's no more an pdf than if an "almost-XML" document had elements with no closing tags. If I had time, I'd write one by hand for ya that had none of the encoding mess. I'd agree most pdf documents would be a pain to handle by hand, but you wouldn't have to apply OCR like techniques to most. Just write a parser based off specs. I'm confused at the point of all of this. You seemed to be implying that bowerbird couldn't be doing what he claimed because: " Acrobat doesn't store text in PDFs, they store pixels and vectors and OCR'd coordinates. " Multiple people have pointed out that this is wrong, that there is text within pdfs. They've shown examples. Remember, probably most of the obstfucation code is there for more nefarious reasons, but some of the ideas come from valid problems with multiple character encoding sets. (We're talking about techniques established well before Unicode) I'm not arguing that the format is good or bad, that we should abandon ACII files here at gutenberg or anything along those lines. Just that your statement was misleading. A pdf is not like a jpeg. In fact, as far as vector-based systems go I'm not familiar with any vector system that doesn't store text in a file instead of just pure vector representation of characters due to efficiency reasons. Jon Gorman