
You said it was impossible, that the information was stored purely as pixels and vectors. It's not.
I'll let it drop... except this one point: I never said it was "impossible" for a pdf to contain text in any of my messages (and further, I've never even used the word "impossible" in any message I've ever posted to this list, ever.) Every single pdf I have here is exactly that: 7-bit ascii text, and nothing more, but the text in the pdfs is definately not the text that comprises the content of the pdf itself. I have heard of binary pdfs, I don't have one here and couldn't find one out there. My collection includes pdfs which are heavily encrypted with the latest-n-greatest Adobe 7.whatever product, and they're still 100% ascii text, but none of the text (except urls) is document "content".
You might need a tool more complex than strings to grab the textual information out if obsfuscated (since it can really be an encoding within an encoding).
I've got many here, and even seen quite a few commercial (proprietary, no source available) products hijacking pdftohtml's source for their pdf rendering. I think I may have found yet-another one last night that converts PDFs for display on a Palm handheld device (a commercial "Office Documents on Palm" product). Of course the output is absolutely horrible, as is the output of most PDFs, but that's another matter.
You seemed to be implying that bowerbird couldn't be doing what he claimed because: " Acrobat doesn't store text in PDFs, they store pixels and vectors and OCR'd coordinates. "
Actually, no tools that can decompose PDF back to readible text produce anything worth using. In 100% of the cases I've found, which includes Open Source and commercial tools, you have to go back in and reformat the entire output by hand anyway. I've tried automating the rewrap, paragraph layout and many other aspects, and its just not worth it. Its easier to load it up in xpdf or acroread and cut and paste from the GUI into another file and format from that baseline. But back to the Bowerbird case... he contends that his Z.M.L. tool written in gwbasic (or whatever its using these days) can do everything including make coffe, walk the dog, and oh yeah, convert pdfs to a pleasant-to-read format. If this is true, this would be the first tool out of literally dozens that I've tried to accomplish this feat successfully. But I'm not going to go install DOS and gwbasic to find out. David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com