roundtripping formatted text through a .pdf

Bowerbird＠aol.com

15 Jun 2005 15 Jun '05

7:44 a.m.

recently i've worked on "roundtripping" styled z.m.l. text through a .pdf. my viewer-program can write z.m.l. text to a .pdf such that copying the text out of the .pdf gives a user the same text that went in. make a few global changes -- which restores the whitespace acrobat usually strips from text -- and you can load the text back into my z.m.l. viewer-program and generate the same .pdf once again... the proof is in the pudding, and the structure is in the presentation. that is _not_ something you can do with text that's copied out of a .pdf created with other programs i know. generally, the .pdf format is known as the "roach motel" of file-formats -- content goes in and can't get out... :+) -bowerbird

Show replies by date

David A. Desrosiers

16 Jun 16 Jun

1:04 p.m.

...

make a few global changes -- which restores the whitespace acrobat usually strips from text -- and you can load the text back into my z.m.l. viewer-program and generate the same .pdf once again...

Acrobat doesn't store text in PDFs, they store pixels and vectors and OCR'd coordinates. Most-definately not text. David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com

Jon Gorman

3:02 p.m.

On 6/16/05, David A. Desrosiers <hacker@gnu-designs.com> wrote:

...

Acrobat doesn't store text in PDFs, they store pixels and vectors and OCR'd coordinates. Most-definately not text.

Ummm....so that's why Chapter 5 of the reference is all about text? Seriously though, it is possible to put text into pdfs. That's why you can copy and paste out of them. Granted, there are a lot of places that just scan in material and post that, but it is not the only thing that you can do with pdfs. PDF is derived from postscript after all. Unless I missed something in the conversation, in which case I'm sorry. Or you're being sarcastic and I just misread ;). Just didn't want anyone to be mislead. Jon Gorman

David A. Desrosiers

3:19 p.m.

...

Seriously though, it is possible to put text into pdfs. That's why you can copy and paste out of them. Granted, there are a lot of places that just scan in material and post that, but it is not the only thing that you can do with pdfs. PDF is derived from postscript after all.

Just because you can put down a cursor and go from one x,y to another x,y does not mean you are "selecting" what is visible on the screen, as your human eyes see it. PDF is pure layout, no structure. Tables are positioned text and lines, columns are positioned text... its basically OCR, without any character detection. David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com

Jon Gorman

3:52 p.m.

...

Just because you can put down a cursor and go from one x,y to another x,y does not mean you are "selecting" what is visible on the screen, as your human eyes see it.

Whoever said they were human eyes ;). Seriously though, while there are always encoding issues and the like, given a reasonable application/clipboard type the region you select should be what is visible, so I'm not sure what you're suggesting. Are you just making the point that when I select the text it's converting an essentially drawn image into an encoding text. But of course anything displayed on the monitor or printed out could be argued to be just pixels and/or vectors.

...

PDF is pure layout, no structure. Tables are positioned text and lines, columns are positioned text... its basically OCR, without any character detection.

Sorry, just guess I'm confused. You said there was no text in pdfs (implying to me just images). Chapter 5 of the Reference has a lot of info of how to include text. I'm not sure what OCR (Optical Character Recognition) means if it doesn't do any character detection... I didn't see any mention of structure anywhere in the email you sent. Just that it was impossible to have text. Which is odd since there are regions of text in a pdf document with instructions on how to draw that text. They can be encoded or just inserted when creating the document. Granted, the encoded streams are a bit of a pain, but they're arguably just as much text as any other I'm not trying to make a mountain of a molehill here. Just didn't want some people to get the impression that pdfs were solely graphic-orientated (like say...jpeg). Perhaps we have different ideas of text. Seriously, no offense to anyone. Just wanted to clarify things. I'm skeptical about bowerbird's claims as well, but it's misleading to say that Acrobat doesn't store text in the document. It is possible to make the text rather obscure, but that doesn't mean that if formatted correctly you could not scan through the file in a text editor and read it. Granted, it's rarely done, but doesn't mean it's impossible. Jon

...

David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d

David A. Desrosiers

5:43 p.m.

...

It is possible to make the text rather obscure, but that doesn't mean that if formatted correctly you could not scan through the file in a text editor and read it. Granted, it's rarely done, but doesn't mean it's impossible.

I just ran strings(1) across about 40 of the PDFs I have here from various clients, online resources and PDFs I've created in Windows and with OpenOffice.org, and not a single one contained any readible strings that are actually in the _content_ of the documents themselves, other than the strings which comprise URLs embedded in the document itself. So where is the text of the document stored? If its somewhere in here, why is it obfuscated by default, in every single PDF I have? The document content itself is most-definitely NOT stored as "plain text" in the pdf documents I have here, which is a pretty broad sample set. David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com

Marcello Perathoner

6:07 p.m.

David A. Desrosiers wrote:

...

I just ran strings(1) across about 40 of the PDFs I have here from various clients, online resources and PDFs I've created in Windows and with OpenOffice.org, and not a single one contained any readible strings that are actually in the _content_ of the documents themselves, other than the strings which comprise URLs embedded in the document itself.

So where is the text of the document stored? If its somewhere in here, why is it obfuscated by default, in every single PDF I have?

The document content itself is most-definitely NOT stored as "plain text" in the pdf documents I have here, which is a pretty broad sample set.

A pdf is a chunked file format and each chunk can be compressed or even encrypted. A run-of-the-mill pdf is always at least compressed. If you create your own pdf with pdftex you can set the compression level to 0 and lo! the text magically appears inside the pdf. -- Marcello Perathoner webmaster@gutenberg.org

D Garcia

10:42 p.m.

On Thursday 16 June 2005 02:07 pm, Marcello Perathoner wrote:

...

David A. Desrosiers wrote: A pdf is a chunked file format and each chunk can be compressed or even encrypted. A run-of-the-mill pdf is always at least compressed.

If you create your own pdf with pdftex you can set the compression level to 0 and lo! the text magically appears inside the pdf.

And if you're truly insane (and or interested) in the format, you can obtain the specs and learn how to write a PDF by hand in a standard text editor. (Which, yes, I have done, including writing vector graphics.) If you understand the technique, you can even write simple scripts in (your interpreted language of choice) to output simple PDF files directly, which is great for doing things like cgi report generation without library dependencies and the like. iirc, the most commonly used compression in PDF is FLATE, which is relatively trivial and fast/good enough for the majority of cases.

Jon Gorman

6:09 p.m.

On 6/16/05, David A. Desrosiers <hacker@gnu-designs.com> wrote:

...

...
It is possible to make the text rather obscure, but that doesn't mean that if formatted correctly you could not scan through the file in a text editor and read it. Granted, it's rarely done, but doesn't mean it's impossible.

I just ran strings(1) across about 40 of the PDFs I have here from various clients, online resources and PDFs I've created in Windows and with OpenOffice.org, and not a single one contained any readible strings that are actually in the _content_ of the documents themselves, other than the strings which comprise URLs embedded in the document itself.

So where is the text of the document stored? If its somewhere in here, why is it obfuscated by default, in every single PDF I have?

In text blocks within the documents which can be encoded and are referenced from the part of the document that sets up the layout.

...

The document content itself is most-definitely NOT stored as "plain text" in the pdf documents I have here, which is a pretty broad sample set.

People are not arguing the average case. Like I said, it's rare for it not to be obfuscated. But guess what, improbable != impossible. You said it was impossible, that the information was stored purely as pixels and vectors. It's not. There is a whole subculture that is quite used to the idea of there being embedded text from when direct tinkering with postscript/tex processing was more common. You might need a tool more complex than strings to grab the textual information out if obsfuscated (since it can really be an encoding within an encoding). I'm at a loss to what your example run proved. Just that's rare. And Marcello was kind enough to provide an example where it was not obfuscated. See those ()? Simple definition of them (It's been a while since I read the Reference, so this isn't 100%) means that the characters are not in another encoding so there is no need to convert them when generating the page. It's pretty well known that the great number of automatic pdf generators can create some very unreadable code. I knew someone who was bitterly disappointed at the amount of cruft and difficulty it brings to working with them. But ideally they still follow the rules in the Reference (it's annoying to find, but it is available through the adobe site). If it's not in that syntax, it's no more an pdf than if an "almost-XML" document had elements with no closing tags. If I had time, I'd write one by hand for ya that had none of the encoding mess. I'd agree most pdf documents would be a pain to handle by hand, but you wouldn't have to apply OCR like techniques to most. Just write a parser based off specs. I'm confused at the point of all of this. You seemed to be implying that bowerbird couldn't be doing what he claimed because: " Acrobat doesn't store text in PDFs, they store pixels and vectors and OCR'd coordinates. " Multiple people have pointed out that this is wrong, that there is text within pdfs. They've shown examples. Remember, probably most of the obstfucation code is there for more nefarious reasons, but some of the ideas come from valid problems with multiple character encoding sets. (We're talking about techniques established well before Unicode) I'm not arguing that the format is good or bad, that we should abandon ACII files here at gutenberg or anything along those lines. Just that your statement was misleading. A pdf is not like a jpeg. In fact, as far as vector-based systems go I'm not familiar with any vector system that doesn't store text in a file instead of just pure vector representation of characters due to efficiency reasons. Jon Gorman

David A. Desrosiers

6:52 p.m.

...

You said it was impossible, that the information was stored purely as pixels and vectors. It's not.

I'll let it drop... except this one point: I never said it was "impossible" for a pdf to contain text in any of my messages (and further, I've never even used the word "impossible" in any message I've ever posted to this list, ever.) Every single pdf I have here is exactly that: 7-bit ascii text, and nothing more, but the text in the pdfs is definately not the text that comprises the content of the pdf itself. I have heard of binary pdfs, I don't have one here and couldn't find one out there. My collection includes pdfs which are heavily encrypted with the latest-n-greatest Adobe 7.whatever product, and they're still 100% ascii text, but none of the text (except urls) is document "content".

...

You might need a tool more complex than strings to grab the textual information out if obsfuscated (since it can really be an encoding within an encoding).

I've got many here, and even seen quite a few commercial (proprietary, no source available) products hijacking pdftohtml's source for their pdf rendering. I think I may have found yet-another one last night that converts PDFs for display on a Palm handheld device (a commercial "Office Documents on Palm" product). Of course the output is absolutely horrible, as is the output of most PDFs, but that's another matter.

...

You seemed to be implying that bowerbird couldn't be doing what he claimed because: " Acrobat doesn't store text in PDFs, they store pixels and vectors and OCR'd coordinates. "

Actually, no tools that can decompose PDF back to readible text produce anything worth using. In 100% of the cases I've found, which includes Open Source and commercial tools, you have to go back in and reformat the entire output by hand anyway. I've tried automating the rewrap, paragraph layout and many other aspects, and its just not worth it. Its easier to load it up in xpdf or acroread and cut and paste from the GUI into another file and format from that baseline. But back to the Bowerbird case... he contends that his Z.M.L. tool written in gwbasic (or whatever its using these days) can do everything including make coffe, walk the dog, and oh yeah, convert pdfs to a pleasant-to-read format. If this is true, this would be the first tool out of literally dozens that I've tried to accomplish this feat successfully. But I'm not going to go install DOS and gwbasic to find out. David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com

Jon Gorman

9:39 p.m.

...

I never said it was "impossible" for a pdf to contain text in any of my messages (and further, I've never even used the word "impossible" in any message I've ever posted to this list, ever.)

And I shouldn't put words in your mouth. I'm sorry. I just interpreted " Acrobat doesn't store text in PDFs" to being that the specifications says it never stores text in pdfs, hence it would be impossible to add. I realized later it could also be interpreted slightly differently (either referring to the applications that create pdfs don't do it or because of common practice). It is a real pain to get the text out and getting worse which each version of pdf.

...

But back to the Bowerbird case... he contends that his Z.M.L. tool written in gwbasic (or whatever its using these days) can do everything including make coffe, walk the dog, and oh yeah, convert pdfs to a pleasant-to-read format.

I must admit perhaps I wasn't following closely but I think bowerbird just claimed the pdf that he exported was easy to import back as pdf (via copying out the text), not necessarily that he converted an existing pdf file. Of course, I'm probably wrong about that. Without capitols I sometimes get lost ;). Except for reading e.e. cummings I suppose. Again David, I'm sorry if I hurt any feelings or anything along those lines. I known some Palm developers so your name is familiar. They're happy for the help you contributed to the community so I would be in some hot water if I ticked you off by being a little too flippant. For some reason certain threads on this mailing lists tend to warp my brain I think. Wonder if it has anything to do with the odd nesting sensation I get when I read certain parts of gutvol-d. Jon

David A. Desrosiers

17 Jun 17 Jun

9:46 a.m.

...

Again David, I'm sorry if I hurt any feelings or anything along those lines. I known some Palm developers so your name is familiar. They're happy for the help you contributed to the community so I would be in some hot water if I ticked you off by being a little too flippant.

I have very thick skin, it takes a lot to hurt my feelings ;) No harm, no foul. Your comments (and those of others) were informative and worthwhile.

...

Wonder if it has anything to do with the odd nesting sensation I get when I read certain parts of gutvol-d.

I know... let's blame Bowerbird! ;) j/k David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com

Marcello Perathoner

16 Jun 16 Jun

10:27 p.m.

David A. Desrosiers wrote:

...

Every single pdf I have here is exactly that: 7-bit ascii text, and nothing more,

The encoding used in a pdf depends of the font technology: Type-1, Type-3, TrueType etc. You can link a dictionary to every font and thus change the standard encoding in any way you like. pdf can even accomodate multi-byte encodings. -- Marcello Perathoner webmaster@gutenberg.org

Tim Meekins

5:03 p.m.

Wrong! PDF most definately stores text. ----- Original Message ----- From: "David A. Desrosiers" <hacker@gnu-designs.com> To: "Project Gutenberg Volunteer Discussion" <gutvol-d@lists.pglaf.org> Sent: Thursday, June 16, 2005 6:04 AM Subject: Re: [gutvol-d] roundtripping formatted text through a .pdf

...

...
make a few global changes -- which restores the whitespace acrobat usually strips from text -- and you can load the text back into my z.m.l. viewer-program and generate the same .pdf once again...

Acrobat doesn't store text in PDFs, they store pixels and vectors and OCR'd coordinates. Most-definately not text.

David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d

Marcello Perathoner

5:18 p.m.

David A. Desrosiers wrote:

...

Acrobat doesn't store text in PDFs, they store pixels and vectors and OCR'd coordinates. Most-definately not text.

You are most definitely wrong there. How else would the "find" function work? Here's an example of a pdf file contents: /F23 17.215 Tf 56.693 509.046 Td[(Chapter)-250(I)]TJ/F23 24.787 Tf 0 -74.229 Td[(Do)10(wn)-250(the)-250(Rab)10(bit-Hole)]TJ/F20 11.955 Tf 0 -44.334 Td[(Alice)-300(w)10(as)-299(be)15(ginning)-300(to)-299(get)-300(v)15(ery)-299(tired)-300(of)-300(sitting)-299(by)-300(her)-299(sister)-300(on)]TJ 0 -14.446 Td[(the)-354(bank,)-380(and)-354(of)-354(ha)20(ving)-354(nothing)-353(to)-354(do:)-518(once)-354(or)-354(twice)-354(she)-354(had)]TJ 0 -14.446 Td[(peeped)-198(into)-199(the)-198(book)-199(her)-198(sister)-199(w)10(as)-198(reading,)-209(b)20(ut)-198(it)-199(had)-198(no)-199(pictures)]TJ 0 -14.446 Td[(or)-321(con)40(v)15(ersations)-321(in)-321(it,)-339(`and)-321(what)-321(is)-321(the)-321(use)-321(of)-321(a)-321(book,')-339(thought)]TJ 0 -14.445 Td[(Alice)-250(`without)-250(pictures)-250(or)-250(con)40(v)15(ersation?')]TJ You see that all the text is there. Spaces are simulated by horizontal movement and kernings also. It would not be too difficult to write a perl script to recover the text out of the pdf. -- Marcello Perathoner webmaster@gutenberg.org

JHowse

3:10 p.m.

At 07:18 PM 16/06/05 +0200, you wrote:

...

David A. Desrosiers wrote:

...
Acrobat doesn't store text in PDFs, they store pixels and vectors and OCR'd coordinates. Most-definately not text.

You are most definitely wrong there. How else would the "find" function work?

[snip] And fonts are imbedding into a pdf file!

...

You see that all the text is there. Spaces are simulated by horizontal movement and kernings also. It would not be too difficult to write a perl script to recover the text out of the pdf.

or if you have the full adobe acrobat programme you can simply export to a rtf file. I did that sort of thing at work for three years. You may have to do some formatting to pretty it up, but it's definitely text. JHowse ================================================================================ "I'm not likely to write a great novel or compose a song or save a baby from a burning building...but I can help make sure that there is an electronic library of free knowledge available for future people to access."--jhutch. Preserving History One Page at a Time!! Celebrating our 6750th book posted to Project Gutenberg Join Project Gutenberg's Distributed Proofreaders http://www.pgdp.net/c/ ================================================================================

7317

Age (days ago)

7319

Last active (days ago)

List overview

Download

15 comments

7 participants

participants (7)

Bowerbird＠aol.com
D Garcia
David A. Desrosiers
JHowse
Jon Gorman
Marcello Perathoner
Tim Meekins