Re: [gutvol-d] roundtripping formatted text through a .pdf

16 Jun 2005

      David A. Desrosiers wrote:
...
Acrobat doesn't store text in PDFs, they store pixels and vectors and
OCR'd coordinates. Most-definately not text.
You are most definitely wrong there. How else would the "find" function
work?

Here's an example of a pdf file contents:

/F23 17.215 Tf 56.693 509.046 Td[(Chapter)-250(I)]TJ/F23 24.787 Tf 0
-74.229 Td[(Do)10(wn)-250(the)-250(Rab)10(bit-Hole)]TJ/F20 11.955 Tf 0
-44.334
Td[(Alice)-300(w)10(as)-299(be)15(ginning)-300(to)-299(get)-300(v)15(ery)-299(tired)-300(of)-300(sitting)-299(by)-300(her)-299(sister)-300(on)]TJ
0 -14.446
Td[(the)-354(bank,)-380(and)-354(of)-354(ha)20(ving)-354(nothing)-353(to)-354(do:)-518(once)-354(or)-354(twice)-354(she)-354(had)]TJ
0 -14.446
Td[(peeped)-198(into)-199(the)-198(book)-199(her)-198(sister)-199(w)10(as)-198(reading,)-209(b)20(ut)-198(it)-199(had)-198(no)-199(pictures)]TJ
0 -14.446
Td[(or)-321(con)40(v)15(ersations)-321(in)-321(it,)-339(`and)-321(what)-321(is)-321(the)-321(use)-321(of)-321(a)-321(book,')-339(thought)]TJ
0 -14.445
Td[(Alice)-250(`without)-250(pictures)-250(or)-250(con)40(v)15(ersation?')]TJ

You see that all the text is there. Spaces are simulated by horizontal 
movement and kernings also. It would not be too difficult to write a 
perl script to recover the text out of the pdf.

-- 
Marcello Perathoner
webmaster@gutenberg.org