
David A. Desrosiers wrote:
Acrobat doesn't store text in PDFs, they store pixels and vectors and OCR'd coordinates. Most-definately not text.
You are most definitely wrong there. How else would the "find" function work? Here's an example of a pdf file contents: /F23 17.215 Tf 56.693 509.046 Td[(Chapter)-250(I)]TJ/F23 24.787 Tf 0 -74.229 Td[(Do)10(wn)-250(the)-250(Rab)10(bit-Hole)]TJ/F20 11.955 Tf 0 -44.334 Td[(Alice)-300(w)10(as)-299(be)15(ginning)-300(to)-299(get)-300(v)15(ery)-299(tired)-300(of)-300(sitting)-299(by)-300(her)-299(sister)-300(on)]TJ 0 -14.446 Td[(the)-354(bank,)-380(and)-354(of)-354(ha)20(ving)-354(nothing)-353(to)-354(do:)-518(once)-354(or)-354(twice)-354(she)-354(had)]TJ 0 -14.446 Td[(peeped)-198(into)-199(the)-198(book)-199(her)-198(sister)-199(w)10(as)-198(reading,)-209(b)20(ut)-198(it)-199(had)-198(no)-199(pictures)]TJ 0 -14.446 Td[(or)-321(con)40(v)15(ersations)-321(in)-321(it,)-339(`and)-321(what)-321(is)-321(the)-321(use)-321(of)-321(a)-321(book,')-339(thought)]TJ 0 -14.445 Td[(Alice)-250(`without)-250(pictures)-250(or)-250(con)40(v)15(ersation?')]TJ You see that all the text is there. Spaces are simulated by horizontal movement and kernings also. It would not be too difficult to write a perl script to recover the text out of the pdf. -- Marcello Perathoner webmaster@gutenberg.org