re: [gutvol-d] roundtripping formatted text through a .pdf

16 Jun 2005

      jon gorman said:
...
Just wanted to clarify things.
that's good.  i like clarification...         :+)
...
I'm skeptical about bowerbird's claims as well
that's good.  i like skeptics...          ;+)

but the proof is in the pudding,
jon, the proof is in the pudding...
...
but it's misleading to say that 
  Acrobat doesn't store text in the document.
i believe, like you, that that would be a misleading statement.
...
It is possible to make the text rather obscure
well, as i said, one _can_ make it rather totally "obscure" by
converting it to graphic format before writing it to the .pdf.
in that case, the user cannot copy out the text -- as text --
to the clipboard.  such "text" is not found by "find" either.

(here i'm largely speaking, of course, as a _programmer_
who is actually outputting the content to the .pdf driver.
most people creating a .pdf don't have that luxury, in that
they're stuck with whatever their authoring tool might do.
as a sidebar here, i will note that the problems involved in
copying text from a .pdf are well-known and long-standing,
so they _should_ have been addressed by the programmers of
common authoring tools, like word-processors, by this time.
in programming my tool, i have sought to empower my users,
including in this arena of round-tripping text put into a .pdf.)
...
but that doesn't mean that if formatted correctly 
  you could not scan through the file in a text editor and
...
read it.  Granted, it's rarely done, but doesn't mean 
  it's impossible.
well, i believe your statement is misleading as well, jon...

(and if you're striving to "clarify" things, you really should try 
something to see if you _can_ do it before you _say_ you can...)

load a .pdf into an editor; you won't find much (if any) text qua text,
not in a recognizable form you can easily copy out to the clipboard.

(it's not _impossible_ you will find some text, depending upon
how the .pdf was created, since there is text in some .ps files.
but it's never a long unbroken stretch before it is interrupted
by postscript commands, so this approach is doomed to failure.)

so one shouldn't expect to find text -- stored as text -- in a .pdf,
not in the traditional sense.  (however, see the p.s. on this post.)

nonetheless, if the text wasn't stored in the .pdf in _some_ way,
users wouldn't be able to copy it out to the clipboard, would they?
and acrobat wouldn't be able to do "find" operations on it, would it?

(notably, though, you'll discover that acrobat's "find" capabilities
don't extend to whitespace.  for instance, you can't do a search for
two spaces, even if there were such instances in the original file.)

-bowerbird

p.s.  it might be possible to store text in the comments of a .pdf,
i'm not sure.  if you could, then that _might_ be interesting to do.
(i will explore the possibility, especially when my app starts to
create .pdfs directly without running them through a .pdf driver.)
with such storage, one wouldn't need to pull the .pdf into acrobat
in order to retrieve the text from it, which might be a capability
that some people would find useful.  (it would also allow ordinary
search programs to search the .pdf.)  but that's just gravy to me;
as long as users can "roundtrip" text out of a .pdf, my goal is met.
once people get used to my viewer, they won't even _want_ .pdfs.

re: [gutvol-d] roundtripping formatted text through a .pdf

Bowerbird＠aol.com