Re: [gutvol-d] Re: German texts and the m-dash

7 Jan 2005

      Regarding the marking up issue, this is how I feel:

PG TXT format is not meant to be read (it is ugly). It is meant to be
"the" reference format, waiting for something spiffier (XML or the
like). It is meant to be transformed in other formats, or viewed in nice
reading tools (eg: PDA with proportional fonts, anti-aliasing, etc.).

As such, typography has nothing to do in it: it is the backend's
problem, that is to say it falls in the bailiwick of the program who
will transform this basic interchange format into something else. (LaTeX
does it automatically with babel packages for instance; XHTML could
maybe do that with the right stylesheet --- then you won't have to worry
about inserting all paragraph indents for example).

When I type e-mails, even in French, I don't take the hassle to include
semi- or full-length non-breakable spaces in front of ;:!?» and the
like, or after «. (By the way, I guess in German quotes work like this:
He said: »Hello« and not, like in French: He said: «Hello». I guess you
code those quotes just as is in your raw text formats). E-mails are
plain text in fixed-width font, not a printed book with nice typography.

As long as you don't destroy information, you can afterwards translate
those things properly respecting classical typography. I try to do that
for the PDF backend in
http://www.eleves.ens.fr/home/blondeel/PGDP/ebooksgratuits/

For instance, in a French text:

* any "--" appearing in the beginning of a paragraph is a dialog dash
  that shold become "&endash; " or maybe "&emdash; " in HTML.

* any other "--" is an em-dash that should become " &emdash; " in HTML
  (note the normal spaces: not unbreakable ones!)

* maybe other rules that escape me now (number intervals?)

On Thu, Jan 06, 2005 at 04:06:35PM -0800, Andrew Sly wrote:
...
However, I do see a problem.
Any "simple" global search/replace such as that has it's risks.
You cannot assume that every instance of "--" is an emdash.
People who perform such search and replaces are supposed to know what
they are doing. If you want to distinguish between "--" appearing in the
beginning of a paragraph or others, for instance, you will run a
contextual search and replace.

I understand some people don't know how to do that and don't want to
learn how to do that. Then they will have to cope with the imperfect
typography, and wait for PG to move to other formats: if/when some
structured formats appear on PG, life will be much easier. For example
you could go:

User: Hey! show me book XXX in HTML format
Server: there you are: [...]
- Nice. Make the font bigger, the margins narrower, the titles
  bolder, etc. [*]
Server (compiling this format on the fly): - there you are: [...]
- Man! I like that book. Give it to me in PDF format.
- there you are: [...]
- Right. Give me both portrait format so I can print it,
  and landscape format with a bigger font so I can read it a little
  on the screen.
- there you are: [...]

[*] note: this you could do on your own, just changing the stylesheet of
the XHTML file (see examples at the URL above). But the website/layout
engine could do that for you.

I can already do all of the above with the ebooksgratuits experiment I
mentioned above (well, of course you would use drop-down menus and not
natural language; I mean I could if I took the time to code it, but
there is nothing difficult there: the proof of concept is out there. The
only slight problem is to teach LaTeX how to cut words, but my program
gives me the list of the words LaTeX couldn't cut and their severity and
context, and makes it possible for me to teach it how to cut them).

As for the case mentioned here, maybe it is a PP issue. Of course the
HTML version should respect more the typography.
...
For instance, what would happen to the following (from
Roughing it in the Bush, PG#4389):
"You were fortunate, C---, to escape," said a backwood settler,
This would fail the contextual search and replace.

To implement the transformations I detail above, you could do this (sed
syntax, but of course you would use an easier programming language):

s/^--\([^-]\)/&endash; \1/

s/\([^-]\)--\([^-]\)/\1 &emdash; \2/g

then you would check no "--" remain, you would check double spaces you
may have introduced with the second transform (in case there
were--wrongly--spaces around the "--" in the original text), etc.

Re: [gutvol-d] Re: German texts and the m-dash

Sebastien Blondeel