Re: [gutvol-d] the end of the line

24 Jun 2006

      Marcello wrote:
...
Jon Noring grudgingly admits:
...
...
<or:page/>                   (page break in a paper source)
<or:lb/>                     (line break in a paper source)
<or:marker/>                 (a generic marker)
...
Why not use <tei:pb> , <tei:lb> and <tei:milestone> ? Insisting on
making your own when there are perfectly good elements in TEI is just
plain ... sub-optimal.
Actually, a very good idea. We've not fixed the "custom" elements yet.

I'll have to look at the TEI-defined semantics of the use of the TEI
equivalents, but *if* reasonably close to what we need, will likely
embrace them. It will add to the list of namespace declarations, but
that downside is pretty minor. Thanks.
...
...
he began to crow delight<or:lb/>edly,
...
Sorry to rain on your parade but your (at best) half-baked proposal has
following shortcomings:
No, I'm submitting the idea for feedback, and your feedback is
valuable.
...
1. Non-standard use of 
The soft-hyphen is a "non-printable" character that may be replaced with
a "printable" hyphen by processors before output.
Your use is to record the place where an existent hyphen has been stripped.
Yes.
...
You got it backwards. You confuse the very different stages of text
feature recording and text output.
Actually, I've been debating whether or not to include the  as it
is used.
...
2. Throws off grep
An xml-grep could find "delight<tei:lb/>edly" if searching for
"delighted", but it surely won't find "delight<tei:lb/>edly".
Well, with existing toolbases, this might be. I believe, however, that
Unicode itself implies that text processors should ignore 
(U+00AD). One reference is:

   http://www.unicode.org/unicode/reports/tr14/#SoftHyphen

In addition HTML discusses the use of the soft hyphen:

   http://www.w3.org/TR/html401/struct/text.html#hyphenation

In summary, user agents, such as doing word searching, should ignore
the soft hyphen character. That some don't is a real-world issue that
unfortunately has to be pragmatically considered.
...
3. Redundant text feature documentation
All you are doing here is repeatedly "documenting" that the character
used to hyphenate words in this text is the hyphen. You don't have to
repeat that statement through all of your text. A single statement to
that effect in the TEI header will suffice.
Two points (based on what I interpret you are saying):

1) We are not focusing on TEI documents, thus many XML documents will
   not have a TEI header.

2) The Unicode annex statement on the use of the soft hyphen (see
   above link) takes into account other characters used for word
   breaking purposes. It does not imply a "hard hyphen", but some
   character used for linebreaking depending upon the text's language
   and country code (required for all OpenReader Content Documents)
...
4. Incompatibility with LOTE
Remember that in LOTE you have to deal with cases like the German "ck"
and "fff" which got hyphenated this way:
dachdecker
  dachdek-ker
Schiffahrt
  Schiff-fahrt
Also remember French and Italian elisions that don't happen at line breaks.
Good points. I'll have to check the Unicode annex document (URL above)
to see what it talks about regarding this.
...
5. Dependance on one edition
All those hard-coded 's will marry your electronic text to one
edition. You have no provision to encode different editions of the very
same text like hardcover and paperback (which may very well have
different line endings).
Yes, this is an issue. I do plan to allow addition of an attribute to
both the page break and line break pointing (via Binder identifier) to
the source work. So the markup may contain multiple source works.
Things get messy if in two works the same word is broken, but in
different places. But I think my system will work for this.

Example of identifier attribute (still using OR namespace):

   <or:page bid="book2" .../>

   <or:lb bid="book2"/>

In the Binder document, in the "descriptions" section (now being amended),
we might have:

   <markdesc id="book2">Second Edition Issued in 1922</markdesc>
...
My advice is: forget entirely about line breaks. They are random
artefacts introduced by the person operating the typesetting machine and
indirectly by the person who chose paper size and font. They have no
raison d'être once you separate the ebook from the scans, ie. after it
left DP. (That this suggestion was by "You Know Who" should have tipped
you off immediately.)
Disagreed. There may be a need, for example, to continue proofing work
in the future. Knowing where line breaks occurred makes it easier with
DP and similar processes. It also better correlates to the "bounding box
information" from OCR which is being preserved. And *someone* may want
to know this for formatting purposes. It is information about the source
which by and large is easy for user-agents to ignore.

Regarding you-know-who, I think you know that I often have profound
disagreements with him, but when I agree with him, I agree. I don't let
personal issues get in the way of acknowledging when I think he is
right. Those who believe in objectivity evaluate what a person says.
...
But if you belong to that fastidious class of people who can't throw
away even the most useless random artefact, I suggest doing it this
standard way:
<html:p>
  ...
  he began to crow de<tei:lb ed="paperback" />light<tei:lb ed="hardcover" />edly,
  ...
  </html:p>
A standard XHTML browser (OpenReader ?) will simply throw away the
unknown tags and render the normalized text. A special processor may be
used to reconstruct the paper layout of the text.
Well, the real issue is dealing with the "fff", etc. issue of LOTE.
I'll have to reread the Unicode annex. In OpenReader we reference that
spec, and recommend user agents follow its guidelines. But it might
not cover the particular LOTE "exceptions" you brought up.

Thanks for your frank feedback. Definitely needed.

Jon Noring