
Marcello wrote:
Jon Noring grudgingly admits:
<or:page/> (page break in a paper source) <or:lb/> (line break in a paper source) <or:marker/> (a generic marker)
Why not use <tei:pb> , <tei:lb> and <tei:milestone> ? Insisting on making your own when there are perfectly good elements in TEI is just plain ... sub-optimal.
Actually, a very good idea. We've not fixed the "custom" elements yet. I'll have to look at the TEI-defined semantics of the use of the TEI equivalents, but *if* reasonably close to what we need, will likely embrace them. It will add to the list of namespace declarations, but that downside is pretty minor. Thanks.
he began to crow delight<or:lb/>edly,
Sorry to rain on your parade but your (at best) half-baked proposal has following shortcomings:
No, I'm submitting the idea for feedback, and your feedback is valuable.
1. Non-standard use of
The soft-hyphen is a "non-printable" character that may be replaced with a "printable" hyphen by processors before output.
Your use is to record the place where an existent hyphen has been stripped.
Yes.
You got it backwards. You confuse the very different stages of text feature recording and text output.
Actually, I've been debating whether or not to include the as it is used.
2. Throws off grep
An xml-grep could find "delight<tei:lb/>edly" if searching for "delighted", but it surely won't find "delight<tei:lb/>edly".
Well, with existing toolbases, this might be. I believe, however, that Unicode itself implies that text processors should ignore (U+00AD). One reference is: http://www.unicode.org/unicode/reports/tr14/#SoftHyphen In addition HTML discusses the use of the soft hyphen: http://www.w3.org/TR/html401/struct/text.html#hyphenation In summary, user agents, such as doing word searching, should ignore the soft hyphen character. That some don't is a real-world issue that unfortunately has to be pragmatically considered.
3. Redundant text feature documentation
All you are doing here is repeatedly "documenting" that the character used to hyphenate words in this text is the hyphen. You don't have to repeat that statement through all of your text. A single statement to that effect in the TEI header will suffice.
Two points (based on what I interpret you are saying): 1) We are not focusing on TEI documents, thus many XML documents will not have a TEI header. 2) The Unicode annex statement on the use of the soft hyphen (see above link) takes into account other characters used for word breaking purposes. It does not imply a "hard hyphen", but some character used for linebreaking depending upon the text's language and country code (required for all OpenReader Content Documents)
4. Incompatibility with LOTE
Remember that in LOTE you have to deal with cases like the German "ck" and "fff" which got hyphenated this way:
dachdecker dachdek-ker
Schiffahrt Schiff-fahrt
Also remember French and Italian elisions that don't happen at line breaks.
Good points. I'll have to check the Unicode annex document (URL above) to see what it talks about regarding this.
5. Dependance on one edition
All those hard-coded 's will marry your electronic text to one edition. You have no provision to encode different editions of the very same text like hardcover and paperback (which may very well have different line endings).
Yes, this is an issue. I do plan to allow addition of an attribute to both the page break and line break pointing (via Binder identifier) to the source work. So the markup may contain multiple source works. Things get messy if in two works the same word is broken, but in different places. But I think my system will work for this. Example of identifier attribute (still using OR namespace): <or:page bid="book2" .../> <or:lb bid="book2"/> In the Binder document, in the "descriptions" section (now being amended), we might have: <markdesc id="book2">Second Edition Issued in 1922</markdesc>
My advice is: forget entirely about line breaks. They are random artefacts introduced by the person operating the typesetting machine and indirectly by the person who chose paper size and font. They have no raison d'ĂȘtre once you separate the ebook from the scans, ie. after it left DP. (That this suggestion was by "You Know Who" should have tipped you off immediately.)
Disagreed. There may be a need, for example, to continue proofing work in the future. Knowing where line breaks occurred makes it easier with DP and similar processes. It also better correlates to the "bounding box information" from OCR which is being preserved. And *someone* may want to know this for formatting purposes. It is information about the source which by and large is easy for user-agents to ignore. Regarding you-know-who, I think you know that I often have profound disagreements with him, but when I agree with him, I agree. I don't let personal issues get in the way of acknowledging when I think he is right. Those who believe in objectivity evaluate what a person says.
But if you belong to that fastidious class of people who can't throw away even the most useless random artefact, I suggest doing it this standard way:
<html:p> ... he began to crow de<tei:lb ed="paperback" />light<tei:lb ed="hardcover" />edly, ... </html:p>
A standard XHTML browser (OpenReader ?) will simply throw away the unknown tags and render the normalized text. A special processor may be used to reconstruct the paper layout of the text.
Well, the real issue is dealing with the "fff", etc. issue of LOTE. I'll have to reread the Unicode annex. In OpenReader we reference that spec, and recommend user agents follow its guidelines. But it might not cover the particular LOTE "exceptions" you brought up. Thanks for your frank feedback. Definitely needed. Jon Noring