On quote-like marks...

Regarding the recent discussion about ASCII and the single/double quote marks (and what to use), I have my two cents to add (and those here who are much more expert at character sets and Unicode than I am will undoubtely be able to add to this.) The situation regarding single and double quote-like marks is even more complicated than what it has been presented so far. It has an impact on the future expanded use of PG texts as envisioned by Michael Hart and others, such as text-to-speech and language conversion. So I believe it needs to be dealt with in a more standardized-fashion (that is, don't simply use the straight keyboard ' and " for everything under the sun.) Quote-like marks are used for multiple purposes in texts -- especially single quote-like marks. And then there are the "curly" types of marks used in typographical presentation. Here's a (probably) partial list of their multiple uses: 1) For marking up quotations (other conventions are also used) 2) Word contractions (e.g., "we're" for "we are") 3) Possessives ("the Emperor's crown") 4) Non-breaking character modifiers (see below) 5) Minutes and second of time and arc. (50d3'25") 6) Feet and inches unit indicator (She is 5'7" tall) 7) Other mathematical symbol and unit measurement uses. Item (4) is particularly interesting since I'm working on cleaning up Burton's "1001 Arabian Nights Tales", and in it there are many Arabic names where, when Burton converted to Latin script, single quote-like marks were inserted to indicate a type of non-breaking character modifier for pronounciation purposes. For example: Ja'afar. This semantically differs from the apostrophes used for contractions/ possessives -- or at least semantically different enough (imho) that warrants differentiation in character encoding/entities. In the XML markup of the Arabian Nights, I've chosen to use the following Unicode character conventions to keep everything straight. It's not what I necessarily propose PG/DP do, but it indicates one possible approach. Since at present I do not enclose quotations in <q> (for example), I keep in the quotation marks (double and single) to identify quotations. In the Arabian Nights I find some odd quotation passages, a couple of which start in the middle of one paragraph and end in the middle of another paragraph later within a story, so adding <q>...</q> would result in non-well-formed XML (I could use the "mile marker" approach as defined in TEI, but for the Arabian Nights have chosen not to.) 1) For quotations using double quote marks, I use the Unicode left-double quote mark for the beginning, and the right-double quote mark for the ending: “ and ”, respectively. (The "curly" quotes -- for those who don't like curly quotes for reading, it is trivial to convert them to straight keyboard quote marks, but going the other way is more difficult to do.) 2) For quotations using single quote marks, I use the Unicode left-single quote mark for the beginning, and the right-single quote mark for the ending: ‘ and ’, respectively. 3) For the non-breaking character modifier as described above, e.g. for "Ja'afar", I use the Unicode character specific for this purpose: ʼ 4) For word contractions and possessives I use the ordinary lower- ASCII single straight quote mark: ' (For later presentational purposes this character can always be converted to the right-single "curly" quote mark.) For use of ' and " for minutes and seconds of arc, and feet/inches, there are special Unicode code points for these (I don't see this usage in the Arabian Nights footnotes, but maybe I'll encounter it somewhere, not having finished the 5000+ footnotes.) If one is working with plain text files (not XML), the above Unicode characters can be encoded at the bit-level using UTF-8 or UTF-16 encoding. Jon Noring

When I prepare TEI versions of my texts, I normally use the following: “ ‘ ” ’ for quotation marks (in English, LOTE has even more variants) ' for apostrophe, including those used in the possesive, as they are the same. ′ ″ for minutes and seconds (and even ‴ for tripple primes) For works using Arabic, I also use &ayn;, etc., to represent those Arabic letters, if they are thus represented in Roman script. I map these entities to Unicode for HTML versions, and to nearest ASCII equivalents in plain vanilla. I avoid <q>...</q> for the reasons you mention. If you need help with validation / transforms, etc., drop me a note... Jeroen. Jon Noring wrote:
Regarding the recent discussion about ASCII and the single/double quote marks (and what to use), I have my two cents to add (and those here who are much more expert at character sets and Unicode than I am will undoubtely be able to add to this.)
The situation regarding single and double quote-like marks is even more complicated than what it has been presented so far. It has an impact on the future expanded use of PG texts as envisioned by Michael Hart and others, such as text-to-speech and language conversion. So I believe it needs to be dealt with in a more standardized-fashion (that is, don't simply use the straight keyboard ' and " for everything under the sun.)
Quote-like marks are used for multiple purposes in texts -- especially single quote-like marks. And then there are the "curly" types of marks used in typographical presentation.
Here's a (probably) partial list of their multiple uses:
1) For marking up quotations (other conventions are also used) 2) Word contractions (e.g., "we're" for "we are") 3) Possessives ("the Emperor's crown") 4) Non-breaking character modifiers (see below) 5) Minutes and second of time and arc. (50d3'25") 6) Feet and inches unit indicator (She is 5'7" tall) 7) Other mathematical symbol and unit measurement uses.
Item (4) is particularly interesting since I'm working on cleaning up Burton's "1001 Arabian Nights Tales", and in it there are many Arabic names where, when Burton converted to Latin script, single quote-like marks were inserted to indicate a type of non-breaking character modifier for pronounciation purposes. For example: Ja'afar. This semantically differs from the apostrophes used for contractions/ possessives -- or at least semantically different enough (imho) that warrants differentiation in character encoding/entities.
In the XML markup of the Arabian Nights, I've chosen to use the following Unicode character conventions to keep everything straight. It's not what I necessarily propose PG/DP do, but it indicates one possible approach. Since at present I do not enclose quotations in <q> (for example), I keep in the quotation marks (double and single) to identify quotations. In the Arabian Nights I find some odd quotation passages, a couple of which start in the middle of one paragraph and end in the middle of another paragraph later within a story, so adding <q>...</q> would result in non-well-formed XML (I could use the "mile marker" approach as defined in TEI, but for the Arabian Nights have chosen not to.)
1) For quotations using double quote marks, I use the Unicode left-double quote mark for the beginning, and the right-double quote mark for the ending: “ and ”, respectively.
(The "curly" quotes -- for those who don't like curly quotes for reading, it is trivial to convert them to straight keyboard quote marks, but going the other way is more difficult to do.)
2) For quotations using single quote marks, I use the Unicode left-single quote mark for the beginning, and the right-single quote mark for the ending: ‘ and ’, respectively.
3) For the non-breaking character modifier as described above, e.g. for "Ja'afar", I use the Unicode character specific for this purpose: ʼ
4) For word contractions and possessives I use the ordinary lower- ASCII single straight quote mark: ' (For later presentational purposes this character can always be converted to the right-single "curly" quote mark.)
For use of ' and " for minutes and seconds of arc, and feet/inches, there are special Unicode code points for these (I don't see this usage in the Arabian Nights footnotes, but maybe I'll encounter it somewhere, not having finished the 5000+ footnotes.)
If one is working with plain text files (not XML), the above Unicode characters can be encoded at the bit-level using UTF-8 or UTF-16 encoding.
Jon Noring
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d
participants (2)
-
Jeroen Hellingman
-
Jon Noring