
On Thu, January 26, 2012 1:57 am, Keith J. Schultz wrote:
Hi Lee,
Am 25.01.2012 um 20:22 schrieb Lee Passey:
Depends on the file. XML files (XHTML, TEI, etc.) are guaranteed to be ASCII in their first line, and that first line declares the encoding, so no BOM is
Just to be picky. But, you err here. The above mentioned files are not guaranteed to be ASCII. only txt. Yet, as you state the first lines can contain encoding information.
XML files are not guaranteed to be "txt." In fact, the term "txt" has virtually no meaning, so you can't say anything is guaranteed to be "txt" unless you first define the meaning of that term. It is true that XML files can be encoded using UTF-16, in which case the first line will /not/ be ASCII, and a BOM should be required (what's the default byte ordering of UTF-16 if there's no BOM, I wonder?) But if the file is not UTF-16 then at least the first line is guaranteed to be ASCII as it must be the "<!xml" signature, and that signature is only composed of ASCII characters. This is possible because ASCII, Latin-1, Windows 1252, and UTF-8 are all identical for values in the ASCII range. So if the first two bytes of an XML file are not a UTF-16 BOM, an automated process can safely read the first line of text and determine the encoding of the entire file from there.