Hello Dave,
Some very good questions.
At 00:27 07/05/19, Dave Singer wrote:
For a lot of these encodings, of course, the initial string is identical (all
the ones which have an 'ascii' core). UTF-16 uses twice the bytes etc.
But in general, given a MIME type with a "+xml" suffix, an XML reader should
be prepared to do what?
At the minimum, handle it if it's UTF-8 or UTF-16 (with BOM in the later
case). Everything else is optional.
I think I am reading "treat the resource as being, in turn, all the encodings
you know of, and if you treat it as an encoding, do you find a confirming
"encoding" attribute?"
My reading of Appendix F of the XML Spec would be somewhat different.
(See http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info.)
First, it's not character encodings, but character encoding families,
that you try. This makes this process quite a bit faster.
Second, that appendix gives a list of character encoding families.
As the appendix is non-normative, it doesn't necessary exclude
other character encoding families, but there aren't really any other
character encoding families that I know of.
Which means that encoding='EBCDIC' (I made that up, by the way) would work?
You didn't have to make that up. EBCDIC as a family is listed in said
appendix. The IETF charset registry lists close to 50 EBCDIC variants
(see http://www.iana.org/assignments/character-sets). I guess that
for the 'original' EBCDIC, you'd have to write encoding='EBCDIC-US'.
Is a ZIP compressed XML file servable under a +xml MIME type?
"encoding='zipped Shift_JIS'"?
First, the encoding names allowed in the XML spec don't permit spaces
(see http://www.w3.org/TR/REC-xml/#NT-EncName), but that's a detail.
Second, I'm not familliar with ZIP encoding, but I guess that it's
not starting with any of the byte sequences mentioned in Appendix F.
The third point is that ZIP files are archives, not compressions of
single files. So you would have to restrict this kind of thing to
archives containing single files.
Fourth, ZIP files don't have any way to identify internal character
encodings. And polluting the charset space with zipped_foo,
zipped_bar, ... does not look like a good idea.
'Semantic' encodings (e.g. MPEG BiM, which uses the schema to be able to
compact the XML) are even greyer; the 'encoding=' is inserted by the BiM
decoder, so what does it say then? I think the 'sanity check' has to be not
that the resulting 'encoding=' says BiM, but that the BiM decode worked; it
makes noi sense for the BiM decoder to produce a text document that says
"encoding='BiM'"!
Well, actually this is less of a problem. Or put it another way round,
it's a problem that turns up with simple plain old character encodings.
The easiest way to understand this is to think in terms of Java, because
Java has a very clear distinction between byte sequences (Streams) and
character sequences (Readers/Writers).
The whole encoding stuff is important as long as you are on the byte
level. Once the decoding is done, you have external information about
the encoding (in Java, you know it's UTF-16), so the encoding
pseudo-attribute in the XML declaration becomes irrelevant.
That's how the implementations I know handle this, you can hand
an XML document from a Reader or a String to a Java XML parser,
and the characters in there might read: encoding='shift_jis',
but that's just ignored. There is not too much in the spec
that defines this explicitly, but it's pretty difficult to do
otherwise.
This is all well off-topic for MPEG-21 of course, but by exploring these edge
cases we might get some clarity on +xml, which would be a Good Thing.
Yes indeed. Thanks a lot.
Regards, Martin.
#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp
mailto:duerst(_at_)it(_dot_)aoyama(_dot_)ac(_dot_)jp