On Wednesday 27 March 2002 04:38, Anton Tagunov wrote:
....
The Yen sign and the backslash tend to be the most troublesome
characters as a single codepoint in 8-bit encodings has a tendation
to be used for both.
Also Korean Won and backslash.
Is this true?
Yes, the problem exists, but this does not state it clearly, completely or
correctly. I can be clear and correct, but not totally complete, thus:
JIS-Roman (Lunde, CJKV Information Processing p. 968) and KS-Roman (op. cit.
p. 970) substitute the Japanese and Korean currency symbols for backslash at
0x5C. It is possible to use either of these 8-bit encodings together with
16-bit encodings for the native characters in these languages. Some Japanese
and Koreans with influence in the software market have created broken
character set mappings from these encodings allegedly to Unicode that map all
of the 8-bit codes in the ASCII range to themselves. Thus either the Japanese
Yen symbol or the Korean Won symbol can be mapped to backslash. There is no
algorithmic way to tell if any particular 0x5C in a text file was supposed to
be a backslash, a Yen symbol, or a Won symbol.
I have recently had an extended discussion with a Japanese programmer who
insists that Unicode is broken because of this conflict, and will not
consider the possibility of using the Unicode code point U+00A5 for YEN SIGN
so that we can clean up software and fonts that perpetuate the error.
A big part of the problem is that Microsoft has put a Yen glyph at the
REVERSE SOLIDUS (backslash) code point in its supposedly Unicode-conformant
fonts for Japanese, and a Korean Won glyph at the same code point in its
Korean fonts, thus breaking them for use in any real Unicode context where
Microsoft-style path names are used. (Note also that recent versions of these
fonts all contain glyphs for the complete CJK Unified Ideographs block of
Unicode, so that they are in fact CJK fonts.) To compound this heinous error,
M$ created a code page for CJK containing this error (I don't have the number
handy) and uses the broken character mappings.
*<%-[
I have therefore sworn off the use of Microsoft's CJK fonts for any use
whatsoever. They do have one correct Unicode font, Arial Unicode MS, and
there are others from many other sources. We still need Free fonts, and there
is a GNU project to create them.
Stick it somewhere, with a statement that we treat
this codepoint as a backslash, not Yen?
Yes. Point readers to the correct code point, and explain that Microsoft's
fonts, code pages, and character set converters are all broken on this point.
A KNOWN ISSUES section?
Definitely.
--
Edward "ISO MMXXII delenda est" Cherlin
edward(_at_)webforhumans(_dot_)com
Does your Web site work?