perl-unicode

Yen/Backslash problems (was Re: Encode::CJKguide)

2002-03-27 12:53:44
On Wednesday 27 March 2002 04:38, Anton Tagunov wrote:
....
  The Yen sign and the backslash tend to be the most troublesome
  characters as a single codepoint in 8-bit encodings has a tendation
  to be used for both.

Also Korean Won and backslash.

Is this true? 

Yes, the problem exists, but this does not state it clearly, completely or 
correctly. I can be clear and correct, but not totally complete, thus:

JIS-Roman (Lunde, CJKV Information Processing p. 968) and KS-Roman (op. cit. 
p. 970) substitute the Japanese and Korean currency symbols for backslash at 
0x5C. It is possible to use either of these 8-bit encodings together with 
16-bit encodings for the native characters in these languages. Some Japanese 
and Koreans with influence in the software market have created broken 
character set mappings from these encodings allegedly to Unicode that map all 
of the 8-bit codes in the ASCII range to themselves. Thus either the Japanese 
Yen symbol or the Korean Won symbol can be mapped to backslash. There is no 
algorithmic way to tell if any particular 0x5C in a text file was supposed to 
be a backslash, a Yen symbol, or a Won symbol.

I have recently had an extended discussion with a Japanese programmer who 
insists that Unicode is broken because of this conflict, and will not 
consider the possibility of using the Unicode code point U+00A5 for YEN SIGN 
so that we can clean up software and fonts that perpetuate the error.

A big part of the problem is that Microsoft has put a Yen glyph at the 
REVERSE SOLIDUS (backslash) code point in its supposedly Unicode-conformant 
fonts for Japanese, and a Korean Won glyph at the same code point in its 
Korean fonts, thus breaking them for use in any real Unicode context where 
Microsoft-style path names are used. (Note also that recent versions of these 
fonts all contain glyphs for the complete CJK Unified Ideographs block of 
Unicode, so that they are in fact CJK fonts.) To compound this heinous error, 
M$ created a code page for CJK containing this error (I don't have the number 
handy) and uses the broken character mappings.

*<%-[

I have therefore sworn off the use of Microsoft's CJK fonts for any use 
whatsoever. They do have one correct Unicode font, Arial Unicode MS, and 
there are others from many other sources. We still need Free fonts, and there 
is a GNU project to create them.


Stick it somewhere, with a statement that we treat
this codepoint as a backslash, not Yen?

Yes. Point readers to the correct code point, and explain that Microsoft's 
fonts, code pages, and character set converters are all broken on this point.


A KNOWN ISSUES section?

Definitely.
-- 
Edward "ISO MMXXII delenda est" Cherlin
edward(_at_)webforhumans(_dot_)com
Does your Web site work?

<Prev in Thread] Current Thread [Next in Thread>