perl-unicode

Re: perlunicode comment - when Unicode does not happen

2003-12-23 13:30:06
Ed Batutis <ed(_at_)batutis(_dot_)com> writes:

The point I'm trying to make (agreeing with most perl 5 porters I suspect)
is that supporting Shift-JIS in Perl5 is hopeless. 

I seem to recall my Japanese collegues at TI using it years ago...
just treating it as octets and with a 'jperl' which did a little more.


We may reach the point where it makes sense to have a pragma
which enables auto encode/decode of args to system calls, but

I'd suggest taking some code from ICU or Mozilla that tries to figure out
what the platform encoding is. 

There may be licencing issues with that - perl needs a commercial use 
friendly license. But I think ICU is being used for perl6.

Then, Perl can do a utf8/platform encoding
conversion before/after the file-system related calls. In the (many -
although way less popular) cases where the platform-encoding detection code
just can't do it (or Encode doesn't support the answer), 

Encode is a module - it is WAY more able to be adapted than either perl 
or OS expectations.

Perl just leaves
things the way they are today. A 'use system-encoding "foo";' pragma would
provide an escape hatch. This solution doesn't break anything and makes at
least 90% of the world (reasonably) happy.

That is all a bit vague to be called a "solution" but in essence describes 
what I think we will end up with eventually. Whether things settle down
enough that that will occur in perl5 is unclear. 

Whoever submits the patches
is going to have to get them "approved" on this list (for international 
issues - bald statements that 90% of world are happy will not be accepted),
and on perl5-porters for the technical issues.


I don't think we understand common practice (or that such practices
are even established yet) well enough to specify that yet.

I may be misunderstanding your point, but I don't see "common practice"
bearing on this. 

UTF-8 in Perl is new - and currently it is dead in the
water for things like "-d" - so why not just fix it.

Because we don't know how, because the "common practice" isn't established.
If we "just fix it" now the behaviour will be tied down and when the 
"common practice" is established we will not be able to support it. 

When _I_ want Unicode named things on Linux I just put file names in UTF-8.

Suits me fine, but is not going to mesh with my locale setting because 
I am going to leave that as en_GB otherwise piles of legacy C apps get ill.
Now when I have samba-mounted a WinXP file system that is wrong, same for 
CDROMs most likely. This mess will converge some more - I can already 
see that happening.

_My_ gut feeling is that on Linux at least the way forward is to 
pass the UTF-8 string through -d - and indeed possibly "upgrade" to UTF-8
if the string has high-bit octets.
But you seem to be making the case that UTF-8 should be converted to 
some "local" multi-byte encoding - which is the "common practice" ?



<Prev in Thread] Current Thread [Next in Thread>