perl-unicode

When are you sure you have picked the right encoding?

2002-03-10 09:47:17
On 2002.03.10, at 23:37, Nick Ing-Simmons wrote:
Dan Kogai <dankogai(_at_)dan(_dot_)co(_dot_)jp> writes:
We have to keep binmode for a while as that is what Camel-III describes.
But we could make it an alias or wrapper on something better if we
can agree what would be better.

  IMHO we should also prepare OO layers as well.  So we can go like

use FileHandle;

STDIN->encoding("iso-2022-jp");

or something like that. This one is definitely more intuitive than binmode(). Or has it already been so? No mention to IO discipline can be found on IO::Handle and FileHandle POD.

Another problem I would like to raise is that this "encoding on IO layer" works when and only when you know what encoding to use a priori. In real life that is rather rare: For most cases you have no idea what encoding is appropriate until you peek the contents. Suppose you want to write a program that mirrors web content. But this time you want to convert any given text to UTF-8 so you can build a multilingual index thereof. In this case you won't know what encoding to use until you read the Content-Type: header. Okay, so you decided to make your program read as ascii until header ends and use binmode() to switch encoding accordingly to the header. (I'm yet to test if this is possible but this idea comes naturally). But even this will fail in many classical web sites in Japan where started services since HTTP/0.9. In that case you have to resort to code guessing. My humble Jcode implements code guessing and it works fairly well except for ambiguous cases between EUC-JP and Shift JIS. But this works because Jcode hubristically assume that the string in question is at least some sort of Japanese encoding. My experience from Jcode tells that code guessing is possible so long us you have some idea on what language is used. Japanese is among the hardest so it should work on other languages. But once again, how are you going to tell in what language your string is written? Some sort of hinting is imperative and since we are dealing with *external* text here, locale is useless. We still have more then 3 weeks before April fool's day but how about this?

study SCALAR LANGUAGE
        Takes extra time to study SCALAR ("$_" if unspeci-
     fied) in anticipation of doing many pattern matches and charset
     conversion on the string before it is next modified.

This recycled reserved word makes me puke but at lease more intuitive than binmode :).

Dan the Man with too Many Encodings to Deal with

<Prev in Thread] Current Thread [Next in Thread>