Re: Detecting 'narrowest' character set


On Thu, 27 Jun 2002 12:01:09 +0100
Jean-Michel Hiver <jhiver(_at_)mkdoc(_dot_)com> wrote:

Hi List,

  I was wondering if there was some kind of module / code snippet out
  there that could detect the 'narrowest' character set which a unicode
  string can fit into.

  The reason for that is that is to be able to treat everything as
  unicode internally while being as friendly as possible with old
  fubared browsers...

I've taken a look on CPAN but I wasn't able to find anything.
Any ideas on how to do that?

Best regards,


How about Dan Kogai's Encode::InCharset on CPAN?
  http://search.cpan.org/search?dist=Encode-InCharset

Though it doesn't detect the 'narrowest',
it can be checked whether the string could be encoded in a charset.

#!perl
use 5.008;
use Encode::InCharset qw(InASCII InISO_8859_1 InISO_8859_2 InShift_JIS);

my $ascii = pack 'C*', 0..127;
my $lat1  = pack 'C*', 0..255;
my $lat2  = pack 'U*', 0..127, 0x102..0x107;
my $sjis  = pack 'U*', 0..127, 0x3041; # with Hiragana
my $tag   = pack 'U*', 0..127, 0xE0041; # with Tag

print detectEncode($ascii), "\n";
print detectEncode($lat1), "\n";
print detectEncode($lat2), "\n";
print detectEncode($sjis), "\n";
print detectEncode($tag), "\n";

sub detectEncode {
    my $string = shift;

    if ($string !~ /\p{^InASCII}/) { # contains nothing but ASCII
        return "ASCII";
    }
    elsif ($string !~ /\p{^InISO_8859_1}/) { # latin 1
        return "ISO_8859_1";
    }
    elsif ($string !~ /\p{^InISO_8859_2}/) { # latin 2
        return "ISO_8859_2";
    }
    elsif ($string !~ /\p{^InShift_JIS}/) {
        return "Shift_JIS";
    }
    # Trial more ? Well, then add something.
    # There is room to tune up in the order of trials.

    return "Unicode"; # abandoned
}

__END__

Regards.
SADAHIRO Tomoyuki