perl-unicode

Re: Starnge characters when displaying html files saved in UTF-8 format

2001-12-17 00:25:13
On Tue, 11 Dec 2001 21:40:36 +0000, awiar(_at_)hotmail(_dot_)com (Jalal 
Kakavand)
wrote:

my $mydoc = shift ;
      # check BOM
      my $top1 = unpack("C", substr($mydoc, 0, 1));
      my $top2 = unpack("C", substr($mydoc, 1, 1));
      my $top3 = unpack("C", substr($mydoc, 2, 1));

      # UTF-8
      if($top1 eq 239 && $top2 eq 187 && $top3 eq 191) {
              $mydoc = substr($mydoc, 3, length($mydoc) - 3);
      }

      return $mydoc;
}

Another way to do it might be

    my $mydoc = shift;
    my $bom = substr($mydoc, 0, 3);
    # Check for UTF-8 BOM
    if($bom eq "\xef\xbb\xbf") {
        substr($mydoc, 0, 3) = '';
    }
    return $mydoc;

That way, you can compare all three bytes at once (your method looks
more like C :)... except that you used 'eq' for a numeric comparison,
which just looks like 'wrong'.). And I believe that by assigning to
substr, you may save a copy of the entire string, since Perl may simply
remember that the real data starts three bytes past the first allocated
character (using OOK, if you're into the internals).

Cheers,
Philip