Re: How best to handle utf8 in octetty module


peter(_dot_)billam(_at_)dpiwe(_dot_)tas(_dot_)gov(_dot_)au said:

The encryption, of course, works with octets. I've just (version 2.13)
introduced a first attempt at handling utf8 string arguments; this is
still undocumented so I can change it if there's a better way. Currently,
at the top of sub encrypt, there is:

      use bytes;
      ...
   sub encrypt { my ($str,$key)=(_at_)_;
      if ($] > 5.007 && Encode::is_utf8($str)) {
         Encode::_utf8_off($str);
         # $str = Encode::encode_utf8($str);
      }
              ...

Is this the right sort of way to do it (e.g. functionality, portability)
?


The man page for Encode still says that twiddling the utf8 flag yourself 
involves "messing with internals" that might change in later releases.  
(Maybe someone else will comment on that.)  Personally, I'd go for the 
"Encode::encode_utf8($str)" in order to get an unaltered copy of the text 
with the flag turned off.

It means that after decrypting again the is_utf8 information is lost; But
I don't see a way round that because 1) Perl's not the only language
involved, 2) putting encoding information into the cyphertext would break
backward compatibility and give an attacker a known-plaintext attack.


I have seen a lot of people putting a BOM (byte-order-mark, U+FEFF) at the 
start of unicode text, even when encoding it as utf8 (where it shows up as 
a three-byte sequence).  So if you're encrypting a utf8 string, you could 
just make sure there's a proper BOM at the start of it.  Then, when you 
decrypt in a perl script, and you see a BOM rendered as a three-byte 
sequence, you know you can decode the octets into utf8.

Hopefully, that's not too much of a giveaway for attackers, since it's 
only three bytes, and it might not be predictable whether there would be a 
BOM in a given cipher text.

Would it be worth giving sub decrypt an option to decode the plaintext
into Perl's internal form (if it's well-formed), or should I leave that
to the user and the Encode module ?


If the plaintext is not utf8 (and not ascii), you have to leave it to the
user and the Encode module.  If it is utf8, I think it'll be a great
benefit to perl users to decode the plaintext into utf8 before returning
it.  Providing an optional arg to the decrypt sub to control how it handles
the utf8 flag sounds like a good idea.

        Dave Graff