perl-unicode

Re: Always setting UTF-8 flag - am I bad?

2004-08-05 02:30:06
Erland Sommarskog <esquel(_at_)sommarskog(_dot_)se> writes:
Jean-Michel Hiver (jhiver(_at_)mkdoc(_dot_)com) writes:
Erland Sommarskog wrote:
I working with an XS module that passes queries to MS SQL Server and
returns data back using SQLOLEDB. MS SQL Server stores Unicode data
as UTF-16. Also, all metadata is UTF-16.

Currently when I get Unicode data back from SQL Server, I convert it to
UTF-8, stash it in an SV, and then set the UTF-8 flag, without checking
whether this is really necessary.

That should be okay. A reasonably cheap option is to convert to UTF-8
as above, then scan so see if any of high bits are set and only 
set SvUTF8_on if they occur. That way pure ASCII isn't "penalized" 
by having UTF-8 bit set. 
Doing a convert to iso-8859-1 is the alternative, but note that 
NOT setting UTF-8 flag on high chars (even if representable)
affects (sadly) the semantics. So unless "locale" is used
(which is a bit alien to Win32) 'Ñ' (N with tilde) etc. are not alpha
as perl defaults to C locale.

Note too that normal Windows "latin 1" code page is a superset 
of iso-8859-1 - so converting to that is wrong, as it will encode 
Euro, smart quotes and m-dash etc. into places (0x80..0x9f) that are 
not what perl expects.


Personally I try to use Encode as much as possible which does The Right 
Thing for me.

$string = Encode::decode ('utf-16', $octets); is pretty safe.

As far as I recall Encode::decode leaves the SvUTF8 flag on once it 
has done its thing. But Dan may have cleaned that up.


Regarding to speed, Encode seems pretty fast to me - but YMMV I guess.

Alright, I failed to say that this is an XS module, so I convert with
WideCharToMultiByte, a Windows routine(*), put the result in an SV, and
then say SvUTF8_on.

The possible danger here is if the "multi byte" encoding for 
user's environment is not UTF-8 but (say) a Japanese one.
Using Encode avoids that.


(*) SQLOLEDB is available on Windows only, so portability is not an issue.

<Prev in Thread] Current Thread [Next in Thread>