Re: Interpretation of non-UTF8 strings

Jarkko Hietaniemi <jhi(_at_)iki(_dot_)fi> writes:

Nick Ing-Simmons wrote:

Once we had 

use encoding qw(locale);

But it did not work well as not all locale implementations
give the API to return the encoding.  
(And even en_GB can be in ASCII, 8859-1, 8859-15 (with euro), UTF-8, ...)


True.

For the open :locale I opted for a easy (cheesy?) algorithm:
(1) if we have langinfo(), use the return value of langinfo(CODESET).
(2) if we do not have getlanginfo(), look at %ENV for locale variables
   and look at the part after the dot, and use that value.
(3) Use the value from either (1) or (2) and if Encode recognizes that,
   good.  Otherwise give up.

Or something like that.  (It's documented in the open pragma, somewhere).


And I was mis-remembering which module it was that had 'locale'.
As I just posted I think it makes sense that 

use encoding ()

as it affects strings in code below is literal - after all the 
strings are in an encoding (determined by author), while 
locale is variable by how user is using it.

So in my speech synthesis stuff I had:

use encoding qw(iso-8859-15);

And then it worked right even if I happened to run it in en_GB.utf8 
that day.

(By the way I re-encoded to UTF-8 and changed that to 'use utf8',
 it all still works but needs much more memory and is slower.
 Seems to be way regexps work - so will probably switch back.)

<Prev in Thread]

Current Thread

[Next in Thread>

Previous by Date:	Re: Interpretation of non-UTF8 strings, Jarkko Hietaniemi
Next by Date:	Re: Interpretation of non-UTF8 strings, Marcin 'Qrczak' Kowalczyk
Previous by Thread:	Re: Interpretation of non-UTF8 strings, Jarkko Hietaniemi
Next by Thread:	Re: Interpretation of non-UTF8 strings, Marcin 'Qrczak' Kowalczyk
Indexes:	[Date] [Thread] [Top] [All Lists]