perl-unicode

RE: Source data for perl encodings

2001-01-10 12:08:37
In regard to source data for perl encodings:

I have a lot of experience with encodings and encoding converters on a lot of 
platforms, and compatibility between these converters can turn into a huge 
mess. Because Perl can be used to create essentially permanent data 
repositories, compatibility across platforms is a good thing.

This is why using one single source for code page conversions is important and 
why using platform-specific conversions is a bad thing.

One could pick a set of encoding files from a lot of places, but it seems to me 
that using the ICU data files has some key advantages: 1) there's someone to 
complain to when they are wrong - and I'm sure there will be bugs found in them 
and 2) ICU's data files are being used by real software so using them isn't 
quite so bleeding edge plus 3) they are under source control and have 
versioning so you can say 'this data was created with ICU data files version 
N.M' - at least someday you could say that somehow.

If there are licensing issues, I think they can be resolved. If you try to 
contact the ICU team about this and you don't get anywhere, please let me know 
and I'll try to help.

I believe the best thing long term would be to use ICU for all conversions. 
Given this, it makes sense to use the ICU data files in the short run so you 
can hope for least controlled incompatibility.

And I would like to point out that although ICU's converter data file is 
largish, it doesn't need to be. It isn't hard to trim it back or to even use 
separate table files. Also, ICU doesn't load the whole table data file into 
memory at once or some silly thing like that. It tries to be efficient.

So, to summarize, I think there is a need for built-in conversion tables at 
this point in time and I believe they should be derived from ICU UCM files. 
Also, I believe that only a small set of single-byte encodings should be built 
in. Multibyte and ISO-2022 related encodings - even only those used just on the 
Internet - are amazingly nasty and best left to something like ICU. Even ICU 
doesn't get everything right, but there's a decent process to fix things that 
are wrong.

In regard to the built-in converter engine, it needs these key features 
(assuming single byte and Western/Central European languages plus Cyrillic 
only):

1) convert a single byte code point to Unicode.
2) convert several 'close cousins' from Unicode to a single character code. In 
other words, it needs to convert Unicode N to x and Unicode M to x, etc. as 
separate characters. The ICU UCM files have these alternate mappings in them. 
3) allow Perl internals (at least) to specify what to do when you can't 
translate a character from Unicode to the single byte encoding. This is 
important because you might introduce interesting bugs/security holes when, for 
example, a question mark gets splotched into your regular expression or, if you 
strip untranslatables, when the match string becomes empty. This seems like a 
potentially big mess to me, but hopefully it really isn't.
4) a non feature for this language group is converting Unicode combining 
sequences (or simply multiple Unicode characters) to single characters in the 
code page (and vice versa). This is required for some encodings, but not in 
this language group. (There are cases where this would be nice, but this isn't 
a critical feature.)

Thanks and regards,

=Ed




------------------------------------------------------------
--== Sent via Deja.com ==--
http://www.deja.com/