In regard to source data for perl encodings:
I have a lot of experience with encodings and encoding converters on a lot of
platforms, and compatibility between these converters can turn into a huge
mess. Because Perl can be used to create essentially permanent data
repositories, compatibility across platforms is a good thing.
This is why using one single source for code page conversions is important and
why using platform-specific conversions is a bad thing.
One could pick a set of encoding files from a lot of places, but it seems to me
that using the ICU data files has some key advantages: 1) there's someone to
complain to when they are wrong - and I'm sure there will be bugs found in them
and 2) ICU's data files are being used by real software so using them isn't
quite so bleeding edge plus 3) they are under source control and have
versioning so you can say 'this data was created with ICU data files version
N.M' - at least someday you could say that somehow.
If there are licensing issues, I think they can be resolved. If you try to
contact the ICU team about this and you don't get anywhere, please let me know
and I'll try to help.
I believe the best thing long term would be to use ICU for all conversions.
Given this, it makes sense to use the ICU data files in the short run so you
can hope for least controlled incompatibility.
And I would like to point out that although ICU's converter data file is
largish, it doesn't need to be. It isn't hard to trim it back or to even use
separate table files. Also, ICU doesn't load the whole table data file into
memory at once or some silly thing like that. It tries to be efficient.
So, to summarize, I think there is a need for built-in conversion tables at
this point in time and I believe they should be derived from ICU UCM files.
Also, I believe that only a small set of single-byte encodings should be built
in. Multibyte and ISO-2022 related encodings - even only those used just on the
Internet - are amazingly nasty and best left to something like ICU. Even ICU
doesn't get everything right, but there's a decent process to fix things that
are wrong.
In regard to the built-in converter engine, it needs these key features
(assuming single byte and Western/Central European languages plus Cyrillic
only):
1) convert a single byte code point to Unicode.
2) convert several 'close cousins' from Unicode to a single character code. In
other words, it needs to convert Unicode N to x and Unicode M to x, etc. as
separate characters. The ICU UCM files have these alternate mappings in them.
3) allow Perl internals (at least) to specify what to do when you can't
translate a character from Unicode to the single byte encoding. This is
important because you might introduce interesting bugs/security holes when, for
example, a question mark gets splotched into your regular expression or, if you
strip untranslatables, when the match string becomes empty. This seems like a
potentially big mess to me, but hopefully it really isn't.
4) a non feature for this language group is converting Unicode combining
sequences (or simply multiple Unicode characters) to single characters in the
code page (and vice versa). This is required for some encodings, but not in
this language group. (There are cases where this would be nice, but this isn't
a critical feature.)
Thanks and regards,
=Ed
------------------------------------------------------------
--== Sent via Deja.com ==--
http://www.deja.com/