Dear Perl Unicode Experts,
I tried to have a look at how much would have to be done to get
the URI and URI::Escape modules to support IRIs in a reasonable
way. The IRI spec has just been published as an IETF Proposed
Standard at http://www.ietf.org/rfc/rfc3987.txt. Also, a new
version of the URI spec is now Internet Standard 66 and is
available at http://www.ietf.org/rfc/rfc3986.txt.
I'm looking for two things:
a) short-term, how to get IRI support using the above and maybe
some additional modules
b) long-term, how to make these modules (and maybe others)
work with IRIs as well as with the new URI spec
Support for these new specs mainly includes the following things:
1) Escaping with %hh is based on UTF-8, not some local character
encoding
2) URIs now allow %hh in the host name part, and require that
it is interpreted as UTF-8
3) IDNs (i.e. conversion to punycode, and if possibly also
nameprep/stringprep) should be supported
4) The user of e.g. the URI module should ideally only have to
deal with one form of the URI/IRI, the one used to construct
the URI/IRI, although it should be possible to create other
forms (e.g. a fully %-encoded URI, an IRI that contains
as few %hh as possible)
5) It should be possible to apply normalization operations
as described in the IRI spec on different parts of an URI/IRI
I started with some very simple (I thought) tests, but got
completely confused very quickly. Here is the short program
that I was using:
>>>> test.pl
use utf8;
use URI;
use URI::Escape;
print (uri_escape("\xFD") . "\n");
print (iri_escape("\xFD") . "\n");
print (uri_escape("\x{FD}") . "\n");
print (iri_escape("\x{FD}") . "\n");
print (uri_escape("\x{370}") . "\n");
print (iri_escape("\x{370}") . "\n");
sub iri_escape
{
return substr (uri_escape("\x{370}".shift), 6);
}
>>>>
With this, on perl, v5.6.1 built for MSWin32-x86-multi-thread
(with 1 registered patch, see perl -V for more detail), I get
>>>>
%FD
%C3%BD
%C3%BD
%C3%BD
%CD%B0
%CD%B0
>>>>
which seems to show that the trick with adding a non-Latin-1
character and then removing its escaped form works (compare
the first line to the second line).
However, on perl, v5.8.4 built for i386-linux-thread-multi,
I get:
>>>>
%FD
%FD
>>>>
Nothing seems to work anymore, although (or because?) 5.8
has better Unicode support.
Any help appreciated.
Regards, Martin.