perl-unicode

IRI support in URI and URI::Escape modules

2005-01-31 02:50:45
Dear Perl Unicode Experts,

I tried to have a look at how much would have to be done to get
the URI and URI::Escape modules to support IRIs in a reasonable
way. The IRI spec has just been published as an IETF Proposed
Standard at http://www.ietf.org/rfc/rfc3987.txt. Also, a new
version of the URI spec is now Internet Standard 66 and is
available at http://www.ietf.org/rfc/rfc3986.txt.

I'm looking for two things:
a) short-term, how to get IRI support using the above and maybe
   some additional modules
b) long-term, how to make these modules (and maybe others)
   work with IRIs as well as with the new URI spec

Support for these new specs mainly includes the following things:
1) Escaping with %hh is based on UTF-8, not some local character
   encoding
2) URIs now allow %hh in the host name part, and require that
   it is interpreted as UTF-8
3) IDNs (i.e. conversion to punycode, and if possibly also
   nameprep/stringprep) should be supported
4) The user of e.g. the URI module should ideally only have to
   deal with one form of the URI/IRI, the one used to construct
   the URI/IRI, although it should be possible to create other
   forms (e.g. a fully %-encoded URI, an IRI that contains
   as few %hh as possible)
5) It should be possible to apply normalization operations
   as described in the IRI spec on different parts of an URI/IRI

I started with some very simple (I thought) tests, but got
completely confused very quickly. Here is the short program
that I was using:

>>>> test.pl
use utf8;
use URI;
use URI::Escape;

print (uri_escape("\xFD") . "\n");
print (iri_escape("\xFD") . "\n");
print (uri_escape("\x{FD}") . "\n");
print (iri_escape("\x{FD}") . "\n");
print (uri_escape("\x{370}") . "\n");
print (iri_escape("\x{370}") . "\n");

sub iri_escape
{
    return substr (uri_escape("\x{370}".shift), 6);
}
>>>>


With this, on perl, v5.6.1 built for MSWin32-x86-multi-thread
(with 1 registered patch, see perl -V for more detail), I get

>>>>
%FD
%C3%BD
%C3%BD
%C3%BD
%CD%B0
%CD%B0
>>>>

which seems to show that the trick with adding a non-Latin-1
character and then removing its escaped form works (compare
the first line to the second line).

However, on perl, v5.8.4 built for i386-linux-thread-multi,
I get:

>>>>
%FD

%FD



>>>>

Nothing seems to work anymore, although (or because?) 5.8
has better Unicode support.

Any help appreciated.

Regards, Martin.

<Prev in Thread] Current Thread [Next in Thread>